Monday, July 11, 2011

The x86_64 Calling Convention


I suppose I can consider myself an 'old-school' developer now; even though I have been reading the AMD64 ABI documentation, I still haven't fully absorbed it into my head yet, which is evidenced by the recent two situations I had today where RTFM-ing would have had saved me hours of GDB debugging pain.

I have been coding some assembly instructions to make C-calls at runtime to a debugging routine, but the call seems to always ends up mysteriously trampling the JIT-ed routines, making the VM take unexpected execution paths and causing some unlikely assertions to be fired.

The situation is confounded by a number of issues:
  1. the code generated is dynamic, and therefore there are no debugging symbols associated with them compared to code typically generated by the assembler/compiler;
  2. there are different types of call-frames for a given method; 1 for a pre-compiled stub, 1 for a frame that's crossed-over from JIT-ed code to native code, and 1 for the JIT-ed code itself;
  3. when the eventual assertion does manifest, the code is already far away in the rabbit-hole from where the original problem manifested. And because some of the JIT-ed code actually makes a "JMP", unlike a "CALL", you can't actually figure out where the code originated from, since %rip is never saved on the call stack.
While situations 1 and 2 make debugging difficult by having the need to keep a lot of contextual information in order to figure out what's going on, situation 3 is just impossible to debug if the bug is non-deterministic in nature. For example, each compiled method in the VM generates a small assembly stub that replaces the actual code to be executed; when the stub gets executed for the first time, it triggers of the JIT compiler at runtime to compile the real method from its intermediate representation. The compiled method then replaces the stub, hence subsequent invocations will simply call the already JIT-generated method, thereby executing at native-speed, like just as you would get on compiled code.

To optimise on space, the stubs are made as small as possible (~20 bytes), and the common execution body shared by all stubs is factored into a common block. All stubs will eventually perform a global "JMP" instruction to this common block. In order to faciliate communication, all shared data between the stub and the common code block is passed on the thread stack, where the common offset to the method handle is agreed upon. 

While the design is elegant, it is also impossible to debug when it breaks; the non-deterministic-ness of the bug seems to surface from time-to-time, where it seems to suggest that the thread stack got corrupted or that it's not passing the method handle correctly. Even when GDB is left running, by the time the assertion triggers, it's already past the fact, and therefore it is unable to trace back to the originating path.

I thought it might be a good idea to inject some debugging calls to trace the execution and stack pointer at runtime, so that I can figure out which stub was last called and the stack height when the call was made; the two information combined should give me sufficient hints on where the problem might lie. However, my injected code has introduced two other issues that I had overlooked, which brings me back into the discussion of the x86_64 ABI again; if you ever wanted to template any assembly instructions into your code that relies on an external library call, do keep these 2 points from the ABI specification in mind:
  1. Save ALL caller-saves registers, not just only the ones that you are using.
  2. (§3.2.2) The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 (32) when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame.
I have to say that I've dismissed (1) since I've gotten use to the style of only documenting and saving the registers that was used; the convention was something that I had picked up from Peter Norton's 1992 book, "Assembly Language for the PC". For those who don't know, he's the "Norton" that Symantec's Norton Antivirus is named after. I still have the out-of-print book on my desk as a keepsake; it reminds me of the the memories of reading it and scribbling code on a piece of paper at my local library. Remarkably, that was how I learnt assembly, since I didn't have a computer back then. Thumbing through the book today, I still have an incredible respect for Peter's coding prowess. He had a way of organising his procedures so elegantly such that each of them all fitted perfectly together from chapter to chapter.


Sorry, got sidetracked. So yes, point (1) - to save ALL registers; this is necessary because all caller-saved registers can actually be occupied by the JIT routines as input arguments to the callee; while this typically means the 6 defined registers (%rdi, %rsi, %rdx, %rcx, %r8, %r9) for general input (see §3.2.3), other registers can also be trashed upon a call return, so as a rule-of-thumb save everything, except the callee-saved registers (%rbx, %rbp, %r12 to %r15), which are guaranteed to be preserved.

Point (2) - I haven't observed a reproducible side effect from this; however the failure points between adhering to it and not actually causes a visible difference in the JIT-ed code's path; therefore there is a need to be on the side of caution. I seem to have observed that some memory faults from not following this directive, but I can't ascertain this for a fact yet.

Finally, a self-inflicted bug that I'd like to remind myself of; remember make sure to deduct from %rsp if any memory has been written onto the thread stack; otherwise any function calls may unknowingly overwrite it!

For all the trouble with debugging that I've gotten myself into, there is at least a silver-lining to it; I had made the problem deterministic, or if it isn't the same problem, it was a similar class of problem that I can consistently reproduce to analyse its behaviour and learn from the mistakes I have been making. Because of the determinism, I was able to use GDB's reversible debugging feature to record the execution from the stub to the common code to gain a better understanding of how the generated code actually works. It's a really nifty feature, and I'm glad to have it as my first useful case of applied reversible debugging in practice.