Hi :) Just a brief update with Guile 3. Last one was here:
https://lists.gnu.org/archive/html/guile-devel/2018-06/msg00026.html There is a now a "lightning" branch that has GNU lightning merged in and built statically into Guile. It causes about 1 MB of overhead in the -Og libguile-3.0.so, bringing it to 6.1 MB, or 1.36 MB stripped. Seems OK for now. By way of contrast, libguile-2.2.so is 5.65 MB when built with -Og, or 1.19 MB stripped. There's some scaffolding for making JIT code emitters for each instruction. But then I ran into a problem about how to intermingle JIT and interpreter returns on the stack. I was hoping to avoid having separate interpreter and JIT return addresses in a stack frame, to avoid adding overhead. That didn't work out: https://lists.gnu.org/archive/html/guile-devel/2018-07/msg00013.html So, I added a slot to the "overhead" part of stack frames. From frames.h: Stack frame layout ------------------ | ... | +==============================+ <- fp + 3 = SCM_FRAME_PREVIOUS_SP (fp) | Dynamic link | +------------------------------+ | Virtual return address (vRA) | +------------------------------+ | Machine return address (mRA) | +==============================+ <- fp | Local 0 | +------------------------------+ | Local 1 | +------------------------------+ | ... | +------------------------------+ | Local N-1 | \------------------------------/ <- sp The stack grows down. The calling convention is that a caller prepares a stack frame consisting of the saved FP, the saved virtual return addres, and the saved machine return address of the calling function, followed by the procedure and then the arguments to the call, in order. Thus in the beginning of a call, the procedure being called is in slot 0, the first argument is in slot 1, and the SP points to the last argument. The number of arguments, including the procedure, is thus FP - SP. That took a while. Anything that changes calling conventions is gnarly. While I was at it, I changed the return calling convention to expect return values from slot 0 instead of from slot 1, and made some other minor changes to instructions related to calls and returns. The next step will be to add an "enter-function" instruction or something to function entries. This instruction's only real purpose will be to increment a counter associated with the function. If the counter exceeds some threshold, JIT code will be emitted for the function and the function will tier up. If the enter-function instruction sees that the function already has JIT code (e.g. emitted from another thread), then it will tier up directly. Because enter-function is in the right place to run the apply hook for debugging, we'll probably move that functionality there, instead of being inline with the call instructions. The "enter-function" opcode will take an offset to writable data for the counter, allocated in the ELF image. This data will have the form: struct jit_data { void* mcode; uint32_t counter; uint32_t start; uint32_t end; } The mcode pointer indicates the JIT code, if any. It will probably need to be referenced atomically (maybe release/consume ordering?). The counter is the counter associated with this function; entering a function will increment it by some amount. The start and end elements indicate the bounds of the function, and are offsets into the vcode, relative to the jit_data struct. These are not writable. Loops will also have an instruction that increments the counter, possibly tiering up if needed. The whole function will share one "struct jit_data". I am currently thinking that we can make JIT-JIT function calls peek ahead in the vcode of the callee to find the callee JIT code, if any. I.e.: (if (has-tc7? callee %tc7-program) (let ((vcode (word-ref callee 1))) (if (= (logand (u32-ref vcode 0) #xff) %enter-function-opcode) (let ((mcode ((+ vcode (* (u32-ref vcode 1) 4))))) (if (zero? mcode) (jmp! mcode) (return!))) ;; return to interpreter (return!))) (return!)) It's a dependent memory load on the function-call hot path but it will predict really well. The upside of this is that there is just one mutable mcode pointer for a function, for all its closures in all threads. It also avoids reserving more space on the heap for another mcode word in program objects. Loops will tier up ("on-stack replacement") by jumping to an offset in the mcode corresponding to the vcode for the counter-incrementing instruction. The offset will be determined by running the JIT compiler for the function but without actually emitting the code and flushing icache; the compiler is run in a mode just to determine the mcode offset for the vcode offset. Once the enter-function opcode is done I'll get back to implementing the JIT compilers for each instruction. Cheers, Andy