guile 3 update, july edition

Andy Wingo Sat, 21 Jul 2018 09:37:37 -0700

Hi :)

Just a brief update with Guile 3.  Last one was here:


  https://lists.gnu.org/archive/html/guile-devel/2018-06/msg00026.html

There is a now a "lightning" branch that has GNU lightning merged in and
built statically into Guile.  It causes about 1 MB of overhead in the
-Og libguile-3.0.so, bringing it to 6.1 MB, or 1.36 MB stripped.  Seems
OK for now.  By way of contrast, libguile-2.2.so is 5.65 MB when built
with -Og, or 1.19 MB stripped.

There's some scaffolding for making JIT code emitters for each
instruction.  But then I ran into a problem about how to intermingle JIT
and interpreter returns on the stack.  I was hoping to avoid having
separate interpreter and JIT return addresses in a stack frame, to avoid
adding overhead.  That didn't work out:

  https://lists.gnu.org/archive/html/guile-devel/2018-07/msg00013.html

So, I added a slot to the "overhead" part of stack frames.  From
frames.h:

   Stack frame layout
   ------------------

   | ...                          |
   +==============================+ <- fp + 3 = SCM_FRAME_PREVIOUS_SP (fp)
   | Dynamic link                 |
   +------------------------------+
   | Virtual return address (vRA) |
   +------------------------------+
   | Machine return address (mRA) |
   +==============================+ <- fp
   | Local 0                      |
   +------------------------------+
   | Local 1                      |
   +------------------------------+
   | ...                          |
   +------------------------------+
   | Local N-1                    |
   \------------------------------/ <- sp

   The stack grows down.

   The calling convention is that a caller prepares a stack frame
   consisting of the saved FP, the saved virtual return addres, and the
   saved machine return address of the calling function, followed by the
   procedure and then the arguments to the call, in order.  Thus in the
   beginning of a call, the procedure being called is in slot 0, the
   first argument is in slot 1, and the SP points to the last argument.
   The number of arguments, including the procedure, is thus FP - SP.

That took a while.  Anything that changes calling conventions is gnarly.
While I was at it, I changed the return calling convention to expect
return values from slot 0 instead of from slot 1, and made some other
minor changes to instructions related to calls and returns.

The next step will be to add an "enter-function" instruction or
something to function entries.  This instruction's only real purpose
will be to increment a counter associated with the function.  If the
counter exceeds some threshold, JIT code will be emitted for the
function and the function will tier up.  If the enter-function
instruction sees that the function already has JIT code (e.g. emitted
from another thread), then it will tier up directly.

Because enter-function is in the right place to run the apply hook for
debugging, we'll probably move that functionality there, instead of
being inline with the call instructions.

The "enter-function" opcode will take an offset to writable data for the
counter, allocated in the ELF image.  This data will have the form:

  struct jit_data {
    void* mcode;
    uint32_t counter;
    uint32_t start;
    uint32_t end;
  }

The mcode pointer indicates the JIT code, if any.  It will probably need
to be referenced atomically (maybe release/consume ordering?).  The
counter is the counter associated with this function; entering a
function will increment it by some amount.  The start and end elements
indicate the bounds of the function, and are offsets into the vcode,
relative to the jit_data struct.  These are not writable.

Loops will also have an instruction that increments the counter,
possibly tiering up if needed.  The whole function will share one
"struct jit_data".

I am currently thinking that we can make JIT-JIT function calls peek
ahead in the vcode of the callee to find the callee JIT code, if any.
I.e.:

  (if (has-tc7? callee %tc7-program)
      (let ((vcode (word-ref callee 1)))
        (if (= (logand (u32-ref vcode 0) #xff)
               %enter-function-opcode)
            (let ((mcode ((+ vcode (* (u32-ref vcode 1) 4)))))
              (if (zero? mcode)
                  (jmp! mcode)
                  (return!))) ;; return to interpreter
            (return!)))
      (return!))

It's a dependent memory load on the function-call hot path but it will
predict really well.  The upside of this is that there is just one
mutable mcode pointer for a function, for all its closures in all
threads.  It also avoids reserving more space on the heap for another
mcode word in program objects.

Loops will tier up ("on-stack replacement") by jumping to an offset in
the mcode corresponding to the vcode for the counter-incrementing
instruction.  The offset will be determined by running the JIT compiler
for the function but without actually emitting the code and flushing
icache; the compiler is run in a mode just to determine the mcode offset
for the vcode offset.

Once the enter-function opcode is done I'll get back to implementing the
JIT compilers for each instruction.

Cheers,

Andy

guile 3 update, july edition

Reply via email to