Author: Armin Rigo <ar...@tunes.org> Branch: Changeset: r87292:1e5c1b905a1c Date: 2016-09-21 22:24 +0200 http://bitbucket.org/pypy/pypy/changeset/1e5c1b905a1c/
Log: Check in extradoc/talk/bucharest2016/jit-backend-8vhY1ArTsh.txt in the docs, as a draft description of our low-level JIT assembler backends diff --git a/rpython/doc/jit/backend.rst b/rpython/doc/jit/backend.rst new file mode 100644 --- /dev/null +++ b/rpython/doc/jit/backend.rst @@ -0,0 +1,263 @@ +========================= +PyPy's assembler backends +========================= + +Draft notes about the organization of assembler backends in the PyPy JIT, in 2016 +================================================================================= + + +input: linear sequence of instructions, called a "trace". + +A trace is a sequence of instructions in SSA form. Most instructions +correspond to one or a few CPU-level instructions. There are a few +meta-instructions like `label` and debugging stuff. All branching is +done with guards, which are instructions that check that a condition is +true and exit the trace if not. A failing guard can have a new trace +added to it later, called a "bridge". A patched guard becomes a direct +`Jcond` instruction going to the bridge, with no indirection, no +register spilling, etc. + +A trace ends with either a `return` or a `jump to label`. The target +label is either inside the same trace, or in some older one. For +historical reasons we call a "loop" a trace that is not a bridge. The +machine code that we generate is organized as a forest of trees; the +trunk of the tree is a "loop", and the branches are all bridges +(branching off the trunk or off another branch). + +* every trunk or branch that ends in a `jump to label` can target a + label from a different tree, too. + +* the whole process of assembling a loop or a branch is basically + single-threaded, so no synchronization issue there (including to patch + older generated instructions). + +* the generated assembler has got a "frame" in %rbp, which is actually + not on the stack at all, but is a GC object (called a "jitframe"). + Spilling goes there. + +* the guards are `Jcond` to a very small piece of generated code, which + is basically pushing a couple of constants on the stack and then + jumping to the general guard-recovery code. That code will save the + registers into the jitframe and then exit the whole generated + function. The caller of that generated function checks how it + finished: if it finished by hitting a guard, then the caller is + responsible for calling the "blackhole interpreter". This is the part + of the front-end that recovers from failing guards and finishes + running the frame (including, possibly, by jumping again into + generated assembler). + + +Details about the JITting process: + +* front-end and optimization pass + +* rewrite (includes gc related transformation as well as simplifactions) + +* assembler generation + + +Front-end and optimization pass +------------------------------- + +Not discussed here in detail. This produces loops and bridges using an +instruction set that is "high-level" in some sense: it contains +intructions like "new"/"new_array", and +"setfield"/"setarrayitem"/"setinteriorfield" which describe the action +of storing a value in a precise field of the structure or array. For +example, the "setfield" action might require implicitly a GC write +barrier. This is the high-level trace that we send to the following +step. + + +Rewrite +------- + +A mostly but not completely CPU-independent phase: lowers some +instructions. For example, the variants of "new" are lowered to +"malloc" and a few "gc_store": it bumps the pointer of the GC and then +sets a few fields explicitly in the newly allocated structure. The +"setfield" is replaced with a "cond_gc_wb_call" (conditional call to the +write barrier) if needed, followed by a "gc_store". + +The "gc_store" instruction can be encoded in a single MOV assembler +instruction, but is not as flexible as a MOV. The address is always +specified as "some GC pointer + an offset". We don't have the notion of +interior pointer for GC objects. + +A different instruction, "gc_store_indexed", offers additional operands, +which can be mapped to a single MOV instruction using forms like +`[rax+8*rcx+24]`. + +Some other complex instructions pass through to the backend, which must +deal with them: for example, "card marking" in the GC. (Writing an +object pointer inside an array would require walking the whole array +later to find "young" references. Instead of that, we flip a bit for +every range of 128 entries. This is a common GC optimization.) Setting +the card bit of a GC object requires a sequence of assembler +instructions that depends too much on the target CPU to be expressed +explicitly here (moreover, it contains a few branches, which are hard to +express at this level). + + +Assembly +-------- + +No fancy code generation technique, but greedy forward pass that tries +to avoid some pitfalls + + +Handling instructions +~~~~~~~~~~~~~~~~~~~~~ + +* One by one (forward direction). Each instruction asks the register + allocator to ensure that some arguments are in registers (not in the + jitframe); asks for a register to put its result into; and asks for + additional scratch registers that will be freed at the end of the + instruction. There is a special case for boolean variables: they are + stored in the condition code flags instead of being materialized as a + 0/1 value. (They are materialized later, except in the common case + where they are only used by the next `guard_false` or `guard_true` and + then forgotten.) + +* Instruction arguments are loaded into a register on demand. This + makes the backend quite easy to write, but leads do some bad + decisions. + + +Linear scan register allocation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Although it's always a linear trace that we consider, we don't use +advanced techniques for register allocation: we do forward, on-demand +allocation as the backend produces the assembler. When it asks for a +register to put some value into, we give it any free register, without +consideration for what will be done with it later. We compute the +longevity of all variables, but only use it when choosing which register +to spill (we spill the variable with the longest longevity). + +This works to some extend because it is well integrated with the earlier +optimization pass. Loops are unrolled once by the optimization pass to +allow more powerful optimizations---the optimization pass itself is the +place that benefits the most, but it also has benefits here in the +assembly pass. These are: + +* The first peeling initializes the register binding on the first use. + +* This leads to an already allocated register of the trace loop. + +* As well as allocated registers when exiting bridges + +[Try to better allocate registers to match the ABI (minor to non benefit +in the current state)] + + +More complex mappings +~~~~~~~~~~~~~~~~~~~~~ + +Some instructions generate more complex code. These are either or both of: + +* complex instructions generating some local control flow, like + "cond_gc_wb_call" (for write barriers), "call_assembler" (a call + followed by a few checks). + +* instructions that invoke custom assembler helpers, like the slow-path + of write barriers or the slow-path of allocations. These slow-paths + are typically generated too, so that we are not constrained by the + usual calling conventions. + + +GC pointers +~~~~~~~~~~~ + +Around most CALL instructions, we need to record a description of where +the GC pointers are (registers and stack frame). This is needed in case +the CALL invokes a garbage collection. The GC pointers can move; the +pointers in the registers and stack frame are updated by the GC. That's +a reason for why we don't have explicit interior pointers. + +GC pointers can appear as constants in the trace. We are busy changing +that to use a constant table and `MOV REG, (%RIP+offset)`. The +"constant" in the table is actually updated by the GC if the object +move. + + +Vectorization +~~~~~~~~~~~~~ + +Optimization developed to use SIMD instructions for trace loops. Primary +idea was to use it as an optimization of micro numpy. It has several +passes on the already optimized trace. + +Shortly explained: It builds dependencies for an unrolled trace loop, +gathering pairs/packs of operations that could be executed in parallel +and finally schedules the operations. + +What did it add to the code base: + +* Dependencies can be constructed + +* Code motion of guards to relax dependencies + +* Scheduler to reorder trace + +* Array bound check removal (especially for unrolled traces) + +What can it do: + +* Transform vector loops (element wise operations) + +* Accumulation (`reduce([...],operator,0)`). Requires Operation to be + associative and commutative + +* SSE 4.1 as "vector backend" + + +We do not +~~~~~~~~~ + +* Keep tracing data around to reoptimize the trace tree. (Once a trace + is compiled, minimal data is kept.) This is one reason (there are + others in the front-end) for the following result: JIT-compiling a + small loop with two common paths ends up as one "loop" and one bridge + assembled, and the bridge-following path is slightly less efficient. + This is notably because this bridge is assembled with two constraints: + the input registers are fixed (from the guard), and the output + registers are fixed (from the jump target); usually these two sets of + fixed registers are different, and copying around is needed. + +* We don't join trace tails: we only assemble *trees*. + +* We don't do any reordering (neither of trace instructions nor of + individual assembler instructions) + +* We don't do any cross-instruction optimization that makes sense only + for the backend and can't easily be expressed at a higher level. I'm + sure there are tons of examples of that, but e.g. loading a large + constant in a register that will survive for several instructions; + moving out of loops *parts* of some instruction like the address + calculation; etc. etc. + +* Other optimization opportunities I can think about: look at the + function prologue/epilogue; look at the overhead (small but not zero) + at the start of a bridge. Also check if the way guards are + implemented makes sense. Also, we generate large-ish sequences of + assembler instructions with tons of `Jcond` that are almost never + followed; any optimization opportunity there? (They all go forward, + if it changes anything.) In theory we could also replace some of + these with a signal handler on segfault (e.g. `guard_nonnull_class`). + + +a GCC or LLVM backend? +~~~~~~~~~~~~~~~~~~~~~~ + +At least for comparison we'd like a JIT backend that emits its code +using GCC or LLVM (irrespective of the time it would take). But it's +hard to map reasonably well the guards to the C language or to LLVM IR. +The problems are: (1) we have many guards, we would like to avoid having +many paths that each do a full +saving-all-local-variables-that-are-still-alive; (2) it's hard to patch +a guard when a bridge is compiled from it; (3) instructions like a CALL +need to expose the local variables that are GC pointers; CALL_MAY_FORCE +need to expose *all* local variables for optional off-line +reconstruction of the interpreter state. + diff --git a/rpython/doc/jit/index.rst b/rpython/doc/jit/index.rst --- a/rpython/doc/jit/index.rst +++ b/rpython/doc/jit/index.rst @@ -26,6 +26,7 @@ optimizer virtualizable vectorization + backend - :doc:`Overview <overview>`: motivating our approach @@ -34,5 +35,8 @@ - :doc:`Optimizer <optimizer>`: the step between tracing and writing machine code -- :doc:`Virtulizable <virtualizable>` how virtualizables work and what they are - (in other words how to make frames more efficient). +- :doc:`Virtualizable <virtualizable>`: how virtualizables work and what + they are (in other words how to make frames more efficient). + +- :doc:`Assembler backend <backend>`: draft notes about the organization + of the assembler backends _______________________________________________ pypy-commit mailing list pypy-commit@python.org https://mail.python.org/mailman/listinfo/pypy-commit