Re: Threaded interpreter in Java
Thanks Rémi! On Thu, Aug 4, 2016 at 5:36 PM, Remi Foraxwrote: > Hi Krys, > > > On August 3, 2016 3:41:46 PM PDT, Krystal Mok > wrote: > >(Having missed last night's dinner and all the fun there...) > > > >When Rémi started talking about this interpreter today, I thought it'd > >be > >tailcall related. > > I would love that it was tailcall related too. people seams to reduce > tailcall to a feature of some obscure languages but in reality each time > you have to do some argument transformations/injections/checking in a > generic way, it screams for tailcall. > > > > >In college, my introductory course to computer systems was based on the > >book "Introduction to Computing Systems: From Bits & Gates to C & > >Beyond", > >where it introduces a small educational 16-bit architecture called > >LC-2. We > >were given a LC-2 simulator to run our "assembly" programs. > >Later on, when I've learned enough Java to be dangerous, I wanted to > >write > >a more powerful simulator (interpreter) of the LC-2 ISA in Java. > > > >My first version was a most straightforward switch-threaded one. > >Nothing > >much to be said about that. > > > >The next version, wanting to mimic a indirect-threaded style, was > >structured in very much the same way Rémi presented in the link, but > >MethodHandle was still years from available so I didn't have the luxury > >of > >using it. Instead, I modeled the opcode handlers as "single-method > >objects", and do a virtual call at the end of each handler. > >(So, instead of Rémi's MH calls, I'm just using a table lookup + > >virtual > >call) > > > >public class NopHandler extends AbstractHandler // or implements > >Handler > >{ > > public void execute(State s) { > >HANDLERS[s.code[s.pc++]].execute(s); // dispatch > > } > >} > > > >That's didn't work out quite well. The performance was okay-ish with > >small > >program inputs. But HotSpot's JIT compilers doesn't implement tail call > >optimizations (well, recursive inlining sort of counts...), so with a > >long-enough program the stack blows up. And it didn't take that long of > >a > >LC-2 assembler program to make it blow up. > > > > To avoid of blowing the stack, the trick is to stop the "trace" at the end > of a basic block, i.e. when you have a jump or a return. It also means that > things like null check, things that should not jump, has to be encoded with > a different bytecode that conditional jump because you don't want to stop > the trace for them. > > Cool! I haven't thought about using this trick in the context of an interpreter in Java. That's really neat! > >Need a good trampoline. Need proper tail calls. > > YES ! > > YES!! Thanks, Kris > > > >- Kris > > Rémi > > > > >On Wed, Aug 3, 2016 at 3:14 PM, John Rose > >wrote: > > > >> On Aug 3, 2016, at 1:22 PM, Remi Forax wrote: > >> > > >> > Charles ask if we can make a fast register interpreter in Java, > >> > here is my take on a threaded-like interpreter > >> > > >> > https://gist.github.com/forax/f38b533e089217cfc4d0ae3c6e2de9c9 > >> > >> Nicely done. We were talking last night at dinner about making > >> bytecode interpreters go fast, and this is helpful. > >> > >> I think there may be a combinator here, for bytecode dispatch loops. > >> (This feels like the PIC combinator, both fundamental and tricky to > >get > >> right.) > >> A "bytecode" dispatch combinator might provide a pattern like the > >> following: > >> > >> MethodHandle init, pred, step, fini, index; > >> @NonNull MethodHandle[] tab; > >> R dispatch(A… a) { > >>V v = init(a…); // one here; there might be many v… > >>while (pred(a…)) { > >> v = step(v, a…); > >> // execute one "instruction": > >> int i = index(v, a…); > >> tab[i].invokeExact(v, a…); > >>} > >>return fini(V, a…); > >> } > >> > >> The explicit table would be recopied internally to an @Stable array > >for > >> constant folding. > >> The various index values could be tracked and turned into traces (or > >some > >> other profile > >> driven structure) which would be compiled together, in specialized > >> segments of the unrolled > >> interpreter loop. > >> > >> Or maybe just the table lookup part alone makes a good combinator, to > >be > >> paired > >> with the loop combinators? > >> > >> Comments welcome… > >> > >> — John > >> > >> P.S. Another new RFE: > >> experimental hook for creating heisenboxes > >> https://bugs.openjdk.java.net/browse/JDK-8163133 > >> ___ > >> mlvm-dev mailing list > >> mlvm-dev@openjdk.java.net > >> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev > >> > > > > > > > > > >___ > >mlvm-dev mailing list > >mlvm-dev@openjdk.java.net > >http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev > > -- > Sent from my Android device with
Re: Threaded interpreter in Java
(Having missed last night's dinner and all the fun there...) When Rémi started talking about this interpreter today, I thought it'd be tailcall related. In college, my introductory course to computer systems was based on the book "Introduction to Computing Systems: From Bits & Gates to C & Beyond", where it introduces a small educational 16-bit architecture called LC-2. We were given a LC-2 simulator to run our "assembly" programs. Later on, when I've learned enough Java to be dangerous, I wanted to write a more powerful simulator (interpreter) of the LC-2 ISA in Java. My first version was a most straightforward switch-threaded one. Nothing much to be said about that. The next version, wanting to mimic a indirect-threaded style, was structured in very much the same way Rémi presented in the link, but MethodHandle was still years from available so I didn't have the luxury of using it. Instead, I modeled the opcode handlers as "single-method objects", and do a virtual call at the end of each handler. (So, instead of Rémi's MH calls, I'm just using a table lookup + virtual call) public class NopHandler extends AbstractHandler // or implements Handler { public void execute(State s) { HANDLERS[s.code[s.pc++]].execute(s); // dispatch } } That's didn't work out quite well. The performance was okay-ish with small program inputs. But HotSpot's JIT compilers doesn't implement tail call optimizations (well, recursive inlining sort of counts...), so with a long-enough program the stack blows up. And it didn't take that long of a LC-2 assembler program to make it blow up. I realized that I had to have proper tail calls to make it work. So I added a central dispatch loop to serve as a trampoline, e.g. Handler next = getNextHandler(); while (next != null) { next = next.execute(s); } where my Handler.execute() would still perform the lookup, but not the actual invocation. As expected, the performance was no where as good as directly making the virtual call at the tail position in each handler. This central dispatch loop makes the unpredictable-ness of the Handler.execute() significant. Need a good trampoline. Need proper tail calls. - Kris On Wed, Aug 3, 2016 at 3:14 PM, John Rosewrote: > On Aug 3, 2016, at 1:22 PM, Remi Forax wrote: > > > > Charles ask if we can make a fast register interpreter in Java, > > here is my take on a threaded-like interpreter > > > > https://gist.github.com/forax/f38b533e089217cfc4d0ae3c6e2de9c9 > > Nicely done. We were talking last night at dinner about making > bytecode interpreters go fast, and this is helpful. > > I think there may be a combinator here, for bytecode dispatch loops. > (This feels like the PIC combinator, both fundamental and tricky to get > right.) > A "bytecode" dispatch combinator might provide a pattern like the > following: > > MethodHandle init, pred, step, fini, index; > @NonNull MethodHandle[] tab; > R dispatch(A… a) { >V v = init(a…); // one here; there might be many v… >while (pred(a…)) { > v = step(v, a…); > // execute one "instruction": > int i = index(v, a…); > tab[i].invokeExact(v, a…); >} >return fini(V, a…); > } > > The explicit table would be recopied internally to an @Stable array for > constant folding. > The various index values could be tracked and turned into traces (or some > other profile > driven structure) which would be compiled together, in specialized > segments of the unrolled > interpreter loop. > > Or maybe just the table lookup part alone makes a good combinator, to be > paired > with the loop combinators? > > Comments welcome… > > — John > > P.S. Another new RFE: > experimental hook for creating heisenboxes > https://bugs.openjdk.java.net/browse/JDK-8163133 > ___ > mlvm-dev mailing list > mlvm-dev@openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev > ___ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
Re: Threaded interpreter in Java
I think it's better to make tab[i].invokeExact(...) a tailcall, so the loop will be part of the implementation, not part of the spec. Rémi - Mail original - > De: "John Rose" <john.r.r...@oracle.com> > À: "Da Vinci Machine Project" <mlvm-dev@openjdk.java.net> > Envoyé: Jeudi 4 Août 2016 00:14:37 > Objet: Re: Threaded interpreter in Java > On Aug 3, 2016, at 1:22 PM, Remi Forax <fo...@univ-mlv.fr> wrote: >> >> Charles ask if we can make a fast register interpreter in Java, >> here is my take on a threaded-like interpreter >> >> https://gist.github.com/forax/f38b533e089217cfc4d0ae3c6e2de9c9 > > Nicely done. We were talking last night at dinner about making > bytecode interpreters go fast, and this is helpful. > > I think there may be a combinator here, for bytecode dispatch loops. > (This feels like the PIC combinator, both fundamental and tricky to get > right.) > A "bytecode" dispatch combinator might provide a pattern like the following: > > MethodHandle init, pred, step, fini, index; > @NonNull MethodHandle[] tab; > R dispatch(A… a) { > V v = init(a…); // one here; there might be many v… > while (pred(a…)) { > v = step(v, a…); > // execute one "instruction": > int i = index(v, a…); > tab[i].invokeExact(v, a…); > } > return fini(V, a…); > } > > The explicit table would be recopied internally to an @Stable array for > constant > folding. > The various index values could be tracked and turned into traces (or some > other > profile > driven structure) which would be compiled together, in specialized segments of > the unrolled > interpreter loop. > > Or maybe just the table lookup part alone makes a good combinator, to be > paired > with the loop combinators? > > Comments welcome… > > — John > > P.S. Another new RFE: > experimental hook for creating heisenboxes > https://bugs.openjdk.java.net/browse/JDK-8163133 > ___ > mlvm-dev mailing list > mlvm-dev@openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev ___ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
Re: Threaded interpreter in Java
On Aug 3, 2016, at 1:22 PM, Remi Foraxwrote: > > Charles ask if we can make a fast register interpreter in Java, > here is my take on a threaded-like interpreter > > https://gist.github.com/forax/f38b533e089217cfc4d0ae3c6e2de9c9 Nicely done. We were talking last night at dinner about making bytecode interpreters go fast, and this is helpful. I think there may be a combinator here, for bytecode dispatch loops. (This feels like the PIC combinator, both fundamental and tricky to get right.) A "bytecode" dispatch combinator might provide a pattern like the following: MethodHandle init, pred, step, fini, index; @NonNull MethodHandle[] tab; R dispatch(A… a) { V v = init(a…); // one here; there might be many v… while (pred(a…)) { v = step(v, a…); // execute one "instruction": int i = index(v, a…); tab[i].invokeExact(v, a…); } return fini(V, a…); } The explicit table would be recopied internally to an @Stable array for constant folding. The various index values could be tracked and turned into traces (or some other profile driven structure) which would be compiled together, in specialized segments of the unrolled interpreter loop. Or maybe just the table lookup part alone makes a good combinator, to be paired with the loop combinators? Comments welcome… — John P.S. Another new RFE: experimental hook for creating heisenboxes https://bugs.openjdk.java.net/browse/JDK-8163133 ___ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev