Re: JIT me some speed!
On Fri, Dec 21, 2001 at 12:03:51AM +, Tom Hughes wrote: In message [EMAIL PROTECTED] Dan Sugalski [EMAIL PROTECTED] wrote: To run a program with the JIT, pass test_parrot the -j flag and watch it scream. Well, scream if you're on x86 Linux or BSD (I get a speedup on mops.pbc of 35x) but it's a darned good place to start. It does seem to be quite impressively fast. Faster even than the compiled version of mops on my machine... It looks like it is going to need some work before it can work for other instruction sets though, at least for RISC systems where the operands are typically encoded with the opcode as part of a single word and the range of immediate constants is often restricted. I'm thinking it will need some way of indicating field widths and shifts for the operands and opcode so they can be merged into an instruction word and also some way of handling a constant pool so that arbitrary addresses can be loaded using PC relative loads. I've been thinking about this, and worrying somewhat if I'm in danger of trying to make a something run before it can walk, but in trying to think how to make an ARM jit. I don't know how typical ARM is of RISC CPUs, but it's probably enough food for thought. I may have made some minor technical errors in this: All our JIT code would be running in user mode on ARM, so we only get to play with the user mode bank of 16 general purpose registers, named r0-r15 general purpose in quotes because r15 is the program counter, one (now always r13) is the stack pointer and the ABI specifies uses for some others. I assume that we're not generating shared library code (if we are, the ABI claims r9 from us) The ABI needs 6 registers to call another function, a called function has the first 4 integer arguments passed in using r0-r3, and returns a result in r0. However in the body of a function we can arrange to use 2 of those reserved 6, so we have 12 registers for our use, of which r4-r9 will be preserved across function calls. Memory loads and stores have to be done relative to a base register (which can be r15, the PC), +- a constant (in the range 0-4095) or another register (optionally shifted by a constant). There is no direct way to load a register with an arbitrary 32 bit constant. Registers can be loaded with an immediate constant stored in 12 bits as 8 bit value rotate right 2 * 4 bit value (So you can load 0xFF in one instruction (4 bytes), for something like 0xFFF it's best to express it in 2 instructions (8 bytes) as load 0xFF, add 0xF00, and for complex values stuff it in a data word within 4092 bytes of the PC and load it PC relative, which is also 8 bytes, but possibly slower) The upshot of this is that loading integer register values by substituting addresses into template code is not an easy way to go. Much better would be to load up r4 to r7 with the base addresses of the integer, floating point, string and PMC registers and do a load integer register 16 into r0 as ldr r0, [r4, #60] @ parrot registers start at 1, don't they? [maybe I am already in danger of trying to run before I can walk, as I suspect that the address substitution can be shoe-horned as ldr r0, [pc, #-4] @ oops - may have got the pipeling wrong here b .L1 .word @ substitute your address here ..L1 ldr r0, [r0] but that's something like 5 CPU cycles rather than 2.] I'm in danger of wanting to run before I can walk. And maybe I should shut up for now as I'm describing a possible next generation JIT, rather than this one: The reason I'd feel I'd naturally want to define Parrot_set_i_i not as 1 code snippet, but as load parrot integer register (from memory) into CPU register n # do nothing here store CPU register n to parrot integer register (to memory) is because even with a more complex JIT syntax that lets me translate setI2, I4 as ldr r0, [r4, #12] str r0, [r4, #4] with r4-r7 loaded with those base addresses, I'm wasting most of the ARM CPU: I've got r1-r3, r8, r9, r12 and r14 that I'm not even using. I've got to be careful in what I'm suggesting here: I'm *not* suggesting that I (or anyone else) immediately writes a JIT that maps CPU registers to parrot registers and attempts to keep values in the CPU where possible. I *am* suggesting that if Parrot_set_i_i is defined as input: load parrot integer register (from memory) into CPU register n body: output: store CPU register n to parrot integer register (to memory) where the JIT provides that load and store code (rather than each op) then the work that goes into writing body code for the ops is still useful and usable with a second (third?) generation JIT that does know how to combine 2 or more adjacent Parrot ops. Possibly also useful thoughts: For things like generating constants for ARM, it would be useful to be able to specify alternative code generation snippets, with some
Re: .ops metadata [was: Re: JIT me some speed!]
On Mon, Dec 24, Gregor N. Purdy wrote: Nicholas -- Parrot_set_i_i(in,out): \x8b \x0d IR2 \x89 \x0d IR1 I'm tempted to push the specification of this information all the way back to the syntax of .ops files, since the code that lives there should behave the same wrt read/write on args. Dan likes C-like syntax as much as possible in other parts of the ops file. Thats why I chose 'inline' as the tag for 'trivial' ops (although that's a C++-ism). More than one C compiler uses 'inline' as an extension to hint at inlining code under optimization. As we are effectively defining our own compiler... If we didn't mind the verbosity, a C-like syntax would be: inline op set(register INTVAL, const register INTVAL) { $1 = $2; } instead of inline op set(i, i) { $1 = $2; } The problem is, we lose the nice space/time/etc. saving capability of: inline op set(i, i|ic) { $1 = $2; } FWIW, I like the 'C-ish' version. It makes it accessible to folks who know C, not a mongrel language which may or may not even exist. Note that in the last version above, we are _not_ saying that the second argument is the result of evaluating a bitwise OR on 'i' and 'ic'. Michael -- Michael Fischer 7.5 million years to run [EMAIL PROTECTED]printf %d, 0x2a; -- deep thought
Re: .ops metadata [was: Re: JIT me some speed!]
Jason -- Making the distinction between the three cases enables a number of optimizations of native code based on analysing data flow. 'in' would be good as an implicit default, as many PMC opcodes will not overwrite any PMC registers. An optimizing native code generator (whether static or JIT) will also need to be aware of operands that may implicitly clobber parrot register values or modify control flow, so that it knows when it musth spill updated parrot register values in hardware registers back to their memory locations and when it must reload hardware registers from main memory. I had considered something like this, using Apocolypse 2 properties, and 'sub' instead of 'op': sub set(INTVAL $1 is written, INTVAL $2 is read) is inline { $1 = $2; goto NEXT(); } # ... sub branch(INTVAL $1 is read) is inline nonlinear { goto OFFSET($1); } Heck, that's almost Perl. In fact, we *could* go all the way to named args: sub set(INTVAL $target is written, INTVAL $source is read) is inline { $target = $source; goto NEXT(); } # ... sub branch(INTVAL $dest is read) is inline nonlinear { goto OFFSET($1); } We could even use adverbial-looking notation for gotos: sub set(INTVAL $target is written, INTVAL $source is read) is inline { $target = $source; goto : next; } # ... sub branch(INTVAL $dest is read) is inline nonlinear { goto : offset $dest; } Finally, we could use the Perl 6 no-funny-business typenames: sub set(int $target is written, int $source is read) is inline { $target = $source; goto : next; } # ... sub branch(int $dest is read) is inline nonlinear { goto : offset $dest; } We could take 'written' as implying 'register' and not 'constant'; and 'read' (without 'written') could imply 'constant' and not 'register'. We could automatically treat those 'read' args as we do 'x|xc' today. We could automatically treat those 'written' args as we do 'x' today. This moves us in the direction of very-Perl-looking .ops code vs. semi- C-looking .ops code, which wouldn't bother me. I can live with C-like or Perl-like syntax here. Note, though that we *are* using Perl-style comments and POD documenjtation, which means that Perl-like syntax would be consistent. One Perl thing we would be breaking is subroutine name overloading. We'd have 'set' in there multiple times, once for each register type. To get around this, we'd have to name them set_[inps] and make sure we've got the C function name generation logic doing The Right Thing. Not insurmountable. The code that does the .ops file reading could be made to permit any number of tags. Stashing them in the Parrot/opblib/*.pm file is easy. Stashing them in *_ops.c won't be hard if we just treat them like an array of strings with a NULL terminator, probably sorted. Harder to use than flag bits, but extensible. Either way would work. Regards -- Gregor /Inspiration Innovation Excellence (TM)\ Gregor N. Purdy [EMAIL PROTECTED] Focus Research, Inc. http://www.focusresearch.com/ 8080 Beckett Center Drive #203 513-860-3570 vox West Chester, OH 45069 513-860-3579 fax \/ [[EMAIL PROTECTED]]$ ping osama.taliban.af PING osama.taliban.af (68.69.65.68) from 20.1.9.11 : 56 bytes of data. From 85.83.77.67: Time to live exceeded
Re: JIT me some speed!
I think we should leave all that for an optimizer. Daniel Grunblatt. On Mon, 24 Dec 2001, Nicholas Clark wrote: On Fri, Dec 21, 2001 at 12:03:51AM +, Tom Hughes wrote: It looks like it is going to need some work before it can work for other instruction sets though, at least for RISC systems where the operands are typically encoded with the opcode as part of a single word and the range of immediate constants is often restricted. I'm thinking it will need some way of indicating field widths and shifts for the operands and opcode so they can be merged into an instruction word and also some way of handling a constant pool so that arbitrary addresses can be loaded using PC relative loads. Another thing that struck me on reading it was: =item CBIRIn Place the address of the CINTVAL register specified in the Inth argument. RISC chips have lots of general purpose registers. It's likely that there will be enough spare that several can be used to map to parrot registers. Say 4 are available, it would be useful to be able to say that an op requires the value of rN and rM, and modifies rD. The JIT compiler would make a sandwich with the code to read in N and M into two of the real CPU registers, the op filling, and then some more code to write D back to memory. However, if the JIT can see that N is already in memory from the previous OP, or D is going to be used and modified by the next op, it can skip, defer or whatever some of the memory reads and writes. [And provided the descriptions are this helpful it doesn't have to do it immediately. It becomes possible to write a better optimising JIT that makes sandwiches with multiple fillings or even Scooby Snacks, while the initial JIT insists that the only recipe available is bread, 1 filling, bread] mops will be fast if REDO: subI4, I4, I3 if I4, REDO maps to REDO: load I4 from memory (which will be in the L1 cache) load I3 from memory I4 = I4 - I3 store I4 to memory load I4 from memory is it 0? goto REDO if true it will be slightly faster if it maps to REDO: load I4 from memory (which will be in the L1 cache) load I3 from memory I4 = I4 - I3 store I4 to memory # I4 still in a CPU register is it 0? goto REDO if so and faster still if the JIT can see how to push things out of the loop: load I4 from memory load I3 from memory REDO: I4 = I4 - I3 is it 0? goto REDO if so store I4 to memory (does threading mess this idea up?) Nicholas Clark
Re: JIT me some speed!
Oh, and by the BTW, I already tried you fastest example last week and got 50x speed up, but that's works only for mops, so ... Daniel Grunblatt. On Mon, 24 Dec 2001, Nicholas Clark wrote: On Fri, Dec 21, 2001 at 12:03:51AM +, Tom Hughes wrote: It looks like it is going to need some work before it can work for other instruction sets though, at least for RISC systems where the operands are typically encoded with the opcode as part of a single word and the range of immediate constants is often restricted. I'm thinking it will need some way of indicating field widths and shifts for the operands and opcode so they can be merged into an instruction word and also some way of handling a constant pool so that arbitrary addresses can be loaded using PC relative loads. Another thing that struck me on reading it was: =item CBIRIn Place the address of the CINTVAL register specified in the Inth argument. RISC chips have lots of general purpose registers. It's likely that there will be enough spare that several can be used to map to parrot registers. Say 4 are available, it would be useful to be able to say that an op requires the value of rN and rM, and modifies rD. The JIT compiler would make a sandwich with the code to read in N and M into two of the real CPU registers, the op filling, and then some more code to write D back to memory. However, if the JIT can see that N is already in memory from the previous OP, or D is going to be used and modified by the next op, it can skip, defer or whatever some of the memory reads and writes. [And provided the descriptions are this helpful it doesn't have to do it immediately. It becomes possible to write a better optimising JIT that makes sandwiches with multiple fillings or even Scooby Snacks, while the initial JIT insists that the only recipe available is bread, 1 filling, bread] mops will be fast if REDO: subI4, I4, I3 if I4, REDO maps to REDO: load I4 from memory (which will be in the L1 cache) load I3 from memory I4 = I4 - I3 store I4 to memory load I4 from memory is it 0? goto REDO if true it will be slightly faster if it maps to REDO: load I4 from memory (which will be in the L1 cache) load I3 from memory I4 = I4 - I3 store I4 to memory # I4 still in a CPU register is it 0? goto REDO if so and faster still if the JIT can see how to push things out of the loop: load I4 from memory load I3 from memory REDO: I4 = I4 - I3 is it 0? goto REDO if so store I4 to memory (does threading mess this idea up?) Nicholas Clark
.ops metadata [was: Re: JIT me some speed!]
Nicholas -- Parrot_set_i_i(in,out): \x8b \x0d IR2 \x89 \x0d IR1 I'm tempted to push the specification of this information all the way back to the syntax of .ops files, since the code that lives there should behave the same wrt read/write on args. Dan likes C-like syntax as much as possible in other parts of the ops file. Thats why I chose 'inline' as the tag for 'trivial' ops (although that's a C++-ism). If we didn't mind the verbosity, a C-like syntax would be: inline op set(register INTVAL, const register INTVAL) { $1 = $2; } instead of inline op set(i, i) { $1 = $2; } The problem is, we lose the nice space/time/etc. saving capability of: inline op set(i, i|ic) { $1 = $2; } But, we could still adopt the C-ism 'const' as meaning read-only, and assume all non-const arguments are written: inline op set(i, const i|ic) { $1 = $2; } Or, do we really need to have the three-way in/out/inout tagset? inline op set(out i, in i|ic) { $1 = $2; } Regards, -- Gregor /Inspiration Innovation Excellence (TM)\ Gregor N. Purdy [EMAIL PROTECTED] Focus Research, Inc. http://www.focusresearch.com/ 8080 Beckett Center Drive #203 513-860-3570 vox West Chester, OH 45069 513-860-3579 fax \/ [[EMAIL PROTECTED]]$ ping osama.taliban.af PING osama.taliban.af (68.69.65.68) from 20.1.9.11 : 56 bytes of data. From 85.83.77.67: Time to live exceeded
RE: .ops metadata [was: Re: JIT me some speed!]
Gregor N. Purdy: # Parrot_set_i_i(in,out): \x8b \x0d IR2 \x89 \x0d IR1 # # I'm tempted to push the specification of this information all the way # back to the syntax of .ops files, since the code that lives there # should behave the same wrt read/write on args. # # Dan likes C-like syntax as much as possible in other parts of the ops # file. Thats why I chose 'inline' as the tag for 'trivial' ops # (although # that's a C++-ism). # # If we didn't mind the verbosity, a C-like syntax would be: # # inline op set(register INTVAL, const register INTVAL) { # $1 = $2; # } # # instead of # # inline op set(i, i) { # $1 = $2; # } # # The problem is, we lose the nice space/time/etc. saving capability of: # # inline op set(i, i|ic) { # $1 = $2; # } # # But, we could still adopt the C-ism 'const' as meaning read-only, and # assume all non-const arguments are written: # # inline op set(i, const i|ic) { # $1 = $2; # } # # Or, do we really need to have the three-way in/out/inout tagset? # # inline op set(out i, in i|ic) { # $1 = $2; # } Or we could go with the Perl 6-ism: inline op set(i is rw, i|ic) { $1=$2; } --Brent Dax [EMAIL PROTECTED] Configure pumpking for Perl 6 Nothing important happened today. --George III of England's diary entry for 4-Jul-1776
Re: JIT me some speed!
Dan and Michael -- $ ./test_parrot -j examples/assembly/mops.pbc Illegal instruction That's not supposed to happen is it? Its Linux/PowerPC, so maybe it is supposed to happen. It's sort of supposed to happen. It shouldn't work, at least--we need better error checking and such, so the -j is disabled on systems we don't support yet. Today's the day guards in interpreter.c / test_main.c go in so you get yelled at instead of barfed on in this case. Regards, -- Gregor /Inspiration Innovation Excellence (TM)\ Gregor N. Purdy [EMAIL PROTECTED] Focus Research, Inc. http://www.focusresearch.com/ 8080 Beckett Center Drive #203 513-860-3570 vox West Chester, OH 45069 513-860-3579 fax \/ [[EMAIL PROTECTED]]$ ping osama.taliban.af PING osama.taliban.af (68.69.65.68) from 20.1.9.11 : 56 bytes of data. From 85.83.77.67: Time to live exceeded
Re: JIT me some speed!
All -- $ ./test_parrot -j examples/assembly/mops.pbc Illegal instruction That's not supposed to happen is it? Its Linux/PowerPC, so maybe it is supposed to happen. It's sort of supposed to happen. It shouldn't work, at least--we need better error checking and such, so the -j is disabled on systems we don't support yet. Today's the day guards in interpreter.c / test_main.c go in so you get yelled at instead of barfed on in this case. The code is in. Regards, -- Gregor /Inspiration Innovation Excellence (TM)\ Gregor N. Purdy [EMAIL PROTECTED] Focus Research, Inc. http://www.focusresearch.com/ 8080 Beckett Center Drive #203 513-860-3570 vox West Chester, OH 45069 513-860-3579 fax \/ [[EMAIL PROTECTED]]$ ping osama.taliban.af PING osama.taliban.af (68.69.65.68) from 20.1.9.11 : 56 bytes of data. From 85.83.77.67: Time to live exceeded
Re: JIT me some speed!
On Fri, 21 Dec 2001, Tom Hughes wrote: In message [EMAIL PROTECTED] Dan Sugalski [EMAIL PROTECTED] wrote: To run a program with the JIT, pass test_parrot the -j flag and watch it scream. Well, scream if you're on x86 Linux or BSD (I get a speedup on mops.pbc of 35x) but it's a darned good place to start. It does seem to be quite impressively fast. Faster even than the compiled version of mops on my machine... It looks like it is going to need some work before it can work for other instruction sets though, at least for RISC systems where the operands are typically encoded with the opcode as part of a single word and the range of immediate constants is often restricted. I'm thinking it will need some way of indicating field widths and shifts for the operands and opcode so they can be merged into an instruction word and also some way of handling a constant pool so that arbitrary addresses can be loaded using PC relative loads. I suspect it is also rather questionable to call system calls directly rather than going via their C library veneers - that is even more true when you come to things (like socket calls) which are system calls on some machines and functions on others. We are not always calling system calls directly, we can use the C library when ever we need it, check out the .jit syntax. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/
Re: JIT me some speed!
In message [EMAIL PROTECTED] Daniel Grunblatt [EMAIL PROTECTED] wrote: On Fri, 21 Dec 2001, Tom Hughes wrote: I suspect it is also rather questionable to call system calls directly rather than going via their C library veneers - that is even more true when you come to things (like socket calls) which are system calls on some machines and functions on others. We are not always calling system calls directly, we can use the C library when ever we need it, check out the .jit syntax. I did have a brief look last night but I must have missed that. No problem that front then. Incidentally the JIT times are definitely impressive... Times for a 1.33 GHz Athlon are like this: dutton [~/src/parrot] % ./test_parrot ./examples/assembly/mops.pbc Iterations:1 Estimated ops: 2 Elapsed time: 4.806858 M op/s:41.607220 dutton [~/src/parrot] % ./test_parrot -j ./examples/assembly/mops.pbc Iterations:1 Estimated ops: 2 Elapsed time: 0.300258 M op/s:666.093736 dutton [~/src/parrot] % ./examples/assembly/mops Iterations:1 Estimated ops: 2 Elapsed time: 0.324787 M op/s:615.788117 Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu
Re: JIT me some speed!
Don't forget that (if I'm missing somthing) by the time that pbc2c.pl work with all the ops it will be much slower than the jit. Daniel Grunblatt. On 21 Dec 2001, Tom Hughes wrote: In message [EMAIL PROTECTED] Daniel Grunblatt [EMAIL PROTECTED] wrote: On Fri, 21 Dec 2001, Tom Hughes wrote: I suspect it is also rather questionable to call system calls directly rather than going via their C library veneers - that is even more true when you come to things (like socket calls) which are system calls on some machines and functions on others. We are not always calling system calls directly, we can use the C library when ever we need it, check out the .jit syntax. I did have a brief look last night but I must have missed that. No problem that front then. Incidentally the JIT times are definitely impressive... Times for a 1.33 GHz Athlon are like this: dutton [~/src/parrot] % ./test_parrot ./examples/assembly/mops.pbc Iterations:1 Estimated ops: 2 Elapsed time: 4.806858 M op/s:41.607220 dutton [~/src/parrot] % ./test_parrot -j ./examples/assembly/mops.pbc Iterations:1 Estimated ops: 2 Elapsed time: 0.300258 M op/s:666.093736 dutton [~/src/parrot] % ./examples/assembly/mops Iterations:1 Estimated ops: 2 Elapsed time: 0.324787 M op/s:615.788117 Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu
Re: JIT me some speed!
In message [EMAIL PROTECTED] Dan Sugalski [EMAIL PROTECTED] wrote: To run a program with the JIT, pass test_parrot the -j flag and watch it scream. Well, scream if you're on x86 Linux or BSD (I get a speedup on mops.pbc of 35x) but it's a darned good place to start. It does seem to be quite impressively fast. Faster even than the compiled version of mops on my machine... It looks like it is going to need some work before it can work for other instruction sets though, at least for RISC systems where the operands are typically encoded with the opcode as part of a single word and the range of immediate constants is often restricted. I'm thinking it will need some way of indicating field widths and shifts for the operands and opcode so they can be merged into an instruction word and also some way of handling a constant pool so that arbitrary addresses can be loaded using PC relative loads. I suspect it is also rather questionable to call system calls directly rather than going via their C library veneers - that is even more true when you come to things (like socket calls) which are system calls on some machines and functions on others. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/
Re: JIT me some speed!
Dan Sugalski [EMAIL PROTECTED] wrote: To run a program with the JIT, pass test_parrot the -j flag and watch it scream. Well, scream if you're on x86 Linux or BSD (I get a speedup on mops.pbc of 35x) but it's a darned good place to start. $ ./test_parrot -j examples/assembly/mops.pbc Illegal instruction That's not supposed to happen is it? Its Linux/PowerPC, so maybe it is supposed to happen. -- Michael G. Schwern [EMAIL PROTECTED]http://www.pobox.com/~schwern/ Perl Quality Assurance [EMAIL PROTECTED] Kwalitee Is Job One mendel ScHWeRnsChweRNsChWErN SchweRN SCHWErNSChwERnsCHwERN sChWErn ScHWeRn schweRn sCHWErN schWeRnscHWeRN SchWeRN scHWErn SchwErn scHWErn ScHweRN sChwern scHWerNscHWeRn scHWerNScHwerN SChWeRN scHWeRn SchwERNschwERnSCHwern sCHWErN SCHWErN sChWeRn
Re: JIT me some speed!
On Thu, 20 Dec 2001, Michael G Schwern wrote: Dan Sugalski [EMAIL PROTECTED] wrote: To run a program with the JIT, pass test_parrot the -j flag and watch it scream. Well, scream if you're on x86 Linux or BSD (I get a speedup on mops.pbc of 35x) but it's a darned good place to start. $ ./test_parrot -j examples/assembly/mops.pbc Illegal instruction That's not supposed to happen is it? Its Linux/PowerPC, so maybe it is supposed to happen. It's sort of supposed to happen. It shouldn't work, at least--we need better error checking and such, so the -j is disabled on systems we don't support yet. dan
JIT me some speed!
Thanks to the work of Daniel Grunblatt, we now have JIT capabilities in parrot. It's in the latest CVS, ready for your use and abuse. To run a program with the JIT, pass test_parrot the -j flag and watch it scream. Well, scream if you're on x86 Linux or BSD (I get a speedup on mops.pbc of 35x) but it's a darned good place to start. Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk