Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
Nicholas Clark wrote: On Mon, Nov 04, 2002 at 10:09:06AM -0500, [EMAIL PROTECTED] wrote: [ JIT + cg_core ] I'm not convinced. Compiling the computed goto core with any sort of optimisation turns on *really* hurts the machine. Here gcc 2.95.2 just fails (256 MB Mem, same swap) I doubt that the GC core's stats look anywhere near as impressive for the unoptimised case. [And I'm not at a machine were I can easily generate some] This makes me think that it would be hard to just in time ... However, I suspect that part of the speed of the CG core comes from the compiler (this is always gcc?) being able to do away with the function call and function return overheads between the ops it has inlined in the GC core. Yes, saving the function code overhead is the major speed win in CGoto. I've no idea if gcc is allowed to re-order the op blocks in the CG core. Doesn't matter IMHO (when we annotate the source) ... If not, then we might be able to pick apart the blocks it compiles (for units for the JIT to use) by putting in custom asm statements between each, which our assembler (or machine code) parser spots and uses as delimiters (hmm. particularly if we have header and trailer asm statements that are actually just assembly language comments with marker text that gcc passes through undigested. This would let us annotate the assembler output of gcc) but this is only half of the work. JITs current outstanding integer performance depends on explict register allocation for the must used IRegs in one block. Mixing of JIT instructions and gcc generated wouldn't work because of this register allocation. My experiment with microops could help here, where the optimizer would basically generate code for a 3-register machine. Nicholas Clark leo
Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
Jason Gloudon wrote: On Mon, Nov 04, 2002 at 09:21:06PM +, Nicholas Clark wrote: It turns out the optimization does make a difference for gcc at least, but for a strange reason. It seems that without optimization gcc allocates a *lot* more space on the stack for cg_core. I suspect this is because gcc does not coalesce the stack space used for temporary values unless optimization is enabled. I never figured out, where this stack space was used. Anyway my last patch should improve the unoptimized case due to faster trace_system_stack by putting lo_var_ptr beyond this jump table. (gcc with no optimization) M op/s:14.783200 (gcc -O2) M op/s:6.642035 Numbers reversed? leo
Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
Nicholas Clark wrote: I'm not convinced. Compiling the computed goto core with any sort of optimisation turns on *really* hurts the machine. I think it's over a minute even a 733 MHz PIII, and it happily pages everything else out while it's doing it. :-( Use the -fno-gcse option to gcc, to turn off global common subexpression elimination. That may help with the speed issue. GCSE messes up interpreter loops anyway. I found out the hard way on Portable.NET that GCSE makes the code perform worse, not better. The compiler gets too greedy about common code, and starts moving things that should stay inline. GCSE is great in normal code, but not the central interpreter loop. Cheers, Rhys.
Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
[EMAIL PROTECTED] wrote: Leo -- Here's one of the messages about how I'd like to see us link op implementations with their op codes: http://archive.develooper.com/perl6-internals;perl.org/msg06193.html Thanks for all these pointers. I did read this thread WRT dynamic opcode loading. We will need a possibility to load different oplibs. But for the core.ops I'd rather stay with the static scheme used now. Your proposal would solve the problem with fingerprinting, but especially for huge programs, the loadtime overhead seems to big for me. Invalid PBC files due to changes will get less and less, during development of parrot, when the core.ops stabilizes. Remaining is the problem, how to run ops from different oplibs _fast_. leo
Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
Nicholas -- I agree it would be good to see CG performance (compilation and exeuction speed) with and without optimizations. I haven't done the experiment myself. I had tinkered with some asm-comment ideas last year when discussing JIT with Daniel Grunblatt. For your amusement, I've attached a tarball with a tiny example. I was trying to insert asm comments for later parsing. I didn't get far, but that doesn't mean its not possible. Regards, -- Gregor Nicholas Clark [EMAIL PROTECTED] 11/04/2002 04:21 PM To: [EMAIL PROTECTED] cc: Leopold Toetsch [EMAIL PROTECTED], Brent Dax [EMAIL PROTECTED], 'Andy Dougherty' [EMAIL PROTECTED], Josh Wilmes [EMAIL PROTECTED], 'Perl6 Internals' [EMAIL PROTECTED] Subject:Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?] On Mon, Nov 04, 2002 at 10:09:06AM -0500, [EMAIL PROTECTED] wrote: While I don't think I'm sophisticated enough to pull it off on my own, I do think it should be possible to use what was learned to build the JIT system to build the equivalent of a CG core on the fly, given its structure. I think the information and basic capabilities are already there: The JIT system knows already how to compile a sequence of ops to machine code -- using this plus enough know-how to plop in the right JMP instructions pretty much gets you there. A possible limitation to the I'm not convinced. Compiling the computed goto core with any sort of optimisation turns on *really* hurts the machine. I think it's over a minute even a 733 MHz PIII, and it happily pages everything else out while it's doing it. :-( I doubt that the GC core's stats look anywhere near as impressive for the unoptimised case. [And I'm not at a machine were I can easily generate some] This makes me think that it would be hard to just in time coolness, here: I think the JIT system bails out for the non-inline ops and just calls the opfunc (please forgive if my understanding of what JIT does and doesn't do is out of date). I think the CG core doesn't have to take the hit of that extra indirection for non-inline ops. If so, then the hypothetical dynamic core construction approach just described would approach the speed of the CG core, but would fall somewhat short on workloads that involve lots of non-inline ops (FWIW, there are more inline ops than not in the current *.ops files). I believe that your understanding of the JIT and the GC cores are still correct. The problem would be solved if we had some nice way of getting the C compiler to generate us nice stub versions of all the non-inline ops functions, which we could then place inline. However, I suspect that part of the speed of the CG core comes from the compiler (this is always gcc?) being able to do away with the function call and function return overheads between the ops it has inlined in the GC core. I've no idea if gcc is allowed to re-order the op blocks in the CG core. If not, then we might be able to pick apart the blocks it compiles (for units for the JIT to use) by putting in custom asm statements between each, which our assembler (or machine code) parser spots and uses as delimiters (hmm. particularly if we have header and trailer asm statements that are actually just assembly language comments with marker text that gcc passes through undigested. This would let us annotate the assembler output of gcc) Nicholas Clark -- Brainfuck better than perl? http://www.perl.org/advocacy/spoofathon/ asm-fun.tar.gz Description: Binary data
Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
Nicholas Clark wrote: I believe that your understanding of the JIT and the GC cores are still correct. The problem would be solved if we had some nice way of getting the C compiler to generate us nice stub versions of all the non-inline ops functions, which we could then place inline. However, I suspect that part of the speed of the CG core comes from the compiler (this is always gcc?) being able to do away with the function call and function return overheads between the ops it has inlined in the GC core. You may want to check out the following two papers and their references: I. Piumarta, F. Riccardi. Optimizing direct threaded code by selective inlining. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 17-19, Montreal, Canada, 1998 ftp://ftp.inria.fr/INRIA/Projects/SOR/papers/1998/ODCSI_pldi98.ps.gz M. Anton Ertl, A Portable Forth Engine, Proceedings euroFORTH '93, pages 253-257. http://www.complang.tuwien.ac.at/forth/threaded-code.html Everything you ever wanted to know about optimising threaded interpreters, but were too afraid to ask. The selecting inlining method in particular talks about how to extract inline code blocks dynamically and then paste them together. Cheers, Rhys.
Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
[EMAIL PROTECTED] wrote: Leo -- I don't know much about the CG core, but prederef and JIT should be able to work with dynamic optables. For prederef and JIT, optable mucking does expire your prederefed and JITted blocks (in general), but for conventional use (preamble setup), you don't pay a price during mainline execution once you've set up your optable. Yep [ JITlike cg_core ] ... If so, then the hypothetical dynamic core construction approach just described would approach the speed of the CG core, but would fall somewhat short on workloads that involve lots of non-inline ops (FWIW, there are more inline ops than not in the current *.ops files). Exactly here is the problem. Allmost all non integer/float stuff is unimplemented in JIT. You don't pay the price per non-inline ops, but per op not in JIT. In CG the op functions are not functions but code pieces, which get jumped too. JITed code (as long as implemented) is a linear sequence of the functions bodies (or better there asm equivalencies). Then, you get CG (-esque) speed along with the dynamic capabilities. Its cheating, to be sure, but I like that kind of cheating.:) If we are able to build such a system, yes. But see Of mops and microops for yet another approach. By splitting current opcodes to more fine grained pieces, we would need less different ops alltogether, and it could be really fast. Regards, -- Gregor leo
Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
Jason -- Originally, I considered adding an optable segment to the packfile format and no useop op. After considering useop, I found the idea of a conventional (but technically optional, TMTOWTDI) preamble and the ability to modify the optable while running intriguing. There is value to using an optable segment to pick the ops for the code segment, even if none of the more dynamic stuff is done. Retaining the ability to modify the optable at runtime is still interesting, to me. It essentially makes the optable a shortcut for the useop preamble. However, retaining this ability means that any generic disassembler will have to be useop aware, which is what I think you are trying to avoid. Regards, -- Gregor Jason Gloudon [EMAIL PROTECTED] 11/04/2002 11:41 AM To: [EMAIL PROTECTED] cc: Leopold Toetsch [EMAIL PROTECTED], Brent Dax [EMAIL PROTECTED], 'Andy Dougherty' [EMAIL PROTECTED], Josh Wilmes [EMAIL PROTECTED], 'Perl6 Internals' [EMAIL PROTECTED] Subject:Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?] On Sun, Nov 03, 2002 at 04:59:22PM -0500, [EMAIL PROTECTED] wrote: What I advocate is having possibly only one (maybe too extreme, but doable) built-in op pre-loaded at opcode zero. This op's name is useop, and its arguments give an opcode (optable index), and sufficent information for the interpreter to chase down the opinfo (and opfunc). In the best scenario, this One question this raises is where does this initialization occur ? I think the information that would be encoded in these instructions should normally go into a metadata section of the bytecode stored on disk. Having to pseudo-execute the bytecode in order to disassemble seems unnecessary. I think keeping this information separete from the executable section will make the code generators simpler as well. -- Jason
Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
[EMAIL PROTECTED] wrote: Leo -- ... Optable build time is not a function of program size, but rather of optable size Ok, I see that, but ... I don't think it remains a problem how to run ops from different oplibs _fast_. the problem is, that as soon as there are dynamic oblibs, they can't be run in the CGoto core, which is normally the fastest core, when executions time is depending on opcode dispatch time. JIT is (much) faster, in almost integer only code, e.g. mops.pasm, but for more complex programs, involving PMCs, JIT is currently slower. ... Op lookup is already fast ... I rewrote find_op, to build a lookup hash at runtime, when it's needed. This is 2-3 times faster then the find_op with the static lookup table in the core_ops.c file. ... After the preamble, while the program is running, the cost of having a dynamic optable is absolutely *nil*, whether the ops in question were statically or dynamically loaded (if you don't see that, then either I'm very wrong, or I haven't given you the right mental picture of what I'm talking about). The cost is only almost *nil*, if program execution time doesn't depend on opcode dispatch time. E.g. mops.pasm has ~50% execution time in cg_core (i.e. the computed goto core). Running the normal fast_core slows this down by ~30%. This might or might not be true for RL applications, but I hope, that the optimizer will bring us near above relations for average programs. Nethertheless I see the need for dynamic oplibs. If e.g. a program pulls in obsure.ops, it could as well pay the penalty for using these. Regards, -- Gregor leo
Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
Leo -- Ah. It seems the point of divergence is slow_core vs. cg_core, et al. As you have figured out, I've been referring to performance of the non-cg, non-prederef, non-JIT (read: slow ;) core. I don't know much about the CG core, but prederef and JIT should be able to work with dynamic optables. For prederef and JIT, optable mucking does expire your prederefed and JITted blocks (in general), but for conventional use (preamble setup), you don't pay a price during mainline execution once you've set up your optable. You only pay an additional cost if your program is dynamic enough to muck with its optable in the middle somewhere, so you have to pay to re-prederef or re-JIT stuff (and a use tax like that seems appropriate to me). Of all the cores, the CG core is the most crystalized (rigid), so it stands to reason that it would not be a good match for dynamic optables. While I don't think I'm sophisticated enough to pull it off on my own, I do think it should be possible to use what was learned to build the JIT system to build the equivalent of a CG core on the fly, given its structure. I think the information and basic capabilities are already there: The JIT system knows already how to compile a sequence of ops to machine code -- using this plus enough know-how to plop in the right JMP instructions pretty much gets you there. A possible limitation to the coolness, here: I think the JIT system bails out for the non-inline ops and just calls the opfunc (please forgive if my understanding of what JIT does and doesn't do is out of date). I think the CG core doesn't have to take the hit of that extra indirection for non-inline ops. If so, then the hypothetical dynamic core construction approach just described would approach the speed of the CG core, but would fall somewhat short on workloads that involve lots of non-inline ops (FWIW, there are more inline ops than not in the current *.ops files). Then, you get CG (-esque) speed along with the dynamic capabilities. Its cheating, to be sure, but I like that kind of cheating.:) Further, DCC would work with dynamically loaded oplibs (presumably using purely the JIT-func-call technique, although I suppose its possible to do even better), where the CG core would not. It would be interesting to see where DCC would fit on the performance spectrum compared to JIT, for mops.pasm and for other examples with broader op usage... Regards, -- Gregor Leopold Toetsch [EMAIL PROTECTED] 11/04/2002 08:45 AM To: [EMAIL PROTECTED] cc: Brent Dax [EMAIL PROTECTED], 'Andy Dougherty' [EMAIL PROTECTED], Josh Wilmes [EMAIL PROTECTED], 'Perl6 Internals' [EMAIL PROTECTED] Subject:Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?] [EMAIL PROTECTED] wrote: Leo -- ... Optable build time is not a function of program size, but rather of optable size Ok, I see that, but ... I don't think it remains a problem how to run ops from different oplibs _fast_. the problem is, that as soon as there are dynamic oblibs, they can't be run in the CGoto core, which is normally the fastest core, when executions time is depending on opcode dispatch time. JIT is (much) faster, in almost integer only code, e.g. mops.pasm, but for more complex programs, involving PMCs, JIT is currently slower. ... Op lookup is already fast ... I rewrote find_op, to build a lookup hash at runtime, when it's needed. This is 2-3 times faster then the find_op with the static lookup table in the core_ops.c file. ... After the preamble, while the program is running, the cost of having a dynamic optable is absolutely *nil*, whether the ops in question were statically or dynamically loaded (if you don't see that, then either I'm very wrong, or I haven't given you the right mental picture of what I'm talking about). The cost is only almost *nil*, if program execution time doesn't depend on opcode dispatch time. E.g. mops.pasm has ~50% execution time in cg_core (i.e. the computed goto core). Running the normal fast_core slows this down by ~30%. This might or might not be true for RL applications, but I hope, that the optimizer will bring us near above relations for average programs. Nethertheless I see the need for dynamic oplibs. If e.g. a program pulls in obsure.ops, it could as well pay the penalty for using these. Regards, -- Gregor leo
RE: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
Leopold Toetsch: # the problem is, that as soon as there are dynamic # oblibs, they can't # be run in the CGoto core, which is normally the fastest core, when # executions time is depending on opcode dispatch time. JIT is (much) # faster, in almost integer only code, e.g. mops.pasm, but for more # complex programs, involving PMCs, JIT is currently slower. Wasn't the plan to deal with that to use the JIT to construct a new cgoto core? --Brent Dax [EMAIL PROTECTED] @roles=map {Parrot $_} qw(embedding regexen Configure) Wire telegraph is a kind of a very, very long cat. You pull his tail in New York and his head is meowing in Los Angeles. And radio operates exactly the same way. The only difference is that there is no cat. --Albert Einstein (explaining radio)
Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
On Sun, Nov 03, 2002 at 04:59:22PM -0500, [EMAIL PROTECTED] wrote: What I advocate is having possibly only one (maybe too extreme, but doable) built-in op pre-loaded at opcode zero. This op's name is useop, and its arguments give an opcode (optable index), and sufficent information for the interpreter to chase down the opinfo (and opfunc). In the best scenario, this One question this raises is where does this initialization occur ? I think the information that would be encoded in these instructions should normally go into a metadata section of the bytecode stored on disk. Having to pseudo-execute the bytecode in order to disassemble seems unnecessary. I think keeping this information separete from the executable section will make the code generators simpler as well. -- Jason
Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
On Mon, Nov 04, 2002 at 10:09:06AM -0500, [EMAIL PROTECTED] wrote: While I don't think I'm sophisticated enough to pull it off on my own, I do think it should be possible to use what was learned to build the JIT system to build the equivalent of a CG core on the fly, given its structure. I think the information and basic capabilities are already there: The JIT system knows already how to compile a sequence of ops to machine code -- using this plus enough know-how to plop in the right JMP instructions pretty much gets you there. A possible limitation to the I'm not convinced. Compiling the computed goto core with any sort of optimisation turns on *really* hurts the machine. I think it's over a minute even a 733 MHz PIII, and it happily pages everything else out while it's doing it. :-( I doubt that the GC core's stats look anywhere near as impressive for the unoptimised case. [And I'm not at a machine were I can easily generate some] This makes me think that it would be hard to just in time coolness, here: I think the JIT system bails out for the non-inline ops and just calls the opfunc (please forgive if my understanding of what JIT does and doesn't do is out of date). I think the CG core doesn't have to take the hit of that extra indirection for non-inline ops. If so, then the hypothetical dynamic core construction approach just described would approach the speed of the CG core, but would fall somewhat short on workloads that involve lots of non-inline ops (FWIW, there are more inline ops than not in the current *.ops files). I believe that your understanding of the JIT and the GC cores are still correct. The problem would be solved if we had some nice way of getting the C compiler to generate us nice stub versions of all the non-inline ops functions, which we could then place inline. However, I suspect that part of the speed of the CG core comes from the compiler (this is always gcc?) being able to do away with the function call and function return overheads between the ops it has inlined in the GC core. I've no idea if gcc is allowed to re-order the op blocks in the CG core. If not, then we might be able to pick apart the blocks it compiles (for units for the JIT to use) by putting in custom asm statements between each, which our assembler (or machine code) parser spots and uses as delimiters (hmm. particularly if we have header and trailer asm statements that are actually just assembly language comments with marker text that gcc passes through undigested. This would let us annotate the assembler output of gcc) Nicholas Clark -- Brainfuck better than perl? http://www.perl.org/advocacy/spoofathon/
Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
[EMAIL PROTECTED] wrote: All -- FWIW, this stuff came up early on in Parrot's infancy. Pointers, hints, information ... On a related note, I'm working on a toy VM outside of Parrot to demonstrate the technique I've proposed here in the past, Pointers, hints, information ... thanks, leo ;-)
Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]
Leo -- Here's one of the early fingerprinting patches, 2001-09-14: http://archive.develooper.com/perl6-internals;perl.org/msg04063.html Here's where Simon removed Digest::MD5, 2001-09-18: http://archive.develooper.com/cvs-parrot;perl.org/msg00151.html Here's one of the messages about how I'd like to see us link op implementations with their op codes: http://archive.develooper.com/perl6-internals;perl.org/msg06193.html You can use that to vector into the thread. My ideas have changed a bit since then (see below), but you can get some of the idea there. Here's another message that touched on this kind of stuff: http://archive.develooper.com/perl6-internals;perl.org/msg06270.html What I advocate is having possibly only one (maybe too extreme, but doable) built-in op pre-loaded at opcode zero. This op's name is useop, and its arguments give an opcode (optable index), and sufficent information for the interpreter to chase down the opinfo (and opfunc). In the best scenario, this could mean even doing some dynamic loading of oplibs. BTW, that was the point of my initial bloated- but-lightning fast oplookup switch() tree implementation, which has now been replaced with something I expect is more sane (I went that extreme because I was getting push-back that the by-name lookups would be slow, and even though I never advocated looking them up in DO_OP, I still wanted to demonstrate that they could be *very* fast). Now, whether or not you statically link other oplibs, I suggest not having every op be allocated a slot in the optable. Rather, the initial few ops in the startup code of a chunk of Parrot code is responsible for utilizing useop to arrange the appropriate optable for that code. For example, assembling mops.pasm would result in the first chunk of code making 13 calls to useop to attach the ops used by that code. No longer to we care what order ops are in their oplibs, because opcodes are not a meaningful concept at a static level (in the Parrot source). I've noticed that the current setup concats *.ops into one big core_ops.c, which is very different from what I was trying to move us towards long ago. I'm an advocate of smaller and independant *.ops files, separately compiled, and (possibly) only some actually statically linked in. An additional by-name lookup will be needed to map oplib names (and possibly version info, if we determine that is necessary) to the oplibinfo structures we got from the statically or dynamically linked oplibs. The oplibinfo structures give you the function pointer to call to look up ops by name in that oplib. One final interesting (at least to me) note: A chunk of code could overwrite the optable entry zero with some noo?p-equivalent op to prevent any further changes to its optable once it has things the way it wants them. Regards, -- Gregor Leopold Toetsch [EMAIL PROTECTED] 11/03/2002 02:49 PM To: [EMAIL PROTECTED] cc: Josh Wilmes [EMAIL PROTECTED], Brent Dax [EMAIL PROTECTED], 'Andy Dougherty' [EMAIL PROTECTED], 'Perl6 Internals' [EMAIL PROTECTED] Subject:Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?] [EMAIL PROTECTED] wrote: All -- FWIW, this stuff came up early on in Parrot's infancy. Pointers, hints, information ... On a related note, I'm working on a toy VM outside of Parrot to demonstrate the technique I've proposed here in the past, Pointers, hints, information ... thanks, leo ;-)