Re: [Python-Dev] [ANN] VPython 0.1
Hey, I hope you don't mind my replying in digest form. First off, I guess I should be a little clearer as to what VPthon is and what it does. VPython is essentially a set of patches for CPython (in touches only three files, diff -b is about 800 lines IIRC plus the switch statement in ceval.c's EvalFrameEx()). The main change is moving the VM instruction implementations, in CPython, blocks of code following a case label, into a separate file, adding Vmgen stack comments, and removing the explicit stack manipulation code (plus some minor modification like renaming variables to work with Vmgen's type prefixes and labels to enable the generation of superinstructions). Vmgen parses the stack comments and prints some macros around the provided instruction code (incidentally, this reduced the 1500 line switch body to about 1000 lines). Interested parties should consult ceval.vmg and ceval-vm.i. The nice thing about this is that: a) It's fairly easy to implement different types of dispatch, simply by changing a few macros (and while I haven't done this, it shouldn't be a problem to add some switch dispatch #ifdefs for non-GCC platforms). In particular, direct threaded code leads to less horrible branch prediction than switch dispatch on many machines (exactly how pronounced this effect is depends heavily on the specific architecture). b) Vmgen can generate superinstructions. A quick primer: A sequence of code such as LOAD_CONST LOAD_FAST BINARY_ADD will, in CPython, push some constant onto the stack, push some local onto the stack, then pop both off the stack, add them and push the result back onto the stack. Turning this into a superinstruction means inlining LOAD_CONST and LOAD_FAST, modifying them to store the values they'd otherwise push onto the stack in local variables and adding a version of BINARY_ADD which reads its arguments from those local variables rather than the stack (this reduces dispatch time in addition to pops and pushes). David Gregg (and friends) recently published a paper comparing stack based and register based VMs for Java and found that register based VMs were substantially faster. The main reason for this appears to be the absence of the various LOAD_ instructions in a register VM. They looked at mitigating this using superinstructions but Java would have required (again, IIRC) about a 1000 (which leads to substantial code growth). Since Python doesn't have multiple (typed) versions of every instruction (Java has iadd and so on) much fewer superinstructions are necessary. On my system, superinstructions account for about 10% of the 30% performance gain. As for limitations, as the README points out, currently 2 cases in test_doctest fail, as well as 1 case in test_hotshot, test_inspect, and test_subprocess. And most of the cases in test_trace. The reason for this is, I suspect, that I removed the line tracing code from ceval.c (I didn't want to look at it detail, and it doesn't seem to affect anything else). I expect this would be a bit of work to fix but I don't see it as a huge problem (in fact, if you don't use settrace(?) it shouldn't affect you?). Stack caching: a previous version of VPython supported this, but the performance gain was minimal (maybe 1-2%, though if done really well (e.g. using x as the top of stack cache), who knows, more may be possible). Also, it let to some problems with the garbage collector seeing an out-of-date stack_pointer[-1]. ``Cell'' is, unfortunately, hardcoded into Vmgen. I could either patch that or run ceval-vm.i through sed or something. Finally, to the people who pointed out that VPython (the name) is already taken: Darn! I really should have checked that! Will call it something else in the future. Anyway, HTH, -jakob ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
On Thu, Oct 23, 2008 at 1:08 AM, J. Sievers <[EMAIL PROTECTED]> wrote: > In particular, direct threaded code leads to less horrible branch > prediction than switch dispatch on many machines (exactly how > pronounced this effect is depends heavily on the specific > architecture). To clarify: This is *NOT* actually a form of threading, is it? It "merely" breaks the giant dispatch table into a series of small ones, while also grouping instructions into larger superinstructions? OS threads are not touched at any point? -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
On 2008-10-23 09:08, J. Sievers wrote: > a) It's fairly easy to implement different types of dispatch, simply by > changing a few macros (and while I haven't done this, it shouldn't be a > problem to add some switch dispatch #ifdefs for non-GCC platforms). > > In particular, direct threaded code leads to less horrible branch > prediction than switch dispatch on many machines (exactly how > pronounced this effect is depends heavily on the specific > architecture). Since VPython is GCC-only, how about naming the patch PyGCCVM ?! I doubt that you'll find the same kind of performance increase when using switch-based dispatching, but using more profile based optimizations, it should be possible to come up with a solution that provides a few 10% performance increase while still remaining portable and readable. When working on the switch statement in ceval some 10 years ago it was easy to get a 10-20% performance increase by just moving the switch cases around, breaking the switch in two groups of opcodes that are used a lot and one for less often used ones and then introducing a few fast paths via goto. However, at that time CPUs had much smaller internal caches and the 1.5.2 ceval VM had obviously hit some cache size limit on my machine, since breaking the switch in two provided the best performance increase. With todays CPUs, this shouldn't be a problem anymore. BTW: I hope you did not use pybench to get profiles of the opcodes. That would most certainly result in good results for pybench, but less good ones for general applications such as Django or Zope/Plone. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 23 2008) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
Adam Olsen wrote: To clarify: This is *NOT* actually a form of threading, is it? I think the term "threaded code" is being used here in the sense of Forth, i.e. instead of a sequence of small integers that are dispatched using a switch statement, you use the actual machine addresses of the switched-to code as the virtual instruction opcodes, so the dispatch is just an indirect jump. I'm having trouble seeing how that would help with branch prediction, though -- seems to me it should make it worse if anything, since the CPU has no idea where an indirect jump is going to go. -- Greg ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
On Thu, Oct 23, 2008 at 01:31:48AM -0600, Adam Olsen wrote: > To clarify: This is *NOT* actually a form of threading, is it? It > "merely" breaks the giant dispatch table into a series of small ones, > while also grouping instructions into larger superinstructions? OS > threads are not touched at any point? I haven't looked at Vmgen or VPython at all, but that's probably correct. From: http://foldoc.org/foldoc.cgi?query=threaded+code threaded code: A technique for implementing virtual machine interpreters, introduced by J.R. Bell in 1973, where each op-code in the virtual machine instruction set is the address of some (lower level) code to perform the required operation. This kind of virtual machine can be implemented efficiently in machine code on most processors by simply performing an indirect jump to the address which is the next instruction. Many Forth implementations use threaded code and nowadays some use the term "threading" for almost any technique used to implement Forth's virtual machine. --amk ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
A.M. Kuchling amk.ca> writes: > > threaded code: A technique for implementing virtual machine > interpreters, introduced by J.R. Bell in 1973, where each op-code in > the virtual machine instruction set is the address of some (lower > level) code to perform the required operation. This kind of virtual > machine can be implemented efficiently in machine code on most > processors by simply performing an indirect jump to the address which > is the next instruction. Is this kind of optimization that useful on modern CPUs? It helps remove a memory access to the switch/case lookup table, which should shave off the 3 CPU cycles of latency of a modern L1 data cache, but it won't remove the branch misprediction penalty of the indirect jump itself, which is more in the order of 10-20 CPU cycles depending on pipeline depth. In 1973, CPUs were not pipelined and did not suffer any penalty for indirect jumps, while lookups could be slow especially if they couldn't run in parallel with other processing in the pipeline. Thanks Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
On 2008.10.23 12:02:12 +0200, M.-A. Lemburg wrote: > BTW: I hope you did not use pybench to get profiles of the opcodes. > That would most certainly result in good results for pybench, but > less good ones for general applications such as Django or Zope/Plone. I was wondering about Pybench-specific optimization too, so last night I ran a few dozen of my projecteuler.net solver programs with VPython. Excluding the ones that are so fast that startup overhead dominates runtime, the least speedup I saw versus vanilla 2.5.2 was ~10%, the best was ~50%, and average was ~30%. Pretty consistent with Pybench. -- David Ripton[EMAIL PROTECTED] ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
On 2008-10-23 15:19, David Ripton wrote: > On 2008.10.23 12:02:12 +0200, M.-A. Lemburg wrote: >> BTW: I hope you did not use pybench to get profiles of the opcodes. >> That would most certainly result in good results for pybench, but >> less good ones for general applications such as Django or Zope/Plone. > > I was wondering about Pybench-specific optimization too, so last night I > ran a few dozen of my projecteuler.net solver programs with VPython. > Excluding the ones that are so fast that startup overhead dominates > runtime, the least speedup I saw versus vanilla 2.5.2 was ~10%, the best > was ~50%, and average was ~30%. Pretty consistent with Pybench. Thanks. That's good to know. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 23 2008) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] superinstructions (VPython 0.1)
Hi, J. Sievers gmail.com> writes: > > A sequence of code such as LOAD_CONST LOAD_FAST BINARY_ADD will, in > CPython, push some constant onto the stack, push some local onto the > stack, then pop both off the stack, add them and push the result back > onto the stack. > Turning this into a superinstruction means inlining LOAD_CONST and > LOAD_FAST, modifying them to store the values they'd otherwise push > onto the stack in local variables and adding a version of BINARY_ADD > which reads its arguments from those local variables rather than the > stack (this reduces dispatch time in addition to pops and pushes). The problem is that this only optimizes code like "x + 1" but not "1 + x" or "x + y". To make this generic a first step would be to try to fuse LOAD_CONST and LOAD_FAST into a single opcode (and check it doesn't slow down the VM). This could be possible by copying the constants table into the start of the frame's variables array when the frame is created, so that the LOAD_FAST code still does a single indexed array dereference. Since constants are constants, they don't need to be copied again when the frame is re-used by a subsequent call of the same function (but this would slow done recursive functions a bit, since those have to create new frames each time they are called). Then fusing e.g. LOAD_FAST LOAD_FAST BINARY_ADD into ADD_FAST_FAST would cover many more cases than the optimization you are writing about, without any explosion in the number of opcodes. Regards Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
Jakob> David Gregg (and friends) recently published a paper comparing Jakob> stack based and register based VMs for Java and found that Jakob> register based VMs were substantially faster. The main reason for Jakob> this appears to be the absence of the various LOAD_ instructions Jakob> in a register VM. Back in the day (2001?) with the help of Tim Peters and others here I wrote most of what was necessary to convert from a stack-based to a register-based virtual machine for CPython. I had trouble with a couple VM instructions. If memory serves correctly, exec was a problem because it could push a variable number of results on the stack. I never completed the project because the instruction set kept growing and I couldn't keep up. Still, it looked promising because it avoided lots of intermediate loads and stores. Jakob> ``Cell'' is, unfortunately, hardcoded into Vmgen. I could either Jakob> patch that or run ceval-vm.i through sed or something. If the patch is to work and eventually be accepted it will have to be changed, probably with a little downstream find/replace. You might want to also file a bug report to the Vmgen folks. If Vmgen is generating these names the authors should really know to try to avoid name clashes. Skip ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] superinstructions (VPython 0.1)
Antoine Pitrou wrote: > Hi, > > J. Sievers gmail.com> writes: >> A sequence of code such as LOAD_CONST LOAD_FAST BINARY_ADD will, in >> CPython, push some constant onto the stack, push some local onto the >> stack, then pop both off the stack, add them and push the result back >> onto the stack. >> Turning this into a superinstruction means inlining LOAD_CONST and >> LOAD_FAST, modifying them to store the values they'd otherwise push >> onto the stack in local variables and adding a version of BINARY_ADD >> which reads its arguments from those local variables rather than the >> stack (this reduces dispatch time in addition to pops and pushes). > > The problem is that this only optimizes code like "x + 1" but not "1 + x" or > "x > + y". To make this generic a first step would be to try to fuse LOAD_CONST and > LOAD_FAST into a single opcode (and check it doesn't slow down the VM). This > could be possible by copying the constants table into the start of the frame's > variables array when the frame is created, so that the LOAD_FAST code still > does > a single indexed array dereference. Since constants are constants, they don't > need to be copied again when the frame is re-used by a subsequent call of the > same function (but this would slow done recursive functions a bit, since those > have to create new frames each time they are called). > Though it would seem redundant to create multiple copies of constant structures. Wouldn't there be some way to optimize this to allow each call to access the data from the same place? > Then fusing e.g. LOAD_FAST LOAD_FAST BINARY_ADD into ADD_FAST_FAST would cover > many more cases than the optimization you are writing about, without any > explosion in the number of opcodes. > regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
On Thu, Oct 23, 2008 at 8:13 AM, Antoine Pitrou <[EMAIL PROTECTED]> wrote:
> Is this kind of optimization that useful on modern CPUs? It helps remove a
> memory access to the switch/case lookup table, which should shave off the 3
> CPU
> cycles of latency of a modern L1 data cache, but it won't remove the branch
> misprediction penalty of the indirect jump itself, which is more in the
> order of
> 10-20 CPU cycles depending on pipeline depth.
I searched around for information on how threaded code interacts with branch
prediction, and here's what I found. The short answer is that threaded code
significantly improves branch prediction.
Any bytecode interpreter has some kind of dispatch mechanism that jumps to
the next opcode handler. With a while(1) {switch() {} } format, there's one
dispatch location in the machine code. Processor branch prediction has a
horrible time trying to predict where that dispatch location is going to
jump to. Here's some rough psuedo-assembly:
main_loop:
compute next_handler
jmp next_handler; abysmal branch prediction
handler1:
; do stuff
jmp main_loop
handler2:
; do stuff
jmp main_loop
With threaded code, every handler ends with its own dispatcher, so the
processor can make fine-grained predictions. Since each opcode has its own
indirect branch instruction, the processor can track them separately and
make better predictions (e.g., it can figure out that opcode X is often
followed by opcode Y).
compute next_handler
jmp next_handler ; executed only once
handler1:
; do stuff
compute next_handler
jmp next_handler
handler2:
; do stuff
compute next_handler
jmp next_handler
--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
Daniel Stutzbach wrote: With threaded code, every handler ends with its own dispatcher, so the processor can make fine-grained predictions. I'm still wondering whether all this stuff makes a noticeable difference in real-life Python code, which spends most of its time doing expensive things like attribute lookups and function calls, rather than fiddling with integers in local variables. -- Greg ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
On Wed, Oct 22, 2008 at 5:16 AM, J. Sievers <[EMAIL PROTECTED]> wrote: > I implemented a variant of the CPython VM on top of Gforth's Vmgen; this made > it fairly straightforward to add direct threaded code and superinstructions > for > the various permutations of LOAD_CONST, LOAD_FAST, and most of the > two-argument VM instructions. > > Sources: > http://svirfneblin.org/stuff/VPython-0.1.tar.gz Hey Jakob, This is very interesting (at this point I'm just lurking), but has anyone pointed out yet that there already is something else called VPython, which has a long standing "right" to the name? http://www.vpython.org/ -- "3D Programming for Ordinary Mortals". I hope you can find a better name for your project, otherwise there will be no end of confusion... -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
Guido van Rossum wrote: there already is something else called VPython Perhaps it could be called Fython (Python with a Forth-like VM) or Thython (threaded-code Python). -- Greg ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] No manifest files on Windows?
> In http://bugs.python.org/issue4120, the author suggests that it might > be possible to completely stop using the manifest mechanism, for VS > 2008. Given the many problems that this SxS stuff has caused, this > sounds like a very desirable route, although I haven't done any actual > testing yet. > > Can all the Windows experts please comment? Could that work? Does it > have any downsides? > > If it works, I would like to apply it to 3.0, although I probably > won't be able to apply it to tomorrow's rc. Would it also be possible > to change that in 2.6.1 (even though python26.dll in 2.6.0 already > includes a manifest, as do all the pyd files)? My take is that the bug is suggesting the manifest be dropped only from .pyd files. Python's executable and DLL will still have the manifest reference, so the CRT is still loaded from a manifest (FWIW, the CRT will abort() if initialized other than via a manifest!). I don't see a downside and can see how it would help with private assemblies. [I've also added a comment to this effect to the bug] Mark. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
On 23 Oct, 10:42 pm, [EMAIL PROTECTED] wrote: Guido van Rossum wrote: there already is something else called VPython Perhaps it could be called Fython (Python with a Forth-like VM) or Thython (threaded-code Python). I feel like I've missed something important, but, why not just call it "Python"? ;-) It's a substantial patch, but from what I understand it's a huge performance improvement and completely compatible, both at the C API and Python source levels. Is there any reason this should be a separate project rather than just be rolled in to the core? Obviously once issues like the 'Cell' macro are dealt with. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
[EMAIL PROTECTED] wrote: Is there any reason this should be a separate project rather than just be rolled in to the core? Always keep in mind that one of the important characteristics of CPython is that its implementation is very straightforward and easy to follow. Replacing the ceval loop with machine generated code would be a step away from that. -- Greg ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
[EMAIL PROTECTED] wrote: It's a substantial patch, but from what I understand it's a huge performance improvement and completely compatible, both at the C API and Python source levels. I have not seen any Windows test yet. The direct threading is gcc-specific, so there might be degradation with MSVC. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] No manifest files on Windows?
> I don't see a downside and can see how it would help with private > assemblies. > > [I've also added a comment to this effect to the bug] Thanks! I'll go ahead and accept this, then. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
