[Python-Dev] Benchmark performance...
Hi, as Antoine pointed out in the corresponding issue (http://bugs.python.org/issue14757#msg160870), measuring/assessing real-world performance of my patch would be interesting. I mentioned that I am not aware of any relevant Python 3 program/application to report numbers for (but guess that the speedups should persist.) Since nobody came up with an answer yet, I figured it would be a good idea to ask everybody on python-dev for suggestions... Regards, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Assigning copyright...
Hello, http://www.python.org/psf/contrib/ I took care of the formalities. I am not sure how to proceed further. Would python-dev want me to draft a PEP? Regards, --stefan PS: Personally, I am not a 100pct convinced that having a PEP is a good thing in this case, as it makes a perfectly transparent optimization visible. AFAIR Sun opted to keep their instruction derivatives secret, i.e., the second edition of the JVM internals does not even mention them anymore. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Assigning copyright...
I think you'll find that we don't keep a lot of things secret about CPython and its implementation. Yeah, I agree that this is in principal a good thing and what makes CPython ideally suited for research. However, my optimizations make use of unused opcodes, which might be used in the future by actual CPython instructions (e.g., from my previous patch to the new one the YIELD_FROM instruction has been added.) I'd say the situation is similar to the threaded code/computed goto's issue. Although this is different when it comes to the community. The PSU has ? I am going to file a patch like Martin von Loewis suggested. Thanks, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Assigning copyright...
Hello Mark, A URL for the code repository (with an open-source license), so code can be reviewed. It is hard to review and update a giant patch. OK, I took Nick's advice to heart and created a fork from the official cpython mirror on bitbucket. You can view the code patched in (branch: inca-only) under the following URL: https://bitbucket.org/sbrunthaler/cpython-inline-caching Since it is a fork, it contains the usual LICENSE from Python. Regarding Eric's hint: It seems that this agreement needs to be signed and mailed. Can I sign/scan and email it to somebody? (Or should I wait until there is a decision regarding a potential integration?) The way I understood Guido's last message it is best to use Apache 2 license without retaining my own copyright. I am perfectly fine with that but am not sure if using the fork with sub-directories including the official LICENSE takes care of that. Obviously, I don't have too much experience in this area, so if I am missing something blatantly obvious, I apologize beforehand... Best, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Assigning copyright...
Hi, I only had little time to spend for my open sourcing efforts, which is why I could not get back to python-dev any time earlier... Yesterday I forward-ported my patches to revision 76549 (13c30fe3f427), which only took 25mins or so (primarly due to the small changes necessary to Python itself and the stability of that parts.) Thanks to a colleague of mine (Per Larsen) I reimplemented some of the more ugly parts of the code generator, too. Guido's answer from the last thread was that I should duly assign the copyright to the PSF. Unfortunatly, I don't really see any other part than the LICENSE and README files in the Python distribution. Since my patch basically just adds another subdirectory (cgen) to the Python top-level directory, I am not sure if I need to supply other information to make my code officially PSF compatible. Am I missing something obvious? Thanks, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations, continued, continued again...
I really don't think that is a problem. The core contributors can deal well with complexity in my experience. :-) No no, I wasn't trying to insinuate anything like that at all. No, I just figured that having the code generator being able to generate 4 optimizations where only one is supported is a bad idea for several reasons, such as maintainability, etc. Anyways, I've just completed the integration of the code generator and put the corresponding patch on my page (http://www.ics.uci.edu/~sbruntha/pydev.html) for downloading. The license thing is still missing, I'll do that tomorrow or sometime next week. Regards, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations, continued, continued again...
How many times did you regenerate this code until you got it right? Well, honestly, I changed the code generator to pack the new optimized instruction derivatives densly into the available opcodes, so that I can make optimal use of what's there. Thus I only generated the code twice for this patch. And how do you know that you really got it so right that it was the last time ever that you needed your generator for it? I am positive that I am going to need my code generator in the future, as I have several ideas to increase performance even more. As I have mentioned before, my quickening based inline caching technique is very simple, and if it would crash, chances are that any of the inline-cache miss guards don't capture all scenarios, i.e., are non-exhaustive. The regression-tests run, so do the official benchmarks plus the computer language benchmarks game. In addition, this has been my line of research since 2009, so I have extensive experience with it, too. What if the C structure of any of those several types ever changes? Since I optimize interpreter instructions, any change that affects their implementation requires changing of the optimized instructions, too. Having the code generator ready for such things would certainly be a good idea (probably also for generating the default interpreter dispatch loop), since you could also add your own profile for your application/domain to re-use the remaining 30+ instruction opcodes. The direct answer is that I would need to re-generate the driver file, which is basically a gdb-dump plus an Emacs macro (please note that I did not need to do that since working with ~ 3.0b1) I will add a list of the types I use for specializing to patch section on the additional resources page of my homepage (including a fixed patch of what Georg brought to my attention.) --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations, continued, continued again...
But let me put this straight: as an open-source project, we are hesitant to accept changes which depend on closed software. Even if your optimization techniques would result in performance a hundred times better than what is currently achieved, we would still be wary to accept them. Please note that this is not because of lack of trust or better yet greed for your code. We need to make sure that under no circumstances our codebase is in danger because something important was left out along the way. I am positive that the code generator does not depend on any closed source components, I just juse mako for storing the C code templates that I generate -- everything else I wrote myself. Of course, I'll give the code generator to pydev, too, if necessary. However, I need to strip it down, so that it does not do all the other stuff that you don't need. I just wanted to give you the implementation now, since Benjamin said that he wants to see real code and results first. If you want to integrate the inca-optimization, I am going to start working on this asap. Maintenance of generated code is yet another nuissance that should better be strongly justified. I agree, but the nice thing is that the technique is very simple: only if you changed a significant part of the interpreter implementation's, you'd need to change the optimized derivatives, too. If one generates the default interpreter implementation, too, then one gets the optimizations almost for free. For maintenance reasons I chose to use a template-based system, too, since this gives you a direct correspondence between the actual code and what's generated, without interfering with the code generator at all. --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations, continued, continued again...
On Wed, Feb 1, 2012 at 09:46, Guido van Rossum gu...@python.org wrote: Let's make one thing clear. The Python core developers need to be able to reproduce your results from scratch, and that means access to the templates, code generators, inputs, and everything else you used. (Of course for stuff you didn't write that's already open source, all we need is a pointer to the open source project and the exact version/configuration you used, plus any local mods you made.) I understand that you're hesitant to just dump your current mess, and you want to clean it up before you show it to us. That's fine. But until you're ready to show it, we're not going to integrate any of your work into CPython, even though some of us (maybe Benjamin) may be interested in kicking its tires. And remember, it doesn't need to be perfect (in fact perfectionism is probably a bad idea here). But it does need to be open source. Every single bit of it. (And no GPL, please.) I understand all of these issues. Currently, it's not really a mess, but much more complicated as it needs to be for only supporting the inca optimization. I don't know what the time frame for a possible integration is (my guess is that it'd be safe anyways to disable it, like the threaded code support was handled.) As for the license: I really don't care about that at all, the only thing nice to have would be to have a pointer to my home page and/or the corresponding research, but that's about all on my wish list. --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations, continued, continued again...
I assume yes here means yes, I'm aware and not yes, I'm using Python 2, right? And you're building on top of the existing support for threaded code in order to improve it? Your assumption is correct, I'm sorry for the sloppiness (I was heading out for lunch.) None of the code is 2.x compatible, all of my work has always targeted Python 3.x. My work does not improve threaded code (as in interpreter dispatch technique), but enables efficient and purely interpretative inline caching via quickening. (So, after execution of BINARY_ADD, I rewrite the specific occurence of the bytecode instruction to a, say, FLOAT_ADD instruction and ensure that my assumption is correct in the FLOAT_ADD instruction.) Thanks, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations, continued, continued again...
If I read the patch correctly, most of it is auto-generated (and there is probably a few spurious changes that blow it up, such as the python-gdb.py file). Hm, honestly I don't know where the python-gdb.py file comes from, I thought it came with the switch from 3.1 to the tip version I was using. Anyways, I did not tuch it or at least have no recollection of doing so. Regarding the spurious changes: This might very well be, regression testing works, and it would actually be fairly easy to figure out crashes (e.g., by tracing all executed bytecode instructions and seeing if all of them are actually executed, I could easily do that if wanted/necessary.) But the tool that actually generates the code doesn't seem to be included? (Which means that in this form, the patch couldn't possibly be accepted.) Well, the tool is not included because it does a lot more (e.g., generate the code for elimination of reference count operations.) Unfortunately, my interpreter architecture that achieves the highest speedups is more complicated, and I got the feeling that this is not going well with python-dev. So, I had the idea of basically using just one (but a major one) optimization technique and going with that. I don't see why you would need my code generator, though. Not that I cared, but I would need to strip down and remove many parts of it and also make it more accessible to other people. However, if python-dev decides that it wants to include the optimizations and requires the code generator, I'll happily chip in the extra work an give you the corresponding code generator, too. Thanks, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations, continued, continued again...
There is also the issue of the two test modules removed from the test suite. Oh, I'm sorry, seems like the patch did contain too much of my development stuff. (I did remove them before, because they were always failing due to the instruction opcodes being changed because of quickening; they pass the tests, though.) Well, nobody wants to review generated code. I agree. The code generator basically uses templates that contain the information and a dump of the C-structure of several types to traverse and see which one of them implements which functions. There is really no magic there, the most complex thing is to get the inline-cache miss checks for function calls right. But I tried to make the generated code look pretty, so that working with it is not too much of a hassle. The code generator itself is a little bit more complicated, so I am not sure it would help a lot... best, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations, continued, continued again...
Hello, Could you try benchmarking with the standard benchmarks: http://hg.python.org/benchmarks/ and see what sort of performance gains you get? Yeah, of course. I already did. Refere to the page listed below for details. I did not look into the results yet, though. How portable is the threaded interpreter? Well, you can implement threaded code on any machine that support indirect branch instructions. Fortunately, GCC supports the label-as-values feature, which makes it available on any machine that supports GCC. My optimizations themselves are portable, and I tested them on a PowerPC for my thesis, too. (AFAIR, llvm supports this feature, too.) Do you have a public repository for the code, so we can take a look? I have created a patch (as Benjamin wanted) and put all of the resources (i.e., benchmark results and the patch itself) on my home page: http://www.ics.uci.edu/~sbruntha/pydev.html Regards, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations, continued, continued again...
Well, you're aware that Python already uses threaded code where available? Or are you testing against Python 2? Yes, and I am building on that. --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations, continued, continued again...
Hi, On Tue, Nov 8, 2011 at 10:36, Benjamin Peterson benja...@python.org wrote: 2011/11/8 stefan brunthaler s.bruntha...@uci.edu: How does that sound? I think I can hear real patches and benchmarks most clearly. I spent the better part of my -20% time on implementing the work as suggested. Please find the benchmarks attached to this email, I just did them on my system (i7-920, Linux 3.0.0-15, GCC 4.6.1). I branched off the regular 3.3a0 default tip changeset 73977 shortly after your email. I do not have an official patch yet, but am going to create one if wanted. Changes to the existing interpreter are minimal, the biggest chunk is a new interpreter dispatch loop. Merging dispatch loops eliminates some of my optimizations, but my inline caching technique enables inlining some functionality, which results in visible speedups. The code is normalized to the non-threaded-code version of the CPython interpreter (named vanilla), so that I can reference it to my preceding results. I anticipate *no* compatibility issues and the interpreter requires less than 100 KiB of extra memory at run-time. Since my interpreter is using 215 of a maximum of 255 instructions, there is room for adding additional derivatives, e.g., for popular Python libraries, too. Let me know what python-dev thinks of this and have a nice weekend, --stefan PS: AFAIR the version without partial stack frame caching also passes all regression tests modulo the ones that test against specific bytecodes. currently processing: bench/binarytrees.py3.py phd-cpy-3a0-thr-cod-pytho arg: 10 | time: 0.161876 | stdev: 0.007780 | var: 0.61 | mem: 6633.60 phd-cpy-3a0-thr-cod-pytho arg: 12 | time: 0.699243 | stdev: 0.019112 | var: 0.000365 | mem: 8142.67 phd-cpy-3a0-thr-cod-pytho arg: 14 | time: 3.388344 | stdev: 0.048042 | var: 0.002308 | mem: 13586.93 phd-cpy-pio-sne-pre-pyt-no-psf arg: 10 | time: 0.153875 | stdev: 0.003828 | var: 0.15 | mem: 6873.73 phd-cpy-pio-sne-pre-pyt-no-psf arg: 12 | time: 0.632572 | stdev: 0.019121 | var: 0.000366 | mem: 8246.27 phd-cpy-pio-sne-pre-pyt-no-psf arg: 14 | time: 3.020988 | stdev: 0.043483 | var: 0.001891 | mem: 13640.27 phd-cpy-pio-sne-pre-pytho arg: 10 | time: 0.150942 | stdev: 0.005157 | var: 0.27 | mem: 6901.87 phd-cpy-pio-sne-pre-pytho arg: 12 | time: 0.660841 | stdev: 0.020538 | var: 0.000422 | mem: 8286.80 phd-cpy-pio-sne-pre-pytho arg: 14 | time: 3.184198 | stdev: 0.051103 | var: 0.002612 | mem: 13680.40 phd-cpy-3a0-van-pytho arg: 10 | time: 0.202812 | stdev: 0.005480 | var: 0.30 | mem: 6633.33 phd-cpy-3a0-van-pytho arg: 12 | time: 0.908456 | stdev: 0.015744 | var: 0.000248 | mem: 8153.07 phd-cpy-3a0-van-pytho arg: 14 | time: 4.364805 | stdev: 0.037522 | var: 0.001408 | mem: 13593.60 ### phd-cpy-3a0-thr-cod-pytho : 1.2887 (avg-sum: 1.416488) ### phd-cpy-pio-sne-pre-pyt-no-psf: 1.4383 (avg-sum: 1.269145) ### phd-cpy-pio-sne-pre-pytho : 1.3704 (avg-sum: 1.331994) ### phd-cpy-3a0-van-pytho : 1. (avg-sum: 1.825358) currently processing: bench/fannkuch.py3.py phd-cpy-3a0-thr-cod-pytho arg: 8 | time: 0.172677 | stdev: 0.006620 | var: 0.44 | mem: 6424.13 phd-cpy-3a0-thr-cod-pytho arg: 9 | time: 1.426755 | stdev: 0.035545 | var: 0.001263 | mem: 6425.20 phd-cpy-pio-sne-pre-pyt-no-psf arg: 8 | time: 0.168010 | stdev: 0.010277 | var: 0.000106 | mem: 6481.07 phd-cpy-pio-sne-pre-pyt-no-psf arg: 9 | time: 1.345817 | stdev: 0.033127 | var: 0.001097 | mem: 6479.60 phd-cpy-pio-sne-pre-pytho arg: 8 | time: 0.165876 | stdev: 0.007136 | var: 0.51 | mem: 6520.00 phd-cpy-pio-sne-pre-pytho arg: 9 | time: 1.351150 | stdev: 0.028822 | var: 0.000831 | mem: 6519.73 phd-cpy-3a0-van-pytho arg: 8 | time: 0.216146 | stdev: 0.012879 | var: 0.000166 | mem: 6419.07 phd-cpy-3a0-van-pytho arg: 9 | time: 1.834247 | stdev: 0.028224 | var: 0.000797 | mem: 6418.67 ### phd-cpy-3a0-thr-cod-pytho : 1.2820 (avg-sum: 0.799716) ### phd-cpy-pio-sne-pre-pyt-no-psf: 1.3544 (avg-sum: 0.756913) ### phd-cpy-pio-sne-pre-pytho : 1.3516 (avg-sum: 0.758513) ### phd-cpy-3a0-van-pytho : 1. (avg-sum: 1.025197) currently processing: bench/fasta.py3.py phd-cpy-3a0-thr-cod-pytho arg: 5 | time: 0.374023 | stdev: 0.010870 | var: 0.000118 | mem: 6495.07 phd-cpy-3a0-thr-cod-pytho arg: 10 | time: 0.714577 | stdev: 0.024713 | var: 0.000611 | mem: 6495.47 phd-cpy-3a0-thr-cod-pytho arg: 15 | time: 1.062866 | stdev: 0.040138 | var: 0.001611 | mem: 6496.27 phd-cpy-pio-sne-pre-pyt-no-psf arg: 5 | time: 0.345621 | stdev: 0.022549 | var: 0.000508 | mem: 6551.87 phd-cpy-pio-sne-pre-pyt-no-psf arg
[Python-Dev] Python 3 optimizations, continued, continued again...
Hi guys, while there is at least some interest in incorporating my optimizations, response has still been low. I figure that the changes are probably too much for a single big incorporation step. On a recent flight, I thought about cutting it down to make it more easily digestible. The basic idea is to remove the optimized interpreter dispatch loop and advanced instruction format and use the existing ones. Currently (rev. ca8a0dfb2176), opcode.h uses 109 of potentially available 255 instructions using the current instruction format. Hence, up to 149 instruction opcodes could be given to optimized instruction derivatives. Consequently, a possible change would require to change: a) opcode.h to add new instruction opcodes, b) ceval.c to include the new instruction opcodes in PyEval_EvalFrameEx, c) abstract.c, object.c (possible other files) to add the quickening/rewriting function calls. If this is more interesting, I could start evaluating which instruction opcodes should be allocated to which derivatives to get the biggest benefit. This is a lot easier to implement (because I can re-use the existing instruction implementations) and can easily be made to be conditionally compile-able, similar to the computed-gotos option. Since the changes are minimal it is also simpler to understand and deal with for everybody else, too. On the downside, however, not all optimizations are possible and/or make sense in the given limit of instructions (no data-object inlining and no reference-count elimination.) How does that sound? Have a nice day, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
as promised, I created a publicly available preview of an implementation with my optimizations, which is available under the following location: https://bitbucket.org/py3_pio/preview/wiki/Home One very important thing that I forgot was to indicate that you have to use computed gotos (i.e., configure --with-computed-gotos), otherwise it won't work (though I think that most people can figure this out easily, knowing this a priori isn't too bad.) Regards, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
1) The SFC optimisation is purely based on static code analysis, right? I assume it takes loops into account (and just multiplies scores for inner loops)? Is that what you mean with nesting level? Obviously, static analysis can sometimes be misleading, e.g. when there's a rare special case with lots of loops that needs to adapt input data in some way, but in general, I'd expect that this heuristic would tend to hit the important cases, especially for well structured code with short functions. Yes, currently I only use the heuristic to statically estimate utility of assigning an optimized slot to a local variable. And, another yes, nested blocks (like for-statements) is what I have in mind when using nesting level. I was told that the algorithm itself is very similar to linear scan register allocation, modulo the ability to spill values, of course. From my benchmarks and in-depth analysis of several programs, I found this to work very well. In fact, the only situation I found is (unfortunately) one of the top-most executed functions in US' bm_django.py: There is one loop that gets almost never executed but this loop gives precedence to local variables used inside. Because of this, I have already an idea for a better approach: first, use the static heuristic to compute stack slot score, then count back-branches (I would need this anyways, as the _Py_CheckInterval has gone and OSR/hot-swapping is in general a good idea) and record their frequency. Next, just replace the current static weight of 100 by the dynamically recorded weight. Consequently, you should get better allocations. (Please note that I did some quantitative analysis of bython functions to determine that using 4 SFC-slots covers a substantial amount of functions [IIRC 95%] with the trivial scenario when there are at most 4 local variables.) 2) The RC elimination is tricky to get right and thus somewhat dangerous, but sounds worthwhile and should work particularly well on a stack based byte code interpreter like CPython. Well, it was very tricky to get right when I implemented it first around Christmas 2009. The current implementation is reasonably simple to understand, however, it depends on the function refcount_effect to give me correct information at all times. I got the biggest performance improvement on one benchmark on the PowerPC and think that RISC architectures in general benefit more from this optimization (eliminate the load, add and store instructions) than x86 CISCs do (an INCREF is just an add on the memory location without data dependencies, so fairly cheap). In any case, however, you get the replication effect of improving CPU branch predicion by having these additional instruction derivatives. It would be interesting (research-wise, too) to be able to measure whether the reduction in memory operations makes Python programs use less energy, and if so, how much the difference is. 3) Inline caching also sounds worthwhile, although I wonder how large the savings will be here. You'd save a couple of indirect jumps at the C-API level, sure, but apart from that, my guess is that it would highly depend on the type of instruction. Certain (repeated) calls to C implemented functions would likely benefit quite a bit, for example, which would be a nice optimisation by itself, e.g. for builtins. I would expect that the same applies to iterators, even a couple of percent faster iteration can make a great deal of a difference, and a substantial set of iterators are implemented in C, e.g. itertools, range, zip and friends. I'm not so sure about arithmetic operations. In Cython, we (currently?) do not optimistically replace these with more specific code (unless we know the types at compile time), because it complicates the generated C code and indirect jumps aren't all that slow that the benefit would be important. Savings are *much* higher when data can be unboxed, so much that the slight improvement for optimistic type guesses is totally dwarfed in Cython. I would expect that the return of investment is better when the types are actually known at runtime, as in your case. Well, in my thesis I already hint at another improvement of the existing design that can work on unboxed data as well (while still being an interpreter.) I am eager to try this, but don't know how much time I can spend on this (because there are several other research projects I am actively involved in.) In my experience, this works very well and you cannot actually report good speedups without inline-caching arithmetic operations, simply because that's where all JITs shine and most benchmarks don't reflect real world scenarios but mathematics-inclined microbenchmarks. Also, if in the future compilers (gcc and clang) will be able to inline the invoked functions, higher speedups will be possible. 4) Regarding inlined object references, I would expect that it's much more worthwhile to speed up LOAD_GLOBAL and LOAD_NAME than LOAD_CONST.
Re: [Python-Dev] Python 3 optimizations continued...
Hi, as promised, I created a publicly available preview of an implementation with my optimizations, which is available under the following location: https://bitbucket.org/py3_pio/preview/wiki/Home I followed Nick's advice and added some valuable advice and overview/introduction at the wiki page the link points to, I am positive that spending 10mins reading this will provide you with a valuable information regarding what's happening. In addition, as Guido already mentioned, this is more or less a direct copy of my research-branch without some of my private comments and *no* additional refactorings because of software-engineering issues (which I am very much aware of.) I hope this clarifies a *lot* and makes it easier to see what parts are involved and how all the pieces fit together. I hope you'll like it, have fun, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
I think that you must deal with big endianess because some RISC can't handle at all data in little endian format. In WPython I have wrote some macros which handle both endianess, but lacking big endian machines I never had the opportunity to verify if something was wrong. I am sorry for the temporal lapse of not getting back to this directly yesterday, we were just heading out for lunch and I figured it only out then but immediately forgot it on our way back to the lab... So, as I have already said, I evaluated my optimizations on x86 (little-endian) and PowerPC 970 (big-endian) and I did not have to change any of my instruction decoding during interpretation. (The only nasty bug I still remember vividly was that while on gcc for x86 the data type char defaults to signed, whereas it defaults to unsigned on PowerPC's gcc.) When I have time and access to a PowerPC machine again (an ARM might be interesting, too), I will take a look at the generated assembly code to figure out why this is working. (I have some ideas why it might work without changing the code.) If I run into any problems, I'll gladly contact you :) BTW: AFAIR, we emailed last year regarding wpython and IIRC your optimizations could primarily be summarized as clever superinstructions. I have not implemented anything in that area at all (and have in fact not even touched the compiler and its peephole optimizer), but if parts my implementation gets in, I am sure that you could add some of your work on top of that, too. Cheers, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
So, basically, you built a JIT compiler but don't want to call it that, right? Just because it compiles byte code to other byte code rather than to native CPU instructions does not mean it doesn't compile Just In Time. For me, a definition of a JIT compiler or any dynamic compilation subsystem entails that native machine code is generated at run-time. Furthermore, I am not compiling from bytecode to bytecode, but rather changing the instruction encoding underneath and use subsequently use quickening to optimize interpretation. But, OTOH, I am not aware of a canonical definition of JIT compilation, so it depends ;) I agree with the others that it's best to open up your repository for everyone who is interested. I can see no reason why you would want to close it back down once it's there. Well, my code has primarily been a vehicle for my research in that area and thus is not immediately suited to adoption (it does not adhere to Python C coding standards, contains lots of private comments about various facts, debugging hints, etc.). The explanation for this is easy: When I started out on my research it was far from clear that it would be successful and really that much faster. So, I would like to clean up the comments and some parts of the code and publish the code I have without any of the clean-up work for naming conventions, etc., so that you can all take a look and it is clear what it's all about. After that we can then have a factual discussion about whether it fits the bill for you, too, and if so, which changes (naming conventions, extensive documentation, etc.) are necessary *before* any adoption is reasonable for you, too. That seems to be a good way to start off and get results and feedback quickly, any ideas/complaints/comments/suggestions? Best regards, --stefan PS: I am using Nick's suggested plan to incorporate my changes directly to the most recent version, as mine is currently only running on Python 3.1. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
Changing the bytecode width wouldn't make the interpreter more complex. No, but I think Stefan is proposing to add a *second* byte code format, in addition to the one that remains there. That would certainly be an increase in complexity. Yes, indeed I have a more straightforward instruction format to allow for more efficient decoding. Just going from bytecode size to word-code size without changing the instruction format is going to require 8 (or word-size) times more memory on a 64bit system. From an optimization perspective, the irregular instruction format was the biggest problem, because checking for HAS_ARG is always on the fast path and mostly unpredictable. Hence, I chose to extend the instruction format to have word-size and use the additional space to have the upper half be used for the argument and the lower half for the actual opcode. Encoding is more efficient, and *not* more complex. Using profiling to indicate what code is hot, I don't waste too much memory on encoding this regular instruction format. For example, I still plan to write a JIT for Python at some point. This may happen in two months, or in two years. I wouldn't try to stop anybody from contributing improvements that may become obsolete with the JIT. I would not necessary argue that at least my optimizations would become obsolete; if you still think about writing a JIT, it might make sense to re-use what I've got and not start from scratch, e.g., building a simple JIT compiler that just inlines the operation implementations as templates to eliminate the interpretative overhead (in similar vein as Piumarta and Riccardi's paper from 1998) might be good start. Thoug I don't want to pre-influence your JIT design, I'm just thinking out loud... Regards, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
On Tue, Aug 30, 2011 at 09:42, Guido van Rossum gu...@python.org wrote: Stefan, have you shared a pointer to your code yet? Is it open source? I have no shared code repository, but could create one (is there any pydev preferred provider?). I have all the copyrights on the code, and I would like to open-source it. It sounds like people are definitely interested and it would make sense to let them experiment with your code and review it. That sounds fine. I need to do some clean up work (contains most of my comments to remind me of issues) and currently does not pass all regression tests. But if people want to take a look first to decide if they want it than that's good enough for me. (I just wanted to know if there is substantial interest so that it eventually pays off to find and fix the remaining bugs) --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
Do you really need it to match a machine word? Or is, say, a 16-bit format sufficient. Hm, technically no, but practically it makes more sense, as (at least for x86 architectures) having opargs and opcodes in half-words can be efficiently expressed in assembly. On 64bit architectures, I could also inline data object references that fit into the 32bit upper half. It turns out that most constant objects fit nicely into this, and I have used this for a special cache region (again below 2^32) for global objects, too. So, technically it's not necessary, but practically it makes a lot of sense. (Most of these things work on 32bit systems, too. For architectures with a smaller size, we can adapt or disable the optimizations.) Cheers, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
Do I sense that the bytecode format is no longer platform-independent? That will need a bit of discussion. I bet there are some things around that depend on that. Hm, I haven't really thought about that in detail and for longer, I ran it on PowerPC 970 and Intel Atom i7 without problems (the latter ones are a non-issue) and think that it can be portable. I just stuff argument and opcode into one word for regular instruction decoding like a RISC CPU, and I realize there might be little/big endian issues, but they surely can be conditionally compiled... --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
Um, I'm sorry, but that reply sounds incredibly naive, like you're not really sure what the on-disk format for .pyc files is or why it would matter. You're not even answering the question, except indirectly -- it seems that you've never even thought about the possibility of generating a .pyc file on one platform and copying it to a computer using a different one. Well, it may sound incredibly naive, but the truth is: I am never storing the optimized representation to disk, it's done purely at runtime when profiling tells me it makes sense to make the switch. Thus I circumvent many of the problems outlined by you. So I am positive that a full fledged change of the representation has many more intricacies to it, but my approach is only tangentially related... --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
Ok, there there's something else you haven't told us. Are you saying that the original (old) bytecode is still used (and hence written to and read from .pyc files)? Short answer: yes. Long answer: I added an invocation counter to the code object and keep interpreting in the usual Python interpreter until this counter reaches a configurable threshold. When it reaches this threshold, I create the new instruction format and interpret with this optimized representation. All the macros look exactly the same in the source code, they are just redefined to use the different instruction format. I am at no point serializing this representation or the runtime information gathered by me, as any subsequent invocation might have different characteristics. I will remove my development commentaries and create a private repository at bitbucket for you* to take an early look like Georg (and more or less Terry, too) suggested. Is that a good way for most of you? (I would then give access to whomever wants to take a look.) Best, --stefan *: not personally targeted at Guido (who is naturally very much welcome to take a look, too) but addressed to python-dev in general. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
On Tue, Aug 30, 2011 at 13:42, Benjamin Peterson benja...@python.org wrote: 2011/8/30 stefan brunthaler ste...@brunthaler.net: I will remove my development commentaries and create a private repository at bitbucket for you* to take an early look like Georg (and more or less Terry, too) suggested. Is that a good way for most of you? (I would then give access to whomever wants to take a look.) And what is wrong with a public one? Well, since it does not fully pass all regression tests and is just meant for people to take a first look to find out if it's interesting, I think I might take it offline after you had a look. It seems to me that that is easier to be done with a private repository, but in general, I don't have a problem with a public one... Regards, --stefan PS: If you want to, I can also just put a tarball on my home page and post a link here. It's not that I would like to have control/influence about who is allowed to look and who doesn't. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Python 3 optimizations continued...
Hi, pretty much a year ago I wrote about the optimizations I did for my PhD thesis that target the Python 3 series interpreters. While I got some replies, the discussion never really picked up and no final explicit conclusion was reached. AFAICT, because of the following two factors, my optimizations were not that interesting for inclusion with the distribution at that time: a) Unladden Swallow was targeting Python 3, too. b) My prototype did not pass the regression tests. As of November 2010 (IIRC), Google is not supporting work on US anymore, and the project is stalled. (If I am wrong and there is still activity and any plans with the corresponding PEP, please let me know.) Which is why I recently spent some time fixing issues so that I can run the regression tests. There is still some work to be done, but by and large it should be possible to complete all regression tests in reasonable time (with the actual infrastructure in place, enabling optimizations later on is not a problem at all, too.) So, the two big issues aside, is there any interest in incorporating these optimizations in Python 3? Have a nice day, --stefan PS: It probably is unusual, but in a part of my home page I have created a link to indicate interest (makes both counting and voting easier, http://www.ics.uci.edu/~sbruntha/) There were also links indicating interest in funding the work; I have disabled these, so as not to upset anybody or make the impression of begging for money... ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
Perhaps there would be something to say given patches/overviews/specifics. Currently I don't have patches, but for an overview and specifics, I can provide the following: * My optimizations basically rely on quickening to incorporate run-time information. * I use two separate instruction dispatch routines, and use profiling to switch from the regular Python 3 dispatch routine to an optimized one (the implementation is actually vice versa, but that is not important now) * The optimized dispatch routine has a changed instruction format (word-sized instead of bytecodes) that allows for regular instruction decoding (without the HAS_ARG-check) and inlinind of some objects in the instruction format on 64bit architectures. * I use inline-caching based on quickening (passes almost all regression tests [302 out of 307]), eliminate reference count operations using quickening (passes but has a memory leak), promote frequently accessed local variables to their dedicated instructions (passes), and cache LOAD_GLOBAL/LOAD_NAME objects in the instruction encoding when possible (I am working on this right now.) The changes I made can be summarized as: * I changed some header files to accommodate additional information (Python.h, ceval.h, code.h, frameobject.h, opcode.h, tupleobject.h) * I changed mostly abstract.c to incorporate runtime-type feedback. * All other changes target mostly ceval.c and all supplementary code is in a sub-directory named opt and all generated files in a sub-directory within that (opt/gen). * I have a code generator in place that takes care of generating all the functions; it uses the Mako template system for creating C code and does not necessarily need to be shipped with the interpreter (though one can play around and experiment with it.) So, all in all, the changes are not that big to the actual implementation, and most of the code is generated (using sloccount, opt has 1990 lines of C, and opt/gen has 8649 lines of C). That's a quick summary, if there are any further or more in-depth questions, let me know. best, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
The question really is whether this is an all-or-nothing deal. If you could identify smaller parts that can be applied independently, interest would be higher. Well, it's not an all-or-nothing deal. In my current architecture, I can selectively enable most of the optimizations as I see fit. The only pre-requisite (in my implementation) is that I have two dispatch loops with a changed instruction format. It is, however, not a technical necessity, just the way I implemented it. Basically, you can choose whatever you like best, and I could extract that part. I am just offering to add all the things that I have done :) Also, I'd be curious whether your techniques help or hinder a potential integration of a JIT generator. This is something I have previously frequently discussed with several JIT people. IMHO, having my optimizations in-place also helps a JIT compiler, since it can re-use the information I gathered to generate more aggressively optimized native machine code right away (the inline caches can be generated with the type information right away, some functions could be inlined with the guard statements subsumed, etc.) Another benefit could be that the JIT compiler can spend longer time on generating code, because the interpreter is already faster (so in some cases it would probably not make sense to include a non-optimizing fast and simple JIT compiler). There are others on the list, who probably can/want to comment on this, too. That aside, I think that while having a JIT is an important goal, I can very well imagine scenarios where the additional memory consumption (for the generated native machine code) of a JIT for each process (I assume that the native machine code caches are not shared) hinders scalability. I have in fact no data to back this up, but I think that would be an interesting trade off, say if I have 30% gain in performance without substantial additional memory requirements on my existing hardware, compared to higher achievable speedups that require more machines, though. Regards, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
Does it speed up Python? :-) Could you provide numbers (benchmarks)? Yes, it does ;) The maximum overall speedup I achieved was by a factor of 2.42 on my i7-920 for the spectralnorm benchmark of the computer language benchmark game. Others from the same set are: binarytrees: 1.9257 (1.9891) fannkuch: 1.6509 (1.7264) fasta: 1.5446 (1.7161) mandelbrot: 2.0040 (2.1847) nbody: 1.6165 (1.7602) spectralnorm: 2.2538 (2.4176) --- overall: 1.8213 (1.9382) (The first number is the combination of all optimizations, the one in parentheses is with my last optimization [Interpreter Instruction Scheduling] enabled, too.) For a comparative real world benchmark I tested Martin von Loewis' django port (there are not that many meaningful Python 3 real world benchmarks) and got a speedup of 1.3 (without IIS). This is reasonably well, US got a speedup of 1.35 on this benchmark. I just checked that pypy-c-latest on 64 bit reports 1.5 (the pypy-c-jit-latest figures seem to be not working currently or *really* fast...), but I cannot tell directly how that relates to speedups (it just says less is better and I did not quickly find an explanation). Since I did this benchmark last year, I have spent more time investigating this benchmark and found that I could do better, but I would have to guess as to how much (An interesting aside though: on this benchmark, the executable never grew on more than 5 megs of memory usage, exactly like the vanilla Python 3 interpreter.) hth, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
Personally, I *like* CPython fitting into the simple-and-portable niche in the Python interpreter space. Armin Rigo made the judgment years ago that CPython was a poor platform for serious optimisation when he stopped working on Psyco and started PyPy instead, and I think the contrasting fates of PyPy and Unladen Swallow have borne out that opinion. Significantly increasing the complexity of CPython for speed-ups that are dwarfed by those available through PyPy seems like a poor trade-off to me. I agree with the trade-off, but the nice thing is that CPython's interpreter remains simple and portable using my optimizations. All of these optimizations are purely interpretative and the complexity of CPython is not affected much. (For example, I have an inline-cached version of BINARY_ADD that is called INCA_FLOAT_ADD [INCA being my abbreviation for INline CAching]; you don't actually have to look at its source code, since it is generated by my code generator but can by looking at instruction traces immediately tell what's going on.) So, the interpreter remains fully portable and any compatibility issues with C modules should not occur either. At a bare minimum, I don't think any significant changes should be made under the it will be faster justification until the bulk of the real-world benchmark suite used for speed.pypy.org is available for Python 3. (Wasn't there a GSoC project about that?) Having more tests would surely be helpful, as already said, the most real-world stuff I can do is Martin's django patch (some of the other benchmarks though are from the shootout and I can [and did] run them, too {binarytrees, fannkuch, fasta, mandelbrot, nbody and spectralnorm}. I have also the AI benchmark from Unladden Swallow but no current figures.) Best, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations...
Hi, I guess it would be a good idea to quickly outline my inline caching approach, so that we all have a basic understanding of how it works. If we take for instance the BINARY_ADD instruction, the interpreter evaluates the actual operand types and chooses the matching operation implementation at runtime, i.e., operands that are unicode strings will be concatenated via unicode_concatenate, for float operands on the other hand, the interpreter would end up invoking float_add via binary_op1. Now, a very efficient way to achieve purely interpretative inline caching is to quicken the type-generic BINARY_ADD instruction to a type-dependent FLOAT_ADD instruction (this technique, i.e., inline caching via quickening, is the primary contribution of my ECOOP paper). Hence, I have a very simple code generator, that generates type-dependent interpreter instructions in a pre-compile step of the interpreter, and uses runtime type information to quicken/rewrite instructions. Aside of the operators, I have implemented this quickening technique for FOR_ITER, COMPARE_OP and CALL_FUNCTION instructions. I'm absolutely interested, although not for the CPython project but for Cython. I wonder how you do inline caching in Python if the methods of a type can be replaced by whatever at runtime. Could you elaborate on that? Currently, I only provide optimized derivatives for several separate call targets, i.e., whether a call target is a C function with varargs, or a Python function/method--this already eliminates a lot of overhead from invoking call_function. Based on further quantitative analysis, I can easily provide inline cached derivatives of frequently called functions, such as some builtin primitives. Based on what information do you switch between inlining states? I have instrumented versions of some functions that allow me to make quickening decisions, such as binary_op1, do_richcompare, or call_function, where I can quicken instructions to an optimized, inline cached, instruction derivative. Or do you restrict yourself to builtin types? Currently, my approach provides optimized derivative instructions for the standard library, e.g., unicode strings, numerical objects, containers, and iterators. That might be worth it already, just think of list.append(). We have an optimistic optimisation for object.append() in Cython that gives us massive speed-ups in loops that build lists, even if we don't know at compile time that we are dealing with lists. Yes, that sounds like a reasonable thing to do. I could provide much more optimized derivatives based on application profiles, too. Since I use a simple code generator for generating the derivatives, it would also be possible to provide end-users with the means to analyze their apps and generate optimized instruction derivatives matching their profile. Regards, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations...
This sounds like wpython (a CPython derivative with a wider set of byte code commands) could benefit from it. I am aware of the wpython project of Cesare di Mauro. I change the instruction format from bytecode to wordcode, too (because it allows for more efficient instruction decoding). Contrary to his approach, however, I do not change the instruction encoding to pack in additional optimizations. (I hope to have put that correctly; I have seen his slides about a year ago.) Do I understand correctly that you modify the byte code of modules/functions at runtime? Yes. Quickening is runtime only optimization technique that rewrites instructions from a generic instruction to an optimized derivative (orignally for the Java virtual machine). It is completely hidden from the compiler and has no other dependencies than the interpreter dispatch routine itself. Ah, yes, that makes good sense. So you basically add an intermediate step to calls that provides faster dispatch for known C functions. Exactly. I also contemplated to provide optimized derivatives for all builtin functions, but never implemented it (lack of time). Based on quantitative analysis of usage frequency one could very well decide to, e.g., provide an optimized CALL_FUNCTION derivative for the len function. Another benefit of using my technique is that a compiler could decide to inline all of the functions of the optimized derivatives (e.g., the float_add function call inside my FLOAT_ADD interpreter instruction). Unfortunately, however, gcc currently does not allow for cross-module inlining (AFAIR). (Preliminary tests with manually changing the default inlining size for ceval.c resulted in speedups of up to 1.3 on my machine, so I think inlinling of function bodies for the optimized derivatives would boost performance noticeably.) I'm interested in the code that determines what can be optimised in what way. I read that Jython recently received a contribution that provides type information for lots of modules and builtins, but having something like that for CPython would be cool. Ok. For this year's PPPJ I wanted to submit a paper realizing my optimization in Jython. Because of bytecode-rewriting tools, the interpreter could decide at runtime which optimized derivatives to generate and add rewriting code that supports the changing instruction set. Either way (static pre-compiling or dynamic bytecode rewriting that is), I think that Jython and IronPython would greatly benefit from applying this optimization technique, because their JIT compilers would inline the function calls with a high likelihood. Such an approach would also be very useful for Cython. Think of a profiler that runs a program in CPython and tells you exactly what static type annotations to put where in your Python code to make it compile to a fast binary with Cython. Or, even better, it could just spit out a .pxd file that you drop next to your .py file and that provides the static type information for you. Hm, I think you could very easily save away the optimized bytecode sequence for function calls that would allow you to do that (e.g., you could save something similar to: LOAD_FAST LOAD_CONST LONG_ADD or LOAD_GLOBAL CALL_C_ZERO ) Cheers, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations...
How do you generate the specialized opcode implementations? I have a small code generator written in Python that uses Mako templates to generate C files that can be included in the main interpreter. It is a data driven approach that uses type information gathered by gdb and check whether given types implement for instance a nb_add method. Presumably that is done ahead of time, or you'd have to use a JIT, which is what you're avoiding. Yes, and yes: I execute the code generator before compiling the Python interpreter, and I am interested in purely interpretative optimization techniques. I'm guessing from your comments below about cross-module inlining that you generate a separate .c file with the specialized opcode bodies and then call through to them via a table of function pointers indexed by opcode, but I could be totally wrong. :) No, dead on ;) Probably a small example from the top of my head illustrates what is going on: TARGET(FLOAT_ADD): w= POP(); v= TOP(); x= PyFloat_Type.tp_as_number-nb_add(v, w); SET_TOP(x); if (x != NULL) FAST_DISPATCH(); break; And I extend the standard indirect threaded code dispatch table to support the FLOAT_ADD operation. There are a variety of solutions to getting cross-module inlining these days. Clang+LLVM support link-time optimization (LTO) via a plugin for gold. GCC has LTO and LIPO as well. A PhD colleague from our institute pointed the gold stuff out to me yesterday, I am going to check out if any of these solutions would work. A deeper problem here is that the heuristics of the compilers are ill-suited to the needs of compiling an interpreter dispatch routine -- I will investigate this further in future research. This would be interesting. We have (obviously) have similar instrumentation in unladen swallow to gather type feedback. We talked with Craig Citro about finding a way to feed that back to Cython for exactly this reason, but we haven't really pursued it. Ok; I think it would actually be fairly easy to use the type information gathered at runtime by the quickening approach. Several auxiliary functions for dealing with these types could be generated by my code generator as well. It is probably worth looking into this, though my current top-priority is my PhD research, so I cannot promise to being able to allocate vast amounts of time for such endeavours. Best, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations...
wpython has reached 1.1 final version. If you are interested, you can find it here: http://code.google.com/p/wpython2/ and you can download the new slides that cover the improvements over 1.0 alpha. Thanks for the hint, I will definitely check your new slides. Did you used wpython wordcode format, or a new one? No, actually I was well into working on my stuff when you announced wpython last year. My latest instruction format uses a native machine word (64bit) that contains two 32bit halves with the opcode in the lower half and the operand in the upper half. While the opcode certainly does not exceed 10bits or so, I need more than a byte to support more operations (my latest interpreter has 395 instructions). Our instruction decoding is almost identical, though. Yes, you're right. wpython approach is to encode as much information as it can to save space, decoding time, specialize some opcodes, etc.. Yes, I see that wpython eliminates common instruction sequences. From my understanding, it corresponds to using static superinstructions in combination with a changed instruction format. Aside of the optimizations in the operation implementation wpython allows to eliminate some instruction dispatches, which are notoriously inefficient. I think it is a very nice approach and some form of inline caching with quickening might well boost performance even more. Cheers, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations...
I think I was wrong, but now I understand. The inlining you want is to get the nb_add body, not the opcode body. Exactly. This would increase performace by quite a bit -- I will start experimentation with that stuff a.s.a.p. The example you've given brings up a correctness issue. It seems you'd want to add checks that verify that the operands w and v are both floats, and jump to BINARY_ADD if the guard fails. It would require reshuffling the stack operations to defer the pop until after the check, but it shouldn't be a problem. That's usually the problem when you're doing something from the top of your head, especially when its already 9pm :) You're right, a simple way around this is either: TARGET(FLOAT_ADD): if (!(TOP()-ob_type == SECOND()-ob_type TOP()-ob_type == PyFloat_Type)) goto TARGET_BINARY_ADD_SKIP_DECODE; ...remains the same... Note: the skip_decode is necessary because otherwise it would advance the instruction pointer. Another, simple approach is to add another set of labels to denote inline cache misses, e.g.: TARGET(BINARY_ADD): w= POP(); v= TOP(); BINARY_ADD_CACHE_MISS: x= PyNumber_Add(v, w); ... TARGET(FLOAT_ADD): w= POP(); v= TOP(); if (v-ob_type != w-ob_type || v-ob_type != PyFloat_Type) goto BINARY_ADD_CACHE_MISS; ... You could also call the PyNumber_Add function yourself instead of the goto, but I did not implement it this way... I think you also record (via gdb) exactly the information that we record. I now see three consumers of type feedback from the CPython interpreter: you or any quickening approach, Cython, and Unladen Swallow. It might be worth converging on a standard way to record this information and serialize it so that it can be analyzed offline. Indeed. Probably a bigger set of types of frequently used C extension modules would be useful as well. I am doing the simplest possible thing here: since the gdb pretty print already is pretty close to a Python data type definition, I just copied this and did a few search and replace commands to convert it to a valid Python data type. Our feedback recording mechanism currently uses LLVM data structures, but the point of quickening is to avoid that kind of dependency, so we'd need to rewrite it before it would really be useful to you. Ok. --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Python 3 optimizations...
Hello, during the last year, I have developed a couple of quickening-based optimizations for the Python 3.1 interpreter. As part of my PhD programme, I have published a first technique that combines quickening with inline caching at this year's ECOOP, and subsequently extended this technique to optimize several load instructions as well as eliminate redundant reference counting operations from instructions, which has been accepted for publication for an upcoming conference. I have a working prototype combining all of these optimizations that achieves a maximum speedup of 2.18 on the spectralnorm benchmark of the computer language benchmarks game. Early measurements on the Unladen Swallow django benchmark (with Martin von Loewis' patch for Python 3) achieve a speedup of about 1.3. Both speedups were measured on an i7 920 when combined with the threaded code/computed goto optimization enabled, and normalized by the standard Python 3.1 interpreter with all optimizations disabled. Since all of these optimizations are purely interpretative, they have next-to-no additional memory requirements and do not incur extensive warm-up costs. I wonder whether you would be interested in integrating these optimizations with the Python 3 distribution, hence this mail. I could send copies of the papers, as well as provide my prototype source code to interested members of the python development community. Have a nice day, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations...
Is the source code under an open source non-copyleft license? I am (unfortunately) not employed or funded by anybody, so I think that I can license/release the code as I see fit. Have you checked that the whole regression test suite passes? Currently, I am sure my prototype will not pass the regression test suite, because I used it for benchmarking only. However, I think it is probably not a too much work to get this done. Can you publish all this online (papers and source code)? I am not sure whether I can publish the ECOOP paper without violating any Springer copyrights, but will check if that is any problem. I could send you the second paper in private; until it is going to be published in the corresponding proceedings, I fear that I might not be able to put this online, though. Publishing the source code should not be a problem at all, though I think some polishing might not be a bad idea... Best, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations...
The Springer link [1] at least shows the front page to give more of an idea as to what this is about. Thanks, I forgot to mention the link. The idea does sound potentially interesting, although I'm not sure how applicable it will be with a full-blown LLVM-based JIT on the way for 3.3 (via Unladen Swallow). Yeah, I see that. However, I think that a JIT compiler could very well re-use the information gathered by using my techniques to generate better code in the first run. There certainly are a couple of other benefits for combining both approaches, i.e., they are not mutually exclusive. Depending on the size and complexity of the patches, it may still be worth exploring for 3.2. I am currently not aware of the planned release schedule, but I think that some of the techniques could very well be released in 3.2 -- with a conditional compilation target similar to the computed-goto feature. Regards, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com