[Python-Dev] Benchmark performance...

2012-05-23 Thread stefan brunthaler
Hi,

as Antoine pointed out in the corresponding issue
(http://bugs.python.org/issue14757#msg160870), measuring/assessing
real-world performance of my patch would be interesting. I mentioned
that I am not aware of any relevant Python 3 program/application to
report numbers for (but guess that the speedups should persist.) Since
nobody came up with an answer yet, I figured it would be a good idea
to ask everybody on python-dev for suggestions...

Regards,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Assigning copyright...

2012-05-07 Thread stefan brunthaler
Hello,

 http://www.python.org/psf/contrib/

I took care of the formalities.

I am not sure how to proceed further. Would python-dev want me to draft a PEP?

Regards,
--stefan

PS: Personally, I am not a 100pct convinced that having a PEP is a
good thing in this case, as it makes a perfectly transparent
optimization visible. AFAIR Sun opted to keep their instruction
derivatives secret, i.e., the second edition of the JVM internals does
not even mention them anymore.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Assigning copyright...

2012-05-07 Thread stefan brunthaler
 I think you'll find that we don't keep a lot of things secret about CPython
 and its implementation.

Yeah, I agree that this is in principal a good thing and what makes
CPython ideally suited for research. However, my optimizations make
use of unused opcodes, which might be used in the future by actual
CPython instructions (e.g., from my previous patch to the new one the
YIELD_FROM instruction has been added.)
I'd say the situation is similar to the threaded code/computed goto's issue.


 Although this is different when it comes to the community.  The PSU has

?

I am going to file a patch like Martin von Loewis suggested.

Thanks,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Assigning copyright...

2012-04-26 Thread stefan brunthaler
Hello Mark,

 A URL for the code repository (with an open-source license),
 so code can be reviewed.
 It is hard to review and update a giant patch.

OK, I took Nick's advice to heart and created a fork from the official
cpython mirror on bitbucket. You can view the code patched  in
(branch: inca-only) under the following URL:
https://bitbucket.org/sbrunthaler/cpython-inline-caching

Since it is a fork, it contains the usual LICENSE from Python.

Regarding Eric's hint: It seems that this agreement needs to be signed
and mailed. Can I sign/scan and email it to somebody? (Or should I
wait until there is a decision regarding a potential integration?) The
way I understood Guido's last message it is best to use Apache 2
license without retaining my own copyright. I am perfectly fine with
that but am not sure if using the fork with sub-directories including
the official LICENSE takes care of that. Obviously, I don't have too
much experience in this area, so if I am missing something blatantly
obvious, I apologize beforehand...

Best,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Assigning copyright...

2012-04-25 Thread stefan brunthaler
Hi,

I only had little time to spend for my open sourcing efforts, which is
why I could not get back to python-dev any time earlier...

Yesterday I forward-ported my patches to revision 76549
(13c30fe3f427), which only took 25mins or so (primarly due to the
small changes necessary to Python itself and the stability of that
parts.) Thanks to a colleague of mine (Per Larsen) I reimplemented
some of the more ugly parts of the code generator, too.
Guido's answer from the last thread was that I should duly assign the
copyright to the PSF. Unfortunatly, I don't really see any other part
than the LICENSE and README files in the Python distribution. Since my
patch basically just adds another subdirectory (cgen) to the Python
top-level directory, I am not sure if I need to supply other
information to make my code officially PSF compatible.

Am I missing something obvious?

Thanks,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations, continued, continued again...

2012-02-02 Thread stefan brunthaler
 I really don't think that is a problem. The core contributors can deal
 well with complexity in my experience. :-)

No no, I wasn't trying to insinuate anything like that at all. No, I
just figured that having the code generator being able to generate 4
optimizations where only one is supported is a bad idea for several
reasons, such as maintainability, etc.

Anyways, I've just completed the integration of the code generator and
put the corresponding patch on my page
(http://www.ics.uci.edu/~sbruntha/pydev.html) for downloading. The
license thing is still missing, I'll do that tomorrow or sometime next
week.

Regards,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations, continued, continued again...

2012-02-01 Thread stefan brunthaler
 How many times did you regenerate this code until you got it right?

Well, honestly, I changed the code generator to pack the new
optimized instruction derivatives densly into the available opcodes,
so that I can make optimal use of what's there. Thus I only generated
the code twice for this patch.

 And how do you know that you really got it so right that it was the last time 
 ever
 that you needed your generator for it?

I am positive that I am going to need my code generator in the future,
as I have several ideas to increase performance even more. As I have
mentioned before, my quickening based inline caching technique is very
simple, and if it would crash, chances are that any of the
inline-cache miss guards don't capture all scenarios, i.e., are
non-exhaustive. The regression-tests run, so do the official
benchmarks plus the computer language benchmarks game. In addition,
this has been my line of research since 2009, so I have extensive
experience with it, too.

 What if the C structure of any of those several types ever changes?

Since I optimize interpreter instructions, any change that affects
their implementation requires changing of the optimized instructions,
too. Having the code generator ready for such things would certainly
be a good idea (probably also for generating the default interpreter
dispatch loop), since you could also add your own profile for your
application/domain to re-use the remaining 30+ instruction opcodes.
The direct answer is that I would need to re-generate the driver file,
which is basically a gdb-dump plus an Emacs macro (please note that I
did not need to do that since working with ~ 3.0b1) I will add a list
of the types I use for specializing to patch section on the
additional resources page of my homepage (including a fixed patch of
what Georg brought to my attention.)

--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations, continued, continued again...

2012-02-01 Thread stefan brunthaler
 But let me put this straight: as an open-source project, we are hesitant to
 accept changes which depend on closed software. Even if your optimization
 techniques would result in performance a hundred times better than what is
 currently achieved, we would still be wary to accept them.

 Please note that this is not because of lack of trust or better yet greed
 for your code. We need to make sure
 that under no circumstances our codebase is in danger because something
 important was left out along the way.

I am positive that the code generator does not depend on any closed
source components, I just juse mako for storing the C code templates
that I generate -- everything else I wrote myself.
Of course, I'll give the code generator to pydev, too, if necessary.
However, I need to strip it down, so that it does not do all the other
stuff that you don't need. I just wanted to give you the
implementation now, since Benjamin said that he wants to see real code
and results first. If you want to integrate the inca-optimization, I
am going to start working on this asap.


 Maintenance of generated code is yet another nuissance that should better be
 strongly justified.

I agree, but the nice thing is that the technique is very simple: only
if you changed a significant part of the interpreter implementation's,
you'd need to change the optimized derivatives, too. If one generates
the default interpreter implementation, too, then one gets the
optimizations almost for free. For maintenance reasons I chose to use
a template-based system, too, since this gives you a direct
correspondence between the actual code and what's generated, without
interfering with the code generator at all.

--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations, continued, continued again...

2012-02-01 Thread stefan brunthaler
On Wed, Feb 1, 2012 at 09:46, Guido van Rossum gu...@python.org wrote:
 Let's make one thing clear. The Python core developers need to be able
 to reproduce your results from scratch, and that means access to the
 templates, code generators, inputs, and everything else you used. (Of
 course for stuff you didn't write that's already open source, all we
 need is a pointer to the open source project and the exact
 version/configuration you used, plus any local mods you made.)

 I understand that you're hesitant to just dump your current mess, and
 you want to clean it up before you show it to us. That's fine. But
 until you're ready to show it, we're not going to integrate any of
 your work into CPython, even though some of us (maybe Benjamin) may be
 interested in kicking its tires. And remember, it doesn't need to be
 perfect (in fact perfectionism is probably a bad idea here). But it
 does need to be open source. Every single bit of it. (And no GPL,
 please.)

I understand all of these issues. Currently, it's not really a mess,
but much more complicated as it needs to be for only supporting the
inca optimization. I don't know  what the time frame for a possible
integration is (my guess is that it'd be safe anyways to disable it,
like the threaded code support was handled.)
As for the license: I really don't care about that at all, the only
thing nice to have would be to have a pointer to my home page and/or
the corresponding research, but that's about all on my wish list.

--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations, continued, continued again...

2012-01-31 Thread stefan brunthaler
 I assume yes here means yes, I'm aware and not yes, I'm using Python
 2, right? And you're building on top of the existing support for threaded
 code in order to improve it?

Your assumption is correct, I'm sorry for the sloppiness (I was
heading out for lunch.) None of the code is 2.x compatible, all of my
work has always targeted Python 3.x. My work does not improve threaded
code (as in interpreter dispatch technique), but enables efficient and
purely interpretative inline caching via quickening. (So, after
execution of BINARY_ADD, I rewrite the specific occurence of the
bytecode instruction to a, say, FLOAT_ADD instruction and ensure that
my assumption is correct in the FLOAT_ADD instruction.)

Thanks,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations, continued, continued again...

2012-01-31 Thread stefan brunthaler
 If I read the patch correctly, most of it is auto-generated (and there
 is probably a few spurious changes that blow it up, such as the
 python-gdb.py file).

Hm, honestly I don't know where the python-gdb.py file comes from, I
thought it came with the switch from 3.1 to the tip version I was
using. Anyways, I did not tuch it or at least have no recollection of
doing so. Regarding the spurious changes: This might very well be,
regression testing works, and it would actually be fairly easy to
figure out crashes (e.g., by tracing all executed bytecode
instructions and seeing if all of them are actually executed, I could
easily do that if wanted/necessary.)


 But the tool that actually generates the code
 doesn't seem to be included?  (Which means that in this form, the
 patch couldn't possibly be accepted.)

Well, the tool is not included because it does a lot more (e.g.,
generate the code for elimination of reference count operations.)
Unfortunately, my interpreter architecture that achieves the highest
speedups is more complicated, and I got the feeling that this is not
going well with python-dev. So, I had the idea of basically using just
one (but a major one) optimization technique and going with that. I
don't see why you would need my code generator, though. Not that I
cared, but I would need to strip down and remove many parts of it and
also make it more accessible to other people. However, if python-dev
decides that it wants to include the optimizations and requires the
code generator, I'll happily chip in the extra work an give you the
corresponding code generator, too.

Thanks,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations, continued, continued again...

2012-01-31 Thread stefan brunthaler
 There is also the issue of the two test modules removed from the
 test suite.

Oh, I'm sorry, seems like the patch did contain too much of my
development stuff. (I did remove them before, because they were always
failing due to the instruction opcodes being changed because of
quickening; they pass the tests, though.)

 Well, nobody wants to review generated code.

I agree. The code generator basically uses templates that contain the
information and a dump of the C-structure of several types to traverse
and see which one of them implements which functions. There is really
no magic there, the most complex thing is to get the inline-cache
miss checks for function calls right. But I tried to make the
generated code look pretty, so that working with it is not too much of
a hassle. The code generator itself is a little bit more complicated,
so I am not sure it would help a lot...

best,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations, continued, continued again...

2012-01-30 Thread stefan brunthaler
Hello,

 Could you try benchmarking with the standard benchmarks:
 http://hg.python.org/benchmarks/
 and see what sort of performance gains you get?

Yeah, of course. I already did. Refere to the page listed below for
details. I did not look into the results yet, though.


 How portable is the threaded interpreter?

Well, you can implement threaded code on any machine that support
indirect branch instructions. Fortunately, GCC supports the
label-as-values feature, which makes it available on any machine
that supports GCC. My optimizations themselves are portable, and I
tested them on a PowerPC for my thesis, too. (AFAIR, llvm supports
this feature, too.)


 Do you have a public repository for the code, so we can take a look?

I have created a patch (as Benjamin wanted) and put all of the
resources (i.e., benchmark results and the patch itself) on my home
page:
http://www.ics.uci.edu/~sbruntha/pydev.html


Regards,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations, continued, continued again...

2012-01-30 Thread stefan brunthaler
 Well, you're aware that Python already uses threaded code where
 available? Or are you testing against Python 2?

Yes, and I am building on that.

--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations, continued, continued again...

2012-01-27 Thread stefan brunthaler
Hi,

On Tue, Nov 8, 2011 at 10:36, Benjamin Peterson benja...@python.org wrote:
 2011/11/8 stefan brunthaler s.bruntha...@uci.edu:
 How does that sound?

 I think I can hear real patches and benchmarks most clearly.

I spent the better part of my -20% time on implementing the work as
suggested. Please find the benchmarks attached to this email, I just
did them on my system (i7-920, Linux 3.0.0-15, GCC 4.6.1). I branched
off the regular 3.3a0 default tip changeset 73977 shortly after your
email. I do not have an official patch yet, but am going to create one
if wanted. Changes to the existing interpreter are minimal, the
biggest chunk is a new interpreter dispatch loop.

Merging dispatch loops eliminates some of my optimizations, but my
inline caching technique enables inlining some functionality, which
results in visible speedups. The code is normalized to the
non-threaded-code version of the CPython interpreter (named
vanilla), so that I can reference it to my preceding results. I
anticipate *no* compatibility issues and the interpreter requires less
than 100 KiB of extra memory at run-time. Since my interpreter is
using 215 of a maximum of 255 instructions, there is room for adding
additional derivatives, e.g., for popular Python libraries, too.


Let me know what python-dev thinks of this and have a nice weekend,
--stefan

PS: AFAIR the version without partial stack frame caching also passes
all regression tests modulo the ones that test against specific
bytecodes.
currently processing:  bench/binarytrees.py3.py
phd-cpy-3a0-thr-cod-pytho  arg: 10 | time:   0.161876  | stdev:  
0.007780 | var:  0.61 | mem:   6633.60
phd-cpy-3a0-thr-cod-pytho  arg: 12 | time:   0.699243  | stdev:  
0.019112 | var:  0.000365 | mem:   8142.67
phd-cpy-3a0-thr-cod-pytho  arg: 14 | time:   3.388344  | stdev:  
0.048042 | var:  0.002308 | mem:  13586.93
phd-cpy-pio-sne-pre-pyt-no-psf arg: 10 | time:   0.153875  | stdev:  
0.003828 | var:  0.15 | mem:   6873.73
phd-cpy-pio-sne-pre-pyt-no-psf arg: 12 | time:   0.632572  | stdev:  
0.019121 | var:  0.000366 | mem:   8246.27
phd-cpy-pio-sne-pre-pyt-no-psf arg: 14 | time:   3.020988  | stdev:  
0.043483 | var:  0.001891 | mem:  13640.27
phd-cpy-pio-sne-pre-pytho  arg: 10 | time:   0.150942  | stdev:  
0.005157 | var:  0.27 | mem:   6901.87
phd-cpy-pio-sne-pre-pytho  arg: 12 | time:   0.660841  | stdev:  
0.020538 | var:  0.000422 | mem:   8286.80
phd-cpy-pio-sne-pre-pytho  arg: 14 | time:   3.184198  | stdev:  
0.051103 | var:  0.002612 | mem:  13680.40
phd-cpy-3a0-van-pytho  arg: 10 | time:   0.202812  | stdev:  
0.005480 | var:  0.30 | mem:   6633.33
phd-cpy-3a0-van-pytho  arg: 12 | time:   0.908456  | stdev:  
0.015744 | var:  0.000248 | mem:   8153.07
phd-cpy-3a0-van-pytho  arg: 14 | time:   4.364805  | stdev:  
0.037522 | var:  0.001408 | mem:  13593.60
### phd-cpy-3a0-thr-cod-pytho :  1.2887 (avg-sum:   1.416488)
### phd-cpy-pio-sne-pre-pyt-no-psf:  1.4383 (avg-sum:   1.269145)
### phd-cpy-pio-sne-pre-pytho :  1.3704 (avg-sum:   1.331994)
### phd-cpy-3a0-van-pytho :  1. (avg-sum:   1.825358)
currently processing:  bench/fannkuch.py3.py
phd-cpy-3a0-thr-cod-pytho  arg:  8 | time:   0.172677  | stdev:  
0.006620 | var:  0.44 | mem:   6424.13
phd-cpy-3a0-thr-cod-pytho  arg:  9 | time:   1.426755  | stdev:  
0.035545 | var:  0.001263 | mem:   6425.20
phd-cpy-pio-sne-pre-pyt-no-psf arg:  8 | time:   0.168010  | stdev:  
0.010277 | var:  0.000106 | mem:   6481.07
phd-cpy-pio-sne-pre-pyt-no-psf arg:  9 | time:   1.345817  | stdev:  
0.033127 | var:  0.001097 | mem:   6479.60
phd-cpy-pio-sne-pre-pytho  arg:  8 | time:   0.165876  | stdev:  
0.007136 | var:  0.51 | mem:   6520.00
phd-cpy-pio-sne-pre-pytho  arg:  9 | time:   1.351150  | stdev:  
0.028822 | var:  0.000831 | mem:   6519.73
phd-cpy-3a0-van-pytho  arg:  8 | time:   0.216146  | stdev:  
0.012879 | var:  0.000166 | mem:   6419.07
phd-cpy-3a0-van-pytho  arg:  9 | time:   1.834247  | stdev:  
0.028224 | var:  0.000797 | mem:   6418.67
### phd-cpy-3a0-thr-cod-pytho :  1.2820 (avg-sum:   0.799716)
### phd-cpy-pio-sne-pre-pyt-no-psf:  1.3544 (avg-sum:   0.756913)
### phd-cpy-pio-sne-pre-pytho :  1.3516 (avg-sum:   0.758513)
### phd-cpy-3a0-van-pytho :  1. (avg-sum:   1.025197)
currently processing:  bench/fasta.py3.py
phd-cpy-3a0-thr-cod-pytho  arg:  5 | time:   0.374023  | stdev:  
0.010870 | var:  0.000118 | mem:   6495.07
phd-cpy-3a0-thr-cod-pytho  arg: 10 | time:   0.714577  | stdev:  
0.024713 | var:  0.000611 | mem:   6495.47
phd-cpy-3a0-thr-cod-pytho  arg: 15 | time:   1.062866  | stdev:  
0.040138 | var:  0.001611 | mem:   6496.27
phd-cpy-pio-sne-pre-pyt-no-psf arg:  5 | time:   0.345621  | stdev:  
0.022549 | var:  0.000508 | mem:   6551.87
phd-cpy-pio-sne-pre-pyt-no-psf arg

[Python-Dev] Python 3 optimizations, continued, continued again...

2011-11-08 Thread stefan brunthaler
Hi guys,

while there is at least some interest in incorporating my
optimizations, response has still been low. I figure that the changes
are probably too much for a single big incorporation step. On a recent
flight, I thought about cutting it down to make it more easily
digestible. The basic idea is to remove the optimized interpreter
dispatch loop and advanced instruction format and use the existing
ones. Currently (rev. ca8a0dfb2176), opcode.h uses 109 of potentially
available 255 instructions using the current instruction format.
Hence, up to 149 instruction opcodes could be given to optimized
instruction derivatives. Consequently, a possible change would require
to change:
  a) opcode.h to add new instruction opcodes,
  b) ceval.c to include the new instruction opcodes in PyEval_EvalFrameEx,
  c) abstract.c, object.c (possible other files) to add the
quickening/rewriting function calls.

If this is more interesting, I could start evaluating which
instruction opcodes should be allocated to which derivatives to get
the biggest benefit. This is a lot easier to implement (because I can
re-use the existing instruction implementations) and can easily be
made to be conditionally compile-able, similar to the computed-gotos
option. Since the changes are minimal it is also simpler to understand
and deal with for everybody else, too. On the downside, however, not
all optimizations are possible and/or make sense in the given limit of
instructions (no data-object inlining and no reference-count
elimination.)

How does that sound?

Have a nice day,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-09-02 Thread stefan brunthaler
 as promised, I created a publicly available preview of an
 implementation with my optimizations, which is available under the
 following location:
 https://bitbucket.org/py3_pio/preview/wiki/Home

One very important thing that I forgot was to indicate that you have
to use computed gotos (i.e., configure --with-computed-gotos),
otherwise it won't work (though I think that most people can figure
this out easily, knowing this a priori isn't too bad.)

Regards,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-09-02 Thread stefan brunthaler
 1) The SFC optimisation is purely based on static code analysis, right? I
 assume it takes loops into account (and just multiplies scores for inner
 loops)? Is that what you mean with nesting level? Obviously, static
 analysis can sometimes be misleading, e.g. when there's a rare special case
 with lots of loops that needs to adapt input data in some way, but in
 general, I'd expect that this heuristic would tend to hit the important
 cases, especially for well structured code with short functions.

Yes, currently I only use the heuristic to statically estimate utility
of assigning an optimized slot to a local variable. And, another yes,
nested blocks (like for-statements) is what I have in mind when using
nesting level. I was told that the algorithm itself is very similar
to linear scan register allocation, modulo the ability to spill
values, of course.
From my benchmarks and in-depth analysis of several programs, I found
this to work very well. In fact, the only situation I found is
(unfortunately) one of the top-most executed functions in US'
bm_django.py: There is one loop that gets almost never executed but
this loop gives precedence to local variables used inside. Because of
this, I have already an idea for a better approach: first, use the
static heuristic to compute stack slot score, then count back-branches
(I would need this anyways, as the _Py_CheckInterval has gone and
OSR/hot-swapping is in general a good idea) and record their
frequency. Next, just replace the current static weight of 100 by the
dynamically recorded weight. Consequently, you should get better
allocations. (Please note that I did some quantitative analysis of
bython functions to determine that using 4 SFC-slots covers a
substantial amount of functions [IIRC 95%] with the trivial scenario
when there are at most 4 local variables.)


 2) The RC elimination is tricky to get right and thus somewhat dangerous,
 but sounds worthwhile and should work particularly well on a stack based
 byte code interpreter like CPython.

Well, it was very tricky to get right when I implemented it first
around Christmas 2009. The current implementation is reasonably simple
to understand, however, it depends on the function refcount_effect to
give me correct information at all times. I got the biggest
performance improvement on one benchmark on the PowerPC and think that
RISC architectures in general benefit more from this optimization
(eliminate the load, add and store instructions) than x86 CISCs do (an
INCREF is just an add on the memory location without data
dependencies, so fairly cheap). In any case, however, you get the
replication effect of improving CPU branch predicion by having these
additional instruction derivatives. It would be interesting
(research-wise, too) to be able to measure whether the reduction in
memory operations makes Python programs use less energy, and if so,
how much the difference is.


 3) Inline caching also sounds worthwhile, although I wonder how large the
 savings will be here. You'd save a couple of indirect jumps at the C-API
 level, sure, but apart from that, my guess is that it would highly depend on
 the type of instruction. Certain (repeated) calls to C implemented functions
 would likely benefit quite a bit, for example, which would be a nice
 optimisation by itself, e.g. for builtins. I would expect that the same
 applies to iterators, even a couple of percent faster iteration can make a
 great deal of a difference, and a substantial set of iterators are
 implemented in C, e.g. itertools, range, zip and friends.

 I'm not so sure about arithmetic operations. In Cython, we (currently?) do
 not optimistically replace these with more specific code (unless we know the
 types at compile time), because it complicates the generated C code and
 indirect jumps aren't all that slow that the benefit would be important.
 Savings are *much* higher when data can be unboxed, so much that the slight
 improvement for optimistic type guesses is totally dwarfed in Cython. I
 would expect that the return of investment is better when the types are
 actually known at runtime, as in your case.

Well, in my thesis I already hint at another improvement of the
existing design that can work on unboxed data as well (while still
being an interpreter.) I am eager to try this, but don't know how much
time I can spend on this (because there are several other research
projects I am actively involved in.) In my experience, this works very
well and you cannot actually report good speedups without
inline-caching arithmetic operations, simply because that's where all
JITs shine and most benchmarks don't reflect real world scenarios but
mathematics-inclined microbenchmarks. Also, if in the future compilers
(gcc and clang) will be able to inline the invoked functions, higher
speedups will be possible.



 4) Regarding inlined object references, I would expect that it's much more
 worthwhile to speed up LOAD_GLOBAL and LOAD_NAME than LOAD_CONST. 

Re: [Python-Dev] Python 3 optimizations continued...

2011-09-01 Thread stefan brunthaler
Hi,

as promised, I created a publicly available preview of an
implementation with my optimizations, which is available under the
following location:
https://bitbucket.org/py3_pio/preview/wiki/Home

I followed Nick's advice and added some valuable advice and
overview/introduction at the wiki page the link points to, I am
positive that spending 10mins reading this will provide you with a
valuable information regarding what's happening.
In addition, as Guido already mentioned, this is more or less a direct
copy of my research-branch without some of my private comments and
*no* additional refactorings because of software-engineering issues
(which I am very much aware of.)

I hope this clarifies a *lot* and makes it easier to see what parts
are involved and how all the pieces fit together.

I hope you'll like it,
have fun,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-31 Thread stefan brunthaler
 I think that you must deal with big endianess because some RISC can't handle
 at all data in little endian format.

 In WPython I have wrote some macros which handle both endianess, but lacking
 big endian machines I never had the opportunity to verify if something was
 wrong.

I am sorry for the temporal lapse of not getting back to this directly
yesterday, we were just heading out for lunch and I figured it only
out then but immediately forgot it on our way back to the lab...

So, as I have already said, I evaluated my optimizations on x86
(little-endian) and PowerPC 970 (big-endian) and I did not have to
change any of my instruction decoding during interpretation. (The only
nasty bug I still remember vividly was that while on gcc for x86 the
data type char defaults to signed, whereas it defaults to unsigned on
PowerPC's gcc.) When I have time and access to a PowerPC machine again
(an ARM might be interesting, too), I will take a look at the
generated assembly code to figure out why this is working. (I have
some ideas why it might work without changing the code.)

If I run into any problems, I'll gladly contact you :)

BTW: AFAIR, we emailed last year regarding wpython and IIRC your
optimizations could primarily be summarized as clever
superinstructions. I have not implemented anything in that area at all
(and have in fact not even touched the compiler and its peephole
optimizer), but if parts my implementation gets in, I am sure that you
could add some of your work on top of that, too.

Cheers,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-31 Thread stefan brunthaler
 So, basically, you built a JIT compiler but don't want to call it that,
 right? Just because it compiles byte code to other byte code rather than to
 native CPU instructions does not mean it doesn't compile Just In Time.

For me, a definition of a JIT compiler or any dynamic compilation
subsystem entails that native machine code is generated at run-time.
Furthermore, I am not compiling from bytecode to bytecode, but rather
changing the instruction encoding underneath and use subsequently use
quickening to optimize interpretation. But, OTOH, I am not aware of a
canonical definition of JIT compilation, so it depends ;)


 I agree with the others that it's best to open up your repository for
 everyone who is interested. I can see no reason why you would want to close
 it back down once it's there.

Well, my code has primarily been a vehicle for my research in that
area and thus is not immediately suited to adoption (it does not
adhere to Python C coding standards, contains lots of private comments
about various facts, debugging hints, etc.). The explanation for this
is easy: When I started out on my research it was far from clear that
it would be successful and really that much faster. So, I would like
to clean up the comments and some parts of the code and publish the
code I have without any of the clean-up work for naming conventions,
etc., so that you can all take a look and it is clear what it's all
about. After that we can then have a factual discussion about whether
it fits the bill for you, too, and if so, which changes (naming
conventions, extensive documentation, etc.) are necessary *before* any
adoption is reasonable for you, too.

That seems to be a good way to start off and get results and feedback
quickly, any ideas/complaints/comments/suggestions?

Best regards,
--stefan

PS: I am using Nick's suggested plan to incorporate my changes
directly to the most recent version, as mine is currently only running
on Python 3.1.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-30 Thread stefan brunthaler
 Changing the bytecode width wouldn't make the interpreter more complex.

 No, but I think Stefan is proposing to add a *second* byte code format,
 in addition to the one that remains there. That would certainly be an
 increase in complexity.

Yes, indeed I have a more straightforward instruction format to allow
for more efficient decoding. Just going from bytecode size to
word-code size without changing the instruction format is going to
require 8 (or word-size) times more memory on a 64bit system. From an
optimization perspective, the irregular instruction format was the
biggest problem, because checking for HAS_ARG is always on the fast
path and mostly unpredictable. Hence, I chose to extend the
instruction format to have word-size and use the additional space to
have the upper half be used for the argument and the lower half for
the actual opcode. Encoding is more efficient, and *not* more complex.
Using profiling to indicate what code is hot, I don't waste too much
memory on encoding this regular instruction format.


 For example, I still plan to write a JIT for Python at some point. This
 may happen in two months, or in two years. I wouldn't try to stop
 anybody from contributing improvements that may become obsolete with the
 JIT.

I would not necessary argue that at least my optimizations would
become obsolete; if you still think about writing a JIT, it might make
sense to re-use what I've got and not start from scratch, e.g.,
building a simple JIT compiler that just inlines the operation
implementations as templates to eliminate the interpretative overhead
(in similar vein as Piumarta and Riccardi's paper from 1998) might be
good start. Thoug I don't want to pre-influence your JIT design, I'm
just thinking out loud...

Regards,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-30 Thread stefan brunthaler
On Tue, Aug 30, 2011 at 09:42, Guido van Rossum gu...@python.org wrote:
 Stefan, have you shared a pointer to your code yet? Is it open source?

I have no shared code repository, but could create one (is there any
pydev preferred provider?). I have all the copyrights on the code, and
I would like to open-source it.

 It sounds like people are definitely interested and it would make
 sense to let them experiment with your code and review it.

That sounds fine. I need to do some clean up work (contains most of my
comments to remind me of issues) and currently does not pass all
regression tests. But if people want to take a look first to decide if
they want it than that's good enough for me. (I just wanted to know if
there is substantial interest so that it eventually pays off to find
and fix the remaining bugs)

--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-30 Thread stefan brunthaler
 Do you really need it to match a machine word? Or is, say, a 16-bit
 format sufficient.

Hm, technically no, but practically it makes more sense, as (at least
for x86 architectures) having opargs and opcodes in half-words can be
efficiently expressed in assembly. On 64bit architectures, I could
also inline data object references that fit into the 32bit upper half.
It turns out that most constant objects fit nicely into this, and I
have used this for a special cache region (again below 2^32) for
global objects, too. So, technically it's not necessary, but
practically it makes a lot of sense. (Most of these things work on
32bit systems, too. For architectures with a smaller size, we can
adapt or disable the optimizations.)

Cheers,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-30 Thread stefan brunthaler
 Do I sense that the bytecode format is no longer platform-independent?
 That will need a bit of discussion. I bet there are some things around
 that depend on that.

Hm, I haven't really thought about that in detail and for longer, I
ran it on PowerPC 970 and Intel Atom  i7 without problems (the latter
ones are a non-issue) and think that it can be portable. I just stuff
argument and opcode into one word for regular instruction decoding
like a RISC CPU, and I realize there might be little/big endian
issues, but they surely can be conditionally compiled...

--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-30 Thread stefan brunthaler
 Um, I'm sorry, but that reply sounds incredibly naive, like you're not
 really sure what the on-disk format for .pyc files is or why it would
 matter. You're not even answering the question, except indirectly --
 it seems that you've never even thought about the possibility of
 generating a .pyc file on one platform and copying it to a computer
 using a different one.

Well, it may sound incredibly naive, but the truth is: I am never
storing the optimized representation to disk, it's done purely at
runtime when profiling tells me it makes sense to make the switch.
Thus I circumvent many of the problems outlined by you. So I am
positive that a full fledged change of the representation has many
more intricacies to it, but my approach is only tangentially
related...

--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-30 Thread stefan brunthaler
 Ok, there there's something else you haven't told us. Are you saying
 that the original (old) bytecode is still used (and hence written to
 and read from .pyc files)?

Short answer: yes.
Long answer: I added an invocation counter to the code object and keep
interpreting in the usual Python interpreter until this counter
reaches a configurable threshold. When it reaches this threshold, I
create the new instruction format and interpret with this optimized
representation. All the macros look exactly the same in the source
code, they are just redefined to use the different instruction format.
I am at no point serializing this representation or the runtime
information gathered by me, as any subsequent invocation might have
different characteristics.

I will remove my development commentaries and create a private
repository at bitbucket for you* to take an early look like Georg (and
more or less Terry, too) suggested. Is that a good way for most of
you? (I would then give access to whomever wants to take a look.)

Best,
--stefan

*: not personally targeted at Guido (who is naturally very much
welcome to take a look, too) but addressed to python-dev in general.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-30 Thread stefan brunthaler
On Tue, Aug 30, 2011 at 13:42, Benjamin Peterson benja...@python.org wrote:
 2011/8/30 stefan brunthaler ste...@brunthaler.net:
 I will remove my development commentaries and create a private
 repository at bitbucket for you* to take an early look like Georg (and
 more or less Terry, too) suggested. Is that a good way for most of
 you? (I would then give access to whomever wants to take a look.)

 And what is wrong with a public one?

Well, since it does not fully pass all regression tests and is just
meant for people to take a first look to find out if it's interesting,
I think I might take it offline after you had a look. It seems to me
that that is easier to be done with a private repository, but in
general, I don't have a problem with a public one...

Regards,
--stefan

PS: If you want to, I can also just put a tarball on my home page and
post a link here. It's not that I would like to have control/influence
about who is allowed to look and who doesn't.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Python 3 optimizations continued...

2011-08-29 Thread stefan brunthaler
Hi,

pretty much a year ago I wrote about the optimizations I did for my
PhD thesis that target the Python 3 series interpreters. While I got
some replies, the discussion never really picked up and no final
explicit conclusion was reached. AFAICT, because of the following two
factors, my optimizations were not that interesting for inclusion with
the distribution at that time:
a) Unladden Swallow was targeting Python 3, too.
b) My prototype did not pass the regression tests.

As of November 2010 (IIRC), Google is not supporting work on US
anymore, and the project is stalled. (If I am wrong and there is still
activity and any plans with the corresponding PEP, please let me
know.) Which is why I recently spent some time fixing issues so that I
can run the regression tests. There is still some work to be done, but
by and large it should be possible to complete all regression tests in
reasonable time (with the actual infrastructure in place, enabling
optimizations later on is not a problem at all, too.)

So, the two big issues aside, is there any interest in incorporating
these optimizations in Python 3?

Have a nice day,
--stefan

PS: It probably is unusual, but in a part of my home page I have
created a link to indicate interest (makes both counting and voting
easier, http://www.ics.uci.edu/~sbruntha/) There were also links
indicating interest in funding the work; I have disabled these, so as
not to upset anybody or make the impression of begging for money...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-29 Thread stefan brunthaler
 Perhaps there would be something to say given patches/overviews/specifics.

Currently I don't have patches, but for an overview and specifics, I
can provide the following:
* My optimizations basically rely on quickening to incorporate
run-time information.
* I use two separate instruction dispatch routines, and use profiling
to switch from the regular Python 3 dispatch routine to an optimized
one (the implementation is actually vice versa, but that is not
important now)
* The optimized dispatch routine has a changed instruction format
(word-sized instead of bytecodes) that allows for regular instruction
decoding (without the HAS_ARG-check) and inlinind of some objects in
the instruction format on 64bit architectures.
* I use inline-caching based on quickening (passes almost all
regression tests [302 out of 307]), eliminate reference count
operations using quickening (passes but has a memory leak), promote
frequently accessed local variables to their dedicated instructions
(passes), and cache LOAD_GLOBAL/LOAD_NAME objects in the instruction
encoding when possible (I am working on this right now.)

The changes I made can be summarized as:
* I changed some header files to accommodate additional information
(Python.h, ceval.h, code.h, frameobject.h, opcode.h, tupleobject.h)
* I changed mostly abstract.c to incorporate runtime-type feedback.
* All other changes target mostly ceval.c and all supplementary code
is in a sub-directory named opt and all generated files in a
sub-directory within that (opt/gen).
* I have a code generator in place that takes care of generating all
the functions; it uses the Mako template system for creating C code
and does not necessarily need to be shipped with the interpreter
(though one can play around and experiment with it.)

So, all in all, the changes are not that big to the actual
implementation, and most of the code is generated (using sloccount,
opt has 1990 lines of C, and opt/gen has 8649 lines of C).

That's a quick summary, if there are any further or more in-depth
questions, let me know.

best,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-29 Thread stefan brunthaler
 The question really is whether this is an all-or-nothing deal. If you
 could identify smaller parts that can be applied independently, interest
 would be higher.

Well, it's not an all-or-nothing deal. In my current architecture, I
can selectively enable most of the optimizations as I see fit. The
only pre-requisite (in my implementation) is that I have two dispatch
loops with a changed instruction format. It is, however, not a
technical necessity, just the way I implemented it. Basically, you can
choose whatever you like best, and I could extract that part. I am
just offering to add all the things that I have done :)


 Also, I'd be curious whether your techniques help or hinder a potential
 integration of a JIT generator.

This is something I have previously frequently discussed with several
JIT people. IMHO, having my optimizations in-place also helps a JIT
compiler, since it can re-use the information I gathered to generate
more aggressively optimized native machine code right away (the inline
caches can be generated with the type information right away, some
functions could be inlined with the guard statements subsumed, etc.)
Another benefit could be that the JIT compiler can spend longer time
on generating code, because the interpreter is already faster (so in
some cases it would probably not make sense to include a
non-optimizing fast and simple JIT compiler).
There are others on the list, who probably can/want to comment on this, too.

That aside, I think that while having a JIT is an important goal, I
can very well imagine scenarios where the additional memory
consumption (for the generated native machine code) of a JIT for each
process (I assume that the native machine code caches are not shared)
hinders scalability. I have in fact no data to back this up, but I
think that would be an interesting trade off, say if I have 30% gain
in performance without substantial additional memory requirements on
my existing hardware, compared to higher achievable speedups that
require more machines, though.


Regards,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-29 Thread stefan brunthaler
 Does it speed up Python? :-) Could you provide numbers (benchmarks)?

Yes, it does ;)

The maximum overall speedup I achieved was by a factor of 2.42 on my
i7-920 for the spectralnorm benchmark of the computer language
benchmark game.

Others from the same set are:
  binarytrees: 1.9257 (1.9891)
  fannkuch: 1.6509 (1.7264)
  fasta: 1.5446 (1.7161)
  mandelbrot: 2.0040 (2.1847)
  nbody: 1.6165 (1.7602)
  spectralnorm: 2.2538 (2.4176)
  ---
  overall: 1.8213 (1.9382)

(The first number is the combination of all optimizations, the one in
parentheses is with my last optimization [Interpreter Instruction
Scheduling] enabled, too.)

For a comparative real world benchmark I tested Martin von Loewis'
django port (there are not that many meaningful Python 3 real world
benchmarks) and got a speedup of 1.3 (without IIS). This is reasonably
well, US got a speedup of 1.35 on this benchmark. I just checked that
pypy-c-latest on 64 bit reports 1.5 (the pypy-c-jit-latest figures
seem to be not working currently or *really* fast...), but I cannot
tell directly how that relates to speedups (it just says less is
better and I did not quickly find an explanation).
Since I did this benchmark last year, I have spent more time
investigating this benchmark and found that I could do better, but I
would have to guess as to how much (An interesting aside though: on
this benchmark, the executable never grew on more than 5 megs of
memory usage, exactly like the vanilla Python 3 interpreter.)

hth,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-29 Thread stefan brunthaler
 Personally, I *like* CPython fitting into the simple-and-portable
 niche in the Python interpreter space. Armin Rigo made the judgment
 years ago that CPython was a poor platform for serious optimisation
 when he stopped working on Psyco and started PyPy instead, and I think
 the contrasting fates of PyPy and Unladen Swallow have borne out that
 opinion. Significantly increasing the complexity of CPython for
 speed-ups that are dwarfed by those available through PyPy seems like
 a poor trade-off to me.

I agree with the trade-off, but the nice thing is that CPython's
interpreter remains simple and portable using my optimizations. All of
these optimizations are purely interpretative and the complexity of
CPython is not affected much. (For example, I have an inline-cached
version of BINARY_ADD that is called INCA_FLOAT_ADD [INCA being my
abbreviation for INline CAching]; you don't actually have to look at
its source code, since it is generated by my code generator but can by
looking at instruction traces immediately tell what's going on.) So,
the interpreter remains fully portable and any compatibility issues
with C modules should not occur either.


 At a bare minimum, I don't think any significant changes should be
 made under the it will be faster justification until the bulk of the
 real-world benchmark suite used for speed.pypy.org is available for
 Python 3. (Wasn't there a GSoC project about that?)

Having more tests would surely be helpful, as already said, the most
real-world stuff I can do is Martin's django patch (some of the other
benchmarks though are from the shootout and I can [and did] run them,
too {binarytrees, fannkuch, fasta, mandelbrot, nbody and
spectralnorm}. I have also the AI benchmark from Unladden Swallow but
no current figures.)


Best,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations...

2010-07-23 Thread stefan brunthaler
Hi,

I guess it would be a good idea to quickly outline my inline caching
approach, so that we all have a basic understanding of how it works.
If we take for instance the BINARY_ADD instruction, the interpreter
evaluates the actual operand types and chooses the matching operation
implementation at runtime, i.e., operands that are unicode strings
will be concatenated via unicode_concatenate, for float operands on
the other hand, the interpreter would end up invoking float_add via
binary_op1. Now, a very efficient way to achieve purely interpretative
inline caching is to quicken the type-generic BINARY_ADD instruction
to a type-dependent FLOAT_ADD instruction (this technique, i.e.,
inline caching via quickening, is the primary contribution of my ECOOP
paper). Hence, I have a very simple code generator, that generates
type-dependent interpreter instructions in a pre-compile step of the
interpreter, and uses runtime type information to quicken/rewrite
instructions.
Aside of the operators, I have implemented this quickening technique
for FOR_ITER, COMPARE_OP and CALL_FUNCTION instructions.


 I'm absolutely interested, although not for the CPython project but for
 Cython. I wonder how you do inline caching in Python if the methods of a
 type can be replaced by whatever at runtime. Could you elaborate on that?

Currently, I only provide optimized derivatives for several separate
call targets, i.e., whether a call target is a C function with
varargs, or a Python function/method--this already eliminates a lot of
overhead from invoking call_function. Based on further quantitative
analysis, I can easily provide inline cached derivatives of frequently
called functions, such as some builtin primitives.

 Based on what information do you switch between inlining states?

I have instrumented versions of some functions that allow me to make
quickening decisions, such as binary_op1, do_richcompare, or
call_function, where I can quicken instructions to an optimized,
inline cached, instruction derivative.


 Or do you restrict yourself to builtin types?

Currently, my approach provides optimized derivative instructions for
the standard library, e.g., unicode strings, numerical objects,
containers, and iterators.


 That might be worth it
 already, just think of list.append(). We have an optimistic optimisation for
 object.append() in Cython that gives us massive speed-ups in loops that
 build lists, even if we don't know at compile time that we are dealing with
 lists.

Yes, that sounds like a reasonable thing to do. I could provide much
more optimized derivatives based on application profiles, too. Since I
use a simple code generator for generating the derivatives, it would
also be possible to provide end-users with the means to analyze their
apps and generate optimized instruction derivatives matching their
profile.


Regards,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations...

2010-07-23 Thread stefan brunthaler
 This sounds like wpython (a CPython derivative with a wider set of byte code
 commands) could benefit from it.

I am aware of the wpython project of Cesare di Mauro. I change the
instruction format from bytecode to wordcode, too (because it allows
for more efficient instruction decoding). Contrary to his approach,
however, I do not change the instruction encoding to pack in
additional optimizations. (I hope to have put that correctly; I have
seen his slides about a year ago.)


 Do I understand correctly that you modify the byte code of modules/functions
 at runtime?

Yes. Quickening is runtime only optimization technique that rewrites
instructions from a generic instruction to an optimized derivative
(orignally for the Java virtual machine). It is completely hidden from
the compiler and has no other dependencies than the interpreter
dispatch routine itself.



 Ah, yes, that makes good sense. So you basically add an intermediate step to
 calls that provides faster dispatch for known C functions.

Exactly. I also contemplated to provide optimized derivatives for all
builtin functions, but never implemented it (lack of time). Based on
quantitative analysis of usage frequency one could very well decide
to, e.g., provide an optimized CALL_FUNCTION derivative for the len
function.
Another benefit of using my technique is that a compiler could decide
to inline all of the functions of the optimized derivatives (e.g., the
float_add function call inside my FLOAT_ADD interpreter instruction).
Unfortunately, however, gcc currently does not allow for cross-module
inlining (AFAIR). (Preliminary tests with manually changing the
default inlining size for ceval.c resulted in speedups of up to 1.3 on
my machine, so I think inlinling of function bodies for the optimized
derivatives would boost performance noticeably.)


 I'm interested in the code that determines what can be optimised in what
 way. I read that Jython recently received a contribution that provides type
 information for lots of modules and builtins, but having something like that
 for CPython would be cool.

Ok. For this year's PPPJ I wanted to submit a paper realizing my
optimization in Jython. Because of bytecode-rewriting tools, the
interpreter could decide at runtime which optimized derivatives to
generate and add rewriting code that supports the changing instruction
set. Either way (static pre-compiling or dynamic bytecode rewriting
that is), I think that Jython and IronPython would greatly benefit
from applying this optimization technique, because their JIT compilers
would inline the function calls with a high likelihood.


 Such an approach would also be very useful for Cython. Think of a profiler
 that runs a program in CPython and tells you exactly what static type
 annotations to put where in your Python code to make it compile to a fast
 binary with Cython. Or, even better, it could just spit out a .pxd file that
 you drop next to your .py file and that provides the static type information
 for you.

Hm, I think you could very easily save away the optimized bytecode
sequence for function calls that would allow you to do that (e.g., you
could save something similar to:
   LOAD_FAST
   LOAD_CONST
   LONG_ADD
or
   LOAD_GLOBAL
   CALL_C_ZERO
)


Cheers,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations...

2010-07-23 Thread stefan brunthaler
 How do you generate the specialized opcode implementations?

I have a small code generator written in Python that uses Mako
templates to generate C files that can be included in the main
interpreter. It is a data driven approach that uses type information
gathered by gdb and check whether given types implement for instance a
nb_add method.

 Presumably that is done ahead of time, or you'd have to use a JIT,
 which is what you're avoiding.

Yes, and yes: I execute the code generator before compiling the Python
interpreter, and I am interested in purely interpretative optimization
techniques.


 I'm guessing from your comments below about cross-module inlining that
 you generate a separate .c file with the specialized opcode bodies and
 then call through to them via a table of function pointers indexed by
 opcode, but I could be totally wrong.  :)

No, dead on ;)
Probably a small example from the top of my head illustrates what is going on:

TARGET(FLOAT_ADD):
  w= POP();
  v= TOP();
  x= PyFloat_Type.tp_as_number-nb_add(v, w);
  SET_TOP(x);
  if (x != NULL) FAST_DISPATCH();
  break;

And I extend the standard indirect threaded code dispatch table to
support the FLOAT_ADD operation.


 There are a variety of solutions to getting cross-module inlining
 these days.  Clang+LLVM support link-time optimization (LTO) via a
 plugin for gold.  GCC has LTO and LIPO as well.

A PhD colleague from our institute pointed the gold stuff out to me
yesterday, I am going to check out if any of these solutions would
work. A deeper problem here is that the heuristics of the compilers
are ill-suited to the needs of compiling an interpreter dispatch
routine -- I will investigate this further in future research.



 This would be interesting.  We have (obviously) have similar
 instrumentation in unladen swallow to gather type feedback.  We talked
 with Craig Citro about finding a way to feed that back to Cython for
 exactly this reason, but we haven't really pursued it.

Ok; I think it would actually be fairly easy to use the type
information gathered at runtime by the quickening approach. Several
auxiliary functions for dealing with these types could be generated by
my code generator as well. It is probably worth looking into this,
though my current top-priority is my PhD research, so I cannot promise
to being able to allocate vast amounts of time for such endeavours.


Best,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations...

2010-07-23 Thread stefan brunthaler
 wpython has reached 1.1 final version. If you are interested, you can find
 it here: http://code.google.com/p/wpython2/ and you can download the new
 slides that cover the improvements over 1.0 alpha.

Thanks for the hint, I will definitely check your new slides.


 Did you used wpython wordcode format, or a new one?

No, actually I was well into working on my stuff when you announced
wpython last year. My latest instruction format uses a native machine
word (64bit) that contains two 32bit halves with the opcode in the
lower half and the operand in the upper half. While the opcode
certainly does not exceed 10bits or so, I need more than a byte to
support more operations (my latest interpreter has 395 instructions).
Our instruction decoding is almost identical, though.


 Yes, you're right. wpython approach is to encode as much information as it
 can to save space, decoding time, specialize some opcodes, etc..

Yes, I see that wpython eliminates common instruction sequences. From
my understanding, it corresponds to using static superinstructions in
combination with a changed instruction format. Aside of the
optimizations in the operation implementation wpython allows to
eliminate some instruction dispatches, which are notoriously
inefficient. I think it is a very nice approach and some form of
inline caching with quickening might well boost performance even more.

Cheers,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations...

2010-07-23 Thread stefan brunthaler
 I think I was wrong, but now I understand.  The inlining you want is
 to get the nb_add body, not the opcode body.

Exactly. This would increase performace by quite a bit -- I will start
experimentation with that stuff a.s.a.p.


 The example you've given brings up a correctness issue.  It seems
 you'd want to add checks that verify that the operands w and v are
 both floats, and jump to BINARY_ADD if the guard fails.  It would
 require reshuffling the stack operations to defer the pop until after
 the check, but it shouldn't be a problem.

That's usually the problem when you're doing something from the top of
your head, especially when its already 9pm :)
You're right, a simple way around this is either:

TARGET(FLOAT_ADD):
   if (!(TOP()-ob_type == SECOND()-ob_type  TOP()-ob_type ==
PyFloat_Type))
  goto TARGET_BINARY_ADD_SKIP_DECODE;
   ...remains the same...

Note: the skip_decode is necessary because otherwise it would advance
the instruction pointer.
Another, simple approach is to add another set of labels to denote
inline cache misses, e.g.:

TARGET(BINARY_ADD):
   w= POP();
   v= TOP();
  BINARY_ADD_CACHE_MISS:
   x= PyNumber_Add(v, w);
   ...

TARGET(FLOAT_ADD):
   w= POP();
   v= TOP();
   if (v-ob_type != w-ob_type || v-ob_type != PyFloat_Type)
 goto BINARY_ADD_CACHE_MISS;
   ...

You could also call the PyNumber_Add function yourself instead of the
goto, but I did not implement it this way...



 I think you also record (via gdb) exactly the information that we
 record.  I now see three consumers of type feedback from the CPython
 interpreter: you or any quickening approach, Cython, and Unladen
 Swallow.  It might be worth converging on a standard way to record
 this information and serialize it so that it can be analyzed offline.

Indeed. Probably a bigger set of types of frequently used C extension
modules would be useful as well. I am doing the simplest possible
thing here: since the gdb pretty print already is pretty close to a
Python data type definition, I just copied this and did a few search
and replace commands to convert it to a valid Python data type.


 Our feedback recording mechanism currently uses LLVM data structures,
 but the point of quickening is to avoid that kind of dependency, so
 we'd need to rewrite it before it would really be useful to you.

Ok.


--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Python 3 optimizations...

2010-07-22 Thread stefan brunthaler
Hello,

during the last year, I have developed a couple of quickening-based
optimizations for the Python 3.1 interpreter. As part of my PhD
programme, I have published a first technique that combines quickening
with inline caching at this year's ECOOP, and subsequently extended
this technique to optimize several load instructions as well as
eliminate redundant reference counting operations from instructions,
which has been accepted for publication for an upcoming conference.
I have a working prototype combining all of these optimizations that
achieves a maximum speedup of 2.18 on the spectralnorm benchmark of
the computer language benchmarks game. Early measurements on the
Unladen Swallow django benchmark (with Martin von Loewis' patch for
Python 3) achieve a speedup of about 1.3. Both speedups were measured
on an i7 920 when combined with the threaded code/computed goto
optimization enabled, and normalized by the standard Python 3.1
interpreter with all optimizations disabled.
Since all of these optimizations are purely interpretative, they have
next-to-no additional memory requirements and do not incur extensive
warm-up costs.

I wonder whether you would be interested in integrating these
optimizations with the Python 3 distribution, hence this mail. I could
send copies of the papers, as well as provide my prototype source code
to interested members of the python development community.

Have a nice day,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations...

2010-07-22 Thread stefan brunthaler
 Is the source code under an open source non-copyleft license?

I am (unfortunately) not employed or funded by anybody, so I think
that I can license/release the code as I see fit.


 Have you checked that the whole regression test suite passes?

Currently, I am sure my prototype will not pass the regression test
suite, because I used it for benchmarking only. However, I think it is
probably not a too much work to get this done.


 Can you publish all this online (papers and source code)?

I am not sure whether I can publish the ECOOP paper without violating
any Springer copyrights, but will check if that is any problem. I
could send you the second paper in private; until it is going to be
published in the corresponding proceedings, I fear that I might not be
able to put this online, though.
Publishing the source code should not be a problem at all, though I
think some polishing might not be a bad idea...


Best,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations...

2010-07-22 Thread stefan brunthaler
 The Springer link [1] at least shows the front page to give more of an
 idea as to what this is about.

Thanks, I forgot to mention the link.


 The idea does sound potentially interesting, although I'm not sure how
 applicable it will be with a full-blown LLVM-based JIT on the way for
 3.3 (via Unladen Swallow).

Yeah, I see that. However, I think that a JIT compiler could very well
re-use the information gathered by using my techniques to generate
better code in the first run. There certainly are a couple of other
benefits for combining both approaches, i.e., they are not mutually
exclusive.


 Depending on the size and complexity of the
 patches, it may still be worth exploring for 3.2.

I am currently not aware of the planned release schedule, but I think
that some of the techniques could very well be released in 3.2 -- with
a conditional compilation target similar to the computed-goto feature.

Regards,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com