[issue14757] INCA: Inline Caching meets Quickening in Python 3.3

2012-05-16 Thread stefan brunthaler

stefan brunthaler s.bruntha...@uci.edu added the comment:

 Perhaps that's just me, but I find the performance gains rather limited given 
 the sheer size of the changes.

Well there are a couple of things to keep in mind:

a) There is a substantial speedup potential in further interpretative
optimizations, but they come at increased complexity (mostly due to a
different instruction encoding). From the response on python-dev I
took away that this is not what people want.

b) The size is deceptive: the patch contains all resources, i.e., the
code gen *and* the generated files. I could split it up into three
separate patches to show that the *actual* intersection with existing
Python sources is very small. (Disregarding opcode.h, my guess is that
it's about a 100 lines.)

c) There are no reasonable compatbility implications (modulo code that
checks specific opcode values) and the memory consumption is
essentially nil (= 100KiB, constant.)

There are further speedups available by ordering the interpreter
instructions (I have a paper on that called Interpreter Instruction
Scheduling, and am currently working on a better algorithm [well, the
algorithm already exists, I'm just evaluating it].) I could easily add
that at no extra cost to the implementation, too.

 Is there any non-micro benchmark where the performance gains are actually 
 substantial (say, more than 20%)?

Hm, I don't know. Are there applications/frameworks running on Python
3 that I can benchmark with?

Based on my experience, the speedups should be achievable across the
board, primarily because the most frequent CALL_FUNCTION instructions
have optimized derivatives. In addition with the arithmetic and
COMPARE_OP derivatives this covers a wide array of dynamic instruction
frequency mixes. There exist further inlining capabilities, too, which
can be easily added to the code generator.
The only reason why some benchmarks don't achieve expected speedups
isdue to them using operations where the code-gen does not contain
optimized derivatives. There is still space for ~45 derivatives to
cover those (including some important application-specific ones.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14757
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14757] INCA: Inline Caching meets Quickening in Python 3.3

2012-05-14 Thread stefan brunthaler

Changes by stefan brunthaler s.bruntha...@uci.edu:


Added file: http://bugs.python.org/file25586/20120514-inca.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14757
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14757] INCA: Inline Caching meets Quickening in Python 3.3

2012-05-14 Thread stefan brunthaler

stefan brunthaler s.bruntha...@uci.edu added the comment:

While looking at the instruction frequency traces, I noticed that INPLACE_* 
instructions were not quickened to optimized instruction derivatives. The 
submitted patch fixes this issue and correctly generates code performing the 
substitution. The performance improvements are noticeably better 
(iterative_count, nbody and threaded_count about 1.25x faster on average.)
The summarized performance numbers make it easy to see where actual performance 
gains* are, and which results seem to be due to measurement error.

*: It seems that the mako benchmark suite suffers a consistent loss. I am going 
to look into this.

--
Added file: http://bugs.python.org/file25588/20120514-inca-perf.txt

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14757
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14757] INCA: Inline Caching meets Quickening in Python 3.3

2012-05-11 Thread stefan brunthaler

stefan brunthaler s.bruntha...@uci.edu added the comment:

This is the updated patch file that fixes the performance issues measurable 
using the official perf.py py3k test suite.

--
Added file: http://bugs.python.org/file25541/20120511-inca.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14757
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14757] INCA: Inline Caching meets Quickening in Python 3.3

2012-05-11 Thread stefan brunthaler

Changes by stefan brunthaler s.bruntha...@uci.edu:


Added file: http://bugs.python.org/file25542/20120511-inca-perf.txt

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14757
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14757] INCA: Inline Caching meets Quickening in Python 3.3

2012-05-11 Thread stefan brunthaler

Changes by stefan brunthaler s.bruntha...@uci.edu:


Added file: http://bugs.python.org/file25543/20120510-vanilla-perf.txt

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14757
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14757] INCA: Inline Caching meets Quickening in Python 3.3

2012-05-11 Thread stefan brunthaler

stefan brunthaler s.bruntha...@uci.edu added the comment:

So I took a close look to what the performance problem was. Many of the 
benchmarks used by the perf.py py3k benchmarks use function calls for which 
there are no optimized derivatives available. In this case the function trying 
to do the quickening (aptly called quickening_call_function) did not have a 
proper target instruction. Therefore, it was stuck with the additional 
quickening code in quickening_call_function. This causes some overhead, or at 
least ate away some of the benefits of quickening. The new patch takes care of 
quickening to another function, if no better target is available, the only 
optimization being to have one instruction for calling a C function or a Python 
function/method (instruction labels being INCA_C_CALL, and INCA_PYTHON_CALL 
respectively. The new interpreter dispatch loop therefore has 208 instructions.)

Furthermore, I have also added a baseline evaluation of running perf.py on my 
system, where I measure the vanilla Python 3.3 version against itself. I see 
quite some noise there, with some benchmarks showing up to 10pct outliers. On 
average, my interpretation is that around five percent variance needs to be 
accounted for.

The new patch shows an overall improvement. For the slower ones, I guess that's 
within the error margin demonstrated by the baseline evaluation run. The 
remaining speedups are acceptable, even though I could tune the code generator 
to have more optimized derivatives in place. I took a closer look at the float 
benchmark, because my experience with the computer language benchmarks game the 
INCA optimization performs quite well there.

These are the dynamic instruction frequencies for the two most frequently 
executed functions:

maximize:
-
   Freq.  |pos | instruction | arg
   100:   0: LOAD_FAST 0
   100:   3: LOAD_ATTR 0
   100:   6: LOAD_FAST 1
   100:   9: LOAD_ATTR 0
   100:  12: INCA_CMP_FLOAT4
   100:  15: POP_JUMP_IF_FALSE 27
   1996400:  18: LOAD_FAST 0
   1996400:  21: LOAD_ATTR 0
   1996400:  24: JUMP_FORWARD  6
  3500:  27: LOAD_FAST 1
  3500:  30: LOAD_ATTR 0
   100:  33: LOAD_FAST 0
   100:  36: STORE_ATTR0
   100:  39: LOAD_FAST 0
   100:  42: LOAD_ATTR 1
   100:  45: LOAD_FAST 1
   100:  48: LOAD_ATTR 1
   100:  51: INCA_CMP_FLOAT4
   100:  54: POP_JUMP_IF_FALSE 66
   100:  57: LOAD_FAST 0
   100:  60: LOAD_ATTR 1
   100:  63: JUMP_FORWARD  6
   100:  72: LOAD_FAST 0
   100:  75: STORE_ATTR1
   100:  78: LOAD_FAST 0
   100:  81: LOAD_ATTR 2
   100:  84: LOAD_FAST 1
   100:  87: LOAD_ATTR 2
   100:  90: INCA_CMP_FLOAT4
   100:  93: POP_JUMP_IF_FALSE 105
   1993800:  96: LOAD_FAST 0
   1993800:  99: LOAD_ATTR 2
   1993800: 102: JUMP_FORWARD  6
  6100: 105: LOAD_FAST 1
  6100: 108: LOAD_ATTR 2
   100: 111: LOAD_FAST 0
   100: 114: STORE_ATTR2
   100: 117: LOAD_CONST0
   100: 120: RETURN_VALUE  0

normalize:
--
   Freq.  |pos | instruction | arg
   200:   0: LOAD_GLOBAL   0
   200:   3: LOAD_FAST 0
   200:   6: LOAD_ATTR 1
   200:   9: LOAD_CONST1
   200:  12: INCA_FLOAT_POWER  1
   200:  13: LOAD_FAST 0
   200:  16: LOAD_ATTR 2
   200:  19: LOAD_CONST1
   200:  22: INCA_FLOAT_POWER  1
   200:  23: INCA_FLOAT_ADD1
   200:  24: LOAD_FAST 0
   200:  27: LOAD_ATTR 3
   200:  30: LOAD_CONST1
   200:  33: INCA_FLOAT_POWER  1
   200:  34: INCA_FLOAT_ADD

[issue14757] INCA: Inline Caching meets Quickening in Python 3.3

2012-05-08 Thread stefan brunthaler

New submission from stefan brunthaler s.bruntha...@uci.edu:

The attached patch adds quickening based inline caching (INCA) to the CPython 
3.3 interpreter. It uses a code generator to generate instruction derivatives 
using the mako template engine, and a set of utility functions to enable 
automatic and safe quickening.

The code generator itself resides in cgen and the generated files reside in 
Python/opt/gen. Auxiliary files resides in Python/opt and only minor 
changes are necessary in ceval.c and places where type feedback is possible 
(mostly in abstract.c and object.c)

Details of the technique have been published (see my home page: 
http://www.ics.uci.edu/~sbruntha/.) 

On my machine (i7-920 with Intel Turbo Boost disabled) this results in average 
arithmetic speedups of 1.47 over the vanilla interpreter without threaded 
code/computed gotos, and 1.13 over an interpreter with threaded code/computed 
gotos enabled. (Maximal speedups are 1.61 over the vanilla interpreter and 1.17 
over the threaded code interpreter.) The optimized interpreter uses 206 
instructions which currently only cover the standard library, i.e., there is 
still ample space left for optimized instruction derivatives for popular 
applications/libraries, such as NumPy or Django.

Furthermore, based on the purely interpretative nature of the technique, there 
are no compatibility implications (modulo libraries/modules relying on concrete 
opcode values---I would guess that such code is rather unlikely, but one never 
knows...) Additional memory overhead is minimal, too, since the technique only 
requires space for the new derivatives and is something along the lines of 
80-100 KiB.

--
components: Interpreter Core
files: 20120508-inca.patch
hgrepos: 124
keywords: patch
messages: 160216
nosy: sbrunthaler
priority: normal
severity: normal
status: open
title: INCA: Inline Caching meets Quickening in Python 3.3
type: enhancement
versions: Python 3.3
Added file: http://bugs.python.org/file25499/20120508-inca.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14757
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14757] INCA: Inline Caching meets Quickening in Python 3.3

2012-05-08 Thread stefan brunthaler

stefan brunthaler s.bruntha...@uci.edu added the comment:

 This looks quite impressive, so sorry for immediately jumping in with
 criticism. -- I've benchmarked the things I worked on, and I can't see
 any speedups but some significant slowdowns. This is on 64-bit Linux
 with a Core 2 Duo, both versions compiled with just `./configure  make`:

Well, no problem -- I don't actually consider it criticism at all.
Build is correct, you could verify the interpreter working adequatly
by running the test suite and seeing some tests depending on specific
bytecodes fail (test_dis, and test_importlib, AFAIR).

I don't have a Core 2 Duo available for testing, though.

 Modules/_decimal/tests/bench.py:
 

 Not much change for floats and decimal.py, 8-10% slowdown for _decimal!

This result is not unexpected, as I have no inline cached versions of
functions using this module. The derivatives I generate work for Long,
Float and Complex numbers (plus Unicode strings and some others.) If
there is a clear need, of course I can look into that and add these
derivatives (as I said, there are still some 40+ opcodes unused.)

 Memoryview:
 ---

 ./python -m timeit -n 1000 -s x = memoryview(bytearray(b'x'*1)) 
 x[:100]

 17% (!) slowdown.

Hm, the 17% slowdown seems strange to me. However, I don't expect to
see any speedups in this case, as there is no repeated execution
within the benchmark code that could leverage type feedback via inline
caching.

You should see most speedups when dealing with for-loops (as FOR_ITER
has optimized derivatives), if-statements (COMPARE_OP has optimized
derivatives), and mathematical code. In addition there are some
optimizations for frequently executed function calls, unpacked
sequences, etc. Note: frequent as in how I encountered them, probably
this needs adjustments for different use cases.

 Did I perhaps miss some option to turn on the optimizations?

Does not seem to be the case, but if you could verify running the
regression tests we could easily eliminate this scenario. You could
verifiy speedups, too, on computer language benchmark game benchmarks,
primarily binarytrees, mandelbrot, nbody and spectralnorm, just to see
how much you *should* gain on your machine. Testing methodology could
also make a difference. I use the following:
- Linux 3.0.0-17 (Ubuntu)
- gcc version 4.6.1
- nice -n -20 to minimize scheduler interference
- 30 repetitions per benchmark

I hope that helps/explains,
regards,
--stefan

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14757
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com