[issue4753] Faster opcode dispatch on gcc

2015-06-02 Thread David Bolen

David Bolen added the comment:

Oops, sorry, I had just followed the commit comment to this issue.  For the 
record here, it looks like Benjamin has committed an update (5e8fa1b13516) that 
resolves the problem.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2015-06-01 Thread R. David Murray

R. David Murray added the comment:

Please open a new issue with the details about your problem.

--
nosy: +r.david.murray

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2015-06-01 Thread David Bolen

David Bolen added the comment:

I ran a few more tests, and the generated executable hangs in both release and 
debug builds.  The closest I can get at the moment is that it's stuck importing 
errno from the "import sys, errno" line in os.py - at least no matter how long 
I wait after starting a process before breaking out, output with -v looks like:

> python_d -v
# installing zipimport hook
import zipimport # builtin
# installed zipimport hook
# D:\cygwin\home\db3l\test\python2.7\lib\site.pyc matches 
D:\cygwin\home\db3l\test\python2.7\lib\site.py
import site # precompiled from D:\cygwin\home\db3l\test\python2.7\lib\site.pyc
# D:\cygwin\home\db3l\test\python2.7\lib\os.pyc matches 
D:\cygwin\home\db3l\test\python2.7\lib\os.py
import os # precompiled from D:\cygwin\home\db3l\test\python2.7\lib\os.pyc
import errno # builtin
Traceback (most recent call last):
  File "D:\cygwin\home\db3l\test\python2.7\lib\site.py", line 62, in 
import os
  File "D:\cygwin\home\db3l\test\python2.7\lib\os.py", line 26, in 
import sys, errno
KeyboardInterrupt
# clear __builtin__._
# clear sys.path
# clear sys.argv
# clear sys.ps1
# clear sys.ps2
# clear sys.exitfunc
# clear sys.exc_type
# clear sys.exc_value
# clear sys.exc_traceback
# clear sys.last_type
# clear sys.last_value
# clear sys.last_traceback
# clear sys.path_hooks
# clear sys.path_importer_cache
# clear sys.meta_path
# clear sys.flags
# clear sys.float_info
# restore sys.stdin
# restore sys.stdout
# restore sys.stderr
# cleanup __main__
# cleanup[1] zipimport
# cleanup[1] errno
# cleanup[1] signal
# cleanup[1] exceptions
# cleanup[1] _warnings
# cleanup sys
# cleanup __builtin__
[8991 refs]
# cleanup ints: 6 unfreed ints
# cleanup floats

I never have a problem interrupting the process, so KeyboardInterrupt is 
processed normally - it just looks like it gets stuck in an infinite loop 
during startup.

-- David

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2015-06-01 Thread David Bolen

David Bolen added the comment:

The 2.7 back-ported version of this patch appears to have broken compilation on 
the Windows XP buildbot, during the OpenSSL build process, when the newly built 
Python is used to execute the build_ssl.py script.

After this patch, when that stage executes, and prior to any output from the 
build script, the python_d process goes to 100% CPU and sticks there until the 
build process times out 1200s later and kills it.

I don't think it's really ssl related though, as after doing some debugging the 
exact same thing happens if I simply run python_d (I never see a prompt - it 
just sits there burning CPU).  So I think build_ssl.py is just the first use of 
the generated python_d during the build process.

I did try attaching to the CPU-stuck version of python_d from VS, and so far 
from what I can see, the process never gets past the Py_Initialize() call in 
Py_Main().  It's all over the place in terms of locations if I try interrupting 
it, but it's always stuck inside that first Py_Initialize call.

I'm not sure if it's something environmental on my slave, or a difference with 
a debug vs. production build (I haven't had a chance to try building a release 
version yet).

-- David

--
nosy: +db3l

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2015-05-28 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 17d3bbde60d2 by Benjamin Peterson in branch '2.7':
backport computed gotos (#4753)
https://hg.python.org/cpython/rev/17d3bbde60d2

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2015-05-27 Thread Ned Deily

Ned Deily added the comment:

@Vamsi, could you please open a new issue and attach your patch there so it can 
be properly tracked for 2.7?  This issue has been closed for five years and the 
code has been out in the field for a long time in Python 3.  Thanks!

--
nosy: +ned.deily

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2015-05-27 Thread Robert Collins

Robert Collins added the comment:

FWIW I'm interested and willing to poke at this if more testers/reviewers are 
needed.

--
nosy: +rbcollins

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2015-05-27 Thread Srinivas Vamsi Parasa

Srinivas Vamsi Parasa added the comment:

Hi All,

This is Vamsi from Server Scripting Languages Optimization team at Intel 
Corporation.

Would like to submit a request to enable the computed goto based dispatch in 
Python 2.x (which happens to be enabled by default in Python 3 given its 
performance benefits on a wide range of workloads). We talked about this patch 
with Guido and he encouraged us to submit a request on Python-dev (email 
conversation with Guido shown at the bottom of this email). 

Attached is the computed goto patch (along with instructions to run) for Python 
2.7.10 (based on the patch submitted by Jeffrey Yasskin  at 
http://bugs.python.org/issue4753). We built and tested this patch for Python 
2.7.10 on a Linux machine (Ubuntu 14.04 LTS server, Intel Xeon – Haswell EP CPU 
with 18 cores, hyper-threading off, turbo off). 

Below is a summary of the performance we saw on the “grand unified python 
benchmarks” suite (available at https://hg.python.org/benchmarks/). We made 3 
rigorous runs of the following benchmarks. In each rigorous run, a benchmark is 
run 100 times with and without the computed goto patch. Below we show the 
average performance boost for the 3 rigorous runs. 
-
Instructions to run the computed goto patch
1) Apply the patch and then generate the new configure script (autoconf 
configure.ac > configure) 
2) Enable execute permissions for Python/makeopcodetargets.py ( sudo chmod +x 
Python/makeopcodetargets.py) 
3) To enable computed gotos, do:  ./configure --with-computed-gotos 
4) Build the new python binary using make && sudo make install
-- 



Python 2.7.10 (original) vs Computed Goto performance
Benchmark   Delta (rigorous run #1) %   Delta (rigorous run 2)  %   
Delta (rigorous run #3) %   Avg. Delta %
iterative_count 24.48   24.36   23.78   24.2
unpack_sequence 19.06   18.47   19.48   19.0
slowspitfire14.36   13.41   16.65   14.8
threaded_count  15.85   13.43   13.93   14.4
pystone 10.68   11.67   11.08   11.1
nbody   10.25   8.939.289.5
go  7.968.767.698.1
pybench 6.3 6.8 7.2 6.8
spectral_norm   5.499.374.626.5
float   6.096.2 6.966.4
richards6.196.416.426.3
slowunpickle6.378.783.556.2
json_dump_v21.9612.53   3.576.0
call_simple 6.375.913.925.4
chaos   4.575.343.854.6
call_method_slots   2.633.277.714.5
telco   5.181.836.474.5
simple_logging  3.481.577.4 4.2
call_method 2.615.4 3.884.0
chameleon   2.036.263.2 3.8
fannkuch3.893.194.393.8
silent_logging  4.333.073.393.6
slowpickle  5.72-1.12   6.063.6
2to32.993.6 3.453.3
etree_iterparse 3.412.513   3.0
regex_compile   3.442.482.842.9
mako_v2 2.141.295.222.9
meteor_contest  2.012.2 3.882.7
django  6.68-1.23   2.562.7
formatted_logging   1.975.82-0.11   2.6
hexiom2 2.832.1 2.552.5
django_v2   1.932.532.922.5
etree_generate  2.382.132.512.3
mako-0.39.66-3.11   2.1
bzr_startup 0.351.973   1.8
etree_process   1.841.011.9 1.6
spambayes   1.760.760.481.0
regex_v81.96-0.66   1.631.0
html5lib0.830.720.970.8
normal_startup  1.410.390.240.7
startup_nosite  1.2 0.410.420.7
etree_parse 0.240.9 0.790.6
json_load   1.380.56-0.25   0.6
pidigits0.450.330.280.4
hg_startup  0.322.07-1.41   0.3
rietveld0.050.91-0.43   0.2
tornado_http2.34-0.92   -1.27   0.1
call_method_unknown 0.721.26-1.85   0.0
raytrace-0.35   -0.75   0.94-0.1
regex_effbot1.97-1.18   -2.57   -0.6
fastunpickle-1.65   0.5 -0.88   -0.7
nqueens -2.24   -1.53   -0.81   -1.5
fastpickle  -0.74   1.98-6.26   -1.7


Thanks,
Vamsi


From: gvanros...@gmail.com [mailto:gvanros...@gmail.com] On Behalf Of Guido van 
Rossum
Sent: Tuesday, May 19, 2015 1:59 PM
To: Cohn, Robert S
Cc: R. David Murray (r.david.mur...@murrayandwalker.com)
Subject: Re: meeting at PyCon

Hi Robert and David,
I just skimmed that thread. There were a lot of noises about backporting it to 
2.7 but the final message on the topic, by Antoine, claimed it was too late for 
2.7. However, that was before we had announced the EOL extension of 2.7 till 
2020, and perhaps we were also in denial about 3.x uptake vs. 2.x. So I think 
it's definitively worth bringing this up. I would start with a post on 
python-dev linking to the source code for your patch, and adding a message to 
the original tracker issue too (without

[issue4753] Faster opcode dispatch on gcc

2010-07-19 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

This is too late for 2.x now, closing.

--
resolution: accepted -> fixed
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2010-05-20 Thread Skip Montanaro

Changes by Skip Montanaro :


--
nosy:  -skip.montanaro

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-07-18 Thread Michele Dionisio

Michele Dionisio  added the comment:

I have patch the code of python3.1 to use computed goto tecnique also
with Visual Studio. The performance result is not good (I really don't
know why). But it is a good work-araound for use the computed goto also
on windows.
The only diffentes is that the opcode_targets vector is filled at run-time.

--
nosy: +mdionisio
versions: +Python 3.1 -Python 2.7
Added file: http://bugs.python.org/file14521/newfile.zip

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-07-02 Thread Jesús Cea Avión

Changes by Jesús Cea Avión :


--
nosy: +jcea

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-04-11 Thread Alexandre Vassalotti

Changes by Alexandre Vassalotti :


--
nosy:  -alexandre.vassalotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-04-11 Thread Mark Dickinson

Changes by Mark Dickinson :


--
nosy:  -marketdickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-04-09 Thread Andrew I MacIntyre

Andrew I MacIntyre  added the comment:

Antoine, in my testing the "loss" of the HAS_ARG() optimisation in my
patch appears to have negligible cost on i386, but starts to look
significant on amd64.

On an Intel E8200 cpu running FreeBSD 7.1 amd64, with gcc 7.2.1 and the
3.1a2 sources, the computed goto version is ~8% faster (average time of
all rounds) for pybench (with warp factor set to 3 rather than the
default 10, to get the round time up over 10s) than without computed
gotos.  With my patch applied, the computed goto version is ~5.5% faster
than without computed gotos by the same measure.  On this platform,
Pystone rates at ~86k (no computed gotos), ~85k (computed gotos) and
~82k (computed gotos + my patch).

For comparison, this machine running Windows XP (32 bit) with the
python.org builds rates ~92k pystones for 2.6.1 and ~81k for 3.1a2. 
Pybench isn't distributed in the MSI installers :-(

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-03-31 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

Andrew, your patch disables the optimization that HAS_ARG(op) is a
constant when op is known by the compiler (that is, inside a
"TARGET_##op" label), so I'd rather keep the version which is currently
in SVN.

--
versions:  -Python 3.1

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-03-31 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

On 2009-03-31 03:19, A.M. Kuchling wrote:
> A.M. Kuchling  added the comment:
> 
> Is a backport to 2.7 still planned?

I hope it is.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-03-30 Thread A.M. Kuchling

A.M. Kuchling  added the comment:

Is a backport to 2.7 still planned?

--
nosy: +akuchling

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-03-22 Thread Andrew I MacIntyre

Andrew I MacIntyre  added the comment:

Out of interest, the attached patch against the py3k branch at r70516
cleans up the threaded code changes a little:
- gets rid of TARGET_WITH_IMPL macro;
- TARGET(op) is followed by a colon, so that it looks like a label (for
editors that make use of that).

On my systems (all older AMD with old versions of gcc), this patch has
performance slightly better than SVN r70516, and performance is
usually very close to the NO_COMPUTED_GOTOS build.

--
nosy: +aimacintyre
Added file: http://bugs.python.org/file13392/ceval.c.threadcode-tidyup.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-02-20 Thread Joshua Bronson

Changes by Joshua Bronson :


--
nosy: +jab

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-02-07 Thread Skip Montanaro

Skip Montanaro  added the comment:

Antoine> Skip, removing the colon doesn't work if the macro adds code
Antoine> after the colon :)

When I looked I thought both TARGET and TARGET_WITH_IMPL ended with a colon,
but I see that's not the case.  How about removing TARGET_WITH_IMPL and just
include the goto explicitly?  There are only a few instances of the
TARGET_WITH_IMPL used.

Skip

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-02-07 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

Skip, removing the colon doesn't work if the macro adds code after the
colon :)

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-02-04 Thread Gabriel Genellina

Gabriel Genellina  added the comment:

> Might I suggest that the TARGET and TARGET_WITH_IMPL macros not 
> include the trailing colon? 

Yes, please!

--
nosy: +gagenellina

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-02-03 Thread Skip Montanaro

Skip Montanaro  added the comment:

This has been checked in, right?  Might I suggest that the TARGET and
TARGET_WITH_IMPL macros not include the trailing colon?  I think that
will make it more friendly toward "smart" editors such as Emacs' C
mode.  I definitely get better indentation with

TARGET(NOP):

than with

TARGET(NOP)

S

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-31 Thread Mark Dickinson

Mark Dickinson  added the comment:

> The test failure also happens on trunk, it may be related to the recent
> tk changes.

Yes; sorry---I didn't mean to suggest that the test failures were in any 
way related to the opcode dispatch stuff.  Apart from the ttk teething 
difficulties, there's a weird 'Unknown signal 32' error that's been going 
on on the gentoo buildbot for many months now.  But that's a separate 
issue (issue #4970, to be precise).

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-31 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

> Square brackets added in r69133.  The gentoo x86 3.x buildbot seems to be 
> passing the compile stage now.  (Though not the test stage, of course:  
> one can't have everything!)

The test failure also happens on trunk, it may be related to the recent
tk changes.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-31 Thread Mark Dickinson

Mark Dickinson  added the comment:

Square brackets added in r69133.  The gentoo x86 3.x buildbot seems to be 
passing the compile stage now.  (Though not the test stage, of course:  
one can't have everything!)

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-30 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

Mark:

"""Are there any objections to me adding a couple of square brackets to
this line to turn the argument of join into a list comprehension?"""

No problems for me. You might also add to the top comments of the file
that it is 2.3-compatible.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-30 Thread Mark Dickinson

Mark Dickinson  added the comment:

Sorry:  ignore that last.  Python/opcode_targets.h is already part of the 
distribution.  I don't know what I was doing wrong.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-30 Thread Mark Dickinson

Mark Dickinson  added the comment:

One other thought:  it seems that as a result of this change, the py3k 
build process now depends on having some version of Python already 
installed;  before this, it didn't.  Is this true, or am I misinterpreting 
something?

Might it be worth adding the file Python/opcode_targets.h to the 
distribution to avoid this problem?

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-30 Thread Mark Dickinson

Mark Dickinson  added the comment:

The x86 gentoo buildbot is failing to compile, with error:

/Python/makeopcodetargets.py ./Python/opcode_targets.h
  File "./Python/makeopcodetargets.py", line 28
f.write(",\n".join("\t&&%s" % s for s in targets))
  ^
SyntaxError: invalid syntax
make: *** [Python/opcode_targets.h] Error 1

I suspect that it's because the system Python on this buildbot is Python 
2.3, which doesn't understand generator expressions.  Are there any 
objections to me adding a couple of square brackets to this line to turn 
the argument of join into a list comprehension?

--
nosy: +marketdickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-30 Thread Kevin Watters

Kevin Watters  added the comment:

Does anyone know the equivalent ICC command line option for GCC's -fno-
gcse? (Or if it has one?) I can't find a related option in the docs.

It looks like ICC hits the same combining goto problem, as was 
mentioned: without changing any options, I applied pitrou_dispatch_2.7.patch to 
release-26maint and pybench reported 
literally no difference, +0.0%.

Even if stock Python is built with MSVC, devs like myself who ship 
Python would like to see the benefit.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-28 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

For the record, I've compiled py3k on an embarassingly fast Core2-based
server (Xeon E5410), and the computed gotos option gives a 16% speedup
on pybench and pystone.

(with gcc 4.3.2 in 64-bit mode)

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-27 Thread Gregory P. Smith

Gregory P. Smith  added the comment:

I'll take on the two remaining tasks for this:

* add configure magic to detect when the compiler supports this so
  that it can default to --with-computed-gotos on modern systems.
* commit the back port to 2.7 trunk.

--
assignee:  -> gregory.p.smith
status: pending -> open
versions: +Python 3.1

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-26 Thread Kevin Watters

Changes by Kevin Watters :


--
nosy: +kevinwatters

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-26 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

-fno-gcse is controversial.
Even if it might avoid jumps sharing, the impact of that option has to
be measured, since common subexpression elimination allows omitting some
recalculations, so disabling global CSE might have a negative impact on
other code.

It would be maybe better to disable GCSE only for the interpreter loop,
but that would make some intra-file inlining impossible.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-25 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

Committed in py3k in r68924.
I won't backport it to trunk myself but it should be easy enough,
provided people are interested.

--
resolution:  -> accepted
stage: patch review -> committed/rejected
status: open -> pending
versions:  -Python 2.6, Python 3.0, Python 3.1

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-24 Thread Jeffrey Yasskin

Jeffrey Yasskin  added the comment:

In the comment, you might mention both -fno-crossjumping and -fno-gcse.
-fno-crossjumping's description looks like it ought to prevent combining
computed gotos, but
http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Optimize-Options.html says
-fno-gcse actually does it, and in my brief tests, the manual is
actually correct (compiling with just -fno-crossjumping combined gotos
anyway).

Otherwise, threadedceval6.patch looks good to submit to me.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-21 Thread Stefan Ring

Stefan Ring  added the comment:

Hi,

I ported threadedceval6.patch to Python 2.5.4, in case anyone is
interested...

Note that you need to run autoconf and autoheader.

--
nosy: +Ringding
Added file: http://bugs.python.org/file12824/threadedceval6-py254.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-16 Thread Antoine Pitrou

Changes by Antoine Pitrou :


Removed file: http://bugs.python.org/file12767/threadedceval6.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-16 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

Thanks Skip, it makes sense... so here is a patch without the configure
script.

(I wonder however if those huge configure changes, when checked into the
SVN, could break something silently somewhere)

Added file: http://bugs.python.org/file12769/threadedceval6.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-16 Thread Skip Montanaro

Skip Montanaro  added the comment:

Antoine> (sorry, the patch is very long because it seems running
Antoine> autoconf changes a lot of things in the configure script)

Normal practice is to not include the configure script in such patches and
indicate to people that they will need to run autoconf.

Skip

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-16 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

Here is an updated patch with a dedicated configure option
(--with-computed-gotos, disabled by default), rather than a compiler
detection switch.

(sorry, the patch is very long because it seems running autoconf changes
a lot of things in the configure script)

Added file: http://bugs.python.org/file12767/threadedceval6.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-13 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

#4715 is interesting, but is not really about superinstructions.
Superinstructions are not created because they make sense; any common
sequence of opcodes can become a superinstruction, just for the point of
saving dispatches. And the creation can even be dynamic!

However, when I'll have substantial time for coding, I'd like to spend
it experimenting with subroutine threading. vmgen author despises it,
but nowadays it probably became even faster, as discussed by the article
"Context threading: A flexible and efficient dispatch technique for
virtual machine interpreters".

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-13 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

As for superinstructions, you can find an example here: #4715.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-12 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

Ok, then vmgen adds almost just direct threading instead of indirect
threading.

Since the purpose of superinstructions is to eliminate dispatch
overhead, and that's more important when little actual work is done,
what about all ones which unconditionally end with FAST_DISPATCH and are
common? DUP_TOP, POP_TOP, DUP_TOPX(2,3) and other stack handling staff
which can't fail? To have any of them + XXX without error handling
problems? Removing handling of DUP_TOPX{4,5} is implied, you shouldn't
check functionality of the compiler during interpretation - indeed, even
the idea using a parameter for that is a waste. Have DUP_TOPX2 and
DUP_TOPX3, like JVM, is just simpler.

> Replication would be trickier since we want the bytecode generation to
be deterministic, but it's probably doable too.

Bytecode conversion during I/O is perfectly fine, to convert from the
actual bytecode to one of the chosen replicas. Conversion in a rescan
pass can be also OK (less cache friendly though, so if it's easy to
avoid, please do).

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-12 Thread Jeffrey Yasskin

Jeffrey Yasskin  added the comment:

@Paolo: I'm going to be looking into converting more common sequences
into superinstructions. We only have LOAD_CONST+XXX so far. The others
are difficult because vmgen doesn't provide easy ways to deal with error
handling, but Jakob and I have come up with a couple ideas to get around
that.

Replication would be trickier since we want the bytecode generation to
be deterministic, but it's probably doable too. I'll post any
significant results I get.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-12 Thread Jeffrey Yasskin

Jeffrey Yasskin  added the comment:

I've left some line-by-line comments at
http://codereview.appspot.com/11905. Sorry if there was already a
Rietveld thread; I didn't see one.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-12 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

A couple percent maybe is not worth vmgen-ing. But even if I'm not a
vmgen expert, I read many papers from Ertl about superinstructions and
replication, so the expected speedup from vmgen'ing is much bigger.
Is there some more advanced feature we are not using and could use?
Have the previous static predictions be converted to superinstructions?
Have other common sequences be treated like that?
Is there an existing discussion on this?

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-12 Thread Jeffrey Yasskin

Jeffrey Yasskin  added the comment:

Here's the vmgen-based patch for comparison. Again, it passes all the
tests, but isn't complete outside of that and (unless consensus develops
that a couple percent is worth requiring vmgen) shouldn't distract from
reviewing Antoine's patch. I'll look over threadedceval5.patch in detail
next.

Added file: http://bugs.python.org/file12705/vmgen_2.7.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-11 Thread Gregory P. Smith

Gregory P. Smith  added the comment:

Benchmarking pitrou_dispatch_2.7.patch applied to trunk r68522 on a 32-
bit Efficeon (x86) using gcc 4.2.4-1ubuntu3 yields a 10% pybench 
speedup.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-11 Thread Jeffrey Yasskin

Jeffrey Yasskin  added the comment:

Here's a port of threadedceval5.patch to trunk. It passes the tests. I
haven't benchmarked this exact patch, but on one Intel Core2, a similar
patch got an 11%-14% speedup (on 2to3 and pybench).

I've also cleaned up Jakob Sievers' vmgen patch (representing
forth-style dispatch) a bit so that it passes all the tests, and on the
same machine it got a 13%-17% speedup. The vmgen patch is not quite at
feature parity (it throws out support for LLTRACE and a couple other
#defines), and there are fairly good arguments against committing it to
python at all (it requires installing and modifying vmgen to build), but
I'll post it after I've ported it to trunk.

Re skip and paolo: JITting and machine-specific assembly will probably
be important to speeding up Python in the long run, but they'll also
take a long while to get right, so we shouldn't let them distract us
from committing the dispatch optimization.

Added file: http://bugs.python.org/file12687/pitrou_dispatch_2.7.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-11 Thread Andrew Bennetts

Changes by Andrew Bennetts :


--
nosy: +spiv

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-10 Thread Alexander Belopolsky

Changes by Alexander Belopolsky :


--
nosy: +belopolsky

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-10 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

> (First culprit might
> be license/compatibility problems I guess, but the speedup would be
> worth the time to fix the troubles IMHO).

That would be the obvious reason IMO. And Intel is the only one who can
"fix the troubles".

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-10 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

> Same for CPU-specific tuning: I don't think we want to ship Python
with compiler flags which depend on the particular CPU being used.

I wasn't suggesting this - but since different CPUs have different
optimization rules, something like "oh, 20% performance slowdown on
PowerPC" or "on P4" is important to know (and yeah, configure options
are a good solution).

Which is the barrier for platform-specific tricks, as long as the code
is still portable? I'd like to experiment with __builtin_expect and with
manual alignment (through 'asm volatile(".p2align 4")' on x86/x86_64
with GAS - PPC might need a different alignment probably).

All hidden through macros to make it disappear on unsupported platforms,
without any configure option for them (there shouldn't be the need for
that).

> I doubt many people compile Python with icc, honestly.

Yep :-(. Why don't distributors do it? (First culprit might
be license/compatibility problems I guess, but the speedup would be
worth the time to fix the troubles IMHO).

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-10 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

On 2009-01-10 10:55, Antoine Pitrou wrote:
> Antoine Pitrou  added the comment:
> 
>> It looks like we still didn't manage, and since ICC is the best 
>> compiler out there, this matters.
> 
> Well, from the perspective of Python, what matters mostly is the
> commonly used compilers (that is, gcc and MSVC). I doubt many people
> compile Python with icc, honestly.

Agreed. Our main targets should be GCC for Linux and MSVC for Windows.

On other platforms such as Solaris and AIX, the native vendor compilers
are commonly used for compiling Python.

That said, with a configure option to turn the optimization on and
off, there shouldn't be any problem with slowdowns.

> Same for CPU-specific tuning: I don't think we want to ship Python with
> compiler flags which depend on the particular CPU being used.

Certainly not in the binaries we release on python.org.

Of course, people are still free to setup OPT to have the compiler
generate CPU specific code.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-10 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

> It looks like we still didn't manage, and since ICC is the best 
> compiler out there, this matters.

Well, from the perspective of Python, what matters mostly is the
commonly used compilers (that is, gcc and MSVC). I doubt many people
compile Python with icc, honestly.

Same for CPU-specific tuning: I don't think we want to ship Python with
compiler flags which depend on the particular CPU being used.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-10 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

> @pitrou:
> > The machine I got the 15% speedup on is in 64-bit mode with gcc
> 4.3.2.
> 
> Which is the processor? I guess the bigger speedups should be on 
> Pentium4, since it has the bigger mispredict penalties.

Athlon X2 3600+.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-10 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

The standing question is still: can we get ICC to produce the expected 
output? It looks like we still didn't manage, and since ICC is the best 
compiler out there, this matters.
Some problems with SunCC, even if it doesn't do jump sharing, it seems 
that one doesn't get the speedups - I guess that on most platforms we 
should select the most common alternative for interpreters (i.e. no 
switch, one jump table, given by threadedceval5.patch + 
abstract-switch-reduced.diff).

On core platforms we can spend time on fine-tuning - and the definition 
of "core platforms" is given by "do developers want to test for that?".

When that's fixed, I think that we just have to choose the simpler form 
and merge that.

@alexandre:
[about removing the switch]
> There is no speed difference on pybench on x86; on x86-64, the code 
is slower due to the opcode fetching change.

Actually, on my machine it looks like the difference is caused by the 
different layout caused by switch removal, or something like that, 
because fixing the opcode fetching doesn't make a difference here (see 
below).

Indeed, I did my benchmarking duties. Results are that 
abstract-switch-reduced.diff (the one removing the switch) gives a 1-3% 
slowdown, and that all the others don't make a significant difference. 
The differences in the assembly output seem to be due to a different 
code layout for some branches, I didn't have a closer look.

However, experimenting with -falign-labels=16 can give a small speedup, 
I'm trying to improve the results (what I actually want is to align 
just the opcode handlers, I'll probably do that by hand).

reenable-static-prediction can give either a slowdown or a speedup by 
around 1%, i.e. around the statistical noise.

Note that on my machine, I get only a 10% speedup with the base patch, 
and that is more reasonable here. In the original thread on PyPy-dev, I 
got a 20% one with the Python interpreter I built for my student 
project, since that one is faster* (by a 2-3x factor, like PyVM), so 
the dispatch cost is more significant, and reducing it has a bigger 
impact. In fact, I couldn't believe that Python got the same speedup.

This is a Core 2 Duo T7200 (Merom) in 64bit mode with 4MB of L2 cache, 
and since it's a laptop I expect it to have slower RAM than a desktop.

@alexandre:
> The patch make a huge difference on 64-bit Linux. I get a 20% 
speed-up and the lowest run time so far. That is quite impressive!
Which processor is that?

@pitrou:
> The machine I got the 15% speedup on is in 64-bit mode with gcc
4.3.2.

Which is the processor? I guess the bigger speedups should be on 
Pentium4, since it has the bigger mispredict penalties.


*DISCLAIMER: the interpreter of our group (me and Sigurd Meldgaard) is 
not complete, has some bugs, and the source code has not yet been 
published, so discussion about why it is faster shall not happen here - 
I want to avoid any flame.
I believe it's not because of skipped runtime checks or such stuff, but 
because we used garbage collection instead of refcounting, indirect  
threading and tagged integers, but I don't have time to discuss that 
yet.
The original thread on pypy-dev has some insights if you are interested 
on this.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-09 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

@ ajaksu2
> Applying your patches makes no difference with gcc 4.2 and gives a
> barely noticeable (~2%) slowdown with icc.
"Your patches" is something quite unclear :-)
Which are the patch sets you are comparing?
And on 32 or 64 bits? But does Yonah supports 64bits? IIRC no, but I'm
not sure.
I would be surprised from slowdowns for restore-old-oparg-load.diff,
really surprised.
And I would be just surprised by slowdowns on
reenable-static-prediction.diff.
Also, about ICC output, we still need to ensure that it's properly
compiled (see above the instructions for counting "jmp *" or similar).
In the measurements above, ICC did miscompile the patch with the switch.
By "properly compiled" I mean that separate indirect branches are
generated, instead of just one.

> These results are from a
> Celeron M 410 (Core Solo Yonah-based), so it's a rather old platform to
> run benchmarks on.

Not really - at the very least we should listen to results on Pentium 4,
Core (i.e. Yonah) and Core 2, and I would also add Pentium3/Pentium M to
represent the P6 family.
Anyway, I have to do my benchmarks on this, I hope this weekend I'll
have time.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-09 Thread Daniel Diniz

Daniel Diniz  added the comment:

Paolo,
Applying your patches makes no difference with gcc 4.2 and gives a
barely noticeable (~2%) slowdown with icc. These results are from a
Celeron M 410 (Core Solo Yonah-based), so it's a rather old platform to
run benchmarks on.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Gregory P. Smith

Changes by Gregory P. Smith :


--
nosy: +gregory.p.smith

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

@skip:
In simple words, the x86 call:
  call 0x2000
placed at address 0x1000 becomes:
  call %rip + 0x1000

RIP holds the instruction pointer, which will be 0x1000 in this case
(actually, I'm ignoring the detail that when executing the call, RIP
points to the first byte of the next instruction).

If I execute the same instruction from a different location (i.e.
different RIP), things will break. So, only code for opcodes without
real calls, nor access to globals can be copied like this (inlines are OK).
With refcounting, not even POP_TOP is safe since it can call
destructors. DUP_TOP is still safe, I guess.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Skip Montanaro

Skip Montanaro  added the comment:

Paolo> Various techniques allow to create binary code from the
Paolo> interpreter binary, by just pasting together the code for the
Paolo> common interpreters cases and producing calls to the other. But,
Paolo> guess what, on most platforms (except plain x86, but including
Paolo> x86_64 and maybe x86 for the shared library case) this does not
Paolo> work if the copied code includes function calls (on x86_64 that's
Paolo> due to RIP-relative addressing, and on similar issues on other
Paolo> platforms).

I don't understand.  I know little or nothing about the details of various
instruction set architectures or linkage methods.  Can you break it down
into a simple example?

Skip

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

I finally implemented my suggestion for the switch elimination.
On top of threadedceval5.patch, apply abstract-switch-reduced.diff and
then restore-old-oparg-load.diff to test it.

This way, only computed goto's are used. I would like who had
miscompilation problems, or didn't get advantage from the patch, to try
compiling and benchmarking this version.

I've also been able to reenable static prediction (PREDICT_*) on top of
computed gotos, and that may help CPU prediction even more (the BTB for
the computed goto will be used to predict the 2nd most frequent target);
obviously it may instead cause a slowdown, I'd need stats on opcode
frequency to try guessing in advance (I'll try gathering them later
through DYNAMIC_EXECUTION_PROFILE).

Apply reenable-static-prediction.diff on top of the rest to get this.

I'll have to finish other stuff before closing everything to run
pybench, I can't get stable timings otherwise, so it'll take some time
(too busy, sorry). However I ran the check for regressions and they show
none.


abstract-switch-reduced.diff is the fixed abstract-switch.diff -
actually there was just one hunk which changed the handling of f_lasti,
and that looked extraneous. See the end of the message.

--- a/Python/ceval.cThu Jan 01 23:54:01 2009 +0100
+++ b/Python/ceval.cSun Jan 04 14:21:16 2009 -0500
@@ -1063,12 +1072,12 @@
}

fast_next_opcode:
-   f->f_lasti = INSTR_OFFSET();

/* line-by-line tracing support */

if (_Py_TracingPossible &&
tstate->c_tracefunc != NULL && !tstate->tracing) {
+   f->f_lasti = INSTR_OFFSET();
/* see maybe_call_line_trace
   for expository comments */
f->f_stacktop = stack_pointer;

Added file: http://bugs.python.org/file12635/reenable-static-prediction.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Paolo 'Blaisorblade' Giarrusso

Changes by Paolo 'Blaisorblade' Giarrusso :


Added file: http://bugs.python.org/file12634/restore-old-oparg-load.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Paolo 'Blaisorblade' Giarrusso

Changes by Paolo 'Blaisorblade' Giarrusso :


Added file: http://bugs.python.org/file12633/abstract-switch-reduced.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-06 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

@pitrou:

Argh, reference counting hinders even that?

I just discovered another problem caused by refcounting.

Various techniques allow to create binary code from the interpreter
binary, by just pasting together the code for the common interpreters
cases and producing calls to the other. But, guess what, on most
platforms (except plain x86, but including x86_64 and maybe x86 for the
shared library case) this does not work if the copied code includes
function calls (on x86_64 that's due to RIP-relative addressing, and on
similar issues on other platforms).

So, Python could not even benefit from that! That's a real pity...
I'll have to try with subroutine threading, to see if that's faster than
indirect threading on current platforms or if it is still slower.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-06 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

FWIW, I have made a quick attempt at removing the f->f_lasti assignment
in the few places where it could be removed, but it didn't make a
difference on my machine. The problem being that there are very few
places where it is legitimate to remove the assignment (even a call to
Py_DECREF can invoke arbitrary external code through destructors, so
even a very simple opcode such as POP_TOP needs the f->f_lasti assignment).

Another approach would be to create agregate opcodes, e.g.
BINARY_ADD_FAST would combine LOAD_FAST and BINARY_ADD. That's
orthogonal to the current issue though.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-05 Thread Jeffrey Yasskin

Changes by Jeffrey Yasskin :


--
nosy: +collinwinter, jyasskin

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-05 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

Le lundi 05 janvier 2009 à 02:39 +, Paolo 'Blaisorblade' Giarrusso a
écrit :
> About f->last_i, when I have time I want to try optimizing it. Somewhere
> you can be sure it's not going to be used.

There are lots of places which can call into arbitrary Python code. A
few opcodes could be optimized for sure.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

@alexandre: if you add two labels per opcode and two dispatch tables,
one before (like now) and one after the parameter fetch (where we have
the 'case'), you can keep the same speed.
And under the hood we also had two dispatch tables before, with the
switch, so it's not a big deal; finally, the second table is only used
in the slow path (i.e. EXTENDED_ARG, or when additional checks are needed).

About f->last_i, when I have time I want to try optimizing it. Somewhere
you can be sure it's not going to be used.
But you have some changes about that in the abstract-switch patch, I
think that was not intended?

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Alexandre Vassalotti

Alexandre Vassalotti  added the comment:

> I managed to remove switch pretty easily by moving opcode fetching
> in the FAST_DISPATCH macro and abstracting the control flow of the
> switch.

Here is the diff against threadceval5.patch.

Added file: http://bugs.python.org/file12584/abstract-switch.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Alexandre Vassalotti

Alexandre Vassalotti  added the comment:

> Removing the switch won't be possible unless we change the semantic
> EXTENDED_ARG. In addition, I doubt the improvement, if any, would worth
> the increased complexity.

Nevermind what I have said. I managed to remove switch pretty easily by
moving opcode fetching in the FAST_DISPATCH macro and abstracting the
control flow of the switch. There is no speed difference on pybench on
x86; on x86-64, the code is slower due to the opcode fetching change.

> I patched ceval.c to minimize f->last_i manipulations in the dispatch
> code.  On x86, I got an extra 9% speed up on pybench. However, the
> patch is a bit clumsy and a few unit tests are failing. I will see
> if I can improve it and open a new issue if worthwhile.

Nevermind that too. I found out f->last_i can be accessed anytime via
frame.getlineno(). So, you cannot really change how f->last_i is used
like I did.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

@Skip: if one decides to generate binary code, there is no need to use
switches. Inline threading (also known as "code copying" in some
research papers) is what you are probably looking for:

http://blog.mozilla.com/dmandelin/2008/08/27/inline-threading-tracemonkey-etc/

For references and background on threading techniques mentioned there, see:

http://en.wikipedia.org/wiki/Threaded_code
http://www.complang.tuwien.ac.at/forth/threaded-code.html

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

@Alexandre:
> > So, can you try dropping the switch altogether, using always computed
> > goto and seeing how does the resulting code get compiled?

> Removing the switch won't be possible unless we change the semantic
> EXTENDED_ARG. In addition, I doubt the improvement, if any, would worth
> the increased complexity.
OK, it's time that I post code to experiment with that - there is no
need to break EXTENDED_ARG. And the point is to fight miscompilations.

> Do you actually mean the time spent interpreting bytecodes compared to
the time spent in the other parts of Python? If so, your figures are
wrong for CPython on x86-64. It is about 50% just like on x86 (when
running pybench). With the patch, this drops to 35% on x86-64 and to 45%
on x86.

More or less, I mean that, but I was making an example, and I made up
reasonable figures.
70%, or even more, just for _dispatch_ (i.e. just for the mispredicted
indirect jump), is valid for real-world Smalltalk interpreters for
instance, or for the ones in "The Structure and Performance of Efficient
Interpreters".
But, when you say "intepreting opcodes", I do not know which part you
refer to, if just the computed goto or for the whole code in the
interpreter function.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Ralph Corderoy

Changes by Ralph Corderoy :


___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Ralph Corderoy

Ralph Corderoy  added the comment:

Regarding compressing the opcode table to make better use of cache; 
what if the most frequently occurring opcodes where placed together,
e.g. the opcodes were ordered by frequency, most frequent first.  Just
based on a one-off static analysis of a body of code.  A level one cache
line can be, what, 64 bytes == 16 32-bit pointers.

--
nosy: +ralph.corderoy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Skip Montanaro

Skip Montanaro  added the comment:

I'm sure this is the wrong place to bring this up, but I had a
thought about simple JIT compilation coupled with the opcode
dispatch changes in this issue.

Consider this silly function:

>>> def f(a, b):
...   result = 0
...   while b:
... result += a
... b -= 1
...   return result
... 

which compiles to

  2   0 LOAD_CONST   1 (0)
  3 STORE_FAST   2 (result)

  3   6 SETUP_LOOP  32 (to 41)
>>9 LOAD_FAST1 (b)
 12 JUMP_IF_FALSE   24 (to 39)
 15 POP_TOP 

  4  16 LOAD_FAST2 (result)
 19 LOAD_FAST0 (a)
 22 INPLACE_ADD 
 23 STORE_FAST   2 (result)

  5  26 LOAD_FAST1 (b)
 29 LOAD_CONST   2 (1)
 32 INPLACE_SUBTRACT
 33 STORE_FAST   1 (b)
 36 JUMP_ABSOLUTE9
>>   39 POP_TOP 
 40 POP_BLOCK   

  6 >>   41 LOAD_FAST2 (result)
 44 RETURN_VALUE

What if you built and compiled a "Mini Me" version of
PyEval_EvalFrameEx on-the-fly which only contained the prologue and
epilog of the real function and a small switch statement which only
knew about the the byte-code instructions used by f()?  Would the
compiler be better able to optimize the code?  Would the
instructions' placement nearer to each other provide better cache
behavior?  Would branch prediction by CPU be improved?

Another possibility would be to eliminate the for(;;) ... switch
altogether and just inline the code for the individual instructions.
It would help if the body of each bytecode instruction was
implemented as a macro, e.g.:

#define _POP_TOP() \
PREDICTED(POP_TOP); \
TARGET(POP_TOP) \
v = POP(); \
Py_DECREF(v); \
FAST_DISPATCH();

The above function could (lots of hand-waving here) be "compiled" to
something like

PyObject *
_MiniMe(PyFrameObject *f, int throwflag)
{
_PyEVAL_FRAMEEX_PROLOG

_LOAD_CONST(1)
_STORE_FAST(2)
_SETUP_LOOP(_41)
_9:
_LOAD_FAST(1)
_JUMP_IF_FALSE(_39)
_POP_TOP()
_LOAD_FAST(2)
_LOAD_FAST(0)
_INPLACE_ADD()
_STORE_FAST(2)
_26:
_LOAD_FAST(1)
_LOAD_CONST(2)
_INPLACE_SUBTRACT()
_STORE_FAST(1)
_JUMP_ABSOLUTE(_9)
_39:
_POP_TOP()
_POP_BLOCK()
_LOAD_FAST(2)
_RETURN_VALUE()

_PyEVAL_FRAMEEX_EPILOG
}

and the resulting binary code saved as an attribute of the code
object.  Presumably there would be some decision made about whether
to compile a function into this form (maybe only after it's been
called N times?).

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Facundo Batista

Changes by Facundo Batista :


--
nosy: +facundobatista

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-03 Thread Alexandre Vassalotti

Alexandre Vassalotti  added the comment:

Paolo wrote:
> So, can you try dropping the switch altogether, using always computed
> goto and seeing how does the resulting code get compiled?

Removing the switch won't be possible unless we change the semantic
EXTENDED_ARG. In addition, I doubt the improvement, if any, would worth
the increased complexity.

> To be absolutely clear: x86_64 has more registers, so the rest of the
> interpreter is faster than x86, but dispatch still takes the same
> absolute time, which is 70% on x86_64, but only 50% on x86 (those are
> realistic figures);

I don't understand what you mean by "absolute time" here. Do you
actually mean the time spent interpreting bytecodes compared to the time
spent in the other parts of Python? If so, your figures are wrong for
CPython on x86-64. It is about 50% just like on x86 (when running
pybench). With the patch, this drops to 35% on x86-64 and to 45% on x86.

> In my toy interpreter, computing last_i for each dispatch doesn't give
> any big slowdown, but storing it in f->last_i gives a ~20% slowdown.

I patched ceval.c to minimize f->last_i manipulations in the dispatch
code.  On x86, I got an extra 9% speed up on pybench. However, the patch
is a bit clumsy and a few unit tests are failing. I will see if I can
improve it and open a new issue if worthwhile.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-03 Thread Daniel Diniz

Daniel Diniz  added the comment:

Paolo 'Blaisorblade' Giarrusso  wrote:
>
> 1st note: is that code from the threaded version? [...] It is vital to
> this patch that the jump is not shared, something similar to
> -fno-crossjumping should be found.

Yes, threaded version by unconditionally defining USE_THREADED_CODE
(old patch version :).

Ok,  I'll try to find a way to get at -fno-crossjumping behavior. BTW,
man gcc suggests using -fno-gcse for programs that use computed gotos
(couldn't fin

[...]
>  In the code you posted, one can see that the program
> counter is spilled to memory by GCC, but isn't by ICC. Either the spill
> is elsewhere, or ICC is better here.
I can [mail you|attach here] icc's output if you want to check the
overall code, it's about 1.9M with the code annotations.

> Finally, I'm a bit surprised by "addl $1, %ecx", since any peephole
> optimizer should remove that; I'm not shocked just because I've never
> seen perfect GCC output.

I'm glad to see the same issue in Alexandre's output, not my fault then :D

The command line I used (after a clean build with gcc) was:
icc  -pthread -c -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -w  -I.
-IInclude -I./Include   -DPy_BUILD_CORE  -S -fcode-asm -fsource-asm
-dA Python/ceval.c

(same as with gcc, except for warnings and -fcode-asm -fsource-asm).

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-03 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

Daniel, I forgot to ask for the compilation command line you used, since
they make a lot of difference. Can you post them? Thanks

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-03 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

1st note: is that code from the threaded version? Note that you need to
modify the source to make it accept also ICC to try that.
In case you already did that, I guess the patch is not useful at all
with ICC since, as far as I can see, the jump is shared. It is vital to
this patch that the jump is not shared, something similar to
-fno-crossjumping should be found.

2nd note: the answer to your questions seems yes, ICC has less register
spills. Look for instance at:
   movl-272(%ebp), %ecx
   movzbl  (%ecx), %eax
   addl$1, %ecx

and
   movzbl(%esi), %ecx
   incl  %esi

This represents the increment of the program counter after loading the
next opcode. In the code you posted, one can see that the program
counter is spilled to memory by GCC, but isn't by ICC. Either the spill
is elsewhere, or ICC is better here. And it's widely known that ICC has
a much better optimizer in many cases, and I remember that GCC register
allocator really needs improvement.

Finally, I'm a bit surprised by "addl $1, %ecx", since any peephole
optimizer should remove that; I'm not shocked just because I've never
seen perfect GCC output.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-03 Thread Daniel Diniz

Daniel Diniz  added the comment:

IIUC, this is what gcc 4.2.4 generates on a Celeron M for the code
Alexandre posted:
movl-272(%ebp), %eax
movl8(%ebp), %edx
subl-228(%ebp), %eax
movl%eax, 60(%edx)
movl-272(%ebp), %ecx
movzbl  (%ecx), %eax
-
addl$1, %ecx
movl%ecx, -272(%ebp)
movlopcode_targets.9311(,%eax,4), %ecx
movl%eax, %ebx
-
jmp *%ecx


And this is what ICC 11.0 generates for (what I think is) the same bit:
movl  360(%esp), %ecx
movl  %esi, %edi
subl  304(%esp), %edi
movl  %edi, 60(%ecx)
movzbl(%esi), %ecx 
-
movl  opcode_targets.2239.0.34(,%ecx,4), %eax
incl  %esi
-
jmp   ..B12.137 # Prob 100%
# ..B12.137: jmp   *%eax  


Does this mean that icc handles register starvation better here?

FWIW, on this CPU, compiling with icc gives a 20% speed boost in pybench
regardless of this patch.

--
nosy: +ajaksu2

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-03 Thread Benoit Boissinot

Changes by Benoit Boissinot :


--
nosy: +bboissin

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-03 Thread Yann Ramin

Changes by Yann Ramin :


--
nosy: +theatrus

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-03 Thread djc

Changes by djc :


--
nosy: +djc

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-03 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

> I'm not an expert in this kind of optimizations. Could we gain more
> speed by making the dispatcher table more dense? Python has less than
> 128 opcodes (len(opcode.opmap) == 113) so they can be squeezed in a
> smaller table. I naively assume a smaller table increases the amount of
> cache hits.

I don't think so. The end of the current table, which doesn't correspond
to any valid opcodes, will not be cached anyway. The upper limit to be
considered is the maximum value for a valid opcode, which is 147.
Reducing that to 113 may reduce cache pressure, but only by a tiny bit.

Of course, only experimentation could tell us for sure :)

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

> I'm not an expert in this kind of optimizations. Could we gain more
speed by making the dispatcher table more dense? Python has less than
128 opcodes (len(opcode.opmap) == 113) so they can be squeezed in a
smaller table. I naively assume a smaller table increases the amount of
cache hits.

Well, you have no binary compatibility constraint with a new release, so
it can be tried and benchmarked, or it can be done anyway!
On x86_64 the impact of the jump table is 8 bytes per pointer * 256
pointers = 2KiB, and the L1 data cache of Pentium4 can be 8KiB or 16KiB
wide.
But I don't expect this to be noticeable in most synthetic
microbenchmarks. Matrix multiplication would be the perfect one I guess;
the repeated column access would kill the L1 data cache, if the whole
matrixes don't fit.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso  added the comment:

About miscompilations: the current patch is a bit weird for GCC, because
you keep both the switch and the computed goto.

But actually, there is no case in which the switch is needed, and
computed goto give less room to GCC's choices.

So, can you try dropping the switch altogether, using always computed
goto and seeing how does the resulting code get compiled? I see you'll
need two labels (before and after argument fetch) per opcode and two
dispatch tabels, but that's no big deal (except for code alignment -
align just the common branch target).

An important warning is that by default, on my system, GCC 4.2 aligns
branch targets for switch to a 16-byte boundary (as recommended by the
Intel optimization guide), by adding a ".p2align 4,,7" GAS directive,
and it does not do that for computed goto.

Adding the directive by hand gave a small speedup, 2% I think; I should
try -falign-jumps=16 if it's not enabled (some -falign-jumps is enabled
by -O2), since that is supposed to give the same result.

Please use that yourself as well, and verify it works for labels, even
if I fear it doesn't.

> However, I don't know why the speed up due to the patch is much
more significant on x86-64 than on x86.

It's Amdahl's law, even if this is not about parallel code. When the
rest is faster (x86_64), the same speedup on dispatch gives a bigger
overall speedup.

To be absolutely clear: x86_64 has more registers, so the rest of the
interpreter is faster than x86, but dispatch still takes the same
absolute time, which is 70% on x86_64, but only 50% on x86 (those are
realistic figures); if this patch halved dispatch time on both (we're
not so lucky), we would save 35% on x86_64 but only 25% on x86.
In fact, on inefficient interpreters, indirect threading is useless
altogether.

So, do those extra register help _so_ much? Yes. In my toy interpreter,
computing last_i for each dispatch doesn't give any big slowdown, but
storing it in f->last_i gives a ~20% slowdown - I cross-checked multiple
times because I was astonished. Conversely, when the program counter had
to be stored in memory, I think it was like 2x slower.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Christian Heimes

Christian Heimes  added the comment:

> Alexandre Vassalotti  added the comment:
> The patch make a huge difference on 64-bit Linux. I get a 20% speed-up
> and the lowest run time so far. That is quite impressive!

I'm really, REALLY impressed by the speed up. Good work!

I'm not an expert in this kind of optimizations. Could we gain more
speed by making the dispatcher table more dense? Python has less than
128 opcodes (len(opcode.opmap) == 113) so they can be squeezed in a
smaller table. I naively assume a smaller table increases the amount of
cache hits.

Christian

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Skip Montanaro

Changes by Skip Montanaro :


Added file: http://bugs.python.org/file12555/ceval.i.threaded

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Skip Montanaro

Skip Montanaro  added the comment:

Alexandre's last comment reminded me I forgot to post the PPC assembler
code.  Next two files are the output as requested by Antoine.

Added file: http://bugs.python.org/file12553/ceval.i.unthreaded

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Alexandre Vassalotti

Alexandre Vassalotti  added the comment:

One more thing, the patch causes the following warnings to be emited by
GCC when USE_COMPUTED_GOTOS is undefined.  

Python/ceval.c: In function ‘PyEval_EvalFrameEx’:
Python/ceval.c:2420: warning: label ‘_make_function’ defined but not used
Python/ceval.c:2374: warning: label ‘_call_function_var_kw’ defined but
not used
Python/ceval.c:2280: warning: label ‘_setup_finally’ defined but not used

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Alexandre Vassalotti

Alexandre Vassalotti  added the comment:

The patch make a huge difference on 64-bit Linux. I get a 20% speed-up
and the lowest run time so far. That is quite impressive!

At first glance, it seems the extra registers of the x86-64 architecture
permit GCC to avoid spilling registers onto the stack (see assembly just
below). However, I don't know why the speed up due to the patch is much
more significant on x86-64 than on x86.

This is the x86 assembly generated by GCC 4.3 (annotated and
slightly edited for readability):

movl-440(%ebp), %eax  # tmp = next_instr
movl$145, %esi# opcode = LIST_APPEND
movl8(%ebp), %ecx # f
subl-408(%ebp), %eax  # tmp -= first_instr
movl%eax, 60(%ecx)# f->f_lasti = tmp
movl-440(%ebp), %ebx  # next_instr
movzbl  (%ebx), %eax  # tmp = *next_instr
addl$1, %ebx  # next_instr++
movl%ebx, -440(%ebp)  # next_instr
movlopcode_targets(,%eax,4), %eax  # tmp = opcode_targets[tmp]
jmp *%eax # goto *tmp


And this is the x86-64 assembly generated also by GCC 4.3:

movl%r15d, %eax  # tmp = next_instr
subl76(%rsp), %eax   # tmp -= first_instr
movl$145, %ebp   # opcode = LIST_APPEND
movl%eax, 120(%r14)  # f->f_lasti = tmp
movzbl  (%r15), %eax # tmp = *next_instr
addq$1, %r15 # next_instr++
movqopcode_targets(,%rax,8), %rax  # tmp = opcode_targets[tmp]
jmp *%rax# goto *tmp


The above assemblies are equivalent to the following C code:

opcode = LIST_APPEND;
f->f_lasti = ((int)(next_instr - first_instr));
goto *opcode_targets[*next_instr++];

On the register-starved x86 architecture, the assembly has 4 stack load
and 1 store operations. While on the x86-64 architecture, most variables
are kept in registers thus it only uses 1 stack store operation. And
from what I saw from the assemblies, the extra registers with the
traditional switch dispatch aren't much used, especially with the opcode
prediction macros which avoid manipulations of f->f_lasti.

That said, I am glad to hear the patch makes Python on PowerPC faster,
because this supports the hypothesis that extra registers are better
used with indirect threading (PowerPC has 32 general-purpose registers).

Added file: http://bugs.python.org/file12551/amd-athlon64-x2-64bit-pybench.txt

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Skip Montanaro

Skip Montanaro  added the comment:

Antoine> Ok, so the threaded version is actually faster by 20% on your
Antoine> PPC, and slower by 5% on your Core 2 Duo. Thanks for doing the
Antoine> measurements!

Confirmed by pystone runs as well.  Sorry for the earlier misdirection.

Skip

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

> OK, I think I'm misreading the output of pybench.  Let me reset.  Ignore
> anything I've written previously on this topic.  Instead, I will just
> post the output of my pybench comparison runs and let more expert people
> interpret as appropriate.  The first file is the result of the run on
> PowerPC (Mac G5).

Ok, so the threaded version is actually faster by 20% on your PPC, and
slower by 5% on your Core 2 Duo. Thanks for doing the measurements!

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Skip Montanaro

Skip Montanaro  added the comment:

The next is the result of running on my MacBook Pro (Intel Core 2 Duo).

Added file: http://bugs.python.org/file12546/pybench.sum.Intel

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



  1   2   >