Re: [Python-Dev] Program runs in 12s on Python 2.7, but 5s on Python 3.5 -- why so much difference?

2017-07-24 Thread Gregory P. Smith
On Mon, Jul 24, 2017 at 1:49 PM Wang, Peter Xihong <
peter.xihong.w...@intel.com> wrote:

> I believe we have evaluated clang vs gcc before (long time ago), and gcc
> won at that time.
>
>
>
> PGO might have overshadowed impact from computed goto, and thus the latter
> may no longer be needed.
>

Computed goto is still needed.  PGO does not magically replace it.  A PGO
build with computed goto is faster than one without computed goto.

... as tested on gcc 4.9 a couple years ago. I doubt that has changed or
changes between compilers; PGO and computed goto are different types of
optimizations.

-gps
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Program runs in 12s on Python 2.7, but 5s on Python 3.5 -- why so much difference?

2017-07-24 Thread Wang, Peter Xihong
I believe we have evaluated clang vs gcc before (long time ago), and gcc won at 
that time.

PGO might have overshadowed impact from computed goto, and thus the latter may 
no longer be needed.

When the performance difference is as large as 50%, there could be various 
options to nail down the root cause, including bytecode analysis.  However, 
coming down to 3.6 sec vs 3.4 sec, a delta of ~5%, it could be hard to find 
out.  Internally we use sampling based performance tools for micro-architecture 
level analysis.  Or generic Linux based and open source tool “perf” is very 
good to use.   You could also do a disassembly analysis/comparison of the 
object files such as the main loop, ceval.o, looking at the efficiency of the 
generated codes (which gives generic info regarding to Python2 and 3, but may 
not tell you the run time behavior with respect your specific app, 
pentomino.py).

Hope that helps.

Peter


From: Ben Hoyt [mailto:benh...@gmail.com]
Sent: Monday, July 24, 2017 12:35 PM
To: Wang, Peter Xihong 
Cc: Nathaniel Smith ; Python-Dev 
Subject: Re: [Python-Dev] Program runs in 12s on Python 2.7, but 5s on Python 
3.5 -- why so much difference?

Thanks for testing.

Oddly, I just tested it in Linux (Ubuntu), and get the same results as you -- 
Python 2.7.13 outperforms 3 (3.5.3 in my case) by a few percent. And even under 
a Virtualbox VM it takes 3.4 and 3.6 seconds, compared to ~5s on the host macOS 
operating system. Very odd. I guess that means Virtualbox is very good, and 
that clang/LLVM is not as good at optimizing the Python VM as gcc is.

I can't find anything majorly different about my macOS Python 2 and 3 builds. 
Both look like they have PGO turned on (from sysconfig.get_config_vars()). Both 
have HAVE_COMPUTED_GOTOS=1 but USE_COMPUTED_GOTOS=0 for some reason. My Python 
2 version is the macOS system version (/usr/local/bin/python2), whereas my 
Python3 version is from "brew install", so that's probably the difference, 
though still doesn't explain exactly why.

-Ben

On Mon, Jul 24, 2017 at 1:49 PM, Wang, Peter Xihong 
mailto:peter.xihong.w...@intel.com>> wrote:
Hi Ben,

Out of curiosity with a quick experiment, I ran your pentomino.py with 2.7.12 
PGO+LTO build (Ubuntu OS 16.04.2 LTS default at /usr/bin/python), and compared 
with 3.7.0 alpha1 PGO+LTO (which I built a while ago), on my SkyLake processor 
based desktop, and 2.7 outperforms 3.7 by 3.5%.
On your 2.5 GHz i7 system, I'd recommend making sure the 2 Python binaries you 
are comparing are in equal footings (compiled with same optimization PGO+LTO).

Thanks,

Peter



-Original Message-
From: Python-Dev 
[mailto:python-dev-bounces+peter.xihong.wang<mailto:python-dev-bounces%2Bpeter.xihong.wang>=intel@python.org<mailto:intel@python.org>]
 On Behalf Of Nathaniel Smith
Sent: Tuesday, July 18, 2017 7:00 PM
To: Ben Hoyt mailto:benh...@gmail.com>>
Cc: Python-Dev mailto:python-dev@python.org>>
Subject: Re: [Python-Dev] Program runs in 12s on Python 2.7, but 5s on Python 
3.5 -- why so much difference?

I'd probably start with a regular C-level profiler, like perf or callgrind. 
They're not very useful for comparing two versions of code written in Python, 
but here the Python code is the same (modulo changes in the stdlib), and it's 
changes in the interpreter's C code that probably make the difference.

On Tue, Jul 18, 2017 at 9:03 AM, Ben Hoyt 
mailto:benh...@gmail.com>> wrote:
> Hi folks,
>
> (Not entirely sure this is the right place for this question, but
> hopefully it's of interest to several folks.)
>
> A few days ago I posted a note in response to Victor Stinner's
> articles on his CPython contributions, noting that I wrote a program
> that ran in 11.7 seconds on Python 2.7, but only takes 5.1 seconds on
> Python 3.5 (on my 2.5 GHz macOS i7), more than 2x as fast. Obviously
> this is a Good Thing, but I'm curious as to why there's so much difference.
>
> The program is a pentomino puzzle solver, and it works via code
> generation, generating a ton of nested "if" statements, so I believe
> it's exercising the Python bytecode interpreter heavily. Obviously
> there have been some big optimizations to make this happen, but I'm
> curious what the main improvements are that are causing this much difference.
>
> There's a writeup about my program here, with benchmarks at the bottom:
> http://benhoyt.com/writings/python-pentomino/
>
> This is the generated Python code that's being exercised:
> https://github.com/benhoyt/python-pentomino/blob/master/generated_solv
> e.py
>
> For reference, on Python 3.6 it runs in 4.6 seconds (same on Python
> 3.7 alpha). This smallish increase from Python 3.5 to Python 3.6 was
> more expected to me due to the bytecode changing to wordcode in 3.6.
>
> I tried usin

Re: [Python-Dev] Program runs in 12s on Python 2.7, but 5s on Python 3.5 -- why so much difference?

2017-07-24 Thread Ben Hoyt
Thanks for testing.

Oddly, I just tested it in Linux (Ubuntu), and get the same results as you
-- Python 2.7.13 outperforms 3 (3.5.3 in my case) by a few percent. And
even under a Virtualbox VM it takes 3.4 and 3.6 seconds, compared to ~5s on
the host macOS operating system. Very odd. I guess that means Virtualbox is
very good, and that clang/LLVM is not as good at optimizing the Python VM
as gcc is.

I can't find anything majorly different about my macOS Python 2 and 3
builds. Both look like they have PGO turned on (from
sysconfig.get_config_vars()). Both have HAVE_COMPUTED_GOTOS=1 but
USE_COMPUTED_GOTOS=0 for some reason. My Python 2 version is the macOS
system version (/usr/local/bin/python2), whereas my Python3 version is from
"brew install", so that's probably the difference, though still doesn't
explain exactly why.

-Ben

On Mon, Jul 24, 2017 at 1:49 PM, Wang, Peter Xihong <
peter.xihong.w...@intel.com> wrote:

> Hi Ben,
>
> Out of curiosity with a quick experiment, I ran your pentomino.py with
> 2.7.12 PGO+LTO build (Ubuntu OS 16.04.2 LTS default at /usr/bin/python),
> and compared with 3.7.0 alpha1 PGO+LTO (which I built a while ago), on my
> SkyLake processor based desktop, and 2.7 outperforms 3.7 by 3.5%.
> On your 2.5 GHz i7 system, I'd recommend making sure the 2 Python binaries
> you are comparing are in equal footings (compiled with same optimization
> PGO+LTO).
>
> Thanks,
>
> Peter
>
>
>
> -Original Message-
> From: Python-Dev [mailto:python-dev-bounces+peter.xihong.wang=intel.com@
> python.org] On Behalf Of Nathaniel Smith
> Sent: Tuesday, July 18, 2017 7:00 PM
> To: Ben Hoyt 
> Cc: Python-Dev 
> Subject: Re: [Python-Dev] Program runs in 12s on Python 2.7, but 5s on
> Python 3.5 -- why so much difference?
>
> I'd probably start with a regular C-level profiler, like perf or
> callgrind. They're not very useful for comparing two versions of code
> written in Python, but here the Python code is the same (modulo changes in
> the stdlib), and it's changes in the interpreter's C code that probably
> make the difference.
>
> On Tue, Jul 18, 2017 at 9:03 AM, Ben Hoyt  wrote:
> > Hi folks,
> >
> > (Not entirely sure this is the right place for this question, but
> > hopefully it's of interest to several folks.)
> >
> > A few days ago I posted a note in response to Victor Stinner's
> > articles on his CPython contributions, noting that I wrote a program
> > that ran in 11.7 seconds on Python 2.7, but only takes 5.1 seconds on
> > Python 3.5 (on my 2.5 GHz macOS i7), more than 2x as fast. Obviously
> > this is a Good Thing, but I'm curious as to why there's so much
> difference.
> >
> > The program is a pentomino puzzle solver, and it works via code
> > generation, generating a ton of nested "if" statements, so I believe
> > it's exercising the Python bytecode interpreter heavily. Obviously
> > there have been some big optimizations to make this happen, but I'm
> > curious what the main improvements are that are causing this much
> difference.
> >
> > There's a writeup about my program here, with benchmarks at the bottom:
> > http://benhoyt.com/writings/python-pentomino/
> >
> > This is the generated Python code that's being exercised:
> > https://github.com/benhoyt/python-pentomino/blob/master/generated_solv
> > e.py
> >
> > For reference, on Python 3.6 it runs in 4.6 seconds (same on Python
> > 3.7 alpha). This smallish increase from Python 3.5 to Python 3.6 was
> > more expected to me due to the bytecode changing to wordcode in 3.6.
> >
> > I tried using cProfile on both Python versions, but that didn't say
> > much, because the functions being called aren't taking the majority of
> the time.
> > How does one benchmark at a lower level, or otherwise explain what's
> > going on here?
> >
> > Thanks,
> > Ben
> >
> > ___
> > Python-Dev mailing list
> > Python-Dev@python.org
> > https://mail.python.org/mailman/listinfo/python-dev
> > Unsubscribe:
> > https://mail.python.org/mailman/options/python-dev/njs%40pobox.com
> >
>
>
>
> --
> Nathaniel J. Smith -- https://vorpus.org __
> _
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/
> peter.xihong.wang%40intel.com
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Program runs in 12s on Python 2.7, but 5s on Python 3.5 -- why so much difference?

2017-07-24 Thread Wang, Peter Xihong
Hi Ben,

Out of curiosity with a quick experiment, I ran your pentomino.py with 2.7.12 
PGO+LTO build (Ubuntu OS 16.04.2 LTS default at /usr/bin/python), and compared 
with 3.7.0 alpha1 PGO+LTO (which I built a while ago), on my SkyLake processor 
based desktop, and 2.7 outperforms 3.7 by 3.5%.
On your 2.5 GHz i7 system, I'd recommend making sure the 2 Python binaries you 
are comparing are in equal footings (compiled with same optimization PGO+LTO).

Thanks,

Peter



-Original Message-
From: Python-Dev 
[mailto:python-dev-bounces+peter.xihong.wang=intel@python.org] On Behalf Of 
Nathaniel Smith
Sent: Tuesday, July 18, 2017 7:00 PM
To: Ben Hoyt 
Cc: Python-Dev 
Subject: Re: [Python-Dev] Program runs in 12s on Python 2.7, but 5s on Python 
3.5 -- why so much difference?

I'd probably start with a regular C-level profiler, like perf or callgrind. 
They're not very useful for comparing two versions of code written in Python, 
but here the Python code is the same (modulo changes in the stdlib), and it's 
changes in the interpreter's C code that probably make the difference.

On Tue, Jul 18, 2017 at 9:03 AM, Ben Hoyt  wrote:
> Hi folks,
>
> (Not entirely sure this is the right place for this question, but 
> hopefully it's of interest to several folks.)
>
> A few days ago I posted a note in response to Victor Stinner's 
> articles on his CPython contributions, noting that I wrote a program 
> that ran in 11.7 seconds on Python 2.7, but only takes 5.1 seconds on 
> Python 3.5 (on my 2.5 GHz macOS i7), more than 2x as fast. Obviously 
> this is a Good Thing, but I'm curious as to why there's so much difference.
>
> The program is a pentomino puzzle solver, and it works via code 
> generation, generating a ton of nested "if" statements, so I believe 
> it's exercising the Python bytecode interpreter heavily. Obviously 
> there have been some big optimizations to make this happen, but I'm 
> curious what the main improvements are that are causing this much difference.
>
> There's a writeup about my program here, with benchmarks at the bottom:
> http://benhoyt.com/writings/python-pentomino/
>
> This is the generated Python code that's being exercised:
> https://github.com/benhoyt/python-pentomino/blob/master/generated_solv
> e.py
>
> For reference, on Python 3.6 it runs in 4.6 seconds (same on Python 
> 3.7 alpha). This smallish increase from Python 3.5 to Python 3.6 was 
> more expected to me due to the bytecode changing to wordcode in 3.6.
>
> I tried using cProfile on both Python versions, but that didn't say 
> much, because the functions being called aren't taking the majority of the 
> time.
> How does one benchmark at a lower level, or otherwise explain what's 
> going on here?
>
> Thanks,
> Ben
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/njs%40pobox.com
>



--
Nathaniel J. Smith -- https://vorpus.org 
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/peter.xihong.wang%40intel.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Program runs in 12s on Python 2.7, but 5s on Python 3.5 -- why so much difference?

2017-07-18 Thread Nathaniel Smith
I'd probably start with a regular C-level profiler, like perf or
callgrind. They're not very useful for comparing two versions of code
written in Python, but here the Python code is the same (modulo
changes in the stdlib), and it's changes in the interpreter's C code
that probably make the difference.

On Tue, Jul 18, 2017 at 9:03 AM, Ben Hoyt  wrote:
> Hi folks,
>
> (Not entirely sure this is the right place for this question, but hopefully
> it's of interest to several folks.)
>
> A few days ago I posted a note in response to Victor Stinner's articles on
> his CPython contributions, noting that I wrote a program that ran in 11.7
> seconds on Python 2.7, but only takes 5.1 seconds on Python 3.5 (on my 2.5
> GHz macOS i7), more than 2x as fast. Obviously this is a Good Thing, but I'm
> curious as to why there's so much difference.
>
> The program is a pentomino puzzle solver, and it works via code generation,
> generating a ton of nested "if" statements, so I believe it's exercising the
> Python bytecode interpreter heavily. Obviously there have been some big
> optimizations to make this happen, but I'm curious what the main
> improvements are that are causing this much difference.
>
> There's a writeup about my program here, with benchmarks at the bottom:
> http://benhoyt.com/writings/python-pentomino/
>
> This is the generated Python code that's being exercised:
> https://github.com/benhoyt/python-pentomino/blob/master/generated_solve.py
>
> For reference, on Python 3.6 it runs in 4.6 seconds (same on Python 3.7
> alpha). This smallish increase from Python 3.5 to Python 3.6 was more
> expected to me due to the bytecode changing to wordcode in 3.6.
>
> I tried using cProfile on both Python versions, but that didn't say much,
> because the functions being called aren't taking the majority of the time.
> How does one benchmark at a lower level, or otherwise explain what's going
> on here?
>
> Thanks,
> Ben
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/njs%40pobox.com
>



-- 
Nathaniel J. Smith -- https://vorpus.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Program runs in 12s on Python 2.7, but 5s on Python 3.5 -- why so much difference?

2017-07-18 Thread Ben Hoyt
Thanks, Nick -- that's interesting. I just saw the extra JUMP_FORWARD and
JUMP_ABSOLUTE instructions on my commute home (I guess those are something
Python 3.x optimizes away).

VERY strangely, on Windows Python 2.7 is faster! Comparing 64-bit Python
2.7.12 against Python 3.5.3 on my Windows 10 laptop:

* Python 2.7.12: 4.088s
* Python 3.5.3: 5.792s

I'm pretty sure MSVC/Windows doesn't support computed gotos, but that
doesn't explain why 3.5 is so much faster than 2.7 on Mac. I have yet to
try it on Linux.

-Ben

On Tue, Jul 18, 2017 at 9:35 PM, Nick Coghlan  wrote:

> On 19 July 2017 at 02:18, Antoine Pitrou  wrote:
> > On Tue, 18 Jul 2017 12:03:36 -0400
> > Ben Hoyt  wrote:
> >> The program is a pentomino puzzle solver, and it works via code
> generation,
> >> generating a ton of nested "if" statements, so I believe it's exercising
> >> the Python bytecode interpreter heavily.
> >
> > A first step would be to see if the generated bytecode has changed
> > substantially.
>
> Scanning over them, the Python 2.7 bytecode appears to have many more
> JUMP_FORWARD and JUMP_ABSOLUTE opcodes than appear in the 3.6 version
> (I didn't dump them into a Counter instance to tally them properly
> though, since 2.7's dis module is missing the structured opcode
> iteration APIs).
>
> With the shift to wordcode, the overall size of the bytecode is also
> significantly *smaller*:
>
> >>> len(co.co_consts[0].co_code) # 2.7
> 14427
>
> >>> len(co.co_consts[0].co_code) # 3.6
> 11850
>
> However, I'm not aware of any Python profilers that currently offer
> opcode level profiling - the closest would probably be VMProf's JIT
> profiling, and that aspect of VMProf is currently PyPy specific
> (although could presumably be extended to CPython 3.6+ by way of the
> opcode evaluation hook).
>
> Cheers,
> Nick.
>
> --
> Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/
> benhoyt%40gmail.com
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Program runs in 12s on Python 2.7, but 5s on Python 3.5 -- why so much difference?

2017-07-18 Thread Nick Coghlan
On 19 July 2017 at 02:18, Antoine Pitrou  wrote:
> On Tue, 18 Jul 2017 12:03:36 -0400
> Ben Hoyt  wrote:
>> The program is a pentomino puzzle solver, and it works via code generation,
>> generating a ton of nested "if" statements, so I believe it's exercising
>> the Python bytecode interpreter heavily.
>
> A first step would be to see if the generated bytecode has changed
> substantially.

Scanning over them, the Python 2.7 bytecode appears to have many more
JUMP_FORWARD and JUMP_ABSOLUTE opcodes than appear in the 3.6 version
(I didn't dump them into a Counter instance to tally them properly
though, since 2.7's dis module is missing the structured opcode
iteration APIs).

With the shift to wordcode, the overall size of the bytecode is also
significantly *smaller*:

>>> len(co.co_consts[0].co_code) # 2.7
14427

>>> len(co.co_consts[0].co_code) # 3.6
11850

However, I'm not aware of any Python profilers that currently offer
opcode level profiling - the closest would probably be VMProf's JIT
profiling, and that aspect of VMProf is currently PyPy specific
(although could presumably be extended to CPython 3.6+ by way of the
opcode evaluation hook).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Program runs in 12s on Python 2.7, but 5s on Python 3.5 -- why so much difference?

2017-07-18 Thread Antoine Pitrou
On Tue, 18 Jul 2017 12:03:36 -0400
Ben Hoyt  wrote:

> Hi folks,
> 
> (Not entirely sure this is the right place for this question, but hopefully
> it's of interest to several folks.)
> 
> A few days ago I posted a note in response to Victor Stinner's articles on
> his CPython contributions, noting that I wrote a program that ran in 11.7
> seconds on Python 2.7, but only takes 5.1 seconds on Python 3.5 (on my 2.5
> GHz macOS i7), more than 2x as fast. Obviously this is a Good Thing, but
> I'm curious as to why there's so much difference.
> 
> The program is a pentomino puzzle solver, and it works via code generation,
> generating a ton of nested "if" statements, so I believe it's exercising
> the Python bytecode interpreter heavily.

A first step would be to see if the generated bytecode has changed
substantially.

Otherwise, you can try to comment out parts of the function until the
performance difference has been nullified.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Program runs in 12s on Python 2.7, but 5s on Python 3.5 -- why so much difference?

2017-07-18 Thread Ben Hoyt
Hi folks,

(Not entirely sure this is the right place for this question, but hopefully
it's of interest to several folks.)

A few days ago I posted a note in response to Victor Stinner's articles on
his CPython contributions, noting that I wrote a program that ran in 11.7
seconds on Python 2.7, but only takes 5.1 seconds on Python 3.5 (on my 2.5
GHz macOS i7), more than 2x as fast. Obviously this is a Good Thing, but
I'm curious as to why there's so much difference.

The program is a pentomino puzzle solver, and it works via code generation,
generating a ton of nested "if" statements, so I believe it's exercising
the Python bytecode interpreter heavily. Obviously there have been some big
optimizations to make this happen, but I'm curious what the main
improvements are that are causing this much difference.

There's a writeup about my program here, with benchmarks at the bottom:
http://benhoyt.com/writings/python-pentomino/

This is the generated Python code that's being exercised:
https://github.com/benhoyt/python-pentomino/blob/master/generated_solve.py

For reference, on Python 3.6 it runs in 4.6 seconds (same on Python 3.7
alpha). This smallish increase from Python 3.5 to Python 3.6 was more
expected to me due to the bytecode changing to wordcode in 3.6.

I tried using cProfile on both Python versions, but that didn't say much,
because the functions being called aren't taking the majority of the time.
How does one benchmark at a lower level, or otherwise explain what's going
on here?

Thanks,
Ben
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com