Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-02 Thread Peter Ludemann via Python-Dev
Also, modern compiler technology tends to use "infinite register" machines
for the intermediate representation, then uses register coloring to assign
the actual registers (and generate spill code if needed). I've seen work on
inter-function optimization for avoiding some register loads and stores
(combined with tail-call optimization, it can turn recursive calls into
loops in the register machine).



On 2 February 2016 at 09:16, Sven R. Kunze  wrote:

> On 02.02.2016 00:27, Greg Ewing wrote:
>
>> Sven R. Kunze wrote:
>>
>>> Are there some resources on why register machines are considered faster
>>> than stack machines?
>>>
>>
>> If a register VM is faster, it's probably because each register
>> instruction does the work of about 2-3 stack instructions,
>> meaning less trips around the eval loop, so less unpredictable
>> branches and less pipeline flushes.
>>
>
> That's was I found so far as well.
>
> This assumes that bytecode dispatching is a substantial fraction
>> of the time taken to execute each instruction. For something
>> like cpython, where the operations carried out by the bytecodes
>> involve a substantial amount of work, this may not be true.
>>
>
> Interesting point indeed. It makes sense that register machines only saves
> us the bytecode dispatching.
>
> How much that is compared to the work each instruction requires, I cannot
> say. Maybe, Yury has a better understanding here.
>
> It also assumes the VM is executing the bytecodes directly. If
>> there is a JIT involved, it all gets translated into something
>> else anyway, and then it's more a matter of whether you find
>> it easier to design the JIT to deal with stack or register code.
>>
>
> It seems like Yury thinks so. He didn't tell use so far.
>
>
> Best,
> Sven
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/pludemann%40google.com
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Yury Selivanov



On 2016-02-02 12:41 PM, Serhiy Storchaka wrote:

On 01.02.16 21:10, Yury Selivanov wrote:

To measure the max/average memory impact, I tuned my code to optimize
*every* code object on *first* run.  Then I ran the entire Python test
suite.  Python test suite + standard library both contain around 72395
code objects, which required 20Mb of memory for caches.  The test
process consumed around 400Mb of memory.  Thus, the absolute worst case
scenario, the overhead is about 5%.


Test process consumes such much memory because few tests creates huge 
objects. If exclude these tests (note that tests that requires more 
than 1Gb are already excluded by default) and tests that creates a 
number of threads (threads consume much memory too), the rest of tests 
needs less than 100Mb of memory. Absolute required minimum is about 
25Mb. Thus, the absolute worst case scenario, the overhead is about 100%.
Can you give me the exact configuration of tests (command line to run) 
that would only consume 25mb?


Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-02 Thread Brett Cannon
On Tue, 2 Feb 2016 at 01:29 Victor Stinner  wrote:

> Hi,
>
> I'm back for the FOSDEM event at Bruxelles, it was really cool. I gave
> talk about FAT Python and I got good feedback. But friends told me
> that people now have expectations on FAT Python. It looks like people
> care of Python performance :-)
>
> FYI the slides of my talk:
> https://github.com/haypo/conf/raw/master/2016-FOSDEM/fat_python.pdf
> (a video was recorded, I don't know when it will be online)
>
> I take a first look at your patch and sorry, I'm skeptical about the
> design. I have to play with it a little bit more to check if there is
> no better design.
>
> To be clear, FAT Python with your work looks more and more like a
> cheap JIT compiler :-) Guards, specializations, optimizing at runtime
> after a threshold... all these things come from JIT compilers. I like
> the idea of a kind-of JIT compiler without having to pay the high cost
> of a large dependency like LLVM. I like baby steps in CPython, it's
> faster, it's possible to implement it in a single release cycle (one
> minor Python release, Python 3.6). Integrating a JIT compiler into
> CPython already failed with Unladen Swallow :-/
>
> PyPy has a complete different design (and has serious issues with the
> Python C API), Pyston is restricted to Python 2.7, Pyjion looks
> specific to Windows (CoreCLR), Numba is specific to numeric
> computations (numpy). IMHO none of these projects can be easily be
> merged into CPython "quickly" (again, in a single Python release
> cycle). By the way, Pyjion still looks very young (I heard that they
> are still working on the compatibility with CPython, not on
> performance yet).
>

We are not ready to have a serious discussion about Pyjion yet as we are
still working on compatibility (we have a talk proposal in for PyCon US
2016 and so we are hoping to have something to discuss at the language
summit), but Victor's email shows there is some misconceptions about it
already and a misunderstanding of our fundamental goal.

First off, Pyjion is very much a work-in-progress. You can find it at
https://github.com/microsoft/pyjion (where there is an FAQ), but for this
audience the key thing to know is that we are still working on
compatibility (see
https://github.com/Microsoft/Pyjion/blob/master/Tests/python_tests.txt for
the list of tests we do (not) pass from the Python test suite). Out of our
roughly 400 tests, we don't pass about 18 of them.

Second, we have not really started work on performance yet. We have done
some very low-hanging fruit stuff, but just barely. IOW we are not really
ready to discuss performance (ATM we JIT instantly for all code objects and
even being that aggressive with the JIT overhead we are even/slightly
slower than an unmodified Python 3.5 VM, so we are hopeful this work will
pan out).

Third, the over-arching goal of Pyjion is not to add a JIT into CPython,
but to add a C API to CPython that will allow plugging in a JIT. If you
simply JIT code objects then the API required to let someone plug in a JIT
is basically three functions, maybe as little as two (you can see the exact
patch against CPython that we are working with at
https://github.com/Microsoft/Pyjion/blob/master/Patches/python.diff). We
have no interest in shipping a JIT with CPython, just making it much easier
to let others add one if they want to because it makes sense for their
workload. We have no plans to suggest shipping a JIT with CPython, just to
make it an option for people to add in if they want (and if Yury's caching
stuff goes in with an execution counter then even the one bit of true
overhead we had will be part of CPython already which makes it even more of
an easy decision to consider the API we will eventually propose).

Fourth, it is not Windows-only by design. CoreCLR is cross-platform on all
major OSs, so that is not a restriction (and honestly we are using CoreCLR
simply because Dino used to work on the CLR team so he knows the bytecode
really well; we easily could have used some other JIT to prove our point).
The only reason Pyjion doesn't work with other OSs is momenum/laziness on
Dino and my part; Dino hacked together Pyjion at PyCon US 2015 and he is
the most comfortable on Windows, and so he just did it in Windows on Visual
Studio and just didn't bother to start with e.g., CMake to make it build on
other OSs. Since we are still trying to work out some compatibility stuff
so we would rather do that than worry about Linux or OS X support right now.

Fifth, if we manage to show that a C API can easily be added to CPython to
make a JIT something that can simply be plugged in and be useful, then we
will also have a basic JIT framework for people to use. As I said, our use
of CoreCLR is just for ease of development. There is no reason we couldn't
use ChakraCore, v8, LLVM, etc. But since all of these JIT compilers would
need to know how to handle CPython bytecode, we have tried to design a
framework where JIT 

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Yury Selivanov

Hi Victor,

On 2016-02-02 4:33 AM, Victor Stinner wrote:

Hi,

Maybe it's worth to write a PEP to summarize all your changes to
optimize CPython? It would avoid to have to follow different threads
on the mailing lists, different issues on the bug tracker, with
external links to GitHub gists, etc. Your code changes critical parts
of Python: code object structure and Python/ceval.c.


Not sure about that... PEPs take a LOT of time :(

Besides, all my changes are CPython specific and
can be considered as an implementation detail.



At least, it would help to document Python internals ;-)


I can write a ceval.txt file explaining what's going on
in ceval loop, with details on the opcode cache and other
things.  I think it's even better than a PEP, to be honest.

Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Sven R. Kunze

On 02.02.2016 20:41, Yury Selivanov wrote:

Hi Victor,

On 2016-02-02 4:33 AM, Victor Stinner wrote:

Hi,

Maybe it's worth to write a PEP to summarize all your changes to
optimize CPython? It would avoid to have to follow different threads
on the mailing lists, different issues on the bug tracker, with
external links to GitHub gists, etc. Your code changes critical parts
of Python: code object structure and Python/ceval.c.


Not sure about that... PEPs take a LOT of time :(


True.


Besides, all my changes are CPython specific and
can be considered as an implementation detail.



At least, it would help to document Python internals ;-)


I can write a ceval.txt file explaining what's going on
in ceval loop, with details on the opcode cache and other
things.  I think it's even better than a PEP, to be honest.


I would love to see that. :)


Best,
Sven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Yury Selivanov



On 2016-02-02 1:45 PM, Serhiy Storchaka wrote:

On 02.02.16 19:45, Yury Selivanov wrote:

On 2016-02-02 12:41 PM, Serhiy Storchaka wrote:

On 01.02.16 21:10, Yury Selivanov wrote:

To measure the max/average memory impact, I tuned my code to optimize
*every* code object on *first* run.  Then I ran the entire Python test
suite.  Python test suite + standard library both contain around 72395
code objects, which required 20Mb of memory for caches.  The test
process consumed around 400Mb of memory.  Thus, the absolute worst 
case

scenario, the overhead is about 5%.


Test process consumes such much memory because few tests creates huge
objects. If exclude these tests (note that tests that requires more
than 1Gb are already excluded by default) and tests that creates a
number of threads (threads consume much memory too), the rest of tests
needs less than 100Mb of memory. Absolute required minimum is about
25Mb. Thus, the absolute worst case scenario, the overhead is about 
100%.

Can you give me the exact configuration of tests (command line to run)
that would only consume 25mb?


I don't remember what exact tests consume the most of memory, but 
following tests are failed when run with less than 30Mb of memory:


test___all__ test_asynchat test_asyncio test_bz2 test_capi 
test_concurrent_futures test_ctypes test_decimal test_descr 
test_distutils test_docxmlrpc test_eintr test_email test_fork1 
test_fstring test_ftplib test_functools test_gc test_gdb test_hashlib 
test_httplib test_httpservers test_idle test_imaplib test_import 
test_importlib test_io test_itertools test_json test_lib2to3 test_list 
test_logging test_longexp test_lzma test_mmap 
test_multiprocessing_fork test_multiprocessing_forkserver 
test_multiprocessing_main_handling test_multiprocessing_spawn test_os 
test_pickle test_poplib test_pydoc test_queue test_regrtest 
test_resource test_robotparser test_shutil test_smtplib test_socket 
test_sqlite test_ssl test_subprocess test_tarfile test_tcl test_thread 
test_threaded_import test_threadedtempfile test_threading 
test_threading_local test_threadsignals test_tix test_tk test_tools 
test_ttk_guionly test_ttk_textonly test_tuple test_unicode 
test_urllib2_localnet test_wait3 test_wait4 test_xmlrpc test_zipfile 
test_zlib


Alright, I modified the code to optimize ALL code objects, and ran unit 
tests with the above tests excluded:


-- Max process mem (ru_maxrss) = 131858432
-- Opcode cache number of objects  = 42109
-- Opcode cache total extra mem= 10901106

And asyncio tests:

-- Max process mem (ru_maxrss) = 57081856
-- Opcode cache number of objects  = 4656
-- Opcode cache total extra mem= 1766681

So the absolute worst case for a small asyncio program is 3%, for unit 
tests (with the above list excluded) - 8%.


I think it'd be very hard to find a real-life program that consists of 
only code objects, and nothing else (no data to work with/process, no 
objects with dicts, no threads, basically nothing).  Because only for 
such a program you would have a 100% memory overhead for the bytecode 
cache (when all code objects are optimized).


FWIW, here are stats for asyncio with only hot objects being optimized:

-- Max process mem (ru_maxrss) = 54775808
-- Opcode cache number of objects  = 121
-- Opcode cache total extra mem= 43521

Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Victor Stinner
2016-02-02 20:23 GMT+01:00 Yury Selivanov :
> Alright, I modified the code to optimize ALL code objects, and ran unit
> tests with the above tests excluded:
>
> -- Max process mem (ru_maxrss) = 131858432
> -- Opcode cache number of objects  = 42109
> -- Opcode cache total extra mem= 10901106

In my experience, RSS is a coarse measure of the memory usage. I wrote
tracemalloc to get a reliable measure of the *Python* memory usage:
https://docs.python.org/dev/library/tracemalloc.html#tracemalloc.get_traced_memory

Run tests with -X tracemalloc -i, and then type in the REPL:

>>> import tracemalloc; print("%.1f kB" % (tracemalloc.get_traced_memory()[1] / 
>>> 1024.))
10197.7 kB

I expect this value to be (much) lower than RSS.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Serhiy Storchaka

On 02.02.16 21:23, Yury Selivanov wrote:

Alright, I modified the code to optimize ALL code objects, and ran unit
tests with the above tests excluded:

-- Max process mem (ru_maxrss) = 131858432
-- Opcode cache number of objects  = 42109
-- Opcode cache total extra mem= 10901106


Thank you for doing these tests. Now results are more convincing to me.


And asyncio tests:

-- Max process mem (ru_maxrss) = 57081856
-- Opcode cache number of objects  = 4656
-- Opcode cache total extra mem= 1766681



FWIW, here are stats for asyncio with only hot objects being optimized:

-- Max process mem (ru_maxrss) = 54775808
-- Opcode cache number of objects  = 121
-- Opcode cache total extra mem= 43521


Interesting, 57081856 - 54775808 = 2306048, but 1766681 - 43521 = 
1723160. There are additional 0.5Mb lost during fragmentation.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread ƦOB COASTN
>> I can write a ceval.txt file explaining what's going on
>> in ceval loop, with details on the opcode cache and other
>> things.  I think it's even better than a PEP, to be honest.
>
>
> I totally agree.
>
Please include the notes text file.  This provides an excellent
summary for those of us that haven't yet taken the deep dive into the
ceval loop, but still wish to understand its implementation.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Stephen J. Turnbull
Yury Selivanov writes:

 > Not sure about that... PEPs take a LOT of time :(

Informational PEPs need not take so much time, no more than you would
spend on ceval.txt.  I'm sure a PEP would get a lot more attention
from reviewers, too.

Even if you PEP the whole thing, as you say it's a (big ;-)
implementation detail.  A PEP won't make things more controversial (or
less) than they already are.  I don't see why it would take that much
more time than ceval.txt.

 > I can write a ceval.txt file explaining what's going on
 > in ceval loop, with details on the opcode cache and other
 > things.  I think it's even better than a PEP, to be honest.

Unlikely to be better, since that's a subset of the proposed PEP.

Of course it's up to you, since you'd be doing most of the work, but
for the rest of us PEPs are a lot more discoverable and easily
referenced than a .txt file with a short name.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Serhiy Storchaka

On 02.02.16 21:41, Yury Selivanov wrote:

I can write a ceval.txt file explaining what's going on
in ceval loop, with details on the opcode cache and other
things.  I think it's even better than a PEP, to be honest.


I totally agree.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Serhiy Storchaka

On 02.02.16 19:45, Yury Selivanov wrote:

On 2016-02-02 12:41 PM, Serhiy Storchaka wrote:

On 01.02.16 21:10, Yury Selivanov wrote:

To measure the max/average memory impact, I tuned my code to optimize
*every* code object on *first* run.  Then I ran the entire Python test
suite.  Python test suite + standard library both contain around 72395
code objects, which required 20Mb of memory for caches.  The test
process consumed around 400Mb of memory.  Thus, the absolute worst case
scenario, the overhead is about 5%.


Test process consumes such much memory because few tests creates huge
objects. If exclude these tests (note that tests that requires more
than 1Gb are already excluded by default) and tests that creates a
number of threads (threads consume much memory too), the rest of tests
needs less than 100Mb of memory. Absolute required minimum is about
25Mb. Thus, the absolute worst case scenario, the overhead is about 100%.

Can you give me the exact configuration of tests (command line to run)
that would only consume 25mb?


I don't remember what exact tests consume the most of memory, but 
following tests are failed when run with less than 30Mb of memory:


test___all__ test_asynchat test_asyncio test_bz2 test_capi 
test_concurrent_futures test_ctypes test_decimal test_descr 
test_distutils test_docxmlrpc test_eintr test_email test_fork1 
test_fstring test_ftplib test_functools test_gc test_gdb test_hashlib 
test_httplib test_httpservers test_idle test_imaplib test_import 
test_importlib test_io test_itertools test_json test_lib2to3 test_list 
test_logging test_longexp test_lzma test_mmap test_multiprocessing_fork 
test_multiprocessing_forkserver test_multiprocessing_main_handling 
test_multiprocessing_spawn test_os test_pickle test_poplib test_pydoc 
test_queue test_regrtest test_resource test_robotparser test_shutil 
test_smtplib test_socket test_sqlite test_ssl test_subprocess 
test_tarfile test_tcl test_thread test_threaded_import 
test_threadedtempfile test_threading test_threading_local 
test_threadsignals test_tix test_tk test_tools test_ttk_guionly 
test_ttk_textonly test_tuple test_unicode test_urllib2_localnet 
test_wait3 test_wait4 test_xmlrpc test_zipfile test_zlib



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-02 Thread Yury Selivanov



On 2016-02-02 4:28 AM, Victor Stinner wrote:
[..]

I take a first look at your patch and sorry,


Thanks for the initial code review!


I'm skeptical about the
design. I have to play with it a little bit more to check if there is
no better design.


So far I see two things you are worried about:


1. The cache is attached to the code object vs function/frame.

I think the code object is the perfect place for such a cache.

The cache must be there (and survive!) "across" the frames.
If you attach it to the function object, you'll have to
re-attach it to a frame object on each PyEval call.
I can't see how that would be better.


2. Two levels of indirection in my cache -- offsets table +
cache table.

In my other email thread "Opcode cache in ceval loop" I
explained that optimizing every code object in the standard
library and unittests adds 5% memory overhead.  Optimizing
only those that are called frequently is less than 1%.

Besides, many functions that you import are never called, or
only called once or twice.  And code objects for modules
and class bodies are called once.

If we don't use an offset table and just allocate a cache
entry for every opcode, then the memory usage will raise
*significantly*.  Right now the overhead of the offset table
is *8 bits* per opcode, the overhead of the cache table is
*32 bytes* per an optimized opcode.  The overhead of
using 1 extra indirection is minimal.

[..]





2016-01-27 19:25 GMT+01:00 Yury Selivanov :

tl;dr The summary is that I have a patch that improves CPython performance
up to 5-10% on macro benchmarks.  Benchmarks results on Macbook Pro/Mac OS
X, desktop CPU/Linux, server CPU/Linux are available at [1].  There are no
slowdowns that I could reproduce consistently.

That's really impressive, great job Yury :-) Getting non-negligible
speedup on large macrobenchmarks became really hard in CPython.
CPython is already well optimized in all corners. It looks like the
overall Python performance still depends heavily on the performance of
dictionary and attribute lookups. Even if it was well known, I didn't
expect up to 10% speedup on *macro* benchmarks.


Thanks!





LOAD_METHOD & CALL_METHOD
-

We had a lot of conversations with Victor about his PEP 509, and he sent me
a link to his amazing compilation of notes about CPython performance [2].
One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an
idea first originated in PyPy.

There is a patch that implements this optimization, it's tracked here: [3].
There are some low level details that I explained in the issue, but I'll go
over the high level design in this email as well.

Your cache is stored directly in code objects. Currently, code objects
are immutable.


Code objects are immutable on the Python level.  My cache
doesn't make any previously immutable field mutable.

Adding a few mutable cache structures visible only at the
C level is acceptable I think.



Antoine Pitrou's patch adding a LOAD_GLOBAL cache adds a cache to
functions with an "alias" in each frame object:
http://bugs.python.org/issue10401

Andrea Griffini's patch also adding a cache for LOAD_GLOBAL adds a
cache for code objects too.
https://bugs.python.org/issue1616125


Those patches are nice, but optimizing just LOAD_GLOBAL
won't give you a big speed-up.  For instance, 2to3 became
7-8% faster once I started to optimize LOAD_ATTR.

The idea of my patch is that it implements caching
in such a way, that we can add it to several different
opcodes.

The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3.  Let's look at the
first one, that loads the 'print' function from builtins.  The opcode knows
the following bits of information:

I tested your latest patch. It looks like LOAD_GLOBAL never
invalidates the cache on cache miss ("deoptimize" the instruction).


Yes, that was a deliberate decision (but we can add the
deoptimization easily).  So far I haven't seen a use case
or benchmark where we really need to deoptimize.



I suggest to always invalidate the cache at each cache miss. Not only,
it's common to modify global variables, but there is also the issue of
different namespace used with the same code object. Examples:

* late global initialization. See for example _a85chars cache of
base64.a85encode.
* code object created in a temporary namespace and then always run in
a different global namespace. See for example
collections.namedtuple(). I'm not sure that it's the best example
because it looks like the Python code only loads builtins, not
globals. But it looks like your code keeps a copy of the version of
the global namespace dict.

I tested with a threshold of 1: always optimize all code objects.
Maybe with your default threshold of 1024 runs, the issue with
different namespaces doesn't occur in practice.


Yep. I added a constant in ceval.c that enables collection
of opcode cache stats.

99.9% of all global dicts in benchmarks are stable.

test suite was a 

Re: [Python-Dev] [Python-checkins] cpython: merge

2016-02-02 Thread Martin Panter
On 2 February 2016 at 05:21, raymond.hettinger
 wrote:
> https://hg.python.org/cpython/rev/0731f097157b
> changeset:   100142:0731f097157b
> parent:  100140:c7f1acdd8be1
> user:Raymond Hettinger 
> date:Mon Feb 01 21:21:19 2016 -0800
> summary:
>   merge
>
> files:
>   Doc/library/collections.rst  |   4 ++--
>   Lib/test/test_deque.py   |  23 ---
>   Modules/_collectionsmodule.c |   7 ++-
>   3 files changed, 16 insertions(+), 18 deletions(-)

This wasn’t actually a merge (there is only one parent). Hopefully I
fixed it up with . But
it looks like the original NEWS entry didn’t get merged in your
earlier merge , so
there was nothing for me to merge the NEWS changes into in the default
branch.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-02 Thread Victor Stinner
Hi,

I'm back for the FOSDEM event at Bruxelles, it was really cool. I gave
talk about FAT Python and I got good feedback. But friends told me
that people now have expectations on FAT Python. It looks like people
care of Python performance :-)

FYI the slides of my talk:
https://github.com/haypo/conf/raw/master/2016-FOSDEM/fat_python.pdf
(a video was recorded, I don't know when it will be online)

I take a first look at your patch and sorry, I'm skeptical about the
design. I have to play with it a little bit more to check if there is
no better design.

To be clear, FAT Python with your work looks more and more like a
cheap JIT compiler :-) Guards, specializations, optimizing at runtime
after a threshold... all these things come from JIT compilers. I like
the idea of a kind-of JIT compiler without having to pay the high cost
of a large dependency like LLVM. I like baby steps in CPython, it's
faster, it's possible to implement it in a single release cycle (one
minor Python release, Python 3.6). Integrating a JIT compiler into
CPython already failed with Unladen Swallow :-/

PyPy has a complete different design (and has serious issues with the
Python C API), Pyston is restricted to Python 2.7, Pyjion looks
specific to Windows (CoreCLR), Numba is specific to numeric
computations (numpy). IMHO none of these projects can be easily be
merged into CPython "quickly" (again, in a single Python release
cycle). By the way, Pyjion still looks very young (I heard that they
are still working on the compatibility with CPython, not on
performance yet).


2016-01-27 19:25 GMT+01:00 Yury Selivanov :
> tl;dr The summary is that I have a patch that improves CPython performance
> up to 5-10% on macro benchmarks.  Benchmarks results on Macbook Pro/Mac OS
> X, desktop CPU/Linux, server CPU/Linux are available at [1].  There are no
> slowdowns that I could reproduce consistently.

That's really impressive, great job Yury :-) Getting non-negligible
speedup on large macrobenchmarks became really hard in CPython.
CPython is already well optimized in all corners. It looks like the
overall Python performance still depends heavily on the performance of
dictionary and attribute lookups. Even if it was well known, I didn't
expect up to 10% speedup on *macro* benchmarks.


> LOAD_METHOD & CALL_METHOD
> -
>
> We had a lot of conversations with Victor about his PEP 509, and he sent me
> a link to his amazing compilation of notes about CPython performance [2].
> One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an
> idea first originated in PyPy.
>
> There is a patch that implements this optimization, it's tracked here: [3].
> There are some low level details that I explained in the issue, but I'll go
> over the high level design in this email as well.

Your cache is stored directly in code objects. Currently, code objects
are immutable.

Antoine Pitrou's patch adding a LOAD_GLOBAL cache adds a cache to
functions with an "alias" in each frame object:
http://bugs.python.org/issue10401

Andrea Griffini's patch also adding a cache for LOAD_GLOBAL adds a
cache for code objects too.
https://bugs.python.org/issue1616125

I don't know what is the best place to store the cache.

I vaguely recall a patch which uses a single unique global cache, but
maybe I'm wrong :-p


> The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3.  Let's look at the
> first one, that loads the 'print' function from builtins.  The opcode knows
> the following bits of information:

I tested your latest patch. It looks like LOAD_GLOBAL never
invalidates the cache on cache miss ("deoptimize" the instruction).

I suggest to always invalidate the cache at each cache miss. Not only,
it's common to modify global variables, but there is also the issue of
different namespace used with the same code object. Examples:

* late global initialization. See for example _a85chars cache of
base64.a85encode.
* code object created in a temporary namespace and then always run in
a different global namespace. See for example
collections.namedtuple(). I'm not sure that it's the best example
because it looks like the Python code only loads builtins, not
globals. But it looks like your code keeps a copy of the version of
the global namespace dict.

I tested with a threshold of 1: always optimize all code objects.
Maybe with your default threshold of 1024 runs, the issue with
different namespaces doesn't occur in practice.


> A straightforward way to implement such a cache is simple, but consumes a
> lot of memory, that would be just wasted, since we only need such a cache
> for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be creative about the
> cache design.

I'm not sure that it's worth to develop a complex dynamic logic to
only enable optimizations after a threshold (design very close to a
JIT compiler). What is the overhead (% of RSS memory) on a concrete
application when all code objects are optimized at 

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Victor Stinner
Hi,

Maybe it's worth to write a PEP to summarize all your changes to
optimize CPython? It would avoid to have to follow different threads
on the mailing lists, different issues on the bug tracker, with
external links to GitHub gists, etc. Your code changes critical parts
of Python: code object structure and Python/ceval.c.

At least, it would help to document Python internals ;-)

The previous "big" change (optimization) like that was the new "type
attribute cache": addition of tp_version_tag to PyTypeObject. I
"documented" it in the PEP 509 and it was difficult to rebuild the
context, understand the design, etc.
https://www.python.org/dev/peps/pep-0509/#method-cache-and-type-version-tag

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-02 Thread Sven R. Kunze

On 02.02.2016 00:27, Greg Ewing wrote:

Sven R. Kunze wrote:
Are there some resources on why register machines are considered 
faster than stack machines?


If a register VM is faster, it's probably because each register
instruction does the work of about 2-3 stack instructions,
meaning less trips around the eval loop, so less unpredictable
branches and less pipeline flushes.


That's was I found so far as well.


This assumes that bytecode dispatching is a substantial fraction
of the time taken to execute each instruction. For something
like cpython, where the operations carried out by the bytecodes
involve a substantial amount of work, this may not be true.


Interesting point indeed. It makes sense that register machines only 
saves us the bytecode dispatching.


How much that is compared to the work each instruction requires, I 
cannot say. Maybe, Yury has a better understanding here.



It also assumes the VM is executing the bytecodes directly. If
there is a JIT involved, it all gets translated into something
else anyway, and then it's more a matter of whether you find
it easier to design the JIT to deal with stack or register code.


It seems like Yury thinks so. He didn't tell use so far.


Best,
Sven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Serhiy Storchaka

On 01.02.16 21:10, Yury Selivanov wrote:

To measure the max/average memory impact, I tuned my code to optimize
*every* code object on *first* run.  Then I ran the entire Python test
suite.  Python test suite + standard library both contain around 72395
code objects, which required 20Mb of memory for caches.  The test
process consumed around 400Mb of memory.  Thus, the absolute worst case
scenario, the overhead is about 5%.


Test process consumes such much memory because few tests creates huge 
objects. If exclude these tests (note that tests that requires more than 
1Gb are already excluded by default) and tests that creates a number of 
threads (threads consume much memory too), the rest of tests needs less 
than 100Mb of memory. Absolute required minimum is about 25Mb. Thus, the 
absolute worst case scenario, the overhead is about 100%.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Python environment registration in the Windows Registry

2016-02-02 Thread Steve Dower
I was throwing around some ideas with colleagues about how we detect 
Python installations on Windows from within Visual Studio, and it came 
up that there are many Python distros that install into different 
locations but write the same registry entries. (I knew about this, of 
course, but this time I decided to do something.)


Apart from not being detected properly by all IDEs/tools/installers, 
non-standard distros that register themselves in the official keys may 
also mess with the default sys.path values. For example, at one point 
(possibly still true) if you installed both Canopy and Anaconda, you 
would break the first one because they tried to load the other's stdlib.


Other implementations have different structures or do not register 
themselves at all, which also makes it more complicated for tools to 
discover them.


So here is a rough proposal to standardise the registry keys that can be 
set on Windows in a way that (a) lets other installers besides the 
official ones have equal footing, (b) provides consistent search and 
resolution semantics for tools, and (c) includes slightly more rich 
metadata (such as display names and URLs). Presented in PEP-like form 
here, but if feedback suggests just putting it in the docs I'm okay with 
that too. It is fully backwards compatible with official releases of 
Python (at least back to 2.5, possibly further) and does not require 
modifications to Python or the official installer - it is purely 
codifying a superset of what we already do.


Any and all feedback welcomed, especially from the owners of other 
distros, Python implementations or tools on the list.


Cheers,
Steve

-

PEP: ???
Title: Python environment registration in the Windows Registry
Version: $Revision$
Last-Modified: $Date$
Author: Steve Dower 
Status: Draft
Type: ???
Content-Type: text/x-rst
Created: 02-Feb-2016


Abstract


When installed on Windows, the official Python installer creates a 
registry key for discovery and detection by other applications. 
Unofficial installers, such as those used by distributions, typically 
create identical keys for the same purpose. However, these may conflict 
with the official installer or other distributions.


This PEP defines a schema for the Python registry key to allow 
unofficial installers to separately register their installation, and to 
allow applications to detect and correctly display all Python 
environments on a user's machine. No implementation changes to Python 
are proposed with this PEP.


The schema matches the registry values that have been used by the 
official installer since at least Python 2.5, and the resolution 
behaviour matches the behaviour of the official Python releases.


Specification
=

We consider there to be a single collection of Python environments on a 
machine, where the collection may be different for each user of the 
machine. There are three potential registry locations where the 
collection may be stored based on the installation options of each 
environment. These are::


HKEY_CURRENT_USER\Software\Python\\
HKEY_LOCAL_MACHINE\Software\Python\\
HKEY_LOCAL_MACHINE\Software\Wow6432Node\Python\\

On a given machine, an environment is uniquely identified by its 
Company-Tag pair. Keys should be searched in the order shown, and if the 
same Company-Tag pair appears in more than one of the above locations, 
only the first occurrence is offerred.


Official Python releases use ``PythonCore`` for Company, and the value 
of ``sys.winver`` for Tag. Other registered environments may use any 
values for Company and Tag. Recommendations are made in the following 
sections.




Backwards Compatibility
---

Python 3.4 and earlier did not distinguish between 32-bit and 64-bit 
builds in ``sys.winver``. As a result, it is possible to have valid 
side-by-side installations of both 32-bit and 64-bit interpreters.


To ensure backwards compatibility, applications should treat 
environments listed under the following two registry keys as distinct, 
even if Tag matches::


HKEY_LOCAL_MACHINE\Software\Python\PythonCore\
HKEY_LOCAL_MACHINE\Software\Wow6432Node\Python\PythonCore\

Note that this does not apply to Python 3.5 and later, which uses 
different Tags. Environments registered under other Company names must 
use distinct Tags for side-by-side installations.


1. Environments in ``HKEY_CURRENT_USER`` are always preferred
2. Environments in ``HKEY_LOCAL_MACHINE\Software\Wow6432Node`` are 
preferred if the interpreter is known to be 32-bit



Company
---

The Company part of the key is intended to group related environments 
and to ensure that Tags are namespaced appropriately. The key name 
should be alphanumeric without spaces and likely to be unique. For 
example, a trademarked name, a UUID, or a hostname would be appropriate::


HKEY_CURRENT_USER\Software\Python\ExampleCorp