subject:"\[Python\-Dev\] Opcode cache in ceval loop"

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-05 Thread Sven R. Kunze


On 05.02.2016 00:06, Matthias Bussonnier wrote:

On Feb 4, 2016, at 08:22, Sven R. Kunze  wrote:

On 04.02.2016 16:57, Matthias Bussonnier wrote:

On Feb 3, 2016, at 13:22, Yury Selivanov  wrote:


An ideal way would be to calculate a hit/miss ratio over time
for each cached opcode, but that would be an expensive
calculation.

Do you mean like a sliding windows ?
Otherwise if you just want a let's say 20% miss threshold, you increment by 1 
on hit,
and decrement by 4 on miss.

Division is expensive.

I'm not speaking about division here.
if you +M / -N the counter will decrease in average only if the hit/miss ratio
is below N/(M+N), but you do not need to do the division.

Then you deoptimize only if you get < 0.


I see but it looks still more complicated. :)






On Feb 3, 2016, at 13:37, Sven R. Kunze  wrote:


On 03.02.2016 22:22, Yury Selivanov wrote:

One way of tackling this is to give each optimized opcode
a counter for hit/misses.  When we have a "hit" we increment
that counter, when it's a miss, we decrement it.

Within a given range, I suppose. Like:

c = min(c+1, 100)

Min might be overkill, maybe you can use a or mask, to limit the windows range
to 256 consecutive call ?

Sure, that is how I would have written it in Python. But I would suggest an AND 
mask. ;-)


Sure, implementation detail I would say. Should not write emails before 
breakfast...


;-)


The other problem, with the mask, is if your increment hit 256 you wrap around 
back to 0
where it deoptimize (which is not what you want), so you might need to not mask 
the
sign bit and deoptimize only on a certain negative threshold.


Does it make sens ?


Definitely. I am curious about the actual implementation of this idea.

Best,
Sven




___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-04 Thread Stephen J. Turnbull

Nick Coghlan writes:

 > If someone else wanted to also describe the change in a PEP for ease
 > of future reference, using Yury's ceval.txt as input, I do think that
 > would be a good thing, but I wouldn't want to make the enhancement
 > conditional on someone volunteering to do that.

I wasn't suggesting making it conditional, I was encouraging Yury to
do it himself as the most familiar with the situation.  I may be
underestimating the additional cost, but it seems to me explaining
both before and after would be very useful to people who've hacked
ceval in the past.  (Presumably Yury would just be explaining
"after" in his ceval.txt.)

The important thing is to make it discoverable, though, and I don't
care if it's done by PEP or not.  In fact, perhaps "let Yury be Yury",
plus an informational PEP listing all of the *.txt files in the tree
would be more useful?  Or in the devguide?

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-04 Thread Matthias Bussonnier


> On Feb 4, 2016, at 08:22, Sven R. Kunze  wrote:
> 
> On 04.02.2016 16:57, Matthias Bussonnier wrote:
>>> On Feb 3, 2016, at 13:22, Yury Selivanov  wrote:
>>> 
>>> 
>>> An ideal way would be to calculate a hit/miss ratio over time
>>> for each cached opcode, but that would be an expensive
>>> calculation.
>> Do you mean like a sliding windows ?
>> Otherwise if you just want a let's say 20% miss threshold, you increment by 
>> 1 on hit,
>> and decrement by 4 on miss.
> 
> Division is expensive.

I'm not speaking about division here. 
if you +M / -N the counter will decrease in average only if the hit/miss ratio 
is below N/(M+N), but you do not need to do the division. 

Then you deoptimize only if you get < 0. 

> 
>> 
>> On Feb 3, 2016, at 13:37, Sven R. Kunze  wrote:
>> 
>>> On 03.02.2016 22:22, Yury Selivanov wrote:
 One way of tackling this is to give each optimized opcode
 a counter for hit/misses.  When we have a "hit" we increment
 that counter, when it's a miss, we decrement it.
>>> Within a given range, I suppose. Like:
>>> 
>>> c = min(c+1, 100)
>> 
>> Min might be overkill, maybe you can use a or mask, to limit the windows 
>> range
>> to 256 consecutive call ?
> 
> Sure, that is how I would have written it in Python. But I would suggest an 
> AND mask. ;-)


Sure, implementation detail I would say. Should not write emails before 
breakfast...

The other problem, with the mask, is if your increment hit 256 you wrap around 
back to 0
where it deoptimize (which is not what you want), so you might need to not mask 
the
sign bit and deoptimize only on a certain negative threshold.


Does it make sens ?

-- 
M

 

> 
> Best,
> Sven

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-04 Thread Sven R. Kunze


On 04.02.2016 16:57, Matthias Bussonnier wrote:

On Feb 3, 2016, at 13:22, Yury Selivanov  wrote:


An ideal way would be to calculate a hit/miss ratio over time
for each cached opcode, but that would be an expensive
calculation.

Do you mean like a sliding windows ?
Otherwise if you just want a let's say 20% miss threshold, you increment by 1 
on hit,
and decrement by 4 on miss.


Division is expensive.



On Feb 3, 2016, at 13:37, Sven R. Kunze  wrote:


On 03.02.2016 22:22, Yury Selivanov wrote:

One way of tackling this is to give each optimized opcode
a counter for hit/misses.  When we have a "hit" we increment
that counter, when it's a miss, we decrement it.

Within a given range, I suppose. Like:

c = min(c+1, 100)


Min might be overkill, maybe you can use a or mask, to limit the windows range
to 256 consecutive call ?


Sure, that is how I would have written it in Python. But I would suggest 
an AND mask. ;-)


Best,
Sven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-04 Thread Matthias Bussonnier


> On Feb 3, 2016, at 13:22, Yury Selivanov  wrote:
> 
> 
> An ideal way would be to calculate a hit/miss ratio over time
> for each cached opcode, but that would be an expensive
> calculation.

Do you mean like a sliding windows ?
Otherwise if you just want a let's say 20% miss threshold, you increment by 1 
on hit, 
and decrement by 4 on miss. 


On Feb 3, 2016, at 13:37, Sven R. Kunze  wrote:

> On 03.02.2016 22:22, Yury Selivanov wrote:
>> One way of tackling this is to give each optimized opcode
>> a counter for hit/misses.  When we have a "hit" we increment
>> that counter, when it's a miss, we decrement it.
> 
> Within a given range, I suppose. Like:
> 
> c = min(c+1, 100)


Min might be overkill, maybe you can use a or mask, to limit the windows range
to 256 consecutive call ? 
-- 
M





> 
> Yury
> 
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/bussonniermatthias%40gmail.com

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-04 Thread Nick Coghlan

On 3 February 2016 at 06:49, Stephen J. Turnbull  wrote:
> Yury Selivanov writes:
>
>  > Not sure about that... PEPs take a LOT of time :(
>
> Informational PEPs need not take so much time, no more than you would
> spend on ceval.txt.  I'm sure a PEP would get a lot more attention
> from reviewers, too.
>
> Even if you PEP the whole thing, as you say it's a (big ;-)
> implementation detail.  A PEP won't make things more controversial (or
> less) than they already are.  I don't see why it would take that much
> more time than ceval.txt.

For a typical PEP, you need to explain both the status quo *and* the
state after the changes, as well as provide references to the related
discussions.

I think in this case the main target audience for the technical
details should be future maintainers, so Yury writing a ceval.txt akin
to the current dictnotes.txt, listsort.txt, etc would cover the
essentials.

If someone else wanted to also describe the change in a PEP for ease
of future reference, using Yury's ceval.txt as input, I do think that
would be a good thing, but I wouldn't want to make the enhancement
conditional on someone volunteering to do that.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-03 Thread Sven R. Kunze


On 03.02.2016 22:22, Yury Selivanov wrote:

One way of tackling this is to give each optimized opcode
a counter for hit/misses.  When we have a "hit" we increment
that counter, when it's a miss, we decrement it.


Within a given range, I suppose. Like:

c = min(c+1, 100)



I kind of have something like that right now:
https://github.com/1st1/cpython/blob/opcache5/Python/ceval.c#L3035

But I only decrement that counter -- the idea is that LOAD_ATTR
is allowed to "miss" only 20 times before getting deoptimized.

I'll experiment with inc/dec on hit/miss and see how that affects
the performance.

An ideal way would be to calculate a hit/miss ratio over time
for each cached opcode, but that would be an expensive
calculation.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-03 Thread Yury Selivanov




On 2016-02-03 3:53 PM, francismb wrote:

Hi,

On 02/01/2016 10:43 PM, Yury Selivanov wrote:


We also need to deoptimize the code to avoid having too many cache
misses/pointless cache updates.  I found that, for instance, LOAD_ATTR
is either super stable (hits 100% of times), or really unstable, so 20
misses is, again, seems to be alright.


Aren't those hits/misses a way to see how dynamic the code is? I mean
can't the current magic (manually tweaked on a limited set) values,
be self tweaked/adapted on those numbers?



Probably.

One way of tackling this is to give each optimized opcode
a counter for hit/misses.  When we have a "hit" we increment
that counter, when it's a miss, we decrement it.

I kind of have something like that right now:
https://github.com/1st1/cpython/blob/opcache5/Python/ceval.c#L3035

But I only decrement that counter -- the idea is that LOAD_ATTR
is allowed to "miss" only 20 times before getting deoptimized.

I'll experiment with inc/dec on hit/miss and see how that affects
the performance.

An ideal way would be to calculate a hit/miss ratio over time
for each cached opcode, but that would be an expensive
calculation.

Yury

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-03 Thread francismb

Hi,

On 02/01/2016 10:43 PM, Yury Selivanov wrote:

> 
> We also need to deoptimize the code to avoid having too many cache
> misses/pointless cache updates.  I found that, for instance, LOAD_ATTR
> is either super stable (hits 100% of times), or really unstable, so 20
> misses is, again, seems to be alright.
> 
Aren't those hits/misses a way to see how dynamic the code is? I mean
can't the current magic (manually tweaked on a limited set) values,
be self tweaked/adapted on those numbers?

Thanks in advance,
francis
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Serhiy Storchaka


On 02.02.16 21:41, Yury Selivanov wrote:

I can write a ceval.txt file explaining what's going on
in ceval loop, with details on the opcode cache and other
things.  I think it's even better than a PEP, to be honest.


I totally agree.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Stephen J. Turnbull

Yury Selivanov writes:

 > Not sure about that... PEPs take a LOT of time :(

Informational PEPs need not take so much time, no more than you would
spend on ceval.txt.  I'm sure a PEP would get a lot more attention
from reviewers, too.

Even if you PEP the whole thing, as you say it's a (big ;-)
implementation detail.  A PEP won't make things more controversial (or
less) than they already are.  I don't see why it would take that much
more time than ceval.txt.

 > I can write a ceval.txt file explaining what's going on
 > in ceval loop, with details on the opcode cache and other
 > things.  I think it's even better than a PEP, to be honest.

Unlikely to be better, since that's a subset of the proposed PEP.

Of course it's up to you, since you'd be doing most of the work, but
for the rest of us PEPs are a lot more discoverable and easily
referenced than a .txt file with a short name.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Serhiy Storchaka


On 02.02.16 21:23, Yury Selivanov wrote:

Alright, I modified the code to optimize ALL code objects, and ran unit
tests with the above tests excluded:

-- Max process mem (ru_maxrss) = 131858432
-- Opcode cache number of objects  = 42109
-- Opcode cache total extra mem= 10901106


Thank you for doing these tests. Now results are more convincing to me.


And asyncio tests:

-- Max process mem (ru_maxrss) = 57081856
-- Opcode cache number of objects  = 4656
-- Opcode cache total extra mem= 1766681



FWIW, here are stats for asyncio with only hot objects being optimized:

-- Max process mem (ru_maxrss) = 54775808
-- Opcode cache number of objects  = 121
-- Opcode cache total extra mem= 43521


Interesting, 57081856 - 54775808 = 2306048, but 1766681 - 43521 = 
1723160. There are additional 0.5Mb lost during fragmentation.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Victor Stinner

2016-02-02 20:23 GMT+01:00 Yury Selivanov :
> Alright, I modified the code to optimize ALL code objects, and ran unit
> tests with the above tests excluded:
>
> -- Max process mem (ru_maxrss) = 131858432
> -- Opcode cache number of objects  = 42109
> -- Opcode cache total extra mem= 10901106

In my experience, RSS is a coarse measure of the memory usage. I wrote
tracemalloc to get a reliable measure of the *Python* memory usage:
https://docs.python.org/dev/library/tracemalloc.html#tracemalloc.get_traced_memory

Run tests with -X tracemalloc -i, and then type in the REPL:

>>> import tracemalloc; print("%.1f kB" % (tracemalloc.get_traced_memory()[1] / 
>>> 1024.))
10197.7 kB

I expect this value to be (much) lower than RSS.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Yury Selivanov




On 2016-02-02 1:45 PM, Serhiy Storchaka wrote:

On 02.02.16 19:45, Yury Selivanov wrote:

On 2016-02-02 12:41 PM, Serhiy Storchaka wrote:

On 01.02.16 21:10, Yury Selivanov wrote:

To measure the max/average memory impact, I tuned my code to optimize
*every* code object on *first* run.  Then I ran the entire Python test
suite.  Python test suite + standard library both contain around 72395
code objects, which required 20Mb of memory for caches.  The test
process consumed around 400Mb of memory.  Thus, the absolute worst 
case

scenario, the overhead is about 5%.


Test process consumes such much memory because few tests creates huge
objects. If exclude these tests (note that tests that requires more
than 1Gb are already excluded by default) and tests that creates a
number of threads (threads consume much memory too), the rest of tests
needs less than 100Mb of memory. Absolute required minimum is about
25Mb. Thus, the absolute worst case scenario, the overhead is about 
100%.

Can you give me the exact configuration of tests (command line to run)
that would only consume 25mb?


I don't remember what exact tests consume the most of memory, but 
following tests are failed when run with less than 30Mb of memory:


test___all__ test_asynchat test_asyncio test_bz2 test_capi 
test_concurrent_futures test_ctypes test_decimal test_descr 
test_distutils test_docxmlrpc test_eintr test_email test_fork1 
test_fstring test_ftplib test_functools test_gc test_gdb test_hashlib 
test_httplib test_httpservers test_idle test_imaplib test_import 
test_importlib test_io test_itertools test_json test_lib2to3 test_list 
test_logging test_longexp test_lzma test_mmap 
test_multiprocessing_fork test_multiprocessing_forkserver 
test_multiprocessing_main_handling test_multiprocessing_spawn test_os 
test_pickle test_poplib test_pydoc test_queue test_regrtest 
test_resource test_robotparser test_shutil test_smtplib test_socket 
test_sqlite test_ssl test_subprocess test_tarfile test_tcl test_thread 
test_threaded_import test_threadedtempfile test_threading 
test_threading_local test_threadsignals test_tix test_tk test_tools 
test_ttk_guionly test_ttk_textonly test_tuple test_unicode 
test_urllib2_localnet test_wait3 test_wait4 test_xmlrpc test_zipfile 
test_zlib


Alright, I modified the code to optimize ALL code objects, and ran unit 
tests with the above tests excluded:


-- Max process mem (ru_maxrss) = 131858432
-- Opcode cache number of objects  = 42109
-- Opcode cache total extra mem= 10901106

And asyncio tests:

-- Max process mem (ru_maxrss) = 57081856
-- Opcode cache number of objects  = 4656
-- Opcode cache total extra mem= 1766681

So the absolute worst case for a small asyncio program is 3%, for unit 
tests (with the above list excluded) - 8%.


I think it'd be very hard to find a real-life program that consists of 
only code objects, and nothing else (no data to work with/process, no 
objects with dicts, no threads, basically nothing).  Because only for 
such a program you would have a 100% memory overhead for the bytecode 
cache (when all code objects are optimized).


FWIW, here are stats for asyncio with only hot objects being optimized:

-- Max process mem (ru_maxrss) = 54775808
-- Opcode cache number of objects  = 121
-- Opcode cache total extra mem= 43521

Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread ƦOB COASTN

>> I can write a ceval.txt file explaining what's going on
>> in ceval loop, with details on the opcode cache and other
>> things.  I think it's even better than a PEP, to be honest.
>
>
> I totally agree.
>
Please include the notes text file.  This provides an excellent
summary for those of us that haven't yet taken the deep dive into the
ceval loop, but still wish to understand its implementation.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Sven R. Kunze


On 02.02.2016 20:41, Yury Selivanov wrote:

Hi Victor,

On 2016-02-02 4:33 AM, Victor Stinner wrote:

Hi,

Maybe it's worth to write a PEP to summarize all your changes to
optimize CPython? It would avoid to have to follow different threads
on the mailing lists, different issues on the bug tracker, with
external links to GitHub gists, etc. Your code changes critical parts
of Python: code object structure and Python/ceval.c.


Not sure about that... PEPs take a LOT of time :(


True.


Besides, all my changes are CPython specific and
can be considered as an implementation detail.



At least, it would help to document Python internals ;-)


I can write a ceval.txt file explaining what's going on
in ceval loop, with details on the opcode cache and other
things.  I think it's even better than a PEP, to be honest.


I would love to see that. :)


Best,
Sven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Yury Selivanov




On 2016-02-02 12:41 PM, Serhiy Storchaka wrote:

On 01.02.16 21:10, Yury Selivanov wrote:

To measure the max/average memory impact, I tuned my code to optimize
*every* code object on *first* run.  Then I ran the entire Python test
suite.  Python test suite + standard library both contain around 72395
code objects, which required 20Mb of memory for caches.  The test
process consumed around 400Mb of memory.  Thus, the absolute worst case
scenario, the overhead is about 5%.


Test process consumes such much memory because few tests creates huge 
objects. If exclude these tests (note that tests that requires more 
than 1Gb are already excluded by default) and tests that creates a 
number of threads (threads consume much memory too), the rest of tests 
needs less than 100Mb of memory. Absolute required minimum is about 
25Mb. Thus, the absolute worst case scenario, the overhead is about 100%.
Can you give me the exact configuration of tests (command line to run) 
that would only consume 25mb?


Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Yury Selivanov


Hi Victor,

On 2016-02-02 4:33 AM, Victor Stinner wrote:

Hi,

Maybe it's worth to write a PEP to summarize all your changes to
optimize CPython? It would avoid to have to follow different threads
on the mailing lists, different issues on the bug tracker, with
external links to GitHub gists, etc. Your code changes critical parts
of Python: code object structure and Python/ceval.c.


Not sure about that... PEPs take a LOT of time :(

Besides, all my changes are CPython specific and
can be considered as an implementation detail.



At least, it would help to document Python internals ;-)


I can write a ceval.txt file explaining what's going on
in ceval loop, with details on the opcode cache and other
things.  I think it's even better than a PEP, to be honest.

Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Serhiy Storchaka


On 02.02.16 19:45, Yury Selivanov wrote:

On 2016-02-02 12:41 PM, Serhiy Storchaka wrote:

On 01.02.16 21:10, Yury Selivanov wrote:

To measure the max/average memory impact, I tuned my code to optimize
*every* code object on *first* run.  Then I ran the entire Python test
suite.  Python test suite + standard library both contain around 72395
code objects, which required 20Mb of memory for caches.  The test
process consumed around 400Mb of memory.  Thus, the absolute worst case
scenario, the overhead is about 5%.


Test process consumes such much memory because few tests creates huge
objects. If exclude these tests (note that tests that requires more
than 1Gb are already excluded by default) and tests that creates a
number of threads (threads consume much memory too), the rest of tests
needs less than 100Mb of memory. Absolute required minimum is about
25Mb. Thus, the absolute worst case scenario, the overhead is about 100%.

Can you give me the exact configuration of tests (command line to run)
that would only consume 25mb?


I don't remember what exact tests consume the most of memory, but 
following tests are failed when run with less than 30Mb of memory:


test___all__ test_asynchat test_asyncio test_bz2 test_capi 
test_concurrent_futures test_ctypes test_decimal test_descr 
test_distutils test_docxmlrpc test_eintr test_email test_fork1 
test_fstring test_ftplib test_functools test_gc test_gdb test_hashlib 
test_httplib test_httpservers test_idle test_imaplib test_import 
test_importlib test_io test_itertools test_json test_lib2to3 test_list 
test_logging test_longexp test_lzma test_mmap test_multiprocessing_fork 
test_multiprocessing_forkserver test_multiprocessing_main_handling 
test_multiprocessing_spawn test_os test_pickle test_poplib test_pydoc 
test_queue test_regrtest test_resource test_robotparser test_shutil 
test_smtplib test_socket test_sqlite test_ssl test_subprocess 
test_tarfile test_tcl test_thread test_threaded_import 
test_threadedtempfile test_threading test_threading_local 
test_threadsignals test_tix test_tk test_tools test_ttk_guionly 
test_ttk_textonly test_tuple test_unicode test_urllib2_localnet 
test_wait3 test_wait4 test_xmlrpc test_zipfile test_zlib



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Serhiy Storchaka


On 01.02.16 21:10, Yury Selivanov wrote:

To measure the max/average memory impact, I tuned my code to optimize
*every* code object on *first* run.  Then I ran the entire Python test
suite.  Python test suite + standard library both contain around 72395
code objects, which required 20Mb of memory for caches.  The test
process consumed around 400Mb of memory.  Thus, the absolute worst case
scenario, the overhead is about 5%.


Test process consumes such much memory because few tests creates huge 
objects. If exclude these tests (note that tests that requires more than 
1Gb are already excluded by default) and tests that creates a number of 
threads (threads consume much memory too), the rest of tests needs less 
than 100Mb of memory. Absolute required minimum is about 25Mb. Thus, the 
absolute worst case scenario, the overhead is about 100%.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-02 Thread Victor Stinner

Hi,

Maybe it's worth to write a PEP to summarize all your changes to
optimize CPython? It would avoid to have to follow different threads
on the mailing lists, different issues on the bug tracker, with
external links to GitHub gists, etc. Your code changes critical parts
of Python: code object structure and Python/ceval.c.

At least, it would help to document Python internals ;-)

The previous "big" change (optimization) like that was the new "type
attribute cache": addition of tp_version_tag to PyTypeObject. I
"documented" it in the PEP 509 and it was difficult to rebuild the
context, understand the design, etc.
https://www.python.org/dev/peps/pep-0509/#method-cache-and-type-version-tag

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Yury Selivanov


Andrew,

On 2016-02-01 4:29 PM, Andrew Barnert wrote:

Looking over the thread and the two issues, you've got good arguments for why 
the improved code will be the most common code, and good benchmarks for various 
kinds of real-life code, but it doesn't seem like you'd tried to stress it on 
anything that could be made worse. From your explanations and your code, I 
wouldn't expect that @classmethods, functions stored in the object dict or 
generated by __getattr__, non-function callables as methods, etc. would go 
significantly slower,


Right.  The caching, of course, has some overhead, albeit barely 
detectable.  The only way the slow down might become "significant" if 
there is a bug in the ceval.c code -- i.e. an opcode doesn't get 
de-optimized etc.  That should be fixable.



  or code that mixes @properties or __getattr__ proxy attributes with real 
attributes, or uses __slots__,


No performance degradation for __slots__, we have a benchmark for that.  
I also tried to add __slots__ to every class in the Richards test - no 
improvement or performance degradation there.



  or code that does frequently write to a global, etc. But it would be nice to 
_know_ that they don't instead of just expecting it.


FWIW I've just tried to write a micro-benchmark for __getattr__: 
https://gist.github.com/1st1/22c1aa0a46f246a31515


Opcode cache gets quickly deoptimized with it, but, as expected, the 
CPython with opcode cache is <1% slower.  But that's a 1% in a super 
micro-benchmark; of course the cost of having a cache that isn't used 
will show up.  In a real code that doesn't consist only of LOAD_ATTRs, 
it won't even be possible to see any slowdown.


Thanks,
Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Sven R. Kunze


On 01.02.2016 22:43, Yury Selivanov wrote:

Sven,

On 2016-02-01 4:32 PM, Sven R. Kunze wrote:

On 01.02.2016 22:27, Yury Selivanov wrote:

Right now they are private constants in ceval.c.

I will (maybe) expose a private API via the _testcapi module to 
re-define them (set them to 1 or 0), only to write better 
unittests.  I have no plans to make those constants public or have a 
public API to tackle them.  IMHO, this is something that almost 
nobody will ever use.


Alright. I agree with you on that.

What I actually meant was: how can we find the optimal values? I 
understand that 1000 and 20 are some hand-figured/subjective values 
for now.


Is there standardized/objective way to find out the best values? What 
does best even mean here?


Running lots of benchmarks and micro-benchmarks hundreds of times ;)  
I've done a lot of that, and I noticed that the numbers don't matter 
too much.


That's actually pretty interesting. :)

Do you consider writing a blog post about this at some time?

What matters is that we don't want to optimize the code that runs 0 or 
1 times.  To save some memory we don't want to optimize the code that 
runs 10 times.  So 1000 seems to be about right.


We also need to deoptimize the code to avoid having too many cache 
misses/pointless cache updates.  I found that, for instance, LOAD_ATTR 
is either super stable (hits 100% of times), or really unstable, so 20 
misses is, again, seems to be alright.


I'm flexible about tweaking those values, I encourage you and everyone 
to experiment, if you have time ;) 
https://github.com/1st1/cpython/blob/opcache5/Python/ceval.c#L100


Right now, I am busy with the heap implementation but I think I can look 
into it later.


Best,
Sven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Yury Selivanov


Sven,

On 2016-02-01 4:32 PM, Sven R. Kunze wrote:

On 01.02.2016 22:27, Yury Selivanov wrote:

Right now they are private constants in ceval.c.

I will (maybe) expose a private API via the _testcapi module to 
re-define them (set them to 1 or 0), only to write better unittests.  
I have no plans to make those constants public or have a public API 
to tackle them.  IMHO, this is something that almost nobody will ever 
use.


Alright. I agree with you on that.

What I actually meant was: how can we find the optimal values? I 
understand that 1000 and 20 are some hand-figured/subjective values 
for now.


Is there standardized/objective way to find out the best values? What 
does best even mean here?


Running lots of benchmarks and micro-benchmarks hundreds of times ;)  
I've done a lot of that, and I noticed that the numbers don't matter too 
much.


What matters is that we don't want to optimize the code that runs 0 or 1 
times.  To save some memory we don't want to optimize the code that runs 
10 times.  So 1000 seems to be about right.


We also need to deoptimize the code to avoid having too many cache 
misses/pointless cache updates.  I found that, for instance, LOAD_ATTR 
is either super stable (hits 100% of times), or really unstable, so 20 
misses is, again, seems to be alright.


I'm flexible about tweaking those values, I encourage you and everyone 
to experiment, if you have time ;) 
https://github.com/1st1/cpython/blob/opcache5/Python/ceval.c#L100


Thanks,
Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Sven R. Kunze


On 01.02.2016 22:27, Yury Selivanov wrote:

Right now they are private constants in ceval.c.

I will (maybe) expose a private API via the _testcapi module to 
re-define them (set them to 1 or 0), only to write better unittests.  
I have no plans to make those constants public or have a public API to 
tackle them.  IMHO, this is something that almost nobody will ever use.


Alright. I agree with you on that.

What I actually meant was: how can we find the optimal values? I 
understand that 1000 and 20 are some hand-figured/subjective values for now.


Is there standardized/objective way to find out the best values? What 
does best even mean here?


Best,
Sven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Damien George

Hi Yury,

That's great news about the speed improvements with the dict offset cache!

> The cache struct is defined in code.h [2], and is 32 bytes long. When a
> code object becomes hot, it gets an cache offset table allocated for it
> (+1 byte for each opcode) + an array of cache structs.

Ok, so each opcode has a 1-byte cache that sits separately to the
actual bytecode.  But a lot of opcodes don't use it so that leads to
some wasted memory, correct?

But then how do you index the cache, do you keep a count of the
current opcode number?  If I remember correctly, CPython has some
opcodes taking 1 byte, and some taking 3 bytes, so the offset into the
bytecode cannot be easily mapped to a bytecode number.

Cheers,
Damien.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Yury Selivanov




On 2016-02-01 3:27 PM, Sven R. Kunze wrote:

On 01.02.2016 20:51, Yury Selivanov wrote:
If LOAD_ATTR gets too many cache misses (20 in my current patch) it 
gets deoptimized, and the default implementation is used.  So if the 
code is very dynamic - there's no improvement, but no performance 
penalty either.


Will you re-try optimizing it?


No.

It's important to understand that if we have a lot of cache misses after 
the code object was executed 1000 times, it doesn't make sense to keep 
trying to update that cache.  It just means that the code, in that 
particular point, works with different kinds of objects. FWIW, I 
experimented with different ideas (one is to never de-optimize), and the 
current strategy works best on the vast number of benchmarks.


Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Yury Selivanov

On 2016-02-01 3:21 PM, Brett Cannon wrote:

On Mon, 1 Feb 2016 at 12:16 Yury Selivanov > wrote:

Brett,

On 2016-02-01 3:08 PM, Brett Cannon wrote:
>
>
> On Mon, 1 Feb 2016 at 11:51 Yury Selivanov
mailto:yselivanov...@gmail.com>
> >> wrote:
>
> Hi Brett,
>
[..]
>
>
> The first two fields are used to make sure that we have
objects of the
> same type.  If it changes, we deoptimize the opcode
immediately.  Then
> we try the offset.  If it's successful - we have a cache
hit.  If not,
> that's fine, we'll try another few times before deoptimizing the
> opcode.
>
>
> So this is a third "next step" that has its own issue?

It's all in issue http://bugs.python.org/issue26219 right now.

My current plan is to implement LOAD_METHOD/CALL_METHOD (just opcodes,
no cache) in 26110.

Then implement caching for LOAD_METHOD, LOAD_GLOBAL, and LOAD_ATTR in
26219.  I'm flexible to break down 26219 in three separate issues if
that helps the review process (but that would take more of my time):

- implement support for opcode caching (general infrastructure) +
LOAD_GLOBAL optimization
- LOAD_METHOD optimization
- LOAD_ATTR optimization

I personally don't care how you break it down, just trying to keep all 
the moving pieces in my head. :)

Anyway, it sounds like PEP 509 is blocking part of it, but the 
LOAD_METHOD stuff can go in as-is. So are you truly blocked only on 
getting the latest version of that patch up to 
http://bugs.python.org/issue26110 and getting a code review?
Yep.  The initial implementation of LOAD_METHOD doesn't need PEP 509 / 
opcode caching.  I'll have to focus on something else this week, but 
early next week I can upload a new patch for 26110.

When we have 26110 committed and PEP 509 approved and committed, I can 
update the opcode cache patch (issue 26219) and we can start reviewing it.

Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Yury Selivanov




On 2016-02-01 4:21 PM, Yury Selivanov wrote:

Hi Damien,

On 2016-02-01 3:59 PM, Damien George wrote:


[..]


But then how do you index the cache, do you keep a count of the
current opcode number?  If I remember correctly, CPython has some
opcodes taking 1 byte, and some taking 3 bytes, so the offset into the
bytecode cannot be easily mapped to a bytecode number.




Here are a few links that might explain the idea better:

https://github.com/1st1/cpython/blob/opcache5/Python/ceval.c#L1229
https://github.com/1st1/cpython/blob/opcache5/Python/ceval.c#L2610
https://github.com/1st1/cpython/blob/opcache5/Objects/codeobject.c#L167

Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Yury Selivanov


Brett,

On 2016-02-01 3:08 PM, Brett Cannon wrote:



On Mon, 1 Feb 2016 at 11:51 Yury Selivanov > wrote:


Hi Brett,


[..]



The first two fields are used to make sure that we have objects of the
same type.  If it changes, we deoptimize the opcode immediately.  Then
we try the offset.  If it's successful - we have a cache hit.  If not,
that's fine, we'll try another few times before deoptimizing the
opcode.


So this is a third "next step" that has its own issue?


It's all in issue http://bugs.python.org/issue26219 right now.

My current plan is to implement LOAD_METHOD/CALL_METHOD (just opcodes, 
no cache) in 26110.


Then implement caching for LOAD_METHOD, LOAD_GLOBAL, and LOAD_ATTR in 
26219.  I'm flexible to break down 26219 in three separate issues if 
that helps the review process (but that would take more of my time):


- implement support for opcode caching (general infrastructure) + 
LOAD_GLOBAL optimization

- LOAD_METHOD optimization
- LOAD_ATTR optimization

Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Brett Cannon

On Mon, 1 Feb 2016 at 11:51 Yury Selivanov  wrote:

> Hi Brett,
>
> On 2016-02-01 2:30 PM, Brett Cannon wrote:
> >
> >
> > On Mon, 1 Feb 2016 at 11:11 Yury Selivanov  > > wrote:
> >
> > Hi,
> >
> [..]
> >
> > What's next?
> > 
> >
> > First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110
> > [1].  It's a very straightforward optimization, the patch is small
> and
> > easy to review.
> >
> >
> > +1 from me.
> >
> >
> > Second, I'd like to merge the new opcode cache, see issue 26219 [5].
> > All unittests pass.  Memory usage increase is very moderate (<1mb for
> > the entire test suite), and the performance increase is significant.
> > The only potential blocker for this is PEP 509 approval (which I'd be
> > happy to assist with).
> >
> >
> > I think the fact that it improves performance across the board as well
> > as eliminates the various tricks people use to cache global and
> > built-ins, a big +1 from me. I guess that means Victor needs to ask
> > for pronouncement on PEP 509.
>
> Great!  AFAIK Victor still needs to update the PEP with some changes
> (globally unique ma_version).  My patch includes the latest
> implementation of PEP 509, and it works fine (no regressions, no broken
> unittests).  I can also assist with reviewing Victor's implementation if
> the PEP is accepted.
>
> >
> > BTW, where does LOAD_ATTR fit into all of this?
>
> LOAD_ATTR optimization doesn't use any of PEP 509 new stuff (if I
> understand you question correctly).  It's based on the following
> assumptions (that really make JITs work so well):
>
> 1. Most classes don't implement __getattribute__.
>
> 2. A lot of attributes are stored in objects' __dict__s.
>
> 3. Most attributes aren't shaded by descriptors/getters-setters; most
> code just uses "self.attr".
>
> 4. An average method/function works on objects of the same type. Which
> means that those objects were constructed in a very similar (if not
> exact) fashion.
>
> For instance:
>
> class F:
> def __init__(self, name):
> self.name = name
> def say(self):
> print(self.name)   # <- For all F instances,
># offset of 'name' in `F().__dict__`s
># will be the same
>
> If LOAD_ATTR gets too many cache misses (20 in my current patch) it gets
> deoptimized, and the default implementation is used.  So if the code is
> very dynamic - there's no improvement, but no performance penalty either.
>
> In my patch, I use the cache to store (for LOAD_ATTR specifically):
>
> - pointer to object's type
> - type->tp_version_tag
> - the last successful __dict__ offset
>
> The first two fields are used to make sure that we have objects of the
> same type.  If it changes, we deoptimize the opcode immediately.  Then
> we try the offset.  If it's successful - we have a cache hit.  If not,
> that's fine, we'll try another few times before deoptimizing the opcode.
>

So this is a third "next step" that has its own issue?

-Brett


>
> >
> > What do you think?
> >
> >
> > It all looks great to me!
>
> Thanks!
>
> Yury
>
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Yury Selivanov


Hi Brett,

On 2016-02-01 2:30 PM, Brett Cannon wrote:



On Mon, 1 Feb 2016 at 11:11 Yury Selivanov > wrote:


Hi,


[..]


What's next?


First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110
[1].  It's a very straightforward optimization, the patch is small and
easy to review.


+1 from me.


Second, I'd like to merge the new opcode cache, see issue 26219 [5].
All unittests pass.  Memory usage increase is very moderate (<1mb for
the entire test suite), and the performance increase is significant.
The only potential blocker for this is PEP 509 approval (which I'd be
happy to assist with).


I think the fact that it improves performance across the board as well 
as eliminates the various tricks people use to cache global and 
built-ins, a big +1 from me. I guess that means Victor needs to ask 
for pronouncement on PEP 509.


Great!  AFAIK Victor still needs to update the PEP with some changes 
(globally unique ma_version).  My patch includes the latest 
implementation of PEP 509, and it works fine (no regressions, no broken 
unittests).  I can also assist with reviewing Victor's implementation if 
the PEP is accepted.




BTW, where does LOAD_ATTR fit into all of this?


LOAD_ATTR optimization doesn't use any of PEP 509 new stuff (if I 
understand you question correctly).  It's based on the following 
assumptions (that really make JITs work so well):


1. Most classes don't implement __getattribute__.

2. A lot of attributes are stored in objects' __dict__s.

3. Most attributes aren't shaded by descriptors/getters-setters; most 
code just uses "self.attr".


4. An average method/function works on objects of the same type. Which 
means that those objects were constructed in a very similar (if not 
exact) fashion.


For instance:

class F:
   def __init__(self, name):
   self.name = name
   def say(self):
   print(self.name)   # <- For all F instances,
  # offset of 'name' in `F().__dict__`s
  # will be the same

If LOAD_ATTR gets too many cache misses (20 in my current patch) it gets 
deoptimized, and the default implementation is used.  So if the code is 
very dynamic - there's no improvement, but no performance penalty either.


In my patch, I use the cache to store (for LOAD_ATTR specifically):

- pointer to object's type
- type->tp_version_tag
- the last successful __dict__ offset

The first two fields are used to make sure that we have objects of the 
same type.  If it changes, we deoptimize the opcode immediately.  Then 
we try the offset.  If it's successful - we have a cache hit.  If not, 
that's fine, we'll try another few times before deoptimizing the opcode.




What do you think?


It all looks great to me!


Thanks!

Yury

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Andrew Barnert via Python-Dev

Looking over the thread and the two issues, you've got good arguments for why 
the improved code will be the most common code, and good benchmarks for various 
kinds of real-life code, but it doesn't seem like you'd tried to stress it on 
anything that could be made worse. From your explanations and your code, I 
wouldn't expect that @classmethods, functions stored in the object dict or 
generated by __getattr__, non-function callables as methods, etc. would go 
significantly slower, or code that mixes @properties or __getattr__ proxy 
attributes with real attributes, or uses __slots__, or code that does 
frequently write to a global, etc. But it would be nice to _know_ that they 
don't instead of just expecting it.

Sent from my iPhone

> On Feb 1, 2016, at 11:10, Yury Selivanov  wrote:
> 
> Hi,
> 
> This is the second email thread I start regarding implementing an opcode 
> cache in ceval loop.  Since my first post on this topic:
> 
> - I've implemented another optimization (LOAD_ATTR);
> 
> - I've added detailed statistics mode so that I can "see" how the cache 
> performs and tune it;
> 
> - some macro benchmarks are now 10-20% faster; 2to3 (a real application) is 
> 7-8% faster;
> 
> - and I have some good insights on the memory footprint.
> 
> ** The purpose of this email is to get a general approval from python-dev, so 
> that I can start polishing the patches and getting them reviewed/committed. **
> 
> 
> Summary of optimizations
> 
> 
> When a code object is executed more than ~1000 times, it's considered "hot".  
> It gets its opcodes analyzed to initialize caches for LOAD_METHOD (a new 
> opcode I propose to add in [1]), LOAD_ATTR, and LOAD_GLOBAL.
> 
> It's important to only optimize code objects that were executed "enough" 
> times, to avoid optimizing code objects for modules, classes, and functions 
> that were imported but never used.
> 
> The cache struct is defined in code.h [2], and is 32 bytes long. When a code 
> object becomes hot, it gets an cache offset table allocated for it (+1 byte 
> for each opcode) + an array of cache structs.
> 
> To measure the max/average memory impact, I tuned my code to optimize *every* 
> code object on *first* run.  Then I ran the entire Python test suite.  Python 
> test suite + standard library both contain around 72395 code objects, which 
> required 20Mb of memory for caches.  The test process consumed around 400Mb 
> of memory.  Thus, the absolute worst case scenario, the overhead is about 5%.
> 
> Then I ran the test suite without any modifications to the patch. This means 
> that only code objects that are called frequently enough are optimized.  In 
> this more, only 2072 code objects were optimized, using less than 1Mb of 
> memory for the cache.
> 
> 
> LOAD_ATTR
> -
> 
> Damien George mentioned that they optimize a lot of dict lookups in 
> MicroPython by memorizing last key/value offset in the dict object, thus 
> eliminating lots of hash lookups.  I've implemented this optimization in my 
> patch.  The results are quite good.  A simple micro-benchmark [3] shows ~30% 
> speed improvement.  Here are some debug stats generated by 2to3 benchmark:
> 
> -- Opcode cache LOAD_ATTR hits = 14778415 (83%)
> -- Opcode cache LOAD_ATTR misses   = 750 (0%)
> -- Opcode cache LOAD_ATTR opts = 282
> -- Opcode cache LOAD_ATTR deopts   = 60
> -- Opcode cache LOAD_ATTR total= 1912
> 
> Each "hit" makes LOAD_ATTR about 30% faster.
> 
> 
> LOAD_GLOBAL
> ---
> 
> This turned out to be a very stable optimization.  Here is the debug output 
> of the 2to3 test:
> 
> -- Opcode cache LOAD_GLOBAL hits   = 3940647 (100%)
> -- Opcode cache LOAD_GLOBAL misses = 0 (0%)
> -- Opcode cache LOAD_GLOBAL opts   = 252
> 
> All benchmarks (and real code) have stats like that.  Globals and builtins 
> are very rarely modified, so the cache works really well.  With LOAD_GLOBAL 
> opcode cache, global lookup is very cheap, there is no hash lookup for it at 
> all.  It makes optimizations like "def foo(len=len)" obsolete.
> 
> 
> LOAD_METHOD
> ---
> 
> This is a new opcode I propose to add in [1].  The idea is to substitute 
> LOAD_ATTR with it, and avoid instantiation of BoundMethod objects.
> 
> With the cache, we can store a reference to the method descriptor (I use 
> type->tp_version_tag for cache invalidation, the same thing _PyType_Lookup is 
> built around).
> 
> The cache makes LOAD_METHOD really efficient.  A simple micro-benchmark like 
> [4], shows that with the cache and LOAD_METHOD, "s.startswith('abc')" becomes 
> as efficient as "s[:3] == 'abc'".
> 
> LOAD_METHOD/CALL_FUNCTION without cache is about 20% faster than 
> LOAD_ATTR/CALL_FUNCTION.  With the cache, it's about 30% faster.
> 
> Here's the debug output of the 2to3 benchmark:
> 
> -- Opcode cache LOAD_METHOD hits   = 5164848 (64%)
> -- Opcode cache LOAD_METHOD misses = 12 (0%)
> -- Opcode cache LOAD_METHOD opts   = 94
> -- Opcode cache LOAD_METHOD d

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Yury Selivanov




On 2016-02-01 4:02 PM, Sven R. Kunze wrote:

On 01.02.2016 21:35, Yury Selivanov wrote:
It's important to understand that if we have a lot of cache misses 
after the code object was executed 1000 times, it doesn't make sense 
to keep trying to update that cache.  It just means that the code, in 
that particular point, works with different kinds of objects.


So, the assumption is that the code makes the difference here not 
time. That could be true for production code.


FWIW, I experimented with different ideas (one is to never 
de-optimize), and the current strategy works best on the vast number 
of benchmarks.


Nice.

Regarding the magic constants (1000, 20) what is the process of 
updating them?


Right now they are private constants in ceval.c.

I will (maybe) expose a private API via the _testcapi module to 
re-define them (set them to 1 or 0), only to write better unittests.  I 
have no plans to make those constants public or have a public API to 
tackle them.  IMHO, this is something that almost nobody will ever use.


Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Yury Selivanov


Hi Damien,

On 2016-02-01 3:59 PM, Damien George wrote:

Hi Yury,

That's great news about the speed improvements with the dict offset cache!


The cache struct is defined in code.h [2], and is 32 bytes long. When a
code object becomes hot, it gets an cache offset table allocated for it
(+1 byte for each opcode) + an array of cache structs.

Ok, so each opcode has a 1-byte cache that sits separately to the
actual bytecode.  But a lot of opcodes don't use it so that leads to
some wasted memory, correct?


Each code object has a list of opcodes and their arguments
(bytes object == unsigned char array).

"Hot" code objects have an offset table (unsigned chars), and
a cache entries array (hope your email client will display
the following correctly):

   opcodes  offset   cache entries
table

OPCODE0cache for 1st LOAD_ATTR
ARG1  0cache for 1st LOAD_GLOBAL
ARG2  0cache for 2nd LOAD_ATTR
OPCODE0cache for 1st LOAD_METHOD
LOAD_ATTR 1...
ARG1  0
ARG2  0
OPCODE0
LOAD_GLOBAL   2
ARG1  0
ARG2  0
LOAD_ATTR 3
ARG1  0
ARG2  0
...  ...
LOAD_METHOD   4
...  ...

When, say, a LOAD_ATTR opcode executes, it first checks if the
code object has a non-NULL cache-entries table.

If it has, that LOAD_ATTR then uses the offset table (indexing
with its `INSTR_OFFSET()`) to find its position in
cache-entries.



But then how do you index the cache, do you keep a count of the
current opcode number?  If I remember correctly, CPython has some
opcodes taking 1 byte, and some taking 3 bytes, so the offset into the
bytecode cannot be easily mapped to a bytecode number.


First, when a code object is created, it doesn't have
an offset table and cache entries (those are set to NULL).

Each code object has a new field to count how many times
it was called.  Each time a code object is called with
PyEval_EvalFrameEx, that field is inced.

Once a code object is called more than 1024 times we:

1. allocate memory for its offset table

2. iterate through its opcodes and count how many
LOAD_ATTR, LOAD_METHOD and LOAD_GLOBAL opcodes it has;

3. As part of (2) we initialize the offset-table with
correct mapping.  Some opcodes will have a non-zero
entry in the offset-table, some won't.  Opcode args
will always have zeros in the offset tables.

4. Then we allocate cache-entries table.

Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Sven R. Kunze


On 01.02.2016 21:35, Yury Selivanov wrote:
It's important to understand that if we have a lot of cache misses 
after the code object was executed 1000 times, it doesn't make sense 
to keep trying to update that cache.  It just means that the code, in 
that particular point, works with different kinds of objects.


So, the assumption is that the code makes the difference here not time. 
That could be true for production code.


FWIW, I experimented with different ideas (one is to never 
de-optimize), and the current strategy works best on the vast number 
of benchmarks.


Nice.

Regarding the magic constants (1000, 20) what is the process of updating 
them?



Best,
Sven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Sven R. Kunze


On 01.02.2016 20:51, Yury Selivanov wrote:
If LOAD_ATTR gets too many cache misses (20 in my current patch) it 
gets deoptimized, and the default implementation is used.  So if the 
code is very dynamic - there's no improvement, but no performance 
penalty either.


Will you re-try optimizing it?

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Brett Cannon

On Mon, 1 Feb 2016 at 12:16 Yury Selivanov  wrote:

> Brett,
>
> On 2016-02-01 3:08 PM, Brett Cannon wrote:
> >
> >
> > On Mon, 1 Feb 2016 at 11:51 Yury Selivanov  > > wrote:
> >
> > Hi Brett,
> >
> [..]
> >
> >
> > The first two fields are used to make sure that we have objects of
> the
> > same type.  If it changes, we deoptimize the opcode immediately.
> Then
> > we try the offset.  If it's successful - we have a cache hit.  If
> not,
> > that's fine, we'll try another few times before deoptimizing the
> > opcode.
> >
> >
> > So this is a third "next step" that has its own issue?
>
> It's all in issue http://bugs.python.org/issue26219 right now.
>
> My current plan is to implement LOAD_METHOD/CALL_METHOD (just opcodes,
> no cache) in 26110.
>
> Then implement caching for LOAD_METHOD, LOAD_GLOBAL, and LOAD_ATTR in
> 26219.  I'm flexible to break down 26219 in three separate issues if
> that helps the review process (but that would take more of my time):
>
> - implement support for opcode caching (general infrastructure) +
> LOAD_GLOBAL optimization
> - LOAD_METHOD optimization
> - LOAD_ATTR optimization
>

I personally don't care how you break it down, just trying to keep all the
moving pieces in my head. :)

Anyway, it sounds like PEP 509 is blocking part of it, but the LOAD_METHOD
stuff can go in as-is. So are you truly blocked only on getting the latest
version of that patch up to http://bugs.python.org/issue26110 and getting a
code review?
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Brett Cannon

On Mon, 1 Feb 2016 at 11:11 Yury Selivanov  wrote:

> Hi,
>
> This is the second email thread I start regarding implementing an opcode
> cache in ceval loop.  Since my first post on this topic:
>
> - I've implemented another optimization (LOAD_ATTR);
>
> - I've added detailed statistics mode so that I can "see" how the cache
> performs and tune it;
>
> - some macro benchmarks are now 10-20% faster; 2to3 (a real application)
> is 7-8% faster;
>
> - and I have some good insights on the memory footprint.
>
> ** The purpose of this email is to get a general approval from
> python-dev, so that I can start polishing the patches and getting them
> reviewed/committed. **
>
>
> Summary of optimizations
> 
>
> When a code object is executed more than ~1000 times, it's considered
> "hot".  It gets its opcodes analyzed to initialize caches for
> LOAD_METHOD (a new opcode I propose to add in [1]), LOAD_ATTR, and
> LOAD_GLOBAL.
>
> It's important to only optimize code objects that were executed "enough"
> times, to avoid optimizing code objects for modules, classes, and
> functions that were imported but never used.
>
> The cache struct is defined in code.h [2], and is 32 bytes long. When a
> code object becomes hot, it gets an cache offset table allocated for it
> (+1 byte for each opcode) + an array of cache structs.
>
> To measure the max/average memory impact, I tuned my code to optimize
> *every* code object on *first* run.  Then I ran the entire Python test
> suite.  Python test suite + standard library both contain around 72395
> code objects, which required 20Mb of memory for caches.  The test
> process consumed around 400Mb of memory.  Thus, the absolute worst case
> scenario, the overhead is about 5%.
>
> Then I ran the test suite without any modifications to the patch. This
> means that only code objects that are called frequently enough are
> optimized.  In this more, only 2072 code objects were optimized, using
> less than 1Mb of memory for the cache.
>
>
> LOAD_ATTR
> -
>
> Damien George mentioned that they optimize a lot of dict lookups in
> MicroPython by memorizing last key/value offset in the dict object, thus
> eliminating lots of hash lookups.  I've implemented this optimization in
> my patch.  The results are quite good.  A simple micro-benchmark [3]
> shows ~30% speed improvement.  Here are some debug stats generated by
> 2to3 benchmark:
>
> -- Opcode cache LOAD_ATTR hits = 14778415 (83%)
> -- Opcode cache LOAD_ATTR misses   = 750 (0%)
> -- Opcode cache LOAD_ATTR opts = 282
> -- Opcode cache LOAD_ATTR deopts   = 60
> -- Opcode cache LOAD_ATTR total= 1912
>
> Each "hit" makes LOAD_ATTR about 30% faster.
>
>
> LOAD_GLOBAL
> ---
>
> This turned out to be a very stable optimization.  Here is the debug
> output of the 2to3 test:
>
> -- Opcode cache LOAD_GLOBAL hits   = 3940647 (100%)
> -- Opcode cache LOAD_GLOBAL misses = 0 (0%)
> -- Opcode cache LOAD_GLOBAL opts   = 252
>
> All benchmarks (and real code) have stats like that.  Globals and
> builtins are very rarely modified, so the cache works really well.  With
> LOAD_GLOBAL opcode cache, global lookup is very cheap, there is no hash
> lookup for it at all.  It makes optimizations like "def foo(len=len)"
> obsolete.
>
>
> LOAD_METHOD
> ---
>
> This is a new opcode I propose to add in [1].  The idea is to substitute
> LOAD_ATTR with it, and avoid instantiation of BoundMethod objects.
>
> With the cache, we can store a reference to the method descriptor (I use
> type->tp_version_tag for cache invalidation, the same thing
> _PyType_Lookup is built around).
>
> The cache makes LOAD_METHOD really efficient.  A simple micro-benchmark
> like [4], shows that with the cache and LOAD_METHOD,
> "s.startswith('abc')" becomes as efficient as "s[:3] == 'abc'".
>
> LOAD_METHOD/CALL_FUNCTION without cache is about 20% faster than
> LOAD_ATTR/CALL_FUNCTION.  With the cache, it's about 30% faster.
>
> Here's the debug output of the 2to3 benchmark:
>
> -- Opcode cache LOAD_METHOD hits   = 5164848 (64%)
> -- Opcode cache LOAD_METHOD misses = 12 (0%)
> -- Opcode cache LOAD_METHOD opts   = 94
> -- Opcode cache LOAD_METHOD deopts = 12
> -- Opcode cache LOAD_METHOD dct-chk= 1614801
> -- Opcode cache LOAD_METHOD total  = 7945954
>
>
> What's next?
> 
>
> First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110
> [1].  It's a very straightforward optimization, the patch is small and
> easy to review.


+1 from me.


>
> Second, I'd like to merge the new opcode cache, see issue 26219 [5].
> All unittests pass.  Memory usage increase is very moderate (<1mb for
> the entire test suite), and the performance increase is significant.
> The only potential blocker for this is PEP 509 approval (which I'd be
> happy to assist with).
>

I think the fact that it improves performance across the board as well as
eliminates the various tricks people use to cache global and built-ins, a
big +1 from me. I g

[Python-Dev] Opcode cache in ceval loop

2016-02-01 Thread Yury Selivanov


Hi,

This is the second email thread I start regarding implementing an opcode 
cache in ceval loop.  Since my first post on this topic:


- I've implemented another optimization (LOAD_ATTR);

- I've added detailed statistics mode so that I can "see" how the cache 
performs and tune it;


- some macro benchmarks are now 10-20% faster; 2to3 (a real application) 
is 7-8% faster;


- and I have some good insights on the memory footprint.

** The purpose of this email is to get a general approval from 
python-dev, so that I can start polishing the patches and getting them 
reviewed/committed. **



Summary of optimizations


When a code object is executed more than ~1000 times, it's considered 
"hot".  It gets its opcodes analyzed to initialize caches for 
LOAD_METHOD (a new opcode I propose to add in [1]), LOAD_ATTR, and 
LOAD_GLOBAL.


It's important to only optimize code objects that were executed "enough" 
times, to avoid optimizing code objects for modules, classes, and 
functions that were imported but never used.


The cache struct is defined in code.h [2], and is 32 bytes long. When a 
code object becomes hot, it gets an cache offset table allocated for it 
(+1 byte for each opcode) + an array of cache structs.


To measure the max/average memory impact, I tuned my code to optimize 
*every* code object on *first* run.  Then I ran the entire Python test 
suite.  Python test suite + standard library both contain around 72395 
code objects, which required 20Mb of memory for caches.  The test 
process consumed around 400Mb of memory.  Thus, the absolute worst case 
scenario, the overhead is about 5%.


Then I ran the test suite without any modifications to the patch. This 
means that only code objects that are called frequently enough are 
optimized.  In this more, only 2072 code objects were optimized, using 
less than 1Mb of memory for the cache.



LOAD_ATTR
-

Damien George mentioned that they optimize a lot of dict lookups in 
MicroPython by memorizing last key/value offset in the dict object, thus 
eliminating lots of hash lookups.  I've implemented this optimization in 
my patch.  The results are quite good.  A simple micro-benchmark [3] 
shows ~30% speed improvement.  Here are some debug stats generated by 
2to3 benchmark:


-- Opcode cache LOAD_ATTR hits = 14778415 (83%)
-- Opcode cache LOAD_ATTR misses   = 750 (0%)
-- Opcode cache LOAD_ATTR opts = 282
-- Opcode cache LOAD_ATTR deopts   = 60
-- Opcode cache LOAD_ATTR total= 1912

Each "hit" makes LOAD_ATTR about 30% faster.


LOAD_GLOBAL
---

This turned out to be a very stable optimization.  Here is the debug 
output of the 2to3 test:


-- Opcode cache LOAD_GLOBAL hits   = 3940647 (100%)
-- Opcode cache LOAD_GLOBAL misses = 0 (0%)
-- Opcode cache LOAD_GLOBAL opts   = 252

All benchmarks (and real code) have stats like that.  Globals and 
builtins are very rarely modified, so the cache works really well.  With 
LOAD_GLOBAL opcode cache, global lookup is very cheap, there is no hash 
lookup for it at all.  It makes optimizations like "def foo(len=len)" 
obsolete.



LOAD_METHOD
---

This is a new opcode I propose to add in [1].  The idea is to substitute 
LOAD_ATTR with it, and avoid instantiation of BoundMethod objects.


With the cache, we can store a reference to the method descriptor (I use 
type->tp_version_tag for cache invalidation, the same thing 
_PyType_Lookup is built around).


The cache makes LOAD_METHOD really efficient.  A simple micro-benchmark 
like [4], shows that with the cache and LOAD_METHOD, 
"s.startswith('abc')" becomes as efficient as "s[:3] == 'abc'".


LOAD_METHOD/CALL_FUNCTION without cache is about 20% faster than 
LOAD_ATTR/CALL_FUNCTION.  With the cache, it's about 30% faster.


Here's the debug output of the 2to3 benchmark:

-- Opcode cache LOAD_METHOD hits   = 5164848 (64%)
-- Opcode cache LOAD_METHOD misses = 12 (0%)
-- Opcode cache LOAD_METHOD opts   = 94
-- Opcode cache LOAD_METHOD deopts = 12
-- Opcode cache LOAD_METHOD dct-chk= 1614801
-- Opcode cache LOAD_METHOD total  = 7945954


What's next?


First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110 
[1].  It's a very straightforward optimization, the patch is small and 
easy to review.


Second, I'd like to merge the new opcode cache, see issue 26219 [5].  
All unittests pass.  Memory usage increase is very moderate (<1mb for 
the entire test suite), and the performance increase is significant.  
The only potential blocker for this is PEP 509 approval (which I'd be 
happy to assist with).


What do you think?

Thanks,
Yury


[1] http://bugs.python.org/issue26110
[2] https://github.com/1st1/cpython/blob/opcache5/Include/code.h#L10
[3] https://gist.github.com/1st1/37d928f1e84813bf1c44
[4] https://gist.github.com/1st1/10588e6e11c4d7c19445
[5] http://bugs.python.org/issue26219

___
Python-Dev mailing list
Python-Dev@python.org
http

40 matches

Mail list logo