Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Victor Stinner
2016-09-15 8:31 GMT+02:00 Serhiy Storchaka :
> Note that this is made at the expense of the 20% slowing down an iteration.
>
> $ ./python -m timeit -s "d = dict.fromkeys(range(10**6))" -- "list(d)"
> Python 3.5: 66.1 msec per loop
> Python 3.6: 82.5 msec per loop
>
> Fortunately the cost of the lookup (the most critical operation for dicts)
> seems left the same.
>
> But this can be an argument against using this technique in sets.

My small benchmarks on dict memory usage and dict lookup:


http://bugs.python.org/issue27350#msg275581

It seems like the memory usage is between 20% and 25% smaller. Great job!

Memory usage, Python 3.5 => Python 3.6 on Linux x86_64:

./python -c 'import sys; print(sys.getsizeof({str(i):i for i in range(10)}))'

* 10 items: 480 B => 384 B (-20%)
* 100 items: 6240 B => 4720 B (-24%)
* 1000 items: 49248 B => 36984 B (-25%)

Note: the size is the the size of the container itself, not of keys nor values.



http://bugs.python.org/issue27350#msg275587

As I expected, a dictionary lookup is a _little bit_ slower (3%)
between Python 3.5 and Python 3.6:

$ ./python -m perf timeit -s 'd={str(i):i for i in range(100)}'
'd["10"]; d["20"]; d["30"]; d["40"]; d["50"]; d["10"]; d["20"];
d["30"]; d["40"]; d["50"]' --rigorous

Median +- std dev: [lookup35] 309 ns +- 10 ns -> [lookup36] 320 ns +-
8 ns: 1.03x slower


Victor
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread INADA Naoki
>
>
> Note that this is made at the expense of the 20% slowing down an iteration.
>
> $ ./python -m timeit -s "d = dict.fromkeys(range(10**6))" -- "list(d)"
> Python 3.5: 66.1 msec per loop
> Python 3.6: 82.5 msec per loop
>
>
Are two Pythons built with same options?

In my environ:

~/local/python-master/bin/python3 -m timeit -s "d =
dict.fromkeys(range(10**6))" 'list(d)'
Python master (8cd9c) 100 loops, best of 3: 11 msec per loop
Python 3.5.2 100 loops, best of 3: 11.6 msec per loop

And dict creation time is:

~/local/python-master/bin/python3 -m timeit "d =
dict.fromkeys(range(10**6))"
Python master  10 loops, best of 3: 70.1 msec per loop
Python 3.5.2  10 loops, best of 3: 78.2 msec per loop

Both Python is built without neither `--with-optimizations` or `make
profile-opt`.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Paul Moore
On 15 September 2016 at 07:31, Serhiy Storchaka  wrote:
> Note that this is made at the expense of the 20% slowing down an iteration.
>
> $ ./python -m timeit -s "d = dict.fromkeys(range(10**6))" -- "list(d)"
> Python 3.5: 66.1 msec per loop
> Python 3.6: 82.5 msec per loop

On my Windows 7 PC with 3.5.2 and 3.6.0b1 installed from the standard
python.org builds:

>py -3.5 -m timeit -s "d = dict.fromkeys(range(10**6))" -- "list(d)"
10 loops, best of 3: 21.7 msec per loop
>py -3.6 -m timeit -s "d = dict.fromkeys(range(10**6))" -- "list(d)"
100 loops, best of 3: 19.6 msec per loop

So 3.6 is faster for me.

Paul
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Victor Stinner
2016-09-15 10:02 GMT+02:00 INADA Naoki :
> In my environ:
>
> ~/local/python-master/bin/python3 -m timeit -s "d =
> dict.fromkeys(range(10**6))" 'list(d)'

Stop! Please stop using timeit, it's lying!

* You must not use the minimum but average or median
* You must run a microbenchmark in multiple processes to test
different randomized hash functions and different memory layouts

In short: you should use my perf module.
http://perf.readthedocs.io/en/latest/cli.html#timeit

The memory layout and the hash function have a major important on such
microbenchmark:
https://haypo.github.io/journey-to-stable-benchmark-average.html


> Both Python is built without neither `--with-optimizations` or `make
> profile-opt`.

That's bad :-) For most reliable benchmarks, it's better to use
LTO+PGO compilation.

Victor
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread INADA Naoki
On Thu, Sep 15, 2016 at 5:57 PM Victor Stinner 
wrote:

> 2016-09-15 10:02 GMT+02:00 INADA Naoki :
> > In my environ:
> >
> > ~/local/python-master/bin/python3 -m timeit -s "d =
> > dict.fromkeys(range(10**6))" 'list(d)'
>
> Stop! Please stop using timeit, it's lying!
>
> * You must not use the minimum but average or median
> * You must run a microbenchmark in multiple processes to test
> different randomized hash functions and different memory layouts
>
> In short: you should use my perf module.
> http://perf.readthedocs.io/en/latest/cli.html#timeit
>
>
I'm sorry.  Changing habit is bit difficult. I'll use it in next time.

I ran microbench 3~5 times and confirm the result is stable before posting
result.
And when difference is smaller than 10%, I don't believe the result.


> The memory layout and the hash function have a major important on such
> microbenchmark:
> https://haypo.github.io/journey-to-stable-benchmark-average.html
>
>
In this microbench, hash randomization is not important, because key
of dict is int.
(It means iterating dict doesn't cause random memory access in old dict
implementation too.)



>
> > Both Python is built without neither `--with-optimizations` or `make
> > profile-opt`.
>
> That's bad :-) For most reliable benchmarks, it's better to use
> LTO+PGO compilation.
>

LTO+PGO  may make performance of `git pull && make` unstable.
PGO clean build takes tooo long time for such a quick benchmark.
So I don't want to use PGO in such a quick benchmark.

And Python doesn't provide way to use LTO without PGO
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Paul Moore
On 15 September 2016 at 09:57, Victor Stinner  wrote:
> 2016-09-15 10:02 GMT+02:00 INADA Naoki :
>> In my environ:
>>
>> ~/local/python-master/bin/python3 -m timeit -s "d =
>> dict.fromkeys(range(10**6))" 'list(d)'
>
> Stop! Please stop using timeit, it's lying!
>
> * You must not use the minimum but average or median
> * You must run a microbenchmark in multiple processes to test
> different randomized hash functions and different memory layouts
>
> In short: you should use my perf module.
> http://perf.readthedocs.io/en/latest/cli.html#timeit

Made essentially no difference to the results I posted:

>py -3.5 -m perf timeit -s "d = dict.fromkeys(range(10**6))" -- "list(d)"

Median +- std dev: 21.4 ms +- 0.7 ms
>py -3.6 -m perf timeit -s "d = dict.fromkeys(range(10**6))" -- "list(d)"

Median +- std dev: 20.0 ms +- 1.1 ms

3.6 remains faster, by very little (barely one standard deviation).

I would consider that the same result as timeit (to the level that
it's reasonable to assign any meaning to a microbenchmark).

Paul
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Antoine Pitrou
On Thu, 15 Sep 2016 10:57:07 +0200
Victor Stinner  wrote:
> 
> > Both Python is built without neither `--with-optimizations` or `make
> > profile-opt`.  
> 
> That's bad :-) For most reliable benchmarks, it's better to use
> LTO+PGO compilation.

That sounds irrelevant. LTO+PGO improves performance, it does
nothing for benchmarking per se. That said, it's probably more useful
to benchmark an optimized Python build than an unoptimized one...

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Raymond Hettinger

> On Sep 14, 2016, at 11:31 PM, Serhiy Storchaka  wrote:
> 
> Note that this is made at the expense of the 20% slowing down an iteration.
> 
> $ ./python -m timeit -s "d = dict.fromkeys(range(10**6))" -- "list(d)"
> Python 3.5: 66.1 msec per loop
> Python 3.6: 82.5 msec per loop

A range of consecutive integers which have consecutive hash values is a really 
weak and non-representative basis for comparison.

Something like this will reveal the true and massive improvement in iteration 
speed:

 $ ./python.exe -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" 
"list(d)"

There are two reasons for the significant improvement in iteration speed:

1) The dense key table is smaller (no intervening NULL entries) so we do fewer 
total memory fetches to loop over the keys, values, or items.

2) The loop over the dense table no longer makes frequent, unpredictable tests 
for NULL entries.  (To better understand why this matters and how major the 
impact is, see http://stackoverflow.com/questions/11227809 ).

Your mileage will vary depending on the size of dictionary and whether the old 
dictionary would have densely packed the keys (as in Serhiy's 
non-representative example).


Raymond


P.S.  Algorithmically, the compact dict seems to be mostly where it needs to be 
(modulo some implementation bugs that are being ironed-out).  However, the code 
hasn't been tuned and polished as much as the old implementation, so there is 
still room for its timings to improve.  Dict copies should end-up being faster 
(fewer bytes copied and a predictable test for NULLs).  Resizes should be much 
faster (only the small index table needs to be updated, while the 
keys/values/hashes don't get moved).  In complex apps, the memory savings ought 
translate into better cache performance (that doesn't show-up much in tight 
benchmark loops but tends to make a different in real code).
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Victor Stinner
2016-09-15 11:29 GMT+02:00 Antoine Pitrou :
> That sounds irrelevant. LTO+PGO improves performance, it does
> nothing for benchmarking per se.

In the past, I had bad surprised when running benchmarks without PGO:
https://haypo.github.io/journey-to-stable-benchmark-deadcode.html

I don't recall if ALSR was enabled or not. But I don't think that I
used multiple processes when I ran these benchmarks because I didn't
write the code yet :-)

I should probably redo the same benchmark using new shiny benchmarking
tools (which are expected to be more reliable and stable).

Victor
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Paul Moore
On 15 September 2016 at 10:43, Raymond Hettinger
 wrote:
> Something like this will reveal the true and massive improvement in iteration 
> speed:
>
>  $ ./python.exe -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" 
> "list(d)"

>py -3.5 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 66.2 msec per loop
>py -3.6 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 27.8 msec per loop

And for Victor:

>py -3.5 -m perf timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"

Median +- std dev: 65.7 ms +- 3.8 ms
>py -3.6 -m perf timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"

Median +- std dev: 27.9 ms +- 1.2 ms

Just as a side point, perf provided essentially identical results but
took 2 minutes as opposed to 8 seconds for timeit to do so. I
understand why perf is better, and I appreciate all the work Victor
did to create it, and analyze the results, but for getting a quick
impression of how a microbenchmark performs, I don't see timeit as
being *quite* as bad as Victor is claiming.

I will tend to use perf now that I have it installed, and now that I
know how to run a published timeit invocation using perf. It's a
really cool tool. But I certainly won't object to seeing people
publish timeit results (any more than I'd object to *any*
mirobenchmark).

Paul
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Serhiy Storchaka

On 15.09.16 11:02, INADA Naoki wrote:

Are two Pythons built with same options?


Both are built from clean checkout with default options (hg update -C 
3.x; ./configure; make -s). The only difference is -std=c99 and 
additional warnings in 3.6:


Python 3.5:
gcc -pthread -c -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv 
-O3 -Wall -Wstrict-prototypes-Werror=declaration-after-statement 
-I. -I./Include-DPy_BUILD_CORE -o Objects/dictobject.o 
Objects/dictobject.c


Python 3.6:
gcc -pthread -c -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv 
-O3 -Wall -Wstrict-prototypes-std=c99 -Wextra -Wno-unused-result 
-Wno-unused-parameter -Wno-missing-field-initializers   -I. -I./Include 
  -DPy_BUILD_CORE -o Objects/dictobject.o Objects/dictobject.c


Usually I run a microbenchmark 3-5 times and choose the median. Results 
was stable enough (the variation is about 1%), unlikely the perf tool 
will give significantly different result.


I repeated measurements on different computer, the difference is the same:

Python 3.5: 10 loops, best of 3: 33.5 msec per loop
Python 3.6: 10 loops, best of 3: 37.5 msec per loop

These results look surprisingly and inexplicably to me. I expected that 
even if there is some performance regression in the lookup or modifying 
operation, the iteration should not be slower.


CPUs on both computers work in 32-bit mode. Maybe this affects.


For string keys Python 3.6 is 4 times faster!

$ ./python -m timeit -s "d = dict.fromkeys(map(str, range(10**6)))" -- 
"list(d)"


On one computer:
Python 3.5: 10 loops, best of 3: 384 msec per loop
Python 3.6: 10 loops, best of 3: 94.6 msec per loop

On other computer:
Python 3.5: 10 loops, best of 3: 179 msec per loop
Python 3.6: 10 loops, best of 3: 46 msec per loop


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Serhiy Storchaka

On 15.09.16 11:57, Victor Stinner wrote:

Stop! Please stop using timeit, it's lying!

* You must not use the minimum but average or median
* You must run a microbenchmark in multiple processes to test
different randomized hash functions and different memory layouts

In short: you should use my perf module.
http://perf.readthedocs.io/en/latest/cli.html#timeit

The memory layout and the hash function have a major important on such
microbenchmark:
https://haypo.github.io/journey-to-stable-benchmark-average.html


$ ./python -m perf timeit -s "d = dict.fromkeys(range(10**6))" -- "list(d)"

Python 3.5: Median +- std dev: 65.1 ms +- 4.9 ms
Python 3.6: Median +- std dev: 79.4 ms +- 3.9 ms

Other computer:
Python 3.5: Median +- std dev: 33.6 ms +- 0.3 ms
Python 3.6: Median +- std dev: 37.5 ms +- 0.2 ms


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Serhiy Storchaka

On 15.09.16 12:43, Raymond Hettinger wrote:

On Sep 14, 2016, at 11:31 PM, Serhiy Storchaka  wrote:

Note that this is made at the expense of the 20% slowing down an iteration.

$ ./python -m timeit -s "d = dict.fromkeys(range(10**6))" -- "list(d)"
Python 3.5: 66.1 msec per loop
Python 3.6: 82.5 msec per loop


A range of consecutive integers which have consecutive hash values is a really 
weak and non-representative basis for comparison.


With randomized integers the result is even worse.

$ ./python -m timeit -s "import random; a = list(range(10**6)); 
random.seed(0); random.shuffle(a); d = dict.fromkeys(a)" -- "list(d)"


Python 3.5: 10 loops, best of 3: 33.6 msec per loop
Python 3.6: 10 loops, best of 3: 166 msec per loop


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Eric Snow
On Sep 15, 2016 06:06, "Serhiy Storchaka"  wrote:
> Python 3.5: 10 loops, best of 3: 33.5 msec per loop
> Python 3.6: 10 loops, best of 3: 37.5 msec per loop
>
> These results look surprisingly and inexplicably to me. I expected that
even if there is some performance regression in the lookup or modifying
operation, the iteration should not be slower.

My understanding is that the all-int-keys case is an outlier.  This is due
to how ints hash, resulting in fewer collisions and a mostly
insertion-ordered hash table.  Consequently, I'd expect the above
microbenchmark to give roughly the same result between 3.5 and 3.6, which
it did.

-eric
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Antoine Pitrou
On Thu, 15 Sep 2016 07:08:50 -0600
Eric Snow  wrote:
> On Sep 15, 2016 06:06, "Serhiy Storchaka"  wrote:
> > Python 3.5: 10 loops, best of 3: 33.5 msec per loop
> > Python 3.6: 10 loops, best of 3: 37.5 msec per loop
> >
> > These results look surprisingly and inexplicably to me. I expected that  
> even if there is some performance regression in the lookup or modifying
> operation, the iteration should not be slower.
> 
> My understanding is that the all-int-keys case is an outlier.  This is due
> to how ints hash, resulting in fewer collisions and a mostly
> insertion-ordered hash table.  Consequently, I'd expect the above
> microbenchmark to give roughly the same result between 3.5 and 3.6, which
> it did.

Dict iteration shouldn't have any dependence on collisions or insertion
order.  It's just a table scan, both in 3.5 and 3.6.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Raymond Hettinger
[Eric]

>> My understanding is that the all-int-keys case is an outlier.  This is due
>> to how ints hash, resulting in fewer collisions and a mostly
>> insertion-ordered hash table.  Consequently, I'd expect the above
>> microbenchmark to give roughly the same result between 3.5 and 3.6, which
>> it did.


[Antoine]
> Dict iteration shouldn't have any dependence on collisions or insertion
> order.  It's just a table scan, both in 3.5 and 3.6.

Eric is correct on this one.  The consecutive hashes make a huge difference for 
Python 3.5.   While there is a table full table scan, the check for NULL 
entries becomes a predictable branch when all the keys are in consecutive 
positions.   There is an astonishingly well written stack overflow post that 
explains this effect clearly: http://stackoverflow.com/questions/11227809 

With normal randomized keys, Python 3.6 loop is dramatically better that Python 
3.5:

~/py36 $ ./python.exe -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" 
"list(d)"
100 loops, best of 3: 12.3 msec per loop

~/py35 $ ./python.exe -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" 
"list(d)"
10 loops, best of 3: 54.7 msec per loop

Repeating the timings, I get consistent results: 12.0 vs 46.7 and 12.0 vs 52.2 
and 11.5 vs 44.8. 


Raymond



P.S. Timings are from fresh builds on Mac OS X 10.11.6 running on a 2.6 Ghz 
Haswell i7 with 16Gb of 1600 Mhz ram:  $ ./configure CC=gcc-6 && make



___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python parser performance optimizations

2016-09-15 Thread Artyom Skrobov
Hello,

This is a monthly ping to get a review on http://bugs.python.org/issue26415 -- 
"Excessive peak memory consumption by the Python parser".

Following the comments from August, the patches now include a more detailed 
comment for Init_ValidationGrammar().

The code change itself is still the same as two months ago.


From: Artyom Skrobov
Sent: 07 July 2016 15:44
To: [email protected]; 
[email protected]; 
[email protected]; 
[email protected]
Cc: nd
Subject: RE: Python parser performance optimizations

Hello,

This is a monthly ping to get a review on http://bugs.python.org/issue26415 -- 
"Excessive peak memory consumption by the Python parser".
The first patch of the series (an NFC refactoring) was successfully committed 
earlier in June, so the next step is to get the second patch, "the payload", 
reviewed and committed.

To address the concerns raised by the commenters back in May: the patch doesn't 
lead to negative memory consumption, of course. The base for calculating 
percentages is the smaller number of the two; this is the same style of 
reporting that perf.py uses. In other words, "200% less memory usage" is a 
threefold shrink.

The absolute values, and the way they were produced, are all reported under the 
ticket.


From: Artyom Skrobov
Sent: 26 May 2016 11:19
To: '[email protected]'
Subject: Python parser performance optimizations

Hello,

Back in March, I've posted a patch at http://bugs.python.org/issue26526 -- "In 
parsermodule.c, replace over 2KLOC of hand-crafted validation code, with a DFA".

The motivation for this patch was to enable a memory footprint optimization, 
discussed at http://bugs.python.org/issue26415
My proposed optimization reduces the memory footprint by up to 30% on the 
standard benchmarks, and by 200% on a degenerate case which sparked the 
discussion.
The run time stays unaffected by this optimization.

Python Developer's Guide says: "If you don't get a response within a few days 
after pinging the issue, then you can try emailing 
[email protected] asking for someone to 
review your patch."

So, here I am.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Ethan Furman

On 09/15/2016 08:02 AM, Raymond Hettinger wrote:


Eric is correct on this one.  The consecutive hashes make a huge difference for 
Python 3.5.   While there is a table full table scan, the check for NULL 
entries becomes a predictable branch when all the keys are in consecutive 
positions.   There is an astonishingly well written stack overflow post that 
explains this effect clearly: http://stackoverflow.com/questions/11227809


Thanks for that.  Very good answer.

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python parser performance optimizations

2016-09-15 Thread Guido van Rossum
I wonder if this patch could just be rejected instead of lingering
forever? It clearly has no champion among the current core devs and
therefore it won't be included in Python 3.6 (we're all volunteers so
that's how it goes).

The use case for the patch is also debatable: Python's parser wasn't
designed to *efficiently* parse huge data tables like that, and if you
have that much data, using JSON is the right answer. So this doesn't
really scratch anyone's itch except of the patch author (Artyom).

From a quick look it seems the patch is very disruptive in terms of
what it changes, so it's not easy to review.

I recommend giving up, closing the issue as "won't fix", recommending
to use JSON, and moving on. Sometimes a change is just not worth the
effort.

--Guido

On Tue, Aug 9, 2016 at 1:59 AM, Artyom Skrobov  wrote:
> Hello,
>
>
>
> This is a monthly ping to get a review on http://bugs.python.org/issue26415
> -- “Excessive peak memory consumption by the Python parser”.
>
>
>
> Following the comments from July, the patches now include updating Misc/NEWS
> and compiler.rst to describe the change.
>
>
>
> The code change itself is still the same as a month ago.
>
>
>
>
>
> From: Artyom Skrobov
> Sent: 07 July 2016 15:44
> To: [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: nd
> Subject: RE: Python parser performance optimizations
>
>
>
> Hello,
>
>
>
> This is a monthly ping to get a review on http://bugs.python.org/issue26415
> -- “Excessive peak memory consumption by the Python parser”.
>
> The first patch of the series (an NFC refactoring) was successfully
> committed earlier in June, so the next step is to get the second patch, “the
> payload”, reviewed and committed.
>
>
>
> To address the concerns raised by the commenters back in May: the patch
> doesn’t lead to negative memory consumption, of course. The base for
> calculating percentages is the smaller number of the two; this is the same
> style of reporting that perf.py uses. In other words, “200% less memory
> usage” is a threefold shrink.
>
>
>
> The absolute values, and the way they were produced, are all reported under
> the ticket.
>
>
>
>
>
> From: Artyom Skrobov
> Sent: 26 May 2016 11:19
> To: '[email protected]'
> Subject: Python parser performance optimizations
>
>
>
> Hello,
>
>
>
> Back in March, I’ve posted a patch at http://bugs.python.org/issue26526 --
> “In parsermodule.c, replace over 2KLOC of hand-crafted validation code, with
> a DFA”.
>
>
>
> The motivation for this patch was to enable a memory footprint optimization,
> discussed at http://bugs.python.org/issue26415
>
> My proposed optimization reduces the memory footprint by up to 30% on the
> standard benchmarks, and by 200% on a degenerate case which sparked the
> discussion.
>
> The run time stays unaffected by this optimization.
>
>
>
> Python Developer’s Guide says: “If you don’t get a response within a few
> days after pinging the issue, then you can try emailing
> [email protected] asking for someone to review your patch.”
>
>
>
> So, here I am.
>
>
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/guido%40python.org
>



-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Antoine Pitrou
On Thu, 15 Sep 2016 08:02:10 -0700
Raymond Hettinger  wrote:
> 
> Eric is correct on this one.  The consecutive hashes make a huge difference 
> for Python 3.5.   While there is a table full table scan, the check for NULL 
> entries becomes a predictable branch when all the keys are in consecutive 
> positions.   There is an astonishingly well written stack overflow post that 
> explains this effect clearly: http://stackoverflow.com/questions/11227809 
> 
> With normal randomized keys, Python 3.6 loop is dramatically better that 
> Python 3.5:
[...]

You're jumping to conclusions. While there is a difference, there
is no evidence that the difference is due to better branch prediction.

Actually, let's do a quick back-of-the-envelope calculation and show
that it can't be attributed mostly to branch prediction:

> ~/py36 $ ./python.exe -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" 
> "list(d)"
> 100 loops, best of 3: 12.3 msec per loop
> ~/py35 $ ./python.exe -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" 
> "list(d)"
> 10 loops, best of 3: 54.7 msec per loop

For 10**6 elements, this is a 42ns difference per dict element.

A 2.6 Ghz Haswell doesn't stall for 42ns when there's a branch
mispredict.  According to the Internet, the branch mispredict penalty
for a Haswell CPU is 15 cycles, which is 5.7ns at 2.6 GHz.  Far from the
observed 42ns.

42ns, however, is congruent with another possible effect: a main memory
access following a last-level cache miss.


And indeed, Serhiy showed that this micro-benchmark is actually
dependent on insertion order *on Python 3.6*:

$ ./python -m timeit -s "l = [str(i) for i in range(10**6)];
d=dict.fromkeys(l)" "list(d)"
-> 100 loops, best of 3: 20 msec per loop

$ ./python -m timeit -s "import random; l = [str(i) for i in
range(10**6)]; random.shuffle(l); d=dict.fromkeys(l)" "list(d)"
-> 10 loops, best of 3: 55.8 msec per loop

The only way the table scan (without NULL checks, since this is
Python 3.6) can be dependent on insertion order is because iterating the
table elements needs to INCREF each element, that is: this benchmark
doesn't only scan the table in a nice prefetcher-friendly linear
sequence, it also accesses object headers at arbitrary places in
Python's heap memory.

Since this micro-benchmark creates the keys in order just before
filling the dict with them, randomizing the insertion order destroys
the temporal locality of object header accesses when iterating over the
dict keys. *This* looks like the right explanation, not branch
mispredicts due to NULL checks.

This also shows that a micro-benchmark that merely looks ok can actually
be a terrible proxy of actual performance.


As a further validation of this theory, let's dramatically decrease the
working set size on the initial benchmark:

$ ./python -m timeit -s "d=dict.fromkeys(map(str,range(10**3)))"
"list(d)"

-> Python 3.5: 10 loops, best of 3: 10.9 usec per loop
-> Python 3.6: 10 loops, best of 3: 9.72 usec per loop

When the working set fits in the cache, this micro-benchmark is
only 12% slower on 3.5 compared to 3.6.
*This* much smaller difference (a mere 1.2ns difference per dict
element) could be attributed to eliminating the NULL checks, or to any
other streamlining of the core iteration logic.

Regards

Antoine.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Antoine Pitrou
On Thu, 15 Sep 2016 18:13:54 +0200
Antoine Pitrou  wrote:
> 
> This also shows that a micro-benchmark that merely looks ok can actually
> be a terrible proxy of actual performance.

... unless all your dicts have their key objects nicely arranged
sequentially in heap memory, of course.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-15 Thread Nikolaus Rath
On Sep 13 2016, Tim Peters  wrote:
> [Terry Reedy ]
>>> Tim Peters investigated and empirically determined that an
>>> O(n*n) binary insort, as he optimized it on real machines, is faster
>>> than O(n*logn) sorting for up to around 64 items.
>
> [Nikolaus Rath ]
>> Out of curiosity: is this test repeated periodically on different
>> architectures? Or could it be that it only ever was true 10 years ago on
>> Tim's Power Mac G5 (or whatever he used)?
>
> It has little to do with architecture, but much to do with the
> relative cost of comparisons versus pointer-copying.  Near the end of
>
> https://github.com/python/cpython/blob/master/Objects/listsort.txt
[...]

Ah, that makes sense, thanks! 

Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

2016-09-15 Thread Serhiy Storchaka

On 15.09.16 19:13, Antoine Pitrou wrote:

Since this micro-benchmark creates the keys in order just before
filling the dict with them, randomizing the insertion order destroys
the temporal locality of object header accesses when iterating over the
dict keys. *This* looks like the right explanation, not branch
mispredicts due to NULL checks.

This also shows that a micro-benchmark that merely looks ok can actually
be a terrible proxy of actual performance.


Thanks you for great explanation Antoine! I came to the same conclusions 
about randomized integers example, but didn't notice that this also is a 
main cause of the speed up of strings example.



As a further validation of this theory, let's dramatically decrease the
working set size on the initial benchmark:

$ ./python -m timeit -s "d=dict.fromkeys(map(str,range(10**3)))"
"list(d)"

-> Python 3.5: 10 loops, best of 3: 10.9 usec per loop
-> Python 3.6: 10 loops, best of 3: 9.72 usec per loop

When the working set fits in the cache, this micro-benchmark is
only 12% slower on 3.5 compared to 3.6.
*This* much smaller difference (a mere 1.2ns difference per dict
element) could be attributed to eliminating the NULL checks, or to any
other streamlining of the core iteration logic.


Yet one example, with random hashes and insertion order independent from 
the creation order.


$ ./python -m timeit -s "import random; a = list(map(str, 
range(10**6))); random.shuffle(a); d = dict.fromkeys(a)" -- "list(d)"


Python 3.5: 180, 180, 180 msec per loop
Python 3.6: 171, 172, 171 msec per loop

Python 3.6 is 5% faster and this looks closer to the actual performance.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Microbenchmarks

2016-09-15 Thread Victor Stinner
The discussion on benchmarking is no more related to compact dict, so
I start a new thread.


2016-09-15 13:27 GMT+02:00 Paul Moore :
> Just as a side point, perf provided essentially identical results but
> took 2 minutes as opposed to 8 seconds for timeit to do so. I
> understand why perf is better, and I appreciate all the work Victor
> did to create it, and analyze the results, but for getting a quick
> impression of how a microbenchmark performs, I don't see timeit as
> being *quite* as bad as Victor is claiming.

He he, I expected such complain. I already wrote a section in the doc
explaining "why perf is so slow":
http://perf.readthedocs.io/en/latest/perf.html#why-is-perf-so-slow


So you say that timeit just works and is faster? Ok. Let's see a small session:

$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 46.7 msec per loop
$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 46.9 msec per loop
$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 46.9To msec per loop
$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 47 msec per loop

$ python2 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 36.3 msec per loop
$ python2 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 36.1 msec per loop
$ python2 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 36.5 msec per loop

$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 48.3 msec per loop
$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 48.4 msec per loop
$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 48.8 msec per loop

I ran timeit 7 times on Python 3 and 3 times on Python 2. Please
ignore Python 2, it's just a quick command to interfere with Python 3
tests.

Now the question is: what is the "correct" result for Python3? Let's
take the minimum of the minimums: 46.7 ms.

Now imagine that you only ran only have the first 4 runs. What is the
"good" result now? Min is still 46.7 ms.

And what if you only had the last 3 runs? What is the "good" result
now? Min becomes 48.3 ms.

On such microbenchmark, the difference between 46.7 ms and 48.3 ms is large :-(

How do you know that you ran timeit enough times to make sure that the
result is the good one?

For me, the timeit tool is broken because you *must* run it many times
to workaround its limits.


In short, I wrote the perf module to answer to these questions.

* perf uses multiple processes to test multiple memory layouts and
multiple randomized hash functions
* perf ignores the first run, used to "warmup" the benchmark
(--warmups command line option)
* perf provides many tools to analyze the distribution of results:
minimum, maximum, standard deviation, histogram, number of samples,
median, etc.
* perf displays the median +- standard deviation: median is more
reproductible and standard deviation gives an idea of the stability of
the benchmark
* etc.


> I will tend to use perf now that I have it installed, and now that I
> know how to run a published timeit invocation using perf. It's a
> really cool tool. But I certainly won't object to seeing people
> publish timeit results (any more than I'd object to *any*
> mirobenchmark).

I consider that timeit results are not reliable at all. There is no
standard deviation and it's hard to guess how much times the user ran
timeit nor how he/she computed the "good result".

perf takes ~60 seconds by default. If you don't care of the accuracy,
use --fast and it now only takes 20 seconds ;-)

Victor
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Microbenchmarks

2016-09-15 Thread Victor Stinner
2016-09-15 21:33 GMT+02:00 Victor Stinner :
> perf takes ~60 seconds by default. If you don't care of the accuracy,
> use --fast and it now only takes 20 seconds ;-)

Oops, I'm wrong. By default, a "single dot" (one worker process) takes
less 1 second, so 20 dots (default) takes less than 20 seconds.

On the following example, the setup statement is quite expensive:

$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"

The statement "list(d)" takes around 47 ms, but the setup statement
takes 377 ms.

It seems like timeit executes the setup statement. Perf is based on
timeit, and so each process runs the setup statement at least 4 times
(1 warmup + 3 samples): 4*377 ms ~= 1.5 sec per process.

Replace range(10**6) with range(10**5) and the benchmark itself
becomes much faster: 57 seconds => 15 seconds.

Victor
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com