subject:"\[issue21955\] ceval.c\: implement fast path for integers with a single digit"

[issue21955] ceval.c: implement fast path for integers with a single digit

2020-10-01 Thread STINNER Victor



STINNER Victor  added the comment:


New changeset bd0a08ea90e4c7a2ebf29697937e9786d4d8e5ee by Victor Stinner in 
branch 'master':
bpo-21955: Change my nickname in BINARY_ADD comment (GH-22481)
https://github.com/python/cpython/commit/bd0a08ea90e4c7a2ebf29697937e9786d4d8e5ee


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2020-10-01 Thread STINNER Victor



Change by STINNER Victor :


--
pull_requests: +21500
pull_request: https://github.com/python/cpython/pull/22481

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-10-20 Thread Roundup Robot


Roundup Robot added the comment:

New changeset 61fcb12a9873 by Victor Stinner in branch 'default':
Issue #21955: Please don't try to optimize int+int
https://hg.python.org/cpython/rev/61fcb12a9873

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-10-20 Thread STINNER Victor


Changes by STINNER Victor :


--
resolution: fixed -> rejected

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-10-20 Thread STINNER Victor


STINNER Victor added the comment:

The fatest patch (inline2.patch) has a negligible impact on benchmarks. The 
purpose of an optimization is to make Python faster, it's not the case here, so 
I close the issue.

Using timeit, the largest speedup is 1.29x faster. Using performance, 
spectral_norm is 1.07x faster and pybench.SimpleLongArithmetic is 1.06x faster. 
I consider that spectral_norm and pybench.SimpleLongArithmetic are 
microbenchmarks and so not representative of a real application.

The issue was fun, thank you for playing with me the game of micro-optimization 
;-) Let's move to more interesting optimizations having a larger impact on more 
realistic workloads, like cache global variables, optimizing method calls, 
fastcalls, etc.

--
resolution:  -> fixed
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-10-20 Thread STINNER Victor


STINNER Victor added the comment:

fastint6_inline2_json.tar.gz: archive of JSON files

- fastint6.json
- inline2.json
- master.json
- timeit-fastint6.json
- timeit-inline2.json
- timeit-master.json

--
Added file: http://bugs.python.org/file45150/fastint6_inline2_json.tar.gz

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-10-20 Thread STINNER Victor


STINNER Victor added the comment:

Result of performance 0.3.3 (and perf 0.8.3).

No major benchmark is faster. A few benchmarks seem to be event slower using 
fastint6.patch (but I don't really trust pybench).


== fastint6.patch ==

$ python3 -m perf compare_to master.json fastint6.json --group-by-speed 
--min-speed=5
Slower (3):
- pybench.ConcatUnicode: 52.7 ns +- 0.0 ns -> 56.1 ns +- 0.4 ns: 1.06x slower
- pybench.ConcatStrings: 52.7 ns +- 0.3 ns -> 56.1 ns +- 0.1 ns: 1.06x slower
- pybench.CompareInternedStrings: 16.5 ns +- 0.0 ns -> 17.4 ns +- 0.0 ns: 1.05x 
slower

Faster (4):
- pybench.SimpleIntFloatArithmetic: 441 ns +- 2 ns -> 400 ns +- 6 ns: 1.10x 
faster
- pybench.SimpleIntegerArithmetic: 441 ns +- 2 ns -> 401 ns +- 5 ns: 1.10x 
faster
- pybench.SimpleLongArithmetic: 643 ns +- 4 ns -> 608 ns +- 6 ns: 1.06x faster
- genshi_text: 79.6 ms +- 0.5 ms -> 75.5 ms +- 0.8 ms: 1.05x faster

Benchmark hidden because not significant (114): 2to3, call_method, (...)


== inline2.patch ==

haypo@selma$ python3 -m perf compare_to master.json inline2.json 
--group-by-speed --min-speed=5
Faster (2):
- spectral_norm: 223 ms +- 1 ms -> 209 ms +- 1 ms: 1.07x faster
- pybench.SimpleLongArithmetic: 643 ns +- 4 ns -> 606 ns +- 7 ns: 1.06x faster

Benchmark hidden because not significant (119): 2to3, call_method, (...)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-10-20 Thread STINNER Victor


STINNER Victor added the comment:

Between inline2.patch and fastint6.patch, it seems like inline2.patch is faster 
(between 9% and 12% faster than fastint6.patch).

Microbenchmark on Python default (rev 554fb699af8c), compilation using LTO 
(./configure --with-lto), GCC 6.2.1 on Fedora 24, Intel(R) Core(TM) i7-3520M 
CPU @ 2.90GHz, perf 0.8.3 (dev version, just after 0.8.2).

Commands:

./python -m perf timeit --name='x+y' -s 'x=1; y=2' 'x+y' --dup 1000 -v -o 
timeit-$branch.json
./python -m perf timeit --name=sum -s "R=range(100)" "[x + x + 1 for x in R]" 
--dup 1000 -v --append timeit-$branch.json

Results:

$ python3 -m perf compare_to timeit-master.json timeit-inline2.json
sum: Median +- std dev: [timeit-master] 6.23 us +- 0.13 us -> [timeit-inline2] 
5.45 us +- 0.09 us: 1.14x faster
x+y: Median +- std dev: [timeit-master] 15.0 ns +- 0.2 ns -> [timeit-inline2] 
11.6 ns +- 0.2 ns: 1.29x faster

$ python3 -m perf compare_to timeit-master.json timeit-fastint6.json 
sum: Median +- std dev: [timeit-master] 6.23 us +- 0.13 us -> [timeit-fastint6] 
6.09 us +- 0.11 us: 1.02x faster
x+y: Median +- std dev: [timeit-master] 15.0 ns +- 0.2 ns -> [timeit-fastint6] 
12.7 ns +- 0.2 ns: 1.18x faster

$ python3 -m perf compare_to timeit-fastint6.json  timeit-inline2.json
sum: Median +- std dev: [timeit-fastint6] 6.09 us +- 0.11 us -> 
[timeit-inline2] 5.45 us +- 0.09 us: 1.12x faster
x+y: Median +- std dev: [timeit-fastint6] 12.7 ns +- 0.2 ns -> [timeit-inline2] 
11.6 ns +- 0.2 ns: 1.09x faster

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-04-22 Thread Stefan Krah


Stefan Krah added the comment:

#14757 has an implementation of inline caching, which at least seemed to slow 
down some use cases. Then again, whenever someone posts a new speedup 
suggestion, it seems to slow down things I'm working on. At least Case van 
Horsen independently verified the phenomenon in this issue. :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-04-22 Thread STINNER Victor


STINNER Victor added the comment:

Maybe we should adopt a difference approach.

There is something called "inline caching": put the cache between instructions, 
in the same memory block. Example of paper on CPython:

"Efficient Inline Caching without Dynamic Translation" by Stefan Brunthaler 
(2009)
https://www.sba-research.org/wp-content/uploads/publications/sac10.pdf

Maybe we can build something on top of the issue #26219 "implement per-opcode 
cache in ceval"?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-10 Thread STINNER Victor


STINNER Victor added the comment:

> The test suite can be run directly from the source tree. The test suite 
> includes timing information for individual tests and for the the entire test. 
> Sample invocation:

I extracted the slowest test (test_polyroots_legendre) and put it in a loop of 
5 iterations: see attached mpmath_bench.py. I ran this benchmark on Linux with 
4 isolated CPU (/sys/devices/system/cpu/isolated=2-3,6-7).
http://haypo-notes.readthedocs.org/misc.html#reliable-micro-benchmarks

On such setup, the benchmark looks stable. Example:

Run #1/5: 12.28 sec
Run #2/5: 12.27 sec
Run #3/5: 12.29 sec
Run #4/5: 12.28 sec
Run #5/5: 12.30 sec

test_polyroots_legendre (min of 5 runs):

* Original: 12.51 sec
* fastint5_4.patch: (min of 5 runs): 12.27 sec (-1.9%)
* fastint6.patch: 12.21 sec (-2.4%)

I ran tests without GMP, to stress the Python int type.

I guess that the benchmark is dominated by CPU time spent on computing 
operations on large Python int, not by the time spent in ceval.c. So the 
speedup is low (2%). Such use case doesn't seem to benefit of micro 
optimization discussed in this issue.

mpmath is an arbitrary-precision floating-point arithmetic using Python int (or 
GMP if available).

--
Added file: http://bugs.python.org/file41882/mpmath_bench.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-09 Thread Case Van Horsen


Case Van Horsen added the comment:

I ran the mpmath test suite with the fastint6 and fastint5_4 patches.

fastint6 results

without gmpy: 0.25% faster
with gmpy: 3% slower

fastint5_4 results

without gmpy: 1.5% slower
with gmpy: 5.5% slower

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-09 Thread STINNER Victor


STINNER Victor added the comment:

Case Van Horsen added the comment:
> I ran the mpmath test suite with the fastint6 and fastint5_4 patches.
>
> fastint6 results
> without gmpy: 0.25% faster
> with gmpy: 3% slower
>
> fastint5_4 results
> without gmpy: 1.5% slower
> with gmpy: 5.5% slower

I'm more and more disappointed by this issue... If even a test
stressing int & float is *slower* (or less than 1% faster) with a
patch supposed to optimized them, what's the point? I'm also concerned
by the slow-down for other types (gmpy types).

Maybe we should just close the issue?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-09 Thread Yury Selivanov


Yury Selivanov added the comment:

> Maybe we should just close the issue?

I'll take a closer look at gmpy later. Please don't close.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-08 Thread Yury Selivanov


Yury Selivanov added the comment:

> I ran the mpmath test suite with Python 3.6 and with the fastint6 patch. The 
> overall increase when using Python long type was about 1%. When using gmpy2's 
> mpz type, there was a slowdown of about 2%.

> I will run more tests tonight.

Please try to test fastint5 too (fast paths for long & floats, whereas fastint6 
is only focused on longs).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-08 Thread Case Van Horsen


Case Van Horsen added the comment:

mpmath is a library for arbitrary-precision floating-point arithmetic. It uses 
Python's native long type or gmpy2's mpz type for computations. It is available 
at https://pypi.python.org/pypi/mpmath.

The test suite can be run directly from the source tree. The test suite 
includes timing information for individual tests and for the the entire test. 
Sample invocation:

~/src/mpmath-0.19/mpmath/tests$ time py36 runtests.py -local

For example, I've tried to optimize gmpy2's handling of binary operations 
between its mpz type and short Python integers. I've found it to provide useful 
results: improvements that are significant on a micro-benchmark (say 20%) will 
usually cause a small but repeatable improvement. And some improvements that 
looked good on a micro-benchmark would slow down mpmath.

I ran the mpmath test suite with Python 3.6 and with the fastint6 patch. The 
overall increase when using Python long type was about 1%. When using gmpy2's 
mpz type, there was a slowdown of about 2%.

I will run more tests tonight.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-07 Thread Stefan Krah


Stefan Krah added the comment:

#26288 brought a great speedup for floats. With fastint5_4.patch *on top of 
#26288* I see no improvement for floats and a big slowdown for _decimal.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-07 Thread Yury Selivanov


Yury Selivanov added the comment:

>From what I can see there is no negative impact of the patch on stable macro 
>benchmarks.

There is quite a detectable positive impact on most of integer and float 
operations from my patch.  13-16% on nbody and spectral_norm benchmarks is 
still impressive.  And you can see a huge improvement in various timeit 
micro-benchmarks.

fastint5 is a very compact patch, that only touches the ceval.c file.  It 
doesn't complicate the code, as the macro is very straightforward.  Since the 
patch passed the code review, thorough benchmarking and discussion stages, I'd 
like to commit it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-07 Thread Serhiy Storchaka


Serhiy Storchaka added the comment:

Please don't commit it right now. Yes, due to using macros the patch looks 
simple, but macros expanded to complex code. We need more statistics.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-07 Thread Yury Selivanov


Yury Selivanov added the comment:

> Please don't commit it right now. Yes, due to using macros the patch looks 
> simple, but macros expanded to complex code. We need more statistics.

But what you will use to gather statistics data?  Test suite isn't 
representative, and we already know what will benchmarks suite show.  I can 
assist with writing some code for stats, but what's the plan?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-07 Thread Antoine Pitrou


Antoine Pitrou added the comment:

Be careful with test suites: first, they might exercise code that would never 
be a critical point for performance in a real-world application; second and 
most important, unittest seems to have gotten slower between 2.x and 3.x, so 
you would really be comparing apples to oranges.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-07 Thread Yury Selivanov


Yury Selivanov added the comment:

Attaching another patch - fastint6.patch that only optimizes longs (no FP fast 
path).

> #26288 brought a great speedup for floats. With fastint5_4.patch *on top of 
> #26288* I see no improvement for floats and a big slowdown for _decimal.

What benchmark did you use?  What were the numbers?  I'm asking because before 
you benchmarked different patches that are conceptually similar to fastint5, 
and the result was that decimal was 5% faster with fast paths for just longs, 
and 6% slower with fast paths for longs & floats.

Also, some quick timeit results (quite stable from run to run):


-m timeit -s "x=2" "x + 10 + x * 20  + x* 10 + 20 -x"
3.6: 0.150usec   3.6+fastint: 0.112usec


-m timeit -s "x=2" "x*2.2 + 2 + x*2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)*2 + 
(x+10)*(x-30)"
3.6: 0.425usec   3.6+fastint: 0.302usec

--
Added file: http://bugs.python.org/file41843/fastint6.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-07 Thread Case Van Horsen


Case Van Horsen added the comment:

Can I suggest the mpmath test suite as a good benchmark? I've used it to test 
the various optimizations in gmpy2 and it has always been a valuable real-world 
benchmark. And it is slower in Python 3 than Python 2

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-06 Thread STINNER Victor


STINNER Victor added the comment:

Benchmark on inline-2.patch. No speedup, only slowdown.

I'm now running benchmark on fastint5_4.patch.

$ python3 -u perf.py --affinity=2-3,6-7 --rigorous ../default/python.orig 
../default/python.inline-2

Report on Linux smithers 4.3.4-300.fc23.x86_64 #1 SMP Mon Jan 25 13:39:23 UTC 
2016 x86_64 x86_64
Total CPU cores: 8

### json_load ###
Min: 0.707290 -> 0.723411: 1.02x slower
Avg: 0.707845 -> 0.724238: 1.02x slower
Significant (t=-297.25)
Stddev: 0.00026 -> 0.00049: 1.8696x larger

### regex_v8 ###
Min: 0.03 -> 0.070435: 1.06x slower
Avg: 0.066947 -> 0.071378: 1.07x slower
Significant (t=-17.98)
Stddev: 0.00172 -> 0.00177: 1.0313x larger

The following not significant results are hidden, use -v to show them:
2to3, chameleon_v2, django_v3, fastpickle, fastunpickle, json_dump_v2, nbody, 
tornado_http.

real58m32.662s
user57m43.058s
sys 0m47.428s

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-06 Thread STINNER Victor


STINNER Victor added the comment:

Benchmark on fastint5_4.patch.

python3 -u perf.py --affinity=2-3,6-7 --rigorous ../default/python.orig 
../default/python_fastint5_4

Report on Linux smithers 4.3.4-300.fc23.x86_64 #1 SMP Mon Jan 25 13:39:23 UTC 
2016 x86_64 x86_64
Total CPU cores: 8

### django_v3 ###
Min: 0.563959 -> 0.578181: 1.03x slower
Avg: 0.565383 -> 0.579137: 1.02x slower
Significant (t=-152.48)
Stddev: 0.00075 -> 0.00050: 1.4900x smaller

### fastunpickle ###
Min: 0.551076 -> 0.563469: 1.02x slower
Avg: 0.555481 -> 0.567028: 1.02x slower
Significant (t=-27.05)
Stddev: 0.00278 -> 0.00324: 1.1687x larger

### json_dump_v2 ###
Min: 2.737429 -> 2.662615: 1.03x faster
Avg: 2.754239 -> 2.685404: 1.03x faster
Significant (t=55.63)
Stddev: 0.00610 -> 0.01077: 1.7662x larger

### nbody ###
Min: 0.228548 -> 0.212292: 1.08x faster
Avg: 0.230082 -> 0.213574: 1.08x faster
Significant (t=73.74)
Stddev: 0.00175 -> 0.00139: 1.2567x smaller

### regex_v8 ###
Min: 0.041323 -> 0.048099: 1.16x slower
Avg: 0.041624 -> 0.049318: 1.18x slower
Significant (t=-45.38)
Stddev: 0.00123 -> 0.00116: 1.0613x smaller

The following not significant results are hidden, use -v to show them:
2to3, chameleon_v2, fastpickle, json_load, tornado_http.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-06 Thread Yury Selivanov


Yury Selivanov added the comment:

> ### regex_v8 ###
> Min: 0.041323 -> 0.048099: 1.16x slower
> Avg: 0.041624 -> 0.049318: 1.18x slower

I think this is a random fluctuation, that benchmark (and re lib) doesn't use 
the operators too much.  It can't be THAT slower just because of optimizing a 
few binops.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-06 Thread Yury Selivanov


Yury Selivanov added the comment:

Alright, I ran a few benchmarks myself.  In rigorous mode regex_v8 has the same 
performance on my 2013 Macbook Pro and an 8-years old i7 CPU (Linux).

Here're results of "perf.py -b raytrace,spectral_norm,meteor_contest,nbody 
../cpython/python.exe ../cpython-git/python.exe -r"


fastint5:

### nbody ###
Min: 0.227683 -> 0.197046: 1.16x faster
Avg: 0.229366 -> 0.198889: 1.15x faster
Significant (t=137.31)
Stddev: 0.00170 -> 0.00142: 1.1977x smaller

### spectral_norm ###
Min: 0.296840 -> 0.262279: 1.13x faster
Avg: 0.299616 -> 0.265387: 1.13x faster
Significant (t=74.52)
Stddev: 0.00331 -> 0.00319: 1.0382x smaller

The following not significant results are hidden, use -v to show them:
meteor_contest, raytrace.


==


inline-2:


### raytrace ###
Min: 1.188825 -> 1.213788: 1.02x slower
Avg: 1.199827 -> 1.227276: 1.02x slower
Significant (t=-18.12)
Stddev: 0.00559 -> 0.01408: 2.5184x larger

### spectral_norm ###
Min: 0.296535 -> 0.277025: 1.07x faster
Avg: 0.299044 -> 0.278071: 1.08x faster
Significant (t=87.40)
Stddev: 0.00220 -> 0.00097: 2.2684x smaller

The following not significant results are hidden, use -v to show them:
meteor_contest, nbody.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-06 Thread Yury Selivanov


Yury Selivanov added the comment:

You're also running a very small subset of all benchmarks available. Please try 
the '-b all' option.  I'll also run benchmarks on my machines.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-06 Thread Serhiy Storchaka


Serhiy Storchaka added the comment:

> I see two main trends: optimize most cases (optimize most operators for int 
> and float,  ex: fastint5_4.patch) versus optimize very few cases to limit 
> changes and to limit effects on ceval.c (ex: inline-2.patch).

I agree that may be optimizing very few cases is better. We need to collect the 
statistics of using different operations with different types in long run of 
tests or benchmarks. If say division is used 100 times less than addition, we 
shouldn't complicate ceval loop to optimize it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread Yury Selivanov


Yury Selivanov added the comment:

As to weather we want this patch committed or not, here's a 
mini-macro-something benchmark:


$ ./python.exe -m timeit -s "x=2" "x + 10 + x * 20  + x* 10 + 20 -x"
1000 loops, best of 3: 0.115 usec per loop

$ python3.5 -m timeit -s "x=2" "x + 10 + x * 20  + x* 10 + 20 -x"
1000 loops, best of 3: 0.141 usec per loop


$ ./python.exe -m timeit -s "x=2" "x*2.2 + 2 + x*2.5 + 1.0 - x / 2.0 + 
(x+0.1)/(x-0.1)*2 + (x+10)*(x-30)"
100 loops, best of 3: 0.308 usec per loop

$ python3.5 -m timeit -s "x=2" "x*2.2 + 2 + x*2.5 + 1.0 - x / 2.0 + 
(x+0.1)/(x-0.1)*2 + (x+10)*(x-30)"
100 loops, best of 3: 0.652 usec per loop


Still, longs are faster 30-50%, FP are faster 100%.  I think it's a very good 
result.  Please don't block this patch.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Re: [issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread M.-A. Lemburg

On 05.02.2016 16:14, STINNER Victor wrote:
> 
> Please don't. I would like to have time to benchmark all these patches (there 
> are now 9 patches attached to the issue :-)) and I would like to hear 
> Serhiy's feedback on your latest patches.

Regardless of the performance, the fastint5.patch looks like the
least invasive approach to me. It also doesn't incur as much
maintenance overhead as the others do.

I'd only rename the macro MAYBE_DISPATCH_FAST_NUM_OP to
TRY_FAST_NUMOP_DISPATCH :-)

BTW: I do wonder why this approach is as fast as the others. Have
compilers grown smart enough to realize that the number slot
functions will not change and can thus be inlined ?

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread STINNER Victor


STINNER Victor added the comment:

My analysis of benchmarks.

Even using CPU isolation to run benchmarks, the results look unreliable for 
very short benchmarks like 3 ** 2.0: I don't think that fastint_alt can make 
the operation 16% slower since it doesn't touch this code, no?

Well... as expected, speedup is quite *small*: the largest difference is on "3 
* 2" ran 100 times: 18% faster with fastint_alt. We are talking about 1.82 us 
=> 1.49 us: delta of 330 ns. I expect a much larger difference is you compile a 
function to machine code using Cython or a JIT like Numba and PyPy. Remember 
that we are running *micro*-benchmarks, so we should not push overkill 
optimizations except if the speedup is really impressive.

It's quite obvious from the tables than fastint_alt.patch only optimize int 
(float is not optimized). If we choose to optimize float too, 
fastintfloat_alt.patch and fastint5.patch look to have the *same* speed.

I don't see any overhead on Decimal + Decimal with any patch: good.

--

Between fastintfloat_alt.patch and fastint5.patch, I prefer 
fastintfloat_alt.patch which is much easier to read, so probably much easier to 
debug. I hate huge macro when I have to debug code in gdb :-( I also like very 
much the idea of *reusing* existing functions, rather than duplicating code.

Even if Antoine doesn't seem interested by optimizations on float, I think that 
it's ok to add a few lines for this type, fastintfloat_alt.patch is not so 
complex. What do *you* think?

Why not optimizing a**b? It's a common operation, especially 2**k, no?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread Yury Selivanov


Yury Selivanov added the comment:

Anyways, if it's about macro vs non-macro, I can inline the macro by hand 
(which I think is an inferior approach here).  But I'd like the final code to 
use my approach of using slots directly, instead of modifying 
longobject/floatobject to export lots of *internal* stuff.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread Yury Selivanov


Yury Selivanov added the comment:

Unless there are any objections, I'll commit fastint5.patch in a day or two.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread Yury Selivanov


Yury Selivanov added the comment:

>> Unless there are any objections, I'll commit fastint5.patch in a day or two.

> Please don't. I would like to have time to benchmark all these patches (there 
> are now 9 patches attached to the issue :-)) and I would like to hear 
> Serhiy's feedback on your latest patches.

Sure, I'd very appreciate a review of fastint5.

I can save you sometime on benchmarking -- it's really about fastint_alt vs 
fastint5.  The latter optimizes ALL ops on longs AND floats.  The former only 
optimizes some ops on longs.  So please be sure you're comparing oranges to 
oranges ;)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread STINNER Victor


STINNER Victor added the comment:

bench_long2.py: my updated microbenchmark to test many types and more 
operations.

compare.txt: compare Python original, fastint_alt.patch, fastintfloat_alt.patch 
and fastint5.patch. "(*)" marks the minimum of the line, percents are relative 
to the minimum (if larger than +/-5%).

compare_to.txt: similar to compare.txt, but percents are relative to the 
original Python.

--
Added file: http://bugs.python.org/file41821/bench_long2.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread Yury Selivanov


Yury Selivanov added the comment:

> Regardless of the performance, the fastint5.patch looks like the
least invasive approach to me. It also doesn't incur as much
maintenance overhead as the others do.

Thanks.  It's a result of an enlightenment that can only come
after running benchmarks all day :)

> I'd only rename the macro MAYBE_DISPATCH_FAST_NUM_OP to
TRY_FAST_NUMOP_DISPATCH :-)

Yeah, your name is better.

> BTW: I do wonder why this approach is as fast as the others. Have
compilers grown smart enough to realize that the number slot
functions will not change and can thus be inlined ?

Looks like so, I'm very impressed myself.  I'd expect fastint3 (which just 
inlines a lot of logic directly in ceval.c) to be the fastest one.  But it 
seems that compiler does an excellent job here.

Victor, BTW, if you want to test fastint3 vs fastint5, don't forget to apply 
the patch from issue #26288 over fastint5 (fixes slow performance of 
PyLong_AsDouble)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread Yury Selivanov


Yury Selivanov added the comment:

> Between fastintfloat_alt.patch and fastint5.patch, I prefer 
> fastintfloat_alt.patch which is much easier to read, so probably much easier 
> to debug. I hate huge macro when I have to debug code in gdb :-( I also like 
> very much the idea of *reusing* existing functions, rather than duplicating 
> code.

I disagree.

fastintfloat_alt exports a lot of functions from longobject/floatobject, 
something that I really don't like.  Lots of repetitive code in ceval.c also 
make it harder to make sure everything is correct.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread Serhiy Storchaka


Serhiy Storchaka added the comment:

My patches were just samples. I'm glad that Yury incorporated the main idea and 
that this helps. If apply any patch I would prefer fastint5.patch. But I don't 
quite understand why it adds any gain. Is this just due to overhead of calling 
PyNumber_Add? Then we should test with other compilers and with the LTO option. 
fastint5.patch adds an overhead for type checks and increases the size of ceval 
loop. What is outweigh this overhead?

As for tests, it would be more honest to test data that results out of small 
ints range (-5..256).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread STINNER Victor


STINNER Victor added the comment:

> Unless there are any objections, I'll commit fastint5.patch in a day or two.

Please don't. I would like to have time to benchmark all these patches (there 
are now 9 patches attached to the issue :-)) and I would like to hear Serhiy's 
feedback on your latest patches.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread Yury Selivanov


Yury Selivanov added the comment:

Thanks, Serhiy,

> But I don't quite understand why it adds any gain. 

Perhaps, and this is just a guess - the fast path does only couple of eq tests 
& one call for the actual op.  If it's long+long then long_add will be called 
directly.

PyNumber_Add has more overhead on:
- at least one extra call
- a few extra checks to guard against NotImplemented
- abstract.c/binary_op1 has a few more checks/slot lookups

So it look that there's just much less instructions to be executed.  If this 
guess is correct, then an LTO build without fast paths will still be somewhat 
slower.

> Is this just due to overhead of calling PyNumber_Add? Then we should test 
> with other compilers and with the LTO option.

I actually tried to compile CPython with LTO -- but couldn't.  Almost all of C 
extension modules failed to link.  Do we compile official binaries with LTO?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread STINNER Victor


Changes by STINNER Victor :


Added file: http://bugs.python.org/file41823/compare_to.txt

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread STINNER Victor


Changes by STINNER Victor :


Added file: http://bugs.python.org/file41822/compare.txt

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread STINNER Victor


STINNER Victor added the comment:

Serhiy Storchaka: "My patches were just samples. I'm glad that Yury 
incorporated the main idea and that this helps."

Oh, if even Serhiy prefers Yury's patches, I should read them again :-)

--

I read fastint5.patch one more time and I finally understood the following 
macros:

+#define NB_SLOT(slot) offsetof(PyNumberMethods, slot)
+#define NB_BINOP(nb_methods, slot) \
+(*(binaryfunc*)(& ((char*)nb_methods)[NB_SLOT(slot)]))
+#define PY_LONG_CALL_BINOP(slot, left, right) \
+(NB_BINOP(PyLong_Type.tp_as_number, slot))(left, right)
+#define PY_FLOAT_CALL_BINOP(slot, left, right) \
+(NB_BINOP(PyFloat_Type.tp_as_number, slot))(left, right)

In short, a+b calls long_add(a, b) with that. At the first read, I understood 
that it casted objects to C long or C double (don't ask me why).


I see a difference between fastint5.patch and fastintfloat_alt.patch: 
fastint5.patch resolves the address of long_add() at runtime, whereas 
fastintfloat_alt.patch gets a direct pointer to _PyLong_Add() at the 
compilation. I expected a sublte speedup, but I'm unable to see it on 
benchmarks (again, both patches have the same speed).

The float path is simpler in fastint5.patch because it uses the same code if 
right is float or long, but it adds more checks for the slow-path. No patch 
looks to have a real impact on the slow-path. Is it worth to change the second 
if to PyFloat_CheckExact() and then check type of right in the if body to avoid 
other checks on the slow-path?

(C checks look very cheap, so I think that I already replied to my own question 
:-))

--

fastint5.patch optimizes a+b, a-b, a*b, a/b and a//b. Why not other operators? 
List of operators from my constant folding optimzation in fatoptimizer:

* int, float: a+b, a-b, a*b, a/b, +x, -x, ~x, a//b, a%b, a**b
* int only: a<>b, a, a|b, a^b

If we optimize a//b, I suggest to also optimize a%b to be consistent. For 
integers, a**b, a<>b would make sense too. Coming from the C language, 
I would prefer a<>b than a*2**k or a//2**k, since I expect better 
performance.

For float, -x and +x may be common, but less a+b, a-b, a*b, a/b.

Well, what I'm trying to say: if choose fastintfloat_alt.patch design, we will 
have to expose like a lot of new C functions in headers, and duplicate a lot of 
code.

To support more than 4 operators, we need a macro.

If we use a macro, it's cheap (in term of code maintenance) to use it for most 
or even all operators.

--

> But I don't quite understand why it adds any gain. Is this just due to 
> overhead of calling PyNumber_Add?

Hum, that's a good question.


> Then we should test with other compilers and with the LTO option.

There are projects (I don't recall the number number) but I would prefer to 
handle LTO separatly. Python supports platforms and compilers which don't 
implement LTO.


> fastint5.patch adds an overhead for type checks and increases the size of 
> ceval loop. What is outweigh this overhead?

I stopped to guess the speedup just by reading the code or a patch. I only 
trust benchmarks :-)

Advice: don't trust yourself! only trust benchmarks.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread STINNER Victor


STINNER Victor added the comment:

msg223186, Serhiy Storchaka about inline.patch: "Confirmed speed up about 20%. 
Surprisingly it affects even integers outside of the of preallocated small 
integers (-5...255)."

The optimization applies to Python int with 0 or 1 digit so in range [-2^30+1; 
2^30-1].

Small integers in [-5; 255] might be faster but for PyLong_FromLong().

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread STINNER Victor


STINNER Victor added the comment:

myself> Ok. Now I'm lost. We have so many patches :-) Which one do you prefer?

I read again fully this *old* issue, well, *almost* all messages.

Well, it's clear that no consensus was found yet :-) I see two main trends: 
optimize most cases (optimize most operators for int and float,  ex: 
fastint5_4.patch) versus optimize very few cases to limit changes and to limit 
effects on ceval.c (ex: inline-2.patch).

Marc-Andre and Antoine asked to not stick to micro-optimizations but think 
wider: run macro benchmarks, like perf.py, and suggest to use PyPy, Numba, 
Cython & cie for users who use best performances on numeric functions.

They also warned about subtle side-effects of any kind of change on ceval.c 
which may be counter-productive. It was shown in the long list of patches that 
some of them introduced performance *regressions*.

I don't expect that CPython can beat any compiler emiting machine code. CPython 
will always have to pay the price of boxing/unboxing and its loop evaluating 
bytecode. We can do *better*, the question is "how far?".

I think that we gone far enough on investigation *all* different options to 
optimize 1+2 ;-) Each option was micro-benchmarked very carefully.

Now I suggest to focus on *macro* benchmarks to help use to take a decision. I 
will run perf.py on fastint5_4.patch and inline-2.patch.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread Yury Selivanov


Changes by Yury Selivanov :


Added file: http://bugs.python.org/file41831/fastint5_4.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread Yury Selivanov


Yury Selivanov added the comment:

Attached is the new version of fastint5 patch.  I fixed most of the review 
comments.  I also optimized %, << and >> operators.  I didn't optimize other 
operators because they are less common.  I guess we have to draw a line 
somwhere...

Victor, thanks a lot for your suggestion to drop NB_SLOT etc macros!  Without 
them the code is even simpler.

--
Added file: http://bugs.python.org/file41829/fastint5_2.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread Yury Selivanov


Changes by Yury Selivanov :


Added file: http://bugs.python.org/file41830/fastint5_3.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread STINNER Victor


STINNER Victor added the comment:

inline-2.patch: more complete version of inline.patch.

Optimize the same instructions than Python 2: BINARY_ADD, INPLACE_ADD, 
BINARY_SUBSTRACT, INPLACE_SUBSTRACT.


Quick & *dirty* microbenchmark:

$ ./python -m timeit -s 'x=1' 'x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x'

* Original: 287 ns
* fastint5_2.patch: 261 ns (-9%)
* inline-2.patch: 212 ns (-26%)


$ ./python -m timeit -s 'x=1000; y=1' 
'x-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y'

* Original: 517 ns
* fastint5_2.patch: 469 ns (-9%)
* inline-2.patch: 442 ns (-15%)


Ok. Now I'm lost. We have so many patches :-) Which one do you prefer?

In term of speedup, I expect that Python 2 design (inline-2.patch) cannot be 
beaten in term of performance by another another option since it doesn't need 
any C code and does everything in ceval.c.

--
Added file: http://bugs.python.org/file41832/inline-2.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread Yury Selivanov


Yury Selivanov added the comment:

> Ok. Now I'm lost. We have so many patches :-) Which one do you prefer?

To no-ones surprise I prefer fastint5, because it optimizes almost all binary 
operators on both ints and floats.

inline-2.patch only optimizes just + and - for just ints.  If + and - 
performance of inline-2 is really important, I suggest to merge it in fastint5, 
but i'd keep it simple ;)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-05 Thread STINNER Victor


STINNER Victor added the comment:

msg222985: Raymond Hettinger
"There also used to be a fast path for binary subscriptions with integer 
indexes.  I would like to see that performance regression fixed if it can be 
done cleanly."

The issue #26280 was opened to track this optimization.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Serhiy Storchaka


Serhiy Storchaka added the comment:

fastint2.patch adds small regression for string multiplication:

$ ./python -m timeit -s "x = 'x'" -- "x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; 
x*2; x*2; "
Unpatched:  1.46 usec per loop
Patched:1.54 usec per loop

Here is an alternative patch. It just uses existing specialized functions for 
integers: long_add, long_sub and long_mul. It doesn't add regression for above 
example with string multiplication, and it looks faster than fastint2.patch for 
integer multiplication.

$ ./python -m timeit -s "x = 12345" -- "x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; 
x*2; x*2; "
Unpatched:  0.887 usec per loop
fastint2.patch: 0.841 usec per loop
fastint_alt.patch:  0.804 usec per loop

--
Added file: http://bugs.python.org/file41801/fastint_alt.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread STINNER Victor


STINNER Victor added the comment:

I prefer fastint_alt.patch design, it's simpler. I added a comment on the 
review.

My numbers, best of 5 timeit runs:

$ ./python -m timeit -s "x = 12345" -- "x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; 
x*2; x*2; "

* original: 299 ns
* fastint2.patch: 282 ns (-17 ns, -6%)
* fastint_alt.patch: 267 ns (-32 ns, -11%)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Antoine Pitrou


Antoine Pitrou added the comment:

People should stop getting hung up about benchmarks numbers and instead should 
first think about what they are trying to *achieve*. FP performance in pure 
Python does not seem like an important goal in itself. Also, some benchmarks 
may show variations which are randomly correlated with a patch (e.g. before of 
different code placement by the compiler interfering with instruction cache 
wayness). It is important not to block a patch because some random benchmark on 
some random machine shows an unexpected slowdown.

That said, both of Serhiy's patches are probably ok IMO.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Antoine Pitrou


Antoine Pitrou added the comment:

Hi Yury,

> I'm not sure how to respond to that. Every performance aspect *is*
> important.

Performance is not a religion (not any more than security or any other
matter).  It is not helpful to brandish results on benchmarks which have
no relevance to real-world applications.

It helps to define what we should achieve and why we want to achieve it.
 Once you start asking "why", the prospect of speeding up FP
computations in the eval loop starts becoming dubious.

> numpy isn't shipped with CPython, not everyone uses it.

That's not the point. *People doing FP-heavy computations* should use
Numpy or any of the packages that can make FP-heavy computations faster
(Numba, Cython, Pythran, etc.).

You should use the right tool for the job.  There is no need to
micro-optimize a hammer for driving screws when you could use a
screwdriver instead.  Lists or tuples of Python float objects are an
awful representation for what should be vectorized native data.  They
eat more memory in addition to being massively slower (they will also be
slower to serialize from/to disk, etc.).

"Not using" Numpy when you would benefit from it is silly.
Numpy is not only massively faster on array-wide tasks, it also makes it
easier to write high-level, readable, reusable code instead of writing
loops and iterating by hand.  Because it has been designed explicitly
for such use cases (which the Python core was not, despite the existence
of the colorsys module ;-)).  It also gives you access to a large
ecosystem of third-party modules implementing various domain-specific
operations, actively maintained by experts in the field.

Really, the mindset of "people shouldn't need to use Numpy, they can do
FP computations in the interpreter loop" is counter-productive.  I
understand that it's seductive to think that Python core should stand on
its own, but it's also a dangerous fallacy.

You *should* advocate people use Numpy for FP computations.  It's an
excellent library, and it's currently a major selling point for Python.
Anyone doing FP-heavy computations with Python should learn to use
Numpy, even if they only use it from time to time.  Downplaying its
importance, and pretending core Python is sufficient, is not helpful.

> It also harms Python 3 adoption a little bit, since many benchmarks
> are still slower. Some of them are FP related.

The Python 3 migration is happening already. There is no need to worry
about it... Even the diehard 3.x haters have stopped talking of
releasing a 2.8 ;-)

> In any case, I think that if we can optimize something - we should.

That's not true. Some optimizations add maintenance overhead for no real
benefit. Some may even hinder performance as they add conditional
branches in a critical path (increasing the load on the CPU's branch
predictors and making them potentially less efficient).

Some optimizations are obviously good, like the method call optimization
which caters to real-world use cases (and, by the way, kudos for that...
you are doing much better than all previous attempts ;-)). But some are
solutions waiting for a problem to solve.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Yury Selivanov


Yury Selivanov added the comment:

tl;dr   I'm attaching a new patch - fastint4 -- the fastest of them all. It 
incorporates Serhiy's suggestion to export long/float functions and use them.  
I think it's reasonable complete -- please review it, and let's get it 
committed.

== Benchmarks ==

spectral_norm (fastint_alt)-> 1.07x faster
spectral_norm (fastintfloat)   -> 1.08x faster
spectral_norm (fastint3.patch) -> 1.29x faster
spectral_norm (fastint4.patch) -> 1.16x faster

spectral_norm (fastint**.patch)-> 1.31x faster
nbody (fastint**.patch)-> 1.16x faster

Where:
- fastint3 - is my previous patch that nobody likes (it inlined a lot of logic 
from longobject/floatobject)

- fastint4 - is the patch I'm attaching and ideally want to commit

- fastint** - is a modification of fastint4.  This is very interesting -- I 
started to profile different approaches, and found two bottlenecks, that really 
made Serhiy's and my other patches slower than fastint3.  What I found is that 
PyLong_AsDouble can be significantly optimized, and PyLong_FloorDiv is super 
inefficient.

PyLong_AsDouble can be sped up several times if we add a fastpath for 1-digit 
longs:

// longobject.c: PyLong_AsDouble
if (PyLong_CheckExact(v) && Py_ABS(Py_SIZE(v)) <= 1) {
/* fast path; single digit will always fit decimal */
return (double)MEDIUM_VALUE((PyLongObject *)v);
}


PyLong_FloorDiv (fastint4 adds it) can be specialized for single digits, which 
gives it a tremendous boost.

With those too optimizations, fastint4 becomes as fast as fastint3.  I'll 
create separate issues for PyLong_AsDouble and FloorDiv.

== Micro-benchmarks ==

Floats + ints:  -m timeit -s "x=2" "x*2.2 + 2 + x*2.5 + 1.0 - x / 2.0 + 
(x+0.1)/(x-0.1)*2 + (x+10)*(x-30)"

2.7:  0.42 (usec)
3.5:  0.619
fastint_alt   0.619
fastintfloat: 0.52
fastint3: 0.289
fastint4: 0.51
fastint**:0.314

===

Ints:  -m timeit -s "x=2" "x + 10 + x * 20 - x // 3 + x* 10 + 20 -x"

2.7:  0.151 (usec)
3.5:  0.19
fastint_alt:  0.136
fastintfloat: 0.135
fastint3: 0.135
fastint4: 0.122
fastint**:0.122


P.S. I have another variant of fastint4 that uses fast_* functions in ceval 
loop, instead of a big macro.  Its performance is slightly worse than with the 
macro.

--
Added file: http://bugs.python.org/file41811/fastint4.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Yury Selivanov


Yury Selivanov added the comment:

> People should stop getting hung up about benchmarks numbers and instead 
> should first think about what they are trying to *achieve*. FP performance in 
> pure Python does not seem like an important goal in itself.

I'm not sure how to respond to that.  Every performance aspect *is* important.  
numpy isn't shipped with CPython, not everyone uses it.  In one of my programs 
I used colorsys extensively -- did I need to rewrite it using numpy?  Probably 
I could, but that was a simple shell script without any dependencies.

It also harms Python 3 adoption a little bit, since many benchmarks are still 
slower.  Some of them are FP related.

In any case, I think that if we can optimize something - we should.


> Also, some benchmarks may show variations which are randomly correlated with 
> a patch (e.g. before of different code placement by the compiler interfering 
> with instruction cache wayness). 

30-50% speed improvement is not a variation.  It's just that a lot less code 
gets executed if we inline some operations.


> It is important not to block a patch because some random benchmark on some 
> random machine shows an unexpected slowdown.

Nothing is blocked atm, we're just discussing various approaches.


> That said, both of Serhiy's patches are probably ok IMO.

Current Serhiy's patches are incomplete.  In any case, I've been doing some 
research and will post another message shortly.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Yury Selivanov


Yury Selivanov added the comment:

Antoine, FWIW I agree on most of your points :)  And yes, numpy, scipy, numba, 
etc rock.

Please take a look at my fastint4.patch.  All tests pass, no performance 
regressions, no crazy inlining of floating point exceptions etc.  And yet we 
have a nice improvement for both ints and floats.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread STINNER Victor


STINNER Victor added the comment:

> Why not combine my patch and Serhiy's?  First we check if left & right are 
> both longs.  Then we check if they are unicode (for +).  And then we have a 
> fastpath for floats.

See my comment on Serhiy's patch. Maybe we can start by check that the type of 
both operands are the same, and then use PyLong_CheckExact and 
PyUnicode_CheckExact.

Using such design, we may add a _PyFloat_Add(). But the next question is then 
the overhead on the "slow" path, which requires a benchmark too! For example, 
use a subtype of int.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Yury Selivanov


Yury Selivanov added the comment:

> I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy 
> (possibly with Numba, Cython or any other additional library). 
> Micro-optimizing floating-point operations in the eval loop makes little 
> sense IMO.

I disagree.

30% faster floats (sic!) is a serious improvement, that shouldn't just be 
discarded.  Many applications have floating point calculations one way or 
another, but don't use numpy because it's an overkill.

Python 2 is much faster than Python 3 on any kind of numeric calculations.  
This point is being frequently brought up in every python2 vs 3 debate.  I 
think it's unacceptable.


> * the ceval loop may no longer fit in to the CPU cache on
   systems with small cache sizes, since the compiler will likely
   inline all the fast_*() functions (I guess it would be possible
   to simply eliminate all fast paths using a compile time
   flag)

That's a speculation.  It may still fit.  Or it had never really fitted, it's 
already huge.  I tested the patch on a 8 year old desktop CPU, no performance 
degradation on our benchmarks.

### raytrace ###
Avg: 1.858527 -> 1.652754: 1.12x faster

### nbody ###
Avg: 0.310281 -> 0.285179: 1.09x faster

### float ###
Avg: 0.392169 -> 0.358989: 1.09x faster

### chaos ###
Avg: 0.355519 -> 0.326400: 1.09x faster

### spectral_norm ###
Avg: 0.377147 -> 0.303928: 1.24x faster

### telco ###
Avg: 0.012845 -> 0.013006: 1.01x slower

The last benchmark (telco) is especially interesting.  It uses decimals for 
calculation, that means that it uses overloaded numeric operators.  Still no 
significant performance degradation.

> * maintenance will get more difficult

Fast path for floats is easy to understand for every core dev that works with 
ceval, there is no rocket science there (if you want rocket science that is 
hard to maintain look at generators/yield from).  If you don't like inlining 
floating point calculations, we can make float_add, float_sub, float_div, and 
float_mul exported and use them in ceval.

Why not combine my patch and Serhiy's?  First we check if left & right are both 
longs.  Then we check if they are unicode (for +).  And then we have a fastpath 
for floats.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Antoine Pitrou


Antoine Pitrou added the comment:

Le 04/02/2016 15:18, Yury Selivanov a écrit :
> 
> But it is faster. That's visible on many benchmarks. Even simple
timeit oneliners can show that. Probably it's because that such
benchmarks usually combine floats and ints, i.e. "2 * smth" instead of
"2.0 * smth".

So it's not about FP-related calculations anymore. It's about ints
having become slower ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Yury Selivanov


Yury Selivanov added the comment:

>> But it is faster. That's visible on many benchmarks. Even simple
> timeit oneliners can show that. Probably it's because that such
> benchmarks usually combine floats and ints, i.e. "2 * smth" instead of
> "2.0 * smth".
> 
> So it's not about FP-related calculations anymore. It's about ints
> having become slower ;-)

I should have written 2 * smth_float vs 2.0 * smth_float

--
nosy: +Yury.Selivanov

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Yury Selivanov


Yury Selivanov added the comment:

Attaching another approach -- fastint5.patch.

Similar to what fastint4.patch does, but doesn't export any new APIs.  Instead, 
similarly to abstract.c, it uses type slots directly.

--
Added file: http://bugs.python.org/file41815/fastint5.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Antoine Pitrou


Antoine Pitrou added the comment:

I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy 
(possibly with Numba, Cython or any other additional library). Micro-optimizing 
floating-point operations in the eval loop makes little sense IMO.

The point of optimizing integers is that they are used for many purposes, not 
only "math" (e.g. indexing).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread STINNER Victor


STINNER Victor added the comment:

> I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy 
> (possibly with Numba, Cython or any other additional library). 
> Micro-optimizing floating-point operations in the eval loop makes little 
> sense IMO.

Oh wait, I maybe misunderstood Marc-Andre comment. If the question is only on 
float: I'm ok to drop the fast-path for float. By the way, I would prefer to 
see PyLong_CheckExact() in the main loop, and only call fast_mul() if both 
operands are Python int.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Stefan Krah


Stefan Krah added the comment:

I mean, if you run the benchmark 10 times and the unpatched result is always 
between 11.3 and 12.0 for floats while the patched result is
between 12.3 and 12.9, for me the situation is clear.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Antoine Pitrou


Antoine Pitrou added the comment:

Le 04/02/2016 14:54, Yury Selivanov a écrit :
> 
> 30% faster floats (sic!) is a serious improvement, that shouldn't
> just be discarded. Many applications have floating point calculations one way
> or another, but don't use numpy because it's an overkill.

Can you give any example of such an application and how they would
actually benefit from "faster floats"? I'm curious why anyone who wants
fast FP calculations would use pure Python with CPython...

Discarding Numpy because it's "overkill" sounds misguided to me.
That's like discarding asyncio because it's "less overkill" to write
your own select() loop. It's often far more productive to use the
established, robust, optimized library rather than tweak your own
low-level code.

(and Numpy is easier to learn than asyncio ;-))

I'm not violently opposing the patch, but I think maintenance effort
devoted to such micro-optimizations is a bit wasted. And once you add
such a micro-optimization, trying to remove it often faces a barrage of
FUD about making Python slower, even if the micro-optimization is
practically worthless.

> Python 2 is much faster than Python 3 on any kind of numeric
> calculations.

Actually, it shouldn't really be faster on FP calculations, since the
float object hasn't changed (as opposed to int/long). So I'm skeptical
of FP-heavy code that would have been made slower by Python 3 (unless
there's also integer handling in that, e.g. indexing).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Yury Selivanov


Yury Selivanov added the comment:

>But the next question is then the overhead on the "slow" path, which requires 
>a benchmark too! For example, use a subtype of int.

telco is such a benchmark (although it's very unstable).  It uses decimals 
extensively.  I've tested it many times on three different CPUs, and it doesn't 
seem to become any slower.


> Discarding Numpy because it's "overkill" sounds misguided to me.
That's like discarding asyncio because it's "less overkill" to write
your own select() loop. It's often far more productive to use the
established, robust, optimized library rather than tweak your own
low-level code.

Don't get me wrong, numpy is simply amazing!  But if you have a 100,000 lines 
application that happens to have a a few FP-related calculations here and 
there, you won't use numpy (unless you had experience with it before).

My opinion on this: numeric operations in Python (and any general purpose 
language) should be as fast as we can make them.


> Python 2 is much faster than Python 3 on any kind of numeric
> calculations.

> Actually, it shouldn't really be faster on FP calculations, since the
float object hasn't changed (as opposed to int/long). So I'm skeptical
of FP-heavy code that would have been made slower by Python 3 (unless
there's also integer handling in that, e.g. indexing).

But it is faster.  That's visible on many benchmarks.  Even simple timeit 
oneliners can show that.  Probably it's because that such benchmarks usually 
combine floats and ints, i.e. "2 * smth" instead of "2.0 * smth".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Yury Selivanov


Yury Selivanov added the comment:

> 
> Stefan Krah added the comment:
> 
> It's instructive to run ./python Modules/_decimal/tests/bench.py (Hit Ctrl-C 
> after the first cdecimal result, 5 repetitions or so).
> 
> fastint2.patch speeds up floats enormously and slows down decimal by 6%.
> fastint_alt.patch slows down float *and* decimal (5% or so).
> 
> Overall the status quo isn't that bad, but I concede that float benchmarks 
> like that are useful for PR.
> 

Thanks Stefan! I'll update my patch to include Serhiy's ideas. The fact that 
fastint_alt slows down floats *and* decimals is not acceptable.

I'm all for keeping cpython and ceval loop simple, but we should not pass on 
opportunities to improve some things in a significant way.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Serhiy Storchaka


Serhiy Storchaka added the comment:

It is easy to extend fastint_alt.patch to support floats too. Here is new patch.

> It's instructive to run ./python Modules/_decimal/tests/bench.py (Hit Ctrl-C 
> after the first cdecimal result, 5 repetitions or so).

Note that this benchmark is not very stable. I ran it few times and the 
difference betweens runs was about 20%.

--
Added file: http://bugs.python.org/file41807/fastintfloat_alt.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Stefan Krah


Stefan Krah added the comment:

I've never seen 20% fluctuation in that benchmark between runs. The benchmark 
is very stable if you take the average of 10 runs.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Stefan Krah


Stefan Krah added the comment:

It's instructive to run ./python Modules/_decimal/tests/bench.py (Hit Ctrl-C 
after the first cdecimal result, 5 repetitions or so).

fastint2.patch speeds up floats enormously and slows down decimal by 6%.
fastint_alt.patch slows down float *and* decimal (5% or so).

Overall the status quo isn't that bad, but I concede that float benchmarks like 
that are useful for PR.

--
nosy: +skrah

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread STINNER Victor


STINNER Victor added the comment:

"In a numerics heavy application it's like that all fast paths will trigger 
somewhere, but those will likely be better off using numpy or numba. For a text 
heavy application such as a web server, only few fast paths will trigger and so 
the various checks only add overhead."

Hum, I disagree. See benchmark results in other messages. Examples:

### django_v2 ###
Min: 2.682884 -> 2.633110: 1.02x faster

### unpickle_list ###
Min: 1.333952 -> 1.212805: 1.10x faster

These benchmarks are not written for numeric, but are more "general" 
benchmarks. int is just a core feature of Python, simply used everywhere, as 
the str type.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 04.02.2016 09:01, STINNER Victor wrote:
> 
> "In a numerics heavy application it's like that all fast paths will trigger 
> somewhere, but those will likely be better off using numpy or numba. For a 
> text heavy application such as a web server, only few fast paths will trigger 
> and so the various checks only add overhead."
> 
> Hum, I disagree. See benchmark results in other messages. Examples:
> 
> ### django_v2 ###
> Min: 2.682884 -> 2.633110: 1.02x faster
> 
> ### unpickle_list ###
> Min: 1.333952 -> 1.212805: 1.10x faster
> 
> These benchmarks are not written for numeric, but are more "general" 
> benchmarks. int is just a core feature of Python, simply used everywhere, as 
> the str type.

Sure, some integer math is used in text applications as well,
e.g. for indexing, counting and slicing, but the patch puts more
emphasis on numeric operations, e.g. fast_add() tests for integers
and floats before then coming back to check for Unicode.

It would be interesting to know how often these paths trigger
or not in the various benchmarks.

BTW: The django_v2 benchmark result does not really say
anything much. Values of +/- 2% do not have much meaning in
benchmark results :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-04 Thread STINNER Victor


STINNER Victor added the comment:

+if (Py_SIZE(left) != 0) {
+if (Py_SIZE(right) != 0) {
+
+#ifdef HAVE_LONG_LONG
+mul = PyLong_FromLongLong(
+(long long)SINGLE_DIGIT_LONG_AS_LONG(left) *
+SINGLE_DIGIT_LONG_AS_LONG(right));
+#else
+mul = PyNumber_Multiply(left, right);
+#endif

Why don't you use the same code than long_mul() (you need #include 
"longintrepr.h")?

stwodigits v = (stwodigits)(MEDIUM_VALUE(a)) * MEDIUM_VALUE(b);
#ifdef HAVE_LONG_LONG
return PyLong_FromLongLong((PY_LONG_LONG)v);
#else
/* if we don't have long long then we're almost certainly
   using 15-bit digits, so v will fit in a long.  In the
   unlikely event that we're using 30-bit digits on a platform
   without long long, a large v will just cause us to fall
   through to the general multiplication code below. */
if (v >= LONG_MIN && v <= LONG_MAX)
return PyLong_FromLong((long)v);
#endif


I guess that long_mul() was always well optimized, no need to experiment 
something new.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Yury Selivanov


Yury Selivanov added the comment:

Attaching a second version of the patch.  (BTW, Serhiy, I tried your idea of 
using a switch statement to optimize branches 
(https://github.com/1st1/cpython/blob/fastint2/Python/ceval.c#L5390) -- no 
detectable speed improvement).


I decided to add fast path for floats & single-digit longs and their 
combinations.  +, -, *, /, //, and their inplace versions are optimized now.  


I'll have a full result of macro-benchmarks run tomorrow morning, but here's a 
result for spectral_norm (rigorous run, best of 3):

### spectral_norm ###
Min: 0.300269 -> 0.233037: 1.29x faster
Avg: 0.301700 -> 0.234282: 1.29x faster
Significant (t=399.89)
Stddev: 0.00147 -> 0.00083: 1.7619x smaller


Some nano-benchmarks (best of 3):

-m timeit -s "loops=tuple(range(100))" "sum([x + x + 1 for x in loops])"
2.7  7.233.5  8.17  3.6  7.57

-m timeit -s "loops=tuple(range(100))" "sum([x + x + 1.0 for x in loops])"
2.7  9.083.5  11.7  3.6  7.22

-m timeit -s "loops=tuple(range(100))" "sum([x/2.2 + 2 + x*2.5 + 1.0 for x in 
loops])"
2.7  17.93.5  24.3  3.6  11.8

--
Added file: http://bugs.python.org/file41799/fastint2.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 04.02.2016 07:02, Yury Selivanov wrote:
> Attaching a second version of the patch.  (BTW, Serhiy, I tried your idea of 
> using a switch statement to optimize branches 
> (https://github.com/1st1/cpython/blob/fastint2/Python/ceval.c#L5390) -- no 
> detectable speed improvement).

It would be better to consistently have the fast_*() helpers
return -1 in case of an error, instead of -1 or 1.

Overall, I see two problems with doing too many of these
fast paths:

 * the ceval loop may no longer fit in to the CPU cache on
   systems with small cache sizes, since the compiler will likely
   inline all the fast_*() functions (I guess it would be possible
   to simply eliminate all fast paths using a compile time
   flag)

 * maintenance will get more difficult

In a numerics heavy application it's like that all fast paths
will trigger somewhere, but those will likely be better off
using numpy or numba. For a text heavy application such as a web
server, only few fast paths will trigger and so the various
checks only add overhead.

Since 'a'+'b' is a very often used instruction type in the
latter type of applications, please make sure that this fast
path gets more priority in your patch.

Please also check the effects of the fast paths for cases
where they don't trigger, e.g. 'a'+'b' or 'a'*2.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Antoine Pitrou


Antoine Pitrou added the comment:

Le 03/02/2016 18:21, Yury Selivanov a écrit :
> 
> Yury Selivanov added the comment:
> 
>> Yury suggested running perf.py twice with the binaries swapped
> 
> Yeah, I had some experience with perf.py when its results were
> skewed
depending on what you test first.

Have you tried disabling turbo on your CPU? (or any kind of power
management that would change the CPU clock depending on the current
workload)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Yury Selivanov


Yury Selivanov added the comment:

Attaching a new patch -- rewritten to optimize -, *, +, -=, *= and +=.  I also 
removed the optimization of [] operator -- that should be done in a separate 
patch and in a separate issue.

Some nano-benchmarks (best of 3):

python -m timeit  "sum([x + x + 1 for x in range(100)])"
2.7: 7.71 3.5: 8.54  3.6: 7.33

python -m timeit  "sum([x - x - 1 for x in range(100)])"
2.7: 7.81 3.5: 8.59  3.6: 7.57

python -m timeit  "sum([x * x * 1 for x in range(100)])"
2.7: 9.28 3.5: 10.6  3.6: 9.44


Python 3.6 vs 3.5 (spectral_norm, rigorous run):
Min: 0.315917 -> 0.276785: 1.14x faster
Avg: 0.321006 -> 0.284909: 1.13x faster


Zach, thanks a lot for the research!  I'm glad that unpack_sequence finally 
proved to be irrelevant.  Could you please take a look at the updated patch?

--
Added file: http://bugs.python.org/file41796/fastint1.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Antoine Pitrou


Antoine Pitrou added the comment:

> In this table I've flipped the results for running the modified build > as 
> the reference, but in the new attachment, slower in the right
> column means faster, I think :)

I don't understand what this table means (why 4 columns?). Can you explain what 
you did?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread STINNER Victor


STINNER Victor added the comment:

> python -m timeit  "sum([x * x * 1 for x in range(100)])"

If you only want to benchmark x*y, x+y and list-comprehension, you
should use a tuple for the iterator.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Yury Selivanov


Yury Selivanov added the comment:

> Yury suggested running perf.py twice with the binaries swapped

Yeah, I had some experience with perf.py when its results were skewed depending 
on what you test first.  Hopefully Victor's new patch will fix that 
http://bugs.python.org/issue26275

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Zach Byrne


Zach Byrne added the comment:

> Could you please take a look at the updated patch?
Looks ok to me, for whatever that's worth.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Yury Selivanov


Yury Selivanov added the comment:

> Fast patch is already implemented in long_mul(). May be we should just use 
> this function if both arguments are exact int, and apply the switch 
> optimization inside.

Agree.

BTW, what do you think about using __int128 when available?  That way we can 
also optimize twodigit PyLongs.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread STINNER Victor


STINNER Victor added the comment:

I don't think. I run benchmarks (for __int128) :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Yury Selivanov


Yury Selivanov added the comment:

> I don't think. I run benchmarks (for __int128) :-)

Never mind...  Seems that __int128 is still an experimental feature and some 
versions of clang even had bugs with it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Serhiy Storchaka


Serhiy Storchaka added the comment:

> BTW, what do you think about using __int128 when available?  That way we can 
> also optimize twodigit PyLongs.

__int128 is not always available and it will add too much of complexity for 
possible less gain. There is many ways to optimize the code and we should to 
choose those of them that have the best gain/complexity ratio.

Lets split the patch on smaller parts: 1) direct using long-specialized 
functions in ceval.c, and 2) optimize the fast path in these functions, and 
test them separately and combined. May be only one way will add a gain.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Zach Byrne


Zach Byrne added the comment:

> I don't understand what this table means (why 4 columns?). Can you explain 
> what you did?

Yury suggested running perf.py twice with the binaries swapped
So "faster" and "slower" underneath "Baseline Reference" are runs where the 
unmodified python binary was the first argument to perf, and the "Modified 
Reference" is where the patched binary is the first argument.

ie. "perf.py -r -b all python patched_python" vs "perf.py -r -b all 
patched_python python"

bench_results.txt has the actual output in it, and the "slower in the right 
column" comment was referring to the contents of that file, not the table. 
Sorry for the confusion.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Yury Selivanov


Yury Selivanov added the comment:

Antoine, yeah, it's probably turbo boost related.  There is no easy way to turn 
it off on mac os x, though.  I hope Victor's patch to perf.py will help to 
mitigate this. 

Victor, Marc-Andre,

Updated results of nano-bench (best of 10):

-m timeit -s "loops=tuple(range(100))" "sum([x * x * 1 for x in loops])"
2.7  8.5 3.5  10.1 3.6  8.91

-m timeit -s "loops=tuple(range(100))" "sum([x + x + 1 for x in loops])"
2.7  7.273.5  8.2  3.6  7.13

-m timeit -s "loops=tuple(range(100))" "sum([x - x - 1 for x in loops])"
2.7  7.013.5  8.1  3.6  6.95

Antoine, Serhiy, I'll upload a new patch soon.  Probably Serhiy's idea of using 
a switch statement will make it slightly faster.  I'll also add a fast path for 
integer division.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Serhiy Storchaka


Serhiy Storchaka added the comment:

Fast patch is already implemented in long_mul(). May be we should just use this 
function if both arguments are exact int, and apply the switch optimization 
inside.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread Zach Byrne


Zach Byrne added the comment:

I ran 6 benchmarks on my work machine(not the same one as the last set) 
overnight.
Two with just the BINARY_ADD change, two with the BINARY_SUBSCR change, and two 
with both.
I'm attaching the output from all my benchmark runs, but here are the highlights
In this table I've flipped the results for running the modified build as the 
reference, but in the new attachment, slower in the right column means faster, 
I think :)
|--|---|---|
|Build | Baseline Reference| Modified Reference 
   |
|--||--||--|
|  | Faster | Slower   | Faster 
| Slower   |
|--||--||--|
|BINARY_ADD| chameleon_v2   | etree_parse  | chameleon_v2   
| call_simple  |
|  | chaos  | nbody| fannkuch   
| nbody|
|  | django | normal_startup   | normal_startup 
| pickle_dict  |
|  | etree_generate | pickle_dict  | nqueens
| regex_v8 |
|  | fannkuch   | pickle_list  | regex_compile  
|  |
|  | formatted_logging  | regex_effbot | spectral_norm  
|  |
|  | go |  | unpickle_list  
|  |
|  | json_load  |  |
|  |
|  | regex_compile  |  |
|  |
|  | simple_logging |  |
|  |
|  | spectral_norm  |  |
|  |
|--||--||--|
|BINARY_SUBSCR | chameleon_v2   | call_simple  | 2to3   
| etree_parse  |
|  | chaos  | go   | call_method_slots  
| json_dump_v2 |
|  | etree_generate | pickle_list  | chaos  
| pickle_dict  |
|  | fannkuch   | telco| fannkuch   
|  |
|  | fastpickle |  | formatted_logging  
|  |
|  | hexiom2|  | go 
|  |
|  | json_load  |  | hexiom2
|  |
|  | mako_v2|  | mako_v2
|  |
|  | meteor_contest |  | meteor_contest 
|  |
|  | nbody  |  | nbody  
|  |
|  | regex_v8   |  | normal_startup 
|  |
|  | spectral_norm  |  | nqueens
|  |
|  ||  | pickle_list
|  |
|  ||  | simple_logging 
|  |
|  ||  | spectral_norm  
|  |
|  ||  | telco  
|  |
|--||--||--|
|BOTH  | chameleon_v2   | call_simple  | chameleon_v2   
| fastpickle   |
|  | chaos  | etree_parse  | choas  
| pickle_dict  |
|  | etree_generate | pathlib  | etree_generate 
| pickle_list  |
|  | etree_process  | pickle_list  | etree_process  
| telco|
|  | fannkuch   |  | fannkuch   
|  |
|  | fastunpickle   |  | float  
|  |
|  | float  |  | formatted_logging  
|  |
|  | formatted_logging  |  | go 
|  |
|  | hexiom2|  | hexiom2
|  |
|  | nbody  |  | nbody  
|  |
|  | nqueens|  | normal_startup 
|  |
|  | regex_v8   |  | nqueens
|  |
|  | spectral_norm

Re: [issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-03 Thread M.-A. Lemburg

On 03.02.2016 18:05, STINNER Victor wrote:
> 
>> python -m timeit  "sum([x * x * 1 for x in range(100)])"
> 
> If you only want to benchmark x*y, x+y and list-comprehension, you
> should use a tuple for the iterator.

... and precalculate that in the setup:

python -m timeit -s "loops=tuple(range(100))" "sum([x * x * 1 for x in loops])"

# python -m timeit "sum([x * x * 1 for x in range(100)])"
10 loops, best of 3: 5.74 usec per loop
# python -m timeit -s "loops=tuple(range(100))" "sum([x * x * 1 for x in 
loops])"
10 loops, best of 3: 5.56 usec per loop

(python = Python 2.7)

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-02 Thread Antoine Pitrou


Antoine Pitrou added the comment:

Any change that increases the cache or branch predictor footprint of the 
evaluation loop may make the interpreter slower, even if the change doesn't 
seem related to a particular benchmark. That may be the reason here.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-02 Thread Zach Byrne


Zach Byrne added the comment:

I took another look at this, and tried applying it to 3.6 and running the 
latest benchmarks. It applied cleanly, and the benchmark results were similar, 
this time unpack_sequence and spectral_norm were slower. Spectral norm makes 
sense, it's doing lots of FP addition. The unpack_sequence instruction looks 
like it already has optimizations for unpacking lists and tuples onto the 
stack, and running dis on the test showed that it's completely dominated calls 
to unpack_sequence, load_fast, and store_fast so I still don't know what's 
going on there.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-02 Thread Yury Selivanov


Yury Selivanov added the comment:

unpack_sequence contains 400 lines of this: "a, b, c, d, e, f, g, h, i, j = 
to_unpack".  This code doesn't even touch BINARY_SUBSCR or BINARY_ADD.

Zach, could you please run your benchmarks in rigorous mode (perf.py -r)?  I'd 
also suggest to experiment with putting the baseline cpython as a first arg and 
as a second -- maybe your machine runs the second interpreter slightly faster.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-02-02 Thread Yury Selivanov


Yury Selivanov added the comment:

I'm assigning this patch to myself to commit it in 3.6 later.

--
assignee:  -> yselivanov
components: +Interpreter Core
stage:  -> patch review
versions: +Python 3.6 -Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-01-11 Thread Zach Byrne


Zach Byrne added the comment:

Anybody still looking at this? I can take another stab at it if it's still in 
scope.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-01-11 Thread Zach Byrne


Zach Byrne added the comment:

> Can you figure why unpack_sequence and other benchmarks were slower?
I didn't look really closely, A few of the slower ones were floating point 
heavy, which would incur the slow path penalty, but I can dig into 
unpack_sequence.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21955] ceval.c: implement fast path for integers with a single digit

2016-01-11 Thread Yury Selivanov


Yury Selivanov added the comment:

> Anybody still looking at this? I can take another stab at it if it's still in 
> scope.

There were some visible speedups from your patch -- I think we should merge 
this optimization.  Can you figure why unpack_sequence and other benchmarks 
were slower?

--
nosy: +yselivanov

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

1 2 >

1 - 100 of 120 matches

Mail list logo