[issue46020] Optimize long_pow for the common case

2022-01-12 Thread Tim Peters


[issue46020] Optimize long_pow for the common case

2022-01-11 Thread Tim Peters


Tim Peters  added the comment:

GH_30555 helps a bit by leaving the giant-exponent table of small odd powers as 
uninitialized stack trash unless it's actually used.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46020] Optimize long_pow for the common case

2022-01-11 Thread Tim Peters


Change by Tim Peters :


--
keywords: +patch
pull_requests: +28756
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/30555

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46020] Optimize long_pow for the common case

2022-01-03 Thread Tim Peters


Tim Peters  added the comment:

I was suprised that

https://bugs.python.org/issue44376 

managed to get i**2 to within a factor of 2 of i*i's speed. The overheads of 
running long_pow() at all are high! Don't overlook that initialization of stack 
variables at the start, like

PyLongObject *z = NULL;  /* accumulated result */

isn't free - code has to be generated to force zeroes into those variables. The 
initialization of `table[]` alone requires code to fill 256 memory bytes with 
zeroes (down to 128 on current main branch). Nothing is free.

We can't sanely move the `table` initialization expense into the "giant k-ary 
window exponentiation" block either, because every bigint operation can fail 
("out of memory"), and the macros for doing the common ones (MULT and REDUCE) 
can do "goto Error;", and that common exit code has no way to know what is or 
isn't initialized. We can't let it see uninitialized stack trash.

The exit code in turn has a string of things like

Py_DECREF(a);
Py_DECREF(b);
Py_XDECREF(c);

and those cost cycles too, including tests and branches.

So the real "outrage" to me is why x*x took 17.6 nsec for x == 10 in the 
original report. That's many times longer than the HW takes to do the actual 
multiply. Whether it's spelled x*x or x**2, we're overwhelming timing 
overheads. `pow()` has many because it's a kind of Swiss army knife doing all 
sorts of things; what's `x*x`'s excuse? ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46020] Optimize long_pow for the common case

2022-01-02 Thread Mark Dickinson


Change by Mark Dickinson :


--
nosy: +tim.peters

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46020] Optimize long_pow for the common case

2021-12-09 Thread Ken Jin


Ken Jin  added the comment:

I'm not sure about the original 10:1 difference in 3.10, but in 3.11, the 2:1 
difference might be due to the PEP 659 machinery optimizing for int * int, and 
float * float cases (see BINARY_OP_MULTIPLY_INT and BINARY_OP_MULTIPLY_FLOAT in 
ceval).

Last I recall, these were slightly faster than their unspecialized counterparts 
as they have less overhead in C function calls. Meanwhile, none of the power 
operations are optimized via PEP 659 machinery at the moment.

--
nosy: +kj

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46020] Optimize long_pow for the common case

2021-12-09 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

The situation for floats is also disappointing:

$ python3.11 -m timeit -s 'x=1.1' 'x ** 2'
500 loops, best of 5: 60.8 nsec per loop
$ python3.11 -m timeit -s 'x=1.1' 'x ** 2.0'
500 loops, best of 5: 51.5 nsec per loop
$ python3.11 -m timeit -s 'x=1.1' 'x * x'
2000 loops, best of 5: 17.7 nsec per loop

Complex numbers are more balanced.  Surprisingly, some of the complex cases are 
faster than their float counterparts:

$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2'
500 loops, best of 5: 42.4 nsec per loop
$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2.0'
500 loops, best of 5: 43.3 nsec per loop
$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2.0j'
200 loops, best of 5: 107 nsec per loop
$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x * x'
1000 loops, best of 5: 30.6 nsec per loop

Decimal still shows a large difference:

$ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 
'x ** 2'
100 loops, best of 5: 206 nsec per loop
$ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 
'x ** Decimal(2)'
100 loops, best of 5: 281 nsec per loop
$ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 
'x * x'
500 loops, best of 5: 60.9 nsec per loop

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46020] Optimize long_pow for the common case

2021-12-09 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Hmm, I had just looked at that code and it wasn't at all obvious that an 
optimization had been added.  I expected something like:

   if (exp==2) return PyNumber_Multiply(x, x);

I wonder where the extra clock cycles are going.  Looking at the ceval.c 
dispatch and the bodies of PyNumber_Multiply(), _PyNumber_PowerNoMod(), and 
PyNumber_Power(), it looks like the power dispatch code should be about the 
same as or slightly cheaper than the multiply dispatch.

I'm surprised there is still almost a two to one speed difference.   ISTM there 
is still too much overhead for squaring.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46020] Optimize long_pow for the common case

2021-12-09 Thread Mark Dickinson


Change by Mark Dickinson :


--
nosy: +mark.dickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46020] Optimize long_pow for the common case

2021-12-09 Thread Dennis Sweeney


Dennis Sweeney  added the comment:

I believe https://bugs.python.org/issue44376 added a special case for 2nd and 
3rd powers, and that's the 3.10/3.11 difference in the speed of x**2, not ceval 
optimizations.

--
nosy: +Dennis Sweeney

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46020] Optimize long_pow for the common case

2021-12-08 Thread Raymond Hettinger


New submission from Raymond Hettinger :

The expression 'x * x' is faster than 'x ** 2'.

In Python3.10, the speed difference was enormous.  Due to ceval optimizations, 
the difference in Python3.11 is tighter; however, there is still room for 
improvement.

The code for long_pow() doesn't currently have a fast path for squaring which 
is by far the most important case.

$ python3.10 -m timeit -r 11 -s 'x = 10' 'x ** 2'
100 loops, best of 11: 201 nsec per loop
$ python3.10 -m timeit -r 11 -s 'x = 10' 'x * x'
1000 loops, best of 11: 21.9 nsec per loop

$ python3.11 -m timeit -r 11 -s 'x = 10' 'x ** 2'
1000 loops, best of 11: 32 nsec per loop
$ python3.11 -m timeit -r 11 -s 'x = 10' 'x * x'
2000 loops, best of 11: 17.6 nsec per loop

--
components: Interpreter Core
messages: 408076
nosy: rhettinger
priority: normal
severity: normal
status: open
title: Optimize long_pow for the common case
type: performance

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com