[issue32534] Speed-up list.insert: use memmove()

2019-04-08 Thread Inada Naoki


Change by Inada Naoki :


--
resolution:  -> wont fix
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-05-18 Thread STINNER Victor

STINNER Victor  added the comment:

This issue is a micro-optimization which is only 1.08x faster:
https://bugs.python.org/issue32534#msg310146

Moreover, it seems really hard to measure precisely the benefit on benchmarks. 
Results seem to not be reliable.

I suggest to close the issue as WONTFIX, as I cannot see any reliable speedup 
nor any significant impact on performance.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-05-15 Thread Stéphane Wirtel

Stéphane Wirtel  added the comment:

Hi,

just a small reminder for this issue because I was reviewing the PR. what is 
the status?

Thanks

--
nosy: +matrixise

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-18 Thread Jesse Bakker

Change by Jesse Bakker :


--
nosy: +Jesse Bakker

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-17 Thread Jeethu Rao

Jeethu Rao  added the comment:

It's also interesting that in 
https://gist.github.com/pitrou/29eb7592fa1eae2be390f3bfa3db0a3a :

| django_template | 307 ms| 312 ms   | 1.02x slower 
| Not significant|

It seems to be slower and the benchmarks before it (2to3, chameleon, chaos, 
crypto_pyaes, deltablue) which we now know barely use list.insert are slightly 
faster :)

¯\_(ツ)_/¯

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-17 Thread STINNER Victor

STINNER Victor  added the comment:

In https://gist.github.com/jeethu/dc0811d415dd6d1a1621761e43842f88 I read:

| django_template | 160 ms | 129 ms | 1.24x faster | Significant (t=15.05) |

So on this benchmark, the optimization seems significant. But in the same 
paste, I see other benchmarks 1.2x faster whereas they don't abuse 
list.insert(), and maybe don't use list.insert() at all.

Maybe Django template engine should be fixed to use deque instead ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-17 Thread Jeethu Rao

Jeethu Rao  added the comment:

> What is 54640?

That's the pid of the process.

> I'm interested to know which benchmarks call list.insert() 40k times.

The django_template benchmark.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-17 Thread STINNER Victor

STINNER Victor  added the comment:

> Sure! https://gist.github.com/jeethu/000a2d3ecd9033c0ef51331f062ac294

I don't understand how to read this output.

"54640 ins1 called 40050 times"

What is 54640?

I'm interested to know which benchmarks call list.insert() 40k times.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-17 Thread Jeethu Rao

Jeethu Rao  added the comment:

> > I still think those numbers are misleading or downright bogus.  There is no 
> > existing proof that list.insert() is a critical path in those benchmarks.

> Can someone check if these bencmarks really use list.insert() in hot code? If 
> yes, why? :-) The cost of list.insert() is known, no?

Sure! https://gist.github.com/jeethu/000a2d3ecd9033c0ef51331f062ac294

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-17 Thread STINNER Victor

STINNER Victor  added the comment:

> I still think those numbers are misleading or downright bogus.  There is no 
> existing proof that list.insert() is a critical path in those benchmarks.

Can someone check if these bencmarks really use list.insert() in hot code? If 
yes, why? :-) The cost of list.insert() is known, no?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-17 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

Le 17/01/2018 à 14:36, Jeethu Rao a écrit :
> 
> On the other hand, on the pyperformance comparison I'd posted yesterday[1], 
> there seems to be an average improvement of 1.27x  on the first seven 
> benchmarks, and the slowest slowdown is only 1.03x.

I still think those numbers are misleading or downright bogus.  There is
no existing proof that list.insert() is a critical path in those benchmarks.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-17 Thread Jeethu Rao

Jeethu Rao  added the comment:

> FWIW, we've encountered a number of situations in the past when something 
> that improved the timings on one compiler would make timings worse on another 
> compiler.  There was also variance between timings on 32-bit builds versus 
> 64-bit builds.

I've verified that both clang and gcc generate similar assembly (memcpy is not 
inlined and the loop is not vectorized) 32-bit mode. I'd wager that the 
improvement with vectorization (in memmove) would be even more pronounced on 
32-bit systems, given that pointers are half the size and cache lines are still 
64 bytes wide.

> It's 1.08x faster (-7.8%). It's small for a microbenchmark, usually an 
> optimization should make a *microbenchmark* at least 10% faster.

That's true if we assume lists to have 100 or lesser elements. On the other 
hand, on the pyperformance comparison I'd posted yesterday[1], there seems to 
be an average improvement of 1.27x  on the first seven benchmarks, and the 
slowest slowdown is only 1.03x. Albeit, the improvement cannot be better than 
by a constant factor with the vectorized loop in memmove.

> Using memmove() for large copy is a good idea. The main question is the "if 
> (n <= INS1_MEMMOVE_THRESHOLD)" test. Is it slower if we always call memmove()?

The overhead of calling memmove makes it slower for small lists. That's how I 
arrived at this patch in the first place. I tried replacing the loop with a 
memmove() and it was slower on pyperformance and it was faster with switching 
to memmove() after a threshold.

> Previously, Python had a Py_MEMCPY() macro which also had such threshold. 
> Basically, it's a workaround for compiler performance issues:

That's very interesting! I think it boils down to the pointer aliasing problem. 
The pointers in memcpy()'s signature have the `restrict` qualifier, which gives 
the compiler more leeway to optimize calls to memcpy, while the compiler has to 
be more conservative with memmove(). I wonder if it's worth trying out a 
Py_MEMMOVE() :)


[1]: https://gist.github.com/jeethu/d6e4045f7932136548a806380eddd030

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-17 Thread STINNER Victor

STINNER Victor  added the comment:

https://gist.github.com/jeethu/19430d802aa08e28d1cb5eb20a47a470

Mean +- std dev: 10.5 us +- 1.4 us => Mean +- std dev: 9.68 us +- 0.89 us

It's 1.08x faster (-7.8%). It's small for a microbenchmark, usually an 
optimization should make a *microbenchmark* at least 10% faster.

For this optimization, I have no strong opinion.


Using memmove() for large copy is a good idea. The main question is the "if (n 
<= INS1_MEMMOVE_THRESHOLD)" test. Is it slower if we always call memmove()?


Previously, Python had a Py_MEMCPY() macro which also had such threshold. 
Basically, it's a workaround for compiler performance issues:

#if defined(_MSC_VER)
#define Py_MEMCPY(target, source, length) do {  \
size_t i_, n_ = (length);   \
char *t_ = (void*) (target);\
const char *s_ = (void*) (source);  \
if (n_ >= 16)   \
memcpy(t_, s_, n_); \
else\
for (i_ = 0; i_ < n_; i_++) \
t_[i_] = s_[i_];\
} while (0)
#else
#define Py_MEMCPY memcpy
#endif

Hopefully, the macro now just became:

/* Py_MEMCPY is kept for backwards compatibility,
 * see https://bugs.python.org/issue28126 */
#define Py_MEMCPY memcpy

And it's no more used.


I recall a performance issues with GCC memcmp() builtin function (replacing the 
libc function during the compilation): 
https://bugs.python.org/issue17628#msg186012

See also:
* https://bugs.python.org/issue13134
* https://bugs.python.org/issue29782

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-16 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

FWIW, we've encountered a number of situations in the past when something that 
improved the timings on one compiler would make timings worse on another 
compiler.  There was also variance between timings on 32-bit builds versus 
64-bit builds.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-16 Thread Jeethu Rao

Jeethu Rao  added the comment:

> Be careful. Moving "l.insert" lookup of the loop might make the code slower. 
> I never looked why. But Python 3.7 was also optimized in many places to call 
> methods, so I'm not sure anymore :)

Thanks again! Here's a gist without the hack[1].

[1]: https://gist.github.com/jeethu/19430d802aa08e28d1cb5eb20a47a470

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-16 Thread STINNER Victor

STINNER Victor  added the comment:

> I’ve run the benchmark that you've suggested with a minor change (to avoid 
> the cost of LOAD_ATTR)

Be careful. Moving "l.insert" lookup of the loop might make the code slower. I 
never looked why. But Python 3.7 was also optimized in many places to call 
methods, so I'm not sure anymore :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-16 Thread Jeethu Rao

Jeethu Rao  added the comment:

Victor: I’m booting with the isolcpus and rcu_nocbs flags, and running 
pyperformance with the --affinity flag to pin the benchmark to the isolated CPU 
cores. I’ve also run `perf system tune`. And the OS is Ubuntu 17.10. Thanks for 
the tip on using perf timeit instead of timeit. I’ve run the benchmark that 
you've suggested with a minor change (to avoid the cost of LOAD_ATTR) and 
attached the output on a gist[1].

Antoine: Thanks for benchmarking it. After looking at the generated 
assembly[2], I found that ins1 is being inlined and the call to memmove was 
appearing before the loop (possibly because the compiler assumes that the call 
to memmove is more likely). I made a minor change and increased the threshold 
to 32. I’ve attached the generated assembly in a gist[3] (The relevant sequence 
is around line 8406, if you’re interested). And here’s the pyperformance 
comparison[4]. Could you please try benchmarking this version on your machine?

[1]: https://gist.github.com/jeethu/2d2de55afdb8ea4ad03b6a5d04d5227f
[2]: Generated with “gcc -DNDEBUG -fwrapv -O3 -std=c99  -I. -I./Include 
-DPy_BUILD_CORE -S -masm=intel Objects/listobject.c”
[3]: https://gist.github.com/jeethu/596bfc1251590bc51cc230046b52fb38
[4]: https://gist.github.com/jeethu/d6e4045f7932136548a806380eddd030

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-16 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

Ok, I ran the benchmarks here (Ubuntu 16.04, Core i5-2500K, PGO and LTO 
disabled) and I don't get any consistent speedup, which is more in line with 
what I was expecting:
https://gist.github.com/pitrou/29eb7592fa1eae2be390f3bfa3db0a3a

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-16 Thread STINNER Victor

Change by STINNER Victor :


--
title: Speed-up list.insert -> Speed-up list.insert: use memmove()

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com