My first intuition is some cache somewhere is unhappy w/ the varying sizes. Have you tried any of this on another machine to see if the results are consistent?
On Wed, 27 Apr 2016 at 08:06 Victor Stinner <victor.stin...@gmail.com> wrote: > Hi, > > I'm working on an experimental change of CPython introducing a new > "fast call" calling convention for Python and C functions. It pass an > array of PyObject* and a number of arguments as an C int (PyObject > **stack, int nargs) instead of using a temporary tuple (PyObject > *args). The expectation is that avoiding the creation makes Python > faster. > http://bugs.python.org/issue26814 > > First microbenchmarks on optimized code are promising: between 18% and > 44% faster. > http://bugs.python.org/issue26814#msg263999 > http://bugs.python.org/issue26814#msg264003 > > But I was quickly blocked on "macrobenchmarks" (?): running the Python > benchmark suite says that many benchmarks are between 2% and 15% > slower. I spent hours (days) to investigate the issue using > Cachegrind, Callgrind, Linux perf, strace, ltrace, etc., but I was > unable to understand how my change can makes CPython slower. > > My change is quite big: "34 files changed, 3301 insertions(+), 730 > deletions(-)". In fact, the performance regression can be reproduced > easily with a few lines of C code: see attached patches. You only have > to add some *unused* (dead) code to see a "glitch" in performance. > It's even worse: the performance change depends on the size of unused > code. > > I done my best to isolate the microbenchmark to make it as reliable as > possible. Results of bm_call_simple on my desktop PC: > > (a) Reference: > Average: 1201.0 ms +/- 0.2 ms (min: 1200.7 ms, max: 1201.2 ms) > > (b) Add 2 unused functions, based on (a): > Average: 1273.0 ms +/- 1.8 ms (min: 1270.1 ms, max: 1274.4 ms) > > (c) Add 1 unused short function ("return NULL;"), based on (a): > Average: 1169.6 ms +/- 0.2 ms (min: 1169.3 ms, max: 1169.8 ms) > > (b) and (c) are 2 versions only adding unused code to (a). The > difference between (b) and (c) is the size of unused code. The problem > is that (b) makes the code slower and (c) makes the code faster (!), > whereas I would not expect any performance change. > > A sane person should ignore such minor performance delta (+72 ms = +6% > // -31.4 ms = -3%). Right. But for optimization patches on CPython, > we use the CPython benchmark suite as a proof that yeah, the change > really makes CPython faster, as announced. > > I compiled the C code using GCC (5.3) and Clang (3.7) using various > options: -O0, -O3, -fno-align-functions, -falign-functions=N (with > N=1, 2, 6, 12), -fomit-frame-pointer, -flto, etc. In short, the > performance looks "random". I'm unable to correlate the performance > with any Linux perf event. IMHO the performance depends on something > low level like L1 cache, CPU pipeline, branch prediction, etc. As I > wrote, I'm unable to verify that. > > To reproduce my issue, you can use the following commands: > --------------------------- > hg clone https://hg.python.org/cpython fastcall > # or: "hg clone (...)/cpython fastcall" > # if you already have a local copy of cpython ;-) > cd fastcall > ./configure -C > > # build reference binary > hg up -C -r 496e094f4734 > patch -p1 < prepare.patch > make && mv python python-ref > > # build binary with deadcode 1 > hg up -C -r 496e094f4734 > patch -p1 < prepare.patch > patch -p1 < deadcode1.patch > make && mv python python-deadcode1 > > # build binary with deadcode 2 > hg up -C -r 496e094f4734 > patch -p1 < prepare.patch > patch -p1 < deadcode2.patch > make && mv python python-deadcode2 > > # run benchmark > PYTHONHASHSEED=0 ./python-ref bm_call_simple.py > PYTHONHASHSEED=0 ./python-deadcode1 bm_call_simple.py > PYTHONHASHSEED=0 ./python-deadcode2 bm_call_simple.py > --------------------------- > > It suggest you to isolate at least one CPU and run the benchmark on > isolated CPUs to get reliable timings: > --------------------------- > # run benchmark on the CPU #2 > PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py > PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py > PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py > --------------------------- > My notes on CPU isolation: > http://haypo-notes.readthedocs.org/microbenchmark.html > > If you don't want to try CPU isolation, try to get an idle system > and/or run the benchmark many times until the standard deviation (the > "+/- ..." part) looks small enough... > > Don't try to run the microbenchmark without PYTHONHASHSEED=0 or you > will get random results depending on the secret hash key used by the > randomized hash function. (Or modify the code to spawn enough child > process to get an uniform distribution ;-)) > > I don't expect that you get the same numbers than me. For example, on > my laptop, the delta is very small (+/- 1%): > > $ PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py > Average: 1096.1 ms +/- 12.9 ms (min: 1079.5 ms, max: 1110.3 ms) > > $ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py > Average: 1109.2 ms +/- 11.1 ms (min: 1095.8 ms, max: 1122.9 ms) > => +1% (+13 ms) > > $ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py > Average: 1072.0 ms +/- 1.5 ms (min: 1070.0 ms, max: 1073.9 ms) > => -2% (-24 ms) > > CPU of my desktop: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz - 4 > physical cores with hyper-threading > CPU of my laptop: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz - 2 > physical cores with hyper-threading > > I modified bm_call_simple.py to call foo() 100 times rather than 20 in > the loop to see the issue more easily. I also removed dependencies and > changed the output format to display average, standard deviation, > minimum and maximum. > > For more benchmarks, see attached deadcode1.log and deadcode2.log: > results of the CPython benchmark to compare deadcode1 VS reference, > and deadcode2 VS reference run on my desktop PC (perf.py --fast & CPU > isolation). Again, deadcode1 looks slower in most cases, whereas > deadcode2 looks faster in most cases, whereas the difference is still > dead code... > > Victor, disappointed > _______________________________________________ > Speed mailing list > Speed@python.org > https://mail.python.org/mailman/listinfo/speed >
_______________________________________________ Speed mailing list Speed@python.org https://mail.python.org/mailman/listinfo/speed