[Speed] When CPython performance depends on dead code...

Victor Stinner Wed, 27 Apr 2016 08:07:14 -0700

 Hi,

I'm working on an experimental change of CPython introducing a new
"fast call" calling convention for Python and C functions. It pass an
array of PyObject* and a number of arguments as an C int (PyObject
**stack, int nargs) instead of using a temporary tuple (PyObject
*args). The expectation is that avoiding the creation makes Python
faster.
http://bugs.python.org/issue26814


First microbenchmarks on optimized code are promising: between 18% and
44% faster.
http://bugs.python.org/issue26814#msg263999
http://bugs.python.org/issue26814#msg264003

But I was quickly blocked on "macrobenchmarks" (?): running the Python
benchmark suite says that many benchmarks are between 2% and 15%
slower. I spent hours (days) to investigate the issue using
Cachegrind, Callgrind, Linux perf, strace, ltrace, etc., but I was
unable to understand how my change can makes CPython slower.

My change is quite big: "34 files changed, 3301 insertions(+), 730
deletions(-)". In fact, the performance regression can be reproduced
easily with a few lines of C code: see attached patches. You only have
to add some *unused* (dead) code to see a "glitch" in performance.
It's even worse: the performance change depends on the size of unused
code.

I done my best to isolate the microbenchmark to make it as reliable as
possible. Results of bm_call_simple on my desktop PC:

(a) Reference:
Average: 1201.0 ms +/- 0.2 ms (min: 1200.7 ms, max: 1201.2 ms)

(b) Add 2 unused functions, based on (a):
Average: 1273.0 ms +/- 1.8 ms (min: 1270.1 ms, max: 1274.4 ms)

(c) Add 1 unused short function ("return NULL;"), based on (a):
Average: 1169.6 ms +/- 0.2 ms (min: 1169.3 ms, max: 1169.8 ms)

(b) and (c) are 2 versions only adding unused code to (a). The
difference between (b) and (c) is the size of unused code. The problem
is that (b) makes the code slower and (c) makes the code faster (!),
whereas I would not expect any performance change.

A sane person should ignore such minor performance delta (+72 ms = +6%
//  -31.4 ms = -3%). Right. But for optimization patches on CPython,
we use the CPython benchmark suite as a proof that yeah, the change
really makes CPython faster, as announced.

I compiled the C code using GCC (5.3) and Clang (3.7) using various
options: -O0, -O3, -fno-align-functions, -falign-functions=N (with
N=1, 2, 6, 12), -fomit-frame-pointer, -flto, etc. In short, the
performance looks "random". I'm unable to correlate the performance
with any Linux perf event. IMHO the performance depends on something
low level like L1 cache, CPU pipeline, branch prediction, etc. As I
wrote, I'm unable to verify that.

To reproduce my issue, you can use the following commands:
---------------------------
hg clone https://hg.python.org/cpython fastcall
# or: "hg clone (...)/cpython fastcall"
# if you already have a local copy of cpython ;-)
cd fastcall
./configure -C

# build reference binary
hg up -C -r 496e094f4734
patch -p1 < prepare.patch
make && mv python python-ref

# build binary with deadcode 1
hg up -C -r 496e094f4734
patch -p1 < prepare.patch
patch -p1 < deadcode1.patch
make && mv python python-deadcode1

# build binary with deadcode 2
hg up -C -r 496e094f4734
patch -p1 < prepare.patch
patch -p1 < deadcode2.patch
make && mv python python-deadcode2

# run benchmark
PYTHONHASHSEED=0 ./python-ref bm_call_simple.py
PYTHONHASHSEED=0 ./python-deadcode1 bm_call_simple.py
PYTHONHASHSEED=0 ./python-deadcode2 bm_call_simple.py
---------------------------

It suggest you to isolate at least one CPU and run the benchmark on
isolated CPUs to get reliable timings:
---------------------------
# run benchmark on the CPU #2
PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py
PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py
PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py
---------------------------
My notes on CPU isolation:
http://haypo-notes.readthedocs.org/microbenchmark.html

If you don't want to try CPU isolation, try to get an idle system
and/or run the benchmark many times until the standard deviation (the
"+/- ..." part) looks small enough...

Don't try to run the microbenchmark without PYTHONHASHSEED=0 or you
will get random results depending on the secret hash key used by the
randomized hash function. (Or modify the code to spawn enough child
process to get an uniform distribution ;-))

I don't expect that you get the same numbers than me. For example, on
my laptop, the delta is very small (+/- 1%):

$ PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py
Average: 1096.1 ms +/- 12.9 ms (min: 1079.5 ms, max: 1110.3 ms)

$ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py
Average: 1109.2 ms +/- 11.1 ms (min: 1095.8 ms, max: 1122.9 ms)
=> +1% (+13 ms)

$ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py
Average: 1072.0 ms +/- 1.5 ms (min: 1070.0 ms, max: 1073.9 ms)
=> -2% (-24 ms)

CPU of my desktop: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz - 4
physical cores with hyper-threading
CPU of my laptop: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz - 2
physical cores with hyper-threading

I modified bm_call_simple.py to call foo() 100 times rather than 20 in
the loop to see the issue more easily. I also removed dependencies and
changed the output format to display average, standard deviation,
minimum and maximum.

For more benchmarks, see attached deadcode1.log and deadcode2.log:
results of the CPython benchmark to compare deadcode1 VS reference,
and deadcode2 VS reference run on my desktop PC (perf.py --fast & CPU
isolation). Again, deadcode1 looks slower in most cases, whereas
deadcode2 looks faster in most cases, whereas the difference is still
dead code...

Victor, disappointed

changeset:   101088:96628516a16b
user:        Victor Stinner <victor.stin...@gmail.com>
date:        Mon Apr 25 21:34:47 2016 +0200
files:       Include/eval.h Include/funcobject.h Python/ceval.c
description:
ceval.c: expose _PyEval_EvalCodeWithName; add _PyFunction_FastCall()


diff -r 496e094f4734 -r 96628516a16b Include/eval.h
--- a/Include/eval.h	Thu Apr 21 00:23:08 2016 -0700
+++ b/Include/eval.h	Mon Apr 25 21:34:47 2016 +0200
@@ -17,6 +17,12 @@ PyAPI_FUNC(PyObject *) PyEval_EvalCodeEx
 					PyObject **defs, int defc,
 					PyObject *kwdefs, PyObject *closure);
 
+PyAPI_FUNC(PyObject *) _PyEval_EvalCodeWithName(PyObject *_co,
+    PyObject *globals, PyObject *locals,
+    PyObject **args, int argcount, PyObject **kws, int kwcount,
+    PyObject **defs, int defcount, PyObject *kwdefs, PyObject *closure,
+    PyObject *name, PyObject *qualname);
+
 #ifndef Py_LIMITED_API
 PyAPI_FUNC(PyObject *) _PyEval_CallTracing(PyObject *func, PyObject *args);
 #endif
diff -r 496e094f4734 -r 96628516a16b Include/funcobject.h
--- a/Include/funcobject.h	Thu Apr 21 00:23:08 2016 -0700
+++ b/Include/funcobject.h	Mon Apr 25 21:34:47 2016 +0200
@@ -58,6 +58,9 @@ PyAPI_FUNC(int) PyFunction_SetClosure(Py
 PyAPI_FUNC(PyObject *) PyFunction_GetAnnotations(PyObject *);
 PyAPI_FUNC(int) PyFunction_SetAnnotations(PyObject *, PyObject *);
 
+PyAPI_FUNC(PyObject *) _PyFunction_FastCall(PyObject *func,
+    PyObject **stack, int na, int nk);
+
 /* Macros for direct access to these values. Type checks are *not*
    done, so use with care. */
 #define PyFunction_GET_CODE(func) \
diff -r 496e094f4734 -r 96628516a16b Python/ceval.c
--- a/Python/ceval.c	Thu Apr 21 00:23:08 2016 -0700
+++ b/Python/ceval.c	Mon Apr 25 21:34:47 2016 +0200
@@ -771,18 +771,6 @@ static int _Py_TracingPossible = 0;
 
 
 
-PyObject *
-PyEval_EvalCode(PyObject *co, PyObject *globals, PyObject *locals)
-{
-    return PyEval_EvalCodeEx(co,
-                      globals, locals,
-                      (PyObject **)NULL, 0,
-                      (PyObject **)NULL, 0,
-                      (PyObject **)NULL, 0,
-                      NULL, NULL);
-}
-
-
 /* Interpreter main loop */
 
 PyObject *
@@ -3814,7 +3802,7 @@ too_many_positional(PyCodeObject *co, in
    the test in the if statements in Misc/gdbinit (pystack and pystackv). */
 
 static PyObject *
-_PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
+eval_code(PyObject *_co, PyObject *globals, PyObject *locals,
            PyObject **args, int argcount, PyObject **kws, int kwcount,
            PyObject **defs, int defcount, PyObject *kwdefs, PyObject *closure,
            PyObject *name, PyObject *qualname)
@@ -4059,16 +4047,41 @@ fail: /* Jump here from prelude on failu
 }
 
 PyObject *
+_PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
+           PyObject **args, int argcount, PyObject **kws, int kwcount,
+           PyObject **defs, int defcount, PyObject *kwdefs, PyObject *closure,
+           PyObject *name, PyObject *qualname)
+{
+    return eval_code(_co, globals, locals,
+                     args, argcount, kws, kwcount,
+                     defs, defcount, kwdefs, closure,
+                     name, qualname);
+}
+
+PyObject *
 PyEval_EvalCodeEx(PyObject *_co, PyObject *globals, PyObject *locals,
            PyObject **args, int argcount, PyObject **kws, int kwcount,
            PyObject **defs, int defcount, PyObject *kwdefs, PyObject *closure)
 {
-    return _PyEval_EvalCodeWithName(_co, globals, locals,
-                                    args, argcount, kws, kwcount,
-                                    defs, defcount, kwdefs, closure,
-                                    NULL, NULL);
+    return eval_code(_co, globals, locals,
+                     args, argcount, kws, kwcount,
+                     defs, defcount, kwdefs, closure,
+                     NULL, NULL);
 }
 
+PyObject *
+PyEval_EvalCode(PyObject *co, PyObject *globals, PyObject *locals)
+{
+    return eval_code(co, globals, locals,
+                     (PyObject **)NULL, 0,
+                     (PyObject **)NULL, 0,
+                     (PyObject **)NULL, 0,
+                     NULL, NULL,
+                     NULL, NULL);
+}
+
+
+
 static PyObject *
 special_lookup(PyObject *o, _Py_Identifier *id)
 {
@@ -4837,11 +4850,19 @@ fast_function(PyObject *func, PyObject *
         d = &PyTuple_GET_ITEM(argdefs, 0);
         nd = Py_SIZE(argdefs);
     }
-    return _PyEval_EvalCodeWithName((PyObject*)co, globals,
-                                    (PyObject *)NULL, (*pp_stack)-n, na,
-                                    (*pp_stack)-2*nk, nk, d, nd, kwdefs,
-                                    PyFunction_GET_CLOSURE(func),
-                                    name, qualname);
+    return eval_code((PyObject*)co, globals,
+                     (PyObject *)NULL, (*pp_stack)-n, na,
+                     (*pp_stack)-2*nk, nk, d, nd, kwdefs,
+                     PyFunction_GET_CLOSURE(func),
+                     name, qualname);
+}
+
+PyObject *
+_PyFunction_FastCall(PyObject *func, PyObject **arg_stack, int na, int nk)
+{
+    const int n = na + nk * 2;
+    PyObject **stack = arg_stack + n;
+    return fast_function(func, &stack, n, na, nk);
 }
 
 static PyObject *

diff -r 96628516a16b -r 9515d44d93ff Include/abstract.h
--- a/Include/abstract.h	Mon Apr 25 21:34:47 2016 +0200
+++ b/Include/abstract.h	Mon Apr 25 21:35:43 2016 +0200
@@ -266,6 +266,10 @@ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
      PyAPI_FUNC(PyObject *) PyObject_Call(PyObject *callable_object,
                                           PyObject *args, PyObject *kw);
 
+     PyAPI_FUNC(PyObject *) _PyObject_FastCall(PyObject *callable_object,
+                                                PyObject **stack,
+                                                int na, int nk);
+
 #ifndef Py_LIMITED_API
      PyAPI_FUNC(PyObject *) _Py_CheckFunctionResult(PyObject *func,
                                                     PyObject *result,
diff -r 96628516a16b -r 9515d44d93ff Objects/abstract.c
--- a/Objects/abstract.c	Mon Apr 25 21:34:47 2016 +0200
+++ b/Objects/abstract.c	Mon Apr 25 21:35:43 2016 +0200
@@ -2174,6 +2174,98 @@ PyObject_Call(PyObject *func, PyObject *
     return _Py_CheckFunctionResult(func, result, NULL);
 }
 
+static int
+_PyStack_AsArgs(PyObject **stack, Py_ssize_t na, Py_ssize_t nk,
+                PyObject *func,
+                PyObject **p_args, PyObject **p_kwargs)
+{
+    PyObject *args, *kwargs;
+    Py_ssize_t i;
+
+    args = PyTuple_New(na);
+    if (args == NULL) {
+        goto error;
+    }
+
+    for (i=0; i < na; i++) {
+        PyObject *item = stack[i];
+
+        Py_INCREF(item);
+        PyTuple_SET_ITEM(args, i, item);
+    }
+
+    if (nk > 0) {
+        kwargs = PyDict_New();
+        if (kwargs == NULL) {
+            Py_DECREF(args);
+            goto error;
+        }
+
+        for (i=0; i < nk; i++) {
+            PyObject *key = stack[na + i*2];
+            PyObject *value = stack[na + i*2 + 1];
+
+            /* FIXME: reuse update_keyword_args() of ceval.c */
+            if (PyDict_SetItem(kwargs, key, value) < 0) {
+                Py_DECREF(args);
+                Py_DECREF(kwargs);
+                goto error;
+            }
+        }
+    }
+    else {
+        kwargs = NULL;
+    }
+
+    *p_args = args;
+    *p_kwargs = kwargs;
+    return 0;
+
+error:
+    *p_args = NULL;
+    *p_kwargs = NULL;
+    return -1;
+}
+
+PyObject *
+_PyObject_FastCall(PyObject *func, PyObject **stack, int na, int nk)
+{
+    PyObject *args, *kwargs, *res;
+
+    /* _PyObject_FastCall() must not be called with an exception set,
+       because it may clear it (directly or indirectly) and so the
+       caller loses its exception */
+    assert(!PyErr_Occurred());
+
+    assert(func != NULL);
+    if (na < 0 || nk < 0 || stack == NULL) {
+        PyErr_BadInternalCall();
+        return NULL;
+    }
+
+    if (PyFunction_Check(func)) {
+        if (Py_EnterRecursiveCall(" while calling a Python object"))
+            return NULL;
+
+        res = _PyFunction_FastCall(func, stack, na, nk);
+
+        Py_LeaveRecursiveCall();
+
+        return res;
+    }
+
+    /* Slow-path: build a temporary tuple and maybe also a temporary dict */
+
+    if (_PyStack_AsArgs(stack, na, nk, func, &args, &kwargs) < 0) {
+        return NULL;
+    }
+
+    res = PyObject_Call(func, args, kwargs);
+    Py_DECREF(args);
+    Py_XDECREF(kwargs);
+    return res;
+}
+
 static PyObject*
 call_function_tail(PyObject *callable, PyObject *args)
 {

diff -r 96628516a16b -r eb5b04495993 Include/abstract.h
--- a/Include/abstract.h	Mon Apr 25 21:34:47 2016 +0200
+++ b/Include/abstract.h	Mon Apr 25 21:36:28 2016 +0200
@@ -266,6 +266,10 @@ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
      PyAPI_FUNC(PyObject *) PyObject_Call(PyObject *callable_object,
                                           PyObject *args, PyObject *kw);
 
+     PyAPI_FUNC(PyObject *) _PyObject_FastCall(PyObject *callable_object,
+                                                PyObject **stack,
+                                                int na, int nk);
+
 #ifndef Py_LIMITED_API
      PyAPI_FUNC(PyObject *) _Py_CheckFunctionResult(PyObject *func,
                                                     PyObject *result,
diff -r 96628516a16b -r eb5b04495993 Objects/abstract.c
--- a/Objects/abstract.c	Mon Apr 25 21:34:47 2016 +0200
+++ b/Objects/abstract.c	Mon Apr 25 21:36:28 2016 +0200
@@ -2174,6 +2174,12 @@ PyObject_Call(PyObject *func, PyObject *
     return _Py_CheckFunctionResult(func, result, NULL);
 }
 
+PyObject *
+_PyObject_FastCall(PyObject *func, PyObject **stack, int na, int nk)
+{
+    return NULL;
+}
+
 static PyObject*
 call_function_tail(PyObject *callable, PyObject *args)
 {

import time

LOOPS = 5
SHOW_RESULT = True

def foo(a, b, c, d):
    # 20 calls
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)
    bar(a, b, c)


def bar(a, b, c):
    # 20 calls
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)
    baz(a, b)


def baz(a, b):
    # 20 calls
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)
    quux(a)


def quux(a):
    # 20 calls
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()
    qux()


def qux():
    pass


def test_calls(iterations, timer):
    times = []
    for run in range(1, iterations+1):
        t0 = timer()

        # 20 calls
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)

        # 20 calls
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)

        # 20 calls
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)

        # 20 calls
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)

        # 20 calls
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)
        foo(1, 2, 3, 4)

        t1 = timer()
        dt = t1 - t0
        times.append(dt)
    return times

times = test_calls(LOOPS, time.perf_counter)

if SHOW_RESULT:
    import statistics

    min_time = min(times)
    max_time = max(times)
    avg = statistics.mean(times)
    dev = statistics.stdev(times)
    print("Average: %.1f ms +/- %.1f ms (min: %.1f ms, max: %.1f ms)"
          % (avg * 1e3, dev * 1e3, min_time * 1e3, max_time * 1e3))

_______________________________________________
Speed mailing list
Speed@python.org
https://mail.python.org/mailman/listinfo/speed

[Speed] When CPython performance depends on dead code...

Reply via email to