[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2017-01-10 Thread STINNER Victor

STINNER Victor added the comment:

Ok, I give up on that one. I don't think that it's worth it.

--
resolution:  -> rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2017-01-02 Thread STINNER Victor

STINNER Victor added the comment:

fastcalldict-4.patch: Rebased patch.

kw1: Median +- std dev: [ref] 290 ns +- 3 ns -> [patch] 253 ns +- 21 ns: 1.14x 
faster (-13%)
kw5: Median +- std dev: [ref] 438 ns +- 33 ns -> [patch] 394 ns +- 27 ns: 1.11x 
faster (-10%)
kw10: Median +- std dev: [ref] 663 ns +- 14 ns -> [patch] 617 ns +- 16 ns: 
1.07x faster (-7%)

This patch still always allocated 10 pointers (80 bytes) on the C stack: see 
issue #28870 which discuss options to use less memory.

--
Added file: http://bugs.python.org/file46118/fastcalldict-4.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2017-01-02 Thread STINNER Victor

STINNER Victor added the comment:

I pushed the two obvious and safe optimization of fastcalldict-3.patch.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2017-01-02 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 5f7cd3b6c9b1 by Victor Stinner in branch 'default':
Issue #28839: Optimize function_call()
https://hg.python.org/cpython/rev/5f7cd3b6c9b1

New changeset f9dd607dc04c by Victor Stinner in branch 'default':
Optimize _PyFunction_FastCallDict() when kwargs is {}
https://hg.python.org/cpython/rev/f9dd607dc04c

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2017-01-02 Thread STINNER Victor

STINNER Victor added the comment:

Quick update on the fastcall work.

> I pushed th echange b9c9691c72c5 to replace PyObject_CallFunctionObjArgs() 
> with _PyObject_CallNoArg() or _PyObject_CallArg1(). These functions are 
> simpler and don't allocate memory on the C stack.

Using _PyObject_CallArg1() increased the usage of the C stack: see issue 
#28858. The fix was to remove the _PyObject_CallArg1() macro, replaced with 
PyObject_CallFunctionObjArgs(func, arg, NULL).

There is still an open issue #28870 to reduce the usage of the C stack memory 
even more, but it's more a general optimization.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-12-01 Thread STINNER Victor

STINNER Victor added the comment:

I pushed th echange b9c9691c72c5 to replace PyObject_CallFunctionObjArgs() with 
_PyObject_CallNoArg() or _PyObject_CallArg1(). These functions are simpler and 
don't allocate
memory on the C stack.

I made similar to PyObject_CallFunctionObjArgs() in Python 3.6 to this issue: 
don't create a tuple but a small stack allocated on the C stack or allocate 
heap memory to pass a C array of PyObject* to _PyObject_FastCall().

Currently, PyObject_CallFunctionObjArgs() uses a small stack for up to 5 
positional arguments: it allocates 40 bytes on the C stack.

Serhiy Storchaka: "The problem with C stack overflow is not new, but your patch 
may make it worse (I don't know if it actually make it worse)."

I consider that 80 bytes is small enough for a C stack. As I wrote, we can 
reduce this "small stack" in the future if somone reports issues.


"Py_EnterRecursiveCall() is used for limiting Python stack. It can't prevent C 
stack overflow."

I know that the protection is not perfect. It's an heuristic.

I don't even know if it counts Python builtin functions, or only pure Python 
functions.


But I'm not sure that I understood your comment: do you suggest to use a tuple 
and reject this issue? To reduce the size of the small stack? Or to only 
allocate memory on the heap memory?

If the issue is the memory allocated on the C stack, maybe we can use a free 
list for "stacks" (C array of PyObject*), as done for tuples? I'm not sure that 
a free list for PyMem_Malloc()/PyMem_Free() is useful, since it uses our 
efficient pymalloc.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-12-01 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

The problem with C stack overflow is not new, but your patch may make it worse 
(I don't know if it actually make it worse). Py_EnterRecursiveCall() is used 
for limiting Python stack. It can't prevent C stack overflow.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-12-01 Thread STINNER Victor

STINNER Victor added the comment:

> I agree with Josh, PyTuple_New() can be faster than PyMem_Malloc() due to 
> tuple free list.

According to benchmarks, PyTuple_New() is slower than PyMem_Malloc(). It's not 
surprising for me, using a tuple object requires extra work:

* Track and then untrack the object from the garbage collector
* Destructor uses Py_TRASHCAN_SAFE_BEGIN/Py_TRASHCAN_SAFE_END macros
* Some additional indirectons

When I started working on "fastcall", I was surprised that not creating tuples 
has a *significant* (positive) effect on performance. It seems to be between 5% 
and 45% faster. Obviously, it depends on the speed of the function body. The 
speedup is higher for faster functions, like fast functions implemented in C.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-12-01 Thread STINNER Victor

STINNER Victor added the comment:

Serhiy: "small_stack increases C stack consumption even for calls without 
keyword arguments. This is serious problem since we can't control stack 
overflow."

This problem is not new and is worked around by Py_EnterRecursiveCall() macro 
which counts the depth of the Python stack.

I didn't notice any stack overflow crash with my patch. We can reduce the 
"small_stack" size later if needed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-12-01 Thread STINNER Victor

STINNER Victor added the comment:

> Note: Using a simple printf() in the C code, I noticed that it is not 
> uncommon that _PyFunction_FastCallDict() is called with an empty dictionary 
> for keyword arguments.

Simplified Python example where _PyFunction_FastCallDict() is called with an 
empty dictionary:
---
def f2():
pass

def wrapper(func, *args, **kw):
# CALL_FUNCTION_EX: func(**{}) calls PyObject_Call() with kwargs={} which
# calls _PyFunction_FastCallDict()
func(*args, **kw)

def f():
# CALL_FUNCTION: calling wrapper calls fast_function() which calls
# _PyEval_EvalCodeWithName() which creates an empty dictionary for kw
wrapper(f2)

f()
---

But on this specific case, the speedup is *very* small: 3 nanoseconds :-)

./python -m perf timeit -s 'kw={}' -s 'def func(): pass' --duplicate=1000 
'func(**kw)'
(...)
Median +- std dev: [ref] 108 ns +- 4 ns -> [patch] 105 ns +- 5 ns: 1.02x faster 
(-2%)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-12-01 Thread STINNER Victor

STINNER Victor added the comment:

(Oops, I attached the wrong benchmark script. It's now fixed.)

--
Added file: http://bugs.python.org/file45721/bench_fastcalldict.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-12-01 Thread STINNER Victor

Changes by STINNER Victor :


Removed file: http://bugs.python.org/file45720/bench_fastcalldict.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-12-01 Thread STINNER Victor

STINNER Victor added the comment:

bench_fastcalldict.py: hardcore microbenchmark on _PyFunction_FastCallDict(). 
Pass keyword arguments to the tp_init slot of a Python constructor. Result for 
1, 5 and 10 keyword arguments:

kw1: Median +- std dev: [ref] 329 ns +- 21 ns -> [patch] 306 ns +- 17 ns: 1.07x 
faster (-7%)
kw5: Median +- std dev: [ref] 508 ns +- 22 ns -> [patch] 481 ns +- 25 ns: 1.05x 
faster (-5%)
kw10: Median +- std dev: [ref] 829 ns +- 45 ns -> [patch] 805 ns +- 39 ns: 
1.03x faster (-3%)

As expected, the difference is small, but it's faster :-) Indirect benefit is 
that the garbage collector should be less stressed :-) (tuples are tracked by 
the GC.)

Note: Using a simple printf() in the C code, I noticed that it is not uncommon 
that _PyFunction_FastCallDict() is called with an empty dictionary for keyword 
arguments. Without the patch, an empty tuple was created. With my patch, 
"unpack" the empty dictionary costs nothing.

--
Added file: http://bugs.python.org/file45720/bench_fastcalldict.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-12-01 Thread STINNER Victor

STINNER Victor added the comment:

Patch version: fix the "if (0)" to use the small stack allocated on the C stack.

--
Added file: http://bugs.python.org/file45719/fastcalldict-3.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-12-01 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I agree with Josh, PyTuple_New() can be faster than PyMem_Malloc() due to tuple 
free list. small_stack increases C stack consumption even for calls without 
keyword arguments. This is serious problem since we can't control stack 
overflow.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-11-30 Thread Josh Rosenberg

Josh Rosenberg added the comment:

Minor correction: No allocation when small stack used, so you'd only see 
(possibly) regressions with 6+ keyword arguments (assuming the tuple free list 
applies for tuples that large). Admittedly a minor concern; keyword processing 
is already pretty slow, and large numbers of keywords being passed are likely a 
tiny fraction of call cases, so I'd guess microbenchmarks wouldn't show a big 
change, and broad benchmarks would be completely unaffected, but worth checking.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-11-30 Thread Josh Rosenberg

Josh Rosenberg added the comment:

Given you can't avoid the refcounting overhead, how much does this really help? 
Are there meaningful benefits in microbenchmarks? I'd worry that unconditional 
allocation from PyMem_Malloc might lose out relative to PyTuple_New, which is 
likely to not involve memory allocation at all (pulling from tuple free list).

--
nosy: +josh.r

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-11-30 Thread STINNER Victor

STINNER Victor added the comment:

fastcalldict.patch avoided INCREF/DECREF on keyword keys and values. This is 
wrong: we must hold strong references because the keyword dictionary can be 
technically modified: see issue #2016 and test_extcall.

Hum, I'm quite sure that it's not the first time that I was bitten by this bug. 
That's maybe why I didn't try to implement this optimization the first time.

fastcalldict-2.patch keeps INCREF/DECREF and so doesn't crash on test_extcall.

--
Added file: http://bugs.python.org/file45704/fastcalldict-2.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28839] _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()

2016-11-30 Thread STINNER Victor

New submission from STINNER Victor:

Attached patch is a minor optimization for _PyFunction_FastCallDict(): avoid 
the creation of a tuple to pass keyword arguments, use a simple C array 
allocated by PyMem_Malloc().

It also uses a small stack of 80 bytes (2*5*sizeof(PyObject*)) allocated on the 
C stack to pass up to 5 keyword arguments (5 key+value pairs): it avoids 
completely the allocation on the heap memory 

I wrote _PyFunction_FastCallDict() (issue #27809): the code was based on 
function_call() which also uses PyTuple_New(). When I wrote the function, I 
didn't notice that PyEval_EvalCodeEx() doesn't expect a Python tuple object, 
but a C array.

The patch also modifies function_call() to call _PyFunction_FastCallDict(), so 
it gets _PyFunction_FastCallDict() optimizations.

--
files: fastcalldict.patch
keywords: patch
messages: 282079
nosy: haypo, serhiy.storchaka
priority: normal
severity: normal
status: open
title: _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()
type: performance
versions: Python 3.7
Added file: http://bugs.python.org/file45703/fastcalldict.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com