[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-09-01 Thread STINNER Victor

STINNER Victor added the comment:

I splitted the giant patch into smaller patches easier to review. The first 
part (_PyObject_FastCall, _PyObject_FastCallDict) is already merged. Other 
issues were opened to implement the full feature. I now close this issue.

--
resolution:  -> fixed
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-05-25 Thread STINNER Victor

STINNER Victor added the comment:

I fixed even more issues with my setup to run benchmark. Results should be even 
more reliable. Moreover, I fixed multiple reference leaks in the code which 
introduced performance regressions. I started to write articles to explain how 
to run stable benchmarks:

* https://haypo.github.io/journey-to-stable-benchmark-system.html
* https://haypo.github.io/journey-to-stable-benchmark-deadcode.html
* https://haypo.github.io/journey-to-stable-benchmark-average.html

Summary of benchmarks at the revision e6f3bf996c01:

Faster (25):
- pickle_list: 1.29x faster
- etree_generate: 1.22x faster
- pickle_dict: 1.19x faster
- etree_process: 1.16x faster
- mako_v2: 1.13x faster
- telco: 1.09x faster
- raytrace: 1.08x faster
- etree_iterparse: 1.08x faster
- regex_compile: 1.07x faster
- json_dump_v2: 1.07x faster
- etree_parse: 1.06x faster
- regex_v8: 1.05x faster
- call_method_unknown: 1.05x faster
- chameleon_v2: 1.05x faster
- fastunpickle: 1.04x faster
- django_v3: 1.04x faster
- chaos: 1.04x faster
- 2to3: 1.03x faster
- pathlib: 1.03x faster
- unpickle_list: 1.03x faster
- json_load: 1.03x faster
- fannkuch: 1.03x faster
- call_method: 1.02x faster
- unpack_sequence: 1.02x faster
- call_method_slots: 1.02x faster

Slower (4):
- regex_effbot: 1.08x slower
- nbody: 1.08x slower
- spectral_norm: 1.07x slower
- nqueens: 1.06x slower

Not significat (13):
- tornado_http
- startup_nosite
- simple_logging
- silent_logging
- richards
- pidigits
- normal_startup
- meteor_contest
- go
- formatted_logging
- float
- fastpickle
- call_simple

I'm now investigating why 4 benchmarks are slower.

Note: I'm still using my patched CPython benchmark suite to get more stable 
benchmark. I will send patches upstream later.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-05-20 Thread STINNER Victor

STINNER Victor added the comment:

> unpickle_list: 1.11x faster

This result was unfair: my fastcall branch contained the optimization of the 
issue #27056. I just pushed this optimization into the default branch.

I ran again the benchmark: the result is now "not significant", as expected, 
since the benchmark is a microbenchmark testing C functions of 
Modules/_pickle.c, it doesn't really rely on the performance of (C or Python) 
functions calls.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-05-19 Thread STINNER Victor

STINNER Victor added the comment:

> In short, I replayed exaclty the same scenario. And... Only raytrace remains 
> slower, (...)

Oh, it looks like the reference binary calls the garbage collector less 
frequently than the patched python. In the patched Python, collections of the 
generation 2 are needed, whereas no collection of the generation 2 is needed on 
the reference binary.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-05-19 Thread STINNER Victor

STINNER Victor added the comment:

> Result of the benchmark suite:
>
> slower (3):
>
> * raytrace: 1.06x slower
> * etree_parse: 1.03x slower
> * normal_startup: 1.02x slower

Hum, I recompiled the patched Python, again with PGO+LTO, and ran the same 
benchmark with the same command. In short, I replayed exaclty the same 
scenario. And... Only raytrace remains slower, etree_parse and normal_startup 
moved to the "not significant" list.

The difference in the benchmark result doesn't come from the benchmark. For 
example, I ran gain the normal_startup benchmark 3 times: I got the same result 
3 times.

### normal_startup ###
Avg: 0.295168 +/- 0.000991 -> 0.294926 +/- 0.00048: 1.00x faster
Not significant

### normal_startup ###
Avg: 0.294871 +/- 0.000606 -> 0.294883 +/- 0.00072: 1.00x slower
Not significant

### normal_startup ###
Avg: 0.295096 +/- 0.000706 -> 0.294967 +/- 0.00068: 1.00x faster
Not significant

IMHO the difference comes from the data collected by PGO.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-05-19 Thread STINNER Victor

STINNER Victor added the comment:

Status of the my FASTCALL implementation (34456cce64bb.patch):

* Add METH_FASTCALL calling convention to C functions, similar
  to METH_VARARGS|METH_KEYWORDS
* Clinic uses METH_FASTCALL when possible (it may use METH_FASTCALL
  for all cases in the future)
* Add new C functions:

  - _PyObject_FastCall(func, stack, nargs, kwds): root of the FASTCALL branch
  - PyObject_CallNoArg(func)
  - PyObject_CallArg1(func, arg)

* Add new type flags changing the calling conventions of tp_new, tp_init and
  tp_call:

  - Py_TPFLAGS_FASTNEW
  - Py_TPFLAGS_FASTINIT
  - Py_TPFLAGS_FASTCALL

* Backward incompatible change of Py_TPFLAGS_FASTNEW and Py_TPFLAGS_FASTINIT
  flags: calling explicitly type->tp_new() and type->tp_init() is now a bug
  and is likely to crash, since the calling convention can now be FASTCALL.

* New _PyType_CallNew() and _PyType_CallInit() functions to call tp_new and
  tp_init of a type. Functions which called tp_new and tp_init directly were
  patched.

* New helpers function to parse functions functions:

  - PyArg_ParseStack()
  - PyArg_ParseStackAndKeywords()
  - PyArg_UnpackStack()

* New Py_Build functons:

  - Py_BuildStack()
  - Py_VaBuildStack()

* New _PyStack API to handle a stack:

  - _PyStack_Alloc(), _PyStack_Free(), _PyStack_Copy()
  - _PyStack_FromTuple()
  - _PyStack_FromBorrowedTuple()
  - _PyStack_AsTuple(), _PyStack_AsTupleSlice()
  - ...

* Many changes were done in the typeobject.c file to handle FASTCALL, new
  type flags, handle correctly flags when a new type is created, etc.

* ceval.c: add _PyFunction_FastCall() function (somehow, I only exposed
  existing code)

A large part of the patch changes existing code to use the new calling
convention in many functions of many modules. Some changes were generated
by the Argument Clinic. IMHO the best would be to use Argument Clinic in more
places, rather than patching manually the code.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-05-19 Thread STINNER Victor

STINNER Victor added the comment:

New patch: 34456cce64bb.patch

$ diffstat 34456cce64bb.patch 
 .hgignore |3 
 Makefile.pre.in   |   37 
 b/Doc/includes/shoddy.c   |2 
 b/Include/Python.h|1 
 b/Include/abstract.h  |   17 
 b/Include/descrobject.h   |   14 
 b/Include/funcobject.h|6 
 b/Include/methodobject.h  |6 
 b/Include/modsupport.h|   20 
 b/Include/object.h|   28 
 b/Lib/json/encoder.py |1 
 b/Lib/test/test_extcall.py|   19 
 b/Lib/test/test_sys.py|6 
 b/Modules/_collectionsmodule.c|   14 
 b/Modules/_csv.c  |   15 
 b/Modules/_ctypes/_ctypes.c   |   12 
 b/Modules/_ctypes/stgdict.c   |2 
 b/Modules/_datetimemodule.c   |   47 
 b/Modules/_elementtree.c  |   11 
 b/Modules/_functoolsmodule.c  |  113 +-
 b/Modules/_io/clinic/_iomodule.c.h|8 
 b/Modules/_io/clinic/bufferedio.c.h   |   42 
 b/Modules/_io/clinic/bytesio.c.h  |   42 
 b/Modules/_io/clinic/fileio.c.h   |   26 
 b/Modules/_io/clinic/iobase.c.h   |   26 
 b/Modules/_io/clinic/stringio.c.h |   34 
 b/Modules/_io/clinic/textio.c.h   |   40 
 b/Modules/_io/iobase.c|4 
 b/Modules/_json.c |   24 
 b/Modules/_lsprof.c   |4 
 b/Modules/_operator.c |   11 
 b/Modules/_pickle.c   |  106 -
 b/Modules/_posixsubprocess.c  |   15 
 b/Modules/_sre.c  |   11 
 b/Modules/_ssl.c  |9 
 b/Modules/_testbuffer.c   |4 
 b/Modules/_testcapimodule.c   |4 
 b/Modules/_threadmodule.c |   32 
 b/Modules/_tkinter.c  |   11 
 b/Modules/arraymodule.c   |   29 
 b/Modules/cjkcodecs/clinic/multibytecodec.c.h |   50 
 b/Modules/clinic/_bz2module.c.h   |8 
 b/Modules/clinic/_codecsmodule.c.h|  318 +++--
 b/Modules/clinic/_cryptmodule.c.h |   10 
 b/Modules/clinic/_datetimemodule.c.h  |8 
 b/Modules/clinic/_dbmmodule.c.h   |   26 
 b/Modules/clinic/_elementtree.c.h |   86 -
 b/Modules/clinic/_gdbmmodule.c.h  |   26 
 b/Modules/clinic/_lzmamodule.c.h  |   16 
 b/Modules/clinic/_opcode.c.h  |   10 
 b/Modules/clinic/_pickle.c.h  |   34 
 b/Modules/clinic/_sre.c.h |  124 +-
 b/Modules/clinic/_ssl.c.h |   74 -
 b/Modules/clinic/_tkinter.c.h |   50 
 b/Modules/clinic/_winapi.c.h  |  124 +-
 b/Modules/clinic/arraymodule.c.h  |   34 
 b/Modules/clinic/audioop.c.h  |  210 ++-
 b/Modules/clinic/binascii.c.h |   36 
 b/Modules/clinic/cmathmodule.c.h  |   24 
 b/Modules/clinic/fcntlmodule.c.h  |   34 
 b/Modules/clinic/grpmodule.c.h|   14 
 b/Modules/clinic/md5module.c.h|8 
 b/Modules/clinic/posixmodule.c.h  |  642 ++-
 b/Modules/clinic/pyexpat.c.h  |   32 
 b/Modules/clinic/sha1module.c.h   |8 
 b/Modules/clinic/sha256module.c.h |   14 
 b/Modules/clinic/sha512module.c.h |   14 
 b/Modules/clinic/signalmodule.c.h |   50 
 b/Modules/clinic/unicodedata.c.h  |   42 
 b/Modules/clinic/zlibmodule.c.h   |   68 -
 b/Modules/itertoolsmodule.c   |   20 
 b/Modules/main.c  |2 
 b/Modules/pyexpat.c   |3 
 b/Modules/signalmodule.c  |9 
 b/Modules/xxsubtype.c |4 
 b/Objects/abstract.c  |  403 ---
 b/Objects/bytesobject.c   |2 
 b/Objects/classobject.c   |   36 
 b/Objects/clinic/bytearrayobject.c.h  |   90 -
 b/Objects/clinic/bytesobject.c.h  |   66 -
 b/Objects/clinic/dictobject.c.h   |   10 
 b/Objects/clinic/unicodeobject.c.h|   10 
 b/Objects/descrobject.c   |  162 +-
 b/Objects/dictobject.c|   26 
 b/Objects/enumobject.c|8 
 b/Objects/exceptions.c|   91 +
 b/Objects/fileobject.c|   29 
 b/Objects/floatobject.c   |   25 
 

[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-05-19 Thread STINNER Victor

Changes by STINNER Victor :


Removed file: http://bugs.python.org/file42898/34456cce64bb.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-05-19 Thread STINNER Victor

Changes by STINNER Victor :


Added file: http://bugs.python.org/file42898/34456cce64bb.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-05-19 Thread STINNER Victor

STINNER Victor added the comment:

Hi,

I made progress on my FASTCALL branch. I removed tp_fastnew, tp_fastinit and
tp_fastnew fields from PyTypeObject to replace them with new type flags (ex:
Py_TPFLAGS_FASTNEW) to avoid code duplication and reduce the memory footprint.
Before, each function was simply duplicated.

This change introduces a backward incompatibility change: it's not more
possible to call directly tp_new, tp_init and tp_call. I don't know yet if such
change would be acceptable in Python 3.6, nor if it is worth it.

I spent a lot of ot time on the CPython benchmark suite to check for
performance regression. In fact, I spent most of my time to try to understand
why most benchmarks looked completly unstable. I now tuned correctly my
system and patched perf.py to get reliable benchmarks.

On the latest run of the benchmark suite, most benchmarks are faster! I have to 
investigate why 3 benchmarks are still slower. In the run, normal_startup was 
not significant, etree_parse was faster (instead of slower), but raytrace was 
already slower (but only 1.13x slower). It may be the "noise" of the PGO 
compilation. I already noticed that once: see the issue #27056 "pickle: 
constant propagation in _Unpickler_Read()".

Result of the benchmark suite:

slower (3):

* raytrace: 1.06x slower
* etree_parse: 1.03x slower
* normal_startup: 1.02x slower

faster (18):

* unpickle_list: 1.11x faster
* chameleon_v2: 1.09x faster
* etree_generate: 1.08x faster
* etree_process: 1.08x faster
* mako_v2: 1.06x faster
* call_method_unknown: 1.06x faster
* django_v3: 1.05x faster
* regex_compile: 1.05x faster
* etree_iterparse: 1.05x faster
* fastunpickle: 1.05x faster
* meteor_contest: 1.05x faster
* pickle_dict: 1.05x faster
* float: 1.04x faster
* pathlib: 1.04x faster
* silent_logging: 1.04x faster
* call_method: 1.03x faster
* json_dump_v2: 1.03x faster
* call_simple: 1.03x faster

not significant (21):

* 2to3
* call_method_slots
* chaos
* fannkuch
* fastpickle
* formatted_logging
* go
* json_load
* nbody
* nqueens
* pickle_list
* pidigits
* regex_effbot
* regex_v8
* richards
* simple_logging
* spectral_norm
* startup_nosite
* telco
* tornado_http
* unpack_sequence

I know that my patch is simply giant and cannot be merged like that.

Since the performance is still promising, I plan to split my giant
patch into smaller patches, easier to review. I will try to check that
individual patches don't make Python slower. This work will take time.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-05-09 Thread Jakub Stasiak

Changes by Jakub Stasiak :


--
nosy: +jstasiak

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-29 Thread STINNER Victor

STINNER Victor added the comment:

> Results look as a noise.

As I wrote, it's really hard to get a reliable benchmark result. I did my best.

See also discussions about the CPython benchmark suite on the speed list:
https://mail.python.org/pipermail/speed/

I'm not sure that you will get less noise on other computers. IMHO many 
benchmarks are simply "broken" (not reliable).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-29 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Results look as a noise. Some tests become slower, others become faster. If 
results on different machine will show the same sets of slowing down and 
speeding up tests, this likely is not a noise.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-29 Thread STINNER Victor

STINNER Victor added the comment:

> Could you repeat benchmarks on different computer? Better with different CPU 
> or compiler.

Sorry, I don't really have the bandwith to repeat the benchmarks. PGO+LTO 
compilation is slow and running the benchmark suite in rigorous mode is very 
slow.

What do you expect from running the benchmark on a different computer?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-29 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Could you repeat benchmarks on different computer? Better with different CPU or 
compiler.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-29 Thread STINNER Victor

STINNER Victor added the comment:

> Results of the CPython benchmark suite. Reference = default branch at rev 
> 496e094f4734, patched: fastcall fork at rev 2b4b7def2949.

Oh, I forgot to mention that I modified perf.py to run each benchmark using 10 
fresh processes to test multiple random seeds for the randomized hash function, 
instead of testing a fixed seed (PYTHONHASHSEED=1). This change should reduce 
the noise in the benchmark results.

I ran the benchmark suite using --rigorous.

I will open a new issue later for my perf.py change.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-29 Thread STINNER Victor

STINNER Victor added the comment:

Results of the CPython benchmark suite. Reference = default branch at rev 
496e094f4734, patched: fastcall fork at rev 2b4b7def2949.

I got many issues to get a reliable benchmark output:

* https://mail.python.org/pipermail/speed/2016-April/000329.html
* https://mail.python.org/pipermail/speed/2016-April/000341.html

The benchmark was run with CPU isolation. Both binaries were compiled with 
PGO+LTO.

Report on Linux smithers 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 
2016 x86_64 x86_64
Total CPU cores: 8

### call_method_slots ###
Min: 0.289704 -> 0.269634: 1.07x faster
Avg: 0.290149 -> 0.275953: 1.05x faster
Significant (t=162.17)
Stddev: 0.00019 -> 0.00150: 8.1176x larger

### call_method_unknown ###
Min: 0.275295 -> 0.302810: 1.10x slower
Avg: 0.280201 -> 0.309166: 1.10x slower
Significant (t=-200.65)
Stddev: 0.00161 -> 0.00191: 1.1909x larger

### call_simple ###
Min: 0.202163 -> 0.207939: 1.03x slower
Avg: 0.202332 -> 0.208662: 1.03x slower
Significant (t=-636.09)
Stddev: 0.8 -> 0.00015: 2.0130x larger

### chameleon_v2 ###
Min: 4.349474 -> 3.901936: 1.11x faster
Avg: 4.377664 -> 3.942932: 1.11x faster
Significant (t=62.39)
Stddev: 0.01403 -> 0.06826: 4.8635x larger

### django_v3 ###
Min: 0.484456 -> 0.462013: 1.05x faster
Avg: 0.489186 -> 0.465189: 1.05x faster
Significant (t=53.10)
Stddev: 0.00415 -> 0.00180: 2.3096x smaller

### etree_generate ###
Min: 0.193538 -> 0.182069: 1.06x faster
Avg: 0.196306 -> 0.184403: 1.06x faster
Significant (t=65.94)
Stddev: 0.00140 -> 0.00115: 1.2181x smaller

### etree_iterparse ###
Min: 0.189955 -> 0.177583: 1.07x faster
Avg: 0.195268 -> 0.183411: 1.06x faster
Significant (t=27.04)
Stddev: 0.00316 -> 0.00304: 1.0386x smaller

### etree_process ###
Min: 0.166556 -> 0.158617: 1.05x faster
Avg: 0.168822 -> 0.160672: 1.05x faster
Significant (t=43.33)
Stddev: 0.00125 -> 0.00140: 1.1205x larger

### fannkuch ###
Min: 0.859842 -> 0.878412: 1.02x slower
Avg: 0.865138 -> 0.889188: 1.03x slower
Significant (t=-14.97)
Stddev: 0.00718 -> 0.01436: 2.x larger

### float ###
Min: 0.222095 -> 0.214706: 1.03x faster
Avg: 0.226273 -> 0.218210: 1.04x faster
Significant (t=21.61)
Stddev: 0.00307 -> 0.00212: 1.4469x smaller

### hexiom2 ###
Min: 100.489630 -> 94.765364: 1.06x faster
Avg: 101.204871 -> 94.885605: 1.07x faster
Significant (t=77.45)
Stddev: 0.25310 -> 0.05016: 5.0454x smaller

### meteor_contest ###
Min: 0.181076 -> 0.176904: 1.02x faster
Avg: 0.181759 -> 0.177783: 1.02x faster
Significant (t=43.68)
Stddev: 0.00061 -> 0.00067: 1.1041x larger

### nbody ###
Min: 0.208752 -> 0.217011: 1.04x slower
Avg: 0.211552 -> 0.219621: 1.04x slower
Significant (t=-69.45)
Stddev: 0.00080 -> 0.00084: 1.0526x larger

### pathlib ###
Min: 0.077121 -> 0.070698: 1.09x faster
Avg: 0.078310 -> 0.071958: 1.09x faster
Significant (t=133.39)
Stddev: 0.00069 -> 0.00081: 1.1735x larger

### pickle_dict ###
Min: 0.530379 -> 0.514363: 1.03x faster
Avg: 0.531325 -> 0.515902: 1.03x faster
Significant (t=154.33)
Stddev: 0.00086 -> 0.00050: 1.7213x smaller

### pickle_list ###
Min: 0.253445 -> 0.263959: 1.04x slower
Avg: 0.255362 -> 0.267402: 1.05x slower
Significant (t=-95.47)
Stddev: 0.00075 -> 0.00101: 1.3447x larger

### raytrace ###
Min: 1.071042 -> 1.030849: 1.04x faster
Avg: 1.076629 -> 1.109029: 1.03x slower
Significant (t=-3.93)
Stddev: 0.00199 -> 0.08246: 41.4609x larger

### regex_compile ###
Min: 0.286053 -> 0.273454: 1.05x faster
Avg: 0.287171 -> 0.274422: 1.05x faster
Significant (t=153.16)
Stddev: 0.00067 -> 0.00050: 1.3452x smaller

### regex_effbot ###
Min: 0.044186 -> 0.048192: 1.09x slower
Avg: 0.044336 -> 0.048513: 1.09x slower
Significant (t=-172.41)
Stddev: 0.00020 -> 0.00014: 1.4671x smaller

### richards ###
Min: 0.137456 -> 0.135029: 1.02x faster
Avg: 0.138993 -> 0.136028: 1.02x faster
Significant (t=20.35)
Stddev: 0.00116 -> 0.00088: 1.3247x smaller

### silent_logging ###
Min: 0.060288 -> 0.056344: 1.07x faster
Avg: 0.060380 -> 0.056518: 1.07x faster
Significant (t=310.27)
Stddev: 0.00011 -> 0.5: 2.1029x smaller

### telco ###
Min: 0.010735 -> 0.010441: 1.03x faster
Avg: 0.010849 -> 0.010557: 1.03x faster
Significant (t=34.04)
Stddev: 0.7 -> 0.5: 1.3325x smaller

### unpickle_list ###
Min: 0.290750 -> 0.297958: 1.02x slower
Avg: 0.292741 -> 0.299419: 1.02x slower
Significant (t=-41.62)
Stddev: 0.00133 -> 0.00090: 1.4852x smaller

The following not significant results are hidden, use -v to show them:
2to3, call_method, chaos, etree_parse, fastpickle, fastunpickle, 
formatted_logging, go, json_dump_v2, json_load, mako_v2, normal_startup, 
nqueens, pidigits, regex_v8, simple_logging, spectral_norm, startup_nosite, 
tornado_http, unpack_sequence.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 

[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-24 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I think you can simplify the patch by dropping keyword arguments support from 
fastcall. Then you can decrease _PyStack_SIZE to 4 (larger size will serve only 
1.7% of calls), and may be refactor a code since an array of 4 pointers 
consumes less C stack than an array of 10 pointers.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-24 Thread STINNER Victor

STINNER Victor added the comment:

> Thus I think we need to optimize only cases of calling with small number 
> (0-3) of positional arguments.

My code is optimized to up to 10 positional arguments: with 0..10 arguments, 
the C stack is used to hold the array of PyObject*. For more arguments, an 
array is allocated in the heap memory.

+   /* 10 positional parameters or 5 (key, value) pairs for keyword parameters.
+  40 bytes on 32-bit or 80 bytes on 64-bit. */
+#  define _PyStack_SIZE 10

For keyword parameters, I don't know yet what is the best API (fatest API). 
Right now, I'm also using the same PyObject** array for positional and keyword 
arguments using "int nk", but maybe a dictionary is faster to combinary keyword 
arguments and to parse keyword arguments.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-24 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I have collected statistics about using CALL_FUNCTION* opcodes in compliled 
code during running CPython testsuite. According to it, 99.4% emitted opcodes 
is the CALL_FUNCTION opcode, and 89% of emitted CALL_FUNCTION opcodes have only 
positional arguments, and 98% of them have not more than 3 arguments.

That was about calls from Python code. All convenient C API functions (like 
PyObject_CallFunction and PyObject_CallFunctionObjArgs) used for direct calling 
in C code use only positional arguments.

Thus I think we need to optimize only cases of calling with small number (0-3) 
of positional arguments.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-22 Thread STINNER Victor

STINNER Victor added the comment:

Results of the CPython benchmark suite on the revision 6c376e866330 of  
https://hg.python.org/sandbox/fastcall compared to CPython 3.6 at the revision 
496e094f4734.

It's surprising than call_simple is 1.08x slower in fastcall. This slowdown is 
not acceptable and should be fixed. It probable explains why many other 
benchmarks are slower.

Hopefully, some benchmarks are faster, between 1.02x and 1.09x faster.

IMHO there are still performance issues in my current implementation that can 
and must be fixed. At least, we have a starting point to compare performances.


$ python3 -u perf.py ../default/python ../fastcall/python -b all
(...)
Report on Linux smithers 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 
2016 x86_64 x86_64
Total CPU cores: 8

[ slower ]

### 2to3 ###
6.859604 -> 6.985351: 1.02x slower

### call_method_slots ###
Min: 0.308846 -> 0.317780: 1.03x slower
Avg: 0.308902 -> 0.318667: 1.03x slower
Significant (t=-464.83)
Stddev: 0.3 -> 0.00026: 9.8974x larger

### call_simple ###
Min: 0.232594 -> 0.251789: 1.08x slower
Avg: 0.232816 -> 0.252443: 1.08x slower
Significant (t=-911.97)
Stddev: 0.00024 -> 0.00011: 2.2373x smaller

### chaos ###
Min: 0.273084 -> 0.284790: 1.04x slower
Avg: 0.273951 -> 0.293177: 1.07x slower
Significant (t=-7.57)
Stddev: 0.00036 -> 0.01796: 49.9421x larger

### django_v3 ###
Min: 0.549604 -> 0.569982: 1.04x slower
Avg: 0.550557 -> 0.571038: 1.04x slower
Significant (t=-204.09)
Stddev: 0.00046 -> 0.00054: 1.1747x larger

### float ###
Min: 0.261939 -> 0.269224: 1.03x slower
Avg: 0.268475 -> 0.276515: 1.03x slower
Significant (t=-12.22)
Stddev: 0.00301 -> 0.00354: 1.1757x larger

### formatted_logging ###
Min: 0.325786 -> 0.334440: 1.03x slower
Avg: 0.326827 -> 0.335968: 1.03x slower
Significant (t=-34.44)
Stddev: 0.00129 -> 0.00136: 1.0503x larger

### mako_v2 ###
Min: 0.039642 -> 0.044765: 1.13x slower
Avg: 0.040251 -> 0.045562: 1.13x slower
Significant (t=-323.73)
Stddev: 0.00028 -> 0.00024: 1.1558x smaller

### meteor_contest ###
Min: 0.196589 -> 0.203667: 1.04x slower
Avg: 0.197497 -> 0.204782: 1.04x slower
Significant (t=-76.06)
Stddev: 0.00050 -> 0.00045: 1.x smaller

### nqueens ###
Min: 0.274664 -> 0.285866: 1.04x slower
Avg: 0.275285 -> 0.286774: 1.04x slower
Significant (t=-68.34)
Stddev: 0.00091 -> 0.00076: 1.2036x smaller

### pickle_list ###
Min: 0.262687 -> 0.269629: 1.03x slower
Avg: 0.263804 -> 0.270789: 1.03x slower
Significant (t=-50.14)
Stddev: 0.00070 -> 0.00070: 1.0004x larger

### raytrace ###
Min: 1.272960 -> 1.284516: 1.01x slower
Avg: 1.276398 -> 1.368574: 1.07x slower
Significant (t=-3.41)
Stddev: 0.00157 -> 0.19115: 122.0022x larger

### regex_compile ###
Min: 0.335753 -> 0.343820: 1.02x slower
Avg: 0.336273 -> 0.344894: 1.03x slower
Significant (t=-127.84)
Stddev: 0.00026 -> 0.00040: 1.5701x larger

### regex_effbot ###
Min: 0.048656 -> 0.050810: 1.04x slower
Avg: 0.048692 -> 0.051619: 1.06x slower
Significant (t=-69.92)
Stddev: 0.2 -> 0.00030: 16.7793x larger

### silent_logging ###
Min: 0.069539 -> 0.071172: 1.02x slower
Avg: 0.069679 -> 0.071230: 1.02x slower
Significant (t=-124.08)
Stddev: 0.9 -> 0.2: 3.7073x smaller

### simple_logging ###
Min: 0.278439 -> 0.287736: 1.03x slower
Avg: 0.279504 -> 0.288811: 1.03x slower
Significant (t=-52.46)
Stddev: 0.00084 -> 0.00093: 1.1074x larger

### telco ###
Min: 0.012480 -> 0.013104: 1.05x slower
Avg: 0.012561 -> 0.013157: 1.05x slower
Significant (t=-100.42)
Stddev: 0.4 -> 0.2: 1.5881x smaller

### unpack_sequence ###
Min: 0.47 -> 0.48: 1.03x slower
Avg: 0.47 -> 0.48: 1.03x slower
Significant (t=-1170.16)
Stddev: 0.0 -> 0.0: 1.0749x larger

### unpickle_list ###
Min: 0.325310 -> 0.330080: 1.01x slower
Avg: 0.326484 -> 0.333974: 1.02x slower
Significant (t=-24.19)
Stddev: 0.00100 -> 0.00195: 1.9392x larger

[ faster ]

### chameleon_v2 ###
Min: 5.525575 -> 5.263668: 1.05x faster
Avg: 5.541444 -> 5.281893: 1.05x faster
Significant (t=85.79)
Stddev: 0.01107 -> 0.01831: 1.6539x larger

### etree_iterparse ###
Min: 0.212073 -> 0.197146: 1.08x faster
Avg: 0.215504 -> 0.200254: 1.08x faster
Significant (t=61.07)
Stddev: 0.00119 -> 0.00130: 1.0893x larger

### etree_parse ###
Min: 0.282983 -> 0.260390: 1.09x faster
Avg: 0.284333 -> 0.262758: 1.08x faster
Significant (t=77.34)
Stddev: 0.00102 -> 0.00169: 1.6628x larger

### etree_process ###
Min: 0.218953 -> 0.213683: 1.02x faster
Avg: 0.221036 -> 0.215280: 1.03x faster
Significant (t=25.98)
Stddev: 0.00114 -> 0.00108: 1.0580x smaller

### hexiom2 ###
Min: 122.001408 -> 118.967112: 1.03x faster
Avg: 122.108010 -> 119.110115: 1.03x faster
Significant (t=16.81)
Stddev: 0.15076 -> 0.20224: 1.3415x larger

### pathlib ###
Min: 0.088533 -> 0.084888: 1.04x faster
Avg: 0.088916 -> 0.085280: 1.04x faster
Significant (t=257.68)
Stddev: 0.00014 -> 0.00017: 1.1725x larger


The following not significant results are hidden, use -v to show them:
call_method, 

[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-22 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Could you compare filter(), map() and sorted() performance with your patch and 
with issue23507 patch?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-22 Thread STINNER Victor

STINNER Victor added the comment:

For more fun, comparison between Python 2.7 / 3.4 / 3.6 / 3.6 FASTCALL.

--+-+++---
Tests |py27 |   py34 |   
py36 |   fast
--+-+++---
filter|  165 us (*) |  318 us (+93%) |  237 us 
(+43%) | 165 us
map   |  209 us (*) |  258 us (+24%) | 202 
us |  171 us (-18%)
sorted(list, key=lambda x: x) |  272 us (*) |  348 us (+28%) |  237 us 
(-13%) |  163 us (-40%)
sorted(list)  | 33.7 us (*) | 47.8 us (+42%) | 27.3 us 
(-19%) | 27.7 us (-18%)
b=MyBytes(); bytes(b) | 3.31 us (*) |  835 ns (-75%) |  510 ns 
(-85%) |  561 ns (-83%)
namedtuple.attr   | 4.63 us (*) |4.51 us | 1.98 us 
(-57%) | 1.57 us (-66%)
object.__setattr__(obj, "x", 1)   |  463 ns (*) | 440 ns |  343 ns 
(-26%) |  222 ns (-52%)
object.__getattribute__(obj, "x") |  323 ns (*) |  396 ns (+23%) | 316 
ns |  196 ns (-39%)
getattr(1, "real")|  218 ns (*) |   237 ns (+8%) |  264 ns 
(+21%) |  147 ns (-33%)
bounded_pymethod(1, 2)|  213 ns (*) |  244 ns (+14%) |   194 ns 
(-9%) |  188 ns (-12%)
unbound_pymethod(obj, 1, 2)   |  345 ns (*) |  247 ns (-29%) |  196 ns 
(-43%) |  191 ns (-45%)
func()|  161 ns (*) |  211 ns (+31%) | 161 
ns | 157 ns
func(1, 2, 3) |  219 ns (*) |  247 ns (+13%) |  196 ns 
(-10%) |  190 ns (-13%)
--+-+++---
Total |  689 us (*) |  980 us (+42%) | 707 
us |  531 us (-23%)
--+-+++---


I didn't know that Python 3.4 was so much slower than Python 2.7 on function 
calls!?

Note: Python 2.7 and Python 3.4 are system binaries (Fedora 22), wheras Python 
3.6 and Python 3.6 FASTCALL are compiled manually.

Ignore "b=MyBytes(); bytes(b)", this benchmark is written for Python 3.

--

details:

Common platform:
Bits: int=32, long=64, long long=64, size_t=64, void*=64
Platform: Linux-4.4.4-301.fc23.x86_64-x86_64-with-fedora-23-Twenty_Three
CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

Platform of campaign py27:
CFLAGS: -fno-strict-aliasing -O2 -g -pipe -Wall -Werror=format-security 
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong 
--param=ssp-buffer-size=4 -grecord-gcc-switches 
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic 
-D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security 
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong 
--param=ssp-buffer-size=4 -grecord-gcc-switches 
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic 
-D_GNU_SOURCE -fPIC -fwrapv
Python unicode implementation: UCS-4
Timer precision: 954 ns
Python version: 2.7.10 (default, Sep 8 2015, 17:20:17) [GCC 5.1.1 20150618 (Red 
Hat 5.1.1-4)]
Timer: time.time

Platform of campaign py34:
Timer info: namespace(adjustable=False, 
implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, 
resolution=1e-09)
CFLAGS: -Wno-unused-result -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -g 
-pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
-fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches 
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic 
-D_GNU_SOURCE -fPIC -fwrapv
Timer precision: 84 ns
Python unicode implementation: PEP 393
Python version: 3.4.3 (default, Jun 29 2015, 12:16:01) [GCC 5.1.1 20150618 (Red 
Hat 5.1.1-4)]
Timer: time.perf_counter

Platform of campaign py36:
Timer info: namespace(adjustable=False, 
implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, 
resolution=1e-09)
Python version: 3.6.0a0 (default:496e094f4734, Apr 22 2016, 02:18:13) [GCC 
5.3.1 20151207 (Red Hat 5.3.1-2)]
CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall 
-Wstrict-prototypes
Python unicode implementation: PEP 393
Timer: time.perf_counter

Platform of campaign fast:
Timer info: namespace(adjustable=False, 
implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, 
resolution=1e-09)
CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall 
-Wstrict-prototypes
Python unicode implementation: PEP 393
Python version: 3.6.0a0 (default:ad4a53ed1fbf, Apr 22 2016, 12:42:15) [GCC 
5.3.1 20151207 (Red Hat 5.3.1-2)]

--
Added file: http://bugs.python.org/file42568/bench_fast-2.py

___
Python tracker 

___

[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-22 Thread STINNER Victor

STINNER Victor added the comment:

Some microbenchmarks: bench_fast.py.

== Python 3.6 / Python 3.6 FASTCALL ==

--+--+---
Tests | /tmp/default |  /tmp/fastcall
--+--+---
filter|   241 us (*) |  166 us (-31%)
map   |   205 us (*) |  168 us (-18%)
sorted(list, key=lambda x: x) |   242 us (*) |  162 us (-33%)
sorted(list)  |  27.7 us (*) |27.8 us
b=MyBytes(); bytes(b) |   549 ns (*) | 533 ns
namedtuple.attr   |  2.03 us (*) | 1.56 us (-23%)
object.__setattr__(obj, "x", 1)   |   347 ns (*) |  218 ns (-37%)
object.__getattribute__(obj, "x") |   331 ns (*) |  200 ns (-40%)
getattr(1, "real")|   267 ns (*) |  150 ns (-44%)
bounded_pymethod(1, 2)|   193 ns (*) | 190 ns
unbound_pymethod(obj, 1, 2|   195 ns (*) | 192 ns
--+--+---
Total |   719 us (*) |  526 us (-27%)
--+--+---


== Compare Python 3.4 / Python 3.6 / Python 3.6 FASTCALL ==

Common platform:
Timer: time.perf_counter
Python unicode implementation: PEP 393
Timer info: namespace(adjustable=False, 
implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, 
resolution=1e-09)
CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Platform: Linux-4.4.4-301.fc23.x86_64-x86_64-with-fedora-23-Twenty_Three
SCM: hg revision=abort: repository . not found! tag=abort: repository . not 
found! branch=abort: repository . not found! date=abort: no repository found in 
'/home/haypo/prog/python' (.hg not found)!
Bits: int=32, long=64, long long=64, size_t=64, void*=64

Platform of campaign /tmp/py34:
Python version: 3.4.3 (default, Jun 29 2015, 12:16:01) [GCC 5.1.1 20150618 (Red 
Hat 5.1.1-4)]
CFLAGS: -Wno-unused-result -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -g 
-pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
-fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches 
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic 
-D_GNU_SOURCE -fPIC -fwrapv
Timer precision: 78 ns
Date: 2016-04-22 13:37:52

Platform of campaign /tmp/default:
Python version: 3.6.0a0 (default:496e094f4734, Apr 22 2016, 02:18:13) [GCC 
5.3.1 20151207 (Red Hat 5.3.1-2)]
CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall 
-Wstrict-prototypes
Timer precision: 103 ns
Date: 2016-04-22 13:38:07

Platform of campaign /tmp/fastcall:
Python version: 3.6.0a0 (default:ad4a53ed1fbf, Apr 22 2016, 12:42:15) [GCC 
5.3.1 20151207 (Red Hat 5.3.1-2)]
Timer precision: 99 ns
CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall 
-Wstrict-prototypes
Date: 2016-04-22 13:38:21

--+-++---
Tests |   /tmp/py34 |   /tmp/default |  
/tmp/fastcall
--+-++---
filter|  325 us (*) |  241 us (-26%) |  166 us 
(-49%)
map   |  260 us (*) |  205 us (-21%) |  168 us 
(-35%)
sorted(list, key=lambda x: x) |  354 us (*) |  242 us (-32%) |  162 us 
(-54%)
sorted(list)  | 46.9 us (*) | 27.7 us (-41%) | 27.8 us 
(-41%)
b=MyBytes(); bytes(b) |  839 ns (*) |  549 ns (-35%) |  533 ns 
(-36%)
namedtuple.attr   | 4.51 us (*) | 2.03 us (-55%) | 1.56 us 
(-65%)
object.__setattr__(obj, "x", 1)   |  447 ns (*) |  347 ns (-22%) |  218 ns 
(-51%)
object.__getattribute__(obj, "x") |  401 ns (*) |  331 ns (-17%) |  200 ns 
(-50%)
getattr(1, "real")|  236 ns (*) |  267 ns (+13%) |  150 ns 
(-36%)
bounded_pymethod(1, 2)|  249 ns (*) |  193 ns (-22%) |  190 ns 
(-24%)
unbound_pymethod(obj, 1, 2|  251 ns (*) |  195 ns (-22%) |  192 ns 
(-23%)
--+-++---
Total |  993 us (*) |  719 us (-28%) |  526 us 
(-47%)
--+-++---

--
Added file: http://bugs.python.org/file42567/bench_fast.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-22 Thread STINNER Victor

STINNER Victor added the comment:

Related issue: issue #23507, "Tuple creation is too slow".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-22 Thread STINNER Victor

STINNER Victor added the comment:

Changes of my current implementation, ad4a53ed1fbf.diff.

The good thing is that all changes are internals (really?). Even if you don't 
modify your C extensions (nor your Python code), you should benefit of the new 
fast call is *a lot* of cases.

IMHO the best tricky part are changes on the PyTypeObject. Is it ok to add a 
new tp_fastcall slot? Should we add even more slots using the fast call 
convention like tp_fastnew and tp_fastinit? How should we handle the 
inheritance of types with that?


(*) Add 2 new public functions:

PyObject* PyObject_CallNoArg(PyObject *func);
PyObject* PyObject_CallArg1(PyObject *func, PyObject *arg);


(*) Add 1 new private function:

PyObject* _PyObject_FastCall(PyObject *func, PyObject **stack, int na, int nk);

_PyObject_FastCall() is the root of the new feature.


(*) type: add a new "tp_fastcall" field to the PyTypeObject structure.

It's unclear to me how inheritance is handled here. Maybe it's simply broken, 
but it's strange because it looks like it works :-) Maybe it's very rare that 
tp_call is overidden in a child class?

TODO: maybe reuse the "tp_call" field? (risk of major backward 
incompatibility...)


(*) slots: add a new "fastwrapper" field to the wrappercase structure. Add a 
fast wrapper to all slots (really all? i should check).

I don't think that consumers of the C API are of this change, or maybe only a 
few projects.

TODO: maybe remove "fastwrapper" and reuse the "wrapper" field? (low risk of 
backward compatibility?)


(*) Implement fast call for Python function (_PyFunction_FastCall) and C 
functions (PyCFunction_FastCall)


(*) Add a new METH_FASTCALL calling convention for C functions. Right now, it 
is used for 4 builtin functions: sorted(), getattr(), iter(), next().

Argument Clinic should be modified to emit C code using this new fast calling 
convention.


(*) Implement fast call in the following functions (types):

- method()
- method_descriptor()
- wrapper_descriptor()
- method_wrapper()
- operator.itemgetter => used by collections.namedtuple to get an item by its 
name


(*) Modify PyObject_Call*() functins to reuse internally the fast call. 
"tp_fastcall" is preferred over "tp_call" (FIXME: is it really useful to do 
that?).

The following functions are able to avoid temporary tuple/dict without having 
to modify the code calling them:

- PyObject_CallFunction()
- PyObject_CallMethod(), _PyObject_CallMethodId()
- PyObject_CallFunctionObjArgs(), PyObject_CallMethodObjArgs()

It's not required to modify code using these functions to use the 3 new shiny 
functions (PyObject_CallNoArg, PyObject_CallArg1, _PyObject_FastCall). For 
example, replacing PyObject_CallFunctionObjArgs(func, NULL) with 
PyObject_CallNoArg(func) is just a micro-optimization, the tuple is already 
avoided. But PyObject_CallNoArg() should use less memory of the C stack and be 
a "little bit" faster.


(*) Add new helpers: new Include/pystack.h file, Py_VaBuildStack(), etc.


Please ignore unrelated changes.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-22 Thread STINNER Victor

Changes by STINNER Victor :


Added file: http://bugs.python.org/file42566/ad4a53ed1fbf.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26814] [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-21 Thread STINNER Victor

STINNER Victor added the comment:

I created a repository. I will work there and make some experiment. It would 
help to have a better idea of the concrete performance. When I will have a 
better view of all requires changes to get best performances everywhere, I will 
start a discussion to see which parts are worth it or not. In my latest 
microbenchmarks, functions calls (C/Python, mixed) are between 8% and 40% 
faster. I'm now running the CPython benchmark suite.

--
title: Add a new _PyObject_FastCall() function which avoids the creation of a 
tuple or dict for arguments -> [WIP] Add a new _PyObject_FastCall() function 
which avoids the creation of a tuple or dict for arguments

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com