[Python-Dev] Speeding up CPython

2020-10-20 Thread Mark Shannon

Hi everyone,

CPython is slow. We all know that, yet little is done to fix it.

I'd like to change that.
I have a plan to speed up CPython by a factor of five over the next few 
years. But it needs funding.


I am aware that there have been several promised speed ups in the past 
that have failed. You might wonder why this is different.


Here are three reasons:
1. I already have working code for the first stage.
2. I'm not promising a silver bullet. I recognize that this is a 
substantial amount of work and needs funding.
3. I have extensive experience in VM implementation, not to mention a 
PhD in the subject.


My ideas for possible funding, as well as the actual plan of 
development, can be found here:


https://github.com/markshannon/faster-cpython

I'd love to hear your thoughts on this.

Cheers,
Mark.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/RDXLCH22T2EZDRCBM6ZYYIUTBWQVVVWH/
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-Dev] Speeding up CPython 5-10%

2016-05-18 Thread zreed
No problem, I did not think you were attacking me or find your
response rude.
 
 
On Wed, May 18, 2016, at 01:06 PM, Cesare Di Mauro wrote:
> If you feel like I've attacked you, I apologize: it wasn't my
> intention. Please, don't get it personal: I only reported my honest
> opinion, albeit after a re-read it looks too rude, and I'm sorry
> for that.
>
> Regarding the post-bytecode optimization issues, they are mainly
> represented by the constant folding code, which is still in the
> peephole stage. Once it's moved to the proper place (ASDL/AST), then
> such kind of issues with the stack calculations disappear, whereas the
> remaining ones can be addressed by a fix of the current
> stackdepth_walk function.
>
> And just to be clear, I've nothing against your code. I simply think
> that, due to my experience, it doesn't fit in CPython.
>
> Regards
> Cesare
>
> 2016-05-18 18:50 GMT+02:00 :
>> __
>> Your criticisms may very well be true. IIRC though, I wrote that pass
>> because what was available was not general enough. The
>> stackdepth_walk function made assumptions that, while true of code
>> generated by the current cpython frontend, were not universally true.
>> If a goal is to move this calculation after any bytecode
>> optimization, something along these lines seems like it will
>> eventually be necessary.
>>
>> Anyway, just offering things already written. If you don't feel it's
>> useful, no worries.
>>
>>
>> On Wed, May 18, 2016, at 11:35 AM, Cesare Di Mauro wrote:
>>> 2016-05-17 8:25 GMT+02:00 :
 In the project https://github.com/zachariahreed/byteasm I
 mentioned on
 the list earlier this month, I have a pass that to computes stack
 usage
 for a given sequence of bytecodes. It seems to be a fair bit more
 agressive than cpython. Maybe it's more generally useful. It's pure
 python rather than C though.

>>> IMO it's too big, resource hungry, and slower, even if you convert
>>> it in C.
>>>
>>> If you take a look at the current stackdepth_walk function which
>>> CPython uses, it's much smaller (not even 50 lines in simple C code)
>>> and quite efficient.
>>>
>>> Currently the problem is that it doesn't return the maximum depth of
>>> the tree, but it updates the intermediate/current maximum, and
>>> *then* it uses it for the subsequent calculations. So, the depth
>>> artificially grows, like in the reported cases.
>>>
>>> It doesn't require a complete rewrite, but spending some time for
>>> fine-tuning it.
>>>
>>> Regards
>>> Cesare
>>
 
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-05-18 Thread Cesare Di Mauro
If you feel like I've attacked you, I apologize: it wasn't my intention.
Please, don't get it personal: I only reported my honest opinion, albeit
after a re-read it looks too rude, and I'm sorry for that.

Regarding the post-bytecode optimization issues, they are mainly
represented by the constant folding code, which is still in the peephole
stage. Once it's moved to the proper place (ASDL/AST), then such kind of
issues with the stack calculations disappear, whereas the remaining ones
can be addressed by a fix of the current stackdepth_walk function.

And just to be clear, I've nothing against your code. I simply think that,
due to my experience, it doesn't fit in CPython.

Regards
Cesare

2016-05-18 18:50 GMT+02:00 :

> Your criticisms may very well be true. IIRC though, I wrote that pass
> because what was available was not general enough. The stackdepth_walk
> function made assumptions that, while true of code generated by the current
> cpython frontend, were not universally true. If a goal is to move this
> calculation after any bytecode optimization, something along these lines
> seems like it will eventually be necessary.
>
> Anyway, just offering things already written. If you don't feel it's
> useful, no worries.
>
>
> On Wed, May 18, 2016, at 11:35 AM, Cesare Di Mauro wrote:
>
> 2016-05-17 8:25 GMT+02:00 :
>
> In the project https://github.com/zachariahreed/byteasm I mentioned on
> the list earlier this month, I have a pass that to computes stack usage
> for a given sequence of bytecodes. It seems to be a fair bit more
> agressive than cpython. Maybe it's more generally useful. It's pure
> python rather than C though.
>
>
> IMO it's too big, resource hungry, and slower, even if you convert it in C.
>
> If you take a look at the current stackdepth_walk function which CPython
> uses, it's much smaller (not even 50 lines in simple C code) and quite
> efficient.
>
> Currently the problem is that it doesn't return the maximum depth of the
> tree, but it updates the intermediate/current maximum, and *then* it uses
> it for the subsequent calculations. So, the depth artificially grows, like
> in the reported cases.
>
> It doesn't require a complete rewrite, but spending some time for
> fine-tuning it.
>
> Regards
> Cesare
>
>
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-05-18 Thread zreed
Your criticisms may very well be true. IIRC though, I wrote that pass
because what was available was not general enough. The stackdepth_walk
function made assumptions that, while true of code generated by the
current cpython frontend, were not universally true. If a goal is to
move this calculation after any bytecode optimization, something along
these lines seems like it will eventually be necessary.
 
Anyway, just offering things already written. If you don't feel it's
useful, no worries.
 
 
On Wed, May 18, 2016, at 11:35 AM, Cesare Di Mauro wrote:
> 2016-05-17 8:25 GMT+02:00 :
>> In the project https://github.com/zachariahreed/byteasm I
>> mentioned on
>>  the list earlier this month, I have a pass that to computes
>>  stack usage
>>  for a given sequence of bytecodes. It seems to be a fair bit more
>>  agressive than cpython. Maybe it's more generally useful. It's pure
>>  python rather than C though.
>>
> IMO it's too big, resource hungry, and slower, even if you convert
> it in C.
>
> If you take a look at the current stackdepth_walk function which
> CPython uses, it's much smaller (not even 50 lines in simple C code)
> and quite efficient.
>
> Currently the problem is that it doesn't return the maximum depth of
> the tree, but it updates the intermediate/current maximum, and *then*
> it uses it for the subsequent calculations. So, the depth artificially
> grows, like in the reported cases.
>
> It doesn't require a complete rewrite, but spending some time for fine-
> tuning it.
>
> Regards
> Cesare
 
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-05-18 Thread Cesare Di Mauro
2016-05-17 8:25 GMT+02:00 :

> In the project https://github.com/zachariahreed/byteasm I mentioned on
> the list earlier this month, I have a pass that to computes stack usage
> for a given sequence of bytecodes. It seems to be a fair bit more
> agressive than cpython. Maybe it's more generally useful. It's pure
> python rather than C though.
>

IMO it's too big, resource hungry, and slower, even if you convert it in C.

If you take a look at the current stackdepth_walk function which CPython
uses, it's much smaller (not even 50 lines in simple C code) and quite
efficient.

Currently the problem is that it doesn't return the maximum depth of the
tree, but it updates the intermediate/current maximum, and *then* it uses
it for the subsequent calculations. So, the depth artificially grows, like
in the reported cases.

It doesn't require a complete rewrite, but spending some time for
fine-tuning it.

Regards
Cesare
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-05-17 Thread zreed
In the project https://github.com/zachariahreed/byteasm I mentioned on
the list earlier this month, I have a pass that to computes stack usage
for a given sequence of bytecodes. It seems to be a fair bit more
agressive than cpython. Maybe it's more generally useful. It's pure
python rather than C though.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-05-16 Thread Cesare Di Mauro
2016-05-16 17:55 GMT+02:00 Meador Inge :

> On Sun, May 15, 2016 at 2:23 AM, Cesare Di Mauro <
> cesare.di.ma...@gmail.com> wrote:
>
>
>> Just one thing that comes to my mind: is the stack depth calculation
>> routine changed? It was suboptimal, and calculating a better number
>> decreases stack allocation, and increases the frame usage.
>>
>
> This is still a problem and came up again recently:
>
> http://bugs.python.org/issue26549
>
> -- Meador
>

I saw the last two comments of the issues: this is what I was talking about
(in particular the issue opened by Armin applies).

However there's another case where the situation is even worse.

Let me show a small reproducer:

def test(self):
for i in range(self.count):
with self: pass

The stack size reported by Python 2.7.11:
>>> test.__code__.co_stacksize
6

Adding another with statement:
>>> test.__code__.co_stacksize
7

But unfortunately with Python 3.5.1 the problematic is much worse:

>>> test.__code__.co_stacksize
10

>>> test.__code__.co_stacksize
17

Here the situation is exacerbated by the fact that the WITH_CLEANUP
instruction of Python 2.x was split into two (WITH_CLEANUP_START and
WITH_CLEANUP_FINISH) in some Python 3 release.

I don't know why two different instructions were introduced, but IMO it's
better to have one instruction which handles all code finalization of the
with statement, at least in this case. If there are other scenarios where
two different instructions are needed, then ad-hoc instructions like those
can be used.

Regards,
Cesare
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-05-16 Thread Meador Inge
On Sun, May 15, 2016 at 2:23 AM, Cesare Di Mauro 
wrote:


> Just one thing that comes to my mind: is the stack depth calculation
> routine changed? It was suboptimal, and calculating a better number
> decreases stack allocation, and increases the frame usage.
>

This is still a problem and came up again recently:

http://bugs.python.org/issue26549

-- Meador
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-05-15 Thread Cesare Di Mauro
2016-02-01 17:54 GMT+01:00 Yury Selivanov :

> Thanks for bringing this up!
>
> IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per
> bytecode instead of 8.


No, it used 16, 32, and 48-bit per opcode (1, 2, or 3 16-bit words).


> That allows to minimize the number of bytecodes, thus having some
> performance increase.  TBH, I don't think it was "significantly faster".
>

Please, take a look at the benchmarks, or compile it and check yourself. ;-)

If I were to do some big refactoring of the ceval loop, I'd probably
> consider implementing a register VM.  While register VMs are a bit faster
> than stack VMs (up to 20-30%), they would also allow us to apply more
> optimizations, and even bolt on a simple JIT compiler.
>
> Yury


WPython was an hybrid-VM: it supported both a stack-based and a
register-based approach.

I think that it's needed, since the nature of Python, because you can have
operations with intermixed operands: constants, locals, globals, names.
It's quite difficult to handle all possible cases with a register-based VM.

Regards,
Cesare
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-05-15 Thread Cesare Di Mauro
2016-02-02 10:28 GMT+01:00 Victor Stinner :

> 2016-01-27 19:25 GMT+01:00 Yury Selivanov :
> > tl;dr The summary is that I have a patch that improves CPython
> performance
> > up to 5-10% on macro benchmarks.  Benchmarks results on Macbook Pro/Mac
> OS
> > X, desktop CPU/Linux, server CPU/Linux are available at [1].  There are
> no
> > slowdowns that I could reproduce consistently.
>
> That's really impressive, great job Yury :-) Getting non-negligible
> speedup on large macrobenchmarks became really hard in CPython.
> CPython is already well optimized in all corners.


It's long time since I took a look at CPython (3.2), but if it didn't
changed a lot then there might be some corner cases still waiting to be
optimized. ;-)

Just one thing that comes to my mind: is the stack depth calculation
routine changed? It was suboptimal, and calculating a better number
decreases stack allocation, and increases the frame usage.


> It looks like the
> overall Python performance still depends heavily on the performance of
> dictionary and attribute lookups. Even if it was well known, I didn't
> expect up to 10% speedup on *macro* benchmarks.
>

True, but it might be mitigated in some ways, at least for built-in types.
There are ideas about that, but they are a bit complicated to implement.

The problem is with functions like len, which IMO should become attribute
lookups ('foo'.len) or method executions ('foo'.len()). Then it'll be
easier to accelerate their execution, with one of the above ideas.

However such kind of changes belong to Guido, which defines the language
structure/philosophy. IMO something like len should be part of the
attributes exposed by an object: it's more "object-oriented". Whereas other
things like open, file, sum, etc., are "general facilities".

Regards,
Cesare
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-04 Thread Nick Coghlan
On 3 February 2016 at 03:52, Brett Cannon  wrote:
> Fifth, if we manage to show that a C API can easily be added to CPython to
> make a JIT something that can simply be plugged in and be useful, then we
> will also have a basic JIT framework for people to use. As I said, our use
> of CoreCLR is just for ease of development. There is no reason we couldn't
> use ChakraCore, v8, LLVM, etc. But since all of these JIT compilers would
> need to know how to handle CPython bytecode, we have tried to design a
> framework where JIT compilers just need a wrapper to handle code emission
> and our framework that we are building will handle driving the code emission
> (e.g., the wrapper needs to know how to emit add_integer(), but our
> framework handles when to have to do that).

That could also be really interesting in the context of pymetabiosis
[1] if it meant that PyPy could still at least partially JIT the
Python code running on the CPython side of the boundary.

Cheers,
Nick.

[1] https://github.com/rguillebert/pymetabiosis


-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-02 Thread Brett Cannon
On Tue, 2 Feb 2016 at 01:29 Victor Stinner  wrote:

> Hi,
>
> I'm back for the FOSDEM event at Bruxelles, it was really cool. I gave
> talk about FAT Python and I got good feedback. But friends told me
> that people now have expectations on FAT Python. It looks like people
> care of Python performance :-)
>
> FYI the slides of my talk:
> https://github.com/haypo/conf/raw/master/2016-FOSDEM/fat_python.pdf
> (a video was recorded, I don't know when it will be online)
>
> I take a first look at your patch and sorry, I'm skeptical about the
> design. I have to play with it a little bit more to check if there is
> no better design.
>
> To be clear, FAT Python with your work looks more and more like a
> cheap JIT compiler :-) Guards, specializations, optimizing at runtime
> after a threshold... all these things come from JIT compilers. I like
> the idea of a kind-of JIT compiler without having to pay the high cost
> of a large dependency like LLVM. I like baby steps in CPython, it's
> faster, it's possible to implement it in a single release cycle (one
> minor Python release, Python 3.6). Integrating a JIT compiler into
> CPython already failed with Unladen Swallow :-/
>
> PyPy has a complete different design (and has serious issues with the
> Python C API), Pyston is restricted to Python 2.7, Pyjion looks
> specific to Windows (CoreCLR), Numba is specific to numeric
> computations (numpy). IMHO none of these projects can be easily be
> merged into CPython "quickly" (again, in a single Python release
> cycle). By the way, Pyjion still looks very young (I heard that they
> are still working on the compatibility with CPython, not on
> performance yet).
>

We are not ready to have a serious discussion about Pyjion yet as we are
still working on compatibility (we have a talk proposal in for PyCon US
2016 and so we are hoping to have something to discuss at the language
summit), but Victor's email shows there is some misconceptions about it
already and a misunderstanding of our fundamental goal.

First off, Pyjion is very much a work-in-progress. You can find it at
https://github.com/microsoft/pyjion (where there is an FAQ), but for this
audience the key thing to know is that we are still working on
compatibility (see
https://github.com/Microsoft/Pyjion/blob/master/Tests/python_tests.txt for
the list of tests we do (not) pass from the Python test suite). Out of our
roughly 400 tests, we don't pass about 18 of them.

Second, we have not really started work on performance yet. We have done
some very low-hanging fruit stuff, but just barely. IOW we are not really
ready to discuss performance (ATM we JIT instantly for all code objects and
even being that aggressive with the JIT overhead we are even/slightly
slower than an unmodified Python 3.5 VM, so we are hopeful this work will
pan out).

Third, the over-arching goal of Pyjion is not to add a JIT into CPython,
but to add a C API to CPython that will allow plugging in a JIT. If you
simply JIT code objects then the API required to let someone plug in a JIT
is basically three functions, maybe as little as two (you can see the exact
patch against CPython that we are working with at
https://github.com/Microsoft/Pyjion/blob/master/Patches/python.diff). We
have no interest in shipping a JIT with CPython, just making it much easier
to let others add one if they want to because it makes sense for their
workload. We have no plans to suggest shipping a JIT with CPython, just to
make it an option for people to add in if they want (and if Yury's caching
stuff goes in with an execution counter then even the one bit of true
overhead we had will be part of CPython already which makes it even more of
an easy decision to consider the API we will eventually propose).

Fourth, it is not Windows-only by design. CoreCLR is cross-platform on all
major OSs, so that is not a restriction (and honestly we are using CoreCLR
simply because Dino used to work on the CLR team so he knows the bytecode
really well; we easily could have used some other JIT to prove our point).
The only reason Pyjion doesn't work with other OSs is momenum/laziness on
Dino and my part; Dino hacked together Pyjion at PyCon US 2015 and he is
the most comfortable on Windows, and so he just did it in Windows on Visual
Studio and just didn't bother to start with e.g., CMake to make it build on
other OSs. Since we are still trying to work out some compatibility stuff
so we would rather do that than worry about Linux or OS X support right now.

Fifth, if we manage to show that a C API can easily be added to CPython to
make a JIT something that can simply be plugged in and be useful, then we
will also have a basic JIT framework for people to use. As I said, our use
of CoreCLR is just for ease of development. There is no reason we couldn't
use ChakraCore, v8, LLVM, etc. But since all of these JIT compilers would
need to know how to handle CPython bytecode, we have tried to design a
framework where JIT compilers just need a wrapper

Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-02 Thread Peter Ludemann via Python-Dev
Also, modern compiler technology tends to use "infinite register" machines
for the intermediate representation, then uses register coloring to assign
the actual registers (and generate spill code if needed). I've seen work on
inter-function optimization for avoiding some register loads and stores
(combined with tail-call optimization, it can turn recursive calls into
loops in the register machine).



On 2 February 2016 at 09:16, Sven R. Kunze  wrote:

> On 02.02.2016 00:27, Greg Ewing wrote:
>
>> Sven R. Kunze wrote:
>>
>>> Are there some resources on why register machines are considered faster
>>> than stack machines?
>>>
>>
>> If a register VM is faster, it's probably because each register
>> instruction does the work of about 2-3 stack instructions,
>> meaning less trips around the eval loop, so less unpredictable
>> branches and less pipeline flushes.
>>
>
> That's was I found so far as well.
>
> This assumes that bytecode dispatching is a substantial fraction
>> of the time taken to execute each instruction. For something
>> like cpython, where the operations carried out by the bytecodes
>> involve a substantial amount of work, this may not be true.
>>
>
> Interesting point indeed. It makes sense that register machines only saves
> us the bytecode dispatching.
>
> How much that is compared to the work each instruction requires, I cannot
> say. Maybe, Yury has a better understanding here.
>
> It also assumes the VM is executing the bytecodes directly. If
>> there is a JIT involved, it all gets translated into something
>> else anyway, and then it's more a matter of whether you find
>> it easier to design the JIT to deal with stack or register code.
>>
>
> It seems like Yury thinks so. He didn't tell use so far.
>
>
> Best,
> Sven
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/pludemann%40google.com
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-02 Thread Sven R. Kunze

On 02.02.2016 00:27, Greg Ewing wrote:

Sven R. Kunze wrote:
Are there some resources on why register machines are considered 
faster than stack machines?


If a register VM is faster, it's probably because each register
instruction does the work of about 2-3 stack instructions,
meaning less trips around the eval loop, so less unpredictable
branches and less pipeline flushes.


That's was I found so far as well.


This assumes that bytecode dispatching is a substantial fraction
of the time taken to execute each instruction. For something
like cpython, where the operations carried out by the bytecodes
involve a substantial amount of work, this may not be true.


Interesting point indeed. It makes sense that register machines only 
saves us the bytecode dispatching.


How much that is compared to the work each instruction requires, I 
cannot say. Maybe, Yury has a better understanding here.



It also assumes the VM is executing the bytecodes directly. If
there is a JIT involved, it all gets translated into something
else anyway, and then it's more a matter of whether you find
it easier to design the JIT to deal with stack or register code.


It seems like Yury thinks so. He didn't tell use so far.


Best,
Sven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-02 Thread Yury Selivanov



On 2016-02-02 4:28 AM, Victor Stinner wrote:
[..]

I take a first look at your patch and sorry,


Thanks for the initial code review!


I'm skeptical about the
design. I have to play with it a little bit more to check if there is
no better design.


So far I see two things you are worried about:


1. The cache is attached to the code object vs function/frame.

I think the code object is the perfect place for such a cache.

The cache must be there (and survive!) "across" the frames.
If you attach it to the function object, you'll have to
re-attach it to a frame object on each PyEval call.
I can't see how that would be better.


2. Two levels of indirection in my cache -- offsets table +
cache table.

In my other email thread "Opcode cache in ceval loop" I
explained that optimizing every code object in the standard
library and unittests adds 5% memory overhead.  Optimizing
only those that are called frequently is less than 1%.

Besides, many functions that you import are never called, or
only called once or twice.  And code objects for modules
and class bodies are called once.

If we don't use an offset table and just allocate a cache
entry for every opcode, then the memory usage will raise
*significantly*.  Right now the overhead of the offset table
is *8 bits* per opcode, the overhead of the cache table is
*32 bytes* per an optimized opcode.  The overhead of
using 1 extra indirection is minimal.

[..]





2016-01-27 19:25 GMT+01:00 Yury Selivanov :

tl;dr The summary is that I have a patch that improves CPython performance
up to 5-10% on macro benchmarks.  Benchmarks results on Macbook Pro/Mac OS
X, desktop CPU/Linux, server CPU/Linux are available at [1].  There are no
slowdowns that I could reproduce consistently.

That's really impressive, great job Yury :-) Getting non-negligible
speedup on large macrobenchmarks became really hard in CPython.
CPython is already well optimized in all corners. It looks like the
overall Python performance still depends heavily on the performance of
dictionary and attribute lookups. Even if it was well known, I didn't
expect up to 10% speedup on *macro* benchmarks.


Thanks!





LOAD_METHOD & CALL_METHOD
-

We had a lot of conversations with Victor about his PEP 509, and he sent me
a link to his amazing compilation of notes about CPython performance [2].
One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an
idea first originated in PyPy.

There is a patch that implements this optimization, it's tracked here: [3].
There are some low level details that I explained in the issue, but I'll go
over the high level design in this email as well.

Your cache is stored directly in code objects. Currently, code objects
are immutable.


Code objects are immutable on the Python level.  My cache
doesn't make any previously immutable field mutable.

Adding a few mutable cache structures visible only at the
C level is acceptable I think.



Antoine Pitrou's patch adding a LOAD_GLOBAL cache adds a cache to
functions with an "alias" in each frame object:
http://bugs.python.org/issue10401

Andrea Griffini's patch also adding a cache for LOAD_GLOBAL adds a
cache for code objects too.
https://bugs.python.org/issue1616125


Those patches are nice, but optimizing just LOAD_GLOBAL
won't give you a big speed-up.  For instance, 2to3 became
7-8% faster once I started to optimize LOAD_ATTR.

The idea of my patch is that it implements caching
in such a way, that we can add it to several different
opcodes.

The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3.  Let's look at the
first one, that loads the 'print' function from builtins.  The opcode knows
the following bits of information:

I tested your latest patch. It looks like LOAD_GLOBAL never
invalidates the cache on cache miss ("deoptimize" the instruction).


Yes, that was a deliberate decision (but we can add the
deoptimization easily).  So far I haven't seen a use case
or benchmark where we really need to deoptimize.



I suggest to always invalidate the cache at each cache miss. Not only,
it's common to modify global variables, but there is also the issue of
different namespace used with the same code object. Examples:

* late global initialization. See for example _a85chars cache of
base64.a85encode.
* code object created in a temporary namespace and then always run in
a different global namespace. See for example
collections.namedtuple(). I'm not sure that it's the best example
because it looks like the Python code only loads builtins, not
globals. But it looks like your code keeps a copy of the version of
the global namespace dict.

I tested with a threshold of 1: always optimize all code objects.
Maybe with your default threshold of 1024 runs, the issue with
different namespaces doesn't occur in practice.


Yep. I added a constant in ceval.c that enables collection
of opcode cache stats.

99.9% of all global dicts in benchmarks are stable.

test suite was a bit different, only ~99% :) 

Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-02 Thread Victor Stinner
Hi,

I'm back for the FOSDEM event at Bruxelles, it was really cool. I gave
talk about FAT Python and I got good feedback. But friends told me
that people now have expectations on FAT Python. It looks like people
care of Python performance :-)

FYI the slides of my talk:
https://github.com/haypo/conf/raw/master/2016-FOSDEM/fat_python.pdf
(a video was recorded, I don't know when it will be online)

I take a first look at your patch and sorry, I'm skeptical about the
design. I have to play with it a little bit more to check if there is
no better design.

To be clear, FAT Python with your work looks more and more like a
cheap JIT compiler :-) Guards, specializations, optimizing at runtime
after a threshold... all these things come from JIT compilers. I like
the idea of a kind-of JIT compiler without having to pay the high cost
of a large dependency like LLVM. I like baby steps in CPython, it's
faster, it's possible to implement it in a single release cycle (one
minor Python release, Python 3.6). Integrating a JIT compiler into
CPython already failed with Unladen Swallow :-/

PyPy has a complete different design (and has serious issues with the
Python C API), Pyston is restricted to Python 2.7, Pyjion looks
specific to Windows (CoreCLR), Numba is specific to numeric
computations (numpy). IMHO none of these projects can be easily be
merged into CPython "quickly" (again, in a single Python release
cycle). By the way, Pyjion still looks very young (I heard that they
are still working on the compatibility with CPython, not on
performance yet).


2016-01-27 19:25 GMT+01:00 Yury Selivanov :
> tl;dr The summary is that I have a patch that improves CPython performance
> up to 5-10% on macro benchmarks.  Benchmarks results on Macbook Pro/Mac OS
> X, desktop CPU/Linux, server CPU/Linux are available at [1].  There are no
> slowdowns that I could reproduce consistently.

That's really impressive, great job Yury :-) Getting non-negligible
speedup on large macrobenchmarks became really hard in CPython.
CPython is already well optimized in all corners. It looks like the
overall Python performance still depends heavily on the performance of
dictionary and attribute lookups. Even if it was well known, I didn't
expect up to 10% speedup on *macro* benchmarks.


> LOAD_METHOD & CALL_METHOD
> -
>
> We had a lot of conversations with Victor about his PEP 509, and he sent me
> a link to his amazing compilation of notes about CPython performance [2].
> One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an
> idea first originated in PyPy.
>
> There is a patch that implements this optimization, it's tracked here: [3].
> There are some low level details that I explained in the issue, but I'll go
> over the high level design in this email as well.

Your cache is stored directly in code objects. Currently, code objects
are immutable.

Antoine Pitrou's patch adding a LOAD_GLOBAL cache adds a cache to
functions with an "alias" in each frame object:
http://bugs.python.org/issue10401

Andrea Griffini's patch also adding a cache for LOAD_GLOBAL adds a
cache for code objects too.
https://bugs.python.org/issue1616125

I don't know what is the best place to store the cache.

I vaguely recall a patch which uses a single unique global cache, but
maybe I'm wrong :-p


> The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3.  Let's look at the
> first one, that loads the 'print' function from builtins.  The opcode knows
> the following bits of information:

I tested your latest patch. It looks like LOAD_GLOBAL never
invalidates the cache on cache miss ("deoptimize" the instruction).

I suggest to always invalidate the cache at each cache miss. Not only,
it's common to modify global variables, but there is also the issue of
different namespace used with the same code object. Examples:

* late global initialization. See for example _a85chars cache of
base64.a85encode.
* code object created in a temporary namespace and then always run in
a different global namespace. See for example
collections.namedtuple(). I'm not sure that it's the best example
because it looks like the Python code only loads builtins, not
globals. But it looks like your code keeps a copy of the version of
the global namespace dict.

I tested with a threshold of 1: always optimize all code objects.
Maybe with your default threshold of 1024 runs, the issue with
different namespaces doesn't occur in practice.


> A straightforward way to implement such a cache is simple, but consumes a
> lot of memory, that would be just wasted, since we only need such a cache
> for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be creative about the
> cache design.

I'm not sure that it's worth to develop a complex dynamic logic to
only enable optimizations after a threshold (design very close to a
JIT compiler). What is the overhead (% of RSS memory) on a concrete
application when all code objects are optimized at startup?

Maybe we need a global 

Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-01 Thread Greg Ewing

Sven R. Kunze wrote:
Are there some resources on why register machines are considered faster 
than stack machines?


If a register VM is faster, it's probably because each register
instruction does the work of about 2-3 stack instructions,
meaning less trips around the eval loop, so less unpredictable
branches and less pipeline flushes.

This assumes that bytecode dispatching is a substantial fraction
of the time taken to execute each instruction. For something
like cpython, where the operations carried out by the bytecodes
involve a substantial amount of work, this may not be true.

It also assumes the VM is executing the bytecodes directly. If
there is a JIT involved, it all gets translated into something
else anyway, and then it's more a matter of whether you find
it easier to design the JIT to deal with stack or register code.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-01 Thread Mark Lawrence

On 01/02/2016 16:54, Yury Selivanov wrote:



On 2016-01-29 11:28 PM, Steven D'Aprano wrote:

On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote:

Hi,

tl;dr The summary is that I have a patch that improves CPython
performance up to 5-10% on macro benchmarks.  Benchmarks results on
Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available
at [1].  There are no slowdowns that I could reproduce consistently.

Have you looked at Cesare Di Mauro's wpython? As far as I know, it's now
unmaintained, and the project repo on Google Code appears to be dead (I
get a 404), but I understand that it was significantly faster than
CPython back in the 2.6 days.

https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-based%20Python.pdf



Thanks for bringing this up!

IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per
bytecode instead of 8.  That allows to minimize the number of bytecodes,
thus having some performance increase.  TBH, I don't think it was
"significantly faster".



From https://code.google.com/archive/p/wpython/


WPython is a re-implementation of (some parts of) Python, which drops 
support for bytecode in favour of a wordcode-based model (where a is 
word is 16 bits wide).


It also implements an hybrid stack-register virtual machine, and adds a 
lot of other optimizations.



--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-01 Thread Brett Cannon
On Mon, 1 Feb 2016 at 10:21 Sven R. Kunze  wrote:

>
>
> On 01.02.2016 18:18, Brett Cannon wrote:
>
>
>
> On Mon, 1 Feb 2016 at 09:08 Yury Selivanov < 
> yselivanov...@gmail.com> wrote:
>
>>
>>
>> On 2016-01-29 11:28 PM, Steven D'Aprano wrote:
>> > On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote:
>> >> Hi,
>> >>
>> >>
>> >> tl;dr The summary is that I have a patch that improves CPython
>> >> performance up to 5-10% on macro benchmarks.  Benchmarks results on
>> >> Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available
>> >> at [1].  There are no slowdowns that I could reproduce consistently.
>> > Have you looked at Cesare Di Mauro's wpython? As far as I know, it's now
>> > unmaintained, and the project repo on Google Code appears to be dead (I
>> > get a 404), but I understand that it was significantly faster than
>> > CPython back in the 2.6 days.
>> >
>> >
>> https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-based%20Python.pdf
>> >
>> >
>>
>> Thanks for bringing this up!
>>
>> IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per
>> bytecode instead of 8.  That allows to minimize the number of bytecodes,
>> thus having some performance increase.  TBH, I don't think it was
>> "significantly faster".
>>
>> If I were to do some big refactoring of the ceval loop, I'd probably
>> consider implementing a register VM.  While register VMs are a bit
>> faster than stack VMs (up to 20-30%), they would also allow us to apply
>> more optimizations, and even bolt on a simple JIT compiler.
>>
>
> If you did tackle the register VM approach that would also settle a
> long-standing question of whether a certain optimization works for Python.
>
>
> Are there some resources on why register machines are considered faster
> than stack machines?
>

A search for [stack vs register based virtual machine] will get you some
information.


>
>
> As for bolting on a JIT, the whole point of Pyjion is to see if that's
> worth it for CPython, so that's already being taken care of (and is
> actually easier with a stack-based VM since the JIT engine we're using is
> stack-based itself).
>
>
> Interesting. Haven't noticed these projects, yet.
>

You aren't really supposed to yet. :) In Pyjion's case we are still working
on compatibility, let alone trying to show a speed improvement so we have
not said much beyond this mailing list (we have a talk proposal in for
PyCon US that we hope gets accepted). We just happened to get picked up on
Reddit and HN recently and so interest has spiked in the project.


>
> So, it could be that we will see a jitted CPython when Pyjion appears to
> be successful?
>

The ability to plug in a JIT, but yes, that's the hope.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-01 Thread Sven R. Kunze



On 01.02.2016 18:18, Brett Cannon wrote:



On Mon, 1 Feb 2016 at 09:08 Yury Selivanov > wrote:




On 2016-01-29 11:28 PM, Steven D'Aprano wrote:
> On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote:
>> Hi,
>>
>>
>> tl;dr The summary is that I have a patch that improves CPython
>> performance up to 5-10% on macro benchmarks. Benchmarks results on
>> Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are
available
>> at [1].  There are no slowdowns that I could reproduce
consistently.
> Have you looked at Cesare Di Mauro's wpython? As far as I know,
it's now
> unmaintained, and the project repo on Google Code appears to be
dead (I
> get a 404), but I understand that it was significantly faster than
> CPython back in the 2.6 days.
>
>

https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-based%20Python.pdf
>
>

Thanks for bringing this up!

IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per
bytecode instead of 8.  That allows to minimize the number of
bytecodes,
thus having some performance increase.  TBH, I don't think it was
"significantly faster".

If I were to do some big refactoring of the ceval loop, I'd probably
consider implementing a register VM.  While register VMs are a bit
faster than stack VMs (up to 20-30%), they would also allow us to
apply
more optimizations, and even bolt on a simple JIT compiler.


If you did tackle the register VM approach that would also settle a 
long-standing question of whether a certain optimization works for Python.


Are there some resources on why register machines are considered faster 
than stack machines?


As for bolting on a JIT, the whole point of Pyjion is to see if that's 
worth it for CPython, so that's already being taken care of (and is 
actually easier with a stack-based VM since the JIT engine we're using 
is stack-based itself).


Interesting. Haven't noticed these projects, yet.

So, it could be that we will see a jitted CPython when Pyjion appears to 
be successful?


Best,
Sven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-01 Thread Brett Cannon
On Mon, 1 Feb 2016 at 09:08 Yury Selivanov  wrote:

>
>
> On 2016-01-29 11:28 PM, Steven D'Aprano wrote:
> > On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote:
> >> Hi,
> >>
> >>
> >> tl;dr The summary is that I have a patch that improves CPython
> >> performance up to 5-10% on macro benchmarks.  Benchmarks results on
> >> Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available
> >> at [1].  There are no slowdowns that I could reproduce consistently.
> > Have you looked at Cesare Di Mauro's wpython? As far as I know, it's now
> > unmaintained, and the project repo on Google Code appears to be dead (I
> > get a 404), but I understand that it was significantly faster than
> > CPython back in the 2.6 days.
> >
> >
> https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-based%20Python.pdf
> >
> >
>
> Thanks for bringing this up!
>
> IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per
> bytecode instead of 8.  That allows to minimize the number of bytecodes,
> thus having some performance increase.  TBH, I don't think it was
> "significantly faster".
>
> If I were to do some big refactoring of the ceval loop, I'd probably
> consider implementing a register VM.  While register VMs are a bit
> faster than stack VMs (up to 20-30%), they would also allow us to apply
> more optimizations, and even bolt on a simple JIT compiler.
>

If you did tackle the register VM approach that would also settle a
long-standing question of whether a certain optimization works for Python.

As for bolting on a JIT, the whole point of Pyjion is to see if that's
worth it for CPython, so that's already being taken care of (and is
actually easier with a stack-based VM since the JIT engine we're using is
stack-based itself).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-01 Thread Sven R. Kunze

On 01.02.2016 19:28, Brett Cannon wrote:
A search for [stack vs register based virtual machine] will get you 
some information.


Alright. :) Will go for that.

You aren't really supposed to yet. :) In Pyjion's case we are still 
working on compatibility, let alone trying to show a speed improvement 
so we have not said much beyond this mailing list (we have a talk 
proposal in for PyCon US that we hope gets accepted). We just happened 
to get picked up on Reddit and HN recently and so interest has spiked 
in the project.


Exciting. :)



So, it could be that we will see a jitted CPython when Pyjion
appears to be successful?


The ability to plug in a JIT, but yes, that's the hope.


Okay. Not sure what you mean by plugin. One thing I like about Python is 
that it just works. So, plugin sounds like unnecessary work.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-01 Thread Sven R. Kunze

On 01.02.2016 17:54, Yury Selivanov wrote:
If I were to do some big refactoring of the ceval loop, I'd probably 
consider implementing a register VM.  While register VMs are a bit 
faster than stack VMs (up to 20-30%), they would also allow us to 
apply more optimizations, and even bolt on a simple JIT compiler.


How do JIT and register machine related to each other? :)


Best,
Sven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-01 Thread Yury Selivanov

Hi Brett,

On 2016-02-01 12:18 PM, Brett Cannon wrote:


On Mon, 1 Feb 2016 at 09:08 Yury Selivanov > wrote:



[..]

If I were to do some big refactoring of the ceval loop, I'd probably
consider implementing a register VM.  While register VMs are a bit
faster than stack VMs (up to 20-30%), they would also allow us to
apply
more optimizations, and even bolt on a simple JIT compiler.


[..]

As for bolting on a JIT, the whole point of Pyjion is to see if that's 
worth it for CPython, so that's already being taken care of (and is 
actually easier with a stack-based VM since the JIT engine we're using 
is stack-based itself).


Sure, I have very high hopes for Pyjion and Pyston.  I really hope that 
Microsoft and Dropbox will keep pushing.


Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-02-01 Thread Yury Selivanov



On 2016-01-29 11:28 PM, Steven D'Aprano wrote:

On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote:

Hi,


tl;dr The summary is that I have a patch that improves CPython
performance up to 5-10% on macro benchmarks.  Benchmarks results on
Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available
at [1].  There are no slowdowns that I could reproduce consistently.

Have you looked at Cesare Di Mauro's wpython? As far as I know, it's now
unmaintained, and the project repo on Google Code appears to be dead (I
get a 404), but I understand that it was significantly faster than
CPython back in the 2.6 days.

https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-based%20Python.pdf




Thanks for bringing this up!

IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per 
bytecode instead of 8.  That allows to minimize the number of bytecodes, 
thus having some performance increase.  TBH, I don't think it was 
"significantly faster".


If I were to do some big refactoring of the ceval loop, I'd probably 
consider implementing a register VM.  While register VMs are a bit 
faster than stack VMs (up to 20-30%), they would also allow us to apply 
more optimizations, and even bolt on a simple JIT compiler.


Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-29 Thread Steven D'Aprano
On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote:
> Hi,
> 
> 
> tl;dr The summary is that I have a patch that improves CPython 
> performance up to 5-10% on macro benchmarks.  Benchmarks results on 
> Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available 
> at [1].  There are no slowdowns that I could reproduce consistently.

Have you looked at Cesare Di Mauro's wpython? As far as I know, it's now 
unmaintained, and the project repo on Google Code appears to be dead (I 
get a 404), but I understand that it was significantly faster than 
CPython back in the 2.6 days.

https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-based%20Python.pdf



-- 
Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-29 Thread Yury Selivanov

Hi Damien,

BTW I just saw (and backed!) your new Kickstarter campaign
to port MicroPython to ESP8266, good stuff!

On 2016-01-29 7:38 AM, Damien George wrote:

Hi Yury,

[..]

Do you use opcode dictionary caching only for LOAD_GLOBAL-like
opcodes?  Do you have an equivalent of LOAD_FAST, or you use
dicts to store local variables?

The opcodes that have dict caching are:

LOAD_NAME
LOAD_GLOBAL
LOAD_ATTR
STORE_ATTR
LOAD_METHOD (not implemented yet in mainline repo)

For local variables we use LOAD_FAST and STORE_FAST (and DELETE_FAST).
Actually, there are 16 dedicated opcodes for loading from positions
0-15, and 16 for storing to these positions.  Eg:

LOAD_FAST_0
LOAD_FAST_1
...

Mostly this is done to save RAM, since LOAD_FAST_0 is 1 byte.


Interesting.  This might actually make CPython slightly faster
too.  Worth trying.




If we change the opcode size, it will probably affect libraries
that compose or modify code objects.  Modules like "dis" will
also need to be updated.  And that's probably just a tip of the
iceberg.

We can still implement your approach if we add a separate
private 'unsigned char' array to each code object, so that
LOAD_GLOBAL can store the key offsets.  It should be a bit
faster than my current patch, since it has one less level
of indirection.  But this way we loose the ability to
optimize LOAD_METHOD, simply because it requires more memory
for its cache.  In any case, I'll experiment!

Problem with that approach (having a separate array for offset_guess)
is that how do you know where to look into that array for a given
LOAD_GLOBAL opcode?  The second LOAD_GLOBAL in your bytecode should
look into the second entry in the array, but how does it know?




I've changed my approach a little bit.  Now I have a simple
function [1] to initialize the cache for code objects that
are called frequently enough.

It walks through the code object's opcodes and creates the
appropriate  offset/cache tables.

Then, in ceval loop I have a couple of convenient macros
to work with the cache [2].  They use INSTR_OFFSET() macro
to locate the cache entry via the offset table initialized
by [1].

Thanks,
Yury

[1] https://github.com/1st1/cpython/blob/opcache4/Objects/codeobject.c#L167
[2] https://github.com/1st1/cpython/blob/opcache4/Python/ceval.c#L1164
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-29 Thread Yury Selivanov



On 2016-01-29 5:00 AM, Stefan Behnel wrote:

Yury Selivanov schrieb am 27.01.2016 um 19:25:

[..]

LOAD_METHOD looks at the object on top of the stack, and checks if the name
resolves to a method or to a regular attribute.  If it's a method, then we
push the unbound method object and the object to the stack.  If it's an
attribute, we push the resolved attribute and NULL.

When CALL_METHOD looks at the stack it knows how to call the unbound method
properly (pushing the object as a first arg), or how to call a regular
callable.

This idea does make CPython faster around 2-4%.  And it surely doesn't make
it slower.  I think it's a safe bet to at least implement this optimization
in CPython 3.6.

So far, the patch only optimizes positional-only method calls. It's
possible to optimize all kind of calls, but this will necessitate 3 more
opcodes (explained in the issue).  We'll need to do some careful
benchmarking to see if it's really needed.

I implemented a similar but simpler optimisation in Cython a while back:

http://blog.behnel.de/posts/faster-python-calls-in-cython-021.html

Instead of avoiding the creation of method objects, as you proposed, it
just normally calls getattr and if that returns a bound method object, it
uses inlined calling code that avoids re-packing the argument tuple.
Interestingly, I got speedups of 5-15% for some of the Python benchmarks,
but I don't quite remember which ones (at least raytrace and richards, I
think), nor do I recall the overall gain, which (I assume) is what you are
referring to with your 2-4% above. Might have been in the same order.


That's great!

I'm still working on the patch, but so far it looks like adding
just LOAD_METHOD/CALL_METHOD (that avoid instantiating BoundMethods)
gives us 10-15% faster method calls.

Combining them with my opcode cache makes them 30-35% faster.

Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-29 Thread Damien George
Hi Yury,

> An off-topic: have you ever tried hg.python.org/benchmarks
> or compare MicroPython vs CPython?  I'm curious if MicroPython
> is faster -- in that case we'll try to copy some optimization
> ideas.

I've tried a small number of those benchmarks, but not in any rigorous
way, and not enough to compare properly with CPython.  Maybe one day I
(or someone) will get to it and report results :)

One thing that makes MP fast is the use of pointer tagging and
stuffing of small integers within object pointers.  Thus integer
arithmetic below 2**30 (on 32-bit arch) requires no heap.

> Do you use opcode dictionary caching only for LOAD_GLOBAL-like
> opcodes?  Do you have an equivalent of LOAD_FAST, or you use
> dicts to store local variables?

The opcodes that have dict caching are:

LOAD_NAME
LOAD_GLOBAL
LOAD_ATTR
STORE_ATTR
LOAD_METHOD (not implemented yet in mainline repo)

For local variables we use LOAD_FAST and STORE_FAST (and DELETE_FAST).
Actually, there are 16 dedicated opcodes for loading from positions
0-15, and 16 for storing to these positions.  Eg:

LOAD_FAST_0
LOAD_FAST_1
...

Mostly this is done to save RAM, since LOAD_FAST_0 is 1 byte.

> If we change the opcode size, it will probably affect libraries
> that compose or modify code objects.  Modules like "dis" will
> also need to be updated.  And that's probably just a tip of the
> iceberg.
>
> We can still implement your approach if we add a separate
> private 'unsigned char' array to each code object, so that
> LOAD_GLOBAL can store the key offsets.  It should be a bit
> faster than my current patch, since it has one less level
> of indirection.  But this way we loose the ability to
> optimize LOAD_METHOD, simply because it requires more memory
> for its cache.  In any case, I'll experiment!

Problem with that approach (having a separate array for offset_guess)
is that how do you know where to look into that array for a given
LOAD_GLOBAL opcode?  The second LOAD_GLOBAL in your bytecode should
look into the second entry in the array, but how does it know?

I'd love to experiment implementing my original caching idea with
CPython, but no time!

Cheers,
Damien.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-29 Thread Stefan Behnel
Yury Selivanov schrieb am 27.01.2016 um 19:25:
> tl;dr The summary is that I have a patch that improves CPython performance
> up to 5-10% on macro benchmarks.  Benchmarks results on Macbook Pro/Mac OS
> X, desktop CPU/Linux, server CPU/Linux are available at [1].  There are no
> slowdowns that I could reproduce consistently.
> 
> There are two different optimizations that yield this speedup:
> LOAD_METHOD/CALL_METHOD opcodes and per-opcode cache in ceval loop.
> 
> LOAD_METHOD & CALL_METHOD
> -
> 
> We had a lot of conversations with Victor about his PEP 509, and he sent me
> a link to his amazing compilation of notes about CPython performance [2]. 
> One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an
> idea first originated in PyPy.
> 
> There is a patch that implements this optimization, it's tracked here:
> [3].  There are some low level details that I explained in the issue, but
> I'll go over the high level design in this email as well.
> 
> Every time you access a method attribute on an object, a BoundMethod object
> is created. It is a fairly expensive operation, despite a freelist of
> BoundMethods (so that memory allocation is generally avoided).  The idea is
> to detect what looks like a method call in the compiler, and emit a pair of
> specialized bytecodes for that.
> 
> So instead of LOAD_GLOBAL/LOAD_ATTR/CALL_FUNCTION we will have
> LOAD_GLOBAL/LOAD_METHOD/CALL_METHOD.
> 
> LOAD_METHOD looks at the object on top of the stack, and checks if the name
> resolves to a method or to a regular attribute.  If it's a method, then we
> push the unbound method object and the object to the stack.  If it's an
> attribute, we push the resolved attribute and NULL.
> 
> When CALL_METHOD looks at the stack it knows how to call the unbound method
> properly (pushing the object as a first arg), or how to call a regular
> callable.
> 
> This idea does make CPython faster around 2-4%.  And it surely doesn't make
> it slower.  I think it's a safe bet to at least implement this optimization
> in CPython 3.6.
> 
> So far, the patch only optimizes positional-only method calls. It's
> possible to optimize all kind of calls, but this will necessitate 3 more
> opcodes (explained in the issue).  We'll need to do some careful
> benchmarking to see if it's really needed.

I implemented a similar but simpler optimisation in Cython a while back:

http://blog.behnel.de/posts/faster-python-calls-in-cython-021.html

Instead of avoiding the creation of method objects, as you proposed, it
just normally calls getattr and if that returns a bound method object, it
uses inlined calling code that avoids re-packing the argument tuple.
Interestingly, I got speedups of 5-15% for some of the Python benchmarks,
but I don't quite remember which ones (at least raytrace and richards, I
think), nor do I recall the overall gain, which (I assume) is what you are
referring to with your 2-4% above. Might have been in the same order.

Stefan


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov



On 2016-01-27 3:46 PM, Glenn Linderman wrote:

On 1/27/2016 12:37 PM, Yury Selivanov wrote:




MicroPython also has dictionary lookup caching, but it's a bit
different to your proposal.  We do something much simpler: each opcode
that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
etc) includes a single byte in the opcode which is an offset-guess
into the dictionary to find the desired element.  Eg for LOAD_GLOBAL
we have (pseudo code):

CASE(LOAD_GLOBAL):
key = DECODE_KEY;
offset_guess = DECODE_BYTE;
if (global_dict[offset_guess].key == key) {
 // found the element straight away
} else {
 // not found, do a full lookup and save the offset
 offset_guess = dict_lookup(global_dict, key);
 UPDATE_BYTECODE(offset_guess);
}
PUSH(global_dict[offset_guess].elem);

We have found that such caching gives a massive performance increase,
on the order of 20%.  The issue (for us) is that it increases bytecode
size by a considerable amount, requires writeable bytecode, and can be
non-deterministic in terms of lookup time.  Those things are important
in the embedded world, but not so much on the desktop.


That's a neat idea!  You're right, it does require bytecode to become 
writeable.


Would it?

Remember "fixup lists"?  Maybe they still exist for loading function 
addresses from one DLL into the code of another at load time?


So the equivalent for bytecode requires a static table of 
offset_guess, and the offsets into that table are allocated by the 
byte-code loader at byte-code load time, and the byte-code is "fixed 
up" at load time to use the correct offsets into the offset_guess 
table.  It takes one more indirection to find the guess, but if the 
result is a 20% improvement, maybe you'd still get 19%...


Right, in my current patch I have an offset table per code object. 
Essentially, this offset table adds 8bits per opcode.  It also means 
that only first 255 LOAD_GLOBAL/LOAD_METHOD opcodes *per-code-object* 
are optimized (because the offset table only can store 8bit offsets), 
which is usually enough (I think you need to have more than a 500 lines 
of code function to reach that limit).


Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Glenn Linderman

On 1/27/2016 12:37 PM, Yury Selivanov wrote:




MicroPython also has dictionary lookup caching, but it's a bit
different to your proposal.  We do something much simpler: each opcode
that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
etc) includes a single byte in the opcode which is an offset-guess
into the dictionary to find the desired element.  Eg for LOAD_GLOBAL
we have (pseudo code):

CASE(LOAD_GLOBAL):
key = DECODE_KEY;
offset_guess = DECODE_BYTE;
if (global_dict[offset_guess].key == key) {
 // found the element straight away
} else {
 // not found, do a full lookup and save the offset
 offset_guess = dict_lookup(global_dict, key);
 UPDATE_BYTECODE(offset_guess);
}
PUSH(global_dict[offset_guess].elem);

We have found that such caching gives a massive performance increase,
on the order of 20%.  The issue (for us) is that it increases bytecode
size by a considerable amount, requires writeable bytecode, and can be
non-deterministic in terms of lookup time.  Those things are important
in the embedded world, but not so much on the desktop.


That's a neat idea!  You're right, it does require bytecode to become 
writeable.


Would it?

Remember "fixup lists"?  Maybe they still exist for loading function 
addresses from one DLL into the code of another at load time?


So the equivalent for bytecode requires a static table of offset_guess, 
and the offsets into that table are allocated by the byte-code loader at 
byte-code load time, and the byte-code is "fixed up" at load time to use 
the correct offsets into the offset_guess table.  It takes one more 
indirection to find the guess, but if the result is a 20% improvement, 
maybe you'd still get 19%...



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov

Damien,

On 2016-01-27 4:20 PM, Damien George wrote:

Hi Yury,

(Sorry for misspelling your name previously!)


NP.  As long as the first letter is "y" I don't care ;)




Yes, we'll need to add CALL_METHOD{_VAR|_KW|etc} opcodes to optimize all
kind of method calls.  However, I'm not sure how big the impact will be,
need to do more benchmarking.

I never did such fine grained analysis with MicroPython.  I don't
think there are many uses of * and ** that it'd be worth it, but
definitely there are lots of uses of plain keywords.  Also, you'd want
to consider how simple/complex it is to treat all these different
opcodes in the compiler.  For us, it's simpler to treat everything the
same.  Otherwise your LOAD_METHOD part of the compiler will need to
peek deep into the AST to see what kind of call it is.


BTW, how do you benchmark MicroPython?

Haha, good question!  Well, we use Pystone 1.2 (unmodified) to do
basic benchmarking, and find it to be quite good.  We track our code
live at:

http://micropython.org/resources/code-dashboard/


The dashboard is cool!

An off-topic: have you ever tried hg.python.org/benchmarks
or compare MicroPython vs CPython?  I'm curious if MicroPython
is faster -- in that case we'll try to copy some optimization
ideas.


You can see there the red line, which is the Pystone result.  There
was a big jump around Jan 2015 which is when we introduced opcode
dictionary caching.  And since then it's been very gradually
increasing due to small optimisations here and there.


Do you use opcode dictionary caching only for LOAD_GLOBAL-like
opcodes?  Do you have an equivalent of LOAD_FAST, or you use
dicts to store local variables?


That's a neat idea!  You're right, it does require bytecode to become
writeable.  I considered implementing a similar strategy, but this would
be a big change for CPython.  So I decided to minimize the impact of the
patch and leave the opcodes untouched.

I think you need to consider "big" changes, especially ones like this
that can have a great (and good) impact.  But really, this is a
behind-the-scenes change that *should not* affect end users, and so
you should not have any second thoughts about doing it.


If we change the opcode size, it will probably affect libraries
that compose or modify code objects.  Modules like "dis" will
also need to be updated.  And that's probably just a tip of the
iceberg.

We can still implement your approach if we add a separate
private 'unsigned char' array to each code object, so that
LOAD_GLOBAL can store the key offsets.  It should be a bit
faster than my current patch, since it has one less level
of indirection.  But this way we loose the ability to
optimize LOAD_METHOD, simply because it requires more memory
for its cache.  In any case, I'll experiment!


One problem I
see with CPython is that it exposes way too much to the user (both
Python programmer and C extension writer) and this hurts both language
evolution (you constantly need to provide backwards compatibility) and
ability to optimise.


Right.  Even though CPython explicitly states that opcodes
and code objects might change in the future, we still have to
be careful about changing them.

Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Damien George
Hi Yury,

(Sorry for misspelling your name previously!)

> Yes, we'll need to add CALL_METHOD{_VAR|_KW|etc} opcodes to optimize all
> kind of method calls.  However, I'm not sure how big the impact will be,
> need to do more benchmarking.

I never did such fine grained analysis with MicroPython.  I don't
think there are many uses of * and ** that it'd be worth it, but
definitely there are lots of uses of plain keywords.  Also, you'd want
to consider how simple/complex it is to treat all these different
opcodes in the compiler.  For us, it's simpler to treat everything the
same.  Otherwise your LOAD_METHOD part of the compiler will need to
peek deep into the AST to see what kind of call it is.

> BTW, how do you benchmark MicroPython?

Haha, good question!  Well, we use Pystone 1.2 (unmodified) to do
basic benchmarking, and find it to be quite good.  We track our code
live at:

http://micropython.org/resources/code-dashboard/

You can see there the red line, which is the Pystone result.  There
was a big jump around Jan 2015 which is when we introduced opcode
dictionary caching.  And since then it's been very gradually
increasing due to small optimisations here and there.

Pystone is actually a great benchmark for embedded systems because it
gives very reliable results there (almost zero variation across runs)
and if we can squeeze 5 more Pystones out with some change then we
know that it's a good optimisation (for efficiency at least).

For us, low RAM usage and small code size are the most important
factors, and we track those meticulously.  But in fact, smaller code
size quite often correlates with more efficient code because there's
less to execute and it fits in the CPU cache (at least on the
desktop).

We do have some other benchmarks, but they are highly specialised for
us.  For example, how fast can you bit bang a GPIO pin using pure
Python code.  Currently we get around 200kHz on a 168MHz MCU, which
shows that pure (Micro)Python code is about 100 times slower than C.

> That's a neat idea!  You're right, it does require bytecode to become
> writeable.  I considered implementing a similar strategy, but this would
> be a big change for CPython.  So I decided to minimize the impact of the
> patch and leave the opcodes untouched.

I think you need to consider "big" changes, especially ones like this
that can have a great (and good) impact.  But really, this is a
behind-the-scenes change that *should not* affect end users, and so
you should not have any second thoughts about doing it.  One problem I
see with CPython is that it exposes way too much to the user (both
Python programmer and C extension writer) and this hurts both language
evolution (you constantly need to provide backwards compatibility) and
ability to optimise.

Cheers,
Damien.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov

BTW, this optimization also makes some old optimization tricks obsolete.

1. No need to write 'def func(len=len)'.  Globals lookups will be fast.

2. No need to save bound methods:

obj = []
obj_append = obj.append
for _ in range(10**6):
   obj_append(something)

This hand-optimized code would only be marginally faster, because of 
LOAD_METHOD and how it's cached.



Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Damien George
Hi Yuri,

I think these are great ideas to speed up CPython.  They are probably
the simplest yet most effective ways to get performance improvements
in the VM.

MicroPython has had LOAD_METHOD/CALL_METHOD from the start (inspired
by PyPy, and the main reason to have it is because you don't need to
allocate on the heap when doing a simple method call).  The specific
opcodes are:

LOAD_METHOD # same behaviour as you propose
CALL_METHOD # for calls with positional and/or keyword args
CALL_METHOD_VAR_KW # for calls with one or both of */**

We also have LOAD_ATTR, CALL_FUNCTION and CALL_FUNCTION_VAR_KW for
non-method calls.

MicroPython also has dictionary lookup caching, but it's a bit
different to your proposal.  We do something much simpler: each opcode
that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
etc) includes a single byte in the opcode which is an offset-guess
into the dictionary to find the desired element.  Eg for LOAD_GLOBAL
we have (pseudo code):

CASE(LOAD_GLOBAL):
key = DECODE_KEY;
offset_guess = DECODE_BYTE;
if (global_dict[offset_guess].key == key) {
// found the element straight away
} else {
// not found, do a full lookup and save the offset
offset_guess = dict_lookup(global_dict, key);
UPDATE_BYTECODE(offset_guess);
}
PUSH(global_dict[offset_guess].elem);

We have found that such caching gives a massive performance increase,
on the order of 20%.  The issue (for us) is that it increases bytecode
size by a considerable amount, requires writeable bytecode, and can be
non-deterministic in terms of lookup time.  Those things are important
in the embedded world, but not so much on the desktop.

Good luck with it!

Regards,
Damien.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Brett Cannon
On Wed, 27 Jan 2016 at 10:26 Yury Selivanov  wrote:

> Hi,
>
>
> tl;dr The summary is that I have a patch that improves CPython
> performance up to 5-10% on macro benchmarks.  Benchmarks results on
> Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available
> at [1].  There are no slowdowns that I could reproduce consistently.
>
> There are twodifferent optimizations that yield this speedup:
> LOAD_METHOD/CALL_METHOD opcodes and per-opcode cache in ceval loop.
>
>
> LOAD_METHOD & CALL_METHOD
> -
>
> We had a lot of conversations with Victor about his PEP 509, and he sent
> me a link to his amazing compilation of notes about CPython performance
> [2].  One optimization that he pointed out to me was LOAD/CALL_METHOD
> opcodes, an idea first originated in PyPy.
>
> There is a patch that implements this optimization, it's tracked here:
> [3].  There are some low level details that I explained in the issue,
> but I'll go over the high level design in this email as well.
>
> Every time you access a method attribute on an object, a BoundMethod
> object is created. It is a fairly expensive operation, despite a
> freelist of BoundMethods (so that memory allocation is generally
> avoided).  The idea is to detect what looks like a method call in the
> compiler, and emit a pair of specialized bytecodes for that.
>
> So instead of LOAD_GLOBAL/LOAD_ATTR/CALL_FUNCTION we will have
> LOAD_GLOBAL/LOAD_METHOD/CALL_METHOD.
>
> LOAD_METHOD looks at the object on top of the stack, and checks if the
> name resolves to a method or to a regular attribute.  If it's a method,
> then we push the unbound method object and the object to the stack.  If
> it's an attribute, we push the resolved attribute and NULL.
>
> When CALL_METHOD looks at the stack it knows how to call the unbound
> method properly (pushing the object as a first arg), or how to call a
> regular callable.
>
> This idea does make CPython faster around 2-4%.  And it surely doesn't
> make it slower.  I think it's a safe bet to at least implement this
> optimization in CPython 3.6.
>
> So far, the patch only optimizes positional-only method calls. It's
> possible to optimize all kind of calls, but this will necessitate 3 more
> opcodes (explained in the issue).  We'll need to do some careful
> benchmarking to see if it's really needed.
>
>
> Per-opcode cache in ceval
> -
>
> While reading PEP 509, I was thinking about how we can use
> dict->ma_version in ceval to speed up globals lookups.  One of the key
> assumptions (and this is what makes JITs possible) is that real-life
> programs don't modify globals and rebind builtins (often), and that most
> code paths operate on objects of the same type.
>
> In CPython, all pure Python functions have code objects.  When you call
> a function, ceval executes its code object in a frame. Frames contain
> contextual information, including pointers to the globals and builtins
> dict.  The key observation here is that almost all code objects always
> have same pointers to the globals (the module they were defined in) and
> to the builtins.  And it's not a good programming practice to mutate
> globals or rebind builtins.
>
> Let's look at this function:
>
> def spam():
>  print(ham)
>
> Here are its opcodes:
>
>2   0 LOAD_GLOBAL  0 (print)
>3 LOAD_GLOBAL  1 (ham)
>6 CALL_FUNCTION1 (1 positional, 0 keyword pair)
>9 POP_TOP
>   10 LOAD_CONST   0 (None)
>   13 RETURN_VALUE
>
> The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3.  Let's look at
> the first one, that loads the 'print' function from builtins.  The
> opcode knows the following bits of information:
>
> - its offset (0),
> - its argument (0 -> 'print'),
> - its type (LOAD_GLOBAL).
>
> And these bits of information will *never* change.  So if this opcode
> could resolve the 'print' name (from globals or builtins, likely the
> latter) and save the pointer to it somewhere, along with
> globals->ma_version and builtins->ma_version, it could, on its second
> call, just load this cached info back, check that the globals and
> builtins dict haven't changed and push the cached ref to the stack.
> That would save it from doing two dict lookups.
>
> We can also optimize LOAD_METHOD.  There are high chances, that 'obj' in
> 'obj.method()' will be of the same type every time we execute the code
> object.  So if we'd have an opcodes cache, LOAD_METHOD could then cache
> a pointer to the resolved unbound method, a pointer to obj.__class__,
> and tp_version_tag of obj.__class__.  Then it would only need to check
> if the cached object type is the same (and that it wasn't modified) and
> that obj.__dict__ doesn't override 'method'.  Long story short, this
> caching really speeds up method calls on types implemented in C.
> list.append becomes very fast, because list doesn't have a __dict__, so

Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov
As Brett suggested, I've just run the benchmarks suite with memory 
tracking on.  The results are here: 
https://gist.github.com/1st1/1851afb2773526fd7c58


Looks like the memory increase is around 1%.

One synthetic micro-benchmark, unpack_sequence, contains hundreds of 
lines that load a global variable and does nothing else, consumes 5%.


Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov



On 2016-01-27 3:10 PM, Damien George wrote:

Hi Yuri,

I think these are great ideas to speed up CPython.  They are probably
the simplest yet most effective ways to get performance improvements
in the VM.


Thanks!



MicroPython has had LOAD_METHOD/CALL_METHOD from the start (inspired
by PyPy, and the main reason to have it is because you don't need to
allocate on the heap when doing a simple method call).  The specific
opcodes are:

LOAD_METHOD # same behaviour as you propose
CALL_METHOD # for calls with positional and/or keyword args
CALL_METHOD_VAR_KW # for calls with one or both of */**

We also have LOAD_ATTR, CALL_FUNCTION and CALL_FUNCTION_VAR_KW for
non-method calls.


Yes, we'll need to add CALL_METHOD{_VAR|_KW|etc} opcodes to optimize all 
kind of method calls.  However, I'm not sure how big the impact will be, 
need to do more benchmarking.


BTW, how do you benchmark MicroPython?



MicroPython also has dictionary lookup caching, but it's a bit
different to your proposal.  We do something much simpler: each opcode
that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
etc) includes a single byte in the opcode which is an offset-guess
into the dictionary to find the desired element.  Eg for LOAD_GLOBAL
we have (pseudo code):

CASE(LOAD_GLOBAL):
key = DECODE_KEY;
offset_guess = DECODE_BYTE;
if (global_dict[offset_guess].key == key) {
 // found the element straight away
} else {
 // not found, do a full lookup and save the offset
 offset_guess = dict_lookup(global_dict, key);
 UPDATE_BYTECODE(offset_guess);
}
PUSH(global_dict[offset_guess].elem);

We have found that such caching gives a massive performance increase,
on the order of 20%.  The issue (for us) is that it increases bytecode
size by a considerable amount, requires writeable bytecode, and can be
non-deterministic in terms of lookup time.  Those things are important
in the embedded world, but not so much on the desktop.


That's a neat idea!  You're right, it does require bytecode to become 
writeable.  I considered implementing a similar strategy, but this would 
be a big change for CPython.  So I decided to minimize the impact of the 
patch and leave the opcodes untouched.



Thanks!
Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov

Hi,


tl;dr The summary is that I have a patch that improves CPython 
performance up to 5-10% on macro benchmarks.  Benchmarks results on 
Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available 
at [1].  There are no slowdowns that I could reproduce consistently.


There are twodifferent optimizations that yield this speedup: 
LOAD_METHOD/CALL_METHOD opcodes and per-opcode cache in ceval loop.



LOAD_METHOD & CALL_METHOD
-

We had a lot of conversations with Victor about his PEP 509, and he sent 
me a link to his amazing compilation of notes about CPython performance 
[2].  One optimization that he pointed out to me was LOAD/CALL_METHOD 
opcodes, an idea first originated in PyPy.


There is a patch that implements this optimization, it's tracked here: 
[3].  There are some low level details that I explained in the issue, 
but I'll go over the high level design in this email as well.


Every time you access a method attribute on an object, a BoundMethod 
object is created. It is a fairly expensive operation, despite a 
freelist of BoundMethods (so that memory allocation is generally 
avoided).  The idea is to detect what looks like a method call in the 
compiler, and emit a pair of specialized bytecodes for that.


So instead of LOAD_GLOBAL/LOAD_ATTR/CALL_FUNCTION we will have 
LOAD_GLOBAL/LOAD_METHOD/CALL_METHOD.


LOAD_METHOD looks at the object on top of the stack, and checks if the 
name resolves to a method or to a regular attribute.  If it's a method, 
then we push the unbound method object and the object to the stack.  If 
it's an attribute, we push the resolved attribute and NULL.


When CALL_METHOD looks at the stack it knows how to call the unbound 
method properly (pushing the object as a first arg), or how to call a 
regular callable.


This idea does make CPython faster around 2-4%.  And it surely doesn't 
make it slower.  I think it's a safe bet to at least implement this 
optimization in CPython 3.6.


So far, the patch only optimizes positional-only method calls. It's 
possible to optimize all kind of calls, but this will necessitate 3 more 
opcodes (explained in the issue).  We'll need to do some careful 
benchmarking to see if it's really needed.



Per-opcode cache in ceval
-

While reading PEP 509, I was thinking about how we can use 
dict->ma_version in ceval to speed up globals lookups.  One of the key 
assumptions (and this is what makes JITs possible) is that real-life 
programs don't modify globals and rebind builtins (often), and that most 
code paths operate on objects of the same type.


In CPython, all pure Python functions have code objects.  When you call 
a function, ceval executes its code object in a frame. Frames contain 
contextual information, including pointers to the globals and builtins 
dict.  The key observation here is that almost all code objects always 
have same pointers to the globals (the module they were defined in) and 
to the builtins.  And it's not a good programming practice to mutate 
globals or rebind builtins.


Let's look at this function:

def spam():
print(ham)

Here are its opcodes:

  2   0 LOAD_GLOBAL  0 (print)
  3 LOAD_GLOBAL  1 (ham)
  6 CALL_FUNCTION1 (1 positional, 0 keyword pair)
  9 POP_TOP
 10 LOAD_CONST   0 (None)
 13 RETURN_VALUE

The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3.  Let's look at 
the first one, that loads the 'print' function from builtins.  The 
opcode knows the following bits of information:


- its offset (0),
- its argument (0 -> 'print'),
- its type (LOAD_GLOBAL).

And these bits of information will *never* change.  So if this opcode 
could resolve the 'print' name (from globals or builtins, likely the 
latter) and save the pointer to it somewhere, along with 
globals->ma_version and builtins->ma_version, it could, on its second 
call, just load this cached info back, check that the globals and 
builtins dict haven't changed and push the cached ref to the stack.  
That would save it from doing two dict lookups.


We can also optimize LOAD_METHOD.  There are high chances, that 'obj' in 
'obj.method()' will be of the same type every time we execute the code 
object.  So if we'd have an opcodes cache, LOAD_METHOD could then cache 
a pointer to the resolved unbound method, a pointer to obj.__class__, 
and tp_version_tag of obj.__class__.  Then it would only need to check 
if the cached object type is the same (and that it wasn't modified) and 
that obj.__dict__ doesn't override 'method'.  Long story short, this 
caching really speeds up method calls on types implemented in C.  
list.append becomes very fast, because list doesn't have a __dict__, so 
the check is very cheap (with cache).


A straightforward way to implement such a cache is simple, but consumes 
a lot of memory, that would be just wasted, since we only need such a

Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov



On 2016-01-27 3:01 PM, Brett Cannon wrote:



[..]


We can also optimize LOAD_METHOD.  There are high chances, that
'obj' in
'obj.method()' will be of the same type every time we execute the code
object.  So if we'd have an opcodes cache, LOAD_METHOD could then
cache
a pointer to the resolved unbound method, a pointer to obj.__class__,
and tp_version_tag of obj.__class__.  Then it would only need to check
if the cached object type is the same (and that it wasn't
modified) and
that obj.__dict__ doesn't override 'method'.  Long story short, this
caching really speeds up method calls on types implemented in C.
list.append becomes very fast, because list doesn't have a
__dict__, so
the check is very cheap (with cache).


What would it take to make this work with Python-defined classes?


It already works for Python-defined classes.  But it's a bit more 
expensive because you still have to check object's __dict__.  Still, 
there is a very noticeable performance increase (see the results of 
benchmark runs).


I guess that would require knowing the version of the instance's 
__dict__, the instance's __class__ version, the MRO, and where the 
method object was found in the MRO and any intermediary classes to 
know if it was suddenly shadowed? I think that's everything. :)


No, unfortunately we can't use the version of the instance's __dict__ as 
it is very volatile.  The current implementation of opcode cache works 
because types are much more stable.  Remember, the cache is per *code 
object*, so it should work for all times when code object is executed.


class F:
  def spam(self):
self.ham()   # <- version of self.__dict__ is unstable
 #so we'll endup invalidating the cache
 #too often

__class__ version, MRO changes etc are covered by tp_version_tag, which 
I use as one of guards.




Obviously that's a lot, but I wonder how many classes have a deep 
inheritance model vs. inheriting only from `object`? In that case you 
only have to check self.__dict__.ma_version, self.__class__, 
self.__class__.__dict__.ma_version, and self.__class__.__class__ == 
`type`. I guess another way to look at this is to get an idea of how 
complex do the checks have to get before caching something like this 
is not worth it (probably also depends on how often you mutate 
self.__dict__ thanks to mutating attributes, but you could in that 
instance just decide to always look at self.__dict__ for the method's 
key and then do the ma_version cache check for everything coming from 
the class).


Otherwise we can consider looking at the the caching strategies that 
Self helped pioneer (http://bibliography.selflanguage.org/) that all 
of the various JS engines lifted and consider caching all method lookups.


Yeah, hidden classes are great.  But the infrastructure to support them 
properly is huge.  I think that to make them work you'll need a JIT -- 
to trace, deoptimize, optimize, and do it all with a reasonable memory 
footprint.  My patch is much smaller and simpler, something we can 
realistically tune and ship in 3.6.




A straightforward way to implement such a cache is simple, but
consumes
a lot of memory, that would be just wasted, since we only need such a
cache for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be
creative
about the cache design.  Here's what I came up with:

1. We add a few fields to the code object.

2. ceval will count how many times each code object is executed.

3. When the code object is executed over ~900 times, we mark it as
"hot".


What happens if you simply consider all code as hot? Is the overhead 
of building the mapping such that you really need this, or is this 
simply to avoid some memory/startup cost?


That's the first step for this patch.  I think we need to profile 
several big applications (I'll do it later for some of my code bases) 
and see how big is the memory impact if we optimize everything.


In any case, I expect it to be noticeable (which may be acceptable), so 
we'll probably try to optimize it.



  We also create an 'unsigned char' array "MAPPING", with length
set to match the length of the code object.  So we have a 1-to-1
mapping
between opcodes and MAPPING array.

4. Next ~100 calls, while the code object is "hot", LOAD_GLOBAL and
LOAD_METHOD do "MAPPING[opcode_offset()]++".

5. After 1024 calls to the code object, ceval loop will iterate
through
the MAPPING, counting all opcodes that were executed more than 50
times.


Where did the "50 times" boundary come from? Was this measured somehow 
or did you just guess at a number?


If the number is too low, then you'll optimize code in branches that are 
rarely executed.  So I picked 50, because I only trace opcodes for 100 
calls.


All of those numbers can be (should be?) changed, and I think we should 
experiment with different heuristics.