Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Damien George
Hi Yuri,

I think these are great ideas to speed up CPython.  They are probably
the simplest yet most effective ways to get performance improvements
in the VM.

MicroPython has had LOAD_METHOD/CALL_METHOD from the start (inspired
by PyPy, and the main reason to have it is because you don't need to
allocate on the heap when doing a simple method call).  The specific
opcodes are:

LOAD_METHOD # same behaviour as you propose
CALL_METHOD # for calls with positional and/or keyword args
CALL_METHOD_VAR_KW # for calls with one or both of */**

We also have LOAD_ATTR, CALL_FUNCTION and CALL_FUNCTION_VAR_KW for
non-method calls.

MicroPython also has dictionary lookup caching, but it's a bit
different to your proposal.  We do something much simpler: each opcode
that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
etc) includes a single byte in the opcode which is an offset-guess
into the dictionary to find the desired element.  Eg for LOAD_GLOBAL
we have (pseudo code):

CASE(LOAD_GLOBAL):
key = DECODE_KEY;
offset_guess = DECODE_BYTE;
if (global_dict[offset_guess].key == key) {
// found the element straight away
} else {
// not found, do a full lookup and save the offset
offset_guess = dict_lookup(global_dict, key);
UPDATE_BYTECODE(offset_guess);
}
PUSH(global_dict[offset_guess].elem);

We have found that such caching gives a massive performance increase,
on the order of 20%.  The issue (for us) is that it increases bytecode
size by a considerable amount, requires writeable bytecode, and can be
non-deterministic in terms of lookup time.  Those things are important
in the embedded world, but not so much on the desktop.

Good luck with it!

Regards,
Damien.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov

BTW, this optimization also makes some old optimization tricks obsolete.

1. No need to write 'def func(len=len)'.  Globals lookups will be fast.

2. No need to save bound methods:

obj = []
obj_append = obj.append
for _ in range(10**6):
   obj_append(something)

This hand-optimized code would only be marginally faster, because of 
LOAD_METHOD and how it's cached.



Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov



On 2016-01-27 3:46 PM, Glenn Linderman wrote:

On 1/27/2016 12:37 PM, Yury Selivanov wrote:




MicroPython also has dictionary lookup caching, but it's a bit
different to your proposal.  We do something much simpler: each opcode
that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
etc) includes a single byte in the opcode which is an offset-guess
into the dictionary to find the desired element.  Eg for LOAD_GLOBAL
we have (pseudo code):

CASE(LOAD_GLOBAL):
key = DECODE_KEY;
offset_guess = DECODE_BYTE;
if (global_dict[offset_guess].key == key) {
 // found the element straight away
} else {
 // not found, do a full lookup and save the offset
 offset_guess = dict_lookup(global_dict, key);
 UPDATE_BYTECODE(offset_guess);
}
PUSH(global_dict[offset_guess].elem);

We have found that such caching gives a massive performance increase,
on the order of 20%.  The issue (for us) is that it increases bytecode
size by a considerable amount, requires writeable bytecode, and can be
non-deterministic in terms of lookup time.  Those things are important
in the embedded world, but not so much on the desktop.


That's a neat idea!  You're right, it does require bytecode to become 
writeable.


Would it?

Remember "fixup lists"?  Maybe they still exist for loading function 
addresses from one DLL into the code of another at load time?


So the equivalent for bytecode requires a static table of 
offset_guess, and the offsets into that table are allocated by the 
byte-code loader at byte-code load time, and the byte-code is "fixed 
up" at load time to use the correct offsets into the offset_guess 
table.  It takes one more indirection to find the guess, but if the 
result is a 20% improvement, maybe you'd still get 19%...


Right, in my current patch I have an offset table per code object. 
Essentially, this offset table adds 8bits per opcode.  It also means 
that only first 255 LOAD_GLOBAL/LOAD_METHOD opcodes *per-code-object* 
are optimized (because the offset table only can store 8bit offsets), 
which is usually enough (I think you need to have more than a 500 lines 
of code function to reach that limit).


Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Glenn Linderman

On 1/27/2016 12:37 PM, Yury Selivanov wrote:




MicroPython also has dictionary lookup caching, but it's a bit
different to your proposal.  We do something much simpler: each opcode
that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
etc) includes a single byte in the opcode which is an offset-guess
into the dictionary to find the desired element.  Eg for LOAD_GLOBAL
we have (pseudo code):

CASE(LOAD_GLOBAL):
key = DECODE_KEY;
offset_guess = DECODE_BYTE;
if (global_dict[offset_guess].key == key) {
 // found the element straight away
} else {
 // not found, do a full lookup and save the offset
 offset_guess = dict_lookup(global_dict, key);
 UPDATE_BYTECODE(offset_guess);
}
PUSH(global_dict[offset_guess].elem);

We have found that such caching gives a massive performance increase,
on the order of 20%.  The issue (for us) is that it increases bytecode
size by a considerable amount, requires writeable bytecode, and can be
non-deterministic in terms of lookup time.  Those things are important
in the embedded world, but not so much on the desktop.


That's a neat idea!  You're right, it does require bytecode to become 
writeable.


Would it?

Remember "fixup lists"?  Maybe they still exist for loading function 
addresses from one DLL into the code of another at load time?


So the equivalent for bytecode requires a static table of offset_guess, 
and the offsets into that table are allocated by the byte-code loader at 
byte-code load time, and the byte-code is "fixed up" at load time to use 
the correct offsets into the offset_guess table.  It takes one more 
indirection to find the guess, but if the result is a 20% improvement, 
maybe you'd still get 19%...



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Brett Cannon
On Wed, 27 Jan 2016 at 10:26 Yury Selivanov  wrote:

> Hi,
>
>
> tl;dr The summary is that I have a patch that improves CPython
> performance up to 5-10% on macro benchmarks.  Benchmarks results on
> Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available
> at [1].  There are no slowdowns that I could reproduce consistently.
>
> There are twodifferent optimizations that yield this speedup:
> LOAD_METHOD/CALL_METHOD opcodes and per-opcode cache in ceval loop.
>
>
> LOAD_METHOD & CALL_METHOD
> -
>
> We had a lot of conversations with Victor about his PEP 509, and he sent
> me a link to his amazing compilation of notes about CPython performance
> [2].  One optimization that he pointed out to me was LOAD/CALL_METHOD
> opcodes, an idea first originated in PyPy.
>
> There is a patch that implements this optimization, it's tracked here:
> [3].  There are some low level details that I explained in the issue,
> but I'll go over the high level design in this email as well.
>
> Every time you access a method attribute on an object, a BoundMethod
> object is created. It is a fairly expensive operation, despite a
> freelist of BoundMethods (so that memory allocation is generally
> avoided).  The idea is to detect what looks like a method call in the
> compiler, and emit a pair of specialized bytecodes for that.
>
> So instead of LOAD_GLOBAL/LOAD_ATTR/CALL_FUNCTION we will have
> LOAD_GLOBAL/LOAD_METHOD/CALL_METHOD.
>
> LOAD_METHOD looks at the object on top of the stack, and checks if the
> name resolves to a method or to a regular attribute.  If it's a method,
> then we push the unbound method object and the object to the stack.  If
> it's an attribute, we push the resolved attribute and NULL.
>
> When CALL_METHOD looks at the stack it knows how to call the unbound
> method properly (pushing the object as a first arg), or how to call a
> regular callable.
>
> This idea does make CPython faster around 2-4%.  And it surely doesn't
> make it slower.  I think it's a safe bet to at least implement this
> optimization in CPython 3.6.
>
> So far, the patch only optimizes positional-only method calls. It's
> possible to optimize all kind of calls, but this will necessitate 3 more
> opcodes (explained in the issue).  We'll need to do some careful
> benchmarking to see if it's really needed.
>
>
> Per-opcode cache in ceval
> -
>
> While reading PEP 509, I was thinking about how we can use
> dict->ma_version in ceval to speed up globals lookups.  One of the key
> assumptions (and this is what makes JITs possible) is that real-life
> programs don't modify globals and rebind builtins (often), and that most
> code paths operate on objects of the same type.
>
> In CPython, all pure Python functions have code objects.  When you call
> a function, ceval executes its code object in a frame. Frames contain
> contextual information, including pointers to the globals and builtins
> dict.  The key observation here is that almost all code objects always
> have same pointers to the globals (the module they were defined in) and
> to the builtins.  And it's not a good programming practice to mutate
> globals or rebind builtins.
>
> Let's look at this function:
>
> def spam():
>  print(ham)
>
> Here are its opcodes:
>
>2   0 LOAD_GLOBAL  0 (print)
>3 LOAD_GLOBAL  1 (ham)
>6 CALL_FUNCTION1 (1 positional, 0 keyword pair)
>9 POP_TOP
>   10 LOAD_CONST   0 (None)
>   13 RETURN_VALUE
>
> The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3.  Let's look at
> the first one, that loads the 'print' function from builtins.  The
> opcode knows the following bits of information:
>
> - its offset (0),
> - its argument (0 -> 'print'),
> - its type (LOAD_GLOBAL).
>
> And these bits of information will *never* change.  So if this opcode
> could resolve the 'print' name (from globals or builtins, likely the
> latter) and save the pointer to it somewhere, along with
> globals->ma_version and builtins->ma_version, it could, on its second
> call, just load this cached info back, check that the globals and
> builtins dict haven't changed and push the cached ref to the stack.
> That would save it from doing two dict lookups.
>
> We can also optimize LOAD_METHOD.  There are high chances, that 'obj' in
> 'obj.method()' will be of the same type every time we execute the code
> object.  So if we'd have an opcodes cache, LOAD_METHOD could then cache
> a pointer to the resolved unbound method, a pointer to obj.__class__,
> and tp_version_tag of obj.__class__.  Then it would only need to check
> if the cached object type is the same (and that it wasn't modified) and
> that obj.__dict__ doesn't override 'method'.  Long story short, this
> caching really speeds up method calls on types implemented in C.
> list.append becomes very fast, because list 

Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Damien George
Hi Yury,

(Sorry for misspelling your name previously!)

> Yes, we'll need to add CALL_METHOD{_VAR|_KW|etc} opcodes to optimize all
> kind of method calls.  However, I'm not sure how big the impact will be,
> need to do more benchmarking.

I never did such fine grained analysis with MicroPython.  I don't
think there are many uses of * and ** that it'd be worth it, but
definitely there are lots of uses of plain keywords.  Also, you'd want
to consider how simple/complex it is to treat all these different
opcodes in the compiler.  For us, it's simpler to treat everything the
same.  Otherwise your LOAD_METHOD part of the compiler will need to
peek deep into the AST to see what kind of call it is.

> BTW, how do you benchmark MicroPython?

Haha, good question!  Well, we use Pystone 1.2 (unmodified) to do
basic benchmarking, and find it to be quite good.  We track our code
live at:

http://micropython.org/resources/code-dashboard/

You can see there the red line, which is the Pystone result.  There
was a big jump around Jan 2015 which is when we introduced opcode
dictionary caching.  And since then it's been very gradually
increasing due to small optimisations here and there.

Pystone is actually a great benchmark for embedded systems because it
gives very reliable results there (almost zero variation across runs)
and if we can squeeze 5 more Pystones out with some change then we
know that it's a good optimisation (for efficiency at least).

For us, low RAM usage and small code size are the most important
factors, and we track those meticulously.  But in fact, smaller code
size quite often correlates with more efficient code because there's
less to execute and it fits in the CPU cache (at least on the
desktop).

We do have some other benchmarks, but they are highly specialised for
us.  For example, how fast can you bit bang a GPIO pin using pure
Python code.  Currently we get around 200kHz on a 168MHz MCU, which
shows that pure (Micro)Python code is about 100 times slower than C.

> That's a neat idea!  You're right, it does require bytecode to become
> writeable.  I considered implementing a similar strategy, but this would
> be a big change for CPython.  So I decided to minimize the impact of the
> patch and leave the opcodes untouched.

I think you need to consider "big" changes, especially ones like this
that can have a great (and good) impact.  But really, this is a
behind-the-scenes change that *should not* affect end users, and so
you should not have any second thoughts about doing it.  One problem I
see with CPython is that it exposes way too much to the user (both
Python programmer and C extension writer) and this hurts both language
evolution (you constantly need to provide backwards compatibility) and
ability to optimise.

Cheers,
Damien.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov

Damien,

On 2016-01-27 4:20 PM, Damien George wrote:

Hi Yury,

(Sorry for misspelling your name previously!)


NP.  As long as the first letter is "y" I don't care ;)




Yes, we'll need to add CALL_METHOD{_VAR|_KW|etc} opcodes to optimize all
kind of method calls.  However, I'm not sure how big the impact will be,
need to do more benchmarking.

I never did such fine grained analysis with MicroPython.  I don't
think there are many uses of * and ** that it'd be worth it, but
definitely there are lots of uses of plain keywords.  Also, you'd want
to consider how simple/complex it is to treat all these different
opcodes in the compiler.  For us, it's simpler to treat everything the
same.  Otherwise your LOAD_METHOD part of the compiler will need to
peek deep into the AST to see what kind of call it is.


BTW, how do you benchmark MicroPython?

Haha, good question!  Well, we use Pystone 1.2 (unmodified) to do
basic benchmarking, and find it to be quite good.  We track our code
live at:

http://micropython.org/resources/code-dashboard/


The dashboard is cool!

An off-topic: have you ever tried hg.python.org/benchmarks
or compare MicroPython vs CPython?  I'm curious if MicroPython
is faster -- in that case we'll try to copy some optimization
ideas.


You can see there the red line, which is the Pystone result.  There
was a big jump around Jan 2015 which is when we introduced opcode
dictionary caching.  And since then it's been very gradually
increasing due to small optimisations here and there.


Do you use opcode dictionary caching only for LOAD_GLOBAL-like
opcodes?  Do you have an equivalent of LOAD_FAST, or you use
dicts to store local variables?


That's a neat idea!  You're right, it does require bytecode to become
writeable.  I considered implementing a similar strategy, but this would
be a big change for CPython.  So I decided to minimize the impact of the
patch and leave the opcodes untouched.

I think you need to consider "big" changes, especially ones like this
that can have a great (and good) impact.  But really, this is a
behind-the-scenes change that *should not* affect end users, and so
you should not have any second thoughts about doing it.


If we change the opcode size, it will probably affect libraries
that compose or modify code objects.  Modules like "dis" will
also need to be updated.  And that's probably just a tip of the
iceberg.

We can still implement your approach if we add a separate
private 'unsigned char' array to each code object, so that
LOAD_GLOBAL can store the key offsets.  It should be a bit
faster than my current patch, since it has one less level
of indirection.  But this way we loose the ability to
optimize LOAD_METHOD, simply because it requires more memory
for its cache.  In any case, I'll experiment!


One problem I
see with CPython is that it exposes way too much to the user (both
Python programmer and C extension writer) and this hurts both language
evolution (you constantly need to provide backwards compatibility) and
ability to optimise.


Right.  Even though CPython explicitly states that opcodes
and code objects might change in the future, we still have to
be careful about changing them.

Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FAT Python (lack of) performance

2016-01-27 Thread Stephen J. Turnbull
Brett Cannon writes:

 > the core team has an implicit understanding that any performance
 > improvement is taken into consideration in terms of balancing
 > complexity in CPython with how much improvement it gets us.

EIBTI.  I can shut up now.  Thank you!

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] Daily reference leaks (cbd4a6a2657e): sum=134

2016-01-27 Thread Victor Stinner
Hi,

I pushed a fix before you sent your message. At least test_ast should be
fixed.

https://hg.python.org/cpython/rev/c5df914e73ad

FYI I'm unable to reproduce the test_collections leak.

Victor

Le mardi 26 janvier 2016, Brett Cannon  a écrit :

> Looks like Victor's ast.Constant change introduced a refleak.
>
> On Tue, 26 Jan 2016 at 00:47  > wrote:
>
>> results for cbd4a6a2657e on branch "default"
>> 
>>
>> test_ast leaked [39, 39, 39] references, sum=117
>> test_ast leaked [5, 5, 5] memory blocks, sum=15
>> test_collections leaked [-2, 0, 0] references, sum=-2
>> test_functools leaked [0, 2, 2] memory blocks, sum=4
>>
>>
>> Command line was: ['./python', '-m', 'test.regrtest', '-uall', '-R',
>> '3:3:/home/psf-users/antoine/refleaks/reflogqIZGVY', '--timeout', '7200']
>> ___
>> Python-checkins mailing list
>> python-check...@python.org
>> 
>> https://mail.python.org/mailman/listinfo/python-checkins
>>
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FAT Python (lack of) performance

2016-01-27 Thread Nick Coghlan
On 28 January 2016 at 04:40, Sven R. Kunze  wrote:
> On 27.01.2016 12:16, Nick Coghlan wrote:
>> Umm, no, that's not how this works
> That's exactly how it works, Nick.
>
> INADA uses Python as I use crossroads each day. Daily human business.
>
> If you read his post carefully, you can discover that he just presented to
> you his perspective of the world. Moreover, I can assure you that he's not
> alone. As usual with humans it's not about facts or mathematically proven
> theorems but perception. It's more about marketing, little important details
> (or unimportant ones depending on whom you ask) and so on. Stating that he
> has a wrong perspective will not change anything.

The only part I disagree with is requesting that *other people* care
about marketing numbers if that's not something they're already
inclined to care about. I'm not in any way disputing that folks make
decisions based on inappropriate metrics, nor that it bothers some
folks that there are dozens of perfectly viable programming languages
people may choose to use instead of Python.

The fact remains that contributors to open source projects work on
what they want to work on or on what their employers pay them to work
on (for a lucky few, those are the same thing), so telling other
contributors that they're working on the "wrong thing" because their
priorities differ from our priorities is almost always going to be
irritating rather than helpful.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FAT Python (lack of) performance

2016-01-27 Thread Stephen J. Turnbull
Terry Reedy writes:

 > So you agree that the limit of 39 is not intrinsic to the fib function 
 > or its uses, but is an after-the-fact limit imposed to mask the bug 
 > proneness of using substitutes for integers.

I don't know what the limit used in the benchmark is, but it must be
quite a bit lower than 50 for 32-bit integers and could be greater
than 90 for 64-bit integers.  And it's not "masking bugs", it's
"respecting the domain of valid input", s'il vous plait.

 > To my mind, a fairer and more useful benchmark of 'base language 
 > performance' based on fib would use a wider domain.

"Fair", maybe.  But why play this game at all?  These benchmarks are
simply not useful to users choosing languages, unless they already
know the difficulties of interpreting benchmarks and are willing to
expend the effort to account for them.

Without that knowledge and effort, choosing a programming language
based on microbenchmarks is like choosing a car based on the
leg-length of the model sitting on the hood in the TV commercial.

 > The report would say that CPython (with lru_cache disallowed) is
 > slow but works over a wide range of inputs,

No, the report would say "Use of this benchmark for cross-language
comparison of function call speed is more or less inaccurate due to
differences in representation of integers and in handling the
possibility of exceptions in 'integer' arithmetic."  You are picking
one tiny difference, but there are potentially many, some quite a bit
larger on the tested domain (for example, some languages may be able
to optimize fib() to unboxed integers, in which case they'll blow away
all those that don't).

 > Users could then make a more informed pick.

My point in my reply to Nick is that users aren't making informed
picks.  If they were, we wouldn't even be thinking about having this
conversation.  I'm not sure what they are doing (maybe, as Nick
suggests, justifying their "tribal" prejudices?), but it's not
that. ;-)  Sure, other things being equal, better benchmarks will
improve runtime performance, but other things are so far from being
equal even an economist can't say "ceteris paribus" here.

To expand that point: I don't really see a point in users (ie,
developers in Python and other such languages) looking at these
benchmarks except for the fun of feeling like implementers, to be
honest.  Even the implementers shouldn't much care about cross-
language benchmarks, except that when a "similar"[1] language does
significantly better on a particular benchmark, it's often useful to
wonder "how dey do dat?!"  Typically the answer is "they 'cheat'" ==
"fail one of the properties we consider required", but sometimes it's
"ooh, that's cute, and I bet we could make Python work the same way"
or "urkh, we can't do *that* (yuck!) but we could FATten up Python
with similar effect".  (Let me take this opportunity to say "Thank
you, Victor!")

Of course in the case of a controlled experiment like "configure in
Victor's changes and run the benchmarks to make sure they're not
detectably slower", they're invaluable regression tests, and more or
less valuable (ie, YMMV) as measures of improvement to compare to
costs they may impose in other features or (even more fuzzy) in
developer time.


Footnotes: 
[1]  Whatever that means

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FAT Python (lack of) performance

2016-01-27 Thread Sven R. Kunze

On 27.01.2016 11:59, Terry Reedy wrote:

On 1/26/2016 12:35 PM, Sven R. Kunze wrote:

I completely agree with INADA.


I an not sure you do.



I am sure I am. He wants to solve a problem the way that is natural to 
him as a unique human being.



It's like saying, because a specific crossroad features a higher
accident rate, *people need to change their driving behavior*.
*No!* People won't change and it's not necessary either. The crossroad
needs to be changed to be safer.


Safer crossroads tend to be slower unless one switched to alternate 
designs that eliminate crossing streams of traffic.


So Python can be safer AND faster ( = different design) if we try hard 
enough.


Languages that don't have integers but use residue classes (with 
wraparound) or finite integer classes (with overflow) as a (faster) 
substitute have, in practice, lots of accidents (bugs) when used by 
non-experts.  Guido noticed this, gave up on changing coder behavior, 
and put the expert behavior of checking for wraparound/overflow and 
switching to real integers (longs) into the language.  (I forget when 
this was added.)


I am glad he did because it helps humans solve their problems in a 
natural way without artificial boundaries. :)


The purpose of the artificially low input to fib() is to hide and 
avoid the bugginess of most languages.  The analogous trick with 
testing crossroads would be to artificially restrict the density of 
cars to mask the accident-proneness of a 'fast, consenting-adults' 
crossroads with no stop signs and no stop lights.




I am completely with you here, however I disagree about suspected 
hiding/avoiding mentality. You say:


Python -> *no problem with big integers but slow at small integers*
Other Language -> *faster but breaks at big integers*

Yes. That's it.

We haven't solved the human side, however. A human AGAIN would need to 
compromise on either speed or safety.



My point is: it would be insanely great if Python could be more like 
"*fast AND no problem with big integers*". No compromise here (at least 
no noticeable).


So, people could entirely *concentrate on their problem domain* without 
every worrying about such tiny little, nitty-gritty computer science 
details. I love computer science but people of other domains don't have 
the time nor the knowledge to decide properly. That's the reason why 
they might decide by using some weird micro-benchmarks. Just humans.



Same goes for Python. If it's slow using the very same piece of code
(even superficially), you better make the language faster.
Developers won't change and they won't change their code either. Just
not necessary.


Instead of making people rewrite fib to dramatically increase speed, 
we added the lru-cache decorator to get most of the benefit without a 
rewrite.  But Inada rejected this Python speedup.  An ast optimizer 
could potentially do the same speedup without the explicit decorator. 
(No side-effects?  Multiple recursive calls? Add a cache!)




Bingo! That's the spirit.

Why that decorator in the first place? Hey, I mean, if I ever want to 
write some cryptic-looking source code with 3-letters abbreviations 
(LRU), I use Assembler again. But I discovered and love Python and I 
never want to go back when my problem domain does not require me to. So, 
when a machine can detect such an optimization, hell, do it, please. 
It's more likely that I apply it at the wrong function AND only in 10% 
of the correct cases: missing 90% and introducing some wild errors.


Again human stuff.


Btw. it would be a great feature for Python 3 to be faster than Python
2.


We all agree on that.  One way for this to happen is to add optimizers 
that would make Python 'cheat' on micrebenchmarks


Then, we are all set. :)

Best,
Sven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov



On 2016-01-27 3:10 PM, Damien George wrote:

Hi Yuri,

I think these are great ideas to speed up CPython.  They are probably
the simplest yet most effective ways to get performance improvements
in the VM.


Thanks!



MicroPython has had LOAD_METHOD/CALL_METHOD from the start (inspired
by PyPy, and the main reason to have it is because you don't need to
allocate on the heap when doing a simple method call).  The specific
opcodes are:

LOAD_METHOD # same behaviour as you propose
CALL_METHOD # for calls with positional and/or keyword args
CALL_METHOD_VAR_KW # for calls with one or both of */**

We also have LOAD_ATTR, CALL_FUNCTION and CALL_FUNCTION_VAR_KW for
non-method calls.


Yes, we'll need to add CALL_METHOD{_VAR|_KW|etc} opcodes to optimize all 
kind of method calls.  However, I'm not sure how big the impact will be, 
need to do more benchmarking.


BTW, how do you benchmark MicroPython?



MicroPython also has dictionary lookup caching, but it's a bit
different to your proposal.  We do something much simpler: each opcode
that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
etc) includes a single byte in the opcode which is an offset-guess
into the dictionary to find the desired element.  Eg for LOAD_GLOBAL
we have (pseudo code):

CASE(LOAD_GLOBAL):
key = DECODE_KEY;
offset_guess = DECODE_BYTE;
if (global_dict[offset_guess].key == key) {
 // found the element straight away
} else {
 // not found, do a full lookup and save the offset
 offset_guess = dict_lookup(global_dict, key);
 UPDATE_BYTECODE(offset_guess);
}
PUSH(global_dict[offset_guess].elem);

We have found that such caching gives a massive performance increase,
on the order of 20%.  The issue (for us) is that it increases bytecode
size by a considerable amount, requires writeable bytecode, and can be
non-deterministic in terms of lookup time.  Those things are important
in the embedded world, but not so much on the desktop.


That's a neat idea!  You're right, it does require bytecode to become 
writeable.  I considered implementing a similar strategy, but this would 
be a big change for CPython.  So I decided to minimize the impact of the 
patch and leave the opcodes untouched.



Thanks!
Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov

Hi,


tl;dr The summary is that I have a patch that improves CPython 
performance up to 5-10% on macro benchmarks.  Benchmarks results on 
Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available 
at [1].  There are no slowdowns that I could reproduce consistently.


There are twodifferent optimizations that yield this speedup: 
LOAD_METHOD/CALL_METHOD opcodes and per-opcode cache in ceval loop.



LOAD_METHOD & CALL_METHOD
-

We had a lot of conversations with Victor about his PEP 509, and he sent 
me a link to his amazing compilation of notes about CPython performance 
[2].  One optimization that he pointed out to me was LOAD/CALL_METHOD 
opcodes, an idea first originated in PyPy.


There is a patch that implements this optimization, it's tracked here: 
[3].  There are some low level details that I explained in the issue, 
but I'll go over the high level design in this email as well.


Every time you access a method attribute on an object, a BoundMethod 
object is created. It is a fairly expensive operation, despite a 
freelist of BoundMethods (so that memory allocation is generally 
avoided).  The idea is to detect what looks like a method call in the 
compiler, and emit a pair of specialized bytecodes for that.


So instead of LOAD_GLOBAL/LOAD_ATTR/CALL_FUNCTION we will have 
LOAD_GLOBAL/LOAD_METHOD/CALL_METHOD.


LOAD_METHOD looks at the object on top of the stack, and checks if the 
name resolves to a method or to a regular attribute.  If it's a method, 
then we push the unbound method object and the object to the stack.  If 
it's an attribute, we push the resolved attribute and NULL.


When CALL_METHOD looks at the stack it knows how to call the unbound 
method properly (pushing the object as a first arg), or how to call a 
regular callable.


This idea does make CPython faster around 2-4%.  And it surely doesn't 
make it slower.  I think it's a safe bet to at least implement this 
optimization in CPython 3.6.


So far, the patch only optimizes positional-only method calls. It's 
possible to optimize all kind of calls, but this will necessitate 3 more 
opcodes (explained in the issue).  We'll need to do some careful 
benchmarking to see if it's really needed.



Per-opcode cache in ceval
-

While reading PEP 509, I was thinking about how we can use 
dict->ma_version in ceval to speed up globals lookups.  One of the key 
assumptions (and this is what makes JITs possible) is that real-life 
programs don't modify globals and rebind builtins (often), and that most 
code paths operate on objects of the same type.


In CPython, all pure Python functions have code objects.  When you call 
a function, ceval executes its code object in a frame. Frames contain 
contextual information, including pointers to the globals and builtins 
dict.  The key observation here is that almost all code objects always 
have same pointers to the globals (the module they were defined in) and 
to the builtins.  And it's not a good programming practice to mutate 
globals or rebind builtins.


Let's look at this function:

def spam():
print(ham)

Here are its opcodes:

  2   0 LOAD_GLOBAL  0 (print)
  3 LOAD_GLOBAL  1 (ham)
  6 CALL_FUNCTION1 (1 positional, 0 keyword pair)
  9 POP_TOP
 10 LOAD_CONST   0 (None)
 13 RETURN_VALUE

The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3.  Let's look at 
the first one, that loads the 'print' function from builtins.  The 
opcode knows the following bits of information:


- its offset (0),
- its argument (0 -> 'print'),
- its type (LOAD_GLOBAL).

And these bits of information will *never* change.  So if this opcode 
could resolve the 'print' name (from globals or builtins, likely the 
latter) and save the pointer to it somewhere, along with 
globals->ma_version and builtins->ma_version, it could, on its second 
call, just load this cached info back, check that the globals and 
builtins dict haven't changed and push the cached ref to the stack.  
That would save it from doing two dict lookups.


We can also optimize LOAD_METHOD.  There are high chances, that 'obj' in 
'obj.method()' will be of the same type every time we execute the code 
object.  So if we'd have an opcodes cache, LOAD_METHOD could then cache 
a pointer to the resolved unbound method, a pointer to obj.__class__, 
and tp_version_tag of obj.__class__.  Then it would only need to check 
if the cached object type is the same (and that it wasn't modified) and 
that obj.__dict__ doesn't override 'method'.  Long story short, this 
caching really speeds up method calls on types implemented in C.  
list.append becomes very fast, because list doesn't have a __dict__, so 
the check is very cheap (with cache).


A straightforward way to implement such a cache is simple, but consumes 
a lot of memory, that would be just wasted, since we only need such 

Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov



On 2016-01-27 3:01 PM, Brett Cannon wrote:



[..]


We can also optimize LOAD_METHOD.  There are high chances, that
'obj' in
'obj.method()' will be of the same type every time we execute the code
object.  So if we'd have an opcodes cache, LOAD_METHOD could then
cache
a pointer to the resolved unbound method, a pointer to obj.__class__,
and tp_version_tag of obj.__class__.  Then it would only need to check
if the cached object type is the same (and that it wasn't
modified) and
that obj.__dict__ doesn't override 'method'.  Long story short, this
caching really speeds up method calls on types implemented in C.
list.append becomes very fast, because list doesn't have a
__dict__, so
the check is very cheap (with cache).


What would it take to make this work with Python-defined classes?


It already works for Python-defined classes.  But it's a bit more 
expensive because you still have to check object's __dict__.  Still, 
there is a very noticeable performance increase (see the results of 
benchmark runs).


I guess that would require knowing the version of the instance's 
__dict__, the instance's __class__ version, the MRO, and where the 
method object was found in the MRO and any intermediary classes to 
know if it was suddenly shadowed? I think that's everything. :)


No, unfortunately we can't use the version of the instance's __dict__ as 
it is very volatile.  The current implementation of opcode cache works 
because types are much more stable.  Remember, the cache is per *code 
object*, so it should work for all times when code object is executed.


class F:
  def spam(self):
self.ham()   # <- version of self.__dict__ is unstable
 #so we'll endup invalidating the cache
 #too often

__class__ version, MRO changes etc are covered by tp_version_tag, which 
I use as one of guards.




Obviously that's a lot, but I wonder how many classes have a deep 
inheritance model vs. inheriting only from `object`? In that case you 
only have to check self.__dict__.ma_version, self.__class__, 
self.__class__.__dict__.ma_version, and self.__class__.__class__ == 
`type`. I guess another way to look at this is to get an idea of how 
complex do the checks have to get before caching something like this 
is not worth it (probably also depends on how often you mutate 
self.__dict__ thanks to mutating attributes, but you could in that 
instance just decide to always look at self.__dict__ for the method's 
key and then do the ma_version cache check for everything coming from 
the class).


Otherwise we can consider looking at the the caching strategies that 
Self helped pioneer (http://bibliography.selflanguage.org/) that all 
of the various JS engines lifted and consider caching all method lookups.


Yeah, hidden classes are great.  But the infrastructure to support them 
properly is huge.  I think that to make them work you'll need a JIT -- 
to trace, deoptimize, optimize, and do it all with a reasonable memory 
footprint.  My patch is much smaller and simpler, something we can 
realistically tune and ship in 3.6.




A straightforward way to implement such a cache is simple, but
consumes
a lot of memory, that would be just wasted, since we only need such a
cache for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be
creative
about the cache design.  Here's what I came up with:

1. We add a few fields to the code object.

2. ceval will count how many times each code object is executed.

3. When the code object is executed over ~900 times, we mark it as
"hot".


What happens if you simply consider all code as hot? Is the overhead 
of building the mapping such that you really need this, or is this 
simply to avoid some memory/startup cost?


That's the first step for this patch.  I think we need to profile 
several big applications (I'll do it later for some of my code bases) 
and see how big is the memory impact if we optimize everything.


In any case, I expect it to be noticeable (which may be acceptable), so 
we'll probably try to optimize it.



  We also create an 'unsigned char' array "MAPPING", with length
set to match the length of the code object.  So we have a 1-to-1
mapping
between opcodes and MAPPING array.

4. Next ~100 calls, while the code object is "hot", LOAD_GLOBAL and
LOAD_METHOD do "MAPPING[opcode_offset()]++".

5. After 1024 calls to the code object, ceval loop will iterate
through
the MAPPING, counting all opcodes that were executed more than 50
times.


Where did the "50 times" boundary come from? Was this measured somehow 
or did you just guess at a number?


If the number is too low, then you'll optimize code in branches that are 
rarely executed.  So I picked 50, because I only trace opcodes for 100 
calls.


All of those numbers can be (should be?) changed, and I think we should 
experiment with different heuristics.


Re: [Python-Dev] FAT Python (lack of) performance

2016-01-27 Thread Glenn Linderman

On 1/27/2016 9:12 AM, Stephen J. Turnbull wrote:

Without that knowledge and effort, choosing a programming language
based on microbenchmarks is like choosing a car based on the
leg-length of the model sitting on the hood in the TV commercial.


+1  QOTD
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FAT Python (lack of) performance

2016-01-27 Thread Brett Cannon
On Wed, 27 Jan 2016 at 10:12 Sven R. Kunze  wrote:

> On 27.01.2016 11:59, Terry Reedy wrote:
>
> On 1/26/2016 12:35 PM, Sven R. Kunze wrote:
>
> I completely agree with INADA.
>
>
> I an not sure you do.
>
>
> I am sure I am. He wants to solve a problem the way that is natural to him
> as a unique human being.
>
>
> It's like saying, because a specific crossroad features a higher
> accident rate, *people need to change their driving behavior*.
> *No!* People won't change and it's not necessary either. The crossroad
> needs to be changed to be safer.
>
>
> Safer crossroads tend to be slower unless one switched to alternate
> designs that eliminate crossing streams of traffic.
>
>
> So Python can be safer AND faster ( = different design) if we try hard
> enough.
>
>
> Languages that don't have integers but use residue classes (with
> wraparound) or finite integer classes (with overflow) as a (faster)
> substitute have, in practice, lots of accidents (bugs) when used by
> non-experts.  Guido noticed this, gave up on changing coder behavior, and
> put the expert behavior of checking for wraparound/overflow and switching
> to real integers (longs) into the language.  (I forget when this was
> added.)
>
>
> I am glad he did because it helps humans solve their problems in a natural
> way without artificial boundaries. :)
>
>
> The purpose of the artificially low input to fib() is to hide and avoid
> the bugginess of most languages.  The analogous trick with testing
> crossroads would be to artificially restrict the density of cars to mask
> the accident-proneness of a 'fast, consenting-adults' crossroads with no
> stop signs and no stop lights.
>
>
> I am completely with you here, however I disagree about suspected
> hiding/avoiding mentality. You say:
>
> Python -> *no problem with big integers but slow at small integers*
> Other Language -> *faster but breaks at big integers*
>
> Yes. That's it.
>
> We haven't solved the human side, however. A human AGAIN would need to
> compromise on either speed or safety.
>
>
> My point is: it would be insanely great if Python could be more like "*fast
> AND no problem with big integers*". No compromise here (at least no
> noticeable).
>

And this is why this entire email thread has devolved into a conversation
that isn't really going anywhere. This whole thread has completely lost
track of the point of Victor's earlier email saying "I'm still working on
my FAT work and don't take any notice of the performance numbers until more
stuff gets finished". And this discussion of what benchmarks to care about
is rather pointless since the core team has an implicit understanding that
any performance improvement is taken into consideration in terms of
balancing complexity in CPython with how much improvement it gets us. So if
someone wants to speed up Fibonacci then they are welcome to try, but the
solution must be maintainable in proportion to the speed increase it buys
Python as a whole.

-Brett




>
> So, people could entirely *concentrate on their problem domain* without
> every worrying about such tiny little, nitty-gritty computer science
> details. I love computer science but people of other domains don't have the
> time nor the knowledge to decide properly. That's the reason why they might
> decide by using some weird micro-benchmarks. Just humans.
>
>
> Same goes for Python. If it's slow using the very same piece of code
> (even superficially), you better make the language faster.
> Developers won't change and they won't change their code either. Just
> not necessary.
>
>
> Instead of making people rewrite fib to dramatically increase speed, we
> added the lru-cache decorator to get most of the benefit without a
> rewrite.  But Inada rejected this Python speedup.  An ast optimizer could
> potentially do the same speedup without the explicit decorator. (No
> side-effects?  Multiple recursive calls?  Add a cache!)
>
>
> Bingo! That's the spirit.
>
> Why that decorator in the first place? Hey, I mean, if I ever want to
> write some cryptic-looking source code with 3-letters abbreviations (LRU),
> I use Assembler again. But I discovered and love Python and I never want to
> go back when my problem domain does not require me to. So, when a machine
> can detect such an optimization, hell, do it, please. It's more likely that
> I apply it at the wrong function AND only in 10% of the correct cases:
> missing 90% and introducing some wild errors.
>
> Again human stuff.
>
>
> Btw. it would be a great feature for Python 3 to be faster than Python
> 2.
>
>
> We all agree on that.  One way for this to happen is to add optimizers
> that would make Python 'cheat' on micrebenchmarks
>
>
> Then, we are all set. :)
>
> Best,
> Sven
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/brett%40python.org
>

Re: [Python-Dev] FAT Python (lack of) performance

2016-01-27 Thread Sven R. Kunze

On 27.01.2016 12:16, Nick Coghlan wrote:

On 27 January 2016 at 03:35, Sven R. Kunze  wrote:

I completely agree with  INADA.

It's like saying, because a specific crossroad features a higher accident
rate, people need to change their driving behavior.
No! People won't change and it's not necessary either. The crossroad needs
to be changed to be safer.

Umm, no, that's not how this works


That's exactly how it works, Nick.

INADA uses Python as I use crossroads each day. Daily human business.

If you read his post carefully, you can discover that he just presented 
to you his perspective of the world. Moreover, I can assure you that 
he's not alone. As usual with humans it's not about facts or 
mathematically proven theorems but *perception*. It's more about 
marketing, little important details (or unimportant ones depending on 
whom you ask) and so on. Stating that he has a wrong perspective will 
not change anything. Believing that Python is treated unfair will not 
change that either.


Most people believe what they see. When they see a "FUNCTION CALL", it's 
the same in every language. Why? Because it looks like a function call ( 
name + parentheses ), it's called "function call" even if it's 
implemented completely differently. It even doesn't matter if we use 
commas, 'def', return types, etc. Because people understand the bigger 
concept, so that is what people want to compare.


Average Joe doesn't care and does not understand. He looks at the 
benchmarks. That is something he can understand. "While performance is 
not a matter when choosing first language, slowest of three makes bad 
impression and people feel less attractive about Python." << just like that


Not saying that INADA is an Average Joe, but I think you get the idea.

  - developers contribute to
community driven projects for their *own* reasons. Nobody gets to tell
them what to do unless they're paying them.


Bit off-topic.


Micro-optimising a poor algorithm won't deliver macro level
improvements because macro level code uses things like space-speed
trade-offs to improve the algorithmic efficiency (as in the example of
applying functools.lru_cache to a naive recursive fibonacci
implementation).


I completely agree, Nick. :) But that isn't the issue here.


Victor's work on FAT optimiser is interesting because it offers
opportunities to speed up even code that is already algorithmically
efficient, as well as making CPython a better platform for
experimenting with those kinds of changes.


Exactly. :)


More generally though, much larger opportunities for improvement lie
in persuading people to *stop writing code*, and instead spending more
of their time on finding and assembling code other people have
*already written* into solutions to interesting problems. *That's* the
kind of improvement that turns enormously complex problems like facial
recognition into 25 line Python scripts:
https://realpython.com/blog/python/face-recognition-with-python/


Interesting post. :) Thanks.

Btw. I completely agree with you on the "improve programming education", 
but not everybody can do it; and not everybody wants to learn and to 
practice it properly.


Best,
Sven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FAT Python (lack of) performance

2016-01-27 Thread Sven R. Kunze

On 27.01.2016 19:33, Brett Cannon wrote:
And this is why this entire email thread has devolved into a 
conversation that isn't really going anywhere. This whole thread has 
completely lost track of the point of Victor's earlier email saying 
"I'm still working on my FAT work and don't take any notice of the 
performance numbers until more stuff gets finished". And this 
discussion of what benchmarks to care about is rather pointless since 
the core team has an implicit understanding that any performance 
improvement is taken into consideration in terms of balancing 
complexity in CPython with how much improvement it gets us. So if 
someone wants to speed up Fibonacci then they are welcome to try, but 
the solution must be maintainable in proportion to the speed increase 
it buys Python as a whole.


+1
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up CPython 5-10%

2016-01-27 Thread Yury Selivanov
As Brett suggested, I've just run the benchmarks suite with memory 
tracking on.  The results are here: 
https://gist.github.com/1st1/1851afb2773526fd7c58


Looks like the memory increase is around 1%.

One synthetic micro-benchmark, unpack_sequence, contains hundreds of 
lines that load a global variable and does nothing else, consumes 5%.


Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FAT Python (lack of) performance

2016-01-27 Thread Nick Coghlan
On 27 January 2016 at 03:35, Sven R. Kunze  wrote:
> I completely agree with  INADA.
>
> It's like saying, because a specific crossroad features a higher accident
> rate, people need to change their driving behavior.
> No! People won't change and it's not necessary either. The crossroad needs
> to be changed to be safer.

Umm, no, that's not how this works - developers contribute to
community driven projects for their *own* reasons. Nobody gets to tell
them what to do unless they're paying them.

Micro-optimising a poor algorithm won't deliver macro level
improvements because macro level code uses things like space-speed
trade-offs to improve the algorithmic efficiency (as in the example of
applying functools.lru_cache to a naive recursive fibonacci
implementation).

Victor's work on FAT optimiser is interesting because it offers
opportunities to speed up even code that is already algorithmically
efficient, as well as making CPython a better platform for
experimenting with those kinds of changes.

More generally though, much larger opportunities for improvement lie
in persuading people to *stop writing code*, and instead spending more
of their time on finding and assembling code other people have
*already written* into solutions to interesting problems. *That's* the
kind of improvement that turns enormously complex problems like facial
recognition into 25 line Python scripts:
https://realpython.com/blog/python/face-recognition-with-python/

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FAT Python (lack of) performance

2016-01-27 Thread Terry Reedy

On 1/26/2016 12:51 PM, Stephen J. Turnbull wrote:

Terry Reedy writes:
  > On 1/26/2016 12:02 AM, INADA Naoki wrote:
  >
  > > People use same algorithm on every language when compares base language
  > > performance [1].
  >
  > The python code is NOT using the same algorithm.  The proof is that the
  > Python function will return the correct value for, say fib(50) while
  > most if not all the other versions will not.


Let me try to be clearer.

1.  Like everyone else, I would like Python function calls to be faster 
either in general or in special cases detected during compilation.  This 
will require micro-benchmark for function calls that do just that. 
First time an empty loop, then time a loop with a call to an empty 
function.  Do the same for various signatures, and maybe other special 
cases.


2. Cross-language micro-benchmarks aimed at timing specific operations 
are tough. To run on multiple languages, they must be restricted to a 
lowest-common-denominator  of features.  It is impossible to make every 
implementation perform exactly the same set of operations.  Some 
languages may bundle features together.  Some languages and 
implementations have optimizations that avoid unneeded operations.  Not 
all optimizatons can be turned off.


3. While there are trends as to the speed of implementations of a 
language, benchmarks time particular implementations.  Shedskin would 
compile fib to a C function that runs much faster than fib with the 
restricted subset of CPython allowed for the benchmark.



True, but that's not a reasonable criterion for "same algorithm" in
this context.  Naoki's application ("base language performance"
benchmarking) requires fib(n) only for n < 40, and run it in a loop
100 times if you want 2 more decimal places of precision ("40" is
appropriate for an implementation with 32-bit ints).


So you agree that the limit of 39 is not intrinsic to the fib function 
or its uses, but is an after-the-fact limit imposed to mask the bug 
proneness of using substitutes for integers.


To my mind, a fairer and more useful benchmark of 'base language 
performance' based on fib would use a wider domain.  The report would 
say that CPython (with lru_cache disallowed) is slow but works over a 
wide range of inputs, while some other implementations of other 
languages run faster for small inputs but fail catastrophically for 
larger inputs.  Users could then make a more informed pick.


Also see my answer to Sven Kunze.

--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FAT Python (lack of) performance

2016-01-27 Thread Victor Stinner
Python has test_dynamic which tests such corner cases.

For example, test_modify_builtins_while_generator_active(): "Modify
the builtins out from under a live generator."
https://hg.python.org/cpython/file/58266f5101cc/Lib/test/test_dynamic.py#l49

Victor

2016-01-27 10:28 GMT+01:00 Paul Moore :
> On 27 January 2016 at 05:23, Sjoerd Job Postmus  wrote:
>> On Mon, Jan 25, 2016 at 11:58:12PM +0100, Victor Stinner wrote:
>>> ...
>>> Oh, they are a lot of things to do! ...
>>
>> Just wondering, do you also need a set of (abusive) test-cases which
>> check 100% conformity to the CPython semantics? I'm sure many of us
>> would be able to whip up some ideas of things that are possible with
>> Python and are of the kind "but you should not do that! That's bad
>> programming!" which may or may not break the optimizations (especially
>> specialized functions).
>>
>> I'm thinking of things like
>>
>> def override_length_function_test():
>> global len
>> value = len("abc")
>> len = lambda obj: ord(obj[0]))
>> value += len("abc")
>> assert value == 3 + 97, "Outdated `len` used."
>>
>> And also cases where `len` was changed not in the function itself, but
>> in another function that was called indirectly (maybe 4 functions down,
>> one of which was monkey-patched in after the compilation step):
>>
>> module_a.py
>> def test_func(callback):
>> value = len("abc")
>> callback()
>> value += len("abc")
>> assert value == 3 + 97, "Outdated `len` used."
>>
>> module_b.py
>> import module_a
>> def override_length_function_test():
>> def callback():
>> module_a.len = lambda obj: ord(obj[0])
>> assert module_a.test_func(callback) == 3 + 97, "Outdated `len` used."
>>
>> (I'm sure some of the other readers of this list can be even more
>> creative in trying to show that making optimizations like this can break
>> semantics.)
>>
>> Other fun tricks I'd like to try is overriding the `len` method from a
>> signal handler, what happens when you monkey-patch a dependent method,
>> having `__getitem__` and `__getattr__` on some value override `len`.
>>
>> Basically: trying things that I normally should not try during my
>> working hours, on account of wanting to still have a job the next day.
>>
>> Kind regards,
>> Sjoerd Job
>
> Maybe I'm just nasty, but IMO those kinds of "torture tests" are just
> as valuable in general, so I'd encourage people to submit patches to
> the main Python test suite to add them.
>
> Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Buildbot timing out - test suite failure - test_socket issue with UDP6?

2016-01-27 Thread Victor Stinner
2016-01-23 7:03 GMT+01:00 Chris Angelico :
> I just had a major crash on the system that hosts the
> angelico-debian-amd64 buildbot, and as usual, checked it carefully
> after bringing everything up. It seems now to be timing out after an
> hour of operation:
>
> http://buildbot.python.org/all/builders/AMD64%20Debian%20root%203.x/builds/3132/steps/test/logs/stdio

I opened http://bugs.python.org/issue26206 to track this issue.

> Running just that test file:
>
> $ ./python Lib/test/test_socket.py
> ... chomp lots of lines ...
> testRecvmsgPeek (__main__.RecvmsgUDP6Test) ...
>
> seems to indicate that the stall is due to IPv6 and UDP. The VM should
> have full IPv6 support, although my ISPs don't carry IPv6 traffic, so
> it won't be able to reach the internet proper; but it should be able
> to do all manner of local tests.

Try to apply attached patch and run:

$ ./python -m test -v -m testRecvmsgPeek test_socket
(...)
testRecvmsgPeek (test.test_socket.RecvmsgUDP6Test) ... CLI SOCK

SERV SOCK 
CLI SOCK ('::1', 44347, 0, 0)
SERV SOCK ('::1', 40488, 0, 0)
ok
testRecvmsgPeek (test.test_socket.RecvmsgIntoUDP6Test) ... CLI SOCK

SERV SOCK 
CLI SOCK ('::1', 52721, 0, 0)
SERV SOCK ('::1', 43967, 0, 0)
ok
(...)

As you can see: the test uses the local loopback interface.
Inet6TestBase.host is "::1".

You can try to run a UDP server using netcat: "nc -l -u ::1 12345".
Keep the command running in a terminal, and then run in a different
terminal: "echo abc | nc -u ::1 12345". You should receive abc in the
server.

Victor
diff -r 58266f5101cc Lib/test/test_socket.py
--- a/Lib/test/test_socket.py	Tue Jan 26 21:46:03 2016 -0800
+++ b/Lib/test/test_socket.py	Wed Jan 27 10:35:32 2016 +0100
@@ -2049,6 +2049,8 @@ class SendrecvmsgConnectionlessBase(Send
 return ([], [], 0, self.serv_addr)
 
 def sendToServer(self, msg):
+print("CLI SOCK", self.cli_sock)
+print("CLI SOCK", self.cli_sock.getsockname())
 return self.cli_sock.sendto(msg, self.serv_addr)
 
 
@@ -2371,6 +2373,8 @@ class RecvmsgGenericTests(SendrecvmsgBas
 # data without consuming it.
 
 # Receive part of data with MSG_PEEK.
+print("SERV SOCK", self.serv_sock)
+print("SERV SOCK", self.serv_sock.getsockname())
 msg, ancdata, flags, addr = self.doRecvmsg(self.serv_sock,
len(MSG) - 3, 0,
socket.MSG_PEEK)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FAT Python (lack of) performance

2016-01-27 Thread Paul Moore
On 27 January 2016 at 05:23, Sjoerd Job Postmus  wrote:
> On Mon, Jan 25, 2016 at 11:58:12PM +0100, Victor Stinner wrote:
>> ...
>> Oh, they are a lot of things to do! ...
>
> Just wondering, do you also need a set of (abusive) test-cases which
> check 100% conformity to the CPython semantics? I'm sure many of us
> would be able to whip up some ideas of things that are possible with
> Python and are of the kind "but you should not do that! That's bad
> programming!" which may or may not break the optimizations (especially
> specialized functions).
>
> I'm thinking of things like
>
> def override_length_function_test():
> global len
> value = len("abc")
> len = lambda obj: ord(obj[0]))
> value += len("abc")
> assert value == 3 + 97, "Outdated `len` used."
>
> And also cases where `len` was changed not in the function itself, but
> in another function that was called indirectly (maybe 4 functions down,
> one of which was monkey-patched in after the compilation step):
>
> module_a.py
> def test_func(callback):
> value = len("abc")
> callback()
> value += len("abc")
> assert value == 3 + 97, "Outdated `len` used."
>
> module_b.py
> import module_a
> def override_length_function_test():
> def callback():
> module_a.len = lambda obj: ord(obj[0])
> assert module_a.test_func(callback) == 3 + 97, "Outdated `len` used."
>
> (I'm sure some of the other readers of this list can be even more
> creative in trying to show that making optimizations like this can break
> semantics.)
>
> Other fun tricks I'd like to try is overriding the `len` method from a
> signal handler, what happens when you monkey-patch a dependent method,
> having `__getitem__` and `__getattr__` on some value override `len`.
>
> Basically: trying things that I normally should not try during my
> working hours, on account of wanting to still have a job the next day.
>
> Kind regards,
> Sjoerd Job

Maybe I'm just nasty, but IMO those kinds of "torture tests" are just
as valuable in general, so I'd encourage people to submit patches to
the main Python test suite to add them.

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FAT Python (lack of) performance

2016-01-27 Thread Terry Reedy

On 1/26/2016 12:35 PM, Sven R. Kunze wrote:

I completely agree with INADA.


I an not sure you do.


It's like saying, because a specific crossroad features a higher
accident rate, *people need to change their driving behavior*.
*No!* People won't change and it's not necessary either. The crossroad
needs to be changed to be safer.


Safer crossroads tend to be slower unless one switched to alternate 
designs that eliminate crossing streams of traffic.  Python is safer, in 
everyday use as well as in artificial benchmarks, and is slower as a result.


Languages that don't have integers but use residue classes (with 
wraparound) or finite integer classes (with overflow) as a (faster) 
substitute have, in practice, lots of accidents (bugs) when used by 
non-experts.  Guido noticed this, gave up on changing coder behavior, 
and put the expert behavior of checking for wraparound/overflow and 
switching to real integers (longs) into the language.  (I forget when 
this was added.)


The purpose of the artificially low input to fib() is to hide and avoid 
the bugginess of most languages.  The analogous trick with testing 
crossroads would be to artificially restrict the density of cars to mask 
the accident-proneness of a 'fast, consenting-adults' crossroads with no 
stop signs and no stop lights.



Same goes for Python. If it's slow using the very same piece of code
(even superficially), you better make the language faster.
Developers won't change and they won't change their code either. Just
not necessary.


Instead of making people rewrite fib to dramatically increase speed, we 
added the lru-cache decorator to get most of the benefit without a 
rewrite.  But Inada rejected this Python speedup.  An ast optimizer 
could potentially do the same speedup without the explicit decorator. 
(No side-effects?  Multiple recursive calls?  Add a cache!)



Btw. it would be a great feature for Python 3 to be faster than Python
2.


We all agree on that.  One way for this to happen is to add optimizers 
that would make Python 'cheat' on micrebenchmarks


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Buildbot timing out - test suite failure - test_socket issue with UDP6?

2016-01-27 Thread Chris Angelico
On Wed, Jan 27, 2016 at 8:39 PM, Victor Stinner
 wrote:
> 2016-01-23 7:03 GMT+01:00 Chris Angelico :
>> Running just that test file:
>>
>> $ ./python Lib/test/test_socket.py
>> ... chomp lots of lines ...
>> testRecvmsgPeek (__main__.RecvmsgUDP6Test) ...
>>
>> seems to indicate that the stall is due to IPv6 and UDP. The VM should
>> have full IPv6 support, although my ISPs don't carry IPv6 traffic, so
>> it won't be able to reach the internet proper; but it should be able
>> to do all manner of local tests.
>
> Try to apply attached patch and run:
>
> $ ./python -m test -v -m testRecvmsgPeek test_socket
> (...)
> testRecvmsgPeek (test.test_socket.RecvmsgUDP6Test) ... CLI SOCK
>  type=SocketKind.SOCK_DGRAM, proto=0, laddr=('::1', 44347, 0, 0)>
> SERV SOCK  type=SocketKind.SOCK_DGRAM, proto=0, laddr=('::1', 40488, 0, 0)>
> CLI SOCK ('::1', 44347, 0, 0)
> SERV SOCK ('::1', 40488, 0, 0)
> ok
> testRecvmsgPeek (test.test_socket.RecvmsgIntoUDP6Test) ... CLI SOCK
>  type=SocketKind.SOCK_DGRAM, proto=0, laddr=('::1', 52721, 0, 0)>
> SERV SOCK  type=SocketKind.SOCK_DGRAM, proto=0, laddr=('::1', 43967, 0, 0)>
> CLI SOCK ('::1', 52721, 0, 0)
> SERV SOCK ('::1', 43967, 0, 0)
> ok
> (...)
>
> As you can see: the test uses the local loopback interface.
> Inet6TestBase.host is "::1".

Confirmed. It does two tests of IPv4 which run just fine, and then:

testRecvmsgPeek (test.test_socket.RecvmsgUDP6Test) ... CLI SOCK

CLI SOCK ('::1', 47518, 0, 0)
SERV SOCK 
SERV SOCK ('::1', 42421, 0, 0)

and hangs until I interrupt it.

> You can try to run a UDP server using netcat: "nc -l -u ::1 12345".
> Keep the command running in a terminal, and then run in a different
> terminal: "echo abc | nc -u ::1 12345". You should receive abc in the
> server.

(Oddly, the default netcat-traditional doesn't seem to support this,
but installing netcat-openbsd adds IPv6 support.)

Yep, and that works flawlessly. It's nothing weird about that
particular port, either - nc can use 42421 without a problem.

After digging through test_socket.py for over an hour (the MRO for
RecvmsgUDP6Test is enormous!!), I've boiled the issue down to this:

import socket
MSG = b'asdf qwer zxcv'
serv = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
serv.bind(("::1", 0))
cli = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
cli.bind(("::1", 0))
cli.sendto(MSG, serv.getsockname())
print(serv.recvmsg(len(MSG) - 3, 0, socket.MSG_PEEK))
print(serv.recvmsg(len(MSG), 0, socket.MSG_PEEK))
print(serv.recvmsg(len(MSG)))

On my main system, this produces three lines of output: the first has
truncated text, the second has full text, and the third also has full
text. This proves that MSG_PEEK is working correctly. On the buildbot,
though, the first one stalls out. Commenting that line out produces
correct results - peek the full data, then read it, and all is well.

Any idea why partial read on a datagram socket would sometimes stall?

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com