Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-13 Thread Chris Angelico
On Thu, Oct 13, 2016 at 5:17 PM, Stephen J. Turnbull
 wrote:
> Chris Angelico writes:
>
>  > I'm not sure what you mean by "strcmp-able"; do you mean that the
>  > lexical ordering of two Unicode strings is guaranteed to be the same
>  > as the byte-wise ordering of their UTF-8 encodings?
>
> This is definitely not true for the Han characters.  In Japanese, the
> most commonly used lexical ordering is based on the pronunciation,
> meaning that there are few characters (perhaps none) in common use
> that has a unique place in lexical ordering (most individual
> characters have multiple pronunciations, and even many whole personal
> names do).

Yeah, and even just with Latin-1 characters, you have (a) non-ASCII
characters that sort between ASCII characters, and (b) characters that
have different meanings in different languages, and should be sorted
differently. So lexicographical ordering is impossible in a generic
string sort.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Elliot Gorokhovsky
On Wed, Oct 12, 2016 at 4:26 PM Terry Reedy  wrote:

> I suspect that optimizing string sorting
> will take some experimentation.  If the initial item is str, it might be
> worthwhile to record the highest 'kind' during the type scan, so that
> strncmp can be used if all are ascii or latin-1.
>

My thoughts exactly.

One other optimization along these lines: the reason ints don't give quite
as shocking results as floats is that comparisons are a bit more expensive:
one first has to check that the int would fit in a c long before comparing;
if not, then a custom procedure has to be used. However, in practice ints
being sorted are almost always smaller in absolute value than 2**32 or
whatever. So I think, just as it might pay off to check for latin-1 and use
strcmp, it may also pay off to check for fits-in-a-C-long and use a custom
function for that case as well, since the performance would be precisely as
awesome as the float performance that started this thread: comparisons
would just be the cost of pointer dereference plus the cost of C long
comparison, i.e. the minimum possible cost.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Greg Ewing

Paul Moore wrote:

What I'm *not* quite clear on is why Python 3's change to reject
comparisons between unrelated types makes this optimisation possible.


I think the idea was that it's likely to be *useful* a higher
proportion of the time, because Python 3 programmers have to
be careful that the types they're sorting are compatible.

I'm not sure how true that is -- just because you *could*
sort lists containing a random selection of types in Python
2 doesn't necessarily mean it was done often.

--
Greg
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Nathaniel Smith
On Wed, Oct 12, 2016 at 3:34 PM, Alexander Belopolsky
 wrote:
>
> On Wed, Oct 12, 2016 at 6:14 PM, Elliot Gorokhovsky
>  wrote:
>>
>> so then Latin1 strings are memcmp-able, and others are not.
>
>
> No.  Strings of the same kind are "memcmp-able" regardless of their kind.

I don't think this is true on little-endian systems.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread MRAB

On 2016-10-12 23:34, Alexander Belopolsky wrote:


On Wed, Oct 12, 2016 at 6:14 PM, Elliot Gorokhovsky
> wrote:

so then Latin1 strings are memcmp-able, and others are not.


No.  Strings of the same kind are "memcmp-able" regardless of their kind.


Surely that's true only if they're big-endian.

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Terry Reedy

On 10/12/2016 5:57 PM, Elliot Gorokhovsky wrote:

On Wed, Oct 12, 2016 at 3:51 PM Nathaniel Smith > wrote:

But this isn't relevant to Python's str, because Python's str never
uses UTF-8.


Really? I thought in python 3, strings are all unicode...


They are ...


so what encoding do they use, then?


Since 3.3, essentially ascii, latin1, utf-16 without surrogates (ucs2), 
or utf-32, depending on the hightest codepoint.  This is the 'kind' 
field.  If we go this route, I suspect that optimizing string sorting 
will take some experimentation.  If the initial item is str, it might be 
worthwhile to record the highest 'kind' during the type scan, so that 
strncmp can be used if all are ascii or latin-1.



--
Terry Jan Reedy

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Elliot Gorokhovsky
Ah. That makes a lot of sense, actually. Anyway, so then Latin1 strings are
memcmp-able, and others are not. That's fine; I'll just add a check for
that (I think there are already helper functions for this) and then have
two special-case string functions. Thanks!

On Wed, Oct 12, 2016 at 4:08 PM Alexander Belopolsky <
alexander.belopol...@gmail.com> wrote:

>
> On Wed, Oct 12, 2016 at 5:57 PM, Elliot Gorokhovsky <
> elliot.gorokhov...@gmail.com> wrote:
>
> On Wed, Oct 12, 2016 at 3:51 PM Nathaniel Smith  wrote:
>
> But this isn't relevant to Python's str, because Python's str never uses
> UTF-8.
>
>
> Really? I thought in python 3, strings are all unicode... so what encoding
> do they use, then?
>
>
> No encoding is used.  The actual code points are stored as integers of the
> same size.  If all code points are less than 256, they are stored as 8-bit
> integers (bytes).  If some code points are greater or equal to 256 but less
> than 65536, they are stored as 16-bit integers and so on.
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Chris Angelico
On Thu, Oct 13, 2016 at 8:51 AM, Nathaniel Smith  wrote:
> The comparison methods on Python's str are codepoint-by-codepoint.

Thanks, that's what I wasn't sure of.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Alexander Belopolsky
On Wed, Oct 12, 2016 at 5:57 PM, Elliot Gorokhovsky <
elliot.gorokhov...@gmail.com> wrote:

> On Wed, Oct 12, 2016 at 3:51 PM Nathaniel Smith  wrote:
>
>> But this isn't relevant to Python's str, because Python's str never uses
>> UTF-8.
>>
>
> Really? I thought in python 3, strings are all unicode... so what encoding
> do they use, then?
>

No encoding is used.  The actual code points are stored as integers of the
same size.  If all code points are less than 256, they are stored as 8-bit
integers (bytes).  If some code points are greater or equal to 256 but less
than 65536, they are stored as 16-bit integers and so on.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Elliot Gorokhovsky
On Wed, Oct 12, 2016 at 3:51 PM Nathaniel Smith  wrote:

> But this isn't relevant to Python's str, because Python's str never uses
> UTF-8.
>

Really? I thought in python 3, strings are all unicode... so what encoding
do they use, then?
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Elliot Gorokhovsky
> So what's Python doing? Is it a codepoint ordering?
>

...ya...how is the python interpreter supposed to know what language
strings are in? There is a unique ordering of unicode strings defined by
the unicode standard, AFAIK.
If you want to sort by natural language ordering, see here:
https://pypi.python.org/pypi/natsort
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Elliot Gorokhovsky
On Wed, Oct 12, 2016 at 3:39 PM Nathaniel Smith  wrote:

> It looks like PyUnicode_Compare already has a special case to use
> memcmp when both of the strings fit into latin1:
>

Wow! That's great! I didn't even try reading through unicode_compare,
because I felt I might miss some subtle detail that would break everything.
But ya, that's great! Since surely latin1 is the most common use case. So
I'll just add a latin1 check in the check-loop, and then I'll have two
unsafe_unicode_compare functions. I felt bad about not being able to get
the same kind of string performance I had gotten with python2, so this is
nice.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Chris Angelico
On Thu, Oct 13, 2016 at 8:19 AM, Elliot Gorokhovsky
 wrote:
>
> My first question was how expensive python compares are vs C compares. And
> since python 2 has PyString_AS_STRING, which just gives you a char* pointer
> to a C string, I went in and replaced PyObject_RichCompareBool with strcmp
> and did a simple benchmark. And I was just totally blown away; it turns out
> you get something like a 40-50% improvement (at least on my simple
> benchmark).
>
> So that was the motivation for all this. Actually, if I wrote this for
> python 2, I might be able to get even better numbers (at least for strings),
> since we can't use strcmp in python 3. (Actually, I've heard UTF-8 strings
> are strcmp-able, so maybe if we go through and verify all the strings are
> UTF-8 we can strcmp them? I don't know enough about how PyUnicode stuff
> works to do this safely).

I'm not sure what you mean by "strcmp-able"; do you mean that the
lexical ordering of two Unicode strings is guaranteed to be the same
as the byte-wise ordering of their UTF-8 encodings? I don't think
that's true, but then, I'm not entirely sure how Python currently
sorts strings. Without knowing which language the text represents,
it's not possible to sort perfectly.

https://en.wikipedia.org/wiki/Collation#Automated_collation
"""
Problems are nonetheless still common when the algorithm has to
encompass more than one language. For example, in German dictionaries
the word ökonomisch comes between offenbar and olfaktorisch, while
Turkish dictionaries treat o and ö as different letters, placing oyun
before öbür.
"""

Which means these lists would already be considered sorted, in their
respective languages:

rosuav@sikorsky:~$ python3
Python 3.7.0a0 (default:a78446a65b1d+, Sep 29 2016, 02:01:55)
[GCC 6.1.1 20160802] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> sorted(["offenbar", "ökonomisch", "olfaktorisch"])
['offenbar', 'olfaktorisch', 'ökonomisch']
>>> sorted(["oyun", "öbür", "parıldıyor"])
['oyun', 'parıldıyor', 'öbür']

So what's Python doing? Is it a codepoint ordering?

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Nathaniel Smith
On Wed, Oct 12, 2016 at 2:19 PM, Elliot Gorokhovsky
 wrote:
[...]
> So that was the motivation for all this. Actually, if I wrote this for
> python 2, I might be able to get even better numbers (at least for strings),
> since we can't use strcmp in python 3. (Actually, I've heard UTF-8 strings
> are strcmp-able, so maybe if we go through and verify all the strings are
> UTF-8 we can strcmp them? I don't know enough about how PyUnicode stuff
> works to do this safely). My string special case currently just bypasses the
> typechecks and goes to unicode_compare(), which is still wayyy overkill for
> the common case of ASCII or Latin-1 strings, since it uses a for loop to go
> through and check characters, and strcmp uses compiler magic to do it in
> like, negative time or something. I even PyUnicode_READY the strings before
> comparing; I'm not sure if that's really necessary, but that's how
> PyUnicode_Compare does it.

It looks like PyUnicode_Compare already has a special case to use
memcmp when both of the strings fit into latin1:


https://github.com/python/cpython/blob/cfc517e6eba37f1bd61d57bf0dbece9843bff9c8/Objects/unicodeobject.c#L10855-L10860

I suppose the for loops that are used for multibyte strings could
potentially be sped up with SIMD or something, but that gets
complicated fast, and modern compilers might even be doing it already.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Elliot Gorokhovsky
On Tue, Oct 11, 2016 at 9:56 PM Nick Coghlan  wrote:

> Once you get to the point of being able to do performance mentions on
> a CPython build with a modified list.sort() implementation, you'll
> want to take a look at the modern benchmark suite in
> https://github.com/python/performance
>

Yup, that's the plan. I'm going to implement optimized compares for tuples,
then implement this as a CPython build, and then run benchmark suites and
write some rigorous benchmarks using perf/timeit.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Elliot Gorokhovsky
On Wed, Oct 12, 2016 at 9:20 AM Tim Peters  wrote:

> > What I'm *not* quite clear on is why Python 3's change to reject
> > comparisons between unrelated types makes this optimisation possible.
>
> It doesn't.  It would also apply in Python 2.  I simply expect the
> optimization will pay off more frequently in Python 3 code.  For
> example, in Python 2 I used to create lists with objects of wildly
> mixed types, and sort them merely to bring objects of the same type
> next to each other.  Things "like that" don't work at all in Python 3.
>
>
> > Surely you have to check either way? It's not that it's a particularly
> > important question - if the optimisation works, it's not that big a
> > deal what triggered the insight. It's just that I'm not sure if
> > there's some other point that I've not properly understood.
>

Yup. Actually, the initial version of this work was with Python 2. What
happened was this: I had posted earlier something along the lines of "hey
everybody let's radix sort strings instead of merge sort because that will
be more fun ok". And everyone wrote me back "no please don't are you
kidding". Tim Peters wrote back "try it but just fyi it's not gonna work".
So I set off to try it. I had never used the C API before, but luckily I
found some Python 2 documentation that gives an example of subclassing
list, so I was able to mostly just copy-paste to get a working list
extension module. I then copied over the implementation of listsort.

My first question was how expensive python compares are vs C compares. And
since python 2 has PyString_AS_STRING, which just gives you a char* pointer
to a C string, I went in and replaced PyObject_RichCompareBool with strcmp
and did a simple benchmark. And I was just totally blown away; it turns out
you get something like a 40-50% improvement (at least on my simple
benchmark).

So that was the motivation for all this. Actually, if I wrote this for
python 2, I might be able to get even better numbers (at least for
strings), since we can't use strcmp in python 3. (Actually, I've heard
UTF-8 strings are strcmp-able, so maybe if we go through and verify all the
strings are UTF-8 we can strcmp them? I don't know enough about how
PyUnicode stuff works to do this safely). My string special case currently
just bypasses the typechecks and goes to unicode_compare(), which is still
wayyy overkill for the common case of ASCII or Latin-1 strings, since it
uses a for loop to go through and check characters, and strcmp uses
compiler magic to do it in like, negative time or something. I even
PyUnicode_READY the strings before comparing; I'm not sure if that's really
necessary, but that's how PyUnicode_Compare does it.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Tim Peters
[Nick Coghlan]
> It's probably more relevant that cmp() went away, since that
> simplified the comparison logic to just PyObject_RichCompareBool,
> without the custom comparison function path.

Well, the current sort is old by now, and was written for Python 2.
But it did anticipate that rich comparisons were the future, and
deliberately restricted itself to using only "<" (Py_LT) comparisons.
So, same as now, only the "<" path needed to be examined.


> It *might* have still been possible to do something like this in the
> Py2 code (since the main requirement is to do the pre-check for
> consistent types if the first object is of a known type with an
> optimised fast path),

It shouldn't really matter whether it's a known type.  For any type,
if it's known that all the objects are of that type, that type's
tp_richcompare slot can be read up once, and if non-NULL used
throughout.  That would save several levels of function call per
comparison during the sort; although that's not factor-of-3-speedup
potential, it should still be a significant win.


> but I don't know anyone that actually *likes* adding new special cases
> to already complex code and trying to figure out how to test whether
> or not they've broken anything :)

A nice thing about this one is that special cases are a one-time thing
at the start, and don't change anything in the vast bulk of the
current sorting code.  So when it breaks, it should be pretty easy to
figure out why ;-)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Tim Peters
[Paul Moore]
> My understanding is that the code does a per-check that all the
> elements of the list are the same type (float, for example). This is a
> relatively quick test (O(n) pointer comparisons). If everything *is* a
> float, then an optimised comparison routine that skips all the type
> checks and goes straight to a float comparison (single machine op).

That matches my understanding.

> Because there are more than O(n) comparisons done in a typical sort,
> this is a win.

If the types are in fact all the same, it should be a win even for
n==2 (at n < 2 no comparisons are done; at n==2 exactly 1 comparison
is done):  one pointer compare + go-straight-to-C-float-"x And because the type checks needed in rich comparison

And layers of function calls.

> are much more expensive than a pointer check, it's a *big* win.

Bingo :-)


> What I'm *not* quite clear on is why Python 3's change to reject
> comparisons between unrelated types makes this optimisation possible.

It doesn't.  It would also apply in Python 2.  I simply expect the
optimization will pay off more frequently in Python 3 code.  For
example, in Python 2 I used to create lists with objects of wildly
mixed types, and sort them merely to bring objects of the same type
next to each other.  Things "like that" don't work at all in Python 3.


> Surely you have to check either way? It's not that it's a particularly
> important question - if the optimisation works, it's not that big a
> deal what triggered the insight. It's just that I'm not sure if
> there's some other point that I've not properly understood.

Well, either your understanding is fine, or we're both confused :-)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Nick Coghlan
On 12 October 2016 at 21:35, Paul Moore  wrote:
> What I'm *not* quite clear on is why Python 3's change to reject
> comparisons between unrelated types makes this optimisation possible.
> Surely you have to check either way? It's not that it's a particularly
> important question - if the optimisation works, it's not that big a
> deal what triggered the insight. It's just that I'm not sure if
> there's some other point that I've not properly understood.

It's probably more relevant that cmp() went away, since that
simplified the comparison logic to just PyObject_RichCompareBool,
without the custom comparison function path.

It *might* have still been possible to do something like this in the
Py2 code (since the main requirement is to do the pre-check for
consistent types if the first object is of a known type with an
optimised fast path), but I don't know anyone that actually *likes*
adding new special cases to already complex code and trying to figure
out how to test whether or not they've broken anything :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-12 Thread Paul Moore
On 12 October 2016 at 11:16, Steven D'Aprano  wrote:
> On Wed, Oct 12, 2016 at 12:25:16AM +, Elliot Gorokhovsky wrote:
>
>> Regarding generalization: the general technique for special-casing is you
>> just substitute all type checks with 1 or 0 by applying the type assumption
>> you're making. That's the only way to guarantee it's safe and compliant.
>
> I'm confused -- I don't understand how *removing* type checks can
> possible guarantee the code is safe and compliant.
>
> It's all very well and good when you are running tests that meet your
> type assumption, but what happens if they don't? If I sort a list made
> up of (say) mixed int and float (possibly including subclasses), does
> your "all type checks are 1 or 0" sort segfault? If not, why not?
> Where's the safety coming from?

My understanding is that the code does a per-check that all the
elements of the list are the same type (float, for example). This is a
relatively quick test (O(n) pointer comparisons). If everything *is* a
float, then an optimised comparison routine that skips all the type
checks and goes straight to a float comparison (single machine op).
Because there are more than O(n) comparisons done in a typical sort,
this is a win. And because the type checks needed in rich comparison
are much more expensive than a pointer check, it's a *big* win.

What I'm *not* quite clear on is why Python 3's change to reject
comparisons between unrelated types makes this optimisation possible.
Surely you have to check either way? It's not that it's a particularly
important question - if the optimisation works, it's not that big a
deal what triggered the insight. It's just that I'm not sure if
there's some other point that I've not properly understood.

Paul
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-11 Thread Nick Coghlan
On 12 October 2016 at 06:58, Elliot Gorokhovsky
 wrote:
> So I got excited here. And the reason why is that I got those numbers *on
> Tim's benchmark*. When I got these kinds of numbers on my benchmarks, I
> figured there was probably a problem with they way I was timing, and
> certainly the gains couldn't be as extreme as they suggested. But this is on
> a benchmark that's already in the codebase!

Thanks for the clearer write-up - this is indeed very cool, and it's
wonderful to see that the new assumptions permitted by Python 3
getting stricter about cross-type ordering comparisons may lead to
speed-ups for certain common kinds of operations (i.e. sorting lists
where the sorting keys are builtin immutable types).

Once you get to the point of being able to do performance mentions on
a CPython build with a modified list.sort() implementation, you'll
want to take a look at the modern benchmark suite in
https://github.com/python/performance

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-11 Thread Terry Reedy

On 10/11/2016 2:30 PM, Paul Moore wrote:

On 11 October 2016 at 17:49, Nick Coghlan  wrote:

On 12 October 2016 at 02:16, Elliot Gorokhovsky
 wrote:

So I thought, wow, this will give some nice numbers! But I underestimated
the power of this optimization. You have no idea. It's crazy.
This is just insane. This is crazy.


Not to take away from the potential for speed improvements (which do
indeed seem interesting), but I'd ask that folks avoid using mental
health terms to describe test results that we find unbelievable. There
are plenty of other adjectives we can use, and a text-based medium
like email gives us a chance to proofread our posts before we send
them.


I'd also suggest toning down the rhetoric a bit (all-caps title, "the
contents of this message may be dangerous for readers with heart
conditions" etc.


I triple the motion.  In general, all caps = spam or worse and I usually 
don't even open such posts.  Elliot, to me, all caps means IGNORE ME.  I 
suspect this is not what you want.


> Your results do seem good, but it's a little hard to

work out what you actually did, and how your results were produced,
through the hype. It'll be much better when someone else has a means
to reproduce your results to confirm them. In all honestly, people
have been working on Python's performance for a long time now, and I'm
more inclined to think that a 50% speedup is a mistake rather than an
opportunity that's been missed for all that time. I'd be happy to be
proved wrong, but for now I'm skeptical.


I'm not, in the same sense, even though Elliot suggested that we should 
be ;-).  His key insight is that if all members of a list have the same 
type (which is a common 'special case'), then we can replace the 
general, somewhat convoluted, rich-comparison function, containing at 
least two type checks, with a faster special-case comparison function 
without any type checks.  Since Python floats wrap machine doubles, I 
expect that float may have the greatest speedup.



Please continue working on this - I'd love my skepticism to be proved wrong!


It may be the case now that sorting a list of all floats is faster than 
a mixed list of ints and floats.  I expect that it definitely will be 
with a float comparison function.


--
Terry Jan Reedy

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-11 Thread Elliot Gorokhovsky
So I got excited here. And the reason why is that I got those numbers *on
Tim's benchmark*. When I got these kinds of numbers on my benchmarks, I
figured there was probably a problem with they way I was timing, and
certainly the gains couldn't be as extreme as they suggested. But this is
on a benchmark that's already in the codebase!


Here is a detailed explanation of how to reproduce my results, and the
circumstances that would lead them to be invalid:

**

To reproduce, just activate a virtualenv, and then clone
https://github.com/embg/python-fast-listsort.git. Then python setup.py
install and python sortperf.py.


Now let's look at what sortperf.py does and how it relates to Tim's
benchmark at Lib/test/sortperf.py. If you diff the two, you'll find I made
three changes:


1. I added an import, "import fastlist". This obviously would not make
sorting twice faster.


2. I changed the way it formats the output: I changed "fmt = ("%2s %7s" + "
%7s"*len(cases))" to "fmt = ("%2s %7s" + " %6s"*len(cases))". Again
irrelevant.


3. I changed the timing function

#from this


def doit_fast(L):
t0 = time.perf_counter()
L.fastsort()
t1 = time.perf_counter()
print("%6.2f" % (t1-t0), end=' ')
flush()



#to this


def doit(L):
F = FastList(L)
f0 = time.perf_counter()
F.fastsort()
f1 = time.perf_counter()
F = FastList(L)
t0 = time.perf_counter()
F.sort()
t1 = time.perf_counter()
print("%6.2f%%" % (100*(1-(f1-f0)/(t1-t0))), end=' ')
flush()


***

So what we've shown is that (1) if you trust the existing sorting benchmark
and (2) if my modification to doit() doesn't mess anything up (I leave this
up to you to judge), then the measurements are as valid. Which is a pretty
big deal (50% !!!), hence my overexcitement.




Now I'd like to respond to responses (the one I'm thinking of was off-list
so I don't want to quote it) I've gotten questioning how it could be
possible for such a small optimization, bypassing the typechecks, to
possibly have such a large effect, even in theory. Here's my answer:

Let's ignore branch prediction and cache for now and just look at a high
level. The cost of sorting is related to the cost of a single comparison,
because the vast majority of our time (let's say certainly at least 90%,
depending on the list) is spent in comparisons. So let's look at the cost
of a comparison.

Without my optimization, comparisons for floats (that's what this benchmark
looks at) go roughly like this:

1. Test type of left and right for PyObject_RichCompare (which costs two
pointer dereferences) and compare them. "3 ops" (quotes because counting
ops like this is pretty hand-wavy). "2 memory accesses".

2. Get the address of the float compare method from
PyFloat_Type->tp_richcompare. "1 op". "1 memory access".

3. Call the function whose address we just got. "1 op". "Basically 0 memory
accesses because we count the stack stuff in that 1 op".

4. Test type of left and right again in PyFloat_RichCompare and compare
both of them to PyFloat_Type. "4 ops". "2 memory accesses".

5. Get floats from the PyObject* by calling PyFloat_AS_DOUBLE or whatever.
"2 ops". "2 memory accesses".

6. Compare the floats and return. "2 ops".

Now let's tally the "cost" (sorry for use of quotes here, just trying to
emphasize this is an intuitive, theoretical explanation for the numbers
which doesn't take into account the hardware):
"13 ops, 7 memory accesses".

Here's what it looks like in my code:

1. Call PyFloat_AS_DOUBLE on left and right. "2 ops". "2 memory acceses".

2. Compare the floats and return. "2 ops".

Tally: "4 ops, 2 memory accesses".

Now you can argue branch prediction alleviates a lot of this cost, since
we're taking the same branches every time. But note that, branch prediction
or not, we still have to do all of those memory acceses, and since they're
pointers to places all over memory, they miss the cache basically every
time (correct me if I'm wrong). So memory-wise, we really are doing
something like a 7:2 ratio, and op-wise, perhaps not as bad because of
branch prediction, but still, 13:4 is probably bad no matter what's going
on in the hardware.

Now consider that something like 90% of our time is spent in those steps.
Are my numbers really that unbelievable?

Thanks for everything, looking forward to writing this up as a nice latex
doc with graphs and perf benchmarks and all the other rigorous goodies, as
well as a special case cmp func for homogeneous tuples and a simple patch
file,

Elliot
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-11 Thread Paul Moore
On 11 October 2016 at 17:49, Nick Coghlan  wrote:
> On 12 October 2016 at 02:16, Elliot Gorokhovsky
>  wrote:
>> So I thought, wow, this will give some nice numbers! But I underestimated
>> the power of this optimization. You have no idea. It's crazy.
>> This is just insane. This is crazy.
>
> Not to take away from the potential for speed improvements (which do
> indeed seem interesting), but I'd ask that folks avoid using mental
> health terms to describe test results that we find unbelievable. There
> are plenty of other adjectives we can use, and a text-based medium
> like email gives us a chance to proofread our posts before we send
> them.

I'd also suggest toning down the rhetoric a bit (all-caps title, "the
contents of this message may be dangerous for readers with heart
conditions" etc. Your results do seem good, but it's a little hard to
work out what you actually did, and how your results were produced,
through the hype. It'll be much better when someone else has a means
to reproduce your results to confirm them. In all honestly, people
have been working on Python's performance for a long time now, and I'm
more inclined to think that a 50% speedup is a mistake rather than an
opportunity that's been missed for all that time. I'd be happy to be
proved wrong, but for now I'm skeptical.

Please continue working on this - I'd love my skepticism to be proved wrong!

Paul
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-11 Thread Nick Coghlan
On 12 October 2016 at 02:16, Elliot Gorokhovsky
 wrote:
> So I thought, wow, this will give some nice numbers! But I underestimated
> the power of this optimization. You have no idea. It's crazy.
> This is just insane. This is crazy.

Not to take away from the potential for speed improvements (which do
indeed seem interesting), but I'd ask that folks avoid using mental
health terms to describe test results that we find unbelievable. There
are plenty of other adjectives we can use, and a text-based medium
like email gives us a chance to proofread our posts before we send
them.

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] INSANE FLOAT PERFORMANCE!!!

2016-10-11 Thread Elliot Gorokhovsky
Warning: the contents of this message may be dangerous for readers with
heart conditions.

So today I looked at PyFloat_RichCompare. I had been scared initially
because it was so complicated, I was worried I would miss some important
error check or something if I special cased it. So I took the following
approach: I replaced all statements of the form PyLong_Check(w) with 0, and
all statements of the form PyFloat_Check(w) with 1. Because that's the
whole point; we're cutting out type checks. I then cut out all blocks
preceded by if (0). So it's definitely safe.

And it turns out that if you do that, PyFloat_RichCompare becomes ONE LINE
(after we set op=Py_LT)!!! Just

return (PyFloat_AS_DOUBLE(v) < PyFloat_AS_DOUBLE(w)) == 0 ? Py_False :
Py_True;

So I thought, wow, this will give some nice numbers! But I underestimated
the power of this optimization. You have no idea. It's crazy.

Since we're dealing with floats, I used Tim Peter's benchmark:
Lib/test/sortperf.py, just modifying one function:

#Old function Tim Peters wrote:
def doit_fast(L):
t0 = time.perf_counter()
L.fastsort()
t1 = time.perf_counter()
print("%6.2f" % (t1-t0), end=' ')
flush()

#My function:
def doit(L):
F = FastList(L)
f0 = time.perf_counter()
F.fastsort()
f1 = time.perf_counter()
F = FastList(L)
t0 = time.perf_counter()
F.sort()
t1 = time.perf_counter()
print("%6.2f%%" % (100*(1-(f1-f0)/(t1-t0))), end=' ')
flush()

So the benchmarking here is valid. I didn't write it. All I did was modify
it to print percent improvement instead of sort time. The numbers below are
percent improvement of my sort vs default sort (just clone my repo and run
python sortperf.py to verify):

 i2**i  *sort  \sort  /sort  3sort  +sort  %sort  ~sort  =sort  !sort
15   32768  44.11%  54.69%  47.41%  57.61%  50.17%  75.24%  68.03%  65.16%
82.16%
16   65536  14.14%  75.38%  63.44%  56.56%  67.99%  66.19%  50.72%  61.55%
61.87%
17  131072  73.54%  60.52%  60.97%  61.63%  52.55%  49.84%  68.68%  84.12%
65.90%
18  262144  54.19%  55.34%  54.67%  54.13%  55.62%  52.88%  69.30%  74.86%
72.66%
19  524288  55.12%  53.32%  53.77%  54.27%  52.97%  53.53%  67.55%  76.60%
78.56%
20 1048576  55.05%  51.09%  60.05%  50.69%  62.98%  50.20%  66.24%  71.47%
61.40%

This is just insane. This is crazy. I didn't write this benchmark, OK, this
is a valid benchmark. Tim Peters wrote it. And except for one trial, they
all show more than 50% improvement. Some in the 70s and 80s. This is crazy.
This is so cool. I just wanted to share this with you guys. I'll submit a
patch to bugs.python.org soon; I just have to write a special case
comparison for tuples and then I'm basically done. This is so cool!
50% man! Crazy!

Elliot

On Mon, Oct 10, 2016 at 9:02 PM Tim Peters  wrote:

> [please restrict follow-ups to python-ideas]
>
> Let's not get hung up on meta-discussion here - I always thought "massive
> clusterf**k" was a precise technical term anyway ;-)
>
> While timing certainly needs to be done more carefully, it's obvious to me
> that this approach _should_ pay off significantly when it applies.
> Comparisons are extraordinarily expensive in Python, precisely because of
> the maze of test-and-branch code it requires just to figure out which
> bottom-level comparison function to invoke each time.  That's why I spent
> months of my life (overall) devising a sequence of sorting algorithms for
> Python that reduced the number of comparisons needed.
>
> Note that when Python's current sort was adopted in Java, they still kept
> a quicksort variant for "unboxed" builtin types.  The adaptive merge sort
> incurs many overheads that often cost more than they save unless
> comparisons are in fact very expensive compared to the cost of pointer
> copying (and in Java comparison of unboxed types is cheap).  Indeed, for
> native numeric types, where comparison is dirt cheap, quicksort generally
> runs faster than mergesort despite that the former does _more_ comparisons
> (because mergesort does so much more pointer-copying).
>
> I had considered something "like this" for Python 2, but didn't pursue it
> because comparison was defined between virtually any two types (34 < [1],
> etc), and people were careless about that (both by design and by
> accident).  In Python 3, comparison "blows up" for absurdly mixed types, so
> specializing for homogeneously-typed lists is a more promising idea on the
> face of it.
>
> The comparisons needed to determine _whether_ a list's objects have a
> common type is just len(list)-1 C-level pointer comparisons, and so goes
> fast.  So I expect that, when it applies, this would speed even sorting an
> already-ordered list with at least 2 elements.
>
> For a mixed-type list with at least 2 elements, it will always be pure
> loss.  But (a) I expect such lists are uncommon (and especially uncommon in
> Python 3); and (b) a one-time scan doing C-level pointer comparisons