Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-15 Thread Nikolaus Rath
On Sep 13 2016, Tim Peters  wrote:
> [Terry Reedy ]
>>> Tim Peters investigated and empirically determined that an
>>> O(n*n) binary insort, as he optimized it on real machines, is faster
>>> than O(n*logn) sorting for up to around 64 items.
>
> [Nikolaus Rath ]
>> Out of curiosity: is this test repeated periodically on different
>> architectures? Or could it be that it only ever was true 10 years ago on
>> Tim's Power Mac G5 (or whatever he used)?
>
> It has little to do with architecture, but much to do with the
> relative cost of comparisons versus pointer-copying.  Near the end of
>
> https://github.com/python/cpython/blob/master/Objects/listsort.txt
[...]

Ah, that makes sense, thanks! 

Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-13 Thread Stephen J. Turnbull
Nikolaus Rath writes:

 > Out of curiosity: is this test repeated periodically on different
 > architectures? Or could it be that it only ever was true 10 years
 > ago on Tim's Power Mac G5 (or whatever he used)?

This is the same Tim Peters of the eponymous test for readability of
syntax: "Syntax shall not look like grit on Tim's screen."  I don't
know if that will help you decide whether to trust his analysis, but
it comforts me.<0.5 wink/>

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-13 Thread Tim Peters
[Terry Reedy ]
>> Tim Peters investigated and empirically determined that an
>> O(n*n) binary insort, as he optimized it on real machines, is faster
>> than O(n*logn) sorting for up to around 64 items.

[Nikolaus Rath ]
> Out of curiosity: is this test repeated periodically on different
> architectures? Or could it be that it only ever was true 10 years ago on
> Tim's Power Mac G5 (or whatever he used)?

It has little to do with architecture, but much to do with the
relative cost of comparisons versus pointer-copying.  Near the end of

https://github.com/python/cpython/blob/master/Objects/listsort.txt

"""
BINSORT
A "binary insertion sort" is just like a textbook insertion sort, but instead
of locating the correct position of the next item via linear (one at a time)
search, an equivalent to Python's bisect.bisect_right is used to find the
correct position in logarithmic time.  Most texts don't mention this
variation, and those that do usually say it's not worth the bother:  insertion
sort remains quadratic (expected and worst cases) either way.  Speeding the
search doesn't reduce the quadratic data movement costs.

But in CPython's case, comparisons are extraordinarily expensive compared to
moving data, and the details matter.  Moving objects is just copying
pointers.  Comparisons can be arbitrarily expensive (can invoke arbitrary
user-supplied Python code), but even in simple cases (like 3 < 4) _all_
decisions are made at runtime:  what's the type of the left comparand?  the
type of the right?  do they need to be coerced to a common type?  where's the
code to compare these types?  And so on.  Even the simplest Python comparison
triggers a large pile of C-level pointer dereferences, conditionals, and
function calls.

So cutting the number of compares is almost always measurably helpful in
CPython, and the savings swamp the quadratic-time data movement costs for
reasonable minrun values.
"""

Binsort does a close to optimal number of comparisons on randomly
ordered data, and that's the point.  Also, in the context of the
overall sorting algorithm, binsort is used to _extend_ the length of a
naturally occurring "too short" run.  There's no need to sort the
whole thing from scratch, because we already know the prefix is
sorted.  That makes binsort a more-than-less obvious choice.(it takes
full advantage of knowing that the prefix is already ordered).

As that doc also says:

"""
When N is a power of 2, testing on random data showed that minrun values of
16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost
in binary insertion sort clearly hurt, and at 8 the increase in the number
of function calls clearly hurt.
"""

So it settled on forcing minrun into the range 32 <= minrun <= 64 (the
precise value depends on the number of elements in the entire array,
for reasons also explained in that doc).  That's far from either end
where the value clearly mattered.

If the full path through Python's expensive
PyObject_RichCompareBool(X, Y, Py_LT) has gotten significantly faster,
a smaller minrun range may make more sense now; or if it's gotten
significantly slower, a larger minrun range.

But, no, I don't believe anyone retests it.

IIRC, when the algorithm was adopted in Java, they found a minrun
range of 16 through 32 worked marginally better for them, because
_their_ spelling of PyObject_RichCompareBool (for Java object
comparison methods) is faster than CPython's.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-13 Thread Random832
On Tue, Sep 13, 2016, at 13:24, Nikolaus Rath wrote:
> On Sep 11 2016, Terry Reedy  wrote:
> > Tim Peters investigated and empirically determined that an
> > O(n*n) binary insort, as he optimized it on real machines, is faster
> > than O(n*logn) sorting for up to around 64 items.
> 
> Out of curiosity: is this test repeated periodically on different
> architectures? Or could it be that it only ever was true 10 years ago on
> Tim's Power Mac G5 (or whatever he used)?

Binary insertion sort is still O(n*logn) in comparisons, so it's likely
that this is due to short memmoves being sufficiently fast due to cache
effects as not to matter. The number might have gotten larger or
smaller, though. I wonder if it's something that could be tuned
dynamically, at compile time or install time.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-13 Thread Nikolaus Rath
On Sep 11 2016, Terry Reedy  wrote:
> Tim Peters investigated and empirically determined that an
> O(n*n) binary insort, as he optimized it on real machines, is faster
> than O(n*logn) sorting for up to around 64 items.

Out of curiosity: is this test repeated periodically on different
architectures? Or could it be that it only ever was true 10 years ago on
Tim's Power Mac G5 (or whatever he used)?

Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-11 Thread Elliot Gorokhovsky
Wow, Tim himself! Regarding performance on semi-ordered data: we'll have to
benchmark to see, but intuitively I imagine radix would meet Timsort
because verifying that a list of strings is sorted takes Omega(nw) (which
gives a lower bound on Timsort), where w is the word length. Radix sort is
Theta(nw). So at least asymptotically it checks out. I think if one uses
the two-array algorithm, other semi-sortings can also be exploited, since
the items get placed into their respective buckets in the order in which
they appear in the list. So, for the example you gave, one pass would sort
it correctly (since the list has the property if x_1 and x_2 are in bucket
b, x1 comes before x2 in the list, so x1 will also come before x2 in the
bucket. Except possibly for one "border bucket" that includes n/2). And
then it would just be Theta(nw/b) in each bucket to verify sorted. I mean
honestly the cool thing about radix is that the best case for Timsort on
strings, Omega(nw), is the worst case for radix! So the point is I think
the two array version, at least, preserves a lot of structure. Anyway, I
hope to have benchmarks (relatively) soon! (I'm a senior in high school so
I'm pretty busy...but I'll try to get on this as soon as I can).


On Sun, Sep 11, 2016 at 3:42 PM Tim Peters  wrote:

> [redirected from python-dev, to python-ideas;
>  please send followups only to python-ideas]
>
> [Elliot Gorokhovsky ]
> > ...
> > TL;DR: Should I spend time making list.sort() detect if it is sorting
> > lexicographically and, if so, use a much faster algorithm?
>
> It will be fun to find out ;-)
>
> As Mark, and especially Terry, pointed out, a major feature of the
> current sort is that it can exploit many kinds of pre-existing order.
> As the paper you referenced says, "Realistic sorting problems are
> usually far from random."  But, although they did run some tests
> against data with significant order, they didn't test against any
> algorithms _aiming_ at exploiting uniformity.  Just against their
> radix sort variants, and against a quicksort.
>
> That's where it's likely next to impossible to guess in advance
> whether radix sort _will_ have a real advantage.  All the kinds of
> order the current sort can exploit are far from obvious, because the
> mechanisms it employs are low-level & very general.  For example,
> consider arrays created by this function, for even `n`:
>
> def bn(n):
> r = [None] * n
> r[0::2] = range(0, n//2)
> r[1::2] = range(n//2, n)
> return r
>
> Then, e.g.,
>
> >>> bn(16)
> [0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]
>
> This effectively takes range(n), cuts it in half, and does a "perfect
> shuffle" on the two halves.
>
> You'll find nothing in the current code looking for this as a special
> case, but it nevertheless sorts such arrays in "close to" O(n) time,
> and despite that there's no natural run in the input longer than 2
> elements.
>
> That said, I'd encourage you to write your code as a new list method
> at first, to make it easiest to run benchmarks.  If that proves
> promising, then you can worry about how to make a single method
> auto-decide which algorithm to use.
>
> Also use the two-array version.  It's easier to understand and to
> code, and stability is crucial now.  The extra memory burden doesn't
> bother me - an array of C pointers consumes little memory compared to
> the memory consumed by the Python objects they point at.
>
> Most of all, have fun! :-)
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-11 Thread Petr Viktorin

On 09/11/2016 10:48 PM, Terry Reedy wrote:
[...]

Second, with respect to timsort in particular: timsort is designed to
exploit structure and run faster than O(n*logn) in special cases.  If a
list is already sorted, timsort will do one O(n) scan and stop.  Any
radix sort will take several times longer.  If a list is reverse sorted,
timsort will do one O(n) scan and do an O(n) reverse.  If a list is the
concatenation of two sorted lists, timsort will find the two sorted
sublists and merge them.  If a sorted list has unsorted items appended
to the end, timsort will sort the appended items and then do a merge.  I
expect any radix sort to be slower for all these cases.  Tim Peters
somewhere documented his experiments and results with various special
but plausible cases of non-randomness.


That write-up is included in Python source:
https://github.com/python/cpython/blob/master/Objects/listsort.txt

A good read if you want to know what sort of thinking, benchmarking, and 
justification should go into a new sorting algorithm.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-11 Thread Tim Peters
[redirected from python-dev, to python-ideas;
 please send followups only to python-ideas]

[Elliot Gorokhovsky ]
> ...
> TL;DR: Should I spend time making list.sort() detect if it is sorting
> lexicographically and, if so, use a much faster algorithm?

It will be fun to find out ;-)

As Mark, and especially Terry, pointed out, a major feature of the
current sort is that it can exploit many kinds of pre-existing order.
As the paper you referenced says, "Realistic sorting problems are
usually far from random."  But, although they did run some tests
against data with significant order, they didn't test against any
algorithms _aiming_ at exploiting uniformity.  Just against their
radix sort variants, and against a quicksort.

That's where it's likely next to impossible to guess in advance
whether radix sort _will_ have a real advantage.  All the kinds of
order the current sort can exploit are far from obvious, because the
mechanisms it employs are low-level & very general.  For example,
consider arrays created by this function, for even `n`:

def bn(n):
r = [None] * n
r[0::2] = range(0, n//2)
r[1::2] = range(n//2, n)
return r

Then, e.g.,

>>> bn(16)
[0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]

This effectively takes range(n), cuts it in half, and does a "perfect
shuffle" on the two halves.

You'll find nothing in the current code looking for this as a special
case, but it nevertheless sorts such arrays in "close to" O(n) time,
and despite that there's no natural run in the input longer than 2
elements.

That said, I'd encourage you to write your code as a new list method
at first, to make it easiest to run benchmarks.  If that proves
promising, then you can worry about how to make a single method
auto-decide which algorithm to use.

Also use the two-array version.  It's easier to understand and to
code, and stability is crucial now.  The extra memory burden doesn't
bother me - an array of C pointers consumes little memory compared to
the memory consumed by the Python objects they point at.

Most of all, have fun! :-)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-11 Thread Terry Reedy

On 9/11/2016 3:45 AM, Elliot Gorokhovsky wrote:

Hello all,

I am interested in making a non-trivial improvement to list.sort(),


This is non-trivial, and harder than you seem to think it is.

> but

before I put in the work, I want to test the waters and see if this is
something the community would accept.


The debate on proposed enhancements is usually whether they are really 
enhancements, all things considered.  For special-case speedups, the 
'all things' include the frequency of the special cases, the ease of 
detecting them, the thoroughness of testintg, and the maintainability of 
the proposed, and likely more complicated, code.



Basically, I want to implement radix sort for lists of strings.


Radix sort was invented for fixed-length strings of digits, as in 
all-digit id 'numbers', so 10 bins.  Ascii strings need 128 bins, 
general byte strings need 256, still manageable.  General unicode 
strings require 1,114,112 bins, most of which will be empty for most 
characters positions.  This is harder to manage.  So are variable-length 
strings.


In CPython 3.3+, strings at the C level are not just strings but are 1, 
2, or 4 bytes per char strings.  So you could specifically target lists 
of bytes (and bytearrays) and lists of strings limited to 1-byte 
characters.  The same code should pretty much work for both.


> ...

In-place radix sort is significantly faster for lexicographic sorting than
Timsort (or in general any comparison-based sort, since radix can beat
the nlogn barrier).


This unqualified statement is doubly wrong.

First, with respect to sorting in general: 'aysmptotically faster' only 
means faster for 'large enough' tasks.  Many real world tasks are small, 
and big tasks gets broken down into multiple small tasks. 
'Asymptotically slower' algoritms may be faster for small tasks. Tim 
Peters investigated and empirically determined that an O(n*n) binary 
insort, as he optimized it on real machines, is faster than O(n*logn) 
sorting for up to around 64 items.  So timsort uses binary insertion to 
sort up to 64 items.  Similarly, you would have to investigate whether 
there is a size below which timsort is faster.


Second, with respect to timsort in particular: timsort is designed to 
exploit structure and run faster than O(n*logn) in special cases.  If a 
list is already sorted, timsort will do one O(n) scan and stop.  Any 
radix sort will take several times longer.  If a list is reverse sorted, 
timsort will do one O(n) scan and do an O(n) reverse.  If a list is the 
concatenation of two sorted lists, timsort will find the two sorted 
sublists and merge them.  If a sorted list has unsorted items appended 
to the end, timsort will sort the appended items and then do a merge.  I 
expect any radix sort to be slower for all these cases.  Tim Peters 
somewhere documented his experiments and results with various special 
but plausible cases of non-randomness.


What you might propose is this:  if the initial up or down sequence 
already determined by timsort is less than, say, 64 items, and the 
length of the list is more than some empirically determined value,
and all items are bytes or byte-char strings and some further checks 
determine the same, then switch to rsort.  But you should start by 
putting a module on PyPI.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-11 Thread Raymond Hettinger

> On Sep 11, 2016, at 11:58 AM, Mark Dickinson  wrote:
> 
>> So I suppose the thing to do is to benchmark stable radix sort against 
>> timsort and see if it's still worth it.
> 
> Agreed; it would definitely be interesting to see benchmarks for the
> two-array stable sort as well as the American Flag unstable sort.
> (Indeed, I think it would be hard to move the proposal forward without
> such benchmarks.)

In the meantime, can I suggest moving this discussion to python-ideas.

There are many practical issues to be addressed:  
* sort stability
* detecting whether you're dealing with a list of strings
* working with unicode rather than inputs limited to a one-byte alphabet
* dealing with the multiple compact forms of unicode strings (i.e. complex 
internal representation)
* avoiding degenerate cases
* cache performance

The referenced article tells us that "troubles with radix sort are in 
implementation, not in conception".


Raymond


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-11 Thread Mark Dickinson
On Sun, Sep 11, 2016 at 7:43 PM, Elliot Gorokhovsky
 wrote:
> So I suppose the thing to do is to benchmark stable radix sort against 
> timsort and see if it's still worth it.

Agreed; it would definitely be interesting to see benchmarks for the
two-array stable sort as well as the American Flag unstable sort.
(Indeed, I think it would be hard to move the proposal forward without
such benchmarks.)

Apart from the cases already mentioned by Chris, one of the situations
you'll want to include in the benchmarks is the case of a list that's
already almost sorted (e.g., an already sorted list with a few extra
unsorted elements appended). This is a case that does arise in
practice, and that Timsort performs particularly well on.

-- 
Mark
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-11 Thread Elliot Gorokhovsky
The sort can be made stable, but that requires the allocation of an
equal-sized auxiliary array. To quote from the paper: "Both list-based and
two-array sorts entail Θ(n) space overhead. That overhead shrinks to
Θ(logn) in American flag sort, which, like quicksort, trades off stability
for space efficiency." So there are two options: follow C++ in providing a
stable and unstable sort, or just use stable radix sort at the cost of
allocating a scratch array. I understand why the first approach is
essentially impossible, since it could break code written under the
assumption that list.sort() is stable. But I think that in Python, since
the list just holds pointers to objects instead of objects themselves,
being in-place isn't that important: we're missing the cache all the time
anyway since our objects are stored all over the place in memory. So I
suppose the thing to do is to benchmark stable radix sort against timsort
and see if it's still worth it. Again, I really don't think the auxiliary
array would make that much of a difference. Note that in timsort we also
use an auxiliary array.

On Sun, Sep 11, 2016, 12:15 PM Mark Dickinson  wrote:

> > I am interested in making a non-trivial improvement to list.sort() [...]
>
> Would your proposed new sorting algorithm be stable? The language
> currently guarantees stability for `list.sort` and `sorted`.
>
> --
> Mark
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-11 Thread Mark Dickinson
> I am interested in making a non-trivial improvement to list.sort() [...]

Would your proposed new sorting algorithm be stable? The language
currently guarantees stability for `list.sort` and `sorted`.

-- 
Mark
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-11 Thread Elliot Gorokhovsky
Thanks for the link. If you look at the conclusion it says "We recommend
American flag sort as an all-round algorithm for sorting strings."

On Sun, Sep 11, 2016, 11:30 AM Raymond Hettinger <
raymond.hettin...@gmail.com> wrote:

>
> > On Sep 11, 2016, at 12:45 AM, Elliot Gorokhovsky 
> wrote:
> >
> > I am interested in making a non-trivial improvement to list.sort(), but
> before I put in the work, I want to test the waters and see if this is
> something the community would accept. Basically, I want to implement radix
> sort for lists of strings. So list.sort() would detect if it is sorting a
> list of strings (which is one of the more common things you sort in python)
> and, if so, use in-place radix sort (see
> https://xlinux.nist.gov/dads/HTML/americanFlagSort.html).
>
> For those who are interested, here is a direct link to the PDF that
> describes the algorithm.
>
>
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.6990&rep=rep1&type=pdf
>
>
> Raymond
>
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-11 Thread Raymond Hettinger

> On Sep 11, 2016, at 12:45 AM, Elliot Gorokhovsky  
> wrote:
> 
> I am interested in making a non-trivial improvement to list.sort(), but 
> before I put in the work, I want to test the waters and see if this is 
> something the community would accept. Basically, I want to implement radix 
> sort for lists of strings. So list.sort() would detect if it is sorting a 
> list of strings (which is one of the more common things you sort in python) 
> and, if so, use in-place radix sort (see 
> https://xlinux.nist.gov/dads/HTML/americanFlagSort.html).

For those who are interested, here is a direct link to the PDF that describes 
the algorithm.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.6990&rep=rep1&type=pdf


Raymond

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-11 Thread Chris Angelico
On Sun, Sep 11, 2016 at 5:45 PM, Elliot Gorokhovsky
 wrote:
> I am interested in making a non-trivial improvement to list.sort(), but
> before I put in the work, I want to test the waters and see if this is
> something the community would accept. Basically, I want to implement radix
> sort for lists of strings. So list.sort() would detect if it is sorting a
> list of strings (which is one of the more common things you sort in python)
> and, if so, use in-place radix sort (see
> https://xlinux.nist.gov/dads/HTML/americanFlagSort.html). In-place radix
> sort is significantly faster for lexicographic sorting than Timsort (or in
> general any comparison-based sort, since radix can beat the nlogn barrier).
> If you don't believe that last claim, suppose for the sake of the argument
> that it's true (because if I actually implemented this I could supply
> benchmarks to prove it).

I'd like to see these benchmarks, actually. Sounds interesting. How
does it fare on different distributions of data - for instance,
strings consisting exclusively of ASCII digits and punctuation eg
"01:12:35,726 --> 01:12:36,810", or strings consisting primarily of
ASCII but with occasional BMP or astral characters, or strings
primarily of Cyrillic text, etc? What if every single string begins
with a slash character (eg if you're sorting a bunch of path names)?
At what list size does it become visibly faster?

Could this be put onto PyPI and then benchmarked as lst.sort() vs
flagsort(lst) ?

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Drastically improving list.sort() for lists of strings/ints

2016-09-11 Thread Elliot Gorokhovsky
Hello all,

I am interested in making a non-trivial improvement to list.sort(), but
before I put in the work, I want to test the waters and see if this is
something the community would accept. Basically, I want to implement radix
sort for lists of strings. So list.sort() would detect if it is sorting a
list of strings (which is one of the more common things you sort in python)
and, if so, use in-place radix sort (see https://xlinux.nist.gov/dads/
HTML/americanFlagSort.html). In-place radix sort is significantly faster
for lexicographic sorting than Timsort (or in general any comparison-based
sort, since radix can beat the nlogn barrier). If you don't believe that
last claim, suppose for the sake of the argument that it's true (because if
I actually implemented this I could supply benchmarks to prove it).

The idea is the following: in list.sort(), if using the default comparison
operator, test the type of the first, middle, and last elements (or
something along those lines). If they are all strings, in practice this
means the list is very likely a list of strings, so it's probably worth the
investment to check and see. So we iterate through and see if they really
are all strings (again, in practice it is very unlikely this test would
fail). Luckily, this is very, very nearly free (i.e. no memory cost) since
we have to iterate through anyway as part of the in-place radix sort (first
step is to count how many elements go in each bucket, you iterate through
to count. So we would just put a test in the loop to break if it finds a
non-string).

Additionally, since often one is sorting objects with string or int fields
instead of strings or ints directly, one could check the type of the field
extracted by the key function or something.

Is this something the community would be interested inf? Again, supposing I
benchmark it and show it gives non-trivial improvements on existing
codebases. Like for example I could benchmark the top 10 packages on pypi
and show they run faster using my list.sort().

TL;DR: Should I spend time making list.sort() detect if it is sorting
lexicographically and, if so, use a much faster algorithm?

Thanks for reading,
Elliot
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com