[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-29 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

On 26.11.2021 10:56, Ruben Vorderman wrote:
> 
> $ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'')"
> 50 loops, best of 5: 495 nsec per loop

Shouldn't this read:

$ python -m timeit -s "from bytes_sort import bytes_sort" "bytes_sort(b'')"

(-s instead of -c) ?

--
nosy: +lemburg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-28 Thread Tim Peters


Tim Peters  added the comment:

I agree with closing this - I don't know of real use cases either.

Serhiy, essentially all LSD radix sorts are stable, and rely on that for their 
own correctness. MSD radix sorts vary.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-27 Thread Gregory P. Smith


Gregory P. Smith  added the comment:

General consensus: There isn't a common need for this.

--
nosy: +gregory.p.smith
resolution:  -> rejected
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-27 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

I concur with Raymond. It is difficult to find any use case for sorting bytes 
objects (I cannot find any).

As for using radix sort in list.sort() in special case of small integer keys, 
it is difficult to implement, because we should preserve the initial order of 
items with the same key. I am not sure that it is possible to implement stable 
radix sort with linear complexity. In any case the overhead will be more 
significant.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-27 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

I’m -1 on this. Given that use cases are rare, there is no need to burden the 
code base with an optimization of something we can already do in other ways.

Also, I don’t like that the APIs for list.sort(), bytes.sort(), and 
bytearray.sort() wouldn’t match.  IMO that would do more harm than good.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Brett Cannon


Change by Brett Cannon :


--
nosy:  -brett.cannon

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread STINNER Victor


STINNER Victor  added the comment:

There are other byte-like objects like memoryview or array.array (for some 
array types). I would suggest writing a function rather than a method.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Eric V. Smith


Eric V. Smith  added the comment:

Given that the normal sort() machinery wouldn't use this code, I don't think 
there's any advantage to adding .sort() methods to bytes and bytesarray. The 
downside to adding these methods is the increased complexity in the stdlib.

I think the better approach is to put bytes_sort on PyPI.

--
nosy: +eric.smith

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Mark Dickinson

Mark Dickinson  added the comment:

> If there are enough use cases for it.

Well, that's the question. :-) *I* can't think of any good use cases, but maybe 
others can. But if we can't come up with some use cases, then this feels like a 
solution looking for a problem, and that makes it hard to justify both the 
short-term effort and the longer-term maintenance costs of adding the 
complexity.

FWIW, given a need to efficiently compute frequency tables for reasonably long 
byte data, I'd probably reach first for NumPy (and numpy.bincount in 
particular):

Python 3.10.0 (default, Nov 12 2021, 12:32:57) [Clang 12.0.5 
(clang-1205.0.22.11)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.28.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import collections, numpy as np

In [2]: t = 
b'MDIAIHHPWIRRPFFPFHSPSRLFDQFFGEHLLESDLFSTATSLSPFYLRPPSFLRAPSWIDTGLSEMRLEKDRFSVNLDVKHFSPEELKVKVLGDVIEVHGKHEERQDEHGFISREFHRKYRI
   ...: PADVDPLAITSSLSSDGVLTVNGPRKQVSGPERTIPITREEKPAVAAAPKK';  t *= 100

In [3]: %timeit np.bincount(np.frombuffer(t, np.uint8))
32.7 µs ± 3.15 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)

In [4]: %timeit collections.Counter(t)
702 µs ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit sorted(t)
896 µs ± 64.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Ruben Vorderman


Ruben Vorderman  added the comment:

I used it for the median calculation of FASTQ quality scores 
(https://en.wikipedia.org/wiki/FASTQ_format). But in the end I used the 
frequency table to calculate the median more quickly. So as you say, the 
frequency table turned out to be more useful.

Having said that the usefulness depends on how many times 8-bit data is passed 
into sorted. (bytes,bytearrays, most python strings are 8-bit I believe). I 
raised this issue not because I want a .sort() method on bytes or 
bytearrays, but mostly because I think python's sorted function can be improved 
with regards to 8-bit data. I think it is an interesting thing to consider, 
depending on how often this occurs.

For example:
sorted(b'Let\'s test a proper string now. One that has some value to be 
sorted.')
and 
list(bytes_sort(b'Let\'s test a proper string now. One that has some value to 
be sorted.'))

This returns the same result (a list of integers). But the byte_sort 
implementation is 2.5 times faster. So sorted is not optimally implemented here.

Since sorted is now basically throwing everything into list.sort an alternative 
codepath using bytes.sort can be considered. (If there are enough use cases for 
it).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Mark Dickinson


Mark Dickinson  added the comment:

(Changing the issue type: as I understand it, this is a proposal for a new 
feature, namely new methods bytes.sort and bytearray.sort, rather than a 
performance improvement to an existing feature.)

--
type: performance -> enhancement

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Mark Dickinson


Mark Dickinson  added the comment:

Can you give example use-cases for sorting a bytes or bytearray object? I see 
value in the intermediate object - the frequency table, but the reconstructed 
sorted bytes object just seems like an inefficient representation of the 
frequency table, and I'm not sure how it would be useful.

As the wikipedia page for counting sort says, the real value is in sorting 
items by keys that are small integers, and the special case where the item is 
identical to the key isn't all that useful:

> In some descriptions of counting sort, the input to be sorted is assumed to 
> be more simply a sequence of integers itself, but this simplification does 
> not accommodate many applications of counting sort.

--
nosy: +mark.dickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Ruben Vorderman


Ruben Vorderman  added the comment:

Sorry for the spam. I see I made a typo in the timeit script. Next time I will 
be more dilligent when making these kinds of reports and triple checking it 
before hand, and sending it once. I used -c instead of -s and now all the setup 
time is also included. This confounds the results.

The proper test commands should be:

python -m timeit -s "from bytes_sort import bytes_sort, bytearray_sort_inplace" 
"bytes_sort(b'My string here')"

python -m timeit "bytes(sorted(b'My string here'))"

Using just sorted, to purely compare the sorting algorithms without the 
overhead of creating a new bytes object.
python -m timeit "sorted(b'My string here')"

Correct comparison results
# String = b''
using bytes(sorted: 188 nsec
using sorted:   108 nsec
using byte_sort:125 nsec  # Some overhead here, setting up the countarray
# String = b'abc'
using bytes(sorted: 252 nsec
using sorted:   145 nsec
using byte_sort:136 nsec  # Overhead compared to sorted already negated 
when sorting 3 items(!)
# String = b'Let\'s test a proper string now. One that has some value to be 
sorted.'
using bytes(sorted: 1830 nsec (reported as 1.83 usec)
using sorted:   1550 nsec (reported as 1.55 usec)
using byte_sort: 220 nsec

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Ruben Vorderman


Ruben Vorderman  added the comment:

I changed the cython script a bit to use a more naive implementation without 
memset.
Now it is always significantly faster than bytes(sorted(my_bytes)).

$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'')"
50 loops, best of 5: 495 nsec per loop
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'abc')"
50 loops, best of 5: 519 nsec per loop
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'Let\'s 
test a proper string now. One that has some value to be sorted.')"
50 loops, best of 5: 594 nsec per loop

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Alex Waygood


Change by Alex Waygood :


--
nosy: +Mark.Shannon, brett.cannon, rhettinger, serhiy.storchaka, tim.peters, 
vstinner, yselivanov

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Ruben Vorderman


Ruben Vorderman  added the comment:

Also I didn't know if this should be in Component C-API or Interpreter Core. 
But I guess this will be implemented as C-API calls PyBytes_Sort and 
PyByteArray_SortInplace so I figured C-API is the correct component here.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Ruben Vorderman


New submission from Ruben Vorderman :

Python now uses the excellent timsort for most (all?) of its sorting. But this 
is not the fastest sort available for one particular use case.

If the number of possible values in the array is limited, it is possible to 
perform a counting sort: https://en.wikipedia.org/wiki/Counting_sort. In a 
counting sort each value maps to an integer corresponding to its relative 
value. The values are then counted by using key = map_to_key(value); 
count_array[key]. Then from this count_array a new array of sorted values can 
be constructed. (See the wikipedia article for a more detailed explanation).

For the python bytes and bytesarray types this is extremely simple to 
implement. All 256 possible values are can be directly used as keys. 

Rough implementation:
- Use buffer protocol to get pointer to bytes/bytesarray internal c-string
- Declare a count_array: Py_ssize_t[256] count_array . (use memset to set it to 
0)
- Iterate over the c-string and add each value to the countarray. 
count_array[buffer[i]] += 1
- Allocate a new bytes(array) object, or in the case of bytesarray the sorting 
can be performed inplace when bytesarray.sort() is used. 
- Iterate over the count_array. Get the number of values for each key and use 
memset to insert the sequences of keys into position.


The most obvious way to implement this algorithm will be as bytesarray.sort() 
where it is sorted inplace and as bytes.sort() which returns a new sorted bytes 
object. This is much much faster than using bytes(sorted(bytes)).

I made a quick cython implementation for speed testing here: 
https://github.com/rhpvorderman/bytes_sort/blob/main/bytes_sort.pyx

Currently to get a sorted bytestring one has to do bytes(sorted(my_bytes)). 

Test results:

# First make sure there is no regression when sorting an empty string
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes(sorted(b''))"
50 loops, best of 5: 560 nsec per loop
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'')"
50 loops, best of 5: 565 nsec per loop

# Test result for very small strings
$ python -m timeit -c "from bytes_sort import bytes_sort" 
"bytes(sorted(b'abc'))"
50 loops, best of 5: 628 nsec per loop
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'abc')"
50 loops, best of 5: 578 nsec per loop

# Even on a very small already sorted string, a counting sort is faster.

# Test with a proper string
$ python -m timeit -c "from bytes_sort import bytes_sort" 
"bytes(sorted(b'Let\'s test a proper string now. One that has some value to be 
sorted.'))"
10 loops, best of 5: 2.32 usec per loop
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'Let\'s 
test a proper string now. One that has some value to be sorted.')"
50 loops, best of 5: 674 nsec per loop

More than three times faster!

--
components: C API
messages: 407032
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: Bytes and bytesarrays can be sorted with a much faster count sort.
type: performance
versions: Python 3.11

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com