[issue21592] Make statistics.median run in linear time

2018-05-07 Thread Dong-hee Na
Dong-hee Na added the comment: For CPython, I agree with the opinion that finding a median value with Tim sort(C version) is faster than quickselect algorithm with pure python. Or we need to implement quickselect by C API. The advantage of implementing this method by

[issue21592] Make statistics.median run in linear time

2018-05-06 Thread Steven D'Aprano
Steven D'Aprano added the comment: How does the performance change with this patch? Quick-select is a nice idea in theory, but unless it is written in C, it is unlikely to beat sorting the list unless you have HUGE data sets. Its been nearly four years since I

[issue21592] Make statistics.median run in linear time

2018-05-06 Thread Dong-hee Na
Change by Dong-hee Na : -- pull_requests: +6408 ___ Python tracker ___ ___

[issue21592] Make statistics.median run in linear time

2018-01-13 Thread Raymond Hettinger
Raymond Hettinger added the comment: Everyone here should heed Tim's comments. The statistics module already has a history of suboptimal decisions made in the service of theoretical perfection (i.e. mean(seq) is over a 100x slower than fsum(seq)/len(seq)). While

[issue21592] Make statistics.median run in linear time

2018-01-13 Thread Roundup Robot
Change by Roundup Robot : -- keywords: +patch pull_requests: +5029 stage: needs patch -> patch review ___ Python tracker

[issue21592] Make statistics.median run in linear time

2017-08-21 Thread Samuel Naughton-Baldwin
Changes by Samuel Naughton-Baldwin : -- pull_requests: +3207 ___ Python tracker ___

[issue21592] Make statistics.median run in linear time

2016-01-03 Thread Julian Taylor
Julian Taylor added the comment: the median of median of 5 is quite significantly slower than a quickselect. numpy implements an introselect which uses quickselect but falls back to median of median of 5 if not enough progress is done. In the numpy implementation for 10 element median

[issue21592] Make statistics.median run in linear time

2016-01-02 Thread Upendra Kumar
Upendra Kumar added the comment: This is my first attempt to contribute. I have used the Median of Medians with n/5 groups of 5 elements each. It has linear time complexity. But, still I am not sure about it, because constants may be high. Therefore, if anyone can please benchmark it with

[issue21592] Make statistics.median run in linear time

2016-01-02 Thread Upendra Kumar
Upendra Kumar added the comment: This method can also be further used to implement multiselect functionality with slight changes in the nth-order function. -- ___ Python tracker

[issue21592] Make statistics.median run in linear time

2014-06-07 Thread Julian Taylor
Julian Taylor added the comment: for median alone a multiselect is probably overkill (thats why I mentioned the minimum trick) but a selection algorithm is useful on its own for all of python and then a multiselect should be considered. Of course that means it would need to be implemented in

[issue21592] Make statistics.median run in linear time

2014-06-07 Thread Steven D'Aprano
Steven D'Aprano added the comment: On Sat, Jun 07, 2014 at 01:02:52PM +, Julian Taylor wrote: but a selection algorithm is useful on its own for all of python and then a multiselect should be considered. I like the idea of a select and/or multiselect for 3.5. As a new feature, it cannot

[issue21592] Make statistics.median run in linear time

2014-06-02 Thread Thomas Dybdahl Ahle
Thomas Dybdahl Ahle added the comment: I don't know if it's worth the overhead to implement a multiselect, given we only expose a median function. I've rewritten select2 to be intro, just falling back on sorting. This doesn't appear to degrade the performance. I also added np.median to the

[issue21592] Make statistics.median run in linear time

2014-06-01 Thread Stefan Behnel
Stefan Behnel added the comment: I tried the same with a Cython compiled version of select.py in the latest CPython 3.5 build. It pretty clearly shows that select2 is pretty much always faster than sorting, by a factor of 2-5x or so. I'll also attach the annotated source file that Cython

[issue21592] Make statistics.median run in linear time

2014-06-01 Thread Stefan Behnel
Changes by Stefan Behnel sco...@users.sourceforge.net: Added file: http://bugs.python.org/file35427/select.html ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21592 ___

[issue21592] Make statistics.median run in linear time

2014-06-01 Thread Stefan Behnel
Stefan Behnel added the comment: Here's also the pathological average of three calls case. As Steven suggests, it shows that select() suffers quite heavily (algorithmically), but select2() also suffers enough to jump up to about 2/3 of the runtime of sorting (so it's still 1/3 faster even

[issue21592] Make statistics.median run in linear time

2014-06-01 Thread Stefan Behnel
Stefan Behnel added the comment: Updating the type declaration file to remove the dependency on the list builtin and allow arbitrary containers. The test code has this dependency (calls a.sort()), but the current statistics module in the stdlib does not (calls sorted()). Timings do not

[issue21592] Make statistics.median run in linear time

2014-06-01 Thread Stefan Behnel
Changes by Stefan Behnel sco...@users.sourceforge.net: Added file: http://bugs.python.org/file35430/select.html ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21592 ___

[issue21592] Make statistics.median run in linear time

2014-06-01 Thread Julian Taylor
Julian Taylor added the comment: in the case of the median you can archive similar performance to a multiselect by simply calling min([len(data) // 2 + 1]) for the second order statistic which you need for the averaging of even number of elements. maybe an interesting datapoint would be to

[issue21592] Make statistics.median run in linear time

2014-05-31 Thread Thomas Dybdahl Ahle
Thomas Dybdahl Ahle added the comment: I think minimize expected-case time is a good goal. If we wanted minimize worst-case time we would have to use k-means rather than quickselect. My trials on random data, where sort arguably has a disadvantage, suggests sorting is about twice as fast for

[issue21592] Make statistics.median run in linear time

2014-05-31 Thread Ezio Melotti
Ezio Melotti added the comment: I have written some proof of concept code here [1], I would appreciate you commenting on it, before I turn it into a patch, as I haven't contributed code to Python before. I would encourage you to consult the devguide, prepare a patch, and upload it here so

[issue21592] Make statistics.median run in linear time

2014-05-31 Thread Steven D'Aprano
Steven D'Aprano added the comment: I've run some performance tests on six variations of the O(N) select algorithm, based on Tim Peters' and Thomas Ahle's code, comparing them to the naive O(N log N) sort first algorithm, and sorting is consistently faster up to the limit I tested. About the

[issue21592] Make statistics.median run in linear time

2014-05-31 Thread Alex Gaynor
Alex Gaynor added the comment: I ran the script (modified very slightly for python 2 support) under PyPy 2.3: $ pypy select.py == Single call mode == Nsort select7 select23 select47 select97 select select2

[issue21592] Make statistics.median run in linear time

2014-05-30 Thread Thomas Dybdahl Ahle
Thomas Dybdahl Ahle added the comment: If you have a good, realistic test set, we can try testing quick-select vs sorting. If it's still not good, I can also reimplement it in C. -- ___ Python tracker rep...@bugs.python.org

[issue21592] Make statistics.median run in linear time

2014-05-30 Thread Terry J. Reedy
Terry J. Reedy added the comment: To accept contributions, we need a signed (possibly electronically) contribution form. https://www.python.org/psf/contrib/contrib-form/ -- nosy: +terry.reedy ___ Python tracker rep...@bugs.python.org

[issue21592] Make statistics.median run in linear time

2014-05-30 Thread Tim Peters
Tim Peters added the comment: I suggest this needs a measurable goal, like minimize expected-case time or minimize worst-case time. Use a state of the art implementation isn't a measurable goal - it's an abstract wish with no measurable merit on its own ;-) Note that I wrote the median-of-k

[issue21592] Make statistics.median run in linear time

2014-05-28 Thread Thomas Dybdahl Ahle
New submission from Thomas Dybdahl Ahle: The statistics module currently contains the following comment: FIXME: investigate ways to calculate medians without sorting? Quickselect? This is important, because users expect standard library functions to use state of the art implementations, and

[issue21592] Make statistics.median run in linear time

2014-05-28 Thread Zachary Ware
Changes by Zachary Ware zachary.w...@gmail.com: -- nosy: +steven.daprano stage: - needs patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21592 ___

[issue21592] Make statistics.median run in linear time

2014-05-28 Thread Thomas Dybdahl Ahle
Thomas Dybdahl Ahle added the comment: I have written some proof of concept code here [1], I would appreciate you commenting on it, before I turn it into a patch, as I haven't contributed code to Python before. I have tried to write it as efficiently as possible, but it is of course possible

[issue21592] Make statistics.median run in linear time

2014-05-28 Thread Steven D'Aprano
Steven D'Aprano added the comment: On Wed, May 28, 2014 at 02:43:29PM +, Thomas Dybdahl Ahle wrote: I have written some proof of concept code here [1], I would appreciate you commenting on it, before I turn it into a patch, as I haven't contributed code to Python before. Thanks

[issue21592] Make statistics.median run in linear time

2014-05-28 Thread Vajrasky Kok
Vajrasky Kok added the comment: Yeah, I remember I tried to improve the performance of the median in the past using median-of-k algorithm. But too bad I could not beat sorted C function efficiency. Maybe you have a better luck! -- nosy: +vajrasky