[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Paul Crowley
Paul Crowley added the comment: Thank you for a very comprehensive and helpful answer! Yep, reservoir sampling makes n calls not k calls, and so should only be used when k is a large fraction of n; in my patch it's k/n >= 1/2. Because modern CPRNGs are so fast, I had been assuming that

[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Raymond Hettinger
Raymond Hettinger added the comment: One other thought. We don't guarantee that sample() will always produce the same sequences, so that does give us some freedom; however, we have a strong aversion to doing so unless there is a compelling advantage. It will almost certainly cause some

[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Raymond Hettinger
Raymond Hettinger added the comment: For the common case, where k is small and n is large, reservoir sampling makes n calls rather the current k plus a few reselections -- the only cost is a temporary k-sized set using superfast int hashing. This technique works even for a very large

[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Paul Crowley
Paul Crowley added the comment: I would be very grateful for your help finding those dicussions! I've tried this search: https://www.google.com/search?q=python+%22Reservoir+sampling%22+rhettinger and found this discussion:

[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Raymond Hettinger
Raymond Hettinger added the comment: Sorry, this has been discussed and rejected multiple times. Reservoir sampling makes far too many calls to the underlying random number generator. -- assignee: -> rhettinger resolution: -> rejected stage: patch review -> resolved status: open

[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Pablo Galindo Salgado
Change by Pablo Galindo Salgado : -- nosy: +mark.dickinson, rhettinger ___ Python tracker ___ ___ Python-bugs-list mailing list

[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Paul Crowley
Change by Paul Crowley : -- keywords: +patch pull_requests: +9510 stage: -> patch review ___ Python tracker ___ ___

[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Paul Crowley
New submission from Paul Crowley : random.sample currently uses either a Fisher-Yates shuffle, or rejection sampling, to achieve sampling without replacement. I propose using reservoir sampling or "cardchoose"; these are similar performance or sometimes faster, and don't need to allocate