[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Paul Crowley


Paul Crowley  added the comment:

Thank you for a very comprehensive and helpful answer!

Yep, reservoir sampling makes n calls not k calls, and so should only be used 
when k is a large fraction of n; in my patch it's k/n >= 1/2.

Because modern CPRNGs are so fast, I had been assuming that overall runtime, 
rather than calls to the RNG; I'll have to bear that in mind here, though in 
general "use a secure seed to whatever secure RNG is fastest" is the right 
strategy.

I don't think hedging against the quality of the RNG is the right thing to do 
here.

I don't mean to suggest you didn't think about this problem hard! It's just 
that I've been obsessing about this problem for the last few weeks for some 
reason (see my repo) so I thought I might be able to help. Thanks again for you 
reply!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

One other thought.  We don't guarantee that sample() will always produce the 
same sequences, so that does give us some freedom; however, we have a strong 
aversion to doing so unless there is a compelling advantage.  It will almost 
certainly cause some disruption for users if their samples stop being 
reproducible from a given seed.  There is some value to the implementation 
remaining stable.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

For the common case, where k is small and n is large, reservoir sampling makes 
n calls rather the current k plus a few reselections -- the only cost is a 
temporary k-sized set using superfast int hashing.  This technique works even 
for a very large values of n, such as sample(range(1_000_000_000), k=10_000).

For the case where k is a large percentage of n, the size of the set becomes 
noticeable and reselections would become more common (placing a higher load on 
the underlying RNG and eating its entropy, keep in mind that the underlying RNG 
can be SystemRandom for example).  So, we switch algorithms to a partial 
shuffle, which has no reselections and uses a temporary list of references to 
the sampled objects.   The places the absolute minimum burden on the underlying 
RNG and makes the minimum number of data swaps (neither of those virtues can be 
claimed by reservoir sampling).

The only advantage of the reservoir approach is not requiring auxiliary
storage.  Keep in mind, the temporary list only holds references to the sampled 
data, so tends to be small relative to that data.  It is no more 
disadvantageous than typical applications of list comprehensions which make new 
lists while looping over some other iterable.

In any case, the algorithms prefer to optimize for fewest calls the the random 
number generator rather than aspiring to zero auxiliary storage.  The 
randbelow() calls are not superfast, so we don't want to use many of them.  
Likewise, calls to SystemRandom() eat available entropy,  so we don't want to 
use many of them either.  Also, the Random class is designed to be subclassed 
to allow uses of other RNGs which are likely to have a smaller period that the 
MersenneTwister so we don't want many calls to them either (it eats their 
limited state space just like shuffle() exceeds the capabilities of MT with an 
input size as small as 2081).

Lastly, I have a vaguely held concern that reservoir sampling uses the RNG in a 
way that would magnify any weaknesses in that RNG (by virtues to making more 
calls and by virtue of using selections in the full range from n-k to n), so we 
would get lower quality shuffles.

FWIW, all of these things were considered when shuffle() was designed, it 
wasn't like other methods weren't considered.  The design we have now was 
deemed to be the best fit for most of our users, most of time (we've never 
gotten a complaint about the temporary storage being a problem for any user, 
ever).  I would however expect complaints about an increased number of calls to 
the user's rngs.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Paul Crowley


Paul Crowley  added the comment:

I would be very grateful for your help finding those dicussions! I've tried 
this search:

https://www.google.com/search?q=python+%22Reservoir+sampling%22+rhettinger

and found this discussion:

https://mail.python.org/pipermail/python-ideas/2016-April/039708.html

but if I've missed any I'm keen to know.

In my pull request reservoir sampling is only used if 2k>=n, so it makes at 
most twice as many random requests as any other algorithm.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Sorry, this has been discussed and rejected multiple times.  Reservoir sampling 
makes far too many calls to the underlying random number generator.

--
assignee:  -> rhettinger
resolution:  -> rejected
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Pablo Galindo Salgado


Change by Pablo Galindo Salgado :


--
nosy: +mark.dickinson, rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Paul Crowley


Change by Paul Crowley :


--
keywords: +patch
pull_requests: +9510
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Paul Crowley


New submission from Paul Crowley :

random.sample currently uses either a Fisher-Yates shuffle, or rejection 
sampling, to achieve sampling without replacement. I propose using reservoir 
sampling or "cardchoose"; these are similar performance or sometimes faster, 
and don't need to allocate anything except the list used for the results.

--
components: Library (Lib)
messages: 328728
nosy: ciphergoth
priority: normal
severity: normal
status: open
title: Improved algorithms for random.sample
type: resource usage
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com