[issue47114] random.choice and random.choices have different distributions

2022-03-24 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
assignee:  -> rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47114] random.choice and random.choices have different distributions

2022-03-24 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

This was an intentional decision.  It is documented and tested.  The rationale 
is that is keeps choices() internally consistent so that choices with equal 
weights produces the same result as if no weights are specified.

For anyone who wants an alternative, it is trivial just call choice() in a list 
comprehension:

   [choice(seq) for i in range(k)]

Another thought is that choices() is more performant than calling choice() in a 
loop.  This matters because one of the principal use cases for choices() in 
bootstrapping where choices() can be called many times.

Lastly, we are strongly averse to changing the algorithms after they are 
published because it impacts reproducible research where people need to be able 
to recreate their random selections for a given seed.

--
nosy: +rhettinger
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47114] random.choice and random.choices have different distributions

2022-03-24 Thread Tim Peters


Tim Peters  added the comment:

Definitely a duplicate, and I doubt Mark or Raymond will change their mind.

One observation: while floats are not uniformly dense in [0, 1), random() 
results are uniformly spaced. Each is of the form I / 2**53 for an integer I in 
range(2**53).

--
nosy: +tim.peters

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47114] random.choice and random.choices have different distributions

2022-03-24 Thread Dennis Sweeney


Dennis Sweeney  added the comment:

Possible duplicate of https://bugs.python.org/issue44080

--
nosy: +Dennis Sweeney

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47114] random.choice and random.choices have different distributions

2022-03-24 Thread Mark Bell


Mark Bell  added the comment:

To give two more consequences of `random.choices` using floating point 
arithmetic:

1) When doing `random.choices([A, B, C], weights=[2**55, 1, 1])` the cumulative 
weight to bisect for is selected using `floor(random() * (2**55 + 1 + 1 + 
0.0))`. Since this is always even, it is impossible for `B` to be chosen.

2) When doing `random.choices([A, B], weights=[2**54, 1])` although the 
`cum_weights` is [18014398509481984, 18014398509481985] the `total` used by 
`random.choices` is `cum_weights[-1] + 0.0`. Due to floating point rounding 
this is 18014398509481984.0 and so is actually smaller than `cum_weights[-1]`. 
Therefore it is impossible for `B` to be chosen.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47114] random.choice and random.choices have different distributions

2022-03-24 Thread Mark Bell


New submission from Mark Bell :

The docstring for `random.choices` indicates that
```
import random
random.choices(population, k=1)
```
should produce a list containing one item, where each item of `population` has 
equal likelihood of being selected. However `random.choices` draws elements for 
its sample by doing `population[floor(random() * len(population)]` and so 
relies on floating point numbers. Therefore not each item is equally likely to 
be chosen since floats are not uniformly dense in [0, 1] and this problem 
becomes worse as `population` becomes larger. 

Note that this issue does not apply to `random.choice(population)` since this 
uses `random.randint` to choose a random element of `population` and performs 
exact integer arithmetic. Compare 
https://github.com/python/cpython/blob/main/Lib/random.py#L371 and 
https://github.com/python/cpython/blob/main/Lib/random.py#L490

Could `random.choices` fall back to doing `return [choice(population) for _ in 
_repeat(None, k)]` if no weights are given? Similarly, is it also plausible to 
only rely on `random.randint` and integer arithmetic if all of the (cumulative) 
weights given to `random.choices` are integers?

--
components: Library (Lib)
messages: 415981
nosy: Mark.Bell
priority: normal
severity: normal
status: open
title: random.choice and random.choices have different distributions
versions: Python 3.11

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com