[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023, at 16:52, Robert Kern wrote: > That optimistic optimization makes this the fastest solution. That'd work great, thanks Robert, Aaron, and everyone who shared input. Stéfan___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
Is numba solution an option for you? > On 17 Nov 2023, at 20:49, Stefan van der Walt via NumPy-Discussion > wrote: > > Hi all, > > I am trying to sample k N-dimensional vectors from a uniform distribution > without replacement. > It seems like this should be straightforward, but I can't seem to pin it down. > > Specifically, I am trying to get random indices in an d0 x d1 x d2.. x dN-1 > array. > > I thought about sneaking in a structured dtype into `rng.integers`, but of > course that doesn't work. > > If we had a string sampler, I could sample k unique words (consisting of > digits), and convert them to indices. > > I could over-sample and filter out the non-unique indices. Or iteratively > draw blocks of samples until I've built up my k unique indices. > > The most straightforward solution would be to flatten indices, and to sample > from those. The integers get large quickly, though. The rng.integers > docstring suggests that it can handle object arrays for very large integers: > >> When using broadcasting with uint64 dtypes, the maximum value (2**64) >> cannot be represented as a standard integer type. >> The high array (or low if high is None) must have object dtype, e.g., >> array([2**64]). > > But, that doesn't work: > > In [35]: rng.integers(np.array([2**64], dtype=object)) > ValueError: high is out of bounds for int64 > > Is there an elegant way to handle this problem? > > Best regards, > Stéfan > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: dom.grigo...@gmail.com ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023 at 7:11 PM Aaron Meurer wrote: > rng.integers() (or np.random.randint) lets you specify lists for low > and high. So you can just use rng.integers((0,)*len(dims), dims). > > Although I'm not seeing how to use this to generate a bunch of vectors > at once. I would have thought something like size=(10, dims) would let > you generate 10 vectors of length dims but it doesn't seem to work. > `size=(k, len(dims))` def sample_indices(shape, size, rng=None): rng = np.random.default_rng(rng) ashape = np.array(shape) seen = set() while len(seen) < size: dsize = size - len(seen) seen.update(map(tuple, rng.integers(0, ashape, size=(dsize, len(shape) return list(seen) That optimistic optimization makes this the fastest solution. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
rng.integers() (or np.random.randint) lets you specify lists for low and high. So you can just use rng.integers((0,)*len(dims), dims). Although I'm not seeing how to use this to generate a bunch of vectors at once. I would have thought something like size=(10, dims) would let you generate 10 vectors of length dims but it doesn't seem to work. Aaron Meurer On Fri, Nov 17, 2023 at 3:31 PM Stefan van der Walt via NumPy-Discussion wrote: > > On Fri, Nov 17, 2023, at 11:07, Robert Kern wrote: > > If the arrays you are drawing indices for are real in-memory arrays for > present-day 64-bit computers, this should be adequate. If it's a notional > array that is larger, then you'll need actual arbitrary-sized integer > sampling. The builtin `random.randrange()` will do arbitrary-sized integers > and is quite reasonable for this task. If you want it to use our > BitGenerators underneath for clean PRNG state management, this is quite > doable with a simple subclass of `random.Random`: > https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258 > > > Thanks, Robert. The use case is for randomly populating a hypothetical > N-dimensional sparse array object. > In practice, int64 will probably suffice. > > I can see how to generate arbitrarily large integers using the pattern above, > but still unsure how to sample without replacement. Aaron mentioned that > conflicts are extremely unlikely, so perhaps fine to assume they won't > happen. Checking for conflicts is expensive. > > Attached is a script that implements this solution. > > Stéfan > > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: asmeu...@gmail.com ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023 at 5:34 PM Stefan van der Walt wrote: > On Fri, Nov 17, 2023, at 14:28, Stefan van der Walt wrote: > > Attached is a script that implements this solution. > > > And the version with set duplicates checking. > If you're going to do the set-checking yourself, then you don't need the unbounded integer support. This is somewhat slower, probably due to the greedy tuple conversions, but eliminates all of the complexity of subclassing `Random`: def sample_indices(shape, size, rng=None): rng = np.random.default_rng(rng) ashape = np.array(shape) seen = set() while len(seen) < size: idx = tuple(rng.integers(0, ashape)) seen.add(idx) return list(seen) Unfortunately, subclassing from `Random` still doesn't get you the ability to sample without replacement for arbitrary-sized populations using `Random`'s own methods for that. `Random.sample(population, k)` requires `population` to be a `Sequence`, and that restricts you to index-sized integers (`ssize_t`) (you can try to fake one, but `len()` will balk if it gets a too-large integer from `__len__`). But `Random.sample` is only just doing set-checking anyways, so no loss. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023, at 14:28, Stefan van der Walt wrote: > Attached is a script that implements this solution. And the version with set duplicates checking. Stéfan import random import functools import itertools import operator import numpy as np def cumulative_prod(arr): return list(itertools.accumulate(arr, func=operator.mul)) def unravel_index(x, dims): dim_prod = cumulative_prod([1] + list(dims)[:0:-1])[::-1] return [list((ix // dim_prod[i]) % dims[i] for i in range(len(dims))) for ix in x] # From Robert Kern's comment at # https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258 class PythonRandomInterface(random.Random): def __init__(self, rng): self._rng = rng def getrandbits(self, k): """getrandbits(k) -> x. Generates an int with k random bits.""" if k < 0: raise ValueError('number of bits must be non-negative') numbytes = (k + 7) // 8 # bits / 8 and rounded up x = int.from_bytes(self._rng.bytes(numbytes), 'big') return x >> (numbytes * 8 - k)# trim excess bits def indices(self, shape, size=1): D = functools.reduce(lambda x, y: x * y, dims) indices = set() while len(indices) < size: indices.add(pri.randint(0, D)) return unravel_index(indices, shape) rng = np.random.default_rng() pri = PythonRandomInterface(rng) dims = (500, 400, 30, 15, 20, 800, 900, 2000, 800) k = 5 print(pri.indices(dims, size=k)) ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023, at 11:07, Robert Kern wrote: > If the arrays you are drawing indices for are real in-memory arrays for > present-day 64-bit computers, this should be adequate. If it's a notional > array that is larger, then you'll need actual arbitrary-sized integer > sampling. The builtin `random.randrange()` will do arbitrary-sized integers > and is quite reasonable for this task. If you want it to use our > BitGenerators underneath for clean PRNG state management, this is quite > doable with a simple subclass of `random.Random`: > https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258 Thanks, Robert. The use case is for randomly populating a hypothetical N-dimensional sparse array object. In practice, int64 will probably suffice. I can see how to generate arbitrarily large integers using the pattern above, but still unsure how to sample without replacement. Aaron mentioned that conflicts are extremely unlikely, so perhaps fine to assume they won't happen. Checking for conflicts is expensive. Attached is a script that implements this solution. Stéfan import random import functools import itertools import operator import numpy as np def cumulative_prod(arr): return list(itertools.accumulate(arr, func=operator.mul)) def unravel_index(x, dims): dim_prod = cumulative_prod([1] + list(dims)[:0:-1])[::-1] return [list((x[j] // dim_prod[i]) % dims[i] for i in range(len(dims))) for j in range(len(x))] # From Robert Kern's comment at # https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258 class PythonRandomInterface(random.Random): def __init__(self, rng): self._rng = rng def getrandbits(self, k): """getrandbits(k) -> x. Generates an int with k random bits.""" if k < 0: raise ValueError('number of bits must be non-negative') numbytes = (k + 7) // 8 # bits / 8 and rounded up x = int.from_bytes(self._rng.bytes(numbytes), 'big') return x >> (numbytes * 8 - k)# trim excess bits def indices(self, shape, size=1): D = functools.reduce(lambda x, y: x * y, dims) indices = [pri.randint(0, D) for i in range(size)] return unravel_index(indices, shape) rng = np.random.default_rng() pri = PythonRandomInterface(rng) dims = (500, 400, 30, 15, 20, 800, 900, 2000, 800) k = 5 print(pri.indices(dims, size=k)) ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023 at 4:15 PM Aaron Meurer wrote: > On Fri, Nov 17, 2023 at 12:10 PM Robert Kern > wrote: > > > > If the arrays you are drawing indices for are real in-memory arrays for > present-day 64-bit computers, this should be adequate. If it's a notional > array that is larger, then you'll need actual arbitrary-sized integer > sampling. The builtin `random.randrange()` will do arbitrary-sized integers > and is quite reasonable for this task. If you want it to use our > BitGenerators underneath for clean PRNG state management, this is quite > doable with a simple subclass of `random.Random`: > https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258 > > Wouldn't it be better to just use random.randint to generate the > vectors directly at that point? If the number of possibilities is more > than 2**64, birthday odds of generating the same vector twice are on > the order of 1 in 2**32. And you can always do a unique() rejection > check if you really want to be careful. > Yes, I jumped to correcting the misreading of the docstring rather than solving the root problem. Almost certainly, the most straightforward strategy is to use `rng.integers(0, array_shape)` repeatedly, storing tuples of the resulting indices in a set and rejecting duplicates. You'd have to do the same thing if one used `rng.integers(0, np.prod(array_shape))` or `random.randint(0, np.prod(array_shape))` in the flattened case. `rng.integers()` doesn't sample without replacement in any case. `rng.choice()` does. It also will not support unbounded integers; it uses a signed `int64` to hold the population size, so it is bounded to that. That is _likely_ to be a fine upper bound on the size of the array (if it is a real array in a 64-bit address space). So one could use `rng.choice(np.prod(array_shape), replace=False, size=n_samples)` and unflatten these integers to the index tuples if that suits the array size. It's just doing the same set lookups internally, just somewhat more efficiently. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023 at 12:10 PM Robert Kern wrote: > > On Fri, Nov 17, 2023 at 1:54 PM Stefan van der Walt via NumPy-Discussion > wrote: >> >> Hi all, >> >> I am trying to sample k N-dimensional vectors from a uniform distribution >> without replacement. >> It seems like this should be straightforward, but I can't seem to pin it >> down. >> >> Specifically, I am trying to get random indices in an d0 x d1 x d2.. x dN-1 >> array. >> >> I thought about sneaking in a structured dtype into `rng.integers`, but of >> course that doesn't work. >> >> If we had a string sampler, I could sample k unique words (consisting of >> digits), and convert them to indices. >> >> I could over-sample and filter out the non-unique indices. Or iteratively >> draw blocks of samples until I've built up my k unique indices. >> >> The most straightforward solution would be to flatten indices, and to sample >> from those. The integers get large quickly, though. The rng.integers >> docstring suggests that it can handle object arrays for very large integers: >> >> > When using broadcasting with uint64 dtypes, the maximum value (2**64) >> > cannot be represented as a standard integer type. >> > The high array (or low if high is None) must have object dtype, e.g., >> > array([2**64]). >> >> But, that doesn't work: >> >> In [35]: rng.integers(np.array([2**64], dtype=object)) >> ValueError: high is out of bounds for int64 >> >> Is there an elegant way to handle this problem? > > > The default dtype for the result of `integers()` is the signed `int64`. If > you want to sample from the range `[0, 2**64)`, you need to specify > `dtype=np.uint64`. The text you are reading is saying that if you want to > specify exactly `2**64` as the exclusive upper bound that you won't be able > to do it with a `np.uint64` array/scalar because it's one above the bound for > that dtype, so you'll have to use a plain Python `int` object or > `dtype=object` array in order to represent `2**64`. It is not saying that you > can draw arbitrary-sized integers. > > >>> rng.integers(2**64, dtype=np.uint64) > 11569248114014186612 > > If the arrays you are drawing indices for are real in-memory arrays for > present-day 64-bit computers, this should be adequate. If it's a notional > array that is larger, then you'll need actual arbitrary-sized integer > sampling. The builtin `random.randrange()` will do arbitrary-sized integers > and is quite reasonable for this task. If you want it to use our > BitGenerators underneath for clean PRNG state management, this is quite > doable with a simple subclass of `random.Random`: > https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258 Wouldn't it be better to just use random.randint to generate the vectors directly at that point? If the number of possibilities is more than 2**64, birthday odds of generating the same vector twice are on the order of 1 in 2**32. And you can always do a unique() rejection check if you really want to be careful. Aaron Meurer > > -- > Robert Kern > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: asmeu...@gmail.com ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023 at 1:54 PM Stefan van der Walt via NumPy-Discussion < numpy-discussion@python.org> wrote: > Hi all, > > I am trying to sample k N-dimensional vectors from a uniform distribution > without replacement. > It seems like this should be straightforward, but I can't seem to pin it > down. > > Specifically, I am trying to get random indices in an d0 x d1 x d2.. x > dN-1 array. > > I thought about sneaking in a structured dtype into `rng.integers`, but of > course that doesn't work. > > If we had a string sampler, I could sample k unique words (consisting of > digits), and convert them to indices. > > I could over-sample and filter out the non-unique indices. Or iteratively > draw blocks of samples until I've built up my k unique indices. > > The most straightforward solution would be to flatten indices, and to > sample from those. The integers get large quickly, though. The rng.integers > docstring suggests that it can handle object arrays for very large integers: > > > When using broadcasting with uint64 dtypes, the maximum value (2**64) > > cannot be represented as a standard integer type. > > The high array (or low if high is None) must have object dtype, e.g., > array([2**64]). > > But, that doesn't work: > > In [35]: rng.integers(np.array([2**64], dtype=object)) > ValueError: high is out of bounds for int64 > > Is there an elegant way to handle this problem? > The default dtype for the result of `integers()` is the signed `int64`. If you want to sample from the range `[0, 2**64)`, you need to specify `dtype=np.uint64`. The text you are reading is saying that if you want to specify exactly `2**64` as the exclusive upper bound that you won't be able to do it with a `np.uint64` array/scalar because it's one above the bound for that dtype, so you'll have to use a plain Python `int` object or `dtype=object` array in order to represent `2**64`. It is not saying that you can draw arbitrary-sized integers. >>> rng.integers(2**64, dtype=np.uint64) 11569248114014186612 If the arrays you are drawing indices for are real in-memory arrays for present-day 64-bit computers, this should be adequate. If it's a notional array that is larger, then you'll need actual arbitrary-sized integer sampling. The builtin `random.randrange()` will do arbitrary-sized integers and is quite reasonable for this task. If you want it to use our BitGenerators underneath for clean PRNG state management, this is quite doable with a simple subclass of `random.Random`: https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258 -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com