[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Stefan van der Walt via NumPy-Discussion
On Fri, Nov 17, 2023, at 16:52, Robert Kern wrote:
> That optimistic optimization makes this the fastest solution.

That'd work great, thanks Robert, Aaron, and everyone who shared input. 

Stéfan___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Dom Grigonis
Is numba solution an option for you?

> On 17 Nov 2023, at 20:49, Stefan van der Walt via NumPy-Discussion 
>  wrote:
> 
> Hi all,
> 
> I am trying to sample k N-dimensional vectors from a uniform distribution 
> without replacement.
> It seems like this should be straightforward, but I can't seem to pin it down.
> 
> Specifically, I am trying to get random indices in an d0 x d1 x d2.. x dN-1 
> array.
> 
> I thought about sneaking in a structured dtype into `rng.integers`, but of 
> course that doesn't work.
> 
> If we had a string sampler, I could sample k unique words (consisting of 
> digits), and convert them to indices.
> 
> I could over-sample and filter out the non-unique indices. Or iteratively 
> draw blocks of samples until I've built up my k unique indices.
> 
> The most straightforward solution would be to flatten indices, and to sample 
> from those. The integers get large quickly, though. The rng.integers 
> docstring suggests that it can handle object arrays for very large integers:
> 
>> When using broadcasting with uint64 dtypes, the maximum value (2**64)
>> cannot be represented as a standard integer type.
>> The high array (or low if high is None) must have object dtype, e.g., 
>> array([2**64]).
> 
> But, that doesn't work:
> 
> In [35]: rng.integers(np.array([2**64], dtype=object))
> ValueError: high is out of bounds for int64
> 
> Is there an elegant way to handle this problem?
> 
> Best regards,
> Stéfan
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Robert Kern
On Fri, Nov 17, 2023 at 7:11 PM Aaron Meurer  wrote:

> rng.integers() (or np.random.randint) lets you specify lists for low
> and high. So you can just use rng.integers((0,)*len(dims), dims).
>
> Although I'm not seeing how to use this to generate a bunch of vectors
> at once. I would have thought something like size=(10, dims) would let
> you generate 10 vectors of length dims but it doesn't seem to work.
>

`size=(k, len(dims))`

def sample_indices(shape, size, rng=None):
rng = np.random.default_rng(rng)
ashape = np.array(shape)
seen = set()
while len(seen) < size:
dsize = size - len(seen)
seen.update(map(tuple, rng.integers(0, ashape, size=(dsize,
len(shape)
return list(seen)

That optimistic optimization makes this the fastest solution.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Aaron Meurer
rng.integers() (or np.random.randint) lets you specify lists for low
and high. So you can just use rng.integers((0,)*len(dims), dims).

Although I'm not seeing how to use this to generate a bunch of vectors
at once. I would have thought something like size=(10, dims) would let
you generate 10 vectors of length dims but it doesn't seem to work.

Aaron Meurer

On Fri, Nov 17, 2023 at 3:31 PM Stefan van der Walt via
NumPy-Discussion  wrote:
>
> On Fri, Nov 17, 2023, at 11:07, Robert Kern wrote:
>
> If the arrays you are drawing indices for are real in-memory arrays for 
> present-day 64-bit computers, this should be adequate. If it's a notional 
> array that is larger, then you'll need actual arbitrary-sized integer 
> sampling. The builtin `random.randrange()` will do arbitrary-sized integers 
> and is quite reasonable for this task. If you want it to use our 
> BitGenerators underneath for clean PRNG state management, this is quite 
> doable with a simple subclass of `random.Random`: 
> https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258
>
>
> Thanks, Robert. The use case is for randomly populating a hypothetical 
> N-dimensional sparse array object.
> In practice, int64 will probably suffice.
>
> I can see how to generate arbitrarily large integers using the pattern above, 
> but still unsure how to sample without replacement. Aaron mentioned that 
> conflicts are extremely unlikely, so perhaps fine to assume they won't 
> happen. Checking for conflicts is expensive.
>
> Attached is a script that implements this solution.
>
> Stéfan
>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: asmeu...@gmail.com
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Robert Kern
On Fri, Nov 17, 2023 at 5:34 PM Stefan van der Walt 
wrote:

> On Fri, Nov 17, 2023, at 14:28, Stefan van der Walt wrote:
>
> Attached is a script that implements this solution.
>
>
> And the version with set duplicates checking.
>

If you're going to do the set-checking yourself, then you don't need the
unbounded integer support. This is somewhat slower, probably due to the
greedy tuple conversions, but eliminates all of the complexity of
subclassing `Random`:

def sample_indices(shape, size, rng=None):
rng = np.random.default_rng(rng)
ashape = np.array(shape)
seen = set()
while len(seen) < size:
idx = tuple(rng.integers(0, ashape))
seen.add(idx)
return list(seen)

Unfortunately, subclassing from `Random` still doesn't get you the ability
to sample without replacement for arbitrary-sized populations using
`Random`'s own methods for that. `Random.sample(population, k)` requires
`population` to be a `Sequence`, and that restricts you to index-sized
integers (`ssize_t`) (you can try to fake one, but `len()` will balk if it
gets a too-large integer from `__len__`). But `Random.sample` is only just
doing set-checking anyways, so no loss.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Stefan van der Walt via NumPy-Discussion
On Fri, Nov 17, 2023, at 14:28, Stefan van der Walt wrote:
> Attached is a script that implements this solution.

And the version with set duplicates checking.

Stéfan
import random
import functools
import itertools
import operator

import numpy as np


def cumulative_prod(arr):
return list(itertools.accumulate(arr, func=operator.mul))


def unravel_index(x, dims):
dim_prod = cumulative_prod([1] + list(dims)[:0:-1])[::-1]
return [list((ix // dim_prod[i]) % dims[i] for i in range(len(dims))) for ix in x]


# From Robert Kern's comment at
# https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258
class PythonRandomInterface(random.Random):
def __init__(self, rng):
self._rng = rng

def getrandbits(self, k):
"""getrandbits(k) -> x.  Generates an int with k random bits."""
if k < 0:
raise ValueError('number of bits must be non-negative')
numbytes = (k + 7) // 8   # bits / 8 and rounded up
x = int.from_bytes(self._rng.bytes(numbytes), 'big')
return x >> (numbytes * 8 - k)# trim excess bits

def indices(self, shape, size=1):
D = functools.reduce(lambda x, y: x * y, dims)
indices = set()
while len(indices) < size:
indices.add(pri.randint(0, D))
return unravel_index(indices, shape)


rng = np.random.default_rng()
pri = PythonRandomInterface(rng)

dims = (500, 400, 30, 15, 20, 800, 900, 2000, 800)
k = 5

print(pri.indices(dims, size=k))
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Stefan van der Walt via NumPy-Discussion
On Fri, Nov 17, 2023, at 11:07, Robert Kern wrote:
> If the arrays you are drawing indices for are real in-memory arrays for 
> present-day 64-bit computers, this should be adequate. If it's a notional 
> array that is larger, then you'll need actual arbitrary-sized integer 
> sampling. The builtin `random.randrange()` will do arbitrary-sized integers 
> and is quite reasonable for this task. If you want it to use our 
> BitGenerators underneath for clean PRNG state management, this is quite 
> doable with a simple subclass of `random.Random`: 
> https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258

Thanks, Robert. The use case is for randomly populating a hypothetical 
N-dimensional sparse array object.
In practice, int64 will probably suffice. 

I can see how to generate arbitrarily large integers using the pattern above, 
but still unsure how to sample without replacement. Aaron mentioned that 
conflicts are extremely unlikely, so perhaps fine to assume they won't happen. 
Checking for conflicts is expensive.

Attached is a script that implements this solution.

Stéfan
import random
import functools
import itertools
import operator

import numpy as np


def cumulative_prod(arr):
return list(itertools.accumulate(arr, func=operator.mul))


def unravel_index(x, dims):
dim_prod = cumulative_prod([1] + list(dims)[:0:-1])[::-1]
return [list((x[j] // dim_prod[i]) % dims[i] for i in range(len(dims))) for j in range(len(x))]


# From Robert Kern's comment at
# https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258
class PythonRandomInterface(random.Random):
def __init__(self, rng):
self._rng = rng

def getrandbits(self, k):
"""getrandbits(k) -> x.  Generates an int with k random bits."""
if k < 0:
raise ValueError('number of bits must be non-negative')
numbytes = (k + 7) // 8   # bits / 8 and rounded up
x = int.from_bytes(self._rng.bytes(numbytes), 'big')
return x >> (numbytes * 8 - k)# trim excess bits

def indices(self, shape, size=1):
D = functools.reduce(lambda x, y: x * y, dims)
indices = [pri.randint(0, D) for i in range(size)]
return unravel_index(indices, shape)


rng = np.random.default_rng()
pri = PythonRandomInterface(rng)

dims = (500, 400, 30, 15, 20, 800, 900, 2000, 800)
k = 5

print(pri.indices(dims, size=k))
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Robert Kern
On Fri, Nov 17, 2023 at 4:15 PM Aaron Meurer  wrote:

> On Fri, Nov 17, 2023 at 12:10 PM Robert Kern 
> wrote:
> >
> > If the arrays you are drawing indices for are real in-memory arrays for
> present-day 64-bit computers, this should be adequate. If it's a notional
> array that is larger, then you'll need actual arbitrary-sized integer
> sampling. The builtin `random.randrange()` will do arbitrary-sized integers
> and is quite reasonable for this task. If you want it to use our
> BitGenerators underneath for clean PRNG state management, this is quite
> doable with a simple subclass of `random.Random`:
> https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258
>
> Wouldn't it be better to just use random.randint to generate the
> vectors directly at that point? If the number of possibilities is more
> than 2**64, birthday odds of generating the same vector twice are on
> the order of 1 in 2**32. And you can always do a unique() rejection
> check if you really want to be careful.
>

Yes, I jumped to correcting the misreading of the docstring rather than
solving the root problem. Almost certainly, the most straightforward
strategy is to use `rng.integers(0, array_shape)` repeatedly, storing
tuples of the resulting indices in a set and rejecting duplicates. You'd
have to do the same thing if one used `rng.integers(0,
np.prod(array_shape))` or `random.randint(0, np.prod(array_shape))` in the
flattened case.

`rng.integers()` doesn't sample without replacement in any case.
`rng.choice()` does. It also will not support unbounded integers; it uses a
signed `int64` to hold the population size, so it is bounded to that. That
is _likely_ to be a fine upper bound on the size of the array (if it is a
real array in a 64-bit address space). So one could use
`rng.choice(np.prod(array_shape), replace=False, size=n_samples)` and
unflatten these integers to the index tuples if that suits the array size.
It's just doing the same set lookups internally, just somewhat more
efficiently.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Aaron Meurer
On Fri, Nov 17, 2023 at 12:10 PM Robert Kern  wrote:
>
> On Fri, Nov 17, 2023 at 1:54 PM Stefan van der Walt via NumPy-Discussion 
>  wrote:
>>
>> Hi all,
>>
>> I am trying to sample k N-dimensional vectors from a uniform distribution 
>> without replacement.
>> It seems like this should be straightforward, but I can't seem to pin it 
>> down.
>>
>> Specifically, I am trying to get random indices in an d0 x d1 x d2.. x dN-1 
>> array.
>>
>> I thought about sneaking in a structured dtype into `rng.integers`, but of 
>> course that doesn't work.
>>
>> If we had a string sampler, I could sample k unique words (consisting of 
>> digits), and convert them to indices.
>>
>> I could over-sample and filter out the non-unique indices. Or iteratively 
>> draw blocks of samples until I've built up my k unique indices.
>>
>> The most straightforward solution would be to flatten indices, and to sample 
>> from those. The integers get large quickly, though. The rng.integers 
>> docstring suggests that it can handle object arrays for very large integers:
>>
>> > When using broadcasting with uint64 dtypes, the maximum value (2**64)
>> > cannot be represented as a standard integer type.
>> > The high array (or low if high is None) must have object dtype, e.g., 
>> > array([2**64]).
>>
>> But, that doesn't work:
>>
>> In [35]: rng.integers(np.array([2**64], dtype=object))
>> ValueError: high is out of bounds for int64
>>
>> Is there an elegant way to handle this problem?
>
>
> The default dtype for the result of `integers()` is the signed `int64`. If 
> you want to sample from the range `[0, 2**64)`, you need to specify 
> `dtype=np.uint64`. The text you are reading is saying that if you want to 
> specify exactly `2**64` as the exclusive upper bound that you won't be able 
> to do it with a `np.uint64` array/scalar because it's one above the bound for 
> that dtype, so you'll have to use a plain Python `int` object or 
> `dtype=object` array in order to represent `2**64`. It is not saying that you 
> can draw arbitrary-sized integers.
>
> >>> rng.integers(2**64, dtype=np.uint64)
> 11569248114014186612
>
> If the arrays you are drawing indices for are real in-memory arrays for 
> present-day 64-bit computers, this should be adequate. If it's a notional 
> array that is larger, then you'll need actual arbitrary-sized integer 
> sampling. The builtin `random.randrange()` will do arbitrary-sized integers 
> and is quite reasonable for this task. If you want it to use our 
> BitGenerators underneath for clean PRNG state management, this is quite 
> doable with a simple subclass of `random.Random`: 
> https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258

Wouldn't it be better to just use random.randint to generate the
vectors directly at that point? If the number of possibilities is more
than 2**64, birthday odds of generating the same vector twice are on
the order of 1 in 2**32. And you can always do a unique() rejection
check if you really want to be careful.

Aaron Meurer

>
> --
> Robert Kern
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: asmeu...@gmail.com
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Robert Kern
On Fri, Nov 17, 2023 at 1:54 PM Stefan van der Walt via NumPy-Discussion <
numpy-discussion@python.org> wrote:

> Hi all,
>
> I am trying to sample k N-dimensional vectors from a uniform distribution
> without replacement.
> It seems like this should be straightforward, but I can't seem to pin it
> down.
>
> Specifically, I am trying to get random indices in an d0 x d1 x d2.. x
> dN-1 array.
>
> I thought about sneaking in a structured dtype into `rng.integers`, but of
> course that doesn't work.
>
> If we had a string sampler, I could sample k unique words (consisting of
> digits), and convert them to indices.
>
> I could over-sample and filter out the non-unique indices. Or iteratively
> draw blocks of samples until I've built up my k unique indices.
>
> The most straightforward solution would be to flatten indices, and to
> sample from those. The integers get large quickly, though. The rng.integers
> docstring suggests that it can handle object arrays for very large integers:
>
> > When using broadcasting with uint64 dtypes, the maximum value (2**64)
> > cannot be represented as a standard integer type.
> > The high array (or low if high is None) must have object dtype, e.g.,
> array([2**64]).
>
> But, that doesn't work:
>
> In [35]: rng.integers(np.array([2**64], dtype=object))
> ValueError: high is out of bounds for int64
>
> Is there an elegant way to handle this problem?
>

The default dtype for the result of `integers()` is the signed `int64`. If
you want to sample from the range `[0, 2**64)`, you need to specify
`dtype=np.uint64`. The text you are reading is saying that if you want to
specify exactly `2**64` as the exclusive upper bound that you won't be able
to do it with a `np.uint64` array/scalar because it's one above the bound
for that dtype, so you'll have to use a plain Python `int` object or
`dtype=object` array in order to represent `2**64`. It is not saying that
you can draw arbitrary-sized integers.

>>> rng.integers(2**64, dtype=np.uint64)
11569248114014186612

If the arrays you are drawing indices for are real in-memory arrays for
present-day 64-bit computers, this should be adequate. If it's a notional
array that is larger, then you'll need actual arbitrary-sized integer
sampling. The builtin `random.randrange()` will do arbitrary-sized integers
and is quite reasonable for this task. If you want it to use our
BitGenerators underneath for clean PRNG state management, this is quite
doable with a simple subclass of `random.Random`:
https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com