Re: [Numpy-discussion] sample without replacement

2010-12-21 Thread Robert Kern
On Mon, Dec 20, 2010 at 10:28, Alan G Isaac  wrote:
> I want to sample *without* replacement from a vector
> (as with Python's random.sample).  I don't see a direct
> replacement for this, and I don't want to carry two
> PRNG's around.  Is the best way something like  this?
>
>        permutation(myvector)[:samplesize]

For one of my personal projects, I copied over the mtrand package and
added a method to RandomState for doing this kind of thing using
reservoir sampling.

http://en.wikipedia.org/wiki/Reservoir_sampling

def subset_reservoir(self, long nselected, long ntotal, object size=None):
""" Sample a given number integers from the set [0, ntotal) without
replacement using a reservoir algorithm.

Parameters
--
nselected : int
The number of integers to sample.
ntotal : int
The size of the set to sample from.
size : int, sequence of ints, or None
The number of subsets to sample or a shape tuple. An axis of the
length nselected will be appended to a shape.

Returns
---
out : ndarray
The sampled subsets. The order of the items is not necessarily
random. Use a slice from the result of permutation() if you need the
order of the items to be randomized.
"""
cdef long total_size, length, i, j, u
cdef cnp.ndarray[cnp.int_t, ndim=2] out
if size is None:
shape = (nselected,)
total_size = nselected
length = 1
elif isinstance(size, int):
shape = (size, nselected)
total_size = size * nselected
length = size
else:
shape = size + (nselected,)
length = 1
for i from 0 <= i < len(size):
length *= size[i]
total_size = length * nselected
out = np.empty((length, nselected), dtype=int)
for i from 0 <= i < length:
for j from 0 <= j < nselected:
out[i,j] = j
for j from nselected <= j < ntotal:
u = rk_interval(j+1, self.internal_state)
if u < nselected:
out[i,u] = j
return out.reshape(shape)

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] sample without replacement

2010-12-21 Thread Sturla Molden

We often need to generate more than one such sample from an array, e.g.
for permutation tests. If we shuffle an array x of size N and use x[:M] as
a random sample "without replacement", we just need to put them back
randomly to get the next sample (cf. Fisher-Yates shuffle). That way we
get O(M) amortized complexity for each sample of size M. Only the first
sample will have complexity O(N).

Sturla


> I know this question came up on the mailing list some time ago
> (19/09/2008), and the conclusion was that yes, you can do it more or
> less efficiently in pure python; the trick is to use two different
> methods. If your sample is more than, say, a quarter the size of the
> set you're drawing from, you permute the set and take the first few.
> If your sample is smaller, you draw with replacement, then redraw the
> duplicates, and repeat until there aren't any more duplicates. Since
> you only do this when your sample is much smaller than the population
> you don't need to repeat many times.
>
> Here's the code I posted to the previous discussion (not tested this
> time around) with comments:
>
> '''
> def choose_without_replacement(m,n,repeats=None):
>"""Choose n nonnegative integers less than m without replacement
>
>Returns an array of shape n, or (n,repeats).
>"""
>if repeats is None:
>r = 1
>else:
>r = repeats
>if n>m:
>raise ValueError, "Cannot find %d nonnegative integers less
> than %d" %  (n,m)
>if n>m/2:
>res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0)
>else:
>res = np.random.random_integers(m,size=(n,r))
>while True:
>res = np.sort(res,axis=0)
>w = np.nonzero(np.diff(res,axis=0)==0)
>nr = len(w[0])
>if nr==0:
>break
>res[w] = np.random.random_integers(m,size=nr)
>
>if repeats is None:
>return res[:,0]
>else:
>return res
>
> For really large values of repeats it does too much sorting; I didn't
> have the energy to make it pull all the ones with repeats to the
> beginning so that only they need to be re-sorted the next time
> through. Still, the expected number of trips through the while loop
> grows only logarithmically with repeats, so it shouldn't be too bad.
> '''
>
> Anne
>
> On 20 December 2010 12:13, John Salvatier 
> wrote:
>> I think this is not possible to do efficiently with just numpy. If you
>> want
>> to do this efficiently, I wrote a no-replacement sampler in Cython some
>> time
>> ago (below). I hearby release it to the public domain.
>>
>> '''
>>
>> Created on Oct 24, 2009
>> http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-replacement
>> @author: johnsalvatier
>>
>> '''
>>
>> from __future__ import division
>>
>> import numpy
>>
>> def random_no_replace(sampleSize, populationSize, numSamples):
>>
>>
>>
>>     samples  = numpy.zeros((numSamples, sampleSize),dtype=int)
>>
>>
>>
>>     # Use Knuth's variable names
>>
>>     cdef int n = sampleSize
>>
>>     cdef int N = populationSize
>>
>>     cdef i = 0
>>
>>     cdef int t = 0 # total input records dealt with
>>
>>     cdef int m = 0 # number of items selected so far
>>
>>     cdef double u
>>
>>     while i < numSamples:
>>
>>         t = 0
>>
>>         m = 0
>>
>>         while m < n :
>>
>>
>>
>>             u = numpy.random.uniform() # call a uniform(0,1) random
>> number
>> generator
>>
>>             if  (N - t)*u >= n - m :
>>
>>
>>
>>                 t += 1
>>
>>
>>
>>             else:
>>
>>
>>
>>                 samples[i,m] = t
>>
>>                 t += 1
>>
>>                 m += 1
>>
>>
>>
>>         i += 1
>>
>>
>>
>>     return samples
>>
>>
>>
>> On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac 
>> wrote:
>>>
>>> I want to sample *without* replacement from a vector
>>> (as with Python's random.sample).  I don't see a direct
>>> replacement for this, and I don't want to carry two
>>> PRNG's around.  Is the best way something like  this?
>>>
>>>        permutation(myvector)[:samplesize]
>>>
>>> Thanks,
>>> Alan Isaac
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] sample without replacement

2010-12-21 Thread Anne Archibald
I know this question came up on the mailing list some time ago
(19/09/2008), and the conclusion was that yes, you can do it more or
less efficiently in pure python; the trick is to use two different
methods. If your sample is more than, say, a quarter the size of the
set you're drawing from, you permute the set and take the first few.
If your sample is smaller, you draw with replacement, then redraw the
duplicates, and repeat until there aren't any more duplicates. Since
you only do this when your sample is much smaller than the population
you don't need to repeat many times.

Here's the code I posted to the previous discussion (not tested this
time around) with comments:

'''
def choose_without_replacement(m,n,repeats=None):
   """Choose n nonnegative integers less than m without replacement

   Returns an array of shape n, or (n,repeats).
   """
   if repeats is None:
   r = 1
   else:
   r = repeats
   if n>m:
   raise ValueError, "Cannot find %d nonnegative integers less
than %d" %  (n,m)
   if n>m/2:
   res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0)
   else:
   res = np.random.random_integers(m,size=(n,r))
   while True:
   res = np.sort(res,axis=0)
   w = np.nonzero(np.diff(res,axis=0)==0)
   nr = len(w[0])
   if nr==0:
   break
   res[w] = np.random.random_integers(m,size=nr)

   if repeats is None:
   return res[:,0]
   else:
   return res

For really large values of repeats it does too much sorting; I didn't
have the energy to make it pull all the ones with repeats to the
beginning so that only they need to be re-sorted the next time
through. Still, the expected number of trips through the while loop
grows only logarithmically with repeats, so it shouldn't be too bad.
'''

Anne

On 20 December 2010 12:13, John Salvatier  wrote:
> I think this is not possible to do efficiently with just numpy. If you want
> to do this efficiently, I wrote a no-replacement sampler in Cython some time
> ago (below). I hearby release it to the public domain.
>
> '''
>
> Created on Oct 24, 2009
> http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-replacement
> @author: johnsalvatier
>
> '''
>
> from __future__ import division
>
> import numpy
>
> def random_no_replace(sampleSize, populationSize, numSamples):
>
>
>
>     samples  = numpy.zeros((numSamples, sampleSize),dtype=int)
>
>
>
>     # Use Knuth's variable names
>
>     cdef int n = sampleSize
>
>     cdef int N = populationSize
>
>     cdef i = 0
>
>     cdef int t = 0 # total input records dealt with
>
>     cdef int m = 0 # number of items selected so far
>
>     cdef double u
>
>     while i < numSamples:
>
>         t = 0
>
>         m = 0
>
>         while m < n :
>
>
>
>             u = numpy.random.uniform() # call a uniform(0,1) random number
> generator
>
>             if  (N - t)*u >= n - m :
>
>
>
>                 t += 1
>
>
>
>             else:
>
>
>
>                 samples[i,m] = t
>
>                 t += 1
>
>                 m += 1
>
>
>
>         i += 1
>
>
>
>     return samples
>
>
>
> On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac  wrote:
>>
>> I want to sample *without* replacement from a vector
>> (as with Python's random.sample).  I don't see a direct
>> replacement for this, and I don't want to carry two
>> PRNG's around.  Is the best way something like  this?
>>
>>        permutation(myvector)[:samplesize]
>>
>> Thanks,
>> Alan Isaac
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] sample without replacement

2010-12-21 Thread Alan G Isaac
On 12/20/2010 10:49 PM, josef.p...@gmail.com wrote:
> What's the difference between a numpy Random and a python
> random.Random instance of separate states of the random number
> generators?


Sorry, I don't understand the question.  The difference
for my use is that a np.RandomState instance provides
access to a different set of methods, which unfortunately
does not include an equivalent to random.Random's sample
method but which does include others I need.

Would it be appropriate to request that an analog
to random.sample be added to numpy.random?
(It might sample only a range, since producing indexes
would provide the base functionality.)
Or is this functionality absent intentionally?

Alan

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] sample without replacement

2010-12-20 Thread josef . pktd
On Mon, Dec 20, 2010 at 10:19 PM, Alan G Isaac  wrote:
> On 12/20/2010 9:41 PM, josef.p...@gmail.com wrote:
>> python has it in random
>>
>> sample( population, k)
>
>
> Yes, I mentioned this in my original post:
> http://www.mail-archive.com/numpy-discussion@scipy.org/msg29324.html
>
> But good simulation practice is perhaps to seed
> a simulation specific random number generator
> (not just rely on a global), and I don't want
> to pass around two different instances.
> So I want to get this functionality from numpy.random.

Sorry, I was reading to fast, and I might be tired.

What's the difference between a numpy Random and a python
random.Random instance of separate states of the random number
generators?

Josef

>
> Which reminds me of another question.
> numpy.random.RandomState accepts an int array as a seed:
> what is the *intended* use?
>
> Thanks,
> Alan
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] sample without replacement

2010-12-20 Thread Alan G Isaac
On 12/20/2010 9:41 PM, josef.p...@gmail.com wrote:
> python has it in random
>
> sample( population, k)


Yes, I mentioned this in my original post:
http://www.mail-archive.com/numpy-discussion@scipy.org/msg29324.html

But good simulation practice is perhaps to seed
a simulation specific random number generator
(not just rely on a global), and I don't want
to pass around two different instances.
So I want to get this functionality from numpy.random.

Which reminds me of another question.
numpy.random.RandomState accepts an int array as a seed:
what is the *intended* use?

Thanks,
Alan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] sample without replacement

2010-12-20 Thread josef . pktd
On Mon, Dec 20, 2010 at 11:28 AM, Alan G Isaac  wrote:
> I want to sample *without* replacement from a vector
> (as with Python's random.sample).  I don't see a direct
> replacement for this, and I don't want to carry two
> PRNG's around.  Is the best way something like  this?
>
>        permutation(myvector)[:samplesize]

python has it in random

sample( population, k)
Return a k length list of unique elements chosen from the population
sequence. Used for random sampling without replacement. New in version
2.3

Josef


>
> Thanks,
> Alan Isaac
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] sample without replacement

2010-12-20 Thread John Salvatier
I think this is not possible to do efficiently with just numpy. If you want
to do this efficiently, I wrote a no-replacement sampler in Cython some time
ago (below). I hearby release it to the public domain.

'''

Created on Oct 24, 2009
http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-replacement
@author: johnsalvatier

'''

from __future__ import division

import numpy

def random_no_replace(sampleSize, populationSize, numSamples):



samples  = numpy.zeros((numSamples, sampleSize),dtype=int)



# Use Knuth's variable names

cdef int n = sampleSize

cdef int N = populationSize

cdef i = 0

cdef int t = 0 # total input records dealt with

cdef int m = 0 # number of items selected so far

cdef double u

while i < numSamples:

t = 0

m = 0

while m < n :



u = numpy.random.uniform() # call a uniform(0,1) random number
generator

if  (N - t)*u >= n - m :



t += 1



else:



samples[i,m] = t

t += 1

m += 1



i += 1



return samples


On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac  wrote:

> I want to sample *without* replacement from a vector
> (as with Python's random.sample).  I don't see a direct
> replacement for this, and I don't want to carry two
> PRNG's around.  Is the best way something like  this?
>
>permutation(myvector)[:samplesize]
>
> Thanks,
> Alan Isaac
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] sample without replacement

2010-12-20 Thread Alan G Isaac
I want to sample *without* replacement from a vector
(as with Python's random.sample).  I don't see a direct
replacement for this, and I don't want to carry two
PRNG's around.  Is the best way something like  this?

permutation(myvector)[:samplesize]

Thanks,
Alan Isaac
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion