[Numpy-discussion] take from structured array is faster than boolean indexing, but reshapes columns to 2D
Dear all- Structured arrays are great, but I am having problems filtering them efficiently. Reading through the mailing list, it seems like boolean arrays are the recommended approach to filtering arrays for arbitrary conditions, but my testing shows that a combination of take and where can be much faster when dealing with structured arrays: import timeit setup = "from numpy import random, where, zeros; r = random.random_integers(1e3, size=1e6); q = zeros((1e6), dtype=[('foo', 'u4'), ('bar', 'u4'), ('baz', 'u4')]); q['foo'] = r" statement1 = "s = q.take(where(q['foo'] < 500))" statement2 = "s = q[q['foo'] < 500]" t = timeit.Timer(statement1, setup) t.timeit(10) t = timeit.Timer(statement2, setup) t.timeit(10) Using the boolean array is about 4 times slower when dealing with large arrays. In my case, these operations are supposed to happen on a web server with a large number of requests, so the efficiency gain is important. However, the combination of take and where reshapes the columns of structured arrays to be 2-dimensional: q['foo'].shape >> (100,) s = q[q['foo'] < 500] s['foo'].shape >> (499102,) s = q.take(where(q['foo'] < 500)) s['foo'].shape >> (1, 499102) Is there a way to use this seemingly more efficient approach (take & where) and not have to manually reshape the columns? This seems ungainly for larger structured arrays. Or should I file this as a bug? Perhaps there are even more efficient approaches that I haven't thought of, but are obvious to others? Thanks in advance, Yours, -Chris -- Chris Mutel Ökologisches Systemdesign - Ecological Systems Design Institut f.Umweltingenieurwissenschaften - Institute for Environmental Engineering ETH Zürich - HIF C 42 - Schafmattstr. 6 8093 Zürich Telefon: +41 44 633 71 45 - Fax: +41 44 633 10 61 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NEP for faster ufuncs
On Tue, Dec 21, 2010 at 5:53 PM, Mark Wiebe wrote: > Hello NumPy-ers, > > After some performance analysis, I've designed and implemented a new > iterator designed to speed up ufuncs and allow for easier multi-dimensional > iteration. The new code is fairly large, but works quite well already. If > some people could read the NEP and give some feedback, that would be great! > Here's a link: > > > https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc.rst > > I would also love it if someone could try building the code and play around > with it a bit. The github branch is here: > > https://github.com/m-paradox/numpy/tree/new_iterator > > To give a taste of the iterator's functionality, below is an example from > the NEP for how to implement a "Lambda UFunc." With just a few lines of > code, it's possible to replicate something similar to the numexpr library > (numexpr still gets a bigger speedup, though). In the example expression I > chose, execution time went from 138ms to 61ms. > > Hopefully this is a good Christmas present for NumPy. :) > > Cheers, > Mark > > Here is the definition of the ``luf`` function.:: > > def luf(lamdaexpr, *args, **kwargs): > """Lambda UFunc > > e.g. > c = luf(lambda i,j:i+j, a, b, order='K', > casting='safe', buffersize=8192) > > c = np.empty(...) > luf(lambda i,j:i+j, a, b, out=c, order='K', > casting='safe', buffersize=8192) > """ > > nargs = len(args) > op = args + (kwargs.get('out',None),) > it = np.newiter(op, ['buffered','no_inner_iteration'], > [['readonly','nbo_aligned']]*nargs + > [['writeonly','allocate','no_broadcast']], > order=kwargs.get('order','K'), > casting=kwargs.get('casting','safe'), > buffersize=kwargs.get('buffersize',0)) > while not it.finished: > it[-1] = lamdaexpr(*it[:-1]) > it.iternext() > > return it.operands[-1] > > Then, by using ``luf`` instead of straight Python expressions, we > can gain some performance from better cache behavior.:: > > In [2]: a = np.random.random((50,50,50,10)) > In [3]: b = np.random.random((50,50,1,10)) > In [4]: c = np.random.random((50,50,50,1)) > > In [5]: timeit 3*a+b-(a/c) > 1 loops, best of 3: 138 ms per loop > > In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) > 10 loops, best of 3: 60.9 ms per loop > > In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c)) > Out[7]: True > > > Wow, that's a really nice design and write up. Small typo: /* Only allow exactly equivalent types */ NPY_NO_CASTING=0, /* Allow casts between equivalent types of different byte orders */ NPY_EQUIV_CASTING=0, Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NEP for faster ufuncs
On Tue, Dec 21, 2010 at 6:06 PM, David wrote: > > This looks pretty cool. I hope to be able to take a look at it during > the christmas holidays. > Thanks! > > I cannot comment in details yet, but it seems to address several issues > I encountered myself while implementing the neighborhood iterator (which > I will try to update to use the new one). > > One question: which CPU/platform did you test it on ? > The system I'm testing on is a bit old: a dual core Athlon 64 X2 4200+ on 64-bit Fedora 13, gcc 4.4.5. -Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NEP for faster ufuncs
That's a good suggestion - added. Unfortunately, it looks like the github rst converter doesn't make a table of contents with working links. Cheers, Mark On Tue, Dec 21, 2010 at 6:00 PM, John Salvatier wrote: > I applaud you on your vision. I only have one small suggestion: I suggest > you put a table of contents at the beginning of your NEP so people may skip > to the part that most interests them. > > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NEP for faster ufuncs
Hi Mark, On 12/22/2010 09:53 AM, Mark Wiebe wrote: > Hello NumPy-ers, > > After some performance analysis, I've designed and implemented a new > iterator designed to speed up ufuncs and allow for easier > multi-dimensional iteration. The new code is fairly large, but works > quite well already. If some people could read the NEP and give some > feedback, that would be great! Here's a link: > > https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc.rst This looks pretty cool. I hope to be able to take a look at it during the christmas holidays. I cannot comment in details yet, but it seems to address several issues I encountered myself while implementing the neighborhood iterator (which I will try to update to use the new one). One question: which CPU/platform did you test it on ? cheers, David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NEP for faster ufuncs
I applaud you on your vision. I only have one small suggestion: I suggest you put a table of contents at the beginning of your NEP so people may skip to the part that most interests them. On Tue, Dec 21, 2010 at 4:59 PM, John Salvatier wrote: > That is an amazing christmas present. > > On Tue, Dec 21, 2010 at 4:53 PM, Mark Wiebe wrote: > >> Hello NumPy-ers, >> >> After some performance analysis, I've designed and implemented a new >> iterator designed to speed up ufuncs and allow for easier multi-dimensional >> iteration. The new code is fairly large, but works quite well already. If >> some people could read the NEP and give some feedback, that would be great! >> Here's a link: >> >> >> https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc.rst >> >> I would also love it if someone could try building the code and play >> around with it a bit. The github branch is here: >> >> https://github.com/m-paradox/numpy/tree/new_iterator >> >> To give a taste of the iterator's functionality, below is an example from >> the NEP for how to implement a "Lambda UFunc." With just a few lines of >> code, it's possible to replicate something similar to the numexpr library >> (numexpr still gets a bigger speedup, though). In the example expression I >> chose, execution time went from 138ms to 61ms. >> >> Hopefully this is a good Christmas present for NumPy. :) >> >> Cheers, >> Mark >> >> Here is the definition of the ``luf`` function.:: >> >> def luf(lamdaexpr, *args, **kwargs): >> """Lambda UFunc >> >> e.g. >> c = luf(lambda i,j:i+j, a, b, order='K', >> casting='safe', buffersize=8192) >> >> c = np.empty(...) >> luf(lambda i,j:i+j, a, b, out=c, order='K', >> casting='safe', buffersize=8192) >> """ >> >> nargs = len(args) >> op = args + (kwargs.get('out',None),) >> it = np.newiter(op, ['buffered','no_inner_iteration'], >> [['readonly','nbo_aligned']]*nargs + >> [['writeonly','allocate','no_broadcast']], >> order=kwargs.get('order','K'), >> casting=kwargs.get('casting','safe'), >> buffersize=kwargs.get('buffersize',0)) >> while not it.finished: >> it[-1] = lamdaexpr(*it[:-1]) >> it.iternext() >> >> return it.operands[-1] >> >> Then, by using ``luf`` instead of straight Python expressions, we >> can gain some performance from better cache behavior.:: >> >> In [2]: a = np.random.random((50,50,50,10)) >> In [3]: b = np.random.random((50,50,1,10)) >> In [4]: c = np.random.random((50,50,50,1)) >> >> In [5]: timeit 3*a+b-(a/c) >> 1 loops, best of 3: 138 ms per loop >> >> In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) >> 10 loops, best of 3: 60.9 ms per loop >> >> In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c)) >> Out[7]: True >> >> >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Giving numpy the ability to multi-iterate excluding an axis
On Mon, Dec 20, 2010 at 1:42 PM, John Salvatier wrote: > A while ago, I asked a whether it was possible to multi-iterate over > several ndarrays but exclude a certain axis( > http://www.mail-archive.com/numpy-discussion@scipy.org/msg29204.html), > sort of a combination of PyArray_IterAllButAxis and PyArray_MultiIterNew. My > goal was to allow creation of relatively complex ufuncs that can allow > reduction or directionally dependent computation and still use broadcasting > (for example a moving averaging ufunc that can have changing averaging > parameters). I didn't get any solutions, which I take to mean that no one > knew how to do this. > > I am thinking about trying to make a numpy patch with this functionality, > and I have some questions: 1) How difficult would this kind of task be for > someone with non-expert C knowledge and good numpy knowledge? 2) Does anyone > have advice on how to do this kind of thing? > You may be able to do what you would like with the new iterator I've written. In particular, it supports nesting multiple iterators by providing either pointers or offsets, and allowing you to specify any subset of the axes to iterate. Here's how the code to do this in a simple 3D case might look, for making axis 1 the inner loop: PyArrayObject *op[2] = {a,b}; npy_intp axes_outer[2] = {0,2}}; npy_intp *op_axes[2]; npy_intp axis_inner = 1; npy_int32 flags[2] = {NPY_ITER_READONLY, NPY_ITER_READONLY}; NpyIter *outer, *inner; NpyIter_IterNext_Fn oiternext, iiternext; npy_intp *ooffsets; char **idataptrs; op_axes[0] = op_axes[1] = axes_outer; outer = NpyIter_MultiNew(2, op, NPY_ITER_OFFSETS, NPY_KEEPORDER, NPY_NO_CASTING, flags, NULL, 2, op_axes, 0); op_axes[0] = op_axes[1] = &axis_inner; inner = NpyIter_MultiNew(2, op, 0, NPY_KEEPORDER, NPY_NO_CASTING, flags, NULL, 1, op_axes, 0); oiternext = NpyIter_GetIterNext(outer); iiternext = NpyIter_GetIterNext(inner); ooffsets = (npy_intp *)NpyIter_GetDataPtrArray(outer); idataptrs = NpyIter_GetDataPtrArray(inner); do { do { char *a_data = idataptrs[0] + ooffsets[0], *b_data = idataptrs[0] + ooffsets[0]; /* Do stuff with the data */ } while(iiternext()); NpyIter_Reset(inner); } while(oiternext()); NpyIter_Deallocate(outer); NpyIter_Deallocate(inner); Extending to more dimensions, or making both the inner and outer loops have multiple dimensions, isn't too crazy. Is this along the lines of what you need? If you check out my code, note that it currently isn't exposed as NumPy API yet, but you can try a lot of things with the Python exposure. Cheers, Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NEP for faster ufuncs
That is an amazing christmas present. On Tue, Dec 21, 2010 at 4:53 PM, Mark Wiebe wrote: > Hello NumPy-ers, > > After some performance analysis, I've designed and implemented a new > iterator designed to speed up ufuncs and allow for easier multi-dimensional > iteration. The new code is fairly large, but works quite well already. If > some people could read the NEP and give some feedback, that would be great! > Here's a link: > > > https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc.rst > > I would also love it if someone could try building the code and play around > with it a bit. The github branch is here: > > https://github.com/m-paradox/numpy/tree/new_iterator > > To give a taste of the iterator's functionality, below is an example from > the NEP for how to implement a "Lambda UFunc." With just a few lines of > code, it's possible to replicate something similar to the numexpr library > (numexpr still gets a bigger speedup, though). In the example expression I > chose, execution time went from 138ms to 61ms. > > Hopefully this is a good Christmas present for NumPy. :) > > Cheers, > Mark > > Here is the definition of the ``luf`` function.:: > > def luf(lamdaexpr, *args, **kwargs): > """Lambda UFunc > > e.g. > c = luf(lambda i,j:i+j, a, b, order='K', > casting='safe', buffersize=8192) > > c = np.empty(...) > luf(lambda i,j:i+j, a, b, out=c, order='K', > casting='safe', buffersize=8192) > """ > > nargs = len(args) > op = args + (kwargs.get('out',None),) > it = np.newiter(op, ['buffered','no_inner_iteration'], > [['readonly','nbo_aligned']]*nargs + > [['writeonly','allocate','no_broadcast']], > order=kwargs.get('order','K'), > casting=kwargs.get('casting','safe'), > buffersize=kwargs.get('buffersize',0)) > while not it.finished: > it[-1] = lamdaexpr(*it[:-1]) > it.iternext() > > return it.operands[-1] > > Then, by using ``luf`` instead of straight Python expressions, we > can gain some performance from better cache behavior.:: > > In [2]: a = np.random.random((50,50,50,10)) > In [3]: b = np.random.random((50,50,1,10)) > In [4]: c = np.random.random((50,50,50,1)) > > In [5]: timeit 3*a+b-(a/c) > 1 loops, best of 3: 138 ms per loop > > In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) > 10 loops, best of 3: 60.9 ms per loop > > In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c)) > Out[7]: True > > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] NEP for faster ufuncs
Hello NumPy-ers, After some performance analysis, I've designed and implemented a new iterator designed to speed up ufuncs and allow for easier multi-dimensional iteration. The new code is fairly large, but works quite well already. If some people could read the NEP and give some feedback, that would be great! Here's a link: https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc.rst I would also love it if someone could try building the code and play around with it a bit. The github branch is here: https://github.com/m-paradox/numpy/tree/new_iterator To give a taste of the iterator's functionality, below is an example from the NEP for how to implement a "Lambda UFunc." With just a few lines of code, it's possible to replicate something similar to the numexpr library (numexpr still gets a bigger speedup, though). In the example expression I chose, execution time went from 138ms to 61ms. Hopefully this is a good Christmas present for NumPy. :) Cheers, Mark Here is the definition of the ``luf`` function.:: def luf(lamdaexpr, *args, **kwargs): """Lambda UFunc e.g. c = luf(lambda i,j:i+j, a, b, order='K', casting='safe', buffersize=8192) c = np.empty(...) luf(lambda i,j:i+j, a, b, out=c, order='K', casting='safe', buffersize=8192) """ nargs = len(args) op = args + (kwargs.get('out',None),) it = np.newiter(op, ['buffered','no_inner_iteration'], [['readonly','nbo_aligned']]*nargs + [['writeonly','allocate','no_broadcast']], order=kwargs.get('order','K'), casting=kwargs.get('casting','safe'), buffersize=kwargs.get('buffersize',0)) while not it.finished: it[-1] = lamdaexpr(*it[:-1]) it.iternext() return it.operands[-1] Then, by using ``luf`` instead of straight Python expressions, we can gain some performance from better cache behavior.:: In [2]: a = np.random.random((50,50,50,10)) In [3]: b = np.random.random((50,50,1,10)) In [4]: c = np.random.random((50,50,50,1)) In [5]: timeit 3*a+b-(a/c) 1 loops, best of 3: 138 ms per loop In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 60.9 ms per loop In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c)) Out[7]: True ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sample without replacement
On Mon, Dec 20, 2010 at 10:28, Alan G Isaac wrote: > I want to sample *without* replacement from a vector > (as with Python's random.sample). I don't see a direct > replacement for this, and I don't want to carry two > PRNG's around. Is the best way something like this? > > permutation(myvector)[:samplesize] For one of my personal projects, I copied over the mtrand package and added a method to RandomState for doing this kind of thing using reservoir sampling. http://en.wikipedia.org/wiki/Reservoir_sampling def subset_reservoir(self, long nselected, long ntotal, object size=None): """ Sample a given number integers from the set [0, ntotal) without replacement using a reservoir algorithm. Parameters -- nselected : int The number of integers to sample. ntotal : int The size of the set to sample from. size : int, sequence of ints, or None The number of subsets to sample or a shape tuple. An axis of the length nselected will be appended to a shape. Returns --- out : ndarray The sampled subsets. The order of the items is not necessarily random. Use a slice from the result of permutation() if you need the order of the items to be randomized. """ cdef long total_size, length, i, j, u cdef cnp.ndarray[cnp.int_t, ndim=2] out if size is None: shape = (nselected,) total_size = nselected length = 1 elif isinstance(size, int): shape = (size, nselected) total_size = size * nselected length = size else: shape = size + (nselected,) length = 1 for i from 0 <= i < len(size): length *= size[i] total_size = length * nselected out = np.empty((length, nselected), dtype=int) for i from 0 <= i < length: for j from 0 <= j < nselected: out[i,j] = j for j from nselected <= j < ntotal: u = rk_interval(j+1, self.internal_state) if u < nselected: out[i,u] = j return out.reshape(shape) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Reversing an array in-place
Chuck wrote: > The reversed matrix is a view, no copyihg is done. It is even faster than > an inplace reversal. This is why I love NumPy. In C, Fortran or Matlab most programmers would probably form the reversed array. In NumPy we just change some metainformation (data pointer and strides) behind the scenes. It cannot be done more efficiently than that. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sample without replacement
We often need to generate more than one such sample from an array, e.g. for permutation tests. If we shuffle an array x of size N and use x[:M] as a random sample "without replacement", we just need to put them back randomly to get the next sample (cf. Fisher-Yates shuffle). That way we get O(M) amortized complexity for each sample of size M. Only the first sample will have complexity O(N). Sturla > I know this question came up on the mailing list some time ago > (19/09/2008), and the conclusion was that yes, you can do it more or > less efficiently in pure python; the trick is to use two different > methods. If your sample is more than, say, a quarter the size of the > set you're drawing from, you permute the set and take the first few. > If your sample is smaller, you draw with replacement, then redraw the > duplicates, and repeat until there aren't any more duplicates. Since > you only do this when your sample is much smaller than the population > you don't need to repeat many times. > > Here's the code I posted to the previous discussion (not tested this > time around) with comments: > > ''' > def choose_without_replacement(m,n,repeats=None): >"""Choose n nonnegative integers less than m without replacement > >Returns an array of shape n, or (n,repeats). >""" >if repeats is None: >r = 1 >else: >r = repeats >if n>m: >raise ValueError, "Cannot find %d nonnegative integers less > than %d" % (n,m) >if n>m/2: >res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0) >else: >res = np.random.random_integers(m,size=(n,r)) >while True: >res = np.sort(res,axis=0) >w = np.nonzero(np.diff(res,axis=0)==0) >nr = len(w[0]) >if nr==0: >break >res[w] = np.random.random_integers(m,size=nr) > >if repeats is None: >return res[:,0] >else: >return res > > For really large values of repeats it does too much sorting; I didn't > have the energy to make it pull all the ones with repeats to the > beginning so that only they need to be re-sorted the next time > through. Still, the expected number of trips through the while loop > grows only logarithmically with repeats, so it shouldn't be too bad. > ''' > > Anne > > On 20 December 2010 12:13, John Salvatier > wrote: >> I think this is not possible to do efficiently with just numpy. If you >> want >> to do this efficiently, I wrote a no-replacement sampler in Cython some >> time >> ago (below). I hearby release it to the public domain. >> >> ''' >> >> Created on Oct 24, 2009 >> http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-replacement >> @author: johnsalvatier >> >> ''' >> >> from __future__ import division >> >> import numpy >> >> def random_no_replace(sampleSize, populationSize, numSamples): >> >> >> >> samples = numpy.zeros((numSamples, sampleSize),dtype=int) >> >> >> >> # Use Knuth's variable names >> >> cdef int n = sampleSize >> >> cdef int N = populationSize >> >> cdef i = 0 >> >> cdef int t = 0 # total input records dealt with >> >> cdef int m = 0 # number of items selected so far >> >> cdef double u >> >> while i < numSamples: >> >> t = 0 >> >> m = 0 >> >> while m < n : >> >> >> >> u = numpy.random.uniform() # call a uniform(0,1) random >> number >> generator >> >> if (N - t)*u >= n - m : >> >> >> >> t += 1 >> >> >> >> else: >> >> >> >> samples[i,m] = t >> >> t += 1 >> >> m += 1 >> >> >> >> i += 1 >> >> >> >> return samples >> >> >> >> On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac >> wrote: >>> >>> I want to sample *without* replacement from a vector >>> (as with Python's random.sample). I don't see a direct >>> replacement for this, and I don't want to carry two >>> PRNG's around. Is the best way something like this? >>> >>> permutation(myvector)[:samplesize] >>> >>> Thanks, >>> Alan Isaac >>> ___ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] bincount and generators
bincount does not currently allow a generator as an argument. I'm wondering if it is considered too costly to extend it to allow this. (Motivation: I'm counting based on an attribute of a large number of objects, and I don't need a list of the data.) Thanks, Alan Isaac ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] bincount question
:: >>> np.bincount([]) Traceback (most recent call last): File "", line 1, in ValueError: The first argument cannot be empty. Why not? (I.e., why isn't an empty array the right answer?) Thanks, Alan Isaac ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sample without replacement
I know this question came up on the mailing list some time ago (19/09/2008), and the conclusion was that yes, you can do it more or less efficiently in pure python; the trick is to use two different methods. If your sample is more than, say, a quarter the size of the set you're drawing from, you permute the set and take the first few. If your sample is smaller, you draw with replacement, then redraw the duplicates, and repeat until there aren't any more duplicates. Since you only do this when your sample is much smaller than the population you don't need to repeat many times. Here's the code I posted to the previous discussion (not tested this time around) with comments: ''' def choose_without_replacement(m,n,repeats=None): """Choose n nonnegative integers less than m without replacement Returns an array of shape n, or (n,repeats). """ if repeats is None: r = 1 else: r = repeats if n>m: raise ValueError, "Cannot find %d nonnegative integers less than %d" % (n,m) if n>m/2: res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0) else: res = np.random.random_integers(m,size=(n,r)) while True: res = np.sort(res,axis=0) w = np.nonzero(np.diff(res,axis=0)==0) nr = len(w[0]) if nr==0: break res[w] = np.random.random_integers(m,size=nr) if repeats is None: return res[:,0] else: return res For really large values of repeats it does too much sorting; I didn't have the energy to make it pull all the ones with repeats to the beginning so that only they need to be re-sorted the next time through. Still, the expected number of trips through the while loop grows only logarithmically with repeats, so it shouldn't be too bad. ''' Anne On 20 December 2010 12:13, John Salvatier wrote: > I think this is not possible to do efficiently with just numpy. If you want > to do this efficiently, I wrote a no-replacement sampler in Cython some time > ago (below). I hearby release it to the public domain. > > ''' > > Created on Oct 24, 2009 > http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-replacement > @author: johnsalvatier > > ''' > > from __future__ import division > > import numpy > > def random_no_replace(sampleSize, populationSize, numSamples): > > > > samples = numpy.zeros((numSamples, sampleSize),dtype=int) > > > > # Use Knuth's variable names > > cdef int n = sampleSize > > cdef int N = populationSize > > cdef i = 0 > > cdef int t = 0 # total input records dealt with > > cdef int m = 0 # number of items selected so far > > cdef double u > > while i < numSamples: > > t = 0 > > m = 0 > > while m < n : > > > > u = numpy.random.uniform() # call a uniform(0,1) random number > generator > > if (N - t)*u >= n - m : > > > > t += 1 > > > > else: > > > > samples[i,m] = t > > t += 1 > > m += 1 > > > > i += 1 > > > > return samples > > > > On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac wrote: >> >> I want to sample *without* replacement from a vector >> (as with Python's random.sample). I don't see a direct >> replacement for this, and I don't want to carry two >> PRNG's around. Is the best way something like this? >> >> permutation(myvector)[:samplesize] >> >> Thanks, >> Alan Isaac >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sample without replacement
On 12/20/2010 10:49 PM, josef.p...@gmail.com wrote: > What's the difference between a numpy Random and a python > random.Random instance of separate states of the random number > generators? Sorry, I don't understand the question. The difference for my use is that a np.RandomState instance provides access to a different set of methods, which unfortunately does not include an equivalent to random.Random's sample method but which does include others I need. Would it be appropriate to request that an analog to random.sample be added to numpy.random? (It might sample only a range, since producing indexes would provide the base functionality.) Or is this functionality absent intentionally? Alan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion