[Numpy-discussion] take from structured array is faster than boolean indexing, but reshapes columns to 2D

2010-12-21 Thread Christopher Mutel
Dear all-

Structured arrays are great, but I am having problems filtering them
efficiently. Reading through the mailing list, it seems like boolean
arrays are the recommended approach to filtering arrays for arbitrary
conditions, but my testing shows that a combination of take and where
can be much faster when dealing with structured arrays:

import timeit

setup = "from numpy import random, where, zeros; r =
random.random_integers(1e3, size=1e6); q = zeros((1e6), dtype=[('foo',
'u4'), ('bar', 'u4'), ('baz', 'u4')]); q['foo'] = r"
statement1 = "s = q.take(where(q['foo'] < 500))"
statement2 = "s = q[q['foo'] < 500]"

t = timeit.Timer(statement1, setup)
t.timeit(10)
t = timeit.Timer(statement2, setup)
t.timeit(10)

Using the boolean array is about 4 times slower when dealing with
large arrays. In my case, these operations are supposed to happen on a
web server with a large number of requests, so the efficiency gain is
important.

However, the combination of take and where reshapes the columns of
structured arrays to be 2-dimensional:

q['foo'].shape
>> (100,)
s = q[q['foo'] < 500]
s['foo'].shape
>> (499102,)
s = q.take(where(q['foo'] < 500))
s['foo'].shape
>> (1, 499102)

Is there a way to use this seemingly more efficient approach (take &
where) and not have to manually reshape the columns? This seems
ungainly for larger structured arrays. Or should I file this as a bug?
Perhaps there are even more efficient approaches that I haven't
thought of, but are obvious to others?

Thanks in advance,

Yours,
-Chris

-- 

Chris Mutel
Ökologisches Systemdesign - Ecological Systems Design
Institut f.Umweltingenieurwissenschaften - Institute for Environmental
Engineering
ETH Zürich - HIF C 42 - Schafmattstr. 6
8093 Zürich

Telefon: +41 44 633 71 45 - Fax: +41 44 633 10 61

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP for faster ufuncs

2010-12-21 Thread Charles R Harris
On Tue, Dec 21, 2010 at 5:53 PM, Mark Wiebe  wrote:

> Hello NumPy-ers,
>
> After some performance analysis, I've designed and implemented a new
> iterator designed to speed up ufuncs and allow for easier multi-dimensional
> iteration.  The new code is fairly large, but works quite well already.  If
> some people could read the NEP and give some feedback, that would be great!
>  Here's a link:
>
>
> https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc.rst
>
> I would also love it if someone could try building the code and play around
> with it a bit.  The github branch is here:
>
> https://github.com/m-paradox/numpy/tree/new_iterator
>
> To give a taste of the iterator's functionality, below is an example from
> the NEP for how to implement a "Lambda UFunc."  With just a few lines of
> code, it's possible to replicate something similar to the numexpr library
> (numexpr still gets a bigger speedup, though).  In the example expression I
> chose, execution time went from 138ms to 61ms.
>
> Hopefully this is a good Christmas present for NumPy. :)
>
> Cheers,
> Mark
>
> Here is the definition of the ``luf`` function.::
>
> def luf(lamdaexpr, *args, **kwargs):
> """Lambda UFunc
>
> e.g.
> c = luf(lambda i,j:i+j, a, b, order='K',
> casting='safe', buffersize=8192)
>
> c = np.empty(...)
> luf(lambda i,j:i+j, a, b, out=c, order='K',
> casting='safe', buffersize=8192)
> """
>
> nargs = len(args)
> op = args + (kwargs.get('out',None),)
> it = np.newiter(op, ['buffered','no_inner_iteration'],
> [['readonly','nbo_aligned']]*nargs +
> [['writeonly','allocate','no_broadcast']],
> order=kwargs.get('order','K'),
> casting=kwargs.get('casting','safe'),
> buffersize=kwargs.get('buffersize',0))
> while not it.finished:
> it[-1] = lamdaexpr(*it[:-1])
> it.iternext()
>
> return it.operands[-1]
>
> Then, by using ``luf`` instead of straight Python expressions, we
> can gain some performance from better cache behavior.::
>
> In [2]: a = np.random.random((50,50,50,10))
> In [3]: b = np.random.random((50,50,1,10))
> In [4]: c = np.random.random((50,50,50,1))
>
> In [5]: timeit 3*a+b-(a/c)
> 1 loops, best of 3: 138 ms per loop
>
> In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
> 10 loops, best of 3: 60.9 ms per loop
>
> In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c))
> Out[7]: True
>
>
>
Wow, that's a really nice design and write up. Small typo:

/* Only allow exactly equivalent types */
NPY_NO_CASTING=0,
/* Allow casts between equivalent types of different byte orders  */
NPY_EQUIV_CASTING=0,

 Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP for faster ufuncs

2010-12-21 Thread Mark Wiebe
On Tue, Dec 21, 2010 at 6:06 PM, David  wrote:

> 
>
This looks pretty cool. I hope to be able to take a look at it during
> the christmas holidays.
>
Thanks!

>
> I cannot comment in details yet, but it seems to address several issues
> I encountered myself while implementing the neighborhood iterator (which
> I will try to update to use the new one).
>
> One question: which CPU/platform did you test it on ?
>

The system I'm testing on is a bit old: a dual core Athlon 64 X2 4200+ on
64-bit Fedora  13, gcc 4.4.5.

-Mark
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP for faster ufuncs

2010-12-21 Thread Mark Wiebe
That's a good suggestion - added.  Unfortunately, it looks like the github
rst converter doesn't make a table of contents with working links.

Cheers,
Mark

On Tue, Dec 21, 2010 at 6:00 PM, John Salvatier
wrote:

> I applaud you on your vision. I only have one small suggestion: I suggest
> you put a table of contents at the beginning of your NEP so people may skip
> to the part that most interests them.



> 
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP for faster ufuncs

2010-12-21 Thread David
Hi Mark,

On 12/22/2010 09:53 AM, Mark Wiebe wrote:
> Hello NumPy-ers,
>
> After some performance analysis, I've designed and implemented a new
> iterator designed to speed up ufuncs and allow for easier
> multi-dimensional iteration.  The new code is fairly large, but works
> quite well already.  If some people could read the NEP and give some
> feedback, that would be great!  Here's a link:
>
> https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc.rst

This looks pretty cool. I hope to be able to take a look at it during 
the christmas holidays.

I cannot comment in details yet, but it seems to address several issues 
I encountered myself while implementing the neighborhood iterator (which 
I will try to update to use the new one).

One question: which CPU/platform did you test it on ?

cheers,

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP for faster ufuncs

2010-12-21 Thread John Salvatier
I applaud you on your vision. I only have one small suggestion: I suggest
you put a table of contents at the beginning of your NEP so people may skip
to the part that most interests them.

On Tue, Dec 21, 2010 at 4:59 PM, John Salvatier
wrote:

> That is an amazing christmas present.
>
> On Tue, Dec 21, 2010 at 4:53 PM, Mark Wiebe  wrote:
>
>> Hello NumPy-ers,
>>
>> After some performance analysis, I've designed and implemented a new
>> iterator designed to speed up ufuncs and allow for easier multi-dimensional
>> iteration.  The new code is fairly large, but works quite well already.  If
>> some people could read the NEP and give some feedback, that would be great!
>>  Here's a link:
>>
>>
>> https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc.rst
>>
>> I would also love it if someone could try building the code and play
>> around with it a bit.  The github branch is here:
>>
>> https://github.com/m-paradox/numpy/tree/new_iterator
>>
>> To give a taste of the iterator's functionality, below is an example from
>> the NEP for how to implement a "Lambda UFunc."  With just a few lines of
>> code, it's possible to replicate something similar to the numexpr library
>> (numexpr still gets a bigger speedup, though).  In the example expression I
>> chose, execution time went from 138ms to 61ms.
>>
>> Hopefully this is a good Christmas present for NumPy. :)
>>
>> Cheers,
>> Mark
>>
>> Here is the definition of the ``luf`` function.::
>>
>> def luf(lamdaexpr, *args, **kwargs):
>> """Lambda UFunc
>>
>> e.g.
>> c = luf(lambda i,j:i+j, a, b, order='K',
>> casting='safe', buffersize=8192)
>>
>> c = np.empty(...)
>> luf(lambda i,j:i+j, a, b, out=c, order='K',
>> casting='safe', buffersize=8192)
>> """
>>
>> nargs = len(args)
>> op = args + (kwargs.get('out',None),)
>> it = np.newiter(op, ['buffered','no_inner_iteration'],
>> [['readonly','nbo_aligned']]*nargs +
>> [['writeonly','allocate','no_broadcast']],
>> order=kwargs.get('order','K'),
>> casting=kwargs.get('casting','safe'),
>> buffersize=kwargs.get('buffersize',0))
>> while not it.finished:
>> it[-1] = lamdaexpr(*it[:-1])
>> it.iternext()
>>
>> return it.operands[-1]
>>
>> Then, by using ``luf`` instead of straight Python expressions, we
>> can gain some performance from better cache behavior.::
>>
>> In [2]: a = np.random.random((50,50,50,10))
>> In [3]: b = np.random.random((50,50,1,10))
>> In [4]: c = np.random.random((50,50,50,1))
>>
>> In [5]: timeit 3*a+b-(a/c)
>> 1 loops, best of 3: 138 ms per loop
>>
>> In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
>> 10 loops, best of 3: 60.9 ms per loop
>>
>> In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c))
>> Out[7]: True
>>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Giving numpy the ability to multi-iterate excluding an axis

2010-12-21 Thread Mark Wiebe
On Mon, Dec 20, 2010 at 1:42 PM, John Salvatier
wrote:

> A while ago, I asked a whether it was possible to multi-iterate over
> several ndarrays but exclude a certain axis(
> http://www.mail-archive.com/numpy-discussion@scipy.org/msg29204.html),
> sort of a combination of PyArray_IterAllButAxis and PyArray_MultiIterNew. My
> goal was to allow creation of relatively complex ufuncs that can allow
> reduction or directionally dependent computation and still use broadcasting
> (for example a moving averaging ufunc that can have changing averaging
> parameters). I didn't get any solutions, which I take to mean that no one
> knew how to do this.
>
> I am thinking about trying to make a numpy patch with this functionality,
> and I have some questions: 1) How difficult would this kind of task be for
> someone with non-expert C knowledge and good numpy knowledge? 2) Does anyone
> have advice on how to do this kind of thing?
>

You may be able to do what you would like with the new iterator I've
written.  In particular, it supports nesting multiple iterators by providing
either pointers or offsets, and allowing you to specify any subset of the
axes to iterate.  Here's how the code to do this in a simple 3D case might
look, for making axis 1 the inner loop:

PyArrayObject *op[2] = {a,b};
npy_intp axes_outer[2] = {0,2}};
npy_intp *op_axes[2];
npy_intp axis_inner = 1;
npy_int32 flags[2] = {NPY_ITER_READONLY, NPY_ITER_READONLY};
NpyIter *outer, *inner;
NpyIter_IterNext_Fn oiternext, iiternext;
npy_intp *ooffsets;
char **idataptrs;

op_axes[0] = op_axes[1] = axes_outer;
outer = NpyIter_MultiNew(2, op, NPY_ITER_OFFSETS,
   NPY_KEEPORDER, NPY_NO_CASTING, flags, NULL, 2,
op_axes, 0);
op_axes[0] = op_axes[1] = &axis_inner;
inner = NpyIter_MultiNew(2, op, 0, NPY_KEEPORDER, NPY_NO_CASTING, flags,
NULL, 1, op_axes, 0);

oiternext = NpyIter_GetIterNext(outer);
iiternext = NpyIter_GetIterNext(inner);

ooffsets = (npy_intp *)NpyIter_GetDataPtrArray(outer);
idataptrs = NpyIter_GetDataPtrArray(inner);

do {
   do {
  char *a_data = idataptrs[0] + ooffsets[0], *b_data = idataptrs[0] +
ooffsets[0];
  /* Do stuff with the data */
   } while(iiternext());
   NpyIter_Reset(inner);
} while(oiternext());

NpyIter_Deallocate(outer);
NpyIter_Deallocate(inner);

Extending to more dimensions, or making both the inner and outer loops have
multiple dimensions, isn't too crazy.  Is this along the lines of what you
need?

If you check out my code, note that it currently isn't exposed as NumPy API
yet, but you can try a lot of things with the Python exposure.

Cheers,
Mark
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP for faster ufuncs

2010-12-21 Thread John Salvatier
That is an amazing christmas present.

On Tue, Dec 21, 2010 at 4:53 PM, Mark Wiebe  wrote:

> Hello NumPy-ers,
>
> After some performance analysis, I've designed and implemented a new
> iterator designed to speed up ufuncs and allow for easier multi-dimensional
> iteration.  The new code is fairly large, but works quite well already.  If
> some people could read the NEP and give some feedback, that would be great!
>  Here's a link:
>
>
> https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc.rst
>
> I would also love it if someone could try building the code and play around
> with it a bit.  The github branch is here:
>
> https://github.com/m-paradox/numpy/tree/new_iterator
>
> To give a taste of the iterator's functionality, below is an example from
> the NEP for how to implement a "Lambda UFunc."  With just a few lines of
> code, it's possible to replicate something similar to the numexpr library
> (numexpr still gets a bigger speedup, though).  In the example expression I
> chose, execution time went from 138ms to 61ms.
>
> Hopefully this is a good Christmas present for NumPy. :)
>
> Cheers,
> Mark
>
> Here is the definition of the ``luf`` function.::
>
> def luf(lamdaexpr, *args, **kwargs):
> """Lambda UFunc
>
> e.g.
> c = luf(lambda i,j:i+j, a, b, order='K',
> casting='safe', buffersize=8192)
>
> c = np.empty(...)
> luf(lambda i,j:i+j, a, b, out=c, order='K',
> casting='safe', buffersize=8192)
> """
>
> nargs = len(args)
> op = args + (kwargs.get('out',None),)
> it = np.newiter(op, ['buffered','no_inner_iteration'],
> [['readonly','nbo_aligned']]*nargs +
> [['writeonly','allocate','no_broadcast']],
> order=kwargs.get('order','K'),
> casting=kwargs.get('casting','safe'),
> buffersize=kwargs.get('buffersize',0))
> while not it.finished:
> it[-1] = lamdaexpr(*it[:-1])
> it.iternext()
>
> return it.operands[-1]
>
> Then, by using ``luf`` instead of straight Python expressions, we
> can gain some performance from better cache behavior.::
>
> In [2]: a = np.random.random((50,50,50,10))
> In [3]: b = np.random.random((50,50,1,10))
> In [4]: c = np.random.random((50,50,50,1))
>
> In [5]: timeit 3*a+b-(a/c)
> 1 loops, best of 3: 138 ms per loop
>
> In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
> 10 loops, best of 3: 60.9 ms per loop
>
> In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c))
> Out[7]: True
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] NEP for faster ufuncs

2010-12-21 Thread Mark Wiebe
Hello NumPy-ers,

After some performance analysis, I've designed and implemented a new
iterator designed to speed up ufuncs and allow for easier multi-dimensional
iteration.  The new code is fairly large, but works quite well already.  If
some people could read the NEP and give some feedback, that would be great!
 Here's a link:

https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc.rst

I would also love it if someone could try building the code and play around
with it a bit.  The github branch is here:

https://github.com/m-paradox/numpy/tree/new_iterator

To give a taste of the iterator's functionality, below is an example from
the NEP for how to implement a "Lambda UFunc."  With just a few lines of
code, it's possible to replicate something similar to the numexpr library
(numexpr still gets a bigger speedup, though).  In the example expression I
chose, execution time went from 138ms to 61ms.

Hopefully this is a good Christmas present for NumPy. :)

Cheers,
Mark

Here is the definition of the ``luf`` function.::

def luf(lamdaexpr, *args, **kwargs):
"""Lambda UFunc

e.g.
c = luf(lambda i,j:i+j, a, b, order='K',
casting='safe', buffersize=8192)

c = np.empty(...)
luf(lambda i,j:i+j, a, b, out=c, order='K',
casting='safe', buffersize=8192)
"""

nargs = len(args)
op = args + (kwargs.get('out',None),)
it = np.newiter(op, ['buffered','no_inner_iteration'],
[['readonly','nbo_aligned']]*nargs +
[['writeonly','allocate','no_broadcast']],
order=kwargs.get('order','K'),
casting=kwargs.get('casting','safe'),
buffersize=kwargs.get('buffersize',0))
while not it.finished:
it[-1] = lamdaexpr(*it[:-1])
it.iternext()

return it.operands[-1]

Then, by using ``luf`` instead of straight Python expressions, we
can gain some performance from better cache behavior.::

In [2]: a = np.random.random((50,50,50,10))
In [3]: b = np.random.random((50,50,1,10))
In [4]: c = np.random.random((50,50,50,1))

In [5]: timeit 3*a+b-(a/c)
1 loops, best of 3: 138 ms per loop

In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
10 loops, best of 3: 60.9 ms per loop

In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c))
Out[7]: True
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] sample without replacement

2010-12-21 Thread Robert Kern
On Mon, Dec 20, 2010 at 10:28, Alan G Isaac  wrote:
> I want to sample *without* replacement from a vector
> (as with Python's random.sample).  I don't see a direct
> replacement for this, and I don't want to carry two
> PRNG's around.  Is the best way something like  this?
>
>        permutation(myvector)[:samplesize]

For one of my personal projects, I copied over the mtrand package and
added a method to RandomState for doing this kind of thing using
reservoir sampling.

http://en.wikipedia.org/wiki/Reservoir_sampling

def subset_reservoir(self, long nselected, long ntotal, object size=None):
""" Sample a given number integers from the set [0, ntotal) without
replacement using a reservoir algorithm.

Parameters
--
nselected : int
The number of integers to sample.
ntotal : int
The size of the set to sample from.
size : int, sequence of ints, or None
The number of subsets to sample or a shape tuple. An axis of the
length nselected will be appended to a shape.

Returns
---
out : ndarray
The sampled subsets. The order of the items is not necessarily
random. Use a slice from the result of permutation() if you need the
order of the items to be randomized.
"""
cdef long total_size, length, i, j, u
cdef cnp.ndarray[cnp.int_t, ndim=2] out
if size is None:
shape = (nselected,)
total_size = nselected
length = 1
elif isinstance(size, int):
shape = (size, nselected)
total_size = size * nselected
length = size
else:
shape = size + (nselected,)
length = 1
for i from 0 <= i < len(size):
length *= size[i]
total_size = length * nselected
out = np.empty((length, nselected), dtype=int)
for i from 0 <= i < length:
for j from 0 <= j < nselected:
out[i,j] = j
for j from nselected <= j < ntotal:
u = rk_interval(j+1, self.internal_state)
if u < nselected:
out[i,u] = j
return out.reshape(shape)

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Reversing an array in-place

2010-12-21 Thread Sturla Molden
Chuck wrote:

> The reversed matrix is a view,  no copyihg is done. It is even faster than
> an inplace reversal.

This is why I love NumPy. In C, Fortran or Matlab most programmers would
probably form the reversed array. In NumPy we just change some
metainformation (data pointer and strides) behind the scenes. It cannot be
done more efficiently than that.

Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] sample without replacement

2010-12-21 Thread Sturla Molden

We often need to generate more than one such sample from an array, e.g.
for permutation tests. If we shuffle an array x of size N and use x[:M] as
a random sample "without replacement", we just need to put them back
randomly to get the next sample (cf. Fisher-Yates shuffle). That way we
get O(M) amortized complexity for each sample of size M. Only the first
sample will have complexity O(N).

Sturla


> I know this question came up on the mailing list some time ago
> (19/09/2008), and the conclusion was that yes, you can do it more or
> less efficiently in pure python; the trick is to use two different
> methods. If your sample is more than, say, a quarter the size of the
> set you're drawing from, you permute the set and take the first few.
> If your sample is smaller, you draw with replacement, then redraw the
> duplicates, and repeat until there aren't any more duplicates. Since
> you only do this when your sample is much smaller than the population
> you don't need to repeat many times.
>
> Here's the code I posted to the previous discussion (not tested this
> time around) with comments:
>
> '''
> def choose_without_replacement(m,n,repeats=None):
>"""Choose n nonnegative integers less than m without replacement
>
>Returns an array of shape n, or (n,repeats).
>"""
>if repeats is None:
>r = 1
>else:
>r = repeats
>if n>m:
>raise ValueError, "Cannot find %d nonnegative integers less
> than %d" %  (n,m)
>if n>m/2:
>res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0)
>else:
>res = np.random.random_integers(m,size=(n,r))
>while True:
>res = np.sort(res,axis=0)
>w = np.nonzero(np.diff(res,axis=0)==0)
>nr = len(w[0])
>if nr==0:
>break
>res[w] = np.random.random_integers(m,size=nr)
>
>if repeats is None:
>return res[:,0]
>else:
>return res
>
> For really large values of repeats it does too much sorting; I didn't
> have the energy to make it pull all the ones with repeats to the
> beginning so that only they need to be re-sorted the next time
> through. Still, the expected number of trips through the while loop
> grows only logarithmically with repeats, so it shouldn't be too bad.
> '''
>
> Anne
>
> On 20 December 2010 12:13, John Salvatier 
> wrote:
>> I think this is not possible to do efficiently with just numpy. If you
>> want
>> to do this efficiently, I wrote a no-replacement sampler in Cython some
>> time
>> ago (below). I hearby release it to the public domain.
>>
>> '''
>>
>> Created on Oct 24, 2009
>> http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-replacement
>> @author: johnsalvatier
>>
>> '''
>>
>> from __future__ import division
>>
>> import numpy
>>
>> def random_no_replace(sampleSize, populationSize, numSamples):
>>
>>
>>
>>     samples  = numpy.zeros((numSamples, sampleSize),dtype=int)
>>
>>
>>
>>     # Use Knuth's variable names
>>
>>     cdef int n = sampleSize
>>
>>     cdef int N = populationSize
>>
>>     cdef i = 0
>>
>>     cdef int t = 0 # total input records dealt with
>>
>>     cdef int m = 0 # number of items selected so far
>>
>>     cdef double u
>>
>>     while i < numSamples:
>>
>>         t = 0
>>
>>         m = 0
>>
>>         while m < n :
>>
>>
>>
>>             u = numpy.random.uniform() # call a uniform(0,1) random
>> number
>> generator
>>
>>             if  (N - t)*u >= n - m :
>>
>>
>>
>>                 t += 1
>>
>>
>>
>>             else:
>>
>>
>>
>>                 samples[i,m] = t
>>
>>                 t += 1
>>
>>                 m += 1
>>
>>
>>
>>         i += 1
>>
>>
>>
>>     return samples
>>
>>
>>
>> On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac 
>> wrote:
>>>
>>> I want to sample *without* replacement from a vector
>>> (as with Python's random.sample).  I don't see a direct
>>> replacement for this, and I don't want to carry two
>>> PRNG's around.  Is the best way something like  this?
>>>
>>>        permutation(myvector)[:samplesize]
>>>
>>> Thanks,
>>> Alan Isaac
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] bincount and generators

2010-12-21 Thread Alan G Isaac
bincount does not currently allow a generator as an argument.
I'm wondering if it is considered too costly to extend it to allow this.
(Motivation: I'm counting based on an attribute of a large number of objects,
and I don't need a list of the data.)

Thanks,
Alan Isaac

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] bincount question

2010-12-21 Thread Alan G Isaac
::

 >>> np.bincount([])
 Traceback (most recent call last):
   File "", line 1, in 
 ValueError: The first argument cannot be empty.

Why not?
(I.e., why isn't an empty array the right answer?)

Thanks,
Alan Isaac

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] sample without replacement

2010-12-21 Thread Anne Archibald
I know this question came up on the mailing list some time ago
(19/09/2008), and the conclusion was that yes, you can do it more or
less efficiently in pure python; the trick is to use two different
methods. If your sample is more than, say, a quarter the size of the
set you're drawing from, you permute the set and take the first few.
If your sample is smaller, you draw with replacement, then redraw the
duplicates, and repeat until there aren't any more duplicates. Since
you only do this when your sample is much smaller than the population
you don't need to repeat many times.

Here's the code I posted to the previous discussion (not tested this
time around) with comments:

'''
def choose_without_replacement(m,n,repeats=None):
   """Choose n nonnegative integers less than m without replacement

   Returns an array of shape n, or (n,repeats).
   """
   if repeats is None:
   r = 1
   else:
   r = repeats
   if n>m:
   raise ValueError, "Cannot find %d nonnegative integers less
than %d" %  (n,m)
   if n>m/2:
   res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0)
   else:
   res = np.random.random_integers(m,size=(n,r))
   while True:
   res = np.sort(res,axis=0)
   w = np.nonzero(np.diff(res,axis=0)==0)
   nr = len(w[0])
   if nr==0:
   break
   res[w] = np.random.random_integers(m,size=nr)

   if repeats is None:
   return res[:,0]
   else:
   return res

For really large values of repeats it does too much sorting; I didn't
have the energy to make it pull all the ones with repeats to the
beginning so that only they need to be re-sorted the next time
through. Still, the expected number of trips through the while loop
grows only logarithmically with repeats, so it shouldn't be too bad.
'''

Anne

On 20 December 2010 12:13, John Salvatier  wrote:
> I think this is not possible to do efficiently with just numpy. If you want
> to do this efficiently, I wrote a no-replacement sampler in Cython some time
> ago (below). I hearby release it to the public domain.
>
> '''
>
> Created on Oct 24, 2009
> http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-replacement
> @author: johnsalvatier
>
> '''
>
> from __future__ import division
>
> import numpy
>
> def random_no_replace(sampleSize, populationSize, numSamples):
>
>
>
>     samples  = numpy.zeros((numSamples, sampleSize),dtype=int)
>
>
>
>     # Use Knuth's variable names
>
>     cdef int n = sampleSize
>
>     cdef int N = populationSize
>
>     cdef i = 0
>
>     cdef int t = 0 # total input records dealt with
>
>     cdef int m = 0 # number of items selected so far
>
>     cdef double u
>
>     while i < numSamples:
>
>         t = 0
>
>         m = 0
>
>         while m < n :
>
>
>
>             u = numpy.random.uniform() # call a uniform(0,1) random number
> generator
>
>             if  (N - t)*u >= n - m :
>
>
>
>                 t += 1
>
>
>
>             else:
>
>
>
>                 samples[i,m] = t
>
>                 t += 1
>
>                 m += 1
>
>
>
>         i += 1
>
>
>
>     return samples
>
>
>
> On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac  wrote:
>>
>> I want to sample *without* replacement from a vector
>> (as with Python's random.sample).  I don't see a direct
>> replacement for this, and I don't want to carry two
>> PRNG's around.  Is the best way something like  this?
>>
>>        permutation(myvector)[:samplesize]
>>
>> Thanks,
>> Alan Isaac
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] sample without replacement

2010-12-21 Thread Alan G Isaac
On 12/20/2010 10:49 PM, josef.p...@gmail.com wrote:
> What's the difference between a numpy Random and a python
> random.Random instance of separate states of the random number
> generators?


Sorry, I don't understand the question.  The difference
for my use is that a np.RandomState instance provides
access to a different set of methods, which unfortunately
does not include an equivalent to random.Random's sample
method but which does include others I need.

Would it be appropriate to request that an analog
to random.sample be added to numpy.random?
(It might sample only a range, since producing indexes
would provide the base functionality.)
Or is this functionality absent intentionally?

Alan

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion