[Numpy-discussion] Efficient way of binning points and applying functions to these groups
Hi! I am looking for an efficient way of doing some simple binning of points and then applying some functions to points within each bin. I have tried several ways, including crude looping over the indices, or using digitize (see below) but I cannot manage to get it as efficient as I need it to be. I have a comparison with a similar (although complex) code in idl, and thought I would ask the forum. In idl there is a way to invert an histogram and get a reverse set of indices (via the histogram function) which seems to do the trick (or maybe it is faster for another reason). Below I provide a (dummy) example of what I wish to achieve. Any hint on how to do this EFFICIENTLY using numpy is most welcome. I need to speed things up quite a bit (at the moment, the version I have, see below, is 10 times slower than the more complex idl routine..., I must be doing something wrong!). thanks!! Eric # I have a random set of data points in 2D with coordinates x,y : import numpy as np x = np.random.random(1000) y = np.random.random(1000) # I have now a 2D grid given by let's say 10x10 grid points: nx = 11 ny = 21 lx = linspace(0,1,nx) ly = linspace(0,1,ny) gx, gy = np.meshgrid(lx, ly) # So my set of 2D bins are (not needed in the solution I present but just for clarity) bins = np.dstack((gx.ravel(), gy.ravel()))[0] # Now I want to have the list of points in each bin and # if the number of points in that bin is larger than 10, apply (dummy) function func1 (see below) # If less than 10, apply (dummy) function func2 so (dum?) # if 0, do nothing # for two dummy functions like (for example): def func1(x) : return x.mean() def func2(x) : return x.std() # One solution would be to use digitize in 1D and histogram in 2D (don't need gx, gy for this one): h = histogram2d(x, y, bins=[lx, ly])[0] digitX = np.digitize(x, lx) digitY = np.digitize(y, ly) # create the output array, with -999 values to make sure I see which ones are not filled in result = np.zeros_like(h) - 999 for i in range(nx-1) : for j in range(ny-1) : selectionofpoints = (digitX == i+1) (digitY == j+1) if h[i,j] 10 : result[i,j] = func1(x[selectionofpoints]) elif h[i,j] 0 : result[i,j] = func2(x[selectionofpoints]) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Efficient way of binning points and applying functions to these groups
This looks like the perfect work for cython. It it's great opp optimizing loops. Another option is the new Numba, an automatic compiler. David. El 26/12/2012 10:09, Eric Emsellem eric.emsel...@eso.org escribió: Hi! I am looking for an efficient way of doing some simple binning of points and then applying some functions to points within each bin. I have tried several ways, including crude looping over the indices, or using digitize (see below) but I cannot manage to get it as efficient as I need it to be. I have a comparison with a similar (although complex) code in idl, and thought I would ask the forum. In idl there is a way to invert an histogram and get a reverse set of indices (via the histogram function) which seems to do the trick (or maybe it is faster for another reason). Below I provide a (dummy) example of what I wish to achieve. Any hint on how to do this EFFICIENTLY using numpy is most welcome. I need to speed things up quite a bit (at the moment, the version I have, see below, is 10 times slower than the more complex idl routine..., I must be doing something wrong!). thanks!! Eric # I have a random set of data points in 2D with coordinates x,y : import numpy as np x = np.random.random(1000) y = np.random.random(1000) # I have now a 2D grid given by let's say 10x10 grid points: nx = 11 ny = 21 lx = linspace(0,1,nx) ly = linspace(0,1,ny) gx, gy = np.meshgrid(lx, ly) # So my set of 2D bins are (not needed in the solution I present but just for clarity) bins = np.dstack((gx.ravel(), gy.ravel()))[0] # Now I want to have the list of points in each bin and # if the number of points in that bin is larger than 10, apply (dummy) function func1 (see below) # If less than 10, apply (dummy) function func2 so (dum?) # if 0, do nothing # for two dummy functions like (for example): def func1(x) : return x.mean() def func2(x) : return x.std() # One solution would be to use digitize in 1D and histogram in 2D (don't need gx, gy for this one): h = histogram2d(x, y, bins=[lx, ly])[0] digitX = np.digitize(x, lx) digitY = np.digitize(y, ly) # create the output array, with -999 values to make sure I see which ones are not filled in result = np.zeros_like(h) - 999 for i in range(nx-1) : for j in range(ny-1) : selectionofpoints = (digitX == i+1) (digitY == j+1) if h[i,j] 10 : result[i,j] = func1(x[selectionofpoints]) elif h[i,j] 0 : result[i,j] = func2(x[selectionofpoints]) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Efficient way of binning points and applying functions to these groups
On Wed, Dec 26, 2012 at 10:09 AM, Eric Emsellem eric.emsel...@eso.orgwrote: Hi! I am looking for an efficient way of doing some simple binning of points and then applying some functions to points within each bin. That's exactly what scipy.stats.binned_statistic does: http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.binned_statistic.html binned_statistic uses np.digitize as well, but I'm not sure that in your code below digitize is the bottleneck - the nested for-loop looks like the more likely suspect. Ralf I have tried several ways, including crude looping over the indices, or using digitize (see below) but I cannot manage to get it as efficient as I need it to be. I have a comparison with a similar (although complex) code in idl, and thought I would ask the forum. In idl there is a way to invert an histogram and get a reverse set of indices (via the histogram function) which seems to do the trick (or maybe it is faster for another reason). Below I provide a (dummy) example of what I wish to achieve. Any hint on how to do this EFFICIENTLY using numpy is most welcome. I need to speed things up quite a bit (at the moment, the version I have, see below, is 10 times slower than the more complex idl routine..., I must be doing something wrong!). thanks!! Eric # I have a random set of data points in 2D with coordinates x,y : import numpy as np x = np.random.random(1000) y = np.random.random(1000) # I have now a 2D grid given by let's say 10x10 grid points: nx = 11 ny = 21 lx = linspace(0,1,nx) ly = linspace(0,1,ny) gx, gy = np.meshgrid(lx, ly) # So my set of 2D bins are (not needed in the solution I present but just for clarity) bins = np.dstack((gx.ravel(), gy.ravel()))[0] # Now I want to have the list of points in each bin and # if the number of points in that bin is larger than 10, apply (dummy) function func1 (see below) # If less than 10, apply (dummy) function func2 so (dum?) # if 0, do nothing # for two dummy functions like (for example): def func1(x) : return x.mean() def func2(x) : return x.std() # One solution would be to use digitize in 1D and histogram in 2D (don't need gx, gy for this one): h = histogram2d(x, y, bins=[lx, ly])[0] digitX = np.digitize(x, lx) digitY = np.digitize(y, ly) # create the output array, with -999 values to make sure I see which ones are not filled in result = np.zeros_like(h) - 999 for i in range(nx-1) : for j in range(ny-1) : selectionofpoints = (digitX == i+1) (digitY == j+1) if h[i,j] 10 : result[i,j] = func1(x[selectionofpoints]) elif h[i,j] 0 : result[i,j] = func2(x[selectionofpoints]) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] dtype reduction
Hi all, I'm looking for a way to reduce dtype1 into dtype2 (when it is possible of course). Is there some easy way to do that by any chance ? dtype1 = np.dtype( [ ('vertex', [('x', 'f4'), ('y', 'f4'), ('z', 'f4')]), ('normal', [('x', 'f4'), ('y', 'f4'), ('z', 'f4')]), ('color', [('r', 'f4'), ('g', 'f4'), ('b', 'f4'), ('a', 'f4')]) ] ) dtype2 = np.dtype( [ ('vertex', 'f4', 3), ('normal', 'f4', 3), ('color', 'f4', 4)] ) Nicolas ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] numpy.testing.asserts and masked array
Dear all, I found here http://mail.scipy.org/pipermail/numpy-discussion/2009-January/039681.html that to use* numpy.ma.testutils.assert_almost_equal* for masked array assertion, but I cannot find the np.ma.testutils module? Am I getting somewhere wrong? my numpy version is 1.6.2 thanks! Chao -- *** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] dtype reduction
On Wed, Dec 26, 2012 at 8:09 PM, Nicolas Rougier nicolas.roug...@inria.fr wrote: Hi all, I'm looking for a way to reduce dtype1 into dtype2 (when it is possible of course). Is there some easy way to do that by any chance ? dtype1 = np.dtype( [ ('vertex', [('x', 'f4'), ('y', 'f4'), ('z', 'f4')]), ('normal', [('x', 'f4'), ('y', 'f4'), ('z', 'f4')]), ('color', [('r', 'f4'), ('g', 'f4'), ('b', 'f4'), ('a', 'f4')]) ] ) dtype2 = np.dtype( [ ('vertex', 'f4', 3), ('normal', 'f4', 3), ('color', 'f4', 4)] ) If you have an array whose dtype is dtype1, and you want to convert it into an array with dtype2, then you just do my_dtype2_array = my_dtype1_array.view(dtype2) If you have dtype1 and you want to programmaticaly construct dtype2, then that's a little more fiddly and depends on what exactly you're trying to do, but start by poking around with dtype1.names and dtype1.fields, which contain information on how dtype1 is put together in the form of regular python structures. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Efficient way of binning points and, applying functions to these groups
Thanks Ralf! this module looks great in fact. I didn't know it existed, and in fact It is only available in Scipy 0.11.0 (had to install from source since an Ubuntu 12.04 bin is not available). Too bad that the User-defined function only accepts one single array. If that function should take more input you need to rely on a trick to basically duplicate the input coordinates and concatenate the input arrays you need. But apart from the fact that the programme looks much cleaner now, I just tested it and it is rather SLOW in fact. Since I have to repeat this 2 or 3 times (I have in fact 3 functions to apply), it takes about 20 seconds for a full test, while with the changes I made (see below) it takes now 3 seconds or so [I am talking about the real code, not the example I give below]. So I managed to speed up things a bit by doing two things: - keeping the first loop but replacing the second one with a loop ONLY on bins which contains the right number of points - and more importantly not addressing the full array at each loop iteration but first selecting the right points (to reduce the size). So something as shown below. it is still a factor of 2 slower than the idl routine and I have no clue why. I will analyse it further. The idl routine has similar loops etc, so there is no reason for this. if anybody has an idea THANKS (using cython is a bit too much on my side - being a low-profile python developer. As for numba, will have a look) and thanks again for your help! Eric === import numpy as np x = np.random.random(1000) y = np.random.random(1000) data = np.random.random(1000) # I have now a 2D grid given by let's say 10x10 grid points: nx = 11 ny = 21 lx = linspace(0,1,nx) ly = linspace(0,1,ny) gx, gy = np.meshgrid(lx, ly) # So my set of 2D bins are (not needed in the solution I present but just for clarity) bins = np.dstack((gx.ravel(), gy.ravel()))[0] # Now I want to have the list of points in each bin and # if the number of points in that bin is larger than 10, apply (dummy) function func1 (see below) # If less than 10, apply (dummy) function func2 so (dum?) # if 0, do nothing # for two dummy functions like (for example): def func1(x) : return x.mean() def func2(x) : return x.std() h = histogram2d(x, y, bins=[lx, ly])[0] digitX = np.digitize(x, lx) digitY = np.digitize(y, ly) # create the output array, with -999 values to make sure I see which ones are not filled in result = np.zeros_like(h) - 999 for i in range(nx-1) : selX = (digitX == i+1) dataX = data[selX] selH10 = np.where(h 10) selH0 = np.where((h 0) (h = 10)) for j in selH10 : selectionofpoints = (digitY == j+1) result[i,j] = func1(data[selectionofpoints]) for j in selH0 : selectionofpoints = (digitY == j+1) result[i,j] = func2(data[selectionofpoints]) Hi! I am looking for an efficient way of doing some simple binning of points and then applying some functions to points within each bin. That's exactly what scipy.stats.binned_statistic does: http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.binned_statistic.html binned_statistic uses np.digitize as well, but I'm not sure that in your code below digitize is the bottleneck - the nested for-loop looks like the more likely suspect. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion