[Numpy-discussion] Efficient way of binning points and applying functions to these groups

2012-12-26 Thread Eric Emsellem
Hi!

I am looking for an efficient way of doing some simple binning of points 
and then applying some functions to points within each bin.

I have tried several ways, including crude looping over the indices, or 
using digitize (see below) but I cannot manage to get it as efficient as 
I need it to be. I have a comparison with a similar (although complex) 
code in idl, and thought I would ask the forum. In idl there is a way to 
invert an histogram and get a reverse set of indices (via the 
histogram function) which seems to do the trick (or maybe it is faster 
for another reason).

Below I provide a (dummy) example of what I wish to achieve. Any hint on 
how to do this EFFICIENTLY using numpy is most welcome. I need to speed 
things up quite a bit (at the moment, the version I have, see below, is 
10 times slower than the more complex idl routine..., I must be doing 
something wrong!).

thanks!!
Eric

# I have a random set of data points in 2D with coordinates x,y :
import numpy as np
x = np.random.random(1000)
y = np.random.random(1000)

# I have now a 2D grid given by let's say 10x10 grid points:
nx = 11
ny = 21
lx = linspace(0,1,nx)
ly = linspace(0,1,ny)
gx, gy = np.meshgrid(lx, ly)

# So my set of 2D bins are (not needed in the solution I present but 
just for clarity)
bins = np.dstack((gx.ravel(), gy.ravel()))[0]

# Now I want to have the list of points in each bin and
# if the number of points in that bin is larger than 10, apply (dummy) 
function func1 (see below)
# If less than 10, apply (dummy) function func2 so (dum?)
# if 0, do nothing
# for two dummy functions like (for example):
def func1(x) : return x.mean()

def func2(x) : return x.std()

# One solution would be to use digitize in 1D and histogram in 2D (don't 
need gx, gy for this one):

h = histogram2d(x, y, bins=[lx, ly])[0]

digitX = np.digitize(x, lx)
digitY = np.digitize(y, ly)

# create the output array, with -999 values to make sure I see which 
ones are not filled in
result = np.zeros_like(h) - 999

for i in range(nx-1) :
 for j in range(ny-1) :
 selectionofpoints = (digitX == i+1)  (digitY == j+1)
 if h[i,j]  10 : result[i,j] = func1(x[selectionofpoints])
 elif h[i,j]  0 : result[i,j] = func2(x[selectionofpoints])
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Efficient way of binning points and applying functions to these groups

2012-12-26 Thread Daπid
This looks like the perfect work for cython. It it's great opp optimizing
loops.
Another option is the new
Numba, an automatic compiler.

David.
El 26/12/2012 10:09, Eric Emsellem eric.emsel...@eso.org escribió:

 Hi!

 I am looking for an efficient way of doing some simple binning of points
 and then applying some functions to points within each bin.

 I have tried several ways, including crude looping over the indices, or
 using digitize (see below) but I cannot manage to get it as efficient as
 I need it to be. I have a comparison with a similar (although complex)
 code in idl, and thought I would ask the forum. In idl there is a way to
 invert an histogram and get a reverse set of indices (via the
 histogram function) which seems to do the trick (or maybe it is faster
 for another reason).

 Below I provide a (dummy) example of what I wish to achieve. Any hint on
 how to do this EFFICIENTLY using numpy is most welcome. I need to speed
 things up quite a bit (at the moment, the version I have, see below, is
 10 times slower than the more complex idl routine..., I must be doing
 something wrong!).

 thanks!!
 Eric
 
 # I have a random set of data points in 2D with coordinates x,y :
 import numpy as np
 x = np.random.random(1000)
 y = np.random.random(1000)

 # I have now a 2D grid given by let's say 10x10 grid points:
 nx = 11
 ny = 21
 lx = linspace(0,1,nx)
 ly = linspace(0,1,ny)
 gx, gy = np.meshgrid(lx, ly)

 # So my set of 2D bins are (not needed in the solution I present but
 just for clarity)
 bins = np.dstack((gx.ravel(), gy.ravel()))[0]

 # Now I want to have the list of points in each bin and
 # if the number of points in that bin is larger than 10, apply (dummy)
 function func1 (see below)
 # If less than 10, apply (dummy) function func2 so (dum?)
 # if 0, do nothing
 # for two dummy functions like (for example):
 def func1(x) : return x.mean()

 def func2(x) : return x.std()

 # One solution would be to use digitize in 1D and histogram in 2D (don't
 need gx, gy for this one):

 h = histogram2d(x, y, bins=[lx, ly])[0]

 digitX = np.digitize(x, lx)
 digitY = np.digitize(y, ly)

 # create the output array, with -999 values to make sure I see which
 ones are not filled in
 result = np.zeros_like(h) - 999

 for i in range(nx-1) :
  for j in range(ny-1) :
  selectionofpoints = (digitX == i+1)  (digitY == j+1)
  if h[i,j]  10 : result[i,j] = func1(x[selectionofpoints])
  elif h[i,j]  0 : result[i,j] = func2(x[selectionofpoints])
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Efficient way of binning points and applying functions to these groups

2012-12-26 Thread Ralf Gommers
On Wed, Dec 26, 2012 at 10:09 AM, Eric Emsellem eric.emsel...@eso.orgwrote:

 Hi!

 I am looking for an efficient way of doing some simple binning of points
 and then applying some functions to points within each bin.


That's exactly what scipy.stats.binned_statistic does:
http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.binned_statistic.html

binned_statistic uses np.digitize as well, but I'm not sure that in your
code below digitize is the bottleneck - the nested for-loop looks like the
more likely suspect.

Ralf




 I have tried several ways, including crude looping over the indices, or
 using digitize (see below) but I cannot manage to get it as efficient as
 I need it to be. I have a comparison with a similar (although complex)
 code in idl, and thought I would ask the forum. In idl there is a way to
 invert an histogram and get a reverse set of indices (via the
 histogram function) which seems to do the trick (or maybe it is faster
 for another reason).

 Below I provide a (dummy) example of what I wish to achieve. Any hint on
 how to do this EFFICIENTLY using numpy is most welcome. I need to speed
 things up quite a bit (at the moment, the version I have, see below, is
 10 times slower than the more complex idl routine..., I must be doing
 something wrong!).

 thanks!!
 Eric
 
 # I have a random set of data points in 2D with coordinates x,y :
 import numpy as np
 x = np.random.random(1000)
 y = np.random.random(1000)

 # I have now a 2D grid given by let's say 10x10 grid points:
 nx = 11
 ny = 21
 lx = linspace(0,1,nx)
 ly = linspace(0,1,ny)
 gx, gy = np.meshgrid(lx, ly)

 # So my set of 2D bins are (not needed in the solution I present but
 just for clarity)
 bins = np.dstack((gx.ravel(), gy.ravel()))[0]

 # Now I want to have the list of points in each bin and
 # if the number of points in that bin is larger than 10, apply (dummy)
 function func1 (see below)
 # If less than 10, apply (dummy) function func2 so (dum?)
 # if 0, do nothing
 # for two dummy functions like (for example):
 def func1(x) : return x.mean()

 def func2(x) : return x.std()

 # One solution would be to use digitize in 1D and histogram in 2D (don't
 need gx, gy for this one):

 h = histogram2d(x, y, bins=[lx, ly])[0]

 digitX = np.digitize(x, lx)
 digitY = np.digitize(y, ly)

 # create the output array, with -999 values to make sure I see which
 ones are not filled in
 result = np.zeros_like(h) - 999

 for i in range(nx-1) :
  for j in range(ny-1) :
  selectionofpoints = (digitX == i+1)  (digitY == j+1)
  if h[i,j]  10 : result[i,j] = func1(x[selectionofpoints])
  elif h[i,j]  0 : result[i,j] = func2(x[selectionofpoints])
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] dtype reduction

2012-12-26 Thread Nicolas Rougier


Hi all,


I'm looking for a way to reduce dtype1 into dtype2 (when it is possible of 
course).
Is there some easy way to do that by any chance ?


dtype1 = np.dtype( [ ('vertex',  [('x', 'f4'),
  ('y', 'f4'),
  ('z', 'f4')]),
('normal',  [('x', 'f4'),
 ('y', 'f4'),
 ('z', 'f4')]),
('color',   [('r', 'f4'),
 ('g', 'f4'),
 ('b', 'f4'),
 ('a', 'f4')]) ] )

dtype2 = np.dtype( [ ('vertex',  'f4', 3),
 ('normal',  'f4', 3),
 ('color',   'f4', 4)] )


Nicolas
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] numpy.testing.asserts and masked array

2012-12-26 Thread Chao YUE
Dear all,

I found here
http://mail.scipy.org/pipermail/numpy-discussion/2009-January/039681.html
that to use* numpy.ma.testutils.assert_almost_equal* for masked array
assertion, but I cannot find the np.ma.testutils module?
Am I getting somewhere wrong? my numpy version is 1.6.2 thanks!

Chao

-- 
***
Chao YUE
Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL)
UMR 1572 CEA-CNRS-UVSQ
Batiment 712 - Pe 119
91191 GIF Sur YVETTE Cedex
Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] dtype reduction

2012-12-26 Thread Nathaniel Smith
On Wed, Dec 26, 2012 at 8:09 PM, Nicolas Rougier
nicolas.roug...@inria.fr wrote:


 Hi all,


 I'm looking for a way to reduce dtype1 into dtype2 (when it is possible of 
 course).
 Is there some easy way to do that by any chance ?


 dtype1 = np.dtype( [ ('vertex',  [('x', 'f4'),
   ('y', 'f4'),
   ('z', 'f4')]),
 ('normal',  [('x', 'f4'),
  ('y', 'f4'),
  ('z', 'f4')]),
 ('color',   [('r', 'f4'),
  ('g', 'f4'),
  ('b', 'f4'),
  ('a', 'f4')]) ] )

 dtype2 = np.dtype( [ ('vertex',  'f4', 3),
  ('normal',  'f4', 3),
  ('color',   'f4', 4)] )


If you have an array whose dtype is dtype1, and you want to convert it
into an array with dtype2, then you just do
  my_dtype2_array = my_dtype1_array.view(dtype2)

If you have dtype1 and you want to programmaticaly construct dtype2,
then that's a little more fiddly and depends on what exactly you're
trying to do, but start by poking around with dtype1.names and
dtype1.fields, which contain information on how dtype1 is put together
in the form of regular python structures.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Efficient way of binning points and, applying functions to these groups

2012-12-26 Thread Eric Emsellem
Thanks Ralf!

this module looks great in fact. I didn't know it existed, and in fact 
It is only available in Scipy 0.11.0 (had to install from source since 
an Ubuntu 12.04 bin is not available). Too bad that the User-defined 
function only accepts one single array. If that function should take 
more input you need to rely on a trick to basically duplicate the input 
coordinates and concatenate the input arrays you need.

But apart from the fact that the programme looks much cleaner now, I 
just tested it and it is rather SLOW in fact.

Since I have to repeat this 2 or 3 times (I have in fact 3 functions to 
apply), it takes about 20 seconds for a full test, while with the 
changes I made (see below) it takes now 3 seconds or so [I am talking 
about the real code, not the example I give below].

So I managed to speed up things a bit by doing two things:
- keeping the first loop but replacing the second one with a loop ONLY 
on bins which contains the right number of points
- and more importantly not addressing the full array at each loop 
iteration but first selecting the right points (to reduce the size).

So something as shown below.

it is still a factor of 2 slower than the idl routine and I have no clue 
why. I will analyse it further. The idl routine has similar loops etc, 
so there is no reason for this.

if anybody has an idea  THANKS (using cython is a bit too much on my 
side - being a low-profile python developer. As for numba, will have a look)

and thanks again for your help!

Eric

===
import numpy as np
x = np.random.random(1000)
y = np.random.random(1000)
data = np.random.random(1000)

# I have now a 2D grid given by let's say 10x10 grid points:
nx = 11
ny = 21
lx = linspace(0,1,nx)
ly = linspace(0,1,ny)
gx, gy = np.meshgrid(lx, ly)

# So my set of 2D bins are (not needed in the solution I present but
just for clarity)
bins = np.dstack((gx.ravel(), gy.ravel()))[0]

# Now I want to have the list of points in each bin and
# if the number of points in that bin is larger than 10, apply (dummy)
function func1 (see below)
# If less than 10, apply (dummy) function func2 so (dum?)
# if 0, do nothing
# for two dummy functions like (for example):
def func1(x) : return x.mean()
def func2(x) : return x.std()

h = histogram2d(x, y, bins=[lx, ly])[0]
digitX = np.digitize(x, lx)
digitY = np.digitize(y, ly)

# create the output array, with -999 values to make sure I see which ones are 
not filled in
result = np.zeros_like(h) - 999

for i in range(nx-1) :
 selX = (digitX == i+1)
 dataX = data[selX]
 selH10 = np.where(h  10)
 selH0 = np.where((h  0)  (h = 10))
 for j in selH10 :
  selectionofpoints = (digitY == j+1)
  result[i,j] = func1(data[selectionofpoints])
 for j in selH0 :
  selectionofpoints = (digitY == j+1)
  result[i,j] = func2(data[selectionofpoints])

 Hi!

 I am looking for an efficient way of doing some simple binning of points
 and then applying some functions to points within each bin.

That's exactly what scipy.stats.binned_statistic does:
http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.binned_statistic.html

binned_statistic uses np.digitize as well, but I'm not sure that in your
code below digitize is the bottleneck - the nested for-loop looks like the
more likely suspect.

Ralf

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion