[Numpy-discussion] Histogram does not preserve subclasses of ndarray (e.g. masked arrays)

2010-09-02 Thread Joe Kington
Hi all,

I just wanted to check if this would be considered a bug.

numpy.histogram does not appear to preserve subclasses of ndarrays (e.g.
masked arrays).  This leads to considerable problems when working with
masked arrays. (As per this Stack Overflow
questionhttp://stackoverflow.com/questions/3610040/how-to-create-the-histogram-of-an-array-with-masked-values-in-numpy
)

E.g.

import numpy as np
x = np.arange(100)
x = np.ma.masked_where(x  30, x)

counts, bin_edges = np.histogram(x)

yields:
counts -- array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
bin_edges -- array([  0. ,   9.9,  19.8,  29.7,  39.6,  49.5,  59.4,
69.3,  79.2, 89.1,  99. ])

I would have expected histogram to ignore the masked portion of the data.
Is this a bug, or expected behavior?  I'll open a bug report, if it's not
expected behavior...

This would appear to be easily fixed by using asanyarray rather than asarray
within histogram.  E.g. this diff for numpy/lib/function_base.py
Index: function_base.py
===
--- function_base.py(revision 8604)
+++ function_base.py(working copy)
@@ -132,9 +132,9 @@

 

-a = asarray(a)
+a = asanyarray(a)
 if weights is not None:
-weights = asarray(weights)
+weights = asanyarray(weights)
 if np.any(weights.shape != a.shape):
 raise ValueError(
 'weights should have the same shape as a.')
@@ -156,7 +156,7 @@
 mx += 0.5
 bins = linspace(mn, mx, bins+1, endpoint=True)
 else:
-bins = asarray(bins)
+bins = asanyarray(bins)
 if (np.diff(bins)  0).any():
 raise AttributeError(
 'bins must increase monotonically.')

Thanks!
-Joe
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Histogram does not preserve subclasses of ndarray (e.g. masked arrays)

2010-09-02 Thread Bruce Southey

 On 09/02/2010 02:50 PM, Joe Kington wrote:

Hi all,

I just wanted to check if this would be considered a bug.

numpy.histogram does not appear to preserve subclasses of ndarrays 
(e.g. masked arrays).  This leads to considerable problems when 
working with masked arrays. (As per this Stack Overflow question 
http://stackoverflow.com/questions/3610040/how-to-create-the-histogram-of-an-array-with-masked-values-in-numpy)


E.g.

import numpy as np
x = np.arange(100)
x = np.ma.masked_where(x  30, x)

counts, bin_edges = np.histogram(x)

yields:
counts -- array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
bin_edges -- array([  0. ,   9.9,  19.8,  29.7,  39.6,  49.5,  59.4,  
69.3,  79.2, 89.1,  99. ])


I would have expected histogram to ignore the masked portion of the 
data.  Is this a bug, or expected behavior?  I'll open a bug report, 
if it's not expected behavior...


This would appear to be easily fixed by using asanyarray rather than 
asarray within histogram.  E.g. this diff for numpy/lib/function_base.py

Index: function_base.py
===
--- function_base.py(revision 8604)
+++ function_base.py(working copy)
@@ -132,9 +132,9 @@

 

-a = asarray(a)
+a = asanyarray(a)
 if weights is not None:
-weights = asarray(weights)
+weights = asanyarray(weights)
 if np.any(weights.shape != a.shape):
 raise ValueError(
 'weights should have the same shape as a.')
@@ -156,7 +156,7 @@
 mx += 0.5
 bins = linspace(mn, mx, bins+1, endpoint=True)
 else:
-bins = asarray(bins)
+bins = asanyarray(bins)
 if (np.diff(bins)  0).any():
 raise AttributeError(
 'bins must increase monotonically.')

Thanks!
-Joe



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
I would not call it a bug as this a known 'feature' of functions that 
use np.asarray().  You are welcome to file a enhancement bug but there 
are some issues that need to be addressed.


Typical questions that come to mind are:
1) Should a user be warned that the input is a masked array?
2) Should histogram count the number of masked values?
3) What is the expected output when normed=True?
4) What type of array should be the weights and bin arguments?
5) What is the dimensions of the weight and bin arguments since it only 
needs to have the number of bins?
6) If the input array is masked should the weight and bins arguments 
also be masked arrays when applicable? If so, what happens if the masks 
are in different locations between arrays?


Regards
Bruce

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Histogram does not preserve subclasses of ndarray (e.g. masked arrays)

2010-09-02 Thread josef . pktd
On Thu, Sep 2, 2010 at 3:50 PM, Joe Kington jking...@wisc.edu wrote:
 Hi all,

 I just wanted to check if this would be considered a bug.

 numpy.histogram does not appear to preserve subclasses of ndarrays (e.g.
 masked arrays).  This leads to considerable problems when working with
 masked arrays. (As per this Stack Overflow question)

 E.g.

 import numpy as np
 x = np.arange(100)
 x = np.ma.masked_where(x  30, x)

 counts, bin_edges = np.histogram(x)

 yields:
 counts -- array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
 bin_edges -- array([  0. ,   9.9,  19.8,  29.7,  39.6,  49.5,  59.4,
 69.3,  79.2, 89.1,  99. ])

 I would have expected histogram to ignore the masked portion of the data.
 Is this a bug, or expected behavior?  I'll open a bug report, if it's not
 expected behavior...

If you want to ignore masked data it's just on extra function call

histogram(m_arr.compressed())

I don't think the fact that this makes an extra copy will be relevant,
because I guess full masked array handling inside histogram will be a
lot more expensive.

Using asanyarray would also allow matrices in and other subtypes that
might not be handled correctly by the histogram calculations.

For anything else besides dropping masked observations, it would be
necessary to figure out what the masked array definition of a
histogram is, as Bruce pointed out.

(Another interesting question would be if histogram handles nans
correctly, searchsorted ???)

Josef


 This would appear to be easily fixed by using asanyarray rather than asarray
 within histogram.  E.g. this diff for numpy/lib/function_base.py
 Index: function_base.py
 ===
 --- function_base.py    (revision 8604)
 +++ function_base.py    (working copy)
 @@ -132,9 +132,9 @@

  

 -    a = asarray(a)
 +    a = asanyarray(a)
  if weights is not None:
 -    weights = asarray(weights)
 +    weights = asanyarray(weights)
  if np.any(weights.shape != a.shape):
  raise ValueError(
  'weights should have the same shape as a.')
 @@ -156,7 +156,7 @@
  mx += 0.5
  bins = linspace(mn, mx, bins+1, endpoint=True)
  else:
 -    bins = asarray(bins)
 +    bins = asanyarray(bins)
  if (np.diff(bins)  0).any():
  raise AttributeError(
  'bins must increase monotonically.')

 Thanks!
 -Joe



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Histogram does not preserve subclasses of ndarray (e.g. masked arrays)

2010-09-02 Thread Joe Kington
On Thu, Sep 2, 2010 at 5:31 PM, josef.p...@gmail.com wrote:

 On Thu, Sep 2, 2010 at 3:50 PM, Joe Kington jking...@wisc.edu wrote:
  Hi all,
 
  I just wanted to check if this would be considered a bug.
 
  numpy.histogram does not appear to preserve subclasses of ndarrays (e.g.
  masked arrays).  This leads to considerable problems when working with
  masked arrays. (As per this Stack Overflow question)
 
  E.g.
 
  import numpy as np
  x = np.arange(100)
  x = np.ma.masked_where(x  30, x)
 
  counts, bin_edges = np.histogram(x)
 
  yields:
  counts -- array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
  bin_edges -- array([  0. ,   9.9,  19.8,  29.7,  39.6,  49.5,  59.4,
  69.3,  79.2, 89.1,  99. ])
 
  I would have expected histogram to ignore the masked portion of the data.
  Is this a bug, or expected behavior?  I'll open a bug report, if it's not
  expected behavior...

 If you want to ignore masked data it's just on extra function call

 histogram(m_arr.compressed())

 I don't think the fact that this makes an extra copy will be relevant,
 because I guess full masked array handling inside histogram will be a
 lot more expensive.

 Using asanyarray would also allow matrices in and other subtypes that
 might not be handled correctly by the histogram calculations.

 For anything else besides dropping masked observations, it would be
 necessary to figure out what the masked array definition of a
 histogram is, as Bruce pointed out.

 (Another interesting question would be if histogram handles nans
 correctly, searchsorted ???)

 Josef


Good points all around.  I'll skip the enhancement request.  Sorry for the
noise!
Thanks!
-Joe



 
  This would appear to be easily fixed by using asanyarray rather than
 asarray
  within histogram.  E.g. this diff for numpy/lib/function_base.py
  Index: function_base.py
  ===
  --- function_base.py(revision 8604)
  +++ function_base.py(working copy)
  @@ -132,9 +132,9 @@
 
   
 
  -a = asarray(a)
  +a = asanyarray(a)
   if weights is not None:
  -weights = asarray(weights)
  +weights = asanyarray(weights)
   if np.any(weights.shape != a.shape):
   raise ValueError(
   'weights should have the same shape as a.')
  @@ -156,7 +156,7 @@
   mx += 0.5
   bins = linspace(mn, mx, bins+1, endpoint=True)
   else:
  -bins = asarray(bins)
  +bins = asanyarray(bins)
   if (np.diff(bins)  0).any():
   raise AttributeError(
   'bins must increase monotonically.')
 
  Thanks!
  -Joe
 
 
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion