Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Charles R Harris
On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Sun, Jul 14, 2013 at 2:55 PM, Warren Weckesser 
 warren.weckes...@gmail.com wrote:

 On 7/14/13, Charles R Harris charlesr.har...@gmail.com wrote:
  Some corner cases in the mean, var, std.
 
  *Empty arrays*
 
  I think these cases should either raise an error or just return nan.
  Warnings seem ineffective to me as they are only issued once by default.
 
  In [3]: ones(0).mean()
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:61:
  RuntimeWarning: invalid value encountered in double_scalars
ret = ret / float(rcount)
  Out[3]: nan
 
  In [4]: ones(0).var()
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:
  RuntimeWarning: invalid value encountered in true_divide
out=arrmean, casting='unsafe', subok=False)
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:
  RuntimeWarning: invalid value encountered in double_scalars
ret = ret / float(rcount)
  Out[4]: nan
 
  In [5]: ones(0).std()
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:
  RuntimeWarning: invalid value encountered in true_divide
out=arrmean, casting='unsafe', subok=False)
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:
  RuntimeWarning: invalid value encountered in double_scalars
ret = ret / float(rcount)
  Out[5]: nan
 
  *ddof = number of elements*
 
  I think these should just raise errors. The results for ddof =
 #elements
  is happenstance, and certainly negative numbers should never be
 returned.
 
  In [6]: ones(2).var(ddof=2)
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:
  RuntimeWarning: invalid value encountered in double_scalars
ret = ret / float(rcount)
  Out[6]: nan
 
  In [7]: ones(2).var(ddof=3)
  Out[7]: -0.0
  *
  nansum*
 
  Currently returns nan for empty arrays. I suspect it should return nan
 for
  slices that are all nan, but 0 for empty slices. That would make it
  consistent with sum in the empty case.
 


 For nansum, I would expect 0 even in the case of all nans.  The point
 of these functions is to simply ignore nans, correct?  So I would aim
 for this behaviour:  nanfunc(x) behaves the same as func(x[~isnan(x)])


 Agreed, although that changes current behavior. What about the other
 cases?


Looks like there isn't much interest in the topic, so I'll just go ahead
with the following choices:

Non-NaN case

1) Empty array - ValueError

The current behavior with stats is an accident, i.e., the nan arises from
0/0. I like to think that in this case the result is any number, rather
than not a number, so *the* value is simply not defined. So in this case
raise a ValueError for empty array.

2) ddof = n - ValueError

If the number of elements, n, is not zero and ddof = n, raise a ValueError
for the ddof value.

Nan case

1) Empty array - Value Error
2) Empty slice - NaN
3) For slice ddof = n - Nan

 Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Benjamin Root
This is going to need to be heavily documented with doctests. Also, just to
clarify, are we talking about a ValueError for doing a nansum on an empty
array as well, or will that now return a zero?

Ben Root


On Mon, Jul 15, 2013 at 9:52 AM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:



 On Sun, Jul 14, 2013 at 2:55 PM, Warren Weckesser 
 warren.weckes...@gmail.com wrote:

 On 7/14/13, Charles R Harris charlesr.har...@gmail.com wrote:
  Some corner cases in the mean, var, std.
 
  *Empty arrays*
 
  I think these cases should either raise an error or just return nan.
  Warnings seem ineffective to me as they are only issued once by
 default.
 
  In [3]: ones(0).mean()
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:61:
  RuntimeWarning: invalid value encountered in double_scalars
ret = ret / float(rcount)
  Out[3]: nan
 
  In [4]: ones(0).var()
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:
  RuntimeWarning: invalid value encountered in true_divide
out=arrmean, casting='unsafe', subok=False)
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:
  RuntimeWarning: invalid value encountered in double_scalars
ret = ret / float(rcount)
  Out[4]: nan
 
  In [5]: ones(0).std()
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:
  RuntimeWarning: invalid value encountered in true_divide
out=arrmean, casting='unsafe', subok=False)
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:
  RuntimeWarning: invalid value encountered in double_scalars
ret = ret / float(rcount)
  Out[5]: nan
 
  *ddof = number of elements*
 
  I think these should just raise errors. The results for ddof =
 #elements
  is happenstance, and certainly negative numbers should never be
 returned.
 
  In [6]: ones(2).var(ddof=2)
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:
  RuntimeWarning: invalid value encountered in double_scalars
ret = ret / float(rcount)
  Out[6]: nan
 
  In [7]: ones(2).var(ddof=3)
  Out[7]: -0.0
  *
  nansum*
 
  Currently returns nan for empty arrays. I suspect it should return nan
 for
  slices that are all nan, but 0 for empty slices. That would make it
  consistent with sum in the empty case.
 


 For nansum, I would expect 0 even in the case of all nans.  The point
 of these functions is to simply ignore nans, correct?  So I would aim
 for this behaviour:  nanfunc(x) behaves the same as func(x[~isnan(x)])


 Agreed, although that changes current behavior. What about the other
 cases?


 Looks like there isn't much interest in the topic, so I'll just go ahead
 with the following choices:

 Non-NaN case

 1) Empty array - ValueError

 The current behavior with stats is an accident, i.e., the nan arises from
 0/0. I like to think that in this case the result is any number, rather
 than not a number, so *the* value is simply not defined. So in this case
 raise a ValueError for empty array.

 2) ddof = n - ValueError

 If the number of elements, n, is not zero and ddof = n, raise a
 ValueError for the ddof value.

 Nan case

 1) Empty array - Value Error
 2) Empty slice - NaN
 3) For slice ddof = n - Nan

  Chuck


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Sebastian Berg
On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:
 
 
 On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 

snip

 
 For nansum, I would expect 0 even in the case of all
 nans.  The point
 of these functions is to simply ignore nans, correct?
  So I would aim
 for this behaviour:  nanfunc(x) behaves the same as
 func(x[~isnan(x)])
 
 
 Agreed, although that changes current behavior. What about the
 other cases? 
 
 
 
 Looks like there isn't much interest in the topic, so I'll just go
 ahead with the following choices:
 
 Non-NaN case
 
 1) Empty array - ValueError
 
 The current behavior with stats is an accident, i.e., the nan arises
 from 0/0. I like to think that in this case the result is any number,
 rather than not a number, so *the* value is simply not defined. So in
 this case raise a ValueError for empty array.
 
To be honest, I don't mind the current behaviour much sum([]) = 0,
len([]) = 0, so it is in a way well defined. At least I am not sure if I
would prefer always an error. I am a bit worried that just changing it
might break code out there, such as plotting code where it makes
perfectly sense to plot a NaN (i.e. nothing), but if that is the case it
would probably be visible fast.

 2) ddof = n - ValueError
 
 If the number of elements, n, is not zero and ddof = n, raise a
 ValueError for the ddof value.
 
Makes sense to me, especially for ddof  n. Just returning nan in all
cases for backward compatibility would be fine with me too.

 Nan case
 
 1) Empty array - Value Error
 2) Empty slice - NaN
 3) For slice ddof = n - Nan
 
Personally I would somewhat prefer if 1) and 2) would at least default
to the same thing. But I don't use the nanfuncs anyway. I was wondering
about adding the option for the user to pick what the fill is (and i.e.
if it is None (maybe default) - ValueError). We could also allow this
for normal reductions without an identity, but I am not sure if it is
useful there.

- Sebastian

  Chuck
 
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Charles R Harris
On Mon, Jul 15, 2013 at 8:25 AM, Benjamin Root ben.r...@ou.edu wrote:

 This is going to need to be heavily documented with doctests. Also, just
 to clarify, are we talking about a ValueError for doing a nansum on an
 empty array as well, or will that now return a zero?


I was going to leave nansum as is, as it seems that the result was by
choice rather than by accident.

Tests, not doctests. I detest doctests ;) Examples, OTOH...

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Charles R Harris
On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg
sebast...@sipsolutions.netwrote:

 On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:
 
 
  On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris
  charlesr.har...@gmail.com wrote:
 

 snip

 
  For nansum, I would expect 0 even in the case of all
  nans.  The point
  of these functions is to simply ignore nans, correct?
   So I would aim
  for this behaviour:  nanfunc(x) behaves the same as
  func(x[~isnan(x)])
 
 
  Agreed, although that changes current behavior. What about the
  other cases?
 
 
 
  Looks like there isn't much interest in the topic, so I'll just go
  ahead with the following choices:
 
  Non-NaN case
 
  1) Empty array - ValueError
 
  The current behavior with stats is an accident, i.e., the nan arises
  from 0/0. I like to think that in this case the result is any number,
  rather than not a number, so *the* value is simply not defined. So in
  this case raise a ValueError for empty array.
 
 To be honest, I don't mind the current behaviour much sum([]) = 0,
 len([]) = 0, so it is in a way well defined. At least I am not sure if I
 would prefer always an error. I am a bit worried that just changing it
 might break code out there, such as plotting code where it makes
 perfectly sense to plot a NaN (i.e. nothing), but if that is the case it
 would probably be visible fast.


I'm talking about mean, var, and std as statistics, sum isn't part of that.
If there is agreement that nansum of empty arrays/columns should be zero I
will do that. Note the sums of empty arrays may or may not be empty.

In [1]: ones((0, 3)).sum(axis=0)
Out[1]: array([ 0.,  0.,  0.])

In [2]: ones((3, 0)).sum(axis=0)
Out[2]: array([], dtype=float64)

Which, sort of, makes sense.



  2) ddof = n - ValueError
 
  If the number of elements, n, is not zero and ddof = n, raise a
  ValueError for the ddof value.
 
 Makes sense to me, especially for ddof  n. Just returning nan in all
 cases for backward compatibility would be fine with me too.

  Nan case
 
  1) Empty array - Value Error
  2) Empty slice - NaN
  3) For slice ddof = n - Nan
 
 Personally I would somewhat prefer if 1) and 2) would at least default
 to the same thing. But I don't use the nanfuncs anyway. I was wondering
 about adding the option for the user to pick what the fill is (and i.e.
 if it is None (maybe default) - ValueError). We could also allow this
 for normal reductions without an identity, but I am not sure if it is
 useful there.


Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Charles R Harris
On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg
sebast...@sipsolutions.netwrote:

 On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:
 
 
  On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris
  charlesr.har...@gmail.com wrote:
 

 snip

 
  For nansum, I would expect 0 even in the case of all
  nans.  The point
  of these functions is to simply ignore nans, correct?
   So I would aim
  for this behaviour:  nanfunc(x) behaves the same as
  func(x[~isnan(x)])
 
 
  Agreed, although that changes current behavior. What about the
  other cases?
 
 
 
  Looks like there isn't much interest in the topic, so I'll just go
  ahead with the following choices:
 
  Non-NaN case
 
  1) Empty array - ValueError
 
  The current behavior with stats is an accident, i.e., the nan arises
  from 0/0. I like to think that in this case the result is any number,
  rather than not a number, so *the* value is simply not defined. So in
  this case raise a ValueError for empty array.
 
 To be honest, I don't mind the current behaviour much sum([]) = 0,
 len([]) = 0, so it is in a way well defined. At least I am not sure if I
 would prefer always an error. I am a bit worried that just changing it
 might break code out there, such as plotting code where it makes
 perfectly sense to plot a NaN (i.e. nothing), but if that is the case it
 would probably be visible fast.

  2) ddof = n - ValueError
 
  If the number of elements, n, is not zero and ddof = n, raise a
  ValueError for the ddof value.
 
 Makes sense to me, especially for ddof  n. Just returning nan in all
 cases for backward compatibility would be fine with me too.


Currently if ddof  n it returns a negative number for variance, the NaN
only comes when ddof == 0 and n == 0, leading to 0/0 (float is NaN, integer
is zero division).



  Nan case
 
  1) Empty array - Value Error
  2) Empty slice - NaN
  3) For slice ddof = n - Nan
 
 Personally I would somewhat prefer if 1) and 2) would at least default
 to the same thing. But I don't use the nanfuncs anyway. I was wondering
 about adding the option for the user to pick what the fill is (and i.e.
 if it is None (maybe default) - ValueError). We could also allow this
 for normal reductions without an identity, but I am not sure if it is
 useful there.


In the NaN case some slices may be empty, others not. My reasoning is that
that is going to be data dependent, not operator error, but if the array is
empty the writer of the code should deal with that.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Charles R Harris
On Mon, Jul 15, 2013 at 8:58 AM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg 
 sebast...@sipsolutions.net wrote:

 On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:
 
 
  On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris
  charlesr.har...@gmail.com wrote:
 

 snip

 
  For nansum, I would expect 0 even in the case of all
  nans.  The point
  of these functions is to simply ignore nans, correct?
   So I would aim
  for this behaviour:  nanfunc(x) behaves the same as
  func(x[~isnan(x)])
 
 
  Agreed, although that changes current behavior. What about the
  other cases?
 
 
 
  Looks like there isn't much interest in the topic, so I'll just go
  ahead with the following choices:
 
  Non-NaN case
 
  1) Empty array - ValueError
 
  The current behavior with stats is an accident, i.e., the nan arises
  from 0/0. I like to think that in this case the result is any number,
  rather than not a number, so *the* value is simply not defined. So in
  this case raise a ValueError for empty array.
 
 To be honest, I don't mind the current behaviour much sum([]) = 0,
 len([]) = 0, so it is in a way well defined. At least I am not sure if I
 would prefer always an error. I am a bit worried that just changing it
 might break code out there, such as plotting code where it makes
 perfectly sense to plot a NaN (i.e. nothing), but if that is the case it
 would probably be visible fast.

  2) ddof = n - ValueError
 
  If the number of elements, n, is not zero and ddof = n, raise a
  ValueError for the ddof value.
 
 Makes sense to me, especially for ddof  n. Just returning nan in all
 cases for backward compatibility would be fine with me too.


 Currently if ddof  n it returns a negative number for variance, the NaN
 only comes when ddof == 0 and n == 0, leading to 0/0 (float is NaN, integer
 is zero division).



  Nan case
 
  1) Empty array - Value Error
  2) Empty slice - NaN
  3) For slice ddof = n - Nan
 
 Personally I would somewhat prefer if 1) and 2) would at least default
 to the same thing. But I don't use the nanfuncs anyway. I was wondering
 about adding the option for the user to pick what the fill is (and i.e.
 if it is None (maybe default) - ValueError). We could also allow this
 for normal reductions without an identity, but I am not sure if it is
 useful there.


 In the NaN case some slices may be empty, others not. My reasoning is that
 that is going to be data dependent, not operator error, but if the array is
 empty the writer of the code should deal with that.


In the case of the nanvar, nanstd, it might make more sense to handle ddof
as

1) if ddof is = axis size, raise ValueError
2) if ddof is = number of values after removing NaNs, return NaN

The first would be consistent with the non-nan case, the second accounts
for the variable nature of data containing NaNs.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Benjamin Root
On Jul 15, 2013 11:47 AM, Charles R Harris charlesr.har...@gmail.com
wrote:



 On Mon, Jul 15, 2013 at 8:58 AM, Charles R Harris 
 charlesr.har...@gmail.com wrote:



 On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg 
 sebast...@sipsolutions.net wrote:

 On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:
 
 
  On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris
  charlesr.har...@gmail.com wrote:
 

 snip

 
  For nansum, I would expect 0 even in the case of all
  nans.  The point
  of these functions is to simply ignore nans, correct?
   So I would aim
  for this behaviour:  nanfunc(x) behaves the same as
  func(x[~isnan(x)])
 
 
  Agreed, although that changes current behavior. What about the
  other cases?
 
 
 
  Looks like there isn't much interest in the topic, so I'll just go
  ahead with the following choices:
 
  Non-NaN case
 
  1) Empty array - ValueError
 
  The current behavior with stats is an accident, i.e., the nan arises
  from 0/0. I like to think that in this case the result is any number,
  rather than not a number, so *the* value is simply not defined. So in
  this case raise a ValueError for empty array.
 
 To be honest, I don't mind the current behaviour much sum([]) = 0,
 len([]) = 0, so it is in a way well defined. At least I am not sure if I
 would prefer always an error. I am a bit worried that just changing it
 might break code out there, such as plotting code where it makes
 perfectly sense to plot a NaN (i.e. nothing), but if that is the case it
 would probably be visible fast.

  2) ddof = n - ValueError
 
  If the number of elements, n, is not zero and ddof = n, raise a
  ValueError for the ddof value.
 
 Makes sense to me, especially for ddof  n. Just returning nan in all
 cases for backward compatibility would be fine with me too.


 Currently if ddof  n it returns a negative number for variance, the NaN
 only comes when ddof == 0 and n == 0, leading to 0/0 (float is NaN, integer
 is zero division).



  Nan case
 
  1) Empty array - Value Error
  2) Empty slice - NaN
  3) For slice ddof = n - Nan
 
 Personally I would somewhat prefer if 1) and 2) would at least default
 to the same thing. But I don't use the nanfuncs anyway. I was wondering
 about adding the option for the user to pick what the fill is (and i.e.
 if it is None (maybe default) - ValueError). We could also allow this
 for normal reductions without an identity, but I am not sure if it is
 useful there.


 In the NaN case some slices may be empty, others not. My reasoning is
 that that is going to be data dependent, not operator error, but if the
 array is empty the writer of the code should deal with that.


 In the case of the nanvar, nanstd, it might make more sense to handle ddof
 as

 1) if ddof is = axis size, raise ValueError
 2) if ddof is = number of values after removing NaNs, return NaN

 The first would be consistent with the non-nan case, the second accounts
 for the variable nature of data containing NaNs.

 Chuck



I think this is a good idea in that it naturally follows well with the
conventions of what to do with empty arrays / empty slices with nanmean,
etc. Note, however, I am not a very big fan of the idea of having two
different behaviors for what I see as semantically the same thing.

But, my objections are not strong enough to veto it, and I do think this
proposal is well thought-out.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Sebastian Berg
On Mon, 2013-07-15 at 08:47 -0600, Charles R Harris wrote:
 
 
 On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg
 sebast...@sipsolutions.net wrote:
 On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:
 
 
  On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris
  charlesr.har...@gmail.com wrote:
 
 
 
 snip
 
 
  For nansum, I would expect 0 even in the
 case of all
  nans.  The point
  of these functions is to simply ignore nans,
 correct?
   So I would aim
  for this behaviour:  nanfunc(x) behaves the
 same as
  func(x[~isnan(x)])
 
 
  Agreed, although that changes current behavior. What
 about the
  other cases?
 
 
 
  Looks like there isn't much interest in the topic, so I'll
 just go
  ahead with the following choices:
 
  Non-NaN case
 
  1) Empty array - ValueError
 
  The current behavior with stats is an accident, i.e., the
 nan arises
  from 0/0. I like to think that in this case the result is
 any number,
  rather than not a number, so *the* value is simply not
 defined. So in
  this case raise a ValueError for empty array.
 
 
 To be honest, I don't mind the current behaviour much sum([])
 = 0,
 len([]) = 0, so it is in a way well defined. At least I am not
 sure if I
 would prefer always an error. I am a bit worried that just
 changing it
 might break code out there, such as plotting code where it
 makes
 perfectly sense to plot a NaN (i.e. nothing), but if that is
 the case it
 would probably be visible fast.
 
 I'm talking about mean, var, and std as statistics, sum isn't part of
 that. If there is agreement that nansum of empty arrays/columns should
 be zero I will do that. Note the sums of empty arrays may or may not
 be empty.
 
 In [1]: ones((0, 3)).sum(axis=0)
 Out[1]: array([ 0.,  0.,  0.])
 
 In [2]: ones((3, 0)).sum(axis=0)
 Out[2]: array([], dtype=float64)
 
 Which, sort of, makes sense.
  
 
I think we can agree that the behaviour for reductions with an identity
should default to returning the identity, including for the nanfuncs,
i.e. sum([]) is 0, product([]) is 1...

Since mean = sum/length is a sensible definition, having 0/0 as a result
doesn't seem to bad to me to be honest, it might be accidental but it is
not a special case in the code ;). Though I don't mind an error as long
as it doesn't break matplotlib or so.

I agree about the nanfuncs raising an error would probably be more of a
problem then for a usual ufunc, but still a bit hesitant about saying
that it is ok too. I could imagine adding a very general identity
argument (though I would not call it identity, because it is not the
same as `np.add.identity`, just used in a place where that would be used
otherwise):

np.add.reduce([], identity=123) - [123]
np.add.reduce([1], identity=123) - [1]
np.nanmean([np.nan], identity=None) - Error
np.nanmean([np.nan], identity=np.nan) - np.nan

It doesn't really make sense, but:
np.subtract.reduce([]) - Error, since np.substract.identity is None
np.subtract.reduce([], identity=0) - 0, suppressing the error.

I am not sure if I am convinced myself, but especially for the nanfuncs
it could maybe provide a way to circumvent the problem somewhat.
Including functions such as np.nanargmin, whose result type does not
even support NaN. Plus it gives an argument allowing for warnings about
changing behaviour.

- Sebastian

 
  2) ddof = n - ValueError
 
  If the number of elements, n, is not zero and ddof = n,
 raise a
  ValueError for the ddof value.
 
 
 Makes sense to me, especially for ddof  n. Just returning nan
 in all
 cases for backward compatibility would be fine with me too.
 
  Nan case
 
  1) Empty array - Value Error
  2) Empty slice - NaN
  3) For slice ddof = n - Nan
 
 
 Personally I would somewhat prefer if 1) and 2) would at least
 default
 to the same thing. But I don't use the nanfuncs anyway. I was
 wondering
 about adding the option for the user to pick what the fill is
 (and i.e.
 if it is None (maybe default) - ValueError). We could also
 allow this
 for normal reductions without an identity, but I am not sure
 if it is
 useful there.
 
 
 Chuck 
 
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 

Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Charles R Harris
On Mon, Jul 15, 2013 at 9:55 AM, Sebastian Berg
sebast...@sipsolutions.netwrote:

 On Mon, 2013-07-15 at 08:47 -0600, Charles R Harris wrote:
 
 
  On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg
  sebast...@sipsolutions.net wrote:
  On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:
  
  
   On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris
   charlesr.har...@gmail.com wrote:
  
 
 
  snip
 
  
   For nansum, I would expect 0 even in the
  case of all
   nans.  The point
   of these functions is to simply ignore nans,
  correct?
So I would aim
   for this behaviour:  nanfunc(x) behaves the
  same as
   func(x[~isnan(x)])
  
  
   Agreed, although that changes current behavior. What
  about the
   other cases?
  
  
  
   Looks like there isn't much interest in the topic, so I'll
  just go
   ahead with the following choices:
  
   Non-NaN case
  
   1) Empty array - ValueError
  
   The current behavior with stats is an accident, i.e., the
  nan arises
   from 0/0. I like to think that in this case the result is
  any number,
   rather than not a number, so *the* value is simply not
  defined. So in
   this case raise a ValueError for empty array.
  
 
  To be honest, I don't mind the current behaviour much sum([])
  = 0,
  len([]) = 0, so it is in a way well defined. At least I am not
  sure if I
  would prefer always an error. I am a bit worried that just
  changing it
  might break code out there, such as plotting code where it
  makes
  perfectly sense to plot a NaN (i.e. nothing), but if that is
  the case it
  would probably be visible fast.
 
  I'm talking about mean, var, and std as statistics, sum isn't part of
  that. If there is agreement that nansum of empty arrays/columns should
  be zero I will do that. Note the sums of empty arrays may or may not
  be empty.
 
  In [1]: ones((0, 3)).sum(axis=0)
  Out[1]: array([ 0.,  0.,  0.])
 
  In [2]: ones((3, 0)).sum(axis=0)
  Out[2]: array([], dtype=float64)
 
  Which, sort of, makes sense.
 
 
 I think we can agree that the behaviour for reductions with an identity
 should default to returning the identity, including for the nanfuncs,
 i.e. sum([]) is 0, product([]) is 1...

 Since mean = sum/length is a sensible definition, having 0/0 as a result
 doesn't seem to bad to me to be honest, it might be accidental but it is
 not a special case in the code ;). Though I don't mind an error as long
 as it doesn't break matplotlib or so.

 I agree about the nanfuncs raising an error would probably be more of a
 problem then for a usual ufunc, but still a bit hesitant about saying
 that it is ok too. I could imagine adding a very general identity
 argument (though I would not call it identity, because it is not the
 same as `np.add.identity`, just used in a place where that would be used
 otherwise):

 np.add.reduce([], identity=123) - [123]
 np.add.reduce([1], identity=123) - [1]
 np.nanmean([np.nan], identity=None) - Error
 np.nanmean([np.nan], identity=np.nan) - np.nan

 It doesn't really make sense, but:
 np.subtract.reduce([]) - Error, since np.substract.identity is None
 np.subtract.reduce([], identity=0) - 0, suppressing the error.

 I am not sure if I am convinced myself, but especially for the nanfuncs
 it could maybe provide a way to circumvent the problem somewhat.
 Including functions such as np.nanargmin, whose result type does not
 even support NaN. Plus it gives an argument allowing for warnings about
 changing behaviour.


Let me try to summarize. To begin with, the environment of the nan
functions is rather special.

1) if the array is of not of inexact type, they punt to the non-nan
versions.
2) if the array is of inexact type, then out and dtype must be inexact if
specified

The second assumption guarantees that NaN can be used in the return values.

*sum and nansum*

These should be consistent so that empty sums are 0. This should cover the
empty array case, but will change the behaviour of nansum which currently
returns NaN if the array isn't empty but the slice is after NaN removal.

*mean and nanmean*

In the case of empty arrays, an empty slice, this leads to 0/0. For Python
this is always a zero division error, for Numpy this raises a warning and
and returns NaN for floats, 0 for integers.

Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In
the special case where dtype=int, the NaN is cast to integer.

Option1
1) mean raise error on 0/0
2) nanmean no warning, return NaN

Option2

Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Nathaniel Smith
On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris
charlesr.har...@gmail.com wrote:
 Let me try to summarize. To begin with, the environment of the nan functions
 is rather special.

 1) if the array is of not of inexact type, they punt to the non-nan
 versions.
 2) if the array is of inexact type, then out and dtype must be inexact if
 specified

 The second assumption guarantees that NaN can be used in the return values.

The requirement on the 'out' dtype only exists because currently the
nan function like to return nan for things like empty arrays, right?
If not for that, it could be relaxed? (it's a rather weird
requirement, since the whole point of these functions is that they
ignore nans, yet they don't always...)

 sum and nansum

 These should be consistent so that empty sums are 0. This should cover the
 empty array case, but will change the behaviour of nansum which currently
 returns NaN if the array isn't empty but the slice is after NaN removal.

I agree that returning 0 is the right behaviour, but we might need a
FutureWarning period.

 mean and nanmean

 In the case of empty arrays, an empty slice, this leads to 0/0. For Python
 this is always a zero division error, for Numpy this raises a warning and
 and returns NaN for floats, 0 for integers.

 Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In
 the special case where dtype=int, the NaN is cast to integer.

 Option1
 1) mean raise error on 0/0
 2) nanmean no warning, return NaN

 Option2
 1) mean raise warning, return NaN (current behavior)
 2) nanmean no warning, return NaN

 Option3
 1) mean raise warning, return NaN (current behavior)
 2) nanmean raise warning, return NaN

I have mixed feelings about the whole np.seterr apparatus, but since
it exists, shouldn't we use it for consistency? I.e., just do whatever
numpy is set up to do with 0/0? (Which I think means, warn and return
NaN by default, but this can be changed.)

 var, std, nanvar, nanstd

 1) if ddof  axis(axes) size, raise error, probably a program bug.
 2) If ddof=0, then whatever is the case for mean, nanmean

 For nanvar, nanstd it is possible that some slice are good, some bad, so

 option1
 1) if n - ddof = 0 for a slice, raise warning, return NaN for slice

 option2
 1) if n - ddof = 0 for a slice, don't warn, return NaN for slice

I don't really have any intuition for these ddof cases. Just raising
an error on negative effective dof is pretty defensible and might be
the safest -- it's a easy to turn an error into something sensible
later if people come up with use cases...

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread josef . pktd
On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith n...@pobox.com wrote:
 On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 Let me try to summarize. To begin with, the environment of the nan functions
 is rather special.

 1) if the array is of not of inexact type, they punt to the non-nan
 versions.
 2) if the array is of inexact type, then out and dtype must be inexact if
 specified

 The second assumption guarantees that NaN can be used in the return values.

 The requirement on the 'out' dtype only exists because currently the
 nan function like to return nan for things like empty arrays, right?
 If not for that, it could be relaxed? (it's a rather weird
 requirement, since the whole point of these functions is that they
 ignore nans, yet they don't always...)

 sum and nansum

 These should be consistent so that empty sums are 0. This should cover the
 empty array case, but will change the behaviour of nansum which currently
 returns NaN if the array isn't empty but the slice is after NaN removal.

 I agree that returning 0 is the right behaviour, but we might need a
 FutureWarning period.

 mean and nanmean

 In the case of empty arrays, an empty slice, this leads to 0/0. For Python
 this is always a zero division error, for Numpy this raises a warning and
 and returns NaN for floats, 0 for integers.

 Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In
 the special case where dtype=int, the NaN is cast to integer.

 Option1
 1) mean raise error on 0/0
 2) nanmean no warning, return NaN

 Option2
 1) mean raise warning, return NaN (current behavior)
 2) nanmean no warning, return NaN

 Option3
 1) mean raise warning, return NaN (current behavior)
 2) nanmean raise warning, return NaN

 I have mixed feelings about the whole np.seterr apparatus, but since
 it exists, shouldn't we use it for consistency? I.e., just do whatever
 numpy is set up to do with 0/0? (Which I think means, warn and return
 NaN by default, but this can be changed.)

 var, std, nanvar, nanstd

 1) if ddof  axis(axes) size, raise error, probably a program bug.
 2) If ddof=0, then whatever is the case for mean, nanmean

 For nanvar, nanstd it is possible that some slice are good, some bad, so

 option1
 1) if n - ddof = 0 for a slice, raise warning, return NaN for slice

 option2
 1) if n - ddof = 0 for a slice, don't warn, return NaN for slice

 I don't really have any intuition for these ddof cases. Just raising
 an error on negative effective dof is pretty defensible and might be
 the safest -- it's a easy to turn an error into something sensible
 later if people come up with use cases...

related why does reduceat not have empty slices?

 np.add.reduceat(np.arange(8),[0,4, 5, 7,7])
array([ 6,  4, 11,  7,  7])


I'm in favor of returning nans instead of raising exceptions, except
if the return type is int and we cannot cast nan to int.

If we get functions into numpy that know how to handle nans, then it
would be useful to get the nans, so we can work with them

Some cases where this might come in handy are when we iterate over
slices of an array that define groups or category levels with possible
empty groups *)

 idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2])
 x = np.arange(9)
 [x[idx==ii].mean() for ii in range(4)]
[1.5, 5.0, nan, 7.5]

instead of
 [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()0]
[1.5, 5.0, 7.5]

same for var, I wouldn't have to check that the size is larger than
the ddof (whatever that is in the specific case)

*) groups could be empty because they were defined for a larger
dataset or as a union of different datasets


PS: I used mean() above and not var() because

 np.__version__
'1.5.1'
 np.mean([])
nan
 np.var([])
0.0

Josef


 -n
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread josef . pktd
On Mon, Jul 15, 2013 at 4:24 PM,  josef.p...@gmail.com wrote:
 On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith n...@pobox.com wrote:
 On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 Let me try to summarize. To begin with, the environment of the nan functions
 is rather special.

 1) if the array is of not of inexact type, they punt to the non-nan
 versions.
 2) if the array is of inexact type, then out and dtype must be inexact if
 specified

 The second assumption guarantees that NaN can be used in the return values.

 The requirement on the 'out' dtype only exists because currently the
 nan function like to return nan for things like empty arrays, right?
 If not for that, it could be relaxed? (it's a rather weird
 requirement, since the whole point of these functions is that they
 ignore nans, yet they don't always...)

 sum and nansum

 These should be consistent so that empty sums are 0. This should cover the
 empty array case, but will change the behaviour of nansum which currently
 returns NaN if the array isn't empty but the slice is after NaN removal.

 I agree that returning 0 is the right behaviour, but we might need a
 FutureWarning period.

 mean and nanmean

 In the case of empty arrays, an empty slice, this leads to 0/0. For Python
 this is always a zero division error, for Numpy this raises a warning and
 and returns NaN for floats, 0 for integers.

 Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In
 the special case where dtype=int, the NaN is cast to integer.

 Option1
 1) mean raise error on 0/0
 2) nanmean no warning, return NaN

 Option2
 1) mean raise warning, return NaN (current behavior)
 2) nanmean no warning, return NaN

 Option3
 1) mean raise warning, return NaN (current behavior)
 2) nanmean raise warning, return NaN

 I have mixed feelings about the whole np.seterr apparatus, but since
 it exists, shouldn't we use it for consistency? I.e., just do whatever
 numpy is set up to do with 0/0? (Which I think means, warn and return
 NaN by default, but this can be changed.)

 var, std, nanvar, nanstd

 1) if ddof  axis(axes) size, raise error, probably a program bug.
 2) If ddof=0, then whatever is the case for mean, nanmean

 For nanvar, nanstd it is possible that some slice are good, some bad, so

 option1
 1) if n - ddof = 0 for a slice, raise warning, return NaN for slice

 option2
 1) if n - ddof = 0 for a slice, don't warn, return NaN for slice

 I don't really have any intuition for these ddof cases. Just raising
 an error on negative effective dof is pretty defensible and might be
 the safest -- it's a easy to turn an error into something sensible
 later if people come up with use cases...

 related why does reduceat not have empty slices?

 np.add.reduceat(np.arange(8),[0,4, 5, 7,7])
 array([ 6,  4, 11,  7,  7])


 I'm in favor of returning nans instead of raising exceptions, except
 if the return type is int and we cannot cast nan to int.

 If we get functions into numpy that know how to handle nans, then it
 would be useful to get the nans, so we can work with them

 Some cases where this might come in handy are when we iterate over
 slices of an array that define groups or category levels with possible
 empty groups *)

 idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2])
 x = np.arange(9)
 [x[idx==ii].mean() for ii in range(4)]
 [1.5, 5.0, nan, 7.5]

 instead of
 [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()0]
 [1.5, 5.0, 7.5]

 same for var, I wouldn't have to check that the size is larger than
 the ddof (whatever that is in the specific case)

 *) groups could be empty because they were defined for a larger
 dataset or as a union of different datasets

background:

I wrote several robust anova versions a few weeks ago, that were
essentially list comprehension as above. However, I didn't allow nans
and didn't check for minimum size.
Allowing for empty groups to return nan would mainly be a convenience,
since I need to check the group size only once.

ddof: tests for proportions have ddof=0, for regular t-test ddof=1,
for tests of correlation ddof=2   IIRC
so we would need to check for the corresponding minimum size that n-ddof0

negative effective dof doesn't exist, that's np.maximum(n - ddof, 0)
which is always non-negative but might result in a zero-division
error. :)

I don't think making anything conditional on ddof0 is useful.

Josef



 PS: I used mean() above and not var() because

 np.__version__
 '1.5.1'
 np.mean([])
 nan
 np.var([])
 0.0

 Josef


 -n
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Charles R Harris
On Mon, Jul 15, 2013 at 2:44 PM, josef.p...@gmail.com wrote:

 On Mon, Jul 15, 2013 at 4:24 PM,  josef.p...@gmail.com wrote:
  On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith n...@pobox.com wrote:
  On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris
  charlesr.har...@gmail.com wrote:
  Let me try to summarize. To begin with, the environment of the nan
 functions
  is rather special.
 
  1) if the array is of not of inexact type, they punt to the non-nan
  versions.
  2) if the array is of inexact type, then out and dtype must be inexact
 if
  specified
 
  The second assumption guarantees that NaN can be used in the return
 values.
 
  The requirement on the 'out' dtype only exists because currently the
  nan function like to return nan for things like empty arrays, right?
  If not for that, it could be relaxed? (it's a rather weird
  requirement, since the whole point of these functions is that they
  ignore nans, yet they don't always...)
 
  sum and nansum
 
  These should be consistent so that empty sums are 0. This should cover
 the
  empty array case, but will change the behaviour of nansum which
 currently
  returns NaN if the array isn't empty but the slice is after NaN
 removal.
 
  I agree that returning 0 is the right behaviour, but we might need a
  FutureWarning period.
 
  mean and nanmean
 
  In the case of empty arrays, an empty slice, this leads to 0/0. For
 Python
  this is always a zero division error, for Numpy this raises a warning
 and
  and returns NaN for floats, 0 for integers.
 
  Currently mean returns NaN and raises a RuntimeWarning when 0/0
 occurs. In
  the special case where dtype=int, the NaN is cast to integer.
 
  Option1
  1) mean raise error on 0/0
  2) nanmean no warning, return NaN
 
  Option2
  1) mean raise warning, return NaN (current behavior)
  2) nanmean no warning, return NaN
 
  Option3
  1) mean raise warning, return NaN (current behavior)
  2) nanmean raise warning, return NaN
 
  I have mixed feelings about the whole np.seterr apparatus, but since
  it exists, shouldn't we use it for consistency? I.e., just do whatever
  numpy is set up to do with 0/0? (Which I think means, warn and return
  NaN by default, but this can be changed.)
 
  var, std, nanvar, nanstd
 
  1) if ddof  axis(axes) size, raise error, probably a program bug.
  2) If ddof=0, then whatever is the case for mean, nanmean
 
  For nanvar, nanstd it is possible that some slice are good, some bad,
 so
 
  option1
  1) if n - ddof = 0 for a slice, raise warning, return NaN for slice
 
  option2
  1) if n - ddof = 0 for a slice, don't warn, return NaN for slice
 
  I don't really have any intuition for these ddof cases. Just raising
  an error on negative effective dof is pretty defensible and might be
  the safest -- it's a easy to turn an error into something sensible
  later if people come up with use cases...
 
  related why does reduceat not have empty slices?
 
  np.add.reduceat(np.arange(8),[0,4, 5, 7,7])
  array([ 6,  4, 11,  7,  7])
 
 
  I'm in favor of returning nans instead of raising exceptions, except
  if the return type is int and we cannot cast nan to int.
 
  If we get functions into numpy that know how to handle nans, then it
  would be useful to get the nans, so we can work with them
 
  Some cases where this might come in handy are when we iterate over
  slices of an array that define groups or category levels with possible
  empty groups *)
 
  idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2])
  x = np.arange(9)
  [x[idx==ii].mean() for ii in range(4)]
  [1.5, 5.0, nan, 7.5]
 
  instead of
  [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()0]
  [1.5, 5.0, 7.5]
 
  same for var, I wouldn't have to check that the size is larger than
  the ddof (whatever that is in the specific case)
 
  *) groups could be empty because they were defined for a larger
  dataset or as a union of different datasets

 background:

 I wrote several robust anova versions a few weeks ago, that were
 essentially list comprehension as above. However, I didn't allow nans
 and didn't check for minimum size.
 Allowing for empty groups to return nan would mainly be a convenience,
 since I need to check the group size only once.

 ddof: tests for proportions have ddof=0, for regular t-test ddof=1,
 for tests of correlation ddof=2   IIRC
 so we would need to check for the corresponding minimum size that n-ddof0

 negative effective dof doesn't exist, that's np.maximum(n - ddof, 0)
 which is always non-negative but might result in a zero-division
 error. :)

 I don't think making anything conditional on ddof0 is useful.


So how would you want it?

To summarize the problem areas:

1) What is the sum of an empty slice? NaN or 0?
2) What is mean of empy slice? NaN, NaN and warn, or error?
3) What if n - ddof  0 for slice? NaN, NaN and warn, or error?
4) What if n - ddof = 0 for slice? NaN, NaN and warn, or error?

I'm tending to NaN and warn for 2 -- 3, because, as Nathaniel notes, the
warning can be 

Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread josef . pktd
On Mon, Jul 15, 2013 at 5:34 PM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Mon, Jul 15, 2013 at 2:44 PM, josef.p...@gmail.com wrote:

 On Mon, Jul 15, 2013 at 4:24 PM,  josef.p...@gmail.com wrote:
  On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith n...@pobox.com wrote:
  On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris
  charlesr.har...@gmail.com wrote:
  Let me try to summarize. To begin with, the environment of the nan
  functions
  is rather special.
 
  1) if the array is of not of inexact type, they punt to the non-nan
  versions.
  2) if the array is of inexact type, then out and dtype must be inexact
  if
  specified
 
  The second assumption guarantees that NaN can be used in the return
  values.
 
  The requirement on the 'out' dtype only exists because currently the
  nan function like to return nan for things like empty arrays, right?
  If not for that, it could be relaxed? (it's a rather weird
  requirement, since the whole point of these functions is that they
  ignore nans, yet they don't always...)
 
  sum and nansum
 
  These should be consistent so that empty sums are 0. This should cover
  the
  empty array case, but will change the behaviour of nansum which
  currently
  returns NaN if the array isn't empty but the slice is after NaN
  removal.
 
  I agree that returning 0 is the right behaviour, but we might need a
  FutureWarning period.
 
  mean and nanmean
 
  In the case of empty arrays, an empty slice, this leads to 0/0. For
  Python
  this is always a zero division error, for Numpy this raises a warning
  and
  and returns NaN for floats, 0 for integers.
 
  Currently mean returns NaN and raises a RuntimeWarning when 0/0
  occurs. In
  the special case where dtype=int, the NaN is cast to integer.
 
  Option1
  1) mean raise error on 0/0
  2) nanmean no warning, return NaN
 
  Option2
  1) mean raise warning, return NaN (current behavior)
  2) nanmean no warning, return NaN
 
  Option3
  1) mean raise warning, return NaN (current behavior)
  2) nanmean raise warning, return NaN
 
  I have mixed feelings about the whole np.seterr apparatus, but since
  it exists, shouldn't we use it for consistency? I.e., just do whatever
  numpy is set up to do with 0/0? (Which I think means, warn and return
  NaN by default, but this can be changed.)
 
  var, std, nanvar, nanstd
 
  1) if ddof  axis(axes) size, raise error, probably a program bug.
  2) If ddof=0, then whatever is the case for mean, nanmean
 
  For nanvar, nanstd it is possible that some slice are good, some bad,
  so
 
  option1
  1) if n - ddof = 0 for a slice, raise warning, return NaN for slice
 
  option2
  1) if n - ddof = 0 for a slice, don't warn, return NaN for slice
 
  I don't really have any intuition for these ddof cases. Just raising
  an error on negative effective dof is pretty defensible and might be
  the safest -- it's a easy to turn an error into something sensible
  later if people come up with use cases...
 
  related why does reduceat not have empty slices?
 
  np.add.reduceat(np.arange(8),[0,4, 5, 7,7])
  array([ 6,  4, 11,  7,  7])
 
 
  I'm in favor of returning nans instead of raising exceptions, except
  if the return type is int and we cannot cast nan to int.
 
  If we get functions into numpy that know how to handle nans, then it
  would be useful to get the nans, so we can work with them
 
  Some cases where this might come in handy are when we iterate over
  slices of an array that define groups or category levels with possible
  empty groups *)
 
  idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2])
  x = np.arange(9)
  [x[idx==ii].mean() for ii in range(4)]
  [1.5, 5.0, nan, 7.5]
 
  instead of
  [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()0]
  [1.5, 5.0, 7.5]
 
  same for var, I wouldn't have to check that the size is larger than
  the ddof (whatever that is in the specific case)
 
  *) groups could be empty because they were defined for a larger
  dataset or as a union of different datasets

 background:

 I wrote several robust anova versions a few weeks ago, that were
 essentially list comprehension as above. However, I didn't allow nans
 and didn't check for minimum size.
 Allowing for empty groups to return nan would mainly be a convenience,
 since I need to check the group size only once.

 ddof: tests for proportions have ddof=0, for regular t-test ddof=1,
 for tests of correlation ddof=2   IIRC
 so we would need to check for the corresponding minimum size that n-ddof0

 negative effective dof doesn't exist, that's np.maximum(n - ddof, 0)
 which is always non-negative but might result in a zero-division
 error. :)

 I don't think making anything conditional on ddof0 is useful.


 So how would you want it?

 To summarize the problem areas:

 1) What is the sum of an empty slice? NaN or 0?
0 as it is now for sum, (including 0 for nansum with no valid entries).

 2) What is mean of empy slice? NaN, NaN and warn, or error?
 3) What if n - ddof  0 for slice? NaN, 

Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Charles R Harris
On Mon, Jul 15, 2013 at 3:57 PM, josef.p...@gmail.com wrote:

 On Mon, Jul 15, 2013 at 5:34 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 
 
  On Mon, Jul 15, 2013 at 2:44 PM, josef.p...@gmail.com wrote:
 
  On Mon, Jul 15, 2013 at 4:24 PM,  josef.p...@gmail.com wrote:
   On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith n...@pobox.com
 wrote:
   On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris
   charlesr.har...@gmail.com wrote:
   Let me try to summarize. To begin with, the environment of the nan
   functions
   is rather special.
  
   1) if the array is of not of inexact type, they punt to the non-nan
   versions.
   2) if the array is of inexact type, then out and dtype must be
 inexact
   if
   specified
  
   The second assumption guarantees that NaN can be used in the return
   values.
  
   The requirement on the 'out' dtype only exists because currently the
   nan function like to return nan for things like empty arrays, right?
   If not for that, it could be relaxed? (it's a rather weird
   requirement, since the whole point of these functions is that they
   ignore nans, yet they don't always...)
  
   sum and nansum
  
   These should be consistent so that empty sums are 0. This should
 cover
   the
   empty array case, but will change the behaviour of nansum which
   currently
   returns NaN if the array isn't empty but the slice is after NaN
   removal.
  
   I agree that returning 0 is the right behaviour, but we might need a
   FutureWarning period.
  
   mean and nanmean
  
   In the case of empty arrays, an empty slice, this leads to 0/0. For
   Python
   this is always a zero division error, for Numpy this raises a
 warning
   and
   and returns NaN for floats, 0 for integers.
  
   Currently mean returns NaN and raises a RuntimeWarning when 0/0
   occurs. In
   the special case where dtype=int, the NaN is cast to integer.
  
   Option1
   1) mean raise error on 0/0
   2) nanmean no warning, return NaN
  
   Option2
   1) mean raise warning, return NaN (current behavior)
   2) nanmean no warning, return NaN
  
   Option3
   1) mean raise warning, return NaN (current behavior)
   2) nanmean raise warning, return NaN
  
   I have mixed feelings about the whole np.seterr apparatus, but since
   it exists, shouldn't we use it for consistency? I.e., just do
 whatever
   numpy is set up to do with 0/0? (Which I think means, warn and return
   NaN by default, but this can be changed.)
  
   var, std, nanvar, nanstd
  
   1) if ddof  axis(axes) size, raise error, probably a program bug.
   2) If ddof=0, then whatever is the case for mean, nanmean
  
   For nanvar, nanstd it is possible that some slice are good, some
 bad,
   so
  
   option1
   1) if n - ddof = 0 for a slice, raise warning, return NaN for slice
  
   option2
   1) if n - ddof = 0 for a slice, don't warn, return NaN for slice
  
   I don't really have any intuition for these ddof cases. Just raising
   an error on negative effective dof is pretty defensible and might be
   the safest -- it's a easy to turn an error into something sensible
   later if people come up with use cases...
  
   related why does reduceat not have empty slices?
  
   np.add.reduceat(np.arange(8),[0,4, 5, 7,7])
   array([ 6,  4, 11,  7,  7])
  
  
   I'm in favor of returning nans instead of raising exceptions, except
   if the return type is int and we cannot cast nan to int.
  
   If we get functions into numpy that know how to handle nans, then it
   would be useful to get the nans, so we can work with them
  
   Some cases where this might come in handy are when we iterate over
   slices of an array that define groups or category levels with possible
   empty groups *)
  
   idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2])
   x = np.arange(9)
   [x[idx==ii].mean() for ii in range(4)]
   [1.5, 5.0, nan, 7.5]
  
   instead of
   [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()0]
   [1.5, 5.0, 7.5]
  
   same for var, I wouldn't have to check that the size is larger than
   the ddof (whatever that is in the specific case)
  
   *) groups could be empty because they were defined for a larger
   dataset or as a union of different datasets
 
  background:
 
  I wrote several robust anova versions a few weeks ago, that were
  essentially list comprehension as above. However, I didn't allow nans
  and didn't check for minimum size.
  Allowing for empty groups to return nan would mainly be a convenience,
  since I need to check the group size only once.
 
  ddof: tests for proportions have ddof=0, for regular t-test ddof=1,
  for tests of correlation ddof=2   IIRC
  so we would need to check for the corresponding minimum size that
 n-ddof0
 
  negative effective dof doesn't exist, that's np.maximum(n - ddof, 0)
  which is always non-negative but might result in a zero-division
  error. :)
 
  I don't think making anything conditional on ddof0 is useful.
 
 
  So how would you want it?
 
  To summarize the problem areas:
 
  1) What is 

Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Stéfan van der Walt
On Mon, 15 Jul 2013 08:33:47 -0600, Charles R Harris wrote:
 On Mon, Jul 15, 2013 at 8:25 AM, Benjamin Root ben.r...@ou.edu wrote:
 
  This is going to need to be heavily documented with doctests. Also, just
  to clarify, are we talking about a ValueError for doing a nansum on an
  empty array as well, or will that now return a zero?
 
 
 I was going to leave nansum as is, as it seems that the result was by
 choice rather than by accident.

That makes sense--I like Sebastian's explanation whereby operations that
define an identity yields that upon empty input.

Stéfan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Charles R Harris
On Mon, Jul 15, 2013 at 6:22 PM, Stéfan van der Walt ste...@sun.ac.zawrote:

 On Mon, 15 Jul 2013 08:33:47 -0600, Charles R Harris wrote:
  On Mon, Jul 15, 2013 at 8:25 AM, Benjamin Root ben.r...@ou.edu wrote:
 
   This is going to need to be heavily documented with doctests. Also,
 just
   to clarify, are we talking about a ValueError for doing a nansum on an
   empty array as well, or will that now return a zero?
  
  
  I was going to leave nansum as is, as it seems that the result was by
  choice rather than by accident.

 That makes sense--I like Sebastian's explanation whereby operations that
 define an identity yields that upon empty input.


So nansum should return zeros rather than the current NaNs?

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Stéfan van der Walt
On Mon, 15 Jul 2013 18:46:33 -0600, Charles R Harris wrote:
 
 So nansum should return zeros rather than the current NaNs?

Yes, my feeling is that nansum([]) should be 0.

Stéfan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Benjamin Root
To add a bit of context to the question of nansum on empty results, we
currently differ from MATLAB and R in this respect, they return zero no
matter what. Personally, I think it should return zero, but our current
behavior of returning nans has existed for a long time.

Personally, I think we need a deprecation warning and possibly wait to
change this until 2.0, with plenty of warning that this will change.

Ben Root
On Jul 15, 2013 8:46 PM, Charles R Harris charlesr.har...@gmail.com
wrote:



 On Mon, Jul 15, 2013 at 6:22 PM, Stéfan van der Walt ste...@sun.ac.zawrote:

 On Mon, 15 Jul 2013 08:33:47 -0600, Charles R Harris wrote:
  On Mon, Jul 15, 2013 at 8:25 AM, Benjamin Root ben.r...@ou.edu wrote:
 
   This is going to need to be heavily documented with doctests. Also,
 just
   to clarify, are we talking about a ValueError for doing a nansum on an
   empty array as well, or will that now return a zero?
  
  
  I was going to leave nansum as is, as it seems that the result was by
  choice rather than by accident.

 That makes sense--I like Sebastian's explanation whereby operations that
 define an identity yields that upon empty input.


 So nansum should return zeros rather than the current NaNs?

 Chuck

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Charles R Harris
On Mon, Jul 15, 2013 at 6:58 PM, Benjamin Root ben.r...@ou.edu wrote:

 To add a bit of context to the question of nansum on empty results, we
 currently differ from MATLAB and R in this respect, they return zero no
 matter what. Personally, I think it should return zero, but our current
 behavior of returning nans has existed for a long time.

 Personally, I think we need a deprecation warning and possibly wait to
 change this until 2.0, with plenty of warning that this will change.

Waiting for the mythical 2.0 probably won't work ;) We also need to give
folks a way to adjust ahead of time. I think the easiest way to do that is
with an extra keyword, say nanok, with True as the starting default, then
later we can make False the default.

snip

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-15 Thread Ralf Gommers
On Tue, Jul 16, 2013 at 3:50 AM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Mon, Jul 15, 2013 at 6:58 PM, Benjamin Root ben.r...@ou.edu wrote:

 To add a bit of context to the question of nansum on empty results, we
 currently differ from MATLAB and R in this respect, they return zero no
 matter what. Personally, I think it should return zero, but our current
 behavior of returning nans has existed for a long time.

 Personally, I think we need a deprecation warning and possibly wait to
 change this until 2.0, with plenty of warning that this will change.

 Waiting for the mythical 2.0 probably won't work ;) We also need to give
 folks a way to adjust ahead of time. I think the easiest way to do that is
 with an extra keyword, say nanok, with True as the starting default, then
 later we can make False the default.


No special keywords to work around behavior change please, it doesn't work
well and you end up with a keyword you don't really want.

Why not just give a FutureWarning in 1.8 and change to returning zero in
1.9?

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-14 Thread Warren Weckesser
On 7/14/13, Charles R Harris charlesr.har...@gmail.com wrote:
 Some corner cases in the mean, var, std.

 *Empty arrays*

 I think these cases should either raise an error or just return nan.
 Warnings seem ineffective to me as they are only issued once by default.

 In [3]: ones(0).mean()
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:61:
 RuntimeWarning: invalid value encountered in double_scalars
   ret = ret / float(rcount)
 Out[3]: nan

 In [4]: ones(0).var()
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:
 RuntimeWarning: invalid value encountered in true_divide
   out=arrmean, casting='unsafe', subok=False)
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:
 RuntimeWarning: invalid value encountered in double_scalars
   ret = ret / float(rcount)
 Out[4]: nan

 In [5]: ones(0).std()
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:
 RuntimeWarning: invalid value encountered in true_divide
   out=arrmean, casting='unsafe', subok=False)
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:
 RuntimeWarning: invalid value encountered in double_scalars
   ret = ret / float(rcount)
 Out[5]: nan

 *ddof = number of elements*

 I think these should just raise errors. The results for ddof = #elements
 is happenstance, and certainly negative numbers should never be returned.

 In [6]: ones(2).var(ddof=2)
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:
 RuntimeWarning: invalid value encountered in double_scalars
   ret = ret / float(rcount)
 Out[6]: nan

 In [7]: ones(2).var(ddof=3)
 Out[7]: -0.0
 *
 nansum*

 Currently returns nan for empty arrays. I suspect it should return nan for
 slices that are all nan, but 0 for empty slices. That would make it
 consistent with sum in the empty case.



For nansum, I would expect 0 even in the case of all nans.  The point
of these functions is to simply ignore nans, correct?  So I would aim
for this behaviour:  nanfunc(x) behaves the same as func(x[~isnan(x)])

Warren


 Chuck

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What should be the result in some statistics corner cases?

2013-07-14 Thread Charles R Harris
On Sun, Jul 14, 2013 at 2:55 PM, Warren Weckesser 
warren.weckes...@gmail.com wrote:

 On 7/14/13, Charles R Harris charlesr.har...@gmail.com wrote:
  Some corner cases in the mean, var, std.
 
  *Empty arrays*
 
  I think these cases should either raise an error or just return nan.
  Warnings seem ineffective to me as they are only issued once by default.
 
  In [3]: ones(0).mean()
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:61:
  RuntimeWarning: invalid value encountered in double_scalars
ret = ret / float(rcount)
  Out[3]: nan
 
  In [4]: ones(0).var()
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:
  RuntimeWarning: invalid value encountered in true_divide
out=arrmean, casting='unsafe', subok=False)
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:
  RuntimeWarning: invalid value encountered in double_scalars
ret = ret / float(rcount)
  Out[4]: nan
 
  In [5]: ones(0).std()
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:
  RuntimeWarning: invalid value encountered in true_divide
out=arrmean, casting='unsafe', subok=False)
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:
  RuntimeWarning: invalid value encountered in double_scalars
ret = ret / float(rcount)
  Out[5]: nan
 
  *ddof = number of elements*
 
  I think these should just raise errors. The results for ddof = #elements
  is happenstance, and certainly negative numbers should never be returned.
 
  In [6]: ones(2).var(ddof=2)
 
 /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:
  RuntimeWarning: invalid value encountered in double_scalars
ret = ret / float(rcount)
  Out[6]: nan
 
  In [7]: ones(2).var(ddof=3)
  Out[7]: -0.0
  *
  nansum*
 
  Currently returns nan for empty arrays. I suspect it should return nan
 for
  slices that are all nan, but 0 for empty slices. That would make it
  consistent with sum in the empty case.
 


 For nansum, I would expect 0 even in the case of all nans.  The point
 of these functions is to simply ignore nans, correct?  So I would aim
 for this behaviour:  nanfunc(x) behaves the same as func(x[~isnan(x)])


Agreed, although that changes current behavior. What about the other cases?

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion