Re: [Numpy-discussion] read-only or immutable masked array
Hi Pierre, I'm a bit surprised, though. Here's what I tried np.version.version 1.7.0 x = np.ma.array([1,2,3], mask=[0,1,0]) x.flags.writeable=False x[0]=-1 ValueError: assignment destination is read-only Thanks, it works perfectly =) Sorry, probably have overlooked this simple solution, tried to set x.data and x.mask directly. I noticed that this only protects the data, so mask also has to be set to read-only or be hardened to avoid accidental (un)masking. Gregorio ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] read-only or immutable masked array
On Jul 15, 2013, at 10:04 , Gregorio Bastardo gregorio.basta...@gmail.com wrote: Hi Pierre, I'm a bit surprised, though. Here's what I tried np.version.version 1.7.0 x = np.ma.array([1,2,3], mask=[0,1,0]) x.flags.writeable=False x[0]=-1 ValueError: assignment destination is read-only Thanks, it works perfectly =) Sorry, probably have overlooked this simple solution, tried to set x.data and x.mask directly. I noticed that this only protects the data, so mask also has to be set to read-only or be hardened to avoid accidental (un)masking. Well, yes and no. Settings the flags of `x` doesn't set (yet) the flags of the mask, that's true. Still, `.writeable=False` should prevent you to unmask data, provided you're not trying to modify the mask directly but use basic assignment like `x[…]=…`. However, assigning `np.ma.masked` to array items does modify the mask and only the mask, hence the absence of error if the array is not writeable. Note as well that hardening the mask only prevents unmasking: you can still grow the mask, which may not be what you want. Use `x.mask.flags.writeable=False` to make the mask really read-only. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] read-only or immutable masked array
Hi Pierre, Note as well that hardening the mask only prevents unmasking: you can still grow the mask, which may not be what you want. Use `x.mask.flags.writeable=False` to make the mask really read-only. I ran into an unmasking problem with the suggested approach: np.version.version '1.7.0' x = np.ma.masked_array(xrange(4), [0,1,0,1]) x masked_array(data = [0 -- 2 --], mask = [False True False True], fill_value = 99) x.flags.writeable = False x.mask.flags.writeable = False x.mask[1] = 0 # ok Traceback (most recent call last): ... ValueError: assignment destination is read-only x[1] = 0 # ok Traceback (most recent call last): ... ValueError: assignment destination is read-only x.mask[1] = 0 # ?? x masked_array(data = [0 1 2 --], mask = [False False False True], fill_value = 99) I noticed that sharedmask attribute changes (from True to False) after x[1] = 0. Also, some of the ma operations result mask identity of the new ma, which causes ValueError when the new ma mask is modified: x = np.ma.masked_array(xrange(4), [0,1,0,1]) x.flags.writeable = False x.mask.flags.writeable = False x1 = x 0 x1.mask is x.mask # ok False x2 = x != 0 x2.mask is x.mask # ?? True x2.mask[1] = 0 Traceback (most recent call last): ... ValueError: assignment destination is read-only which is a bit confusing. And I experienced that *_like operations give mask identity too: y = np.ones_like(x) y.mask is x.mask True but for that I found a recent discussion (empty_like for masked arrays) on the mailing list: http://mail.scipy.org/pipermail/numpy-discussion/2013-June/066836.html I might be missing something but could you clarify these issues? Thanks, Gregorio ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Allow == and != to raise errors
Python itself doesn't raise an exception in such cases : (3,4) != (2, 3, 4) True (3,4) == (2, 3, 4) False Should numpy behave differently ? Bruno. 2013/7/12 Frédéric Bastien no...@nouiz.org I also don't like that idea, but I'm not able to come to a good reasoning like Benjamin. I don't see advantage to this change and the reason isn't good enough to justify breaking the interface I think. But I don't think we rely on this, so if the change goes in, it probably won't break stuff or they will be easily seen and repared. Fred On Fri, Jul 12, 2013 at 9:13 AM, Benjamin Root ben.r...@ou.edu wrote: I can see where you are getting at, but I would have to disagree. First of all, when a comparison between two mis-shaped arrays occur, you get back a bone fide python boolean, not a numpy array of bools. So if any action was taken on the result of such a comparison assumed that the result was some sort of an array, it would fail (yes, this does make it a bit difficult to trace back the source of the problem, but not impossible). Second, no semantics are broken with this. Are the arrays equal or not? If they weren't broadcastible, then returning False for == and True for != makes perfect sense to me. At least, that is my take on it. Cheers! Ben Root On Fri, Jul 12, 2013 at 8:38 AM, Sebastian Berg sebast...@sipsolutions.net wrote: Hey, the array comparisons == and != never raise errors but instead simply return False for invalid comparisons. The main example are arrays of non-matching dimensions, and object arrays with invalid element-wise comparisons: In [1]: np.array([1,2,3]) == np.array([1,2]) Out[1]: False In [2]: np.array([1, np.array([2, 3])], dtype=object) == [1, 2] Out[2]: False This seems wrong to me, and I am sure not just me. I doubt any large projects makes use of such comparisons and assume that most would prefer the shape mismatch to raise an error, so I would like to change it. But I am a bit unsure especially about smaller projects. So to keep the transition a bit safer could imagine implementing a FutureWarning for these cases (and that would at least notify new users that what they are doing doesn't seem like the right thing). So the question is: Is such a change safe enough, or is there some good reason for the current behavior that I am missing? Regards, Sebastian (There may be other issues with structured types that would continue returning False I think, because neither side knows how to compare) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] empty_like for masked arrays
Hi, On Mon, Jun 10, 2013 at 3:47 PM, Nathaniel Smith n...@pobox.com wrote: Hi all, Is there anyone out there using numpy masked arrays, who has an opinion on how empty_like (and its friends ones_like, zeros_like) should handle the mask? Right now apparently if you call np.ma.empty_like on a masked array, you get a new masked array that shares the original array's mask, so modifying one modifies the other. That's almost certainly wrong. This PR: https://github.com/numpy/numpy/pull/3404 makes it so instead the new array has values that are all set to empty/zero/one, and a mask which is set to match the input array's mask (so whenever something was masked in the original array, the empty/zero/one in that place is also masked). We don't know if this is the desired behaviour for these functions, though. Maybe it's more intuitive for the new array to match the original array in shape and dtype, but to always have an empty mask. Or maybe not. None of us really use np.ma, so if you do and have an opinion then please speak up... I recently joined the mailing list, so the message might not reach the original thread, sorry for that. I use masked arrays extensively, and would vote for the first option, as I use the *_like operations with the assumption that the resulting array has the same mask as the original. I think it's more intuitive than selecting between all masked or all unmasked behaviour. If it's not too late, please consider my use case. Thanks, Gregorio ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Jul 14, 2013 at 2:55 PM, Warren Weckesser warren.weckes...@gmail.com wrote: On 7/14/13, Charles R Harris charlesr.har...@gmail.com wrote: Some corner cases in the mean, var, std. *Empty arrays* I think these cases should either raise an error or just return nan. Warnings seem ineffective to me as they are only issued once by default. In [3]: ones(0).mean() /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:61: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[3]: nan In [4]: ones(0).var() /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76: RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False) /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[4]: nan In [5]: ones(0).std() /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76: RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False) /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[5]: nan *ddof = number of elements* I think these should just raise errors. The results for ddof = #elements is happenstance, and certainly negative numbers should never be returned. In [6]: ones(2).var(ddof=2) /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[6]: nan In [7]: ones(2).var(ddof=3) Out[7]: -0.0 * nansum* Currently returns nan for empty arrays. I suspect it should return nan for slices that are all nan, but 0 for empty slices. That would make it consistent with sum in the empty case. For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)]) Agreed, although that changes current behavior. What about the other cases? Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices: Non-NaN case 1) Empty array - ValueError The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array. 2) ddof = n - ValueError If the number of elements, n, is not zero and ddof = n, raise a ValueError for the ddof value. Nan case 1) Empty array - Value Error 2) Empty slice - NaN 3) For slice ddof = n - Nan Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Allow == and != to raise errors
On Mon, Jul 15, 2013 at 2:09 PM, bruno Piguet bruno.pig...@gmail.com wrote: Python itself doesn't raise an exception in such cases : (3,4) != (2, 3, 4) True (3,4) == (2, 3, 4) False Should numpy behave differently ? The numpy equivalent to Python's scalar == is called array_equal, and that does indeed behave the same: In [5]: np.array_equal([3, 4], [2, 3, 4]) Out[5]: False But in numpy, the name == is shorthand for the ufunc np.equal, which raises an error: In [8]: np.equal([3, 4], [2, 3, 4]) ValueError: operands could not be broadcast together with shapes (2) (3) -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Allow == and != to raise errors
On Mon, 2013-07-15 at 15:09 +0200, bruno Piguet wrote: Python itself doesn't raise an exception in such cases : (3,4) != (2, 3, 4) True (3,4) == (2, 3, 4) False Should numpy behave differently ? Yes, because Python tests whether the tuple is different, not whether the elements are: (3, 4) == (3, 4) True np.array([3, 4]) == np.array([3, 4]) array([ True, True], dtype=bool) So doing the test like python *changes* the behaviour. - Sebastian Bruno. 2013/7/12 Frédéric Bastien no...@nouiz.org I also don't like that idea, but I'm not able to come to a good reasoning like Benjamin. I don't see advantage to this change and the reason isn't good enough to justify breaking the interface I think. But I don't think we rely on this, so if the change goes in, it probably won't break stuff or they will be easily seen and repared. Fred On Fri, Jul 12, 2013 at 9:13 AM, Benjamin Root ben.r...@ou.edu wrote: I can see where you are getting at, but I would have to disagree. First of all, when a comparison between two mis-shaped arrays occur, you get back a bone fide python boolean, not a numpy array of bools. So if any action was taken on the result of such a comparison assumed that the result was some sort of an array, it would fail (yes, this does make it a bit difficult to trace back the source of the problem, but not impossible). Second, no semantics are broken with this. Are the arrays equal or not? If they weren't broadcastible, then returning False for == and True for != makes perfect sense to me. At least, that is my take on it. Cheers! Ben Root On Fri, Jul 12, 2013 at 8:38 AM, Sebastian Berg sebast...@sipsolutions.net wrote: Hey, the array comparisons == and != never raise errors but instead simply return False for invalid comparisons. The main example are arrays of non-matching dimensions, and object arrays with invalid element-wise comparisons: In [1]: np.array([1,2,3]) == np.array([1,2]) Out[1]: False In [2]: np.array([1, np.array([2, 3])], dtype=object) == [1, 2] Out[2]: False This seems wrong to me, and I am sure not just me. I doubt any large projects makes use of such comparisons and assume that most would prefer the shape mismatch to raise an error, so I would like to change it. But I am a bit unsure especially about smaller projects. So to keep the transition a bit safer could imagine implementing a FutureWarning for these cases (and that would at least notify new users that what they are doing doesn't seem like the right thing). So the question is: Is such a change safe enough, or is there some good reason for the current behavior that I am missing? Regards, Sebastian (There may be other issues with structured types that would continue returning False I think, because neither side knows how to compare) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
This is going to need to be heavily documented with doctests. Also, just to clarify, are we talking about a ValueError for doing a nansum on an empty array as well, or will that now return a zero? Ben Root On Mon, Jul 15, 2013 at 9:52 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Jul 14, 2013 at 2:55 PM, Warren Weckesser warren.weckes...@gmail.com wrote: On 7/14/13, Charles R Harris charlesr.har...@gmail.com wrote: Some corner cases in the mean, var, std. *Empty arrays* I think these cases should either raise an error or just return nan. Warnings seem ineffective to me as they are only issued once by default. In [3]: ones(0).mean() /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:61: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[3]: nan In [4]: ones(0).var() /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76: RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False) /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[4]: nan In [5]: ones(0).std() /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76: RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False) /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[5]: nan *ddof = number of elements* I think these should just raise errors. The results for ddof = #elements is happenstance, and certainly negative numbers should never be returned. In [6]: ones(2).var(ddof=2) /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[6]: nan In [7]: ones(2).var(ddof=3) Out[7]: -0.0 * nansum* Currently returns nan for empty arrays. I suspect it should return nan for slices that are all nan, but 0 for empty slices. That would make it consistent with sum in the empty case. For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)]) Agreed, although that changes current behavior. What about the other cases? Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices: Non-NaN case 1) Empty array - ValueError The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array. 2) ddof = n - ValueError If the number of elements, n, is not zero and ddof = n, raise a ValueError for the ddof value. Nan case 1) Empty array - Value Error 2) Empty slice - NaN 3) For slice ddof = n - Nan Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote: On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: snip For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)]) Agreed, although that changes current behavior. What about the other cases? Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices: Non-NaN case 1) Empty array - ValueError The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array. To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast. 2) ddof = n - ValueError If the number of elements, n, is not zero and ddof = n, raise a ValueError for the ddof value. Makes sense to me, especially for ddof n. Just returning nan in all cases for backward compatibility would be fine with me too. Nan case 1) Empty array - Value Error 2) Empty slice - NaN 3) For slice ddof = n - Nan Personally I would somewhat prefer if 1) and 2) would at least default to the same thing. But I don't use the nanfuncs anyway. I was wondering about adding the option for the user to pick what the fill is (and i.e. if it is None (maybe default) - ValueError). We could also allow this for normal reductions without an identity, but I am not sure if it is useful there. - Sebastian Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, Jul 15, 2013 at 8:25 AM, Benjamin Root ben.r...@ou.edu wrote: This is going to need to be heavily documented with doctests. Also, just to clarify, are we talking about a ValueError for doing a nansum on an empty array as well, or will that now return a zero? I was going to leave nansum as is, as it seems that the result was by choice rather than by accident. Tests, not doctests. I detest doctests ;) Examples, OTOH... Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg sebast...@sipsolutions.netwrote: On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote: On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: snip For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)]) Agreed, although that changes current behavior. What about the other cases? Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices: Non-NaN case 1) Empty array - ValueError The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array. To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast. I'm talking about mean, var, and std as statistics, sum isn't part of that. If there is agreement that nansum of empty arrays/columns should be zero I will do that. Note the sums of empty arrays may or may not be empty. In [1]: ones((0, 3)).sum(axis=0) Out[1]: array([ 0., 0., 0.]) In [2]: ones((3, 0)).sum(axis=0) Out[2]: array([], dtype=float64) Which, sort of, makes sense. 2) ddof = n - ValueError If the number of elements, n, is not zero and ddof = n, raise a ValueError for the ddof value. Makes sense to me, especially for ddof n. Just returning nan in all cases for backward compatibility would be fine with me too. Nan case 1) Empty array - Value Error 2) Empty slice - NaN 3) For slice ddof = n - Nan Personally I would somewhat prefer if 1) and 2) would at least default to the same thing. But I don't use the nanfuncs anyway. I was wondering about adding the option for the user to pick what the fill is (and i.e. if it is None (maybe default) - ValueError). We could also allow this for normal reductions without an identity, but I am not sure if it is useful there. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Allow == and != to raise errors
Just a question, should == behave like a ufunc or like python == for tuple? I think that all ndarray comparision (==, !=, =, ...) should behave the same. If they don't (like it was said), making them consistent is good. What is the minimal change to have them behave the same? From my understanding, it is your proposal to change == and != to behave like real ufunc. But I'm not sure if the minimal change is the best, for new user, what they will expect more? The ufunc of the python behavior? Anyway, I see the advantage to simplify the interface to something more consistent. Anyway, if we make all comparison behave like ufunc, there is array_equal as said to have the python behavior of ==, is it useful to have equivalent function the other comparison? Do they already exist. thanks Fred On Mon, Jul 15, 2013 at 10:20 AM, Nathaniel Smith n...@pobox.com wrote: On Mon, Jul 15, 2013 at 2:09 PM, bruno Piguet bruno.pig...@gmail.com wrote: Python itself doesn't raise an exception in such cases : (3,4) != (2, 3, 4) True (3,4) == (2, 3, 4) False Should numpy behave differently ? The numpy equivalent to Python's scalar == is called array_equal, and that does indeed behave the same: In [5]: np.array_equal([3, 4], [2, 3, 4]) Out[5]: False But in numpy, the name == is shorthand for the ufunc np.equal, which raises an error: In [8]: np.equal([3, 4], [2, 3, 4]) ValueError: operands could not be broadcast together with shapes (2) (3) -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg sebast...@sipsolutions.netwrote: On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote: On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: snip For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)]) Agreed, although that changes current behavior. What about the other cases? Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices: Non-NaN case 1) Empty array - ValueError The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array. To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast. 2) ddof = n - ValueError If the number of elements, n, is not zero and ddof = n, raise a ValueError for the ddof value. Makes sense to me, especially for ddof n. Just returning nan in all cases for backward compatibility would be fine with me too. Currently if ddof n it returns a negative number for variance, the NaN only comes when ddof == 0 and n == 0, leading to 0/0 (float is NaN, integer is zero division). Nan case 1) Empty array - Value Error 2) Empty slice - NaN 3) For slice ddof = n - Nan Personally I would somewhat prefer if 1) and 2) would at least default to the same thing. But I don't use the nanfuncs anyway. I was wondering about adding the option for the user to pick what the fill is (and i.e. if it is None (maybe default) - ValueError). We could also allow this for normal reductions without an identity, but I am not sure if it is useful there. In the NaN case some slices may be empty, others not. My reasoning is that that is going to be data dependent, not operator error, but if the array is empty the writer of the code should deal with that. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Allow == and != to raise errors
Thank-you for your explanations. So, if the operator == applied to np.arrays is a shorthand for the ufunc np.equal, it should definitly behave exactly as np.equal(), and raise an error. One side question about style : In case you would like to protect a x == y test by a try/except clause, wouldn't it feel more natural to write np.equal(x, y) ? Bruno. 2013/7/15 Nathaniel Smith n...@pobox.com On Mon, Jul 15, 2013 at 2:09 PM, bruno Piguet bruno.pig...@gmail.com wrote: Python itself doesn't raise an exception in such cases : (3,4) != (2, 3, 4) True (3,4) == (2, 3, 4) False Should numpy behave differently ? The numpy equivalent to Python's scalar == is called array_equal, and that does indeed behave the same: In [5]: np.array_equal([3, 4], [2, 3, 4]) Out[5]: False But in numpy, the name == is shorthand for the ufunc np.equal, which raises an error: In [8]: np.equal([3, 4], [2, 3, 4]) ValueError: operands could not be broadcast together with shapes (2) (3) -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] PIL and NumPy
On Jul 12, 2013, at 8:51 PM, Brady McCary brady.mcc...@gmail.com wrote: something to do with an alpha channel being present. I'd check and see how PIL is storing the alpha channel. If it's RGBA, then I'd expect it to work. But I'd PIL is storing the alpha channel as a separate band, then I'm not surprised you have an issue. Can you either drop the alpha or convert to RGBA? There is also a package called something line imageArray that loads and saves image formats directly to/from numpy arrays-maybe that would be helpful. CHB When I remove the alpha channel, things appear to work as I expect. Any discussion on the matter? Brady On Fri, Jul 12, 2013 at 10:00 PM, Brady McCary brady.mcc...@gmail.com wrote: NumPy Folks, I want to load images with PIL and then operate on them with NumPy. According to the PIL and NumPy documentation, I would expect the following to work, but it is not. Python 2.7.4 (default, Apr 19 2013, 18:28:01) [GCC 4.7.3] on linux2 Type help, copyright, credits or license for more information. import numpy numpy.version.version import Image Image.VERSION '1.1.7' im = Image.open('big-0.png') im.size (2550, 3300) ar = numpy.asarray(im) ar.size 1 ar.shape () ar array(PIL.PngImagePlugin.PngImageFile image mode=LA size=2550x3300 at 0x1E5BA70, dtype=object) By not working I mean that I would have expected the data to be loaded/available in ar. PIL and NumPy/SciPy seem to be working fine independently of each other. Any guidance? Brady ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Allow == and != to raise errors
2013/7/15 Frédéric Bastien no...@nouiz.org Just a question, should == behave like a ufunc or like python == for tuple? That's what I was also wondering. I see the advantage of consistency for newcomers. I'm not experienced enough to see if this is a problem for numerical practitionners Maybe they wouldn't even imagine that == applied to arrays could do anything else than element-wise comparison ? Explicit is better than implicit : to me, np.equal(x, y) is more explicit than x == y. But Beautiful is better than ugly. Is np.equal(x, y) ugly ? Bruno. I think that all ndarray comparision (==, !=, =, ...) should behave the same. If they don't (like it was said), making them consistent is good. What is the minimal change to have them behave the same? From my understanding, it is your proposal to change == and != to behave like real ufunc. But I'm not sure if the minimal change is the best, for new user, what they will expect more? The ufunc of the python behavior? Anyway, I see the advantage to simplify the interface to something more consistent. Anyway, if we make all comparison behave like ufunc, there is array_equal as said to have the python behavior of ==, is it useful to have equivalent function the other comparison? Do they already exist. thanks Fred On Mon, Jul 15, 2013 at 10:20 AM, Nathaniel Smith n...@pobox.com wrote: On Mon, Jul 15, 2013 at 2:09 PM, bruno Piguet bruno.pig...@gmail.com wrote: Python itself doesn't raise an exception in such cases : (3,4) != (2, 3, 4) True (3,4) == (2, 3, 4) False Should numpy behave differently ? The numpy equivalent to Python's scalar == is called array_equal, and that does indeed behave the same: In [5]: np.array_equal([3, 4], [2, 3, 4]) Out[5]: False But in numpy, the name == is shorthand for the ufunc np.equal, which raises an error: In [8]: np.equal([3, 4], [2, 3, 4]) ValueError: operands could not be broadcast together with shapes (2) (3) -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] read-only or immutable masked array
On Jul 15, 2013, at 14:40 , Gregorio Bastardo gregorio.basta...@gmail.com wrote: Hi Pierre, Note as well that hardening the mask only prevents unmasking: you can still grow the mask, which may not be what you want. Use `x.mask.flags.writeable=False` to make the mask really read-only. I ran into an unmasking problem with the suggested approach: np.version.version '1.7.0' x = np.ma.masked_array(xrange(4), [0,1,0,1]) x masked_array(data = [0 -- 2 --], mask = [False True False True], fill_value = 99) x.flags.writeable = False x.mask.flags.writeable = False x.mask[1] = 0 # ok Traceback (most recent call last): ... ValueError: assignment destination is read-only x[1] = 0 # ok Traceback (most recent call last): ... ValueError: assignment destination is read-only x.mask[1] = 0 # ?? x masked_array(data = [0 1 2 --], mask = [False False False True], fill_value = 99) Ouch… Quick workaround: use `x.harden_mask()` *then* `x.mask.flags.writeable=False` [Longer explanation] I noticed that sharedmask attribute changes (from True to False) after x[1] = 0. Indeed, indeed… When setting items, the mask is unshared to limit some issues (like propagation to the other masked_arrays sharing the mask). Unsharing the mask involves a copy, which unfortunately doesn't copy the flags. In other terms, when you try `x[1]=0`, the mask becomes rewritable. That hurts… But! This call to `unshare_mask` is performed only when the mask is 'soft' hence the quick workaround… Note to self (or whomever will fix the issue before I can do it): * We could make sure that copying a mask copies some of its flags to (like the `writeable` one, which other ones?) * The call to `unshare_mask` is made *before* we try to call `__setitem__` on the `_data` part: that's silly, if we called `__setitem__(_data,index,dval)` before, the `ValueError: assignment destination is read-only` would be raised before the mask could get unshared… TLD;DR: move L3073 of np.ma.core to L3068 * There should be some simpler ways to make a masked_array read-only, this little dance is rapidly tiring. Also, some of the ma operations result mask identity of the new ma, which causes ValueError when the new ma mask is modified: x = np.ma.masked_array(xrange(4), [0,1,0,1]) x.flags.writeable = False x.mask.flags.writeable = False x1 = x 0 x1.mask is x.mask # ok False x2 = x != 0 x2.mask is x.mask # ?? True x2.mask[1] = 0 Traceback (most recent call last): ... ValueError: assignment destination is read-only which is a bit confusing. Ouch again. [TL;DR] No workaround, sorry [Long version] The inconsistency comes from the fact that '!=' or '==' call the `__ne__` or `__eq__` methods while other comparison operators call their own function. In the first case, because we're comparing with a non-masked scalar, no copy of the mask is made; in the second case, a copy is systematically made. As pointed out earlier, copies of a mask don't preserve its flags… [Note to self] * Define a factory for __lt__/__le__/__gt__/__ge__ based on __eq__ : MaskedArray.__eq__ and __ne__ already have almost the same code.. (but what about filling? Is it an issue?) And I experienced that *_like operations give mask identity too: y = np.ones_like(x) y.mask is x.mask True This may change in the future, depending on a yet-to-be-achieved consensus on the definition of 'least-surprising behaviour'. Right now, the *-like functions return an array that shares the mask with the input, as you've noticed. Some people complained about it, what's your take on that? I might be missing something but could you clarify these issues? You were not missing anything, np.ma isn't the most straightforward module: plenty of corner cases, and the implementation is pretty naive at times (but hey, it works). My only advice is to never lose hope. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, Jul 15, 2013 at 8:58 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg sebast...@sipsolutions.net wrote: On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote: On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: snip For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)]) Agreed, although that changes current behavior. What about the other cases? Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices: Non-NaN case 1) Empty array - ValueError The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array. To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast. 2) ddof = n - ValueError If the number of elements, n, is not zero and ddof = n, raise a ValueError for the ddof value. Makes sense to me, especially for ddof n. Just returning nan in all cases for backward compatibility would be fine with me too. Currently if ddof n it returns a negative number for variance, the NaN only comes when ddof == 0 and n == 0, leading to 0/0 (float is NaN, integer is zero division). Nan case 1) Empty array - Value Error 2) Empty slice - NaN 3) For slice ddof = n - Nan Personally I would somewhat prefer if 1) and 2) would at least default to the same thing. But I don't use the nanfuncs anyway. I was wondering about adding the option for the user to pick what the fill is (and i.e. if it is None (maybe default) - ValueError). We could also allow this for normal reductions without an identity, but I am not sure if it is useful there. In the NaN case some slices may be empty, others not. My reasoning is that that is going to be data dependent, not operator error, but if the array is empty the writer of the code should deal with that. In the case of the nanvar, nanstd, it might make more sense to handle ddof as 1) if ddof is = axis size, raise ValueError 2) if ddof is = number of values after removing NaNs, return NaN The first would be consistent with the non-nan case, the second accounts for the variable nature of data containing NaNs. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Jul 15, 2013 11:47 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jul 15, 2013 at 8:58 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg sebast...@sipsolutions.net wrote: On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote: On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: snip For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)]) Agreed, although that changes current behavior. What about the other cases? Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices: Non-NaN case 1) Empty array - ValueError The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array. To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast. 2) ddof = n - ValueError If the number of elements, n, is not zero and ddof = n, raise a ValueError for the ddof value. Makes sense to me, especially for ddof n. Just returning nan in all cases for backward compatibility would be fine with me too. Currently if ddof n it returns a negative number for variance, the NaN only comes when ddof == 0 and n == 0, leading to 0/0 (float is NaN, integer is zero division). Nan case 1) Empty array - Value Error 2) Empty slice - NaN 3) For slice ddof = n - Nan Personally I would somewhat prefer if 1) and 2) would at least default to the same thing. But I don't use the nanfuncs anyway. I was wondering about adding the option for the user to pick what the fill is (and i.e. if it is None (maybe default) - ValueError). We could also allow this for normal reductions without an identity, but I am not sure if it is useful there. In the NaN case some slices may be empty, others not. My reasoning is that that is going to be data dependent, not operator error, but if the array is empty the writer of the code should deal with that. In the case of the nanvar, nanstd, it might make more sense to handle ddof as 1) if ddof is = axis size, raise ValueError 2) if ddof is = number of values after removing NaNs, return NaN The first would be consistent with the non-nan case, the second accounts for the variable nature of data containing NaNs. Chuck I think this is a good idea in that it naturally follows well with the conventions of what to do with empty arrays / empty slices with nanmean, etc. Note, however, I am not a very big fan of the idea of having two different behaviors for what I see as semantically the same thing. But, my objections are not strong enough to veto it, and I do think this proposal is well thought-out. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, 2013-07-15 at 08:47 -0600, Charles R Harris wrote: On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg sebast...@sipsolutions.net wrote: On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote: On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: snip For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)]) Agreed, although that changes current behavior. What about the other cases? Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices: Non-NaN case 1) Empty array - ValueError The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array. To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast. I'm talking about mean, var, and std as statistics, sum isn't part of that. If there is agreement that nansum of empty arrays/columns should be zero I will do that. Note the sums of empty arrays may or may not be empty. In [1]: ones((0, 3)).sum(axis=0) Out[1]: array([ 0., 0., 0.]) In [2]: ones((3, 0)).sum(axis=0) Out[2]: array([], dtype=float64) Which, sort of, makes sense. I think we can agree that the behaviour for reductions with an identity should default to returning the identity, including for the nanfuncs, i.e. sum([]) is 0, product([]) is 1... Since mean = sum/length is a sensible definition, having 0/0 as a result doesn't seem to bad to me to be honest, it might be accidental but it is not a special case in the code ;). Though I don't mind an error as long as it doesn't break matplotlib or so. I agree about the nanfuncs raising an error would probably be more of a problem then for a usual ufunc, but still a bit hesitant about saying that it is ok too. I could imagine adding a very general identity argument (though I would not call it identity, because it is not the same as `np.add.identity`, just used in a place where that would be used otherwise): np.add.reduce([], identity=123) - [123] np.add.reduce([1], identity=123) - [1] np.nanmean([np.nan], identity=None) - Error np.nanmean([np.nan], identity=np.nan) - np.nan It doesn't really make sense, but: np.subtract.reduce([]) - Error, since np.substract.identity is None np.subtract.reduce([], identity=0) - 0, suppressing the error. I am not sure if I am convinced myself, but especially for the nanfuncs it could maybe provide a way to circumvent the problem somewhat. Including functions such as np.nanargmin, whose result type does not even support NaN. Plus it gives an argument allowing for warnings about changing behaviour. - Sebastian 2) ddof = n - ValueError If the number of elements, n, is not zero and ddof = n, raise a ValueError for the ddof value. Makes sense to me, especially for ddof n. Just returning nan in all cases for backward compatibility would be fine with me too. Nan case 1) Empty array - Value Error 2) Empty slice - NaN 3) For slice ddof = n - Nan Personally I would somewhat prefer if 1) and 2) would at least default to the same thing. But I don't use the nanfuncs anyway. I was wondering about adding the option for the user to pick what the fill is (and i.e. if it is None (maybe default) - ValueError). We could also allow this for normal reductions without an identity, but I am not sure if it is useful there. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org
Re: [Numpy-discussion] Allow == and != to raise errors
On Mon, 2013-07-15 at 17:12 +0200, bruno Piguet wrote: 2013/7/15 Frédéric Bastien no...@nouiz.org Just a question, should == behave like a ufunc or like python == for tuple? That's what I was also wondering. I am not sure I understand the question. Of course == should be (mostly?) identical to np.equal. Things like arr[arr == 0] = -1 etc., etc., are a common design pattern. Operations on arrays are element-wise by default, falling back to the python tuple/container behaviour is a special case and I do not see a good reason for it, except possibly backward compatibility. Personally I doubt anyone who seriously uses numpy, uses the np.array([1, 2, 3]) == np.array([1,2]) - False behaviour, and it seems a bit like a trap to me, because suddenly you get: np.array([1, 2, 3]) == np.array([1]) - np.array([True, False, False]) (Though in combination with np.all, it can make sense and is then identical to np.array_equiv/np.array_equal) - Sebastian I see the advantage of consistency for newcomers. I'm not experienced enough to see if this is a problem for numerical practitionners Maybe they wouldn't even imagine that == applied to arrays could do anything else than element-wise comparison ? Explicit is better than implicit : to me, np.equal(x, y) is more explicit than x == y. But Beautiful is better than ugly. Is np.equal(x, y) ugly ? Bruno. I think that all ndarray comparision (==, !=, =, ...) should behave the same. If they don't (like it was said), making them consistent is good. What is the minimal change to have them behave the same? From my understanding, it is your proposal to change == and != to behave like real ufunc. But I'm not sure if the minimal change is the best, for new user, what they will expect more? The ufunc of the python behavior? Anyway, I see the advantage to simplify the interface to something more consistent. Anyway, if we make all comparison behave like ufunc, there is array_equal as said to have the python behavior of ==, is it useful to have equivalent function the other comparison? Do they already exist. thanks Fred On Mon, Jul 15, 2013 at 10:20 AM, Nathaniel Smith n...@pobox.com wrote: On Mon, Jul 15, 2013 at 2:09 PM, bruno Piguet bruno.pig...@gmail.com wrote: Python itself doesn't raise an exception in such cases : (3,4) != (2, 3, 4) True (3,4) == (2, 3, 4) False Should numpy behave differently ? The numpy equivalent to Python's scalar == is called array_equal, and that does indeed behave the same: In [5]: np.array_equal([3, 4], [2, 3, 4]) Out[5]: False But in numpy, the name == is shorthand for the ufunc np.equal, which raises an error: In [8]: np.equal([3, 4], [2, 3, 4]) ValueError: operands could not be broadcast together with shapes (2) (3) -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] PIL and NumPy
Dear Brady On Fri, 12 Jul 2013 22:00:08 -0500, Brady McCary wrote: I want to load images with PIL and then operate on them with NumPy. According to the PIL and NumPy documentation, I would expect the following to work, but it is not. Reading images as PIL is a little bit trickier than one would hope. You can find an example of how to do it (taken scikit-image) here: https://github.com/scikit-image/scikit-image/blob/master/skimage/io/_plugins/pil_plugin.py#L15 Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] read-only or immutable masked array
Ouch… Quick workaround: use `x.harden_mask()` *then* `x.mask.flags.writeable=False` Thanks for the update and the detailed explanation. I'll try this trick. This may change in the future, depending on a yet-to-be-achieved consensus on the definition of 'least-surprising behaviour'. Right now, the *-like functions return an array that shares the mask with the input, as you've noticed. Some people complained about it, what's your take on that? I already took part in the survey (possibly out of thread): http://mail.scipy.org/pipermail/numpy-discussion/2013-July/067136.html You were not missing anything, np.ma isn't the most straightforward module: plenty of corner cases, and the implementation is pretty naive at times (but hey, it works). My only advice is to never lose hope. I agree there are plenty of hard-to-define cases, and I came accross a hot debate on missing data representation in python: https://github.com/njsmith/numpy/wiki/NA-discussion-status but still I believe np.ma is very usable when compression is not strongly needed. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, Jul 15, 2013 at 9:55 AM, Sebastian Berg sebast...@sipsolutions.netwrote: On Mon, 2013-07-15 at 08:47 -0600, Charles R Harris wrote: On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg sebast...@sipsolutions.net wrote: On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote: On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: snip For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)]) Agreed, although that changes current behavior. What about the other cases? Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices: Non-NaN case 1) Empty array - ValueError The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array. To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast. I'm talking about mean, var, and std as statistics, sum isn't part of that. If there is agreement that nansum of empty arrays/columns should be zero I will do that. Note the sums of empty arrays may or may not be empty. In [1]: ones((0, 3)).sum(axis=0) Out[1]: array([ 0., 0., 0.]) In [2]: ones((3, 0)).sum(axis=0) Out[2]: array([], dtype=float64) Which, sort of, makes sense. I think we can agree that the behaviour for reductions with an identity should default to returning the identity, including for the nanfuncs, i.e. sum([]) is 0, product([]) is 1... Since mean = sum/length is a sensible definition, having 0/0 as a result doesn't seem to bad to me to be honest, it might be accidental but it is not a special case in the code ;). Though I don't mind an error as long as it doesn't break matplotlib or so. I agree about the nanfuncs raising an error would probably be more of a problem then for a usual ufunc, but still a bit hesitant about saying that it is ok too. I could imagine adding a very general identity argument (though I would not call it identity, because it is not the same as `np.add.identity`, just used in a place where that would be used otherwise): np.add.reduce([], identity=123) - [123] np.add.reduce([1], identity=123) - [1] np.nanmean([np.nan], identity=None) - Error np.nanmean([np.nan], identity=np.nan) - np.nan It doesn't really make sense, but: np.subtract.reduce([]) - Error, since np.substract.identity is None np.subtract.reduce([], identity=0) - 0, suppressing the error. I am not sure if I am convinced myself, but especially for the nanfuncs it could maybe provide a way to circumvent the problem somewhat. Including functions such as np.nanargmin, whose result type does not even support NaN. Plus it gives an argument allowing for warnings about changing behaviour. Let me try to summarize. To begin with, the environment of the nan functions is rather special. 1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified The second assumption guarantees that NaN can be used in the return values. *sum and nansum* These should be consistent so that empty sums are 0. This should cover the empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal. *mean and nanmean* In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers. Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer. Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN Option2
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris charlesr.har...@gmail.com wrote: Let me try to summarize. To begin with, the environment of the nan functions is rather special. 1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified The second assumption guarantees that NaN can be used in the return values. The requirement on the 'out' dtype only exists because currently the nan function like to return nan for things like empty arrays, right? If not for that, it could be relaxed? (it's a rather weird requirement, since the whole point of these functions is that they ignore nans, yet they don't always...) sum and nansum These should be consistent so that empty sums are 0. This should cover the empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal. I agree that returning 0 is the right behaviour, but we might need a FutureWarning period. mean and nanmean In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers. Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer. Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN Option2 1) mean raise warning, return NaN (current behavior) 2) nanmean no warning, return NaN Option3 1) mean raise warning, return NaN (current behavior) 2) nanmean raise warning, return NaN I have mixed feelings about the whole np.seterr apparatus, but since it exists, shouldn't we use it for consistency? I.e., just do whatever numpy is set up to do with 0/0? (Which I think means, warn and return NaN by default, but this can be changed.) var, std, nanvar, nanstd 1) if ddof axis(axes) size, raise error, probably a program bug. 2) If ddof=0, then whatever is the case for mean, nanmean For nanvar, nanstd it is possible that some slice are good, some bad, so option1 1) if n - ddof = 0 for a slice, raise warning, return NaN for slice option2 1) if n - ddof = 0 for a slice, don't warn, return NaN for slice I don't really have any intuition for these ddof cases. Just raising an error on negative effective dof is pretty defensible and might be the safest -- it's a easy to turn an error into something sensible later if people come up with use cases... -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris charlesr.har...@gmail.com wrote: Let me try to summarize. To begin with, the environment of the nan functions is rather special. 1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified The second assumption guarantees that NaN can be used in the return values. The requirement on the 'out' dtype only exists because currently the nan function like to return nan for things like empty arrays, right? If not for that, it could be relaxed? (it's a rather weird requirement, since the whole point of these functions is that they ignore nans, yet they don't always...) sum and nansum These should be consistent so that empty sums are 0. This should cover the empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal. I agree that returning 0 is the right behaviour, but we might need a FutureWarning period. mean and nanmean In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers. Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer. Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN Option2 1) mean raise warning, return NaN (current behavior) 2) nanmean no warning, return NaN Option3 1) mean raise warning, return NaN (current behavior) 2) nanmean raise warning, return NaN I have mixed feelings about the whole np.seterr apparatus, but since it exists, shouldn't we use it for consistency? I.e., just do whatever numpy is set up to do with 0/0? (Which I think means, warn and return NaN by default, but this can be changed.) var, std, nanvar, nanstd 1) if ddof axis(axes) size, raise error, probably a program bug. 2) If ddof=0, then whatever is the case for mean, nanmean For nanvar, nanstd it is possible that some slice are good, some bad, so option1 1) if n - ddof = 0 for a slice, raise warning, return NaN for slice option2 1) if n - ddof = 0 for a slice, don't warn, return NaN for slice I don't really have any intuition for these ddof cases. Just raising an error on negative effective dof is pretty defensible and might be the safest -- it's a easy to turn an error into something sensible later if people come up with use cases... related why does reduceat not have empty slices? np.add.reduceat(np.arange(8),[0,4, 5, 7,7]) array([ 6, 4, 11, 7, 7]) I'm in favor of returning nans instead of raising exceptions, except if the return type is int and we cannot cast nan to int. If we get functions into numpy that know how to handle nans, then it would be useful to get the nans, so we can work with them Some cases where this might come in handy are when we iterate over slices of an array that define groups or category levels with possible empty groups *) idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2]) x = np.arange(9) [x[idx==ii].mean() for ii in range(4)] [1.5, 5.0, nan, 7.5] instead of [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()0] [1.5, 5.0, 7.5] same for var, I wouldn't have to check that the size is larger than the ddof (whatever that is in the specific case) *) groups could be empty because they were defined for a larger dataset or as a union of different datasets PS: I used mean() above and not var() because np.__version__ '1.5.1' np.mean([]) nan np.var([]) 0.0 Josef -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, Jul 15, 2013 at 4:24 PM, josef.p...@gmail.com wrote: On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris charlesr.har...@gmail.com wrote: Let me try to summarize. To begin with, the environment of the nan functions is rather special. 1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified The second assumption guarantees that NaN can be used in the return values. The requirement on the 'out' dtype only exists because currently the nan function like to return nan for things like empty arrays, right? If not for that, it could be relaxed? (it's a rather weird requirement, since the whole point of these functions is that they ignore nans, yet they don't always...) sum and nansum These should be consistent so that empty sums are 0. This should cover the empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal. I agree that returning 0 is the right behaviour, but we might need a FutureWarning period. mean and nanmean In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers. Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer. Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN Option2 1) mean raise warning, return NaN (current behavior) 2) nanmean no warning, return NaN Option3 1) mean raise warning, return NaN (current behavior) 2) nanmean raise warning, return NaN I have mixed feelings about the whole np.seterr apparatus, but since it exists, shouldn't we use it for consistency? I.e., just do whatever numpy is set up to do with 0/0? (Which I think means, warn and return NaN by default, but this can be changed.) var, std, nanvar, nanstd 1) if ddof axis(axes) size, raise error, probably a program bug. 2) If ddof=0, then whatever is the case for mean, nanmean For nanvar, nanstd it is possible that some slice are good, some bad, so option1 1) if n - ddof = 0 for a slice, raise warning, return NaN for slice option2 1) if n - ddof = 0 for a slice, don't warn, return NaN for slice I don't really have any intuition for these ddof cases. Just raising an error on negative effective dof is pretty defensible and might be the safest -- it's a easy to turn an error into something sensible later if people come up with use cases... related why does reduceat not have empty slices? np.add.reduceat(np.arange(8),[0,4, 5, 7,7]) array([ 6, 4, 11, 7, 7]) I'm in favor of returning nans instead of raising exceptions, except if the return type is int and we cannot cast nan to int. If we get functions into numpy that know how to handle nans, then it would be useful to get the nans, so we can work with them Some cases where this might come in handy are when we iterate over slices of an array that define groups or category levels with possible empty groups *) idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2]) x = np.arange(9) [x[idx==ii].mean() for ii in range(4)] [1.5, 5.0, nan, 7.5] instead of [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()0] [1.5, 5.0, 7.5] same for var, I wouldn't have to check that the size is larger than the ddof (whatever that is in the specific case) *) groups could be empty because they were defined for a larger dataset or as a union of different datasets background: I wrote several robust anova versions a few weeks ago, that were essentially list comprehension as above. However, I didn't allow nans and didn't check for minimum size. Allowing for empty groups to return nan would mainly be a convenience, since I need to check the group size only once. ddof: tests for proportions have ddof=0, for regular t-test ddof=1, for tests of correlation ddof=2 IIRC so we would need to check for the corresponding minimum size that n-ddof0 negative effective dof doesn't exist, that's np.maximum(n - ddof, 0) which is always non-negative but might result in a zero-division error. :) I don't think making anything conditional on ddof0 is useful. Josef PS: I used mean() above and not var() because np.__version__ '1.5.1' np.mean([]) nan np.var([]) 0.0 Josef -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, Jul 15, 2013 at 2:44 PM, josef.p...@gmail.com wrote: On Mon, Jul 15, 2013 at 4:24 PM, josef.p...@gmail.com wrote: On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris charlesr.har...@gmail.com wrote: Let me try to summarize. To begin with, the environment of the nan functions is rather special. 1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified The second assumption guarantees that NaN can be used in the return values. The requirement on the 'out' dtype only exists because currently the nan function like to return nan for things like empty arrays, right? If not for that, it could be relaxed? (it's a rather weird requirement, since the whole point of these functions is that they ignore nans, yet they don't always...) sum and nansum These should be consistent so that empty sums are 0. This should cover the empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal. I agree that returning 0 is the right behaviour, but we might need a FutureWarning period. mean and nanmean In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers. Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer. Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN Option2 1) mean raise warning, return NaN (current behavior) 2) nanmean no warning, return NaN Option3 1) mean raise warning, return NaN (current behavior) 2) nanmean raise warning, return NaN I have mixed feelings about the whole np.seterr apparatus, but since it exists, shouldn't we use it for consistency? I.e., just do whatever numpy is set up to do with 0/0? (Which I think means, warn and return NaN by default, but this can be changed.) var, std, nanvar, nanstd 1) if ddof axis(axes) size, raise error, probably a program bug. 2) If ddof=0, then whatever is the case for mean, nanmean For nanvar, nanstd it is possible that some slice are good, some bad, so option1 1) if n - ddof = 0 for a slice, raise warning, return NaN for slice option2 1) if n - ddof = 0 for a slice, don't warn, return NaN for slice I don't really have any intuition for these ddof cases. Just raising an error on negative effective dof is pretty defensible and might be the safest -- it's a easy to turn an error into something sensible later if people come up with use cases... related why does reduceat not have empty slices? np.add.reduceat(np.arange(8),[0,4, 5, 7,7]) array([ 6, 4, 11, 7, 7]) I'm in favor of returning nans instead of raising exceptions, except if the return type is int and we cannot cast nan to int. If we get functions into numpy that know how to handle nans, then it would be useful to get the nans, so we can work with them Some cases where this might come in handy are when we iterate over slices of an array that define groups or category levels with possible empty groups *) idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2]) x = np.arange(9) [x[idx==ii].mean() for ii in range(4)] [1.5, 5.0, nan, 7.5] instead of [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()0] [1.5, 5.0, 7.5] same for var, I wouldn't have to check that the size is larger than the ddof (whatever that is in the specific case) *) groups could be empty because they were defined for a larger dataset or as a union of different datasets background: I wrote several robust anova versions a few weeks ago, that were essentially list comprehension as above. However, I didn't allow nans and didn't check for minimum size. Allowing for empty groups to return nan would mainly be a convenience, since I need to check the group size only once. ddof: tests for proportions have ddof=0, for regular t-test ddof=1, for tests of correlation ddof=2 IIRC so we would need to check for the corresponding minimum size that n-ddof0 negative effective dof doesn't exist, that's np.maximum(n - ddof, 0) which is always non-negative but might result in a zero-division error. :) I don't think making anything conditional on ddof0 is useful. So how would you want it? To summarize the problem areas: 1) What is the sum of an empty slice? NaN or 0? 2) What is mean of empy slice? NaN, NaN and warn, or error? 3) What if n - ddof 0 for slice? NaN, NaN and warn, or error? 4) What if n - ddof = 0 for slice? NaN, NaN and warn, or error? I'm tending to NaN and warn for 2 -- 3, because, as Nathaniel notes, the warning can be
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, Jul 15, 2013 at 5:34 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jul 15, 2013 at 2:44 PM, josef.p...@gmail.com wrote: On Mon, Jul 15, 2013 at 4:24 PM, josef.p...@gmail.com wrote: On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris charlesr.har...@gmail.com wrote: Let me try to summarize. To begin with, the environment of the nan functions is rather special. 1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified The second assumption guarantees that NaN can be used in the return values. The requirement on the 'out' dtype only exists because currently the nan function like to return nan for things like empty arrays, right? If not for that, it could be relaxed? (it's a rather weird requirement, since the whole point of these functions is that they ignore nans, yet they don't always...) sum and nansum These should be consistent so that empty sums are 0. This should cover the empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal. I agree that returning 0 is the right behaviour, but we might need a FutureWarning period. mean and nanmean In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers. Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer. Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN Option2 1) mean raise warning, return NaN (current behavior) 2) nanmean no warning, return NaN Option3 1) mean raise warning, return NaN (current behavior) 2) nanmean raise warning, return NaN I have mixed feelings about the whole np.seterr apparatus, but since it exists, shouldn't we use it for consistency? I.e., just do whatever numpy is set up to do with 0/0? (Which I think means, warn and return NaN by default, but this can be changed.) var, std, nanvar, nanstd 1) if ddof axis(axes) size, raise error, probably a program bug. 2) If ddof=0, then whatever is the case for mean, nanmean For nanvar, nanstd it is possible that some slice are good, some bad, so option1 1) if n - ddof = 0 for a slice, raise warning, return NaN for slice option2 1) if n - ddof = 0 for a slice, don't warn, return NaN for slice I don't really have any intuition for these ddof cases. Just raising an error on negative effective dof is pretty defensible and might be the safest -- it's a easy to turn an error into something sensible later if people come up with use cases... related why does reduceat not have empty slices? np.add.reduceat(np.arange(8),[0,4, 5, 7,7]) array([ 6, 4, 11, 7, 7]) I'm in favor of returning nans instead of raising exceptions, except if the return type is int and we cannot cast nan to int. If we get functions into numpy that know how to handle nans, then it would be useful to get the nans, so we can work with them Some cases where this might come in handy are when we iterate over slices of an array that define groups or category levels with possible empty groups *) idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2]) x = np.arange(9) [x[idx==ii].mean() for ii in range(4)] [1.5, 5.0, nan, 7.5] instead of [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()0] [1.5, 5.0, 7.5] same for var, I wouldn't have to check that the size is larger than the ddof (whatever that is in the specific case) *) groups could be empty because they were defined for a larger dataset or as a union of different datasets background: I wrote several robust anova versions a few weeks ago, that were essentially list comprehension as above. However, I didn't allow nans and didn't check for minimum size. Allowing for empty groups to return nan would mainly be a convenience, since I need to check the group size only once. ddof: tests for proportions have ddof=0, for regular t-test ddof=1, for tests of correlation ddof=2 IIRC so we would need to check for the corresponding minimum size that n-ddof0 negative effective dof doesn't exist, that's np.maximum(n - ddof, 0) which is always non-negative but might result in a zero-division error. :) I don't think making anything conditional on ddof0 is useful. So how would you want it? To summarize the problem areas: 1) What is the sum of an empty slice? NaN or 0? 0 as it is now for sum, (including 0 for nansum with no valid entries). 2) What is mean of empy slice? NaN, NaN and warn, or error? 3) What if n - ddof 0 for slice? NaN,
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, Jul 15, 2013 at 3:57 PM, josef.p...@gmail.com wrote: On Mon, Jul 15, 2013 at 5:34 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jul 15, 2013 at 2:44 PM, josef.p...@gmail.com wrote: On Mon, Jul 15, 2013 at 4:24 PM, josef.p...@gmail.com wrote: On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris charlesr.har...@gmail.com wrote: Let me try to summarize. To begin with, the environment of the nan functions is rather special. 1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified The second assumption guarantees that NaN can be used in the return values. The requirement on the 'out' dtype only exists because currently the nan function like to return nan for things like empty arrays, right? If not for that, it could be relaxed? (it's a rather weird requirement, since the whole point of these functions is that they ignore nans, yet they don't always...) sum and nansum These should be consistent so that empty sums are 0. This should cover the empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal. I agree that returning 0 is the right behaviour, but we might need a FutureWarning period. mean and nanmean In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers. Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer. Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN Option2 1) mean raise warning, return NaN (current behavior) 2) nanmean no warning, return NaN Option3 1) mean raise warning, return NaN (current behavior) 2) nanmean raise warning, return NaN I have mixed feelings about the whole np.seterr apparatus, but since it exists, shouldn't we use it for consistency? I.e., just do whatever numpy is set up to do with 0/0? (Which I think means, warn and return NaN by default, but this can be changed.) var, std, nanvar, nanstd 1) if ddof axis(axes) size, raise error, probably a program bug. 2) If ddof=0, then whatever is the case for mean, nanmean For nanvar, nanstd it is possible that some slice are good, some bad, so option1 1) if n - ddof = 0 for a slice, raise warning, return NaN for slice option2 1) if n - ddof = 0 for a slice, don't warn, return NaN for slice I don't really have any intuition for these ddof cases. Just raising an error on negative effective dof is pretty defensible and might be the safest -- it's a easy to turn an error into something sensible later if people come up with use cases... related why does reduceat not have empty slices? np.add.reduceat(np.arange(8),[0,4, 5, 7,7]) array([ 6, 4, 11, 7, 7]) I'm in favor of returning nans instead of raising exceptions, except if the return type is int and we cannot cast nan to int. If we get functions into numpy that know how to handle nans, then it would be useful to get the nans, so we can work with them Some cases where this might come in handy are when we iterate over slices of an array that define groups or category levels with possible empty groups *) idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2]) x = np.arange(9) [x[idx==ii].mean() for ii in range(4)] [1.5, 5.0, nan, 7.5] instead of [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()0] [1.5, 5.0, 7.5] same for var, I wouldn't have to check that the size is larger than the ddof (whatever that is in the specific case) *) groups could be empty because they were defined for a larger dataset or as a union of different datasets background: I wrote several robust anova versions a few weeks ago, that were essentially list comprehension as above. However, I didn't allow nans and didn't check for minimum size. Allowing for empty groups to return nan would mainly be a convenience, since I need to check the group size only once. ddof: tests for proportions have ddof=0, for regular t-test ddof=1, for tests of correlation ddof=2 IIRC so we would need to check for the corresponding minimum size that n-ddof0 negative effective dof doesn't exist, that's np.maximum(n - ddof, 0) which is always non-negative but might result in a zero-division error. :) I don't think making anything conditional on ddof0 is useful. So how would you want it? To summarize the problem areas: 1) What is
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, 15 Jul 2013 08:33:47 -0600, Charles R Harris wrote: On Mon, Jul 15, 2013 at 8:25 AM, Benjamin Root ben.r...@ou.edu wrote: This is going to need to be heavily documented with doctests. Also, just to clarify, are we talking about a ValueError for doing a nansum on an empty array as well, or will that now return a zero? I was going to leave nansum as is, as it seems that the result was by choice rather than by accident. That makes sense--I like Sebastian's explanation whereby operations that define an identity yields that upon empty input. Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, Jul 15, 2013 at 6:22 PM, Stéfan van der Walt ste...@sun.ac.zawrote: On Mon, 15 Jul 2013 08:33:47 -0600, Charles R Harris wrote: On Mon, Jul 15, 2013 at 8:25 AM, Benjamin Root ben.r...@ou.edu wrote: This is going to need to be heavily documented with doctests. Also, just to clarify, are we talking about a ValueError for doing a nansum on an empty array as well, or will that now return a zero? I was going to leave nansum as is, as it seems that the result was by choice rather than by accident. That makes sense--I like Sebastian's explanation whereby operations that define an identity yields that upon empty input. So nansum should return zeros rather than the current NaNs? Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, 15 Jul 2013 18:46:33 -0600, Charles R Harris wrote: So nansum should return zeros rather than the current NaNs? Yes, my feeling is that nansum([]) should be 0. Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
To add a bit of context to the question of nansum on empty results, we currently differ from MATLAB and R in this respect, they return zero no matter what. Personally, I think it should return zero, but our current behavior of returning nans has existed for a long time. Personally, I think we need a deprecation warning and possibly wait to change this until 2.0, with plenty of warning that this will change. Ben Root On Jul 15, 2013 8:46 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jul 15, 2013 at 6:22 PM, Stéfan van der Walt ste...@sun.ac.zawrote: On Mon, 15 Jul 2013 08:33:47 -0600, Charles R Harris wrote: On Mon, Jul 15, 2013 at 8:25 AM, Benjamin Root ben.r...@ou.edu wrote: This is going to need to be heavily documented with doctests. Also, just to clarify, are we talking about a ValueError for doing a nansum on an empty array as well, or will that now return a zero? I was going to leave nansum as is, as it seems that the result was by choice rather than by accident. That makes sense--I like Sebastian's explanation whereby operations that define an identity yields that upon empty input. So nansum should return zeros rather than the current NaNs? Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] retrieving original array locations from 2d argsort
I know that there's an easy way to solve this problem, but I'm not sufficiently knowledgeable about numpy indexing to figure it out. Here is the problem: Take a 2-d array a, of any size. Sort it in ascending order using, I presume, argsort. Step through the sorted array in order, and for each element in the sorted array, retrieve what the corresponding (line, sample) indices in the original array are. For instance: a = numpy.arange(0, 16).reshape(4,4) a[0,:] = -1*numpy.arange(0,4) a[2,:] = -1*numpy.arange(4,8) asort = numpy.sort(a, axis=None) for idx in xrange(0, asort.size): element = asort[idx] !! Find the line and sample location in a that corresponds to the i-th element in assort Thank-you for your help, Catherine ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] retrieving original array locations from 2d argsort
On 7/15/13, Moroney, Catherine M (398D) catherine.m.moro...@jpl.nasa.gov wrote: I know that there's an easy way to solve this problem, but I'm not sufficiently knowledgeable about numpy indexing to figure it out. Here is the problem: Take a 2-d array a, of any size. Sort it in ascending order using, I presume, argsort. Step through the sorted array in order, and for each element in the sorted array, retrieve what the corresponding (line, sample) indices in the original array are. For instance: a = numpy.arange(0, 16).reshape(4,4) a[0,:] = -1*numpy.arange(0,4) a[2,:] = -1*numpy.arange(4,8) asort = numpy.sort(a, axis=None) for idx in xrange(0, asort.size): element = asort[idx] !! Find the line and sample location in a that corresponds to the i-th element in assort One way is to use argsort and `numpy.unravel_index` to recover the original 2D indices: code import numpy a = numpy.arange(0, 16).reshape(4,4) a[0,:] = -1*numpy.arange(0,4) a[2,:] = -1*numpy.arange(4,8) flat_sort_indices = numpy.argsort(a, axis=None) original_indices = numpy.unravel_index(flat_sort_indices, a.shape) print i j a[i,j] for i, j in zip(*original_indices): element = a[i,j] print %3d %3d %6d % (i, j, element) /code Warren Thank-you for your help, Catherine ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Mon, Jul 15, 2013 at 6:58 PM, Benjamin Root ben.r...@ou.edu wrote: To add a bit of context to the question of nansum on empty results, we currently differ from MATLAB and R in this respect, they return zero no matter what. Personally, I think it should return zero, but our current behavior of returning nans has existed for a long time. Personally, I think we need a deprecation warning and possibly wait to change this until 2.0, with plenty of warning that this will change. Waiting for the mythical 2.0 probably won't work ;) We also need to give folks a way to adjust ahead of time. I think the easiest way to do that is with an extra keyword, say nanok, with True as the starting default, then later we can make False the default. snip Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What should be the result in some statistics corner cases?
On Tue, Jul 16, 2013 at 3:50 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jul 15, 2013 at 6:58 PM, Benjamin Root ben.r...@ou.edu wrote: To add a bit of context to the question of nansum on empty results, we currently differ from MATLAB and R in this respect, they return zero no matter what. Personally, I think it should return zero, but our current behavior of returning nans has existed for a long time. Personally, I think we need a deprecation warning and possibly wait to change this until 2.0, with plenty of warning that this will change. Waiting for the mythical 2.0 probably won't work ;) We also need to give folks a way to adjust ahead of time. I think the easiest way to do that is with an extra keyword, say nanok, with True as the starting default, then later we can make False the default. No special keywords to work around behavior change please, it doesn't work well and you end up with a keyword you don't really want. Why not just give a FutureWarning in 1.8 and change to returning zero in 1.9? Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion