Re: [Numpy-discussion] in the NA discussion, what can we agree on?
Nathaniel Smith writes: So assignment is not destructive -- the old value is retained as the payload. I never assumed (and I think it is also the case for others) that the payload was retaining the old value. In fact, AFAIR, the payloads were introduced as a way of having more than one special value that (if wanted by the user) can be handled differently depending on the payload. Note that while you're assuming IGNORED(x) means a value that is ignoring the x original value, you're never writing MISSING(x) to retain the original value that is now missing. Thus I think that decoupling the payload from the previous value concept makes it all consistent regardless of the destructiveness property. That's one of the reasons why I used the special value concept since the beginning, so that no assumption can be made about its propagation and destructiveness properties. Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Sat, Nov 5, 2011 at 3:22 PM, T J tjhn...@gmail.com wrote: So what do people expect out of ignored values? It seems that we might need to extend the list you put forward so that it includes these desires. Since my primary use is with MISSING and not so much IGNORED, I'm not in a very good position to help extend that list. I'd be curious to know if this present suggestion would work with how matplotlib uses masked arrays. I'm in a similar position -- I don't have any use cases for IGNORED myself right now, so I'm just trying to guess from what I've seen people say. Just having where= around makes it at least possible, if cumbersome, to handle one of the big problems -- working with subsets of large datasets without having to make a copy. It's possible we should leave IGNORED support alone until people have more experience with where=, and can say what kind of convenience features would be useful. Or maybe we can get some experts to speak up... perhaps this will help: http://thread.gmane.org/gmane.comp.python.matplotlib.devel/10740 -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 8:33 PM, T J tjhn...@gmail.com wrote: On Fri, Nov 4, 2011 at 8:03 PM, Nathaniel Smith n...@pobox.com wrote: Again, I really don't think you're going to be able to sell an API where [2] + [IGNORED(20)] == [IGNORED(22)] I mean, it's not me you have to convince, it's Gary, Pierre, maybe Benjamin, Lluís, etc. So I could be wrong. But you might want to figure that out first before making plans based on this... But this is how np.ma currently does it, except that it doesn't compute the payload---it just calls it IGNORED. Yes, that's what I mean -- if you're just temporarily masking something out because you want it to be IGNORED, then you don't want it to change around when you do something like a += 2, right? If the operation is changing the payload, then it's weird to say that the operation ignored the payload... Anyway, I think this is another way to think about your suggestion: -- each array gets an extra boolean array called the mask that it carries around with it -- Unary ufuncs automatically copy these masks to their results. For binary ufuncs, the input masks get automatically ORed together, and that determines the mask attached to the output array -- these masks have absolutely no effect on any computations, except that ufunc.reduce(a, skip_IGNORED=True) is defined to be a synonym for ufunc.reduce(a, where=a.mask) Is that correct? Also, if can I ask -- is this something you would find useful yourself? -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
Hi, 05.11.2011 03:43, T J kirjoitti: [clip] I thought that PdC satisfied (a) and (b). Let me show you what I thought they were. Perhaps I am not being consistent. If so, point out my mistake. Yes, propagating + destructive assigment + do-computations-on-payload should satisfy (a) and (b). (NA also works as it's a singleton.) The question is now that are there other rules, with more desirable behavior of masked values, that also have a += b a += 42 print unmask(a) and a += 42 a += b print unmask(a) as equivalent operations. The rules chosen by np.ma don't satisfy this. If taking a commutative version of np.ma's binary op rules, it seems that it's not clear how to make assignment work exactly in the way you'd expect of masked values while retaining equivalence in the above code. It seems that having `a += b` have `a[j]` unchanged if it's ignored, and having ignored values propagate creates the problem. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
04.11.2011 19:59, Pauli Virtanen kirjoitti: [clip] This makes inline binary ops behave like Nn. Reductions are N. (Assignment: dC, reductions: N, binary ops: PX, unary ops: PC, inline binary ops: Nn). Sorry, inline binary ops are also PdX, not Nn. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 11:59 AM, Pauli Virtanen p...@iki.fi wrote: I have a feeling that if you don't start by mathematically defining the scalar operations first, and only after that generalize them to arrays, some conceptual problems may follow. Yes. I was going to mention this point as well. For shorthand, we can refer to the above choices with the nomenclature shorthand ::= propagation destructivity payload_type propagation ::= P | N destructivity ::= d | n | s payload_type ::= S | E | C That makes 2 * 3 * 3 = 18 different ways to construct consistent behavior. Some of them might make sense, the problem is to find out which :) This is great for the discussion, IMO. The self-destructive assignment hasn't come up at all, so I'm guessing we can probably ignore it. --- Can you be a bit more explicit on the payload types? Let me try, respond with corrections if necessary. S is singleton and in the case of missing data, we take it to mean that we only care that data is missing and not *how* missing the data is. x = MISSING -x # unary MISSING x + 3 # binary MISSING E means that we acknowledge that we want to track the how, but that we aren't interested in it. So raise an error. In the case of ignored data, we might have: x = 2 ignore(x) x IGNORED(2) -x Error x + 3 Error C means that we acknowledge that we want to track the how, and that we are interested in it. So do the computations. x = 2 ignore(x) -x IGNORED(-2) x + 3 IGNORED(5) Did I get that mostly right? NAN and NA apparently fall into the PdS class. Here is where I think we need ot be a bit more careful. It is true that we want NAN and MISSING to propagate, but then we additionally want to ignore it sometimes. This is precisely why we have functions like nansum. Although people are well-aware of this desire, I think this thread has largely conflated the issues when discussing propagation. To push this forward a bit, can I propose that IGNORE behave as: PnC x = np.array([1, 2, 3]) y = np.array([10, 20, 30]) ignore(x[2]) x [1, IGNORED(2), 3] x + 2 [3, IGNORED(4), 5] x + y [11, IGNORED(22), 33] z = x.sum() z IGNORED(6) unignore(z) z 6 x.sum(skipIGNORED=True) 4 When done in this fashion, I think it is perfectly fine for masks to be unmasked. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
NAN and NA apparently fall into the PdS class. Here is where I think we need ot be a bit more careful. It is true that we want NAN and MISSING to propagate, but then we additionally want to ignore it sometimes. This is precisely why we have functions like nansum. Although people are well-aware of this desire, I think this thread has largely conflated the issues when discussing propagation. To push this forward a bit, can I propose that IGNORE behave as: PnC x = np.array([1, 2, 3]) y = np.array([10, 20, 30]) ignore(x[2]) x [1, IGNORED(2), 3] x + 2 [3, IGNORED(4), 5] x + y [11, IGNORED(22), 33] z = x.sum() z IGNORED(6) unignore(z) z 6 x.sum(skipIGNORED=True) 4 When done in this fashion, I think it is perfectly fine for masks to be unmasked. In my mind, IGNORED items should be skipped by default (i.e., skipIGNORED seems redundant ... isn't that what ignoring is all about?). Thus I might instead suggest the opposite (default) behavior at the end: x = np.array([1, 2, 3]) y = np.array([10, 20, 30]) ignore(x[2]) x [1, IGNORED(2), 3] x + 2 [3, IGNORED(4), 5] x + y [11, IGNORED(22), 33] z = x.sum() z 4 unignore(x).sum() 6 x.sum(keepIGNORED=True) 6 (Obviously all the syntax is totally up for debate.) -best Gary The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 1:03 PM, Gary Strangman str...@nmr.mgh.harvard.eduwrote: To push this forward a bit, can I propose that IGNORE behave as: PnC x = np.array([1, 2, 3]) y = np.array([10, 20, 30]) ignore(x[2]) x [1, IGNORED(2), 3] x + 2 [3, IGNORED(4), 5] x + y [11, IGNORED(22), 33] z = x.sum() z IGNORED(6) unignore(z) z 6 x.sum(skipIGNORED=True) 4 In my mind, IGNORED items should be skipped by default (i.e., skipIGNORED seems redundant ... isn't that what ignoring is all about?). Thus I might instead suggest the opposite (default) behavior at the end: x = np.array([1, 2, 3]) y = np.array([10, 20, 30]) ignore(x[2]) x [1, IGNORED(2), 3] x + 2 [3, IGNORED(4), 5] x + y [11, IGNORED(22), 33] z = x.sum() z 4 unignore(x).sum() 6 x.sum(keepIGNORED=True) 6 (Obviously all the syntax is totally up for debate.) I agree that it would be ideal if the default were to skip IGNORED values, but that behavior seems inconsistent with its propagation properties (such as when adding arrays with IGNORED values). To illustrate, when we did x+2, we were stating that: IGNORED(2) + 2 == IGNORED(4) which means that we propagated the IGNORED value. If we were to skip them by default, then we'd have: IGNORED(2) + 2 == 2 To be consistent, then it seems we also should have had: x + 2 [3, 2, 5] which I think we can agree is not so desirable. What this seems to come down to is that we tend to want different behavior when we are doing reductions, and that for IGNORED data, we want it to propagate in every situation except for a reduction (where we want to skip over it). I don't know if there is a well-defined way to distinguish reductions from the other operations. Would it hold for generalized ufuncs? Would it hold for other functions which might return arrays instead of scalars? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 1:03 PM, Gary Strangman str...@nmr.mgh.harvard.edu wrote: To push this forward a bit, can I propose that IGNORE behave as: PnC x = np.array([1, 2, 3]) y = np.array([10, 20, 30]) ignore(x[2]) x [1, IGNORED(2), 3] x + 2 [3, IGNORED(4), 5] x + y [11, IGNORED(22), 33] z = x.sum() z IGNORED(6) unignore(z) z 6 x.sum(skipIGNORED=True) 4 In my mind, IGNORED items should be skipped by default (i.e., skipIGNORED seems redundant ... isn't that what ignoring is all about?). Thus I might instead suggest the opposite (default) behavior at the end: x = np.array([1, 2, 3]) y = np.array([10, 20, 30]) ignore(x[2]) x [1, IGNORED(2), 3] x + 2 [3, IGNORED(4), 5] x + y [11, IGNORED(22), 33] z = x.sum() z 4 unignore(x).sum() 6 x.sum(keepIGNORED=True) 6 (Obviously all the syntax is totally up for debate.) I agree that it would be ideal if the default were to skip IGNORED values, but that behavior seems inconsistent with its propagation properties (such as when adding arrays with IGNORED values). To illustrate, when we did x+2, we were stating that: IGNORED(2) + 2 == IGNORED(4) which means that we propagated the IGNORED value. If we were to skip them by default, then we'd have: IGNORED(2) + 2 == 2 To be consistent, then it seems we also should have had: x + 2 [3, 2, 5] which I think we can agree is not so desirable. What this seems to come down to is that we tend to want different behavior when we are doing reductions, and that for IGNORED data, we want it to propagate in every situation except for a reduction (where we want to skip over it). I don't know if there is a well-defined way to distinguish reductions from the other operations. Would it hold for generalized ufuncs? Would it hold for other functions which might return arrays instead of scalars? Ahhh, yes. That clearly explains the issue hung-up in my mind, and also clarifies what I was getting at with the elementwise vs. reduction distinction I made earlier today. Maybe this is a pickle in a jar with no lid. I'll have to think about it ... -best Gary The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 1:59 PM, Pauli Virtanen p...@iki.fi wrote: For shorthand, we can refer to the above choices with the nomenclature shorthand ::= propagation destructivity payload_type propagation ::= P | N destructivity ::= d | n | s payload_type ::= S | E | C I really like this problem formulation and description. Can we all agree to use this language/shorthand from this point on? I think it really illuminates the discussion, and I would like to have it added to the wiki page. Thanks! Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 1:22 PM, T J tjhn...@gmail.com wrote: I agree that it would be ideal if the default were to skip IGNORED values, but that behavior seems inconsistent with its propagation properties (such as when adding arrays with IGNORED values). To illustrate, when we did x+2, we were stating that: IGNORED(2) + 2 == IGNORED(4) which means that we propagated the IGNORED value. If we were to skip them by default, then we'd have: IGNORED(2) + 2 == 2 To be consistent, then it seems we also should have had: x + 2 [3, 2, 5] which I think we can agree is not so desirable. What this seems to come down to is that we tend to want different behavior when we are doing reductions, and that for IGNORED data, we want it to propagate in every situation except for a reduction (where we want to skip over it). I don't know if there is a well-defined way to distinguish reductions from the other operations. Would it hold for generalized ufuncs? Would it hold for other functions which might return arrays instead of scalars? Continuing my theme of looking for consensus first... there are obviously a ton of ugly corners in here. But my impression is that at least for some simple cases, it's clear what users want: a = [1, IGNORED(2), 3] # array-with-ignored-values + unignored scalar only affects unignored values a + 2 [3, IGNORED(2), 5] # reduction operations skip ignored values np.sum(a) 4 For example, Gary mentioned the common idiom of wanting to take an array and subtract off its mean, and he wants to do that while leaving the masked-out/ignored values unchanged. As long as the above cases work the way I wrote, we will have np.mean(a) 2 a -= np.mean(a) a [-1, IGNORED(2), 1] Which I'm pretty sure is the result that he wants. (Gary, is that right?) Also numpy.ma follows these rules, so that's some additional evidence that they're reasonable. (And I think part of the confusion between Lluís and me was that these are the rules that I meant when I said non-propagating, but he understood that to mean something else.) So before we start exploring the whole vast space of possible ways to handle masked-out data, does anyone see any reason to consider rules that don't have, as a subset, the ones above? Do other rules have any use cases or user demand? (I *love* playing with clever mathematics and making things consistent, but there's not much point unless the end result is something that people will use :-).) -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
04.11.2011 20:49, T J kirjoitti: [clip] To push this forward a bit, can I propose that IGNORE behave as: PnC The *n* classes can be a bit confusing in Python: ### PnC x = np.array([1, 2, 3]) y = np.array([4, 5, 6]) ignore(y[1]) z = x + y z np.array([5, IGNORE(7), 9]) x += y # NB: x[1] := x[1] + y[1] x np.array([5, 2, 3]) *** I think I defined the destructive and non-destructive in a different way than earlier in the thread. Maybe this behavior from np.ma is closer to what was meant earlier: x = np.ma.array([1, 2, 3], mask=[0, 0, 1]) y = np.ma.array([4, 5, 6], mask=[0, 1, 1]) x += y x masked_array(data = [5 -- --], mask = [False True True], fill_value = 99) x.data array([5, 2, 3]) Let's call this (since I botched and already reserved the letter n :) (m) mark-ignored a := SPECIAL_1 # - a == SPECIAL_a ; the payload of the RHS is neglected, # the assigned value has the original LHS # as the payload -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 2:41 PM, Pauli Virtanen p...@iki.fi wrote: 04.11.2011 20:49, T J kirjoitti: [clip] To push this forward a bit, can I propose that IGNORE behave as: PnC The *n* classes can be a bit confusing in Python: ### PnC x = np.array([1, 2, 3]) y = np.array([4, 5, 6]) ignore(y[1]) z = x + y z np.array([5, IGNORE(7), 9]) x += y # NB: x[1] := x[1] + y[1] x np.array([5, 2, 3]) *** Interesting. I think I defined the destructive and non-destructive in a different way than earlier in the thread. Maybe this behavior from np.ma is closer to what was meant earlier: x = np.ma.array([1, 2, 3], mask=[0, 0, 1]) y = np.ma.array([4, 5, 6], mask=[0, 1, 1]) x += y x masked_array(data = [5 -- --], mask = [False True True], fill_value = 99) x.data array([5, 2, 3]) Let's call this (since I botched and already reserved the letter n :) (m) mark-ignored a := SPECIAL_1 # - a == SPECIAL_a ; the payload of the RHS is neglected, # the assigned value has the original LHS # as the payload Does this behave as expected for x + y (as opposed to the inplace operation)? z = x + y z np.array([5, IGNORED(2), IGNORED(3)]) x += y np.array([5, IGNORED(2), IGNORED(3)]) However, doesn't this have the issue that Nathaniel brought up earlier: commutativity unignore(x + y) != unignore(y + x) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 11:59 AM, Pauli Virtanen p...@iki.fi wrote: I have a feeling that if you don't start by mathematically defining the scalar operations first, and only after that generalize them to arrays, some conceptual problems may follow. On the other hand, I should note that numpy.ma does not work this way, and many people seem still happy with how it works. Yes, my impression is that people who want MISSING just want something that acts like a special scalar value (PdS, in your scheme), but the people who want IGNORED want something that *can't* be defined in this way (see my other recent post). That said... There are a two options how to behave with respect to binary/unary operations: (P) Propagating unop(SPECIAL_1) == SPECIAL_new binop(SPECIAL_1, SPECIAL_2) == SPECIAL_new binop(a, SPECIAL) == SPECIAL_new (N) Non-propagating unop(SPECIAL_1) == SPECIAL_new binop(SPECIAL_1, SPECIAL_2) == SPECIAL_new binop(a, SPECIAL) == binop(a, binop.identity) == a SPECIAL_1 means a special value with payload 1, right? Same thing that some of us have been writing IGNORED(1) in other places? Assuming that, I believe that what people want for IGNORED values is unop(SPECIAL_1) == SPECIAL_1 which doesn't seem to be an option in your taxonomy. There's also the option of binop(a, SPECIAL) - error. And three options on what to do on assignment: (d) Destructive a := SPECIAL # - a == SPECIAL (n) Non-destructive a := SPECIAL # - a unchanged (s) Self-destructive a := SPECIAL_1 # - if `a` is SPECIAL-class, then a == SPECIAL_1, # otherwise `a` remains unchanged I'm not sure assignment is a useful way to think about what we've been calling IGNORED values (for MISSING/NA it's fine). I've been talking about masking/unmasking values or toggling the IGNORED state, because my impression is that what people want is something like: a[0] = 3 a[0] = SPECIAL # now a[0] == SPECIAL(3) This is pretty confusing when written as an assignment (and note that now I'm assigning into an array, because if I were just assigning to a python variable then these semantics would be impossible to implement!). So we might prefer a syntax like a.visible[0] = False or a.ignore(0) If classified this way, behaviour of items in np.ma arrays is different in different operations, but seems roughly PdX, where X stands for returning a masked value with the first argument as the payload in binary ops if either argument is masked. No -- np.ma implements the assignment semantics I described above, not d semantics. Trimming some output for readability: a = np.ma.masked_array([1, 2, 3]) a[1] = np.ma.masked a [1, --, 3] a.mask[1] = False a [1, 2, 3] So assignment is not destructive -- the old value is retained as the payload. -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 2:29 PM, Nathaniel Smith n...@pobox.com wrote: On Fri, Nov 4, 2011 at 1:22 PM, T J tjhn...@gmail.com wrote: I agree that it would be ideal if the default were to skip IGNORED values, but that behavior seems inconsistent with its propagation properties (such as when adding arrays with IGNORED values). To illustrate, when we did x+2, we were stating that: IGNORED(2) + 2 == IGNORED(4) which means that we propagated the IGNORED value. If we were to skip them by default, then we'd have: IGNORED(2) + 2 == 2 To be consistent, then it seems we also should have had: x + 2 [3, 2, 5] which I think we can agree is not so desirable. What this seems to come down to is that we tend to want different behavior when we are doing reductions, and that for IGNORED data, we want it to propagate in every situation except for a reduction (where we want to skip over it). I don't know if there is a well-defined way to distinguish reductions from the other operations. Would it hold for generalized ufuncs? Would it hold for other functions which might return arrays instead of scalars? Continuing my theme of looking for consensus first... there are obviously a ton of ugly corners in here. But my impression is that at least for some simple cases, it's clear what users want: a = [1, IGNORED(2), 3] # array-with-ignored-values + unignored scalar only affects unignored values a + 2 [3, IGNORED(2), 5] # reduction operations skip ignored values np.sum(a) 4 For example, Gary mentioned the common idiom of wanting to take an array and subtract off its mean, and he wants to do that while leaving the masked-out/ignored values unchanged. As long as the above cases work the way I wrote, we will have np.mean(a) 2 a -= np.mean(a) a [-1, IGNORED(2), 1] Which I'm pretty sure is the result that he wants. (Gary, is that right?) Also numpy.ma follows these rules, so that's some additional evidence that they're reasonable. (And I think part of the confusion between Lluís and me was that these are the rules that I meant when I said non-propagating, but he understood that to mean something else.) So before we start exploring the whole vast space of possible ways to handle masked-out data, does anyone see any reason to consider rules that don't have, as a subset, the ones above? Do other rules have any use cases or user demand? (I *love* playing with clever mathematics and making things consistent, but there's not much point unless the end result is something that people will use :-).) I guess I'm just confused on how one, in principle, would distinguish the various forms of propagation that you are suggesting (ie for reductions). I also don't think it is good that we lack commutativity. If we disallow unignoring, then yes, I agree that what you wrote above is what people want. But if we are allowed to unignore, then I do not. Also, how does something like this get handled? a = [1, 2, IGNORED(3), NaN] If I were to say, What is the mean of 'a'?, then I think most of the time people would want 1.5. I guess if we kept nanmean around, then we could do: a -= np.nanmean(a) [-.5, .5, IGNORED(3), NaN] Sorry if this is considered digging deeper than consensus. I'm just curious if arrays having NaNs in them, in addition to IGNORED, causes problems. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 3:04 PM, Nathaniel Smith n...@pobox.com wrote: On Fri, Nov 4, 2011 at 11:59 AM, Pauli Virtanen p...@iki.fi wrote: If classified this way, behaviour of items in np.ma arrays is different in different operations, but seems roughly PdX, where X stands for returning a masked value with the first argument as the payload in binary ops if either argument is masked. No -- np.ma implements the assignment semantics I described above, not d semantics. Trimming some output for readability: Oops, I see we cross-posted :-). So never-mind that part of my email... Okay, I'm going to follow my own suggestion now and stop talking about these thorny details until I know whether we can simplify the discussion to only considering schemes that are consistent with the rules I posted about here: http://article.gmane.org/gmane.comp.python.numeric.general/46760 -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
04.11.2011 22:57, T J kirjoitti: [clip] (m) mark-ignored a := SPECIAL_1 # - a == SPECIAL_a ; the payload of the RHS is neglected, # the assigned value has the original LHS # as the payload [clip] Does this behave as expected for x + y (as opposed to the inplace operation)? [clip] The definition is for assignment, and does not concern binary ops, so it behaves as expected. The inplace operation x += y is defined as equivalent to binary op followed by assignment x[:] = x + y as far as the missing values are concerned. However, doesn't this have the issue that Nathaniel brought up earlier: commutativity unignore(x + y) != unignore(y + x) As the definition concerns only what happens on assignment, it does not have problems with commutativity. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 3:08 PM, T J tjhn...@gmail.com wrote: On Fri, Nov 4, 2011 at 2:29 PM, Nathaniel Smith n...@pobox.com wrote: Continuing my theme of looking for consensus first... there are obviously a ton of ugly corners in here. But my impression is that at least for some simple cases, it's clear what users want: a = [1, IGNORED(2), 3] # array-with-ignored-values + unignored scalar only affects unignored values a + 2 [3, IGNORED(2), 5] # reduction operations skip ignored values np.sum(a) 4 For example, Gary mentioned the common idiom of wanting to take an array and subtract off its mean, and he wants to do that while leaving the masked-out/ignored values unchanged. As long as the above cases work the way I wrote, we will have np.mean(a) 2 a -= np.mean(a) a [-1, IGNORED(2), 1] Which I'm pretty sure is the result that he wants. (Gary, is that right?) Also numpy.ma follows these rules, so that's some additional evidence that they're reasonable. (And I think part of the confusion between Lluís and me was that these are the rules that I meant when I said non-propagating, but he understood that to mean something else.) So before we start exploring the whole vast space of possible ways to handle masked-out data, does anyone see any reason to consider rules that don't have, as a subset, the ones above? Do other rules have any use cases or user demand? (I *love* playing with clever mathematics and making things consistent, but there's not much point unless the end result is something that people will use :-).) I guess I'm just confused on how one, in principle, would distinguish the various forms of propagation that you are suggesting (ie for reductions). Well, numpy.ma does work this way, so certainly it's possible to do. At the code level, np.add() and np.add.reduce() are different entry points and can behave differently. OTOH, it might be that it's impossible to do *while still maintaining other things we care about*... but in that case we should just shake our fists at the mathematics and then give up, instead of coming up with an elegant system that isn't actually useful. So that's why I think we should figure out what's useful first. I also don't think it is good that we lack commutativity. If we disallow unignoring, then yes, I agree that what you wrote above is what people want. But if we are allowed to unignore, then I do not. I *think* that for the no-unignoring (also known as MISSING) case, we have a pretty clear consensus that we want something like: a + 2 [3, MISSING, 5] np.sum(a) MISSING np.sum(a, skip_MISSING=True) 4 (Please say if you disagree, but I really hope you don't!) This case is also easier, because we don't even have to allow a skip_MISSING flag in cases where it doesn't make sense (e.g., unary or binary operations) -- it's a convenience feature, so no-one will care if it only works when it's useful ;-). The use case that we're still confused about is specifically the one where people want to *temporarily* hide parts of their data, do some calculations that ignore those parts of their data, and then unhide that data again -- e.g., see Gary's first post in this thread. So for this use case, allowing unignore is definitely important, and having np.sum() return IGNORED seems pretty useless to me. (When an operation involves actually missing data, then you need to stop and think what would be a statistically meaningful way to handle that -- sometimes it's skip_MISSING, sometimes something else. So np.sum returning MISSING is useful - it tells you something you might not have realized. If you just ignored some data because you want to ignore that data, then having np.sum return IGNORED is useless, because it tells you something you already knew perfectly well.) Also, how does something like this get handled? a = [1, 2, IGNORED(3), NaN] If I were to say, What is the mean of 'a'?, then I think most of the time people would want 1.5. I would want NaN! But that's because the only way I get NaN's is when I do dumb things like compute log(0), and again, I want my code to tell me that I was dumb instead of just quietly making up a meaningless answer. -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
04.11.2011 23:04, Nathaniel Smith kirjoitti: [clip] Assuming that, I believe that what people want for IGNORED values is unop(SPECIAL_1) == SPECIAL_1 which doesn't seem to be an option in your taxonomy. Well, you can always add a new branch for rules on what to do with unary ops. [clip] I'm not sure assignment is a useful way to think about what we've been calling IGNORED values (for MISSING/NA it's fine). I've been talking about masking/unmasking values or toggling the IGNORED state, because my impression is that what people want is something like: a[0] = 3 a[0] = SPECIAL # now a[0] == SPECIAL(3) That's partly syntax sugar. What I meant above by assignment is what happens on a[:] = b and what should occur in in-place operations, a += b which are equivalent to a[:] = a + b Yeah, it's a different definition for destructive and non-destructive than what was used earlier in the discussion. [clip] If classified this way, behaviour of items in np.ma arrays is different in different operations, but seems roughly PdX, where X stands for returning a masked value with the first argument as the payload in binary ops if either argument is masked. No -- np.ma implements the assignment semantics I described above, not d semantics. Trimming some output for readability: Well, np.ma implements d semantics, but because of the way binary ops are noncommutative, in-place binary ops behave as if they were not mutating. Assignments do actually change the masked data: a[:] = b which changes also masked values in `a`. That may be a bug. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 3:38 PM, Nathaniel Smith n...@pobox.com wrote: On Fri, Nov 4, 2011 at 3:08 PM, T J tjhn...@gmail.com wrote: On Fri, Nov 4, 2011 at 2:29 PM, Nathaniel Smith n...@pobox.com wrote: Continuing my theme of looking for consensus first... there are obviously a ton of ugly corners in here. But my impression is that at least for some simple cases, it's clear what users want: a = [1, IGNORED(2), 3] # array-with-ignored-values + unignored scalar only affects unignored values a + 2 [3, IGNORED(2), 5] # reduction operations skip ignored values np.sum(a) 4 For example, Gary mentioned the common idiom of wanting to take an array and subtract off its mean, and he wants to do that while leaving the masked-out/ignored values unchanged. As long as the above cases work the way I wrote, we will have np.mean(a) 2 a -= np.mean(a) a [-1, IGNORED(2), 1] Which I'm pretty sure is the result that he wants. (Gary, is that right?) Also numpy.ma follows these rules, so that's some additional evidence that they're reasonable. (And I think part of the confusion between Lluís and me was that these are the rules that I meant when I said non-propagating, but he understood that to mean something else.) So before we start exploring the whole vast space of possible ways to handle masked-out data, does anyone see any reason to consider rules that don't have, as a subset, the ones above? Do other rules have any use cases or user demand? (I *love* playing with clever mathematics and making things consistent, but there's not much point unless the end result is something that people will use :-).) I guess I'm just confused on how one, in principle, would distinguish the various forms of propagation that you are suggesting (ie for reductions). Well, numpy.ma does work this way, so certainly it's possible to do. At the code level, np.add() and np.add.reduce() are different entry points and can behave differently. I see your point, but that seems like just an API difference with a bad name. reduce() is just calling add() a bunch of times, so it seems like it should behave as add() does. That we can create different behaviors with various assignment rules (like Pauli's 'm' for mark-ignored), only makes it more confusing to me. a = 1 a += 2 a += IGNORE b = 1 + 2 + IGNORE I think having a == b is essential. If they can be different, that will only lead to confusion. On this point alone, does anyone think it is acceptable to have a != b? OTOH, it might be that it's impossible to do *while still maintaining other things we care about*... but in that case we should just shake our fists at the mathematics and then give up, instead of coming up with an elegant system that isn't actually useful. So that's why I think we should figure out what's useful first. Agreed. I'm on the same page. I also don't think it is good that we lack commutativity. If we disallow unignoring, then yes, I agree that what you wrote above is what people want. But if we are allowed to unignore, then I do not. I *think* that for the no-unignoring (also known as MISSING) case, we have a pretty clear consensus that we want something like: a + 2 [3, MISSING, 5] np.sum(a) MISSING np.sum(a, skip_MISSING=True) 4 (Please say if you disagree, but I really hope you don't!) This case is also easier, because we don't even have to allow a skip_MISSING flag in cases where it doesn't make sense (e.g., unary or binary operations) -- it's a convenience feature, so no-one will care if it only works when it's useful ;-). Yes, in agreement. I was talking specifically about the IGNORE case. And my point is that if we allow people to remove the IGNORE flag and see the original data (and if the payloads are computed), then we should care about commutativity: x = [1, IGNORE(2), 3] x2 = x.copy() y = [10, 11, IGNORE(12)] z = x + y a = z.sum() x += y b = x.sum() y += x2 c = y.sum() So, we should have: a == b == c. Additionally, if we allow users to unignore data, then we should have: x = [1, IGNORE(2), 3] x2 = x.copy() y = [10, 11, IGNORE(12)] x += y aa = unignore(x).sum() y += x2 bb = unignore(y).sum() aa == bb True Is there agreement on this? Also, how does something like this get handled? a = [1, 2, IGNORED(3), NaN] If I were to say, What is the mean of 'a'?, then I think most of the time people would want 1.5. I would want NaN! But that's because the only way I get NaN's is when I do dumb things like compute log(0), and again, I want my code to tell me that I was dumb instead of just quietly making up a meaningless answer. That's definitely field specific then. In probability: 0 = 0 log(0) is a common idiom. In NumPy, 0 log(0) gives NaN, so you'd want to ignore then when summing. ___ NumPy-Discussion mailing list
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
04.11.2011 23:29, Pauli Virtanen kirjoitti: [clip] As the definition concerns only what happens on assignment, it does not have problems with commutativity. This is of course then not really true in a wider sense, as an example from T J shows: a = 1 a += IGNORE(3) # - a := a + IGNORE(3) # - a := IGNORE(4) # - a == IGNORE(1) which is different from a = 1 + IGNORE(3) # - a == IGNORE(4) Damn, it seemed so good. Probably anything expect destructive assignment leads to problems like this with propagating special values. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 4:29 PM, Pauli Virtanen p...@iki.fi wrote: 04.11.2011 23:29, Pauli Virtanen kirjoitti: [clip] As the definition concerns only what happens on assignment, it does not have problems with commutativity. This is of course then not really true in a wider sense, as an example from T J shows: a = 1 a += IGNORE(3) # - a := a + IGNORE(3) # - a := IGNORE(4) # - a == IGNORE(1) which is different from a = 1 + IGNORE(3) # - a == IGNORE(4) Damn, it seemed so good. Probably anything expect destructive assignment leads to problems like this with propagating special values. Ok...with what I understand now, it seems like for almost all operations: MISSING : PdS IGNORED : PdC (this gives commutivity when unignoring data points) When you want some sort of reduction, we want to change the behavior for IGNORED so that it skips the IGNORED values by default. Personally, I still believe that this non-consistent behavior warrants a new method name. What I mean is: x = np.array([1, IGNORED(2), 3]) y = x.sum() z = x[0] + x[1] + x[2] To say that y != z will only be a source of confusion. To remedy, we force people to be explicit, even if they'll need to be explicit 99% of the time: q = x.sum(skipIGNORED=True) Then we can have y == z and y != q. To make the 99% use case easier, we provide a new method which passings the keyword for us. With PdS and PdC is seems rather clear to me why MISSING should be implemented as a bit pattern and IGNORED implemented using masks. Setting implementation details aside and going back to Nathaniel's original biggest *un*resolved question, I am now convinced that these (IGNORED and MISSING) should be distinct API concepts and still yet distinct from NaN with floating point dtypes. The NA implementation in NumPy does not seem to match either of these (IGNORED and MISSING) exactly. One cannot, as far as I know, unignore an element marked as NA. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
04.11.2011 22:29, Nathaniel Smith kirjoitti: [clip] Continuing my theme of looking for consensus first... there are obviously a ton of ugly corners in here. But my impression is that at least for some simple cases, it's clear what users want: a = [1, IGNORED(2), 3] # array-with-ignored-values + unignored scalar only affects unignored values a + 2 [3, IGNORED(2), 5] # reduction operations skip ignored values np.sum(a) 4 This can break commutativity: a = [1, IGNORED(2), 3] b = [4, IGNORED(5), 6] x = a + b y = b + a x[1] = ??? y[1] = ??? Defining unop(IGNORED(a)) == IGNORED(a) binop(IGNORED(a), b) == IGNORED(a) binop(a, IGNORED(b)) == IGNORED(b) binop(IGNORED(a), IGNORED(b)) == IGNORED(binop(a, b)) # or NA could however get around that. That seems to be pretty much how NaN works, except that it now carries a hidden value with it. For example, Gary mentioned the common idiom of wanting to take an array and subtract off its mean, and he wants to do that while leaving the masked-out/ignored values unchanged.As long as the above cases work the way I wrote, we will have np.mean(a) 2 a -= np.mean(a) a [-1, IGNORED(2), 1] That would be propagating + the above NaN-like rules for binary operators. Whether the reduction methods have skip_IGNORE=True as default or not is in my opinion more of an API question, rather than a question on how the algebra of ignored values should work. *** If destructive assignment is really needed to avoid problems with commutation, [see T. J. (2011)] is then maybe a problem. So, one would need to have x = [1, IGNORED(2), 3] y = [1, IGNORED(2), 3] z = [4, IGNORED(5), IGNORED(6)] x[:] = z x [4, IGNORED(5), IGNORED(6)] y += z y [4, IGNORED(7), IGNORED(6)] This is not how np.ma works. But if you do otherwise, there doesn't seem to be any guarantee that a += 42 a += b is the same thing as a += b a += 42 [clip] So before we start exploring the whole vast space of possible ways to handle masked-out data, does anyone see any reason to consider rules that don't have, as a subset, the ones above? Do other rules have any use cases or user demand? (I *love* playing with clever mathematics and making things consistent, but there's not much point unless the end result is something that people will use :-).) Yep, it's important to keep in mind what people want. People however tend to implicitly expect that simple arithmetic operations on arrays, containing ignored values or not, operate in a certain way. Actually stating how these operations work with scalars gives valuable insight on how you'd like things to work. Also, if you propose to break the rules of arithmetic, in a fundamental library meant for scientific computation, you should be aware that you do so, and how you do so. I mean, at least for me it was not clear before this formulation that there was a reason why binary ops in np.ma were not commutative! Now I kind of see that there is an asymmetry in assignment into masked arrays, and there is a conflict with commuting operations and with what you'd expect ignored values to do. I'm not sure if it's possible to get rid of this problem, but it could be possible to restrict it to assignments and in-place operations rather than having it in binary ops. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
Also, how does something like this get handled? a = [1, 2, IGNORED(3), NaN] If I were to say, What is the mean of 'a'?, then I think most of the time people would want 1.5. I would want NaN! But that's because the only way I get NaN's is when I do dumb things like compute log(0), and again, I want my code to tell me that I was dumb instead of just quietly making up a meaningless answer. As another data point, I prefer control over this sort of situation. Sometimes I'm completely in agreement with Nathaniel and want the operation to fail. Other times I am forced to perform operations (e.g. log) on a huge matrix, and I fully expect some 0s may be in there. For a complex enough chain of operations, looking for all the bad apples at each step in the chain can be prohibitive, so in those cases I'm looking for compute it if you can, and give me a NaN if you can't ... G The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
05.11.2011 00:14, T J kirjoitti: [clip] a = 1 a += 2 a += IGNORE b = 1 + 2 + IGNORE I think having a == b is essential. If they can be different, that will only lead to confusion. On this point alone, does anyone think it is acceptable to have a != b? It seems to me that requiring this sort of a thing gives some limitations on how array operations should behave. An acid test for proposed rules: given two arrays `a` and `b`, a = [1, 2, IGNORED(3), IGNORED(4)] b = [10, IGNORED(20), 30, IGNORED(40)] (a) Are the following pieces of code equivalent: print unmask(a + b) a += 42 a += b print unmask(a) and print unmask(b + a) a += b a += 42 print unmask(a) (b) Are the following two statements equivalent (wrt. ignored values): a += b a[:] = a + b For np.ma (a) is false whereas (b) is true. For arrays containing nans, on the other hand (a) and (b) are both true (but of course, in this case values cannot be unmasked). Is there a way to define operations so that (a) is true, while retaining the desired other properties of arrays with ignored values? Is there a real-word need to have (a) be true? -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 6:31 PM, Pauli Virtanen p...@iki.fi wrote: 05.11.2011 00:14, T J kirjoitti: [clip] a = 1 a += 2 a += IGNORE b = 1 + 2 + IGNORE I think having a == b is essential. If they can be different, that will only lead to confusion. On this point alone, does anyone think it is acceptable to have a != b? It seems to me that requiring this sort of a thing gives some limitations on how array operations should behave. An acid test for proposed rules: given two arrays `a` and `b`, a = [1, 2, IGNORED(3), IGNORED(4)] b = [10, IGNORED(20), 30, IGNORED(40)] (a) Are the following pieces of code equivalent: print unmask(a + b) a += 42 a += b print unmask(a) and print unmask(b + a) a += b a += 42 print unmask(a) (b) Are the following two statements equivalent (wrt. ignored values): a += b a[:] = a + b For np.ma (a) is false whereas (b) is true. For arrays containing nans, on the other hand (a) and (b) are both true (but of course, in this case values cannot be unmasked). Is there a way to define operations so that (a) is true, while retaining the desired other properties of arrays with ignored values? I thought that PdC satisfied (a) and (b). Let me show you what I thought they were. Perhaps I am not being consistent. If so, point out my mistake. (A) You are making two comparisons. (A1) Does unmask(a+b) == unmask(b + a) ? Yes. They both equal: unmask([11, IGNORED(22), IGNORED(33), IGNORED(44)]) = [11, 22, 33, 44] (A2) Is 'a' the same after a += 42; a += b and a += b; a += 42? Yes. a = [1, 2, IGNORED(3), IGNORED(4)] b = [10, IGNORED(20), 30, IGNORED(40)] a += 42; a [43, 44, IGNORED(45), IGNORED(46)] a += b; a [53, IGNORED(64), IGNORED(75), IGNORED(86)] vs a = [1, 2, IGNORED(3), IGNORED(4)] b = [10, IGNORED(20), 30, IGNORED(40)] a += b; a [11, IGNORED(22), IGNORED(33), IGNORED(44)] a += 42; a [53, IGNORED(64), IGNORED(75), IGNORED(86)] For part (B), I thought we were in agreement that inplace assignment should be defined so that: a += b is equivalent to: tmp = a + b a = tmp If so, this definitely holds. Have I missed something? Probably. Please spell it out for me. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 7:43 PM, T J tjhn...@gmail.com wrote: On Fri, Nov 4, 2011 at 6:31 PM, Pauli Virtanen p...@iki.fi wrote: An acid test for proposed rules: given two arrays `a` and `b`, a = [1, 2, IGNORED(3), IGNORED(4)] b = [10, IGNORED(20), 30, IGNORED(40)] [...] (A1) Does unmask(a+b) == unmask(b + a) ? Yes. They both equal: unmask([11, IGNORED(22), IGNORED(33), IGNORED(44)]) = [11, 22, 33, 44] Again, I really don't think you're going to be able to sell an API where [2] + [IGNORED(20)] == [IGNORED(22)] I mean, it's not me you have to convince, it's Gary, Pierre, maybe Benjamin, Lluís, etc. So I could be wrong. But you might want to figure that out first before making plans based on this... -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 8:03 PM, Nathaniel Smith n...@pobox.com wrote: On Fri, Nov 4, 2011 at 7:43 PM, T J tjhn...@gmail.com wrote: On Fri, Nov 4, 2011 at 6:31 PM, Pauli Virtanen p...@iki.fi wrote: An acid test for proposed rules: given two arrays `a` and `b`, a = [1, 2, IGNORED(3), IGNORED(4)] b = [10, IGNORED(20), 30, IGNORED(40)] [...] (A1) Does unmask(a+b) == unmask(b + a) ? Yes. They both equal: unmask([11, IGNORED(22), IGNORED(33), IGNORED(44)]) = [11, 22, 33, 44] Again, I really don't think you're going to be able to sell an API where [2] + [IGNORED(20)] == [IGNORED(22)] I mean, it's not me you have to convince, it's Gary, Pierre, maybe Benjamin, Lluís, etc. So I could be wrong. But you might want to figure that out first before making plans based on this... But this is how np.ma currently does it, except that it doesn't compute the payload---it just calls it IGNORED. And it seems that this generalizes the way people want it to: z = [2, 4] + [IGNORED(20), 3] z [IGNORED(24), 7] z.sum(skip_ignored=True) # True could be the default 7 z.sum(skip_ignored=False) IGNORED(31) I guess I am confused because it seems that you implicitly used this same rule here: Say we have a = np.array([1, IGNORED(2), 3]) b = np.array([10, 20, 30]) (Here's I'm using IGNORED(2) to mean a value that is currently ignored, but if you unmasked it it would have the value 2.) Then we have: # non-propagating **or** propagating, doesn't matter: a + 2 [3, IGNORED(2), 5] That is, element-wise, you had to have done: IGNORED(2) + 2 -- IGNORED(2). I said it should be equal to IGNORED(4), but the result is still some form of ignore. Sorry if I am missing the bigger picture at this pointits late and a Fri. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 10:33 PM, T J tjhn...@gmail.com wrote: On Fri, Nov 4, 2011 at 8:03 PM, Nathaniel Smith n...@pobox.com wrote: On Fri, Nov 4, 2011 at 7:43 PM, T J tjhn...@gmail.com wrote: On Fri, Nov 4, 2011 at 6:31 PM, Pauli Virtanen p...@iki.fi wrote: An acid test for proposed rules: given two arrays `a` and `b`, a = [1, 2, IGNORED(3), IGNORED(4)] b = [10, IGNORED(20), 30, IGNORED(40)] [...] (A1) Does unmask(a+b) == unmask(b + a) ? Yes. They both equal: unmask([11, IGNORED(22), IGNORED(33), IGNORED(44)]) = [11, 22, 33, 44] Again, I really don't think you're going to be able to sell an API where [2] + [IGNORED(20)] == [IGNORED(22)] I mean, it's not me you have to convince, it's Gary, Pierre, maybe Benjamin, Lluís, etc. So I could be wrong. But you might want to figure that out first before making plans based on this... But this is how np.ma currently does it, except that it doesn't compute the payload---it just calls it IGNORED. And it seems that this generalizes the way people want it to: z = [2, 4] + [IGNORED(20), 3] z [IGNORED(24), 7] z.sum(skip_ignored=True) # True could be the default 7 z.sum(skip_ignored=False) IGNORED(31) I guess I am confused because it seems that you implicitly used this same rule here: Say we have a = np.array([1, IGNORED(2), 3]) b = np.array([10, 20, 30]) (Here's I'm using IGNORED(2) to mean a value that is currently ignored, but if you unmasked it it would have the value 2.) Then we have: # non-propagating **or** propagating, doesn't matter: a + 2 [3, IGNORED(2), 5] That is, element-wise, you had to have done: IGNORED(2) + 2 -- IGNORED(2). I said it should be equal to IGNORED(4), but the result is still some form of ignore. Sorry if I am missing the bigger picture at this pointits late and a Fri. This scheme is actually somewhat intriguing. Not totally convinced, by intrigued. Unfortunately, I fell behind in the postings by having dinner... We probably should start a new thread soon with a bunch of this stuff solidified and stated to give others a chance to hop back into the game. Maybe a table of some sort with pros/cons (mathematically speaking, deferring implementation details for later). I swear, if we can get this to make sense... we should have a Nobel prize or something. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Thu, Nov 3, 2011 at 7:54 PM, Gary Strangman str...@nmr.mgh.harvard.edu wrote: For the non-destructive+propagating case, do I understand correctly that this would mean I (as a user) could temporarily decide to IGNORE certain portions of my data, perform a series of computation on that data, and the IGNORED flag (or however it is implemented) would be propagated from computation to computation? If that's the case, I suspect I'd use it all the time ... to effectively perform data subsetting without generating (partial) copies of large datasets. But maybe I misunderstand the intended notion of propagation ... I *think* it's more subtle than that, but I admit I'm somewhat confused about how exactly people would want IGNORED to work in various corner cases. (This is another part of why figuring out our audience/use-cases seems like an important first step to me... fortunately the semantics for MISSING are, I think, much more clear.) Say we have a = np.array([1, IGNORED(2), 3]) b = np.array([10, 20, 30]) (Here's I'm using IGNORED(2) to mean a value that is currently ignored, but if you unmasked it it would have the value 2.) Then we have: # non-propagating *or* propagating, doesn't matter: a + 2 [3, IGNORED(2), 5] # non-propagating: a + b One of these, I don't know which: [11, IGNORED(2), 33] # numpy.ma chooses this [11, 20, 33] Error: shape mismatch (An error is maybe the most *consistent* option; the suggestion in the alterNEP was that masks had to match on all axes that were *not* broadcast, so a + 2 and a + a are okay, but a + b is an error. I assume the numpy.ma approach is also useful, but note that it has the surprising effect that addition is not commutative: IGNORED(x) + IGNORED(y) = IGNORED(x). Try it: masked1 = np.ma.masked_array([1, 2, 3], mask=[False, True, False]) masked2 = np.ma.masked_array([10, 20, 30], mask=[False, True, False]) np.asarray(masked1 + masked2) # [11, 2, 33] np.asarray(masked2 + masked1) # [11, 20, 33] I don't really know what people would prefer.) # propagating: a + b One of these, I don't know which: [11, IGNORED(2), 33] # same as numpy.ma, again [11, IGNORED(22), 33] # non-propagating: np.sum(a) 4 # propagating: np.sum(a) One of these, I don't know which: IGNORED(4) IGNORED(6) So from your description, I wouldn't say that you necessarily want non-destructive+propagating -- it really depends on exactly what computations you want to perform, and how you expect them to work. The main difference is how reduction operations are treated. I kind of feel like the non-propagating version makes more sense overall, but I don't know if there's any consensus on that. (You also have the option of just using the new where= argument to your ufuncs, which avoids some of this confusion because it gives a single mask that would apply to the whole operation. The ambiguities here arise because it's not clear what to do when applying a binary operation to two arrays that have different masks.) Maybe you could give some examples of the kinds of computations you're thinking of? -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Friday, November 4, 2011, Nathaniel Smith n...@pobox.com wrote: On Thu, Nov 3, 2011 at 7:54 PM, Gary Strangman str...@nmr.mgh.harvard.edu wrote: For the non-destructive+propagating case, do I understand correctly that this would mean I (as a user) could temporarily decide to IGNORE certain portions of my data, perform a series of computation on that data, and the IGNORED flag (or however it is implemented) would be propagated from computation to computation? If that's the case, I suspect I'd use it all the time ... to effectively perform data subsetting without generating (partial) copies of large datasets. But maybe I misunderstand the intended notion of propagation ... I *think* it's more subtle than that, but I admit I'm somewhat confused about how exactly people would want IGNORED to work in various corner cases. (This is another part of why figuring out our audience/use-cases seems like an important first step to me... fortunately the semantics for MISSING are, I think, much more clear.) Say we have a = np.array([1, IGNORED(2), 3]) b = np.array([10, 20, 30]) (Here's I'm using IGNORED(2) to mean a value that is currently ignored, but if you unmasked it it would have the value 2.) Then we have: # non-propagating *or* propagating, doesn't matter: a + 2 [3, IGNORED(2), 5] # non-propagating: a + b One of these, I don't know which: [11, IGNORED(2), 33] # numpy.ma chooses this [11, 20, 33] Error: shape mismatch (An error is maybe the most *consistent* option; the suggestion in the alterNEP was that masks had to match on all axes that were *not* broadcast, so a + 2 and a + a are okay, but a + b is an error. I assume the numpy.ma approach is also useful, but note that it has the surprising effect that addition is not commutative: IGNORED(x) + IGNORED(y) = IGNORED(x). Try it: masked1 = np.ma.masked_array([1, 2, 3], mask=[False, True, False]) masked2 = np.ma.masked_array([10, 20, 30], mask=[False, True, False]) np.asarray(masked1 + masked2) # [11, 2, 33] np.asarray(masked2 + masked1) # [11, 20, 33] I don't really know what people would prefer.) # propagating: a + b One of these, I don't know which: [11, IGNORED(2), 33] # same as numpy.ma, again [11, IGNORED(22), 33] # non-propagating: np.sum(a) 4 # propagating: np.sum(a) One of these, I don't know which: IGNORED(4) IGNORED(6) So from your description, I wouldn't say that you necessarily want non-destructive+propagating -- it really depends on exactly what computations you want to perform, and how you expect them to work. The main difference is how reduction operations are treated. I kind of feel like the non-propagating version makes more sense overall, but I don't know if there's any consensus on that. I think this is further evidence for my idea that a mask should not be undone, but is non destructive. If you want to be able to access the values after masking, have a view, or only apply the mask to a view. Reduction ufuncs make a lot of sense because they have a basis in mathematics when there are no values. Reduction ufuncs are covered in great detail in Mark's NEP. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
non-destructive+propagating -- it really depends on exactly what computations you want to perform, and how you expect them to work. The main difference is how reduction operations are treated. I kind of feel like the non-propagating version makes more sense overall, but I don't know if there's any consensus on that. I think this is further evidence for my idea that a mask should not be undone, but is non destructive. If you want to be able to access the values after masking, have a view, or only apply the mask to a view. OK, so my understanding of what's meant by propagating is probably incomplete (and is definitely still fuzzy). I'm a little confused by the phrase a mask should not be undone though. Say I want to perform a statistical analysis or filtering procedure excluding and (separately) including a handful of outliers? Isn't that a natural case for undoing a mask? Or did you mean something else? I think I understand the use a view option above, though I don't see how one could apply a mask only to a view. What if my view is every other row in a 2D array, and I want to mask the last half of this view? What is the state of the original array once the mask has been applied? (If this is derailing the progress of this thread, feel free to ignore it.) -best Gary The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
Gary Strangman writes: For the non-destructive+propagating case, do I understand correctly that this would mean I (as a user) could temporarily decide to IGNORE certain portions of my data, perform a series of computation on that data, and the IGNORED flag (or however it is implemented) would be propagated from computation to computation? If that's the case, I suspect I'd use it all the time ... to effectively perform data subsetting without generating (partial) copies of large datasets. But maybe I misunderstand the intended notion of propagation ... I *think* you're right. I say think because to *me* IGNORE is the opposite of propagate. For example, you could temporarily (non-destructively) decide to assign a propagating special value to array a. Then do a+=2; a*=5 and get something like (which I think is the kind of use case you were talking about): # original values a [1, 1, 1] # with special value a [1, SPECIAL, 1] # computations a += 2 a *= 5 # result a [15, SPECIAL, 15] # without special value a [15, 1, 15] So yes, separating both properties is not only a matter of elegance and simplicity, but also has the practical impact of (as Ben said in another mail) making the non-destructive property into a fancy form of indexing. Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, 4 Nov 2011, Benjamin Root wrote: On Friday, November 4, 2011, Gary Strangman str...@nmr.mgh.harvard.edu wrote: non-destructive+propagating -- it really depends on exactly what computations you want to perform, and how you expect them to work. The main difference is how reduction operations are treated. I kind of feel like the non-propagating version makes more sense overall, but I don't know if there's any consensus on that. I think this is further evidence for my idea that a mask should not be undone, but is non destructive. If you want to be able to access the values after masking, have a view, or only apply the mask to a view. OK, so my understanding of what's meant by propagating is probably incomplete (and is definitely still fuzzy). I'm a little confused by the phrase a mask should not be undone though. Say I want to perform a statistical analysis or filtering procedure excluding and (separately) including a handful of outliers? Isn't that a natural case for undoing a mask? Or did you mean something else? I think I understand the use a view option above, though I don't see how one could apply a mask only to a view. What if my view is every other row in a 2D array, and I want to mask the last half of this view? What is the state of the original array once the mask has been applied? (If this is derailing the progress of this thread, feel free to ignore it.) -best Gary Ufuncs can be broadly categorized as element-wise (binary ops like +, *, etc) as well as regular functions that return an array with a shape that matches the inputs broadcasted together. And reduction ops (sum, min, mean, etc). For element-wise, things are a bit murky for IGNORE, and I defer to Mark's NEP: https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#id17, and it probably should be expanded and clarified in the NEP. For reduction ops, propagation means that sum([3 5 NA 6]) == NA, just like if you had a NaN in the array. Non-propagating (or skipping or ignore) would have that operation produce 14. A mean() for the propagating case would be NA, but 4. for non-propagating. The part about undoing a mask is addressing the issue of when an operation produces a new array that has ignored elements in it, then those elements never were initialized with any value at all. Therefore, unmasking those elements and accessing their values make no sense. This and more are covered in this section of the NEP: https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#id11 For your stated case, I would have two views of the data (or at least the original data and a view of it). For the view, I would apply the mask to hide the outliers from the filtering operation and produce a result. The first view (or the original array) sees the same data as it did before the other view took on a mask, so you can perform the filtering operation on the data and have two separate results. You can keep the masked view for subsequent calculations, and/or keep the original view, and/or create new views with new masks for other analyzes, all while keeping the original data intact. Note that I am right now speaking of views in a somewhat more abstract sense that is only loosely tied to numpy's specific behavior with respect to views right now. As for np.view() in specific, that is an implementation detail that probably shouldn't be in this thread yet, so don't hook too much onto it. Thanks Ben. That's quite helpful. And it also points to my worry (sorry, I already knew enough about views to be dangerous) ... your conceptual version of views is great, but I don't think numpy fully and reliably follows it (occasionally giving copies instead of views, for example, when a view is particularly difficult to generate). So I worry that your notion of views will actually collide with core numpy view implementations. But like you said, perhaps this thread shouldn't go there (yet). Given I'm still fuzzy on all the distinctions, perhaps someone could try to help me (and others?) to define all /4/ logical possibilities ... some may be obvious dead-ends. I'll take a stab at them, but these should definitely get edited by others: destructive + propagating = the data point is truly missing (satellite fell into the ocean; dog ate my source datasheet, or whatever), this is the nature of that data point, such missingness should be replicated in elementwise operations, and the missingness SHOULD interfere with reduction operations that involve that datapoint (np.sum([1,MISSING])=MISSING) destructive + non-propagating = the data point is truly missing, this is the nature of that data point, such missingness should be replicated in elementwise operations, but such missingness should NOT interfere with reduction operations that involve that datapoint (np.sum([1,MISSING])=1) non-destructive + propagating = I want to ignore this datapoint for now; element-wise operations should replicate
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
Gary Strangman writes: [...] Given I'm still fuzzy on all the distinctions, perhaps someone could try to help me (and others?) to define all /4/ logical possibilities ... some may be obvious dead-ends. I'll take a stab at them, but these should definitely get edited by others: destructive + propagating = the data point is truly missing (satellite fell into the ocean; dog ate my source datasheet, or whatever), this is the nature of that data point, such missingness should be replicated in elementwise operations, and the missingness SHOULD interfere with reduction operations that involve that datapoint (np.sum([1,MISSING])=MISSING) Right. destructive + non-propagating = the data point is truly missing, this is the nature of that data point, such missingness should be replicated in elementwise operations, but such missingness should NOT interfere with reduction operations that involve that datapoint (np.sum([1,MISSING])=1) What do you define as element-wise operations? Is a sum on an array an element-wise operation? [1, MISSING]+2 [1, MISSING] Or is it just a form of reduction (after shape broadcasting)? [1, MISSING]+2 [3, 2] For me it's the second, so the only time where special values propagate in a non-propagating scenario is when you slice an array. non-destructive + propagating = I want to ignore this datapoint for now; element-wise operations should replicate this ignore designation, and missingness of this type SHOULD interfere with reduction operations that involve this datapoint (np.sum([1,IGNORE])=IGNORE) Right. non-destructive + non-propagating = I want to ignore this datapoint for now; element-wise operations should replicate this ignore designation, but missingness of this type SHOULD NOT interfere with reduction operations that involve this datapoint (np.sum([1,IGNORE])=1) Same concerns as above. Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 11:08 AM, Lluís xscr...@gmx.net wrote: Gary Strangman writes: [...] destructive + non-propagating = the data point is truly missing, this is the nature of that data point, such missingness should be replicated in elementwise operations, but such missingness should NOT interfere with reduction operations that involve that datapoint (np.sum([1,MISSING])=1) What do you define as element-wise operations? Is a sum on an array an element-wise operation? [1, MISSING]+2 [1, MISSING] did you mean [3, MISSING]? Or is it just a form of reduction (after shape broadcasting)? [1, MISSING]+2 [3, 2] For me it's the second, so the only time where special values propagate in a non-propagating scenario is when you slice an array. Propagation has a very specific meaning here, and I think it is causing confusion elsewhere. Propagation (to me) is the *exact* same behavior that occurs with NaNs, but generalized to any dtype. It seems like you are taking propagate to mean whether the mask of the inputs follow on to the mask of the output. This is related, but is possibly a murkier concept and should probably be cleaned up. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
destructive + propagating = the data point is truly missing (satellite fell into the ocean; dog ate my source datasheet, or whatever), this is the nature of that data point, such missingness should be replicated in elementwise operations, and the missingness SHOULD interfere with reduction operations that involve that datapoint (np.sum([1,MISSING])=MISSING) Right. destructive + non-propagating = the data point is truly missing, this is the nature of that data point, such missingness should be replicated in elementwise operations, but such missingness should NOT interfere with reduction operations that involve that datapoint (np.sum([1,MISSING])=1) What do you define as element-wise operations? Is a sum on an array an element-wise operation? [1, MISSING]+2 [1, MISSING] Or is it just a form of reduction (after shape broadcasting)? [1, MISSING]+2 [3, 2] For me it's the second, so the only time where special values propagate in a non-propagating scenario is when you slice an array. Let's say I want to re-scale a column (or remove the mean from a column). I wouldn't want that to change my missingness. Thus, I'm thinking: x = [1,2,MISSING] x*3 [3, 6, MISSING] x = [1,2,MISSING] x - x.mean() [-0.5, 0.5, MISSING] To me it makes sense to have identical operations for the temporary IGNORE case below (versus the permanent MISSING case here). Note, the reason to independently have separate IGNORE and MISSING is so that I can (for example) temporarily IGNORE entire rows in my 2D array (which may have scattered MISSING elements), and when I undo the IGNORE operation the MISSING elements are still MISSING. The question does still remain what to do when performing operations like those above in IGNORE cases. Perform the operation underneath? Or not? non-destructive + propagating = I want to ignore this datapoint for now; element-wise operations should replicate this ignore designation, and missingness of this type SHOULD interfere with reduction operations that involve this datapoint (np.sum([1,IGNORE])=IGNORE) Right. non-destructive + non-propagating = I want to ignore this datapoint for now; element-wise operations should replicate this ignore designation, but missingness of this type SHOULD NOT interfere with reduction operations that involve this datapoint (np.sum([1,IGNORE])=1) Same concerns as above. Lluis The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Fri, Nov 4, 2011 at 11:08 AM, Lluís xscr...@gmx.net wrote: Gary Strangman writes: [...] destructive + non-propagating = the data point is truly missing, this is the nature of that data point, such missingness should be replicated in elementwise operations, but such missingness should NOT interfere with reduction operations that involve that datapoint (np.sum([1,MISSING])=1) What do you define as element-wise operations? Is a sum on an array an element-wise operation? [1, MISSING]+2 [1, MISSING] did you mean [3, MISSING]? Or is it just a form of reduction (after shape broadcasting)? [1, MISSING]+2 [3, 2] For me it's the second, so the only time where special values propagate in a non-propagating scenario is when you slice an array. Propagation has a very specific meaning here, and I think it is causing confusion elsewhere. Propagation (to me) is the *exact* same behavior that occurs with NaNs, but generalized to any dtype. It seems like you are taking propagate to mean whether the mask of the inputs follow on to the mask of the output. This is related, but is possibly a murkier concept and should probably be cleaned up. I think different people have different notions of propagation here. Yes, my notion was more related to input masks propagating to output masks. It's important to know you define it differently ... and I think the difference in (implicit) definitions is indeed causing confusion. At least it is for me. ;-) -best Gary The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
Benjamin Root writes: On Fri, Nov 4, 2011 at 11:08 AM, Lluís xscr...@gmx.net wrote: Gary Strangman writes: [...] destructive + non-propagating = the data point is truly missing, this is the nature of that data point, such missingness should be replicated in elementwise operations, but such missingness should NOT interfere with reduction operations that involve that datapoint (np.sum([1,MISSING])=1) What do you define as element-wise operations? Is a sum on an array an element-wise operation? [1, MISSING]+2 [1, MISSING] did you mean [3, MISSING]? Yes, sorry. Or is it just a form of reduction (after shape broadcasting)? [1, MISSING]+2 [3, 2] For me it's the second, so the only time where special values propagate in a non-propagating scenario is when you slice an array. Propagation has a very specific meaning here, and I think it is causing confusion elsewhere. Propagation (to me) is the *exact* same behavior that occurs with NaNs, but generalized to any dtype. It seems like you are taking propagate to mean whether the mask of the inputs follow on to the mask of the output. This is related, but is possibly a murkier concept and should probably be cleaned up. If you ignore the existence of a mask (as it is a specific mechanism for handling the destructiveness, not the propagation), I think we both think of the same concept of propagation: High-level: x + SPECIAL Propagating (SPECIAL = NaN-like = MISSING): x + SPECIAL = SPECIAL Non-propagating (SPECIAL = ignore this element, similar to nansum = IGNORE): x + SPECIAL = x Is there an agreement on this, or am I missing something else? Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
04.11.2011 17:31, Gary Strangman kirjoitti: [clip] The question does still remain what to do when performing operations like those above in IGNORE cases. Perform the operation underneath? Or not? I have a feeling that if you don't start by mathematically defining the scalar operations first, and only after that generalize them to arrays, some conceptual problems may follow. On the other hand, I should note that numpy.ma does not work this way, and many people seem still happy with how it works. But if you go defining scalars first, as far as I see ufuncs (eg. binary operations), and assignment are what needs to be defined. Since the idea seems to be to use these as masks, let's assume that each special value can also carry a payload. *** There are a two options how to behave with respect to binary/unary operations: (P) Propagating unop(SPECIAL_1) == SPECIAL_new binop(SPECIAL_1, SPECIAL_2) == SPECIAL_new binop(a, SPECIAL) == SPECIAL_new (N) Non-propagating unop(SPECIAL_1) == SPECIAL_new binop(SPECIAL_1, SPECIAL_2) == SPECIAL_new binop(a, SPECIAL) == binop(a, binop.identity) == a *** And three options on what to do on assignment: (d) Destructive a := SPECIAL # - a == SPECIAL (n) Non-destructive a := SPECIAL # - a unchanged (s) Self-destructive a := SPECIAL_1 # - if `a` is SPECIAL-class, then a == SPECIAL_1, # otherwise `a` remains unchanged *** Finally, there is a question whether the value has a payload or not. The payload complicates the scheme, as binary and unary operations need to create new values. For singletons (eg. NaN) this is not a problem. But if it's a non-singleton, desirable behavior would be to retain commutativity (and other similar properties) of binary ops. I see two sensible approaches for this: either raise an error, or do the computation on the payload. This brings in a third choice: (S) singleton, (E) payload, but raise errors on operations only on special values, and (C) payload, but do computations on payload. *** For shorthand, we can refer to the above choices with the nomenclature shorthand ::= propagation destructivity payload_type propagation ::= P | N destructivity ::= d | n | s payload_type ::= S | E | C That makes 2 * 3 * 3 = 18 different ways to construct consistent behavior. Some of them might make sense, the problem is to find out which :) NAN and NA apparently fall into the PdS class. If classified this way, behaviour of items in np.ma arrays is different in different operations, but seems roughly PdX, where X stands for returning a masked value with the first argument as the payload in binary ops if either argument is masked. This makes inline binary ops behave like Nn. Reductions are N. (Assignment: dC, reductions: N, binary ops: PX, unary ops: PC, inline binary ops: Nn). Finally, there's a can of worms on specifying the outcome of binary operations on two special values of different kinds, but it's maybe best to first choose one that behaves sensibly by itself. Cheers, Pauli ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Wed, Nov 2, 2011 at 8:20 PM, Benjamin Root ben.r...@ou.edu wrote: On Wednesday, November 2, 2011, Nathaniel Smith n...@pobox.com wrote: By R compatibility, I specifically had in mind in-memory compatibility. rpy2 provides a more-or-less seamless within-process interface between R and Python (and specifically lets you get numpy views on arrays returned by R functions), so if we can make this work for R arrays containing NA too then that'd be handy. (The rpy2 author requested this in the last discussion here: http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057084.html) When it comes to disk formats, then this doesn't matter so much, since IO routines have to translate between different representations all the time anyway. Interesting, but I still have to wonder if that should be on the wishlist for MISSING. I guess it would matter by knowing whether people would be fully converting from R or gradually transitioning from it? That is something that I can't answer. Well, I'm one of the people who would use it, so yeah :-). I've been trying to standardize my code on Python for a while now, but there's a ton of statistical tools that are only really available through R, and that will remain true for a while yet. So I use rpy2 when I have to. I take the replacement of my line about MISSING disallowing unmasking and your line about MISSING assignment being destructive as basically expressing the same idea. Is that fair, or did you mean something else? I am someone who wants to get to the absolute core of ideas. Also, this expression cleanly delineates the differences as binary. By expressing it this way, we also shy away from implementation details. For example, Unmasking can be programmatically prevented for MISSING while it could be implemented by other indirect means for IGNORE. Not that those are the preferred ways, only that the phrasing is more flexible and exacting. Finally, do you think that people who want IGNORED support care about having a convenient API for masking/unmasking values? You removed that line, but I don't know if that was because you disagreed with it, or were just trying to simplify. See previous. I like getting to the core of things too, but unless there's actual disagreement, then I think even less central points are still worth noting :-). I've tried editing things a bit to make the compare/contrast clearer based on your comments, and put it up here: https://github.com/njsmith/numpy/wiki/NA-discussion-status Maybe it would be better to split each list into core idea versus extra niceties or something? I'm not sure. -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
I also mentioned this at the bottom of a reply to Benjamin, but to make sure people joining the thread see it: I went ahead and put this up on a github wiki page that everyone should be able to edit https://github.com/njsmith/numpy/wiki/NA-discussion-status We could move it to the numpy wiki or whatever if people prefer, this just seemed like the easiest way to get something up there that everyone would have write access to. -- Nathaniel On Wed, Nov 2, 2011 at 4:37 PM, Nathaniel Smith n...@pobox.com wrote: Hi again, Okay, here's my attempt at an *uncontroversial* email! Specifically, I think it'll be easier to talk about this NA stuff if we can establish some common ground, and easier for people to follow if the basic points of agreement are laid out in one place. So I'm going to try and summarize just the things that we can agree about. Note that right now I'm *only* talking about what kind of tools we want to give the user -- i.e., what kind of problems we are trying to solve. AFAICT we don't have as much consensus on implementation matters, and anyway it's hard to make implementation decisions without knowing what we're trying to accomplish. 1) I think we have consensus that there are (at least) two different possible ways of thinking about this problem, with somewhat different constituencies. Let's call these two concepts MISSING data and IGNORED data. 2) I also think we have at least a rough consensus on what these concepts mean, and what their supporters want from them: MISSING data: - Conceptually, MISSINGness acts like a property of a datum -- assigning MISSING to a location is like assigning any other value to that location - Ufuncs and other operations must propagate these values by default, and there must be an option to cause them to be ignored - Must be competitive with NaNs in terms of speed and memory usage (or else people will just use NaNs) - Compatibility with R is valuable - To avoid user confusion, ideally it should *not* be possible to 'unmask' a missing value, since this is inconsistent with the missing value metaphor (e.g., see Wes's comment about leaky abstractions) - Possible useful extension: having different classes of missing values (similar to Stata) - Target audience: data analysis with missing data, neuroimaging, econometrics, former R users, ... IGNORED data: - Conceptually, IGNOREDness acts like a property of the array -- toggling a location to be IGNORED is kind of vaguely similar to changing an array's shape - Ufuncs and other operations must ignore these values by default, and there doesn't really need to be a way to propagate them, even as an option (though it probably wouldn't hurt either) - Some memory overhead is inevitable and acceptable - Compatibility with R neither possible nor valuable - Ability to toggle the IGNORED state of a location is critical, and should be as convenient as possible - Possible useful extension: having not just different types of ignored values, but richer ways to combine them -- e.g., the example of combining astronomical images with some kind of associated per-pixel quality scores, where one might want the 'mask' to be not just a boolean IGNORED/not-IGNORED flag, but an integer (perhaps a multi-byte integer) or even a float, and to allow these 'masks' to be combined in some more complex way than just logical_and. - Target audience: anyone who's already doing this kind of thing by hand using a second mask array + boolean indexing, former numpy.ma users, matplotlib, ... 3) And perhaps we can all agree that the biggest *un*resolved question is whether we want to: - emphasize the similarities between these two use cases and build a single interface that can handle both concepts, with some compromises - or, treat these at two mostly-separate features that can each become exactly what the respective constituency wants without compromise -- but with some potential redundancy and extra code. Each approach has advantages and disadvantages. Does that seem like a fair summary? Anything more we can add? Most importantly, anything here that you disagree with? Did I summarize your needs well? Do you have a use case that you feel doesn't fit naturally into either category? [Also, I thought this might make the start of a good wiki page for people to reference during these discussions, but I don't seem to have edit rights. If other people agree, maybe someone could put it up, or give me access? My trac id is n...@pobox.com.] Thanks, -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On 2011-11-03 04:22, numpy-discussion-requ...@scipy.org wrote: Message: 1 Date: Wed, 2 Nov 2011 22:20:15 -0500 From: Benjamin Rootben.r...@ou.edu Subject: Re: [Numpy-discussion] in the NA discussion, what can we agree on? To: Discussion of Numerical Pythonnumpy-discussion@scipy.org Message-ID: cannq6fnlkweuxugoey0kto7yi-v+tnv3nzj6upukkva+d0d...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 On Wednesday, November 2, 2011, Nathaniel Smithn...@pobox.com wrote: Hi Benjamin, On Wed, Nov 2, 2011 at 5:25 PM, Benjamin Rootben.r...@ou.edu wrote: I want to pare this down even more. I think the above lists makes too many unneeded extrapolations. Okay. I found your formatting a little confusing, so I want to make sure I understood the changes you're suggesting: For the description of what MISSING means, you removed the lines: - Compatibility with R is valuable - To avoid user confusion, ideally it should *not* be possible to 'unmask' a missing value, since this is inconsistent with the missing value metaphor (e.g., see Wes's comment about leaky abstractions) And you added the line: + Assigning MISSING is destructive And for the description of what IGNORED means, you removed the lines: - Some memory overhead is inevitable and acceptable - Compatibility with R neither possible nor valuable - Ability to toggle the IGNORED state of a location is critical, and should be as convenient as possible And you added the lines: + IGNORE is non-destructive + Must be competitive with np.ma for speed and memory (or else users would just use np.ma) Is that right? Correct. Assuming it is, my thoughts are: By R compatibility, I specifically had in mind in-memory compatibility. rpy2 provides a more-or-less seamless within-process interface between R and Python (and specifically lets you get numpy views on arrays returned by R functions), so if we can make this work for R arrays containing NA too then that'd be handy. (The rpy2 author requested this in the last discussion here: http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057084.html) When it comes to disk formats, then this doesn't matter so much, since IO routines have to translate between different representations all the time anyway. Interesting, but I still have to wonder if that should be on the wishlist for MISSING. I guess it would matter by knowing whether people would be fully converting from R or gradually transitioning from it? That is something that I can't answer. I probably do not have all possible use-cases but what I'd think of as the most common is: use R stuff just straight out of R from Python. Say that you are doing your work in Python and read about some statistical method for which an implementation in R exists (but not in Python/numpy). You can just pass your numpy arrays or vectors to the relevant R function(s) and retrieve the results in a form directly usable by numpy (without having the data copied around). Should performances become an issue, and that method be of crucial importance, you will probably want to reimplement it (C, or Cython, for example). Otherwise you could pick R's phenomenal toolbox without much effort and keep those calls to R as part of your code. In my experience, the later would be the most frequent. Get some compatibility for the NA magic values and that possible coupling between R and numpy becomes even better by preventing one side or the other to understand them as non-NA values. I take the replacement of my line about MISSING disallowing unmasking and your line about MISSING assignment being destructive as basically expressing the same idea. Is that fair, or did you mean something else? I am someone who wants to get to the absolute core of ideas. Also, this expression cleanly delineates the differences as binary. By expressing it this way, we also shy away from implementation details. For example, Unmasking can be programmatically prevented for MISSING while it could be implemented by other indirect means for IGNORE. Not that those are the preferred ways, only that the phrasing is more flexible and exacting. Finally, do you think that people who want IGNORED support care about having a convenient API for masking/unmasking values? You removed that line, but I don't know if that was because you disagreed with it, or were just trying to simplify. See previous. Then, as a third-party module developer, I can tell you that having separate and independent ways to detect MISSING/IGNORED would likely make support more difficult and would greatly benefit from a common (or easily combinable) method of identification. Right, sorry... I didn't forget, and that's part of what I was thinking when I described the second approach as keeping them as *mostly*-separate interfaces... but I should have made it more explicit! Anyway, yes: 4) There is consensus that whatever approach is taken
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
Nathaniel Smith writes: 4) There is consensus that whatever approach is taken, there should be a quick and convenient way to identify values that are MISSING, IGNORED, or both. (E.g., functions is_MISSING, is_IGNORED, is_MISSING_or_IGNORED, or some equivalent.) Well, maybe it's too low level, but I'd rather decouple the two concepts into two orthogonal properties that can be composed: * Destructiveness: whether the previous data value is lost whenever you assign a special value. * Propagation: whether any of these special values is propagated or just skipped when performing computations. I think we can all agree on the definition of these two properties (where bit-patters are destructive and masks are non-destructive), so I'd say that the first discussion is establishing whether to expose them as separate properties or just expose specific combinations of them: * MISSING: destructive + propagating * IGNORED: non-destructive + non-propagating For example, it makes sense to me to have non-destructive + propagating. If we take this road, then the next points to discuss should probably be how these combinations are expressed: * At the array level: all special values behave the same in a specific array, given its properties (e.g., all of them are destructive+propagating). * At the value level: each special value conveys a specific combination of the aforementioned properties (e.g., assigning A is destructive+propagating and assigning B is non-destructive+non-propagating). * Hybrid: e.g., all special values are destructive, but propagation depends on the specific special value. I think this last decision is crucial, as it will have a direct impact on performance, numpy code maintainability and 3rd party interface simplicity. Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Thu, Nov 3, 2011 at 9:28 AM, Lluís xscr...@gmx.net wrote: Nathaniel Smith writes: 4) There is consensus that whatever approach is taken, there should be a quick and convenient way to identify values that are MISSING, IGNORED, or both. (E.g., functions is_MISSING, is_IGNORED, is_MISSING_or_IGNORED, or some equivalent.) Well, maybe it's too low level, but I'd rather decouple the two concepts into two orthogonal properties that can be composed: * Destructiveness: whether the previous data value is lost whenever you assign a special value. * Propagation: whether any of these special values is propagated or just skipped when performing computations. I think we can all agree on the definition of these two properties (where bit-patters are destructive and masks are non-destructive), so I'd say that the first discussion is establishing whether to expose them as separate properties or just expose specific combinations of them: * MISSING: destructive + propagating * IGNORED: non-destructive + non-propagating For example, it makes sense to me to have non-destructive + propagating. This is sort of how it is currently implemented. By default, NA propagates, but it is possible to override these defaults on an operation-by-operation basis using the skipna kwarg, and a subclassed array could implement a __ufunc_wrap__() to default the skipna kwarg to True. If we take this road, then the next points to discuss should probably be how these combinations are expressed: * At the array level: all special values behave the same in a specific array, given its properties (e.g., all of them are destructive+propagating). * At the value level: each special value conveys a specific combination of the aforementioned properties (e.g., assigning A is destructive+propagating and assigning B is non-destructive+non-propagating). * Hybrid: e.g., all special values are destructive, but propagation depends on the specific special value. I think this last decision is crucial, as it will have a direct impact on performance, numpy code maintainability and 3rd party interface simplicity. This is actually a very good point, and plays directly on the types of implementations that can be done. Currently, Mark's implementation is the first one. The others are not possible with the current design. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On 11/2/11 7:16 PM, Nathaniel Smith wrote: By R compatibility, I specifically had in mind in-memory compatibility. The R crowd has had a big voice in this discussion, and I understand that there are some nice lessons to be learned from it with regard to the NA issues. However, I think making R compatibility a priority is a mistake -- numpy is numpy, it is NOT, nor should it be, an emulation of anything else. NA functionality is useful to virtually everyone -- not just folks doing R-like stuff, and even less so folks directly working with R. rpy2 provides a more-or-less seamless within-process interface between R and Python Perhaps rpy2 will need to do some translating -- so be it, better than crippling numpy for other uses. That being said, if the R binary format is a good one for numpy, no harm in using it, but I think that should be a secondary, at best, concern. So should emulating the R API. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
Hi Chris, On Thu, Nov 3, 2011 at 9:45 AM, Chris.Barker chris.bar...@noaa.gov wrote: On 11/2/11 7:16 PM, Nathaniel Smith wrote: By R compatibility, I specifically had in mind in-memory compatibility. The R crowd has had a big voice in this discussion, and I understand that there are some nice lessons to be learned from it with regard to the NA issues. However, I think making R compatibility a priority is a mistake -- numpy is numpy, it is NOT, nor should it be, an emulation of anything else. NA functionality is useful to virtually everyone -- not just folks doing R-like stuff, and even less so folks directly working with R. I think we agree, actually. What I currently have written on the wiki page is In-memory compatibility with R would be handy, which is intended to convey that all else being equal this is a desirable feature, but that it's not worth crippling numpy (as you put it) to get. Do you have a suggestion about how I could make this clearer? or am I misunderstanding your point? -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
Hi Lluís, On Thu, Nov 3, 2011 at 7:28 AM, Lluís xscr...@gmx.net wrote: Well, maybe it's too low level, but I'd rather decouple the two concepts into two orthogonal properties that can be composed: * Destructiveness: whether the previous data value is lost whenever you assign a special value. * Propagation: whether any of these special values is propagated or just skipped when performing computations. I think we can all agree on the definition of these two properties (where bit-patters are destructive and masks are non-destructive), so I'd say that the first discussion is establishing whether to expose them as separate properties or just expose specific combinations of them: * MISSING: destructive + propagating * IGNORED: non-destructive + non-propagating Thanks, that's an interesting idea that I'd forgotten about. I added a link to your message to the proposals section, and to the list of proposed solutions in point (3). I'm tempted to respond in more depth, but I'm worried that if we start digging into specific proposals like this right now then we'll start going in circles again -- that's why I'm trying to establish some common ground on what our goals are, so we have more of a basis for comparing different ideas. So obviously your suggestion of breaking things down into finer-grained orthogonal features has merits in terms of simplicity, elegance, etc. Before we get into those, though, I want to ask: do you feel that the extra ability to have values that are destructive+non-propagating and non-destructive+propagating is a major *practical* benefit, or are you more motivated by the simplicity and elegance, and the extra flexibility is just something kind of cool that we'd get for free? Or put another way, do you think that the MISSING and IGNORED concepts are adequate to cover practical use cases, or do you have an example where what's really wanted is say non-destructive + propagating? I can see how it would work, but I don't think I'd ever use it, so I'm curious... -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
For the non-destructive+propagating case, do I understand correctly that this would mean I (as a user) could temporarily decide to IGNORE certain portions of my data, perform a series of computation on that data, and the IGNORED flag (or however it is implemented) would be propagated from computation to computation? If that's the case, I suspect I'd use it all the time ... to effectively perform data subsetting without generating (partial) copies of large datasets. But maybe I misunderstand the intended notion of propagation ... Gary Or put another way, do you think that the MISSING and IGNORED concepts are adequate to cover practical use cases, or do you have an example where what's really wanted is say non-destructive + propagating? I can see how it would work, but I don't think I'd ever use it, so I'm curious... -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Thursday, November 3, 2011, Gary Strangman str...@nmr.mgh.harvard.edu wrote: For the non-destructive+propagating case, do I understand correctly that this would mean I (as a user) could temporarily decide to IGNORE certain portions of my data, perform a series of computation on that data, and the IGNORED flag (or however it is implemented) would be propagated from computation to computation? If that's the case, I suspect I'd use it all the time ... to effectively perform data subsetting without generating (partial) copies of large datasets. But maybe I misunderstand the intended notion of propagation ... Gary Propagating is default NaN-like behavior when performing a sum with at least one NaN in it. Ignoring is like using nansum on that same array. Masking, one can think of it as very fancy indexing, but the shape and structure of the data is maintained. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Wed, Nov 2, 2011 at 6:37 PM, Nathaniel Smith n...@pobox.com wrote: Hi again, Okay, here's my attempt at an *uncontroversial* email! Specifically, I think it'll be easier to talk about this NA stuff if we can establish some common ground, and easier for people to follow if the basic points of agreement are laid out in one place. So I'm going to try and summarize just the things that we can agree about. Note that right now I'm *only* talking about what kind of tools we want to give the user -- i.e., what kind of problems we are trying to solve. AFAICT we don't have as much consensus on implementation matters, and anyway it's hard to make implementation decisions without knowing what we're trying to accomplish. 1) I think we have consensus that there are (at least) two different possible ways of thinking about this problem, with somewhat different constituencies. Let's call these two concepts MISSING data and IGNORED data. 2) I also think we have at least a rough consensus on what these concepts mean, and what their supporters want from them: MISSING data: - Conceptually, MISSINGness acts like a property of a datum -- assigning MISSING to a location is like assigning any other value to that location - Ufuncs and other operations must propagate these values by default, and there must be an option to cause them to be ignored - Must be competitive with NaNs in terms of speed and memory usage (or else people will just use NaNs) - Compatibility with R is valuable - To avoid user confusion, ideally it should *not* be possible to 'unmask' a missing value, since this is inconsistent with the missing value metaphor (e.g., see Wes's comment about leaky abstractions) - Possible useful extension: having different classes of missing values (similar to Stata) - Target audience: data analysis with missing data, neuroimaging, econometrics, former R users, ... IGNORED data: - Conceptually, IGNOREDness acts like a property of the array -- toggling a location to be IGNORED is kind of vaguely similar to changing an array's shape - Ufuncs and other operations must ignore these values by default, and there doesn't really need to be a way to propagate them, even as an option (though it probably wouldn't hurt either) - Some memory overhead is inevitable and acceptable - Compatibility with R neither possible nor valuable - Ability to toggle the IGNORED state of a location is critical, and should be as convenient as possible - Possible useful extension: having not just different types of ignored values, but richer ways to combine them -- e.g., the example of combining astronomical images with some kind of associated per-pixel quality scores, where one might want the 'mask' to be not just a boolean IGNORED/not-IGNORED flag, but an integer (perhaps a multi-byte integer) or even a float, and to allow these 'masks' to be combined in some more complex way than just logical_and. - Target audience: anyone who's already doing this kind of thing by hand using a second mask array + boolean indexing, former numpy.ma users, matplotlib, ... 3) And perhaps we can all agree that the biggest *un*resolved question is whether we want to: - emphasize the similarities between these two use cases and build a single interface that can handle both concepts, with some compromises - or, treat these at two mostly-separate features that can each become exactly what the respective constituency wants without compromise -- but with some potential redundancy and extra code. Each approach has advantages and disadvantages. Does that seem like a fair summary? Anything more we can add? Most importantly, anything here that you disagree with? Did I summarize your needs well? Do you have a use case that you feel doesn't fit naturally into either category? [Also, I thought this might make the start of a good wiki page for people to reference during these discussions, but I don't seem to have edit rights. If other people agree, maybe someone could put it up, or give me access? My trac id is n...@pobox.com.] Thanks, -- Nathaniel I want to pare this down even more. I think the above lists makes too many unneeded extrapolations. MISSING data: - Conceptually, MISSINGness acts like a property of a datum -- assigning MISSING to a location is like assigning any other value to that location - Ufuncs and other operations must propagate these values by default, and there must be an option to cause them to be ignored - Assigning MISSING is destructive - Must be competitive with NaNs in terms of speed and memory usage (or else people will just use NaNs) - Target audience: data analysis with missing data, neuroimaging, econometrics, former R users, ... - Possible useful extension: having different classes of missing values (similar to Stata) IGNORED data: - Conceptually, IGNOREDness acts like a property of the array -- toggling a location to be IGNORED is kind of vaguely similar to changing
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
Hi Benjamin, On Wed, Nov 2, 2011 at 5:25 PM, Benjamin Root ben.r...@ou.edu wrote: I want to pare this down even more. I think the above lists makes too many unneeded extrapolations. Okay. I found your formatting a little confusing, so I want to make sure I understood the changes you're suggesting: For the description of what MISSING means, you removed the lines: - Compatibility with R is valuable - To avoid user confusion, ideally it should *not* be possible to 'unmask' a missing value, since this is inconsistent with the missing value metaphor (e.g., see Wes's comment about leaky abstractions) And you added the line: + Assigning MISSING is destructive And for the description of what IGNORED means, you removed the lines: - Some memory overhead is inevitable and acceptable - Compatibility with R neither possible nor valuable - Ability to toggle the IGNORED state of a location is critical, and should be as convenient as possible And you added the lines: + IGNORE is non-destructive + Must be competitive with np.ma for speed and memory (or else users would just use np.ma) Is that right? Assuming it is, my thoughts are: By R compatibility, I specifically had in mind in-memory compatibility. rpy2 provides a more-or-less seamless within-process interface between R and Python (and specifically lets you get numpy views on arrays returned by R functions), so if we can make this work for R arrays containing NA too then that'd be handy. (The rpy2 author requested this in the last discussion here: http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057084.html) When it comes to disk formats, then this doesn't matter so much, since IO routines have to translate between different representations all the time anyway. I take the replacement of my line about MISSING disallowing unmasking and your line about MISSING assignment being destructive as basically expressing the same idea. Is that fair, or did you mean something else? Finally, do you think that people who want IGNORED support care about having a convenient API for masking/unmasking values? You removed that line, but I don't know if that was because you disagreed with it, or were just trying to simplify. Then, as a third-party module developer, I can tell you that having separate and independent ways to detect MISSING/IGNORED would likely make support more difficult and would greatly benefit from a common (or easily combinable) method of identification. Right, sorry... I didn't forget, and that's part of what I was thinking when I described the second approach as keeping them as *mostly*-separate interfaces... but I should have made it more explicit! Anyway, yes: 4) There is consensus that whatever approach is taken, there should be a quick and convenient way to identify values that are MISSING, IGNORED, or both. (E.g., functions is_MISSING, is_IGNORED, is_MISSING_or_IGNORED, or some equivalent.) -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On Wednesday, November 2, 2011, Nathaniel Smith n...@pobox.com wrote: Hi Benjamin, On Wed, Nov 2, 2011 at 5:25 PM, Benjamin Root ben.r...@ou.edu wrote: I want to pare this down even more. I think the above lists makes too many unneeded extrapolations. Okay. I found your formatting a little confusing, so I want to make sure I understood the changes you're suggesting: For the description of what MISSING means, you removed the lines: - Compatibility with R is valuable - To avoid user confusion, ideally it should *not* be possible to 'unmask' a missing value, since this is inconsistent with the missing value metaphor (e.g., see Wes's comment about leaky abstractions) And you added the line: + Assigning MISSING is destructive And for the description of what IGNORED means, you removed the lines: - Some memory overhead is inevitable and acceptable - Compatibility with R neither possible nor valuable - Ability to toggle the IGNORED state of a location is critical, and should be as convenient as possible And you added the lines: + IGNORE is non-destructive + Must be competitive with np.ma for speed and memory (or else users would just use np.ma) Is that right? Correct. Assuming it is, my thoughts are: By R compatibility, I specifically had in mind in-memory compatibility. rpy2 provides a more-or-less seamless within-process interface between R and Python (and specifically lets you get numpy views on arrays returned by R functions), so if we can make this work for R arrays containing NA too then that'd be handy. (The rpy2 author requested this in the last discussion here: http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057084.html) When it comes to disk formats, then this doesn't matter so much, since IO routines have to translate between different representations all the time anyway. Interesting, but I still have to wonder if that should be on the wishlist for MISSING. I guess it would matter by knowing whether people would be fully converting from R or gradually transitioning from it? That is something that I can't answer. I take the replacement of my line about MISSING disallowing unmasking and your line about MISSING assignment being destructive as basically expressing the same idea. Is that fair, or did you mean something else? I am someone who wants to get to the absolute core of ideas. Also, this expression cleanly delineates the differences as binary. By expressing it this way, we also shy away from implementation details. For example, Unmasking can be programmatically prevented for MISSING while it could be implemented by other indirect means for IGNORE. Not that those are the preferred ways, only that the phrasing is more flexible and exacting. Finally, do you think that people who want IGNORED support care about having a convenient API for masking/unmasking values? You removed that line, but I don't know if that was because you disagreed with it, or were just trying to simplify. See previous. Then, as a third-party module developer, I can tell you that having separate and independent ways to detect MISSING/IGNORED would likely make support more difficult and would greatly benefit from a common (or easily combinable) method of identification. Right, sorry... I didn't forget, and that's part of what I was thinking when I described the second approach as keeping them as *mostly*-separate interfaces... but I should have made it more explicit! Anyway, yes: 4) There is consensus that whatever approach is taken, there should be a quick and convenient way to identify values that are MISSING, IGNORED, or both. (E.g., functions is_MISSING, is_IGNORED, is_MISSING_or_IGNORED, or some equivalent.) Good. Cheers! Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion