Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-07 Thread Lluís
Nathaniel Smith writes:
 So assignment is not destructive -- the old value is retained as the 
 payload.

I never assumed (and I think it is also the case for others) that the payload
was retaining the old value. In fact, AFAIR, the payloads were introduced as a
way of having more than one special value that (if wanted by the user) can be
handled differently depending on the payload.

Note that while you're assuming IGNORED(x) means a value that is ignoring the
x original value, you're never writing MISSING(x) to retain the original
value that is now missing.

Thus I think that decoupling the payload from the previous value concept makes
it all consistent regardless of the destructiveness property.

That's one of the reasons why I used the special value concept since the
beginning, so that no assumption can be made about its propagation and
destructiveness properties.


Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-06 Thread Nathaniel Smith
On Sat, Nov 5, 2011 at 3:22 PM, T J tjhn...@gmail.com wrote:
 So what do people expect out of ignored values?  It seems that we might need
 to extend the list you put forward so that it includes these desires. Since
 my primary use is with MISSING and not so much IGNORED, I'm not in a very
 good position to help extend that list.  I'd be curious to know if this
 present suggestion would work with how matplotlib uses masked arrays.

I'm in a similar position -- I don't have any use cases for IGNORED
myself right now, so I'm just trying to guess from what I've seen
people say.

Just having where= around makes it at least possible, if cumbersome,
to handle one of the big problems -- working with subsets of large
datasets without having to make a copy. It's possible we should leave
IGNORED support alone until people have more experience with where=,
and can say what kind of convenience features would be useful. Or
maybe we can get some experts to speak up... perhaps this will help:
  http://thread.gmane.org/gmane.comp.python.matplotlib.devel/10740

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-05 Thread Nathaniel Smith
On Fri, Nov 4, 2011 at 8:33 PM, T J tjhn...@gmail.com wrote:
 On Fri, Nov 4, 2011 at 8:03 PM, Nathaniel Smith n...@pobox.com wrote:
 Again, I really don't think you're going to be able to sell an API where
  [2] + [IGNORED(20)] == [IGNORED(22)]
 I mean, it's not me you have to convince, it's Gary, Pierre, maybe
 Benjamin, Lluís, etc. So I could be wrong. But you might want to
 figure that out first before making plans based on this...

 But this is how np.ma currently does it, except that it doesn't compute the
 payload---it just calls it IGNORED.

Yes, that's what I mean -- if you're just temporarily masking
something out because you want it to be IGNORED, then you don't want
it to change around when you do something like a += 2, right? If the
operation is changing the payload, then it's weird to say that the
operation ignored the payload...

Anyway, I think this is another way to think about your suggestion:

-- each array gets an extra boolean array called the mask that it
carries around with it
-- Unary ufuncs automatically copy these masks to their results. For
binary ufuncs, the input masks get automatically ORed together, and
that determines the mask attached to the output array
-- these masks have absolutely no effect on any computations, except that
 ufunc.reduce(a, skip_IGNORED=True)
is defined to be a synonym for
 ufunc.reduce(a, where=a.mask)

Is that correct?

Also, if can I ask -- is this something you would find useful yourself?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-05 Thread Pauli Virtanen
Hi,

05.11.2011 03:43, T J kirjoitti:
[clip]
 I thought that PdC satisfied (a) and (b).
 Let me show you what I thought they were. Perhaps I am not being
 consistent. If so, point out my mistake.

Yes, propagating + destructive assigment + do-computations-on-payload 
should satisfy (a) and (b). (NA also works as it's a singleton.)

The question is now that are there other rules, with more desirable 
behavior of masked values, that also have

a += b
a += 42
print unmask(a)

and

a += 42
a += b
print unmask(a)

as equivalent operations.

The rules chosen by np.ma don't satisfy this. If taking a commutative 
version of np.ma's binary op rules, it seems that it's not clear how to 
make assignment work exactly in the way you'd expect of masked values 
while retaining equivalence in the above code.

It seems that having `a += b` have `a[j]` unchanged if it's ignored, and 
having ignored values propagate creates the problem.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Pauli Virtanen
04.11.2011 19:59, Pauli Virtanen kirjoitti:
[clip]
 This makes inline binary ops
 behave like Nn. Reductions are N. (Assignment: dC, reductions: N, binary
 ops: PX, unary ops: PC, inline binary ops: Nn).

Sorry, inline binary ops are also PdX, not Nn.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread T J
On Fri, Nov 4, 2011 at 11:59 AM, Pauli Virtanen p...@iki.fi wrote:

 I have a feeling that if you don't start by mathematically defining the
 scalar operations first, and only after that generalize them to arrays,
 some conceptual problems may follow.


Yes.  I was going to mention this point as well.


 For shorthand, we can refer to the above choices with the nomenclature

shorthand ::= propagation destructivity payload_type
propagation ::= P | N
destructivity ::= d | n | s
payload_type ::= S | E | C

 That makes 2 * 3 * 3 = 18 different ways to construct consistent
 behavior. Some of them might make sense, the problem is to find out which
:)


This is great for the discussion, IMO.  The self-destructive assignment
hasn't come up at all, so I'm guessing we can probably ignore it.

---

Can you be a bit more explicit on the payload types?  Let me try, respond
with corrections if necessary.

S is singleton and in the case of missing data, we take it to mean that
we only care that data is missing and not *how* missing the data is.

 x = MISSING
 -x  # unary
MISSING
 x + 3  # binary
MISSING

E means that we acknowledge that we want to track the how, but that we
aren't interested in it. So raise an error.
In the case of ignored data, we might have:

 x = 2
 ignore(x)
 x
IGNORED(2)
 -x
Error
 x + 3
Error

C means that we acknowledge that we want to track the how, and that we
are interested in it.  So do the computations.

 x = 2
 ignore(x)
 -x
IGNORED(-2)
 x + 3
IGNORED(5)

Did I get that mostly right?

 NAN and NA apparently fall into the PdS class.


Here is where I think we need ot be a bit more careful.  It is true that we
want NAN and MISSING to propagate, but then we additionally want to ignore
it sometimes.  This is precisely why we have functions like nansum.
Although people are well-aware of this desire, I think this thread has
largely conflated the issues when discussing propagation.

To push this forward a bit, can I propose that IGNORE behave as:   PnC

 x = np.array([1, 2, 3])
 y = np.array([10, 20, 30])
 ignore(x[2])
 x
[1, IGNORED(2), 3]
 x + 2
[3, IGNORED(4), 5]
 x + y
[11, IGNORED(22), 33]
 z = x.sum()
 z
IGNORED(6)
 unignore(z)
 z
6
 x.sum(skipIGNORED=True)
4

When done in this fashion, I think it is perfectly fine for masks to be
unmasked.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Gary Strangman



 NAN and NA apparently fall into the PdS class.


Here is where I think we need ot be a bit more careful.  It is true that we want
NAN and MISSING to propagate, but then we additionally want to ignore it
sometimes.  This is precisely why we have functions like nansum.  Although 
people
are well-aware of this desire, I think this thread has largely conflated the
issues when discussing propagation.

To push this forward a bit, can I propose that IGNORE behave as:   PnC

 x = np.array([1, 2, 3])
 y = np.array([10, 20, 30])
 ignore(x[2])
 x
[1, IGNORED(2), 3]
 x + 2
[3, IGNORED(4), 5]
 x + y
[11, IGNORED(22), 33]
 z = x.sum()
 z
IGNORED(6)
 unignore(z)
 z
6
 x.sum(skipIGNORED=True)
4

When done in this fashion, I think it is perfectly fine for masks to be
unmasked.


In my mind, IGNORED items should be skipped by default (i.e., skipIGNORED 
seems redundant ... isn't that what ignoring is all about?). Thus I might 
instead suggest the opposite (default) behavior at the end:



x = np.array([1, 2, 3])
y = np.array([10, 20, 30])
ignore(x[2])
x

[1, IGNORED(2), 3]

x + 2

[3, IGNORED(4), 5]

x + y

[11, IGNORED(22), 33]

z = x.sum()
z

4

unignore(x).sum()

6

x.sum(keepIGNORED=True)

6

(Obviously all the syntax is totally up for debate.)

-best
Gary


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread T J
On Fri, Nov 4, 2011 at 1:03 PM, Gary Strangman
str...@nmr.mgh.harvard.eduwrote:


 To push this forward a bit, can I propose that IGNORE behave as:   PnC

  x = np.array([1, 2, 3])
  y = np.array([10, 20, 30])
  ignore(x[2])
  x
 [1, IGNORED(2), 3]
  x + 2
 [3, IGNORED(4), 5]
  x + y
 [11, IGNORED(22), 33]
  z = x.sum()
  z
 IGNORED(6)
  unignore(z)
  z
 6
  x.sum(skipIGNORED=True)
 4


 In my mind, IGNORED items should be skipped by default (i.e., skipIGNORED
 seems redundant ... isn't that what ignoring is all about?). Thus I might
 instead suggest the opposite (default) behavior at the end:


  x = np.array([1, 2, 3])
 y = np.array([10, 20, 30])
 ignore(x[2])
 x

 [1, IGNORED(2), 3]

 x + 2

 [3, IGNORED(4), 5]

 x + y

 [11, IGNORED(22), 33]

 z = x.sum()
 z

 4

 unignore(x).sum()

 6

 x.sum(keepIGNORED=True)

 6

 (Obviously all the syntax is totally up for debate.)



I agree that it would be ideal if the default were to skip IGNORED values,
but that behavior seems inconsistent with its propagation properties (such
as when adding arrays with IGNORED values).  To illustrate, when we did
x+2, we were stating that:

IGNORED(2) + 2 == IGNORED(4)

which means that we propagated the IGNORED value.  If we were to skip them
by default, then we'd have:

IGNORED(2) + 2 == 2

To be consistent, then it seems we also should have had:

 x + 2
[3, 2, 5]

which I think we can agree is not so desirable.   What this seems to come
down to is that we tend to want different behavior when we are doing
reductions, and that for IGNORED data, we want it to propagate in every
situation except for a reduction (where we want to skip over it).

I don't know if there is a well-defined way to distinguish reductions from
the other operations.  Would it hold for generalized ufuncs?  Would it hold
for other functions which might return arrays instead of scalars?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Gary Strangman



On Fri, Nov 4, 2011 at 1:03 PM, Gary Strangman str...@nmr.mgh.harvard.edu
wrote:

  To push this forward a bit, can I propose that IGNORE behave
  as:   PnC

   x = np.array([1, 2, 3])
   y = np.array([10, 20, 30])
   ignore(x[2])
   x
  [1, IGNORED(2), 3]
   x + 2
  [3, IGNORED(4), 5]
   x + y
  [11, IGNORED(22), 33]
   z = x.sum()
   z
  IGNORED(6)
   unignore(z)
   z
  6
   x.sum(skipIGNORED=True)
  4


In my mind, IGNORED items should be skipped by default (i.e., skipIGNORED
seems redundant ... isn't that what ignoring is all about?). Thus I might
instead suggest the opposite (default) behavior at the end:

  x = np.array([1, 2, 3])
  y = np.array([10, 20, 30])
  ignore(x[2])
  x

[1, IGNORED(2), 3]
  x + 2

[3, IGNORED(4), 5]
  x + y

[11, IGNORED(22), 33]
  z = x.sum()
  z

4
  unignore(x).sum()

6
  x.sum(keepIGNORED=True)

6

(Obviously all the syntax is totally up for debate.)



I agree that it would be ideal if the default were to skip IGNORED values, but
that behavior seems inconsistent with its propagation properties (such as when
adding arrays with IGNORED values).  To illustrate, when we did x+2, we were
stating that:

IGNORED(2) + 2 == IGNORED(4)

which means that we propagated the IGNORED value.  If we were to skip them by
default, then we'd have:

IGNORED(2) + 2 == 2

To be consistent, then it seems we also should have had:

 x + 2
[3, 2, 5]

which I think we can agree is not so desirable.   What this seems to come down 
to
is that we tend to want different behavior when we are doing reductions, and 
that
for IGNORED data, we want it to propagate in every situation except for a
reduction (where we want to skip over it).

I don't know if there is a well-defined way to distinguish reductions from the
other operations.  Would it hold for generalized ufuncs?  Would it hold for 
other
functions which might return arrays instead of scalars?


Ahhh, yes. That clearly explains the issue hung-up in my mind, and also 
clarifies what I was getting at with the elementwise vs. reduction 
distinction I made earlier today. Maybe this is a pickle in a jar with no 
lid. I'll have to think about it ...


-best
Gary


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Benjamin Root
On Fri, Nov 4, 2011 at 1:59 PM, Pauli Virtanen p...@iki.fi wrote:


 For shorthand, we can refer to the above choices with the nomenclature

shorthand ::= propagation destructivity payload_type
propagation ::= P | N
destructivity ::= d | n | s
payload_type ::= S | E | C


I really like this problem formulation and description.  Can we all agree
to use this language/shorthand from this point on?  I think it really
illuminates the discussion, and I would like to have it added to the wiki
page.

Thanks!
Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Nathaniel Smith
On Fri, Nov 4, 2011 at 1:22 PM, T J tjhn...@gmail.com wrote:
 I agree that it would be ideal if the default were to skip IGNORED values,
 but that behavior seems inconsistent with its propagation properties (such
 as when adding arrays with IGNORED values).  To illustrate, when we did
 x+2, we were stating that:

 IGNORED(2) + 2 == IGNORED(4)

 which means that we propagated the IGNORED value.  If we were to skip them
 by default, then we'd have:

 IGNORED(2) + 2 == 2

 To be consistent, then it seems we also should have had:

 x + 2
 [3, 2, 5]

 which I think we can agree is not so desirable.   What this seems to come
 down to is that we tend to want different behavior when we are doing
 reductions, and that for IGNORED data, we want it to propagate in every
 situation except for a reduction (where we want to skip over it).

 I don't know if there is a well-defined way to distinguish reductions from
 the other operations.  Would it hold for generalized ufuncs?  Would it hold
 for other functions which might return arrays instead of scalars?

Continuing my theme of looking for consensus first... there are
obviously a ton of ugly corners in here. But my impression is that at
least for some simple cases, it's clear what users want:

 a = [1, IGNORED(2), 3]
# array-with-ignored-values + unignored scalar only affects unignored values
 a + 2
[3, IGNORED(2), 5]
# reduction operations skip ignored values
 np.sum(a)
4

For example, Gary mentioned the common idiom of wanting to take an
array and subtract off its mean, and he wants to do that while leaving
the masked-out/ignored values unchanged. As long as the above cases
work the way I wrote, we will have

 np.mean(a)
2
 a -= np.mean(a)
 a
[-1, IGNORED(2), 1]

Which I'm pretty sure is the result that he wants. (Gary, is that
right?) Also numpy.ma follows these rules, so that's some additional
evidence that they're reasonable. (And I think part of the confusion
between Lluís and me was that these are the rules that I meant when I
said non-propagating, but he understood that to mean something
else.)

So before we start exploring the whole vast space of possible ways to
handle masked-out data, does anyone see any reason to consider rules
that don't have, as a subset, the ones above? Do other rules have any
use cases or user demand? (I *love* playing with clever mathematics
and making things consistent, but there's not much point unless the
end result is something that people will use :-).)

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Pauli Virtanen
04.11.2011 20:49, T J kirjoitti:
[clip]
 To push this forward a bit, can I propose that IGNORE behave as:   PnC

The *n* classes can be a bit confusing in Python:

### PnC

  x = np.array([1, 2, 3])
  y = np.array([4, 5, 6])
  ignore(y[1])
  z = x + y
  z
np.array([5, IGNORE(7), 9])
  x += y # NB: x[1] := x[1] + y[1]
  x
np.array([5, 2, 3])

 ***

I think I defined the destructive and non-destructive in a different 
way than earlier in the thread. Maybe this behavior from np.ma is closer 
to what was meant earlier:

  x = np.ma.array([1, 2, 3], mask=[0, 0, 1])
  y = np.ma.array([4, 5, 6], mask=[0, 1, 1])
  x += y
  x
masked_array(data = [5 -- --],
  mask = [False  True  True],
fill_value = 99)
  x.data
array([5, 2, 3])


Let's call this (since I botched and already reserved the letter n :)

(m) mark-ignored

a := SPECIAL_1
# - a == SPECIAL_a ; the payload of the RHS is neglected,
# the assigned value has the original LHS
# as the payload

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread T J
On Fri, Nov 4, 2011 at 2:41 PM, Pauli Virtanen p...@iki.fi wrote:

 04.11.2011 20:49, T J kirjoitti:
 [clip]
  To push this forward a bit, can I propose that IGNORE behave as:   PnC

 The *n* classes can be a bit confusing in Python:

 ### PnC

   x = np.array([1, 2, 3])
   y = np.array([4, 5, 6])
   ignore(y[1])
   z = x + y
   z
 np.array([5, IGNORE(7), 9])
   x += y # NB: x[1] := x[1] + y[1]
   x
 np.array([5, 2, 3])

 ***


Interesting.


 I think I defined the destructive and non-destructive in a different
 way than earlier in the thread. Maybe this behavior from np.ma is closer
 to what was meant earlier:

   x = np.ma.array([1, 2, 3], mask=[0, 0, 1])
   y = np.ma.array([4, 5, 6], mask=[0, 1, 1])
   x += y
   x
 masked_array(data = [5 -- --],
  mask = [False  True  True],
fill_value = 99)
   x.data
 array([5, 2, 3])


 Let's call this (since I botched and already reserved the letter n :)

 (m) mark-ignored

 a := SPECIAL_1
 # - a == SPECIAL_a ; the payload of the RHS is neglected,
 # the assigned value has the original LHS
 # as the payload



Does this behave as expected for x + y (as opposed to the inplace
operation)?

 z = x + y
 z
np.array([5, IGNORED(2), IGNORED(3)])
 x += y
np.array([5, IGNORED(2), IGNORED(3)])

However, doesn't this have the issue that Nathaniel brought up earlier:
commutativity

unignore(x + y) != unignore(y + x)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Nathaniel Smith
On Fri, Nov 4, 2011 at 11:59 AM, Pauli Virtanen p...@iki.fi wrote:
 I have a feeling that if you don't start by mathematically defining the
 scalar operations first, and only after that generalize them to arrays,
 some conceptual problems may follow.

 On the other hand, I should note that numpy.ma does not work this way,
 and many people seem still happy with how it works.

Yes, my impression is that people who want MISSING just want something
that acts like a special scalar value (PdS, in your scheme), but the
people who want IGNORED want something that *can't* be defined in this
way (see my other recent post). That said...

 There are a two options how to behave with respect to binary/unary
 operations:

 (P) Propagating

 unop(SPECIAL_1) == SPECIAL_new
 binop(SPECIAL_1, SPECIAL_2) == SPECIAL_new
 binop(a, SPECIAL) == SPECIAL_new

 (N) Non-propagating

 unop(SPECIAL_1) == SPECIAL_new
 binop(SPECIAL_1, SPECIAL_2) == SPECIAL_new
 binop(a, SPECIAL) == binop(a, binop.identity) == a

SPECIAL_1 means a special value with payload 1, right? Same thing
that some of us have been writing IGNORED(1) in other places?

Assuming that, I believe that what people want for IGNORED values is
  unop(SPECIAL_1) == SPECIAL_1
which doesn't seem to be an option in your taxonomy.

There's also the option of binop(a, SPECIAL) - error.

 And three options on what to do on assignment:

 (d) Destructive

 a := SPECIAL      # - a == SPECIAL

 (n) Non-destructive

 a := SPECIAL      # - a unchanged

 (s) Self-destructive

 a := SPECIAL_1
 # - if `a` is SPECIAL-class, then a == SPECIAL_1,
 # otherwise `a` remains unchanged

I'm not sure assignment is a useful way to think about what we've
been calling IGNORED values (for MISSING/NA it's fine). I've been
talking about masking/unmasking values or toggling the IGNORED
state, because my impression is that what people want is something
like:

a[0] = 3
a[0] = SPECIAL
# now a[0] == SPECIAL(3)

This is pretty confusing when written as an assignment (and note that
now I'm assigning into an array, because if I were just assigning to a
python variable then these semantics would be impossible to
implement!). So we might prefer a syntax like
  a.visible[0] = False
or
  a.ignore(0)

 If classified this way, behaviour of items in np.ma arrays is different
 in different operations, but seems roughly PdX, where X stands for
 returning a masked value with the first argument as the payload in
 binary ops if either argument is masked.

No -- np.ma implements the assignment semantics I described above, not
d semantics. Trimming some output for readability:

 a = np.ma.masked_array([1, 2, 3])
 a[1] = np.ma.masked
 a
[1, --, 3]
 a.mask[1] = False
 a
[1, 2, 3]

So assignment is not destructive -- the old value is retained as the payload.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread T J
On Fri, Nov 4, 2011 at 2:29 PM, Nathaniel Smith n...@pobox.com wrote:

 On Fri, Nov 4, 2011 at 1:22 PM, T J tjhn...@gmail.com wrote:
  I agree that it would be ideal if the default were to skip IGNORED
 values,
  but that behavior seems inconsistent with its propagation properties
 (such
  as when adding arrays with IGNORED values).  To illustrate, when we did
  x+2, we were stating that:
 
  IGNORED(2) + 2 == IGNORED(4)
 
  which means that we propagated the IGNORED value.  If we were to skip
 them
  by default, then we'd have:
 
  IGNORED(2) + 2 == 2
 
  To be consistent, then it seems we also should have had:
 
  x + 2
  [3, 2, 5]
 
  which I think we can agree is not so desirable.   What this seems to come
  down to is that we tend to want different behavior when we are doing
  reductions, and that for IGNORED data, we want it to propagate in every
  situation except for a reduction (where we want to skip over it).
 
  I don't know if there is a well-defined way to distinguish reductions
 from
  the other operations.  Would it hold for generalized ufuncs?  Would it
 hold
  for other functions which might return arrays instead of scalars?

 Continuing my theme of looking for consensus first... there are
 obviously a ton of ugly corners in here. But my impression is that at
 least for some simple cases, it's clear what users want:

  a = [1, IGNORED(2), 3]
 # array-with-ignored-values + unignored scalar only affects unignored
 values
  a + 2
 [3, IGNORED(2), 5]
 # reduction operations skip ignored values
  np.sum(a)
 4

 For example, Gary mentioned the common idiom of wanting to take an
 array and subtract off its mean, and he wants to do that while leaving
 the masked-out/ignored values unchanged. As long as the above cases
 work the way I wrote, we will have

  np.mean(a)
 2
  a -= np.mean(a)
  a
 [-1, IGNORED(2), 1]

 Which I'm pretty sure is the result that he wants. (Gary, is that
 right?) Also numpy.ma follows these rules, so that's some additional
 evidence that they're reasonable. (And I think part of the confusion
 between Lluís and me was that these are the rules that I meant when I
 said non-propagating, but he understood that to mean something
 else.)

 So before we start exploring the whole vast space of possible ways to
 handle masked-out data, does anyone see any reason to consider rules
 that don't have, as a subset, the ones above? Do other rules have any
 use cases or user demand? (I *love* playing with clever mathematics
 and making things consistent, but there's not much point unless the
 end result is something that people will use :-).)


I guess I'm just confused on how one, in principle, would distinguish the
various forms of propagation that you are suggesting (ie for reductions).
I also don't think it is good that we lack commutativity.  If we disallow
unignoring, then yes, I agree that what you wrote above is what people
want.  But if we are allowed to unignore, then I do not.

Also, how does something like this get handled?

 a = [1, 2, IGNORED(3), NaN]

If I were to say, What is the mean of 'a'?, then I think most of the time
people would want 1.5.  I guess if we kept nanmean around, then we could do:

 a -= np.nanmean(a)
[-.5, .5, IGNORED(3), NaN]

Sorry if this is considered digging deeper than consensus.  I'm just
curious if arrays having NaNs in them, in addition to IGNORED, causes
problems.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Nathaniel Smith
On Fri, Nov 4, 2011 at 3:04 PM, Nathaniel Smith n...@pobox.com wrote:
 On Fri, Nov 4, 2011 at 11:59 AM, Pauli Virtanen p...@iki.fi wrote:
 If classified this way, behaviour of items in np.ma arrays is different
 in different operations, but seems roughly PdX, where X stands for
 returning a masked value with the first argument as the payload in
 binary ops if either argument is masked.

 No -- np.ma implements the assignment semantics I described above, not
 d semantics. Trimming some output for readability:

Oops, I see we cross-posted :-). So never-mind that part of my email...

Okay, I'm going to follow my own suggestion now and stop talking about
these thorny details until I know whether we can simplify the
discussion to only considering schemes that are consistent with the
rules I posted about here:
http://article.gmane.org/gmane.comp.python.numeric.general/46760

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Pauli Virtanen
04.11.2011 22:57, T J kirjoitti:
[clip]
(m) mark-ignored
 
a := SPECIAL_1
# - a == SPECIAL_a ; the payload of the RHS is neglected,
# the assigned value has the original LHS
# as the payload
[clip]
 Does this behave as expected for x + y (as opposed to the inplace
 operation)?
[clip]

The definition is for assignment, and does not concern binary ops, so it 
behaves as expected. The inplace operation

  x += y

is defined as equivalent to binary op followed by assignment

  x[:] = x + y

as far as the missing values are concerned.

 However, doesn't this have the issue that Nathaniel brought up earlier:
 commutativity

 unignore(x + y) != unignore(y + x)

As the definition concerns only what happens on assignment, it does not 
have problems with commutativity.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Nathaniel Smith
On Fri, Nov 4, 2011 at 3:08 PM, T J tjhn...@gmail.com wrote:
 On Fri, Nov 4, 2011 at 2:29 PM, Nathaniel Smith n...@pobox.com wrote:
 Continuing my theme of looking for consensus first... there are
 obviously a ton of ugly corners in here. But my impression is that at
 least for some simple cases, it's clear what users want:

  a = [1, IGNORED(2), 3]
 # array-with-ignored-values + unignored scalar only affects unignored
 values
  a + 2
 [3, IGNORED(2), 5]
 # reduction operations skip ignored values
  np.sum(a)
 4

 For example, Gary mentioned the common idiom of wanting to take an
 array and subtract off its mean, and he wants to do that while leaving
 the masked-out/ignored values unchanged. As long as the above cases
 work the way I wrote, we will have

  np.mean(a)
 2
  a -= np.mean(a)
  a
 [-1, IGNORED(2), 1]

 Which I'm pretty sure is the result that he wants. (Gary, is that
 right?) Also numpy.ma follows these rules, so that's some additional
 evidence that they're reasonable. (And I think part of the confusion
 between Lluís and me was that these are the rules that I meant when I
 said non-propagating, but he understood that to mean something
 else.)

 So before we start exploring the whole vast space of possible ways to
 handle masked-out data, does anyone see any reason to consider rules
 that don't have, as a subset, the ones above? Do other rules have any
 use cases or user demand? (I *love* playing with clever mathematics
 and making things consistent, but there's not much point unless the
 end result is something that people will use :-).)

 I guess I'm just confused on how one, in principle, would distinguish the
 various forms of propagation that you are suggesting (ie for reductions).

Well, numpy.ma does work this way, so certainly it's possible to do.
At the code level, np.add() and np.add.reduce() are different entry
points and can behave differently.

OTOH, it might be that it's impossible to do *while still maintaining
other things we care about*... but in that case we should just shake
our fists at the mathematics and then give up, instead of coming up
with an elegant system that isn't actually useful. So that's why I
think we should figure out what's useful first.

 I also don't think it is good that we lack commutativity.  If we disallow
 unignoring, then yes, I agree that what you wrote above is what people
 want.  But if we are allowed to unignore, then I do not.

I *think* that for the no-unignoring (also known as MISSING) case,
we have a pretty clear consensus that we want something like:

 a + 2
[3, MISSING, 5]
 np.sum(a)
MISSING
 np.sum(a, skip_MISSING=True)
4

(Please say if you disagree, but I really hope you don't!) This case
is also easier, because we don't even have to allow a skip_MISSING
flag in cases where it doesn't make sense (e.g., unary or binary
operations) -- it's a convenience feature, so no-one will care if it
only works when it's useful ;-).

The use case that we're still confused about is specifically the one
where people want to *temporarily* hide parts of their data, do some
calculations that ignore those parts of their data, and then unhide
that data again -- e.g., see Gary's first post in this thread. So for
this use case, allowing unignore is definitely important, and having
np.sum() return IGNORED seems pretty useless to me. (When an operation
involves actually missing data, then you need to stop and think what
would be a statistically meaningful way to handle that -- sometimes
it's skip_MISSING, sometimes something else. So np.sum returning
MISSING is useful - it tells you something you might not have
realized. If you just ignored some data because you want to ignore
that data, then having np.sum return IGNORED is useless, because it
tells you something you already knew perfectly well.)

 Also, how does something like this get handled?

 a = [1, 2, IGNORED(3), NaN]

 If I were to say, What is the mean of 'a'?, then I think most of the time
 people would want 1.5.

I would want NaN! But that's because the only way I get NaN's is when
I do dumb things like compute log(0), and again, I want my code to
tell me that I was dumb instead of just quietly making up a
meaningless answer.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Pauli Virtanen
04.11.2011 23:04, Nathaniel Smith kirjoitti:
[clip]
 Assuming that, I believe that what people want for IGNORED values is
unop(SPECIAL_1) == SPECIAL_1
 which doesn't seem to be an option in your taxonomy.

Well, you can always add a new branch for rules on what to do with unary 
ops.

[clip]
 I'm not sure assignment is a useful way to think about what we've
 been calling IGNORED values (for MISSING/NA it's fine). I've been
 talking about masking/unmasking values or toggling the IGNORED
 state, because my impression is that what people want is something
 like:

 a[0] = 3
 a[0] = SPECIAL
 # now a[0] == SPECIAL(3)

That's partly syntax sugar. What I meant above by assignment is what 
happens on

a[:] = b

and what should occur in in-place operations,

 a += b

which are equivalent to

 a[:] = a + b

Yeah, it's a different definition for destructive and 
non-destructive than what was used earlier in the discussion.

[clip]
 If classified this way, behaviour of items in np.ma arrays is different
 in different operations, but seems roughly PdX, where X stands for
 returning a masked value with the first argument as the payload in
 binary ops if either argument is masked.

 No -- np.ma implements the assignment semantics I described above, not
 d semantics. Trimming some output for readability:

Well, np.ma implements d semantics, but because of the way binary ops 
are noncommutative, in-place binary ops behave as if they were not mutating.

Assignments do actually change the masked data:

  a[:] = b

which changes also masked values in `a`. That may be a bug.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread T J
On Fri, Nov 4, 2011 at 3:38 PM, Nathaniel Smith n...@pobox.com wrote:

 On Fri, Nov 4, 2011 at 3:08 PM, T J tjhn...@gmail.com wrote:
  On Fri, Nov 4, 2011 at 2:29 PM, Nathaniel Smith n...@pobox.com wrote:
  Continuing my theme of looking for consensus first... there are
  obviously a ton of ugly corners in here. But my impression is that at
  least for some simple cases, it's clear what users want:
 
   a = [1, IGNORED(2), 3]
  # array-with-ignored-values + unignored scalar only affects unignored
  values
   a + 2
  [3, IGNORED(2), 5]
  # reduction operations skip ignored values
   np.sum(a)
  4
 
  For example, Gary mentioned the common idiom of wanting to take an
  array and subtract off its mean, and he wants to do that while leaving
  the masked-out/ignored values unchanged. As long as the above cases
  work the way I wrote, we will have
 
   np.mean(a)
  2
   a -= np.mean(a)
   a
  [-1, IGNORED(2), 1]
 
  Which I'm pretty sure is the result that he wants. (Gary, is that
  right?) Also numpy.ma follows these rules, so that's some additional
  evidence that they're reasonable. (And I think part of the confusion
  between Lluís and me was that these are the rules that I meant when I
  said non-propagating, but he understood that to mean something
  else.)
 
  So before we start exploring the whole vast space of possible ways to
  handle masked-out data, does anyone see any reason to consider rules
  that don't have, as a subset, the ones above? Do other rules have any
  use cases or user demand? (I *love* playing with clever mathematics
  and making things consistent, but there's not much point unless the
  end result is something that people will use :-).)
 
  I guess I'm just confused on how one, in principle, would distinguish the
  various forms of propagation that you are suggesting (ie for reductions).

 Well, numpy.ma does work this way, so certainly it's possible to do.
 At the code level, np.add() and np.add.reduce() are different entry
 points and can behave differently.


I see your point, but that seems like just an API difference with a bad
name.  reduce() is just calling add() a bunch of times, so it seems like it
should behave as add() does.  That we can create different behaviors with
various assignment rules (like Pauli's 'm' for mark-ignored), only makes it
more confusing to me.

a = 1
a += 2
a += IGNORE

b = 1 + 2 + IGNORE

I think having a == b is essential.  If they can be different, that will
only lead to confusion.  On this point alone, does anyone think it is
acceptable to have a != b?



 OTOH, it might be that it's impossible to do *while still maintaining
 other things we care about*... but in that case we should just shake
 our fists at the mathematics and then give up, instead of coming up
 with an elegant system that isn't actually useful. So that's why I
 think we should figure out what's useful first.


Agreed.  I'm on the same page.



  I also don't think it is good that we lack commutativity.  If we disallow
  unignoring, then yes, I agree that what you wrote above is what people
  want.  But if we are allowed to unignore, then I do not.

 I *think* that for the no-unignoring (also known as MISSING) case,
 we have a pretty clear consensus that we want something like:

  a + 2
 [3, MISSING, 5]
  np.sum(a)
 MISSING
  np.sum(a, skip_MISSING=True)
 4

 (Please say if you disagree, but I really hope you don't!) This case
 is also easier, because we don't even have to allow a skip_MISSING
 flag in cases where it doesn't make sense (e.g., unary or binary
 operations) -- it's a convenience feature, so no-one will care if it
 only works when it's useful ;-).


Yes, in agreement.  I was talking specifically about the IGNORE case.   And
my point is that if we allow people to remove the IGNORE flag and see the
original data (and if the payloads are computed), then we should care about
commutativity:

 x = [1, IGNORE(2), 3]
 x2 = x.copy()
 y = [10, 11, IGNORE(12)]
 z = x + y
 a = z.sum()
 x += y
 b = x.sum()
 y += x2
 c = y.sum()

So, we should have:  a == b == c.
Additionally, if we allow users to unignore data, then we should have:

 x = [1, IGNORE(2), 3]
 x2 = x.copy()
 y = [10, 11, IGNORE(12)]
 x += y
 aa = unignore(x).sum()
 y += x2
 bb = unignore(y).sum()
 aa == bb
True

Is there agreement on this?


 Also, how does something like this get handled?
 
  a = [1, 2, IGNORED(3), NaN]
 
  If I were to say, What is the mean of 'a'?, then I think most of the
 time
  people would want 1.5.


 I would want NaN! But that's because the only way I get NaN's is when
 I do dumb things like compute log(0), and again, I want my code to
 tell me that I was dumb instead of just quietly making up a
 meaningless answer.


That's definitely field specific then.  In probability:  0 = 0 log(0) is a
common idiom.  In NumPy, 0 log(0) gives NaN, so you'd want to ignore then
when summing.
___
NumPy-Discussion mailing list

Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Pauli Virtanen
04.11.2011 23:29, Pauli Virtanen kirjoitti:
[clip]
 As the definition concerns only what happens on assignment, it does not
 have problems with commutativity.

This is of course then not really true in a wider sense, as an example 
from T J shows:

a = 1
a += IGNORE(3)
# - a := a + IGNORE(3)
# - a := IGNORE(4)
# - a == IGNORE(1)

which is different from

a = 1 + IGNORE(3)
# - a == IGNORE(4)

Damn, it seemed so good. Probably anything expect destructive assignment 
leads to problems like this with propagating special values.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread T J
On Fri, Nov 4, 2011 at 4:29 PM, Pauli Virtanen p...@iki.fi wrote:

 04.11.2011 23:29, Pauli Virtanen kirjoitti:
 [clip]
  As the definition concerns only what happens on assignment, it does not
  have problems with commutativity.

 This is of course then not really true in a wider sense, as an example
 from T J shows:

 a = 1
 a += IGNORE(3)
 # - a := a + IGNORE(3)
 # - a := IGNORE(4)
 # - a == IGNORE(1)

 which is different from

 a = 1 + IGNORE(3)
 # - a == IGNORE(4)

 Damn, it seemed so good. Probably anything expect destructive assignment
 leads to problems like this with propagating special values.



Ok...with what I understand now, it seems like for almost all operations:

   MISSING :  PdS
   IGNORED : PdC  (this gives commutivity when unignoring data points)

When you want some sort of reduction, we want to change the behavior for
IGNORED so that it skips the IGNORED values by default. Personally, I still
believe that this non-consistent behavior warrants a new method name.  What
I mean is:

 x = np.array([1, IGNORED(2), 3])
 y = x.sum()
 z = x[0] + x[1] + x[2]

To say that y != z will only be a source of confusion. To remedy, we force
people to be explicit, even if they'll need to be explicit 99% of the time:

 q = x.sum(skipIGNORED=True)

Then we can have y == z  and y != q.  To make the 99% use case easier, we
provide a new method which passings the keyword for us.



With PdS and PdC is seems rather clear to me why MISSING should be
implemented as a bit pattern and IGNORED implemented using masks.  Setting
implementation details aside and going back to Nathaniel's original
biggest *un*resolved question, I am now convinced that these (IGNORED and
MISSING) should be distinct API concepts and still yet distinct from NaN
with floating point dtypes.  The NA implementation in NumPy does not seem
to match either of these (IGNORED and MISSING) exactly.  One cannot, as far
as I know, unignore an element marked as NA.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Pauli Virtanen
04.11.2011 22:29, Nathaniel Smith kirjoitti:
[clip]
 Continuing my theme of looking for consensus first... there are
 obviously a ton of ugly corners in here. But my impression is that at
 least for some simple cases, it's clear what users want:

 a = [1, IGNORED(2), 3]
 # array-with-ignored-values + unignored scalar only affects unignored values
 a + 2
 [3, IGNORED(2), 5]
 # reduction operations skip ignored values
 np.sum(a)
 4

This can break commutativity:

  a = [1, IGNORED(2), 3]
  b = [4, IGNORED(5), 6]
  x = a + b
  y = b + a
  x[1] = ???
  y[1] = ???

Defining

unop(IGNORED(a)) == IGNORED(a)
binop(IGNORED(a), b) == IGNORED(a)
binop(a, IGNORED(b)) == IGNORED(b)
binop(IGNORED(a), IGNORED(b)) == IGNORED(binop(a, b))   # or NA

could however get around that. That seems to be pretty much how NaN 
works, except that it now carries a hidden value with it.

 For example, Gary mentioned the common idiom of wanting to take an
 array and subtract off its mean, and he wants to do that while leaving
 the masked-out/ignored values unchanged.As long as the above cases
 work the way I wrote, we will have

 np.mean(a)
 2
 a -= np.mean(a)
 a
 [-1, IGNORED(2), 1]

That would be propagating + the above NaN-like rules for binary operators.

Whether the reduction methods have skip_IGNORE=True as default or not is 
in my opinion more of an API question, rather than a question on how the 
algebra of ignored values should work.

 ***

If destructive assignment is really needed to avoid problems with 
commutation, [see T. J. (2011)] is then maybe a problem. So, one would 
need to have

  x = [1, IGNORED(2), 3]
  y = [1, IGNORED(2), 3]
  z = [4, IGNORED(5), IGNORED(6)]

  x[:] = z
  x
[4, IGNORED(5), IGNORED(6)]

  y += z
  y
[4, IGNORED(7), IGNORED(6)]

This is not how np.ma works. But if you do otherwise, there doesn't seem 
to be any guarantee that

  a += 42
  a += b

is the same thing as

  a += b
  a += 42

[clip]
 So before we start exploring the whole vast space of possible ways to
 handle masked-out data, does anyone see any reason to consider rules
 that don't have, as a subset, the ones above? Do other rules have any
 use cases or user demand? (I *love* playing with clever mathematics
 and making things consistent, but there's not much point unless the
 end result is something that people will use :-).)

Yep, it's important to keep in mind what people want.

People however tend to implicitly expect that simple arithmetic 
operations on arrays, containing ignored values or not, operate in a 
certain way. Actually stating how these operations work with scalars 
gives valuable insight on how you'd like things to work.

Also, if you propose to break the rules of arithmetic, in a fundamental 
library meant for scientific computation, you should be aware that you 
do so, and how you do so.

I mean, at least for me it was not clear before this formulation that 
there was a reason why binary ops in np.ma were not commutative! Now I 
kind of see that there is an asymmetry in assignment into masked arrays, 
and there is a conflict with commuting operations and with what you'd 
expect ignored values to do. I'm not sure if it's possible to get rid 
of this problem, but it could be possible to restrict it to assignments 
and in-place operations rather than having it in binary ops.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Gary Strangman

 Also, how does something like this get handled?

 a = [1, 2, IGNORED(3), NaN]

 If I were to say, What is the mean of 'a'?, then I think most of the time
 people would want 1.5.

 I would want NaN! But that's because the only way I get NaN's is when
 I do dumb things like compute log(0), and again, I want my code to
 tell me that I was dumb instead of just quietly making up a
 meaningless answer.

As another data point, I prefer control over this sort of situation. 
Sometimes I'm completely in agreement with Nathaniel and want the 
operation to fail. Other times I am forced to perform operations (e.g. 
log) on a huge matrix, and I fully expect some 0s may be in there. For a 
complex enough chain of operations, looking for all the bad apples at each 
step in the chain can be prohibitive, so in those cases I'm looking for 
compute it if you can, and give me a NaN if you can't ...

G


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Pauli Virtanen
05.11.2011 00:14, T J kirjoitti:
[clip]
  a = 1
  a += 2
  a += IGNORE
  b = 1 + 2 + IGNORE

 I think having a == b is essential.  If they can be different, that will
 only lead to confusion.  On this point alone, does anyone think it is
 acceptable to have a != b?

It seems to me that requiring this sort of a thing gives some 
limitations on how array operations should behave.

An acid test for proposed rules: given two arrays `a` and `b`,

 a = [1, 2, IGNORED(3), IGNORED(4)]
b = [10, IGNORED(20), 30, IGNORED(40)]

(a) Are the following pieces of code equivalent:

print unmask(a + b)
a += 42
a += b
print unmask(a)

 and

print unmask(b + a)
a += b
a += 42
print unmask(a)

(b) Are the following two statements equivalent (wrt. ignored values):

a += b
a[:] = a + b

For np.ma (a) is false whereas (b) is true. For arrays containing nans, 
on the other hand (a) and (b) are both true (but of course, in this case 
values cannot be unmasked).

Is there a way to define operations so that (a) is true, while retaining 
the desired other properties of arrays with ignored values?

Is there a real-word need to have (a) be true?

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread T J
On Fri, Nov 4, 2011 at 6:31 PM, Pauli Virtanen p...@iki.fi wrote:

 05.11.2011 00:14, T J kirjoitti:
 [clip]
   a = 1
   a += 2
   a += IGNORE
   b = 1 + 2 + IGNORE
 
  I think having a == b is essential.  If they can be different, that will
  only lead to confusion.  On this point alone, does anyone think it is
  acceptable to have a != b?

 It seems to me that requiring this sort of a thing gives some
 limitations on how array operations should behave.

 An acid test for proposed rules: given two arrays `a` and `b`,

 a = [1, 2, IGNORED(3), IGNORED(4)]
b = [10, IGNORED(20), 30, IGNORED(40)]

 (a) Are the following pieces of code equivalent:

print unmask(a + b)
 a += 42
a += b
 print unmask(a)

 and

print unmask(b + a)
 a += b
a += 42
 print unmask(a)

 (b) Are the following two statements equivalent (wrt. ignored values):

a += b
 a[:] = a + b

 For np.ma (a) is false whereas (b) is true. For arrays containing nans,
 on the other hand (a) and (b) are both true (but of course, in this case
 values cannot be unmasked).

 Is there a way to define operations so that (a) is true, while retaining
 the desired other properties of arrays with ignored values?


I thought that PdC satisfied (a) and (b).
Let me show you what I thought they were. Perhaps I am not being
consistent. If so, point out my mistake.

(A)  You are making two comparisons.

(A1)  Does  unmask(a+b) == unmask(b + a) ?

Yes.  They both equal:

   unmask([11, IGNORED(22), IGNORED(33), IGNORED(44)])
 =
   [11, 22, 33, 44]


(A2) Is 'a' the same after a += 42; a += b and  a += b; a += 42?

Yes.

 a = [1, 2, IGNORED(3), IGNORED(4)]
 b = [10, IGNORED(20), 30, IGNORED(40)]
 a += 42; a
[43, 44, IGNORED(45), IGNORED(46)]
 a += b; a
[53, IGNORED(64), IGNORED(75), IGNORED(86)]

vs

 a = [1, 2, IGNORED(3), IGNORED(4)]
 b = [10, IGNORED(20), 30, IGNORED(40)]
 a += b; a
[11, IGNORED(22), IGNORED(33), IGNORED(44)]
 a += 42; a
[53, IGNORED(64), IGNORED(75), IGNORED(86)]


For part (B), I thought we were in agreement that inplace assignment should
be defined so that:

 a += b

is equivalent to:

 tmp = a + b
 a = tmp

If so, this definitely holds.


Have I missed something?  Probably.  Please spell it out for me.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Nathaniel Smith
On Fri, Nov 4, 2011 at 7:43 PM, T J tjhn...@gmail.com wrote:
 On Fri, Nov 4, 2011 at 6:31 PM, Pauli Virtanen p...@iki.fi wrote:
 An acid test for proposed rules: given two arrays `a` and `b`,

         a = [1, 2, IGNORED(3), IGNORED(4)]
        b = [10, IGNORED(20), 30, IGNORED(40)]
[...]
 (A1)  Does  unmask(a+b) == unmask(b + a) ?

 Yes.  They both equal:

    unmask([11, IGNORED(22), IGNORED(33), IGNORED(44)])
  =
    [11, 22, 33, 44]

Again, I really don't think you're going to be able to sell an API where
  [2] + [IGNORED(20)] == [IGNORED(22)]
I mean, it's not me you have to convince, it's Gary, Pierre, maybe
Benjamin, Lluís, etc. So I could be wrong. But you might want to
figure that out first before making plans based on this...

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread T J
On Fri, Nov 4, 2011 at 8:03 PM, Nathaniel Smith n...@pobox.com wrote:

 On Fri, Nov 4, 2011 at 7:43 PM, T J tjhn...@gmail.com wrote:
  On Fri, Nov 4, 2011 at 6:31 PM, Pauli Virtanen p...@iki.fi wrote:
  An acid test for proposed rules: given two arrays `a` and `b`,
 
  a = [1, 2, IGNORED(3), IGNORED(4)]
 b = [10, IGNORED(20), 30, IGNORED(40)]
 [...]
  (A1)  Does  unmask(a+b) == unmask(b + a) ?
 
  Yes.  They both equal:
 
 unmask([11, IGNORED(22), IGNORED(33), IGNORED(44)])
   =
 [11, 22, 33, 44]

 Again, I really don't think you're going to be able to sell an API where
  [2] + [IGNORED(20)] == [IGNORED(22)]
 I mean, it's not me you have to convince, it's Gary, Pierre, maybe
 Benjamin, Lluís, etc. So I could be wrong. But you might want to
 figure that out first before making plans based on this...


But this is how np.ma currently does it, except that it doesn't compute the
payload---it just calls it IGNORED.
And it seems that this generalizes the way people want it to:

 z = [2, 4] + [IGNORED(20), 3]
 z
[IGNORED(24), 7]
 z.sum(skip_ignored=True)   # True could be the default
7
 z.sum(skip_ignored=False)
IGNORED(31)

I guess I am confused because it seems that you implicitly used this same
rule here:

Say we have
   a = np.array([1, IGNORED(2), 3])
   b = np.array([10, 20, 30])
(Here's I'm using IGNORED(2) to mean a value that is currently
ignored, but if you unmasked it it would have the value 2.)

Then we have:

# non-propagating **or** propagating, doesn't matter:

 a + 2

[3, IGNORED(2), 5]


That is, element-wise, you had to have done:

IGNORED(2) + 2 -- IGNORED(2).

I said it should be equal to IGNORED(4), but the result is still some form
of ignore.  Sorry if I am missing the bigger picture at this pointits
late and a Fri.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Benjamin Root
On Fri, Nov 4, 2011 at 10:33 PM, T J tjhn...@gmail.com wrote:

 On Fri, Nov 4, 2011 at 8:03 PM, Nathaniel Smith n...@pobox.com wrote:

 On Fri, Nov 4, 2011 at 7:43 PM, T J tjhn...@gmail.com wrote:
  On Fri, Nov 4, 2011 at 6:31 PM, Pauli Virtanen p...@iki.fi wrote:
  An acid test for proposed rules: given two arrays `a` and `b`,
 
  a = [1, 2, IGNORED(3), IGNORED(4)]
 b = [10, IGNORED(20), 30, IGNORED(40)]
 [...]
  (A1)  Does  unmask(a+b) == unmask(b + a) ?
 
  Yes.  They both equal:
 
 unmask([11, IGNORED(22), IGNORED(33), IGNORED(44)])
   =
 [11, 22, 33, 44]

 Again, I really don't think you're going to be able to sell an API where
  [2] + [IGNORED(20)] == [IGNORED(22)]
 I mean, it's not me you have to convince, it's Gary, Pierre, maybe
 Benjamin, Lluís, etc. So I could be wrong. But you might want to
 figure that out first before making plans based on this...


 But this is how np.ma currently does it, except that it doesn't compute
 the payload---it just calls it IGNORED.
 And it seems that this generalizes the way people want it to:

  z = [2, 4] + [IGNORED(20), 3]
  z
 [IGNORED(24), 7]
  z.sum(skip_ignored=True)   # True could be the default
 7
  z.sum(skip_ignored=False)
 IGNORED(31)

 I guess I am confused because it seems that you implicitly used this same
 rule here:

 Say we have
a = np.array([1, IGNORED(2), 3])

b = np.array([10, 20, 30])
 (Here's I'm using IGNORED(2) to mean a value that is currently
 ignored, but if you unmasked it it would have the value 2.)

 Then we have:

 # non-propagating **or** propagating, doesn't matter:

  a + 2

 [3, IGNORED(2), 5]


 That is, element-wise, you had to have done:

 IGNORED(2) + 2 -- IGNORED(2).

 I said it should be equal to IGNORED(4), but the result is still some form
 of ignore.  Sorry if I am missing the bigger picture at this pointits
 late and a Fri.



This scheme is actually somewhat intriguing.  Not totally convinced, by
intrigued.  Unfortunately, I fell behind in the postings by having
dinner... We probably should start a new thread soon with a bunch of this
stuff solidified and stated to give others a chance to hop back into the
game. Maybe a table of some sort with pros/cons (mathematically speaking,
deferring implementation details for later).

I swear, if we can get this to make sense... we should have a Nobel prize
or something.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Nathaniel Smith
On Thu, Nov 3, 2011 at 7:54 PM, Gary Strangman
str...@nmr.mgh.harvard.edu wrote:
 For the non-destructive+propagating case, do I understand correctly that
 this would mean I (as a user) could temporarily decide to IGNORE certain
 portions of my data, perform a series of computation on that data, and the
 IGNORED flag (or however it is implemented) would be propagated from
 computation to computation? If that's the case, I suspect I'd use it all
 the time ... to effectively perform data subsetting without generating
 (partial) copies of large datasets. But maybe I misunderstand the
 intended notion of propagation ...

I *think* it's more subtle than that, but I admit I'm somewhat
confused about how exactly people would want IGNORED to work in
various corner cases. (This is another part of why figuring out our
audience/use-cases seems like an important first step to me...
fortunately the semantics for MISSING are, I think, much more clear.)

Say we have
   a = np.array([1, IGNORED(2), 3])
   b = np.array([10, 20, 30])
(Here's I'm using IGNORED(2) to mean a value that is currently
ignored, but if you unmasked it it would have the value 2.)

Then we have:

# non-propagating *or* propagating, doesn't matter:
 a + 2
[3, IGNORED(2), 5]

# non-propagating:
 a + b
One of these, I don't know which:
  [11, IGNORED(2), 33]  # numpy.ma chooses this
  [11, 20, 33]
  Error: shape mismatch

(An error is maybe the most *consistent* option; the suggestion in the
alterNEP was that masks had to match on all axes that were *not*
broadcast, so a + 2 and a + a are okay, but a + b is an error. I
assume the numpy.ma approach is also useful, but note that it has the
surprising effect that addition is not commutative: IGNORED(x) +
IGNORED(y) = IGNORED(x). Try it:
   masked1 = np.ma.masked_array([1, 2, 3], mask=[False, True, False])
   masked2 = np.ma.masked_array([10, 20, 30], mask=[False, True, False])
   np.asarray(masked1 + masked2) # [11, 2, 33]
   np.asarray(masked2 + masked1) # [11, 20, 33]
I don't really know what people would prefer.)

# propagating:
 a + b
One of these, I don't know which:
  [11, IGNORED(2), 33] # same as numpy.ma, again
  [11, IGNORED(22), 33]

# non-propagating:
 np.sum(a)
4

# propagating:
 np.sum(a)
One of these, I don't know which:
  IGNORED(4)
  IGNORED(6)

So from your description, I wouldn't say that you necessarily want
non-destructive+propagating -- it really depends on exactly what
computations you want to perform, and how you expect them to work. The
main difference is how reduction operations are treated. I kind of
feel like the non-propagating version makes more sense overall, but I
don't know if there's any consensus on that.

(You also have the option of just using the new where= argument to
your ufuncs, which avoids some of this confusion because it gives a
single mask that would apply to the whole operation. The ambiguities
here arise because it's not clear what to do when applying a binary
operation to two arrays that have different masks.)

Maybe you could give some examples of the kinds of computations you're
thinking of?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Benjamin Root
On Friday, November 4, 2011, Nathaniel Smith n...@pobox.com wrote:
 On Thu, Nov 3, 2011 at 7:54 PM, Gary Strangman
 str...@nmr.mgh.harvard.edu wrote:
 For the non-destructive+propagating case, do I understand correctly that
 this would mean I (as a user) could temporarily decide to IGNORE certain
 portions of my data, perform a series of computation on that data, and
the
 IGNORED flag (or however it is implemented) would be propagated from
 computation to computation? If that's the case, I suspect I'd use it all
 the time ... to effectively perform data subsetting without generating
 (partial) copies of large datasets. But maybe I misunderstand the
 intended notion of propagation ...

 I *think* it's more subtle than that, but I admit I'm somewhat
 confused about how exactly people would want IGNORED to work in
 various corner cases. (This is another part of why figuring out our
 audience/use-cases seems like an important first step to me...
 fortunately the semantics for MISSING are, I think, much more clear.)

 Say we have
   a = np.array([1, IGNORED(2), 3])
   b = np.array([10, 20, 30])
 (Here's I'm using IGNORED(2) to mean a value that is currently
 ignored, but if you unmasked it it would have the value 2.)

 Then we have:

 # non-propagating *or* propagating, doesn't matter:
 a + 2
 [3, IGNORED(2), 5]

 # non-propagating:
 a + b
 One of these, I don't know which:
  [11, IGNORED(2), 33]  # numpy.ma chooses this
  [11, 20, 33]
  Error: shape mismatch

 (An error is maybe the most *consistent* option; the suggestion in the
 alterNEP was that masks had to match on all axes that were *not*
 broadcast, so a + 2 and a + a are okay, but a + b is an error. I
 assume the numpy.ma approach is also useful, but note that it has the
 surprising effect that addition is not commutative: IGNORED(x) +
 IGNORED(y) = IGNORED(x). Try it:
   masked1 = np.ma.masked_array([1, 2, 3], mask=[False, True, False])
   masked2 = np.ma.masked_array([10, 20, 30], mask=[False, True, False])
   np.asarray(masked1 + masked2) # [11, 2, 33]
   np.asarray(masked2 + masked1) # [11, 20, 33]
 I don't really know what people would prefer.)

 # propagating:
 a + b
 One of these, I don't know which:
  [11, IGNORED(2), 33] # same as numpy.ma, again
  [11, IGNORED(22), 33]

 # non-propagating:
 np.sum(a)
 4

 # propagating:
 np.sum(a)
 One of these, I don't know which:
  IGNORED(4)
  IGNORED(6)

 So from your description, I wouldn't say that you necessarily want
 non-destructive+propagating -- it really depends on exactly what
 computations you want to perform, and how you expect them to work. The
 main difference is how reduction operations are treated. I kind of
 feel like the non-propagating version makes more sense overall, but I
 don't know if there's any consensus on that.

I think this is further evidence for my idea that a mask should not be
undone, but is non destructive.  If you want to be able to access the
values after masking, have a view, or only apply the mask to a view.

Reduction ufuncs make a lot of sense because they have a basis in
mathematics when there are no values. Reduction ufuncs are covered in great
detail in Mark's NEP.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Gary Strangman



 non-destructive+propagating -- it really depends on exactly what
 computations you want to perform, and how you expect them to work. The
 main difference is how reduction operations are treated. I kind of
 feel like the non-propagating version makes more sense overall, but I
 don't know if there's any consensus on that.

I think this is further evidence for my idea that a mask should not be
undone, but is non destructive.  If you want to be able to access the values
after masking, have a view, or only apply the mask to a view.


OK, so my understanding of what's meant by propagating is probably 
incomplete (and is definitely still fuzzy). I'm a little confused by the 
phrase a mask should not be undone though. Say I want to perform a 
statistical analysis or filtering procedure excluding and (separately) 
including a handful of outliers? Isn't that a natural case for undoing a 
mask? Or did you mean something else?


I think I understand the use a view option above, though I don't see how 
one could apply a mask only to a view. What if my view is every other row 
in a 2D array, and I want to mask the last half of this view? What is the 
state of the original array once the mask has been applied?


(If this is derailing the progress of this thread, feel free to ignore 
it.)


-best
Gary


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Lluís
Gary Strangman writes:

 For the non-destructive+propagating case, do I understand correctly that 
 this would mean I (as a user) could temporarily decide to IGNORE certain 
 portions of my data, perform a series of computation on that data, and the 
 IGNORED flag (or however it is implemented) would be propagated from 
 computation to computation? If that's the case, I suspect I'd use it all 
 the time ... to effectively perform data subsetting without generating 
 (partial) copies of large datasets. But maybe I misunderstand the 
 intended notion of propagation ...

I *think* you're right. I say think because to *me* IGNORE is the opposite of
propagate.

For example, you could temporarily (non-destructively) decide to assign a
propagating special value to array a. Then do a+=2; a*=5 and get something
like (which I think is the kind of use case you were talking about):

  # original values
   a
  [1, 1, 1]

  # with special value
   a
  [1, SPECIAL, 1]

  # computations
   a += 2
   a *= 5
  
  # result
   a
  [15, SPECIAL, 15]

  # without special value
   a
  [15, 1, 15]


So yes, separating both properties is not only a matter of elegance and
simplicity, but also has the practical impact of (as Ben said in another mail)
making the non-destructive property into a fancy form of indexing.


Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Gary Strangman


On Fri, 4 Nov 2011, Benjamin Root wrote:


On Friday, November 4, 2011, Gary Strangman str...@nmr.mgh.harvard.edu
wrote:

  non-destructive+propagating -- it really depends on exactly what
  computations you want to perform, and how you expect them to work. The
  main difference is how reduction operations are treated. I kind of
  feel like the non-propagating version makes more sense overall, but I
  don't know if there's any consensus on that.

 I think this is further evidence for my idea that a mask should not be
 undone, but is non destructive.  If you want to be able to access the
values
 after masking, have a view, or only apply the mask to a view.

 OK, so my understanding of what's meant by propagating is probably
incomplete (and is definitely still fuzzy). I'm a little confused by the
phrase a mask should not be undone though. Say I want to perform a
statistical analysis or filtering procedure excluding and (separately)
including a handful of outliers? Isn't that a natural case for undoing a
mask? Or did you mean something else?

 I think I understand the use a view option above, though I don't see how
one could apply a mask only to a view. What if my view is every other row in
a 2D array, and I want to mask the last half of this view? What is the state
of the original array once the mask has been applied?

 (If this is derailing the progress of this thread, feel free to ignore
it.)

 -best
 Gary

Ufuncs can be broadly categorized as element-wise (binary ops like +, *,
etc) as well as regular functions that return an array with a shape that
matches the inputs broadcasted together.  And reduction ops (sum, min, mean,
etc).

For element-wise, things are a bit murky for IGNORE, and I defer to Mark's
NEP:
https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#id17,
and it probably should be expanded and clarified in the NEP.

For reduction ops, propagation means that sum([3 5 NA 6]) == NA, just like
if you had a NaN in the array. Non-propagating (or skipping or ignore) would
have that operation produce 14.  A mean() for the propagating case would be
NA, but 4. for non-propagating.

The part about undoing a mask is addressing the issue of when an operation
produces a new array that has ignored elements in it, then those elements
never were initialized with any value at all.  Therefore, unmasking those
elements and accessing their values make no sense. This and more are covered
in this section of the NEP:
https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#id11

For your stated case, I would have two views of the data (or at least the
original data and a view of it).  For the view, I would apply the mask to
hide the outliers from the filtering operation and produce a result.  The
first view (or the original array) sees the same data as it did before the
other view took on a mask, so you can perform the filtering operation on the
data and have two separate results. You can keep the masked view for
subsequent calculations, and/or keep the original view, and/or create new
views with new masks for other analyzes, all while keeping the original data
intact.

Note that I am right now speaking of views in a somewhat more abstract sense
that is only loosely tied to numpy's specific behavior with respect to views
right now.  As for np.view() in specific, that is an implementation detail
that probably shouldn't be in this thread yet, so don't hook too much onto
it.


Thanks Ben. That's quite helpful. And it also points to my worry (sorry, I 
already knew enough about views to be dangerous) ... your conceptual 
version of views is great, but I don't think numpy fully and reliably 
follows it (occasionally giving copies instead of views, for example, when 
a view is particularly difficult to generate). So I worry that your notion 
of views will actually collide with core numpy view implementations. But 
like you said, perhaps this thread shouldn't go there (yet).


Given I'm still fuzzy on all the distinctions, perhaps someone could try 
to help me (and others?) to define all /4/ logical possibilities ... some 
may be obvious dead-ends. I'll take a stab at them, but these should 
definitely get edited by others:


destructive + propagating = the data point is truly missing (satellite 
fell into the ocean; dog ate my source datasheet, or whatever), this is 
the nature of that data point, such missingness should be replicated in 
elementwise operations, and the missingness SHOULD interfere with 
reduction operations that involve that datapoint 
(np.sum([1,MISSING])=MISSING)


destructive + non-propagating = the data point is truly missing, this is 
the nature of that data point, such missingness should be replicated in 
elementwise operations, but such missingness should NOT interfere with 
reduction operations that involve that datapoint (np.sum([1,MISSING])=1)


non-destructive + propagating = I want to ignore this datapoint for 
now; element-wise operations should replicate 

Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Lluís
Gary Strangman writes:
[...]
 Given I'm still fuzzy on all the distinctions, perhaps someone could try to 
 help
 me (and others?) to define all /4/ logical possibilities ... some may be 
 obvious
 dead-ends. I'll take a stab at them, but these should definitely get edited by
 others:

 destructive + propagating = the data point is truly missing (satellite fell 
 into
 the ocean; dog ate my source datasheet, or whatever), this is the nature of 
 that
 data point, such missingness should be replicated in elementwise operations, 
 and
 the missingness SHOULD interfere with reduction operations that involve that
 datapoint (np.sum([1,MISSING])=MISSING)

Right.


 destructive + non-propagating = the data point is truly missing, this is the
 nature of that data point, such missingness should be replicated in 
 elementwise
 operations, but such missingness should NOT interfere with reduction 
 operations
 that involve that datapoint (np.sum([1,MISSING])=1)

What do you define as element-wise operations?

Is a sum on an array an element-wise operation?

   [1, MISSING]+2
  [1, MISSING]

Or is it just a form of reduction (after shape broadcasting)?

   [1, MISSING]+2
  [3, 2]

For me it's the second, so the only time where special values propagate in a
non-propagating scenario is when you slice an array.


 non-destructive + propagating = I want to ignore this datapoint for now;
 element-wise operations should replicate this ignore designation, and
 missingness of this type SHOULD interfere with reduction operations that 
 involve
 this datapoint (np.sum([1,IGNORE])=IGNORE)

Right.


 non-destructive + non-propagating = I want to ignore this datapoint for now;
 element-wise operations should replicate this ignore designation, but
 missingness of this type SHOULD NOT interfere with reduction operations that
 involve this datapoint (np.sum([1,IGNORE])=1)

Same concerns as above.


Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Benjamin Root
On Fri, Nov 4, 2011 at 11:08 AM, Lluís xscr...@gmx.net wrote:

 Gary Strangman writes:
 [...]

  destructive + non-propagating = the data point is truly missing, this is
 the
  nature of that data point, such missingness should be replicated in
 elementwise
  operations, but such missingness should NOT interfere with reduction
 operations
  that involve that datapoint (np.sum([1,MISSING])=1)

 What do you define as element-wise operations?

 Is a sum on an array an element-wise operation?

   [1, MISSING]+2
  [1, MISSING]


did you mean [3, MISSING]?


 Or is it just a form of reduction (after shape broadcasting)?

   [1, MISSING]+2
  [3, 2]

 For me it's the second, so the only time where special values propagate
 in a
 non-propagating scenario is when you slice an array.


Propagation has a very specific meaning here, and I think it is causing
confusion elsewhere.  Propagation (to me) is the *exact* same behavior that
occurs with NaNs, but generalized to any dtype.  It seems like you are
taking propagate to mean whether the mask of the inputs follow on to the
mask of the output.  This is related, but is possibly a murkier concept and
should probably be cleaned up.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Gary Strangman

 destructive + propagating = the data point is truly missing (satellite fell 
 into
 the ocean; dog ate my source datasheet, or whatever), this is the nature of 
 that
 data point, such missingness should be replicated in elementwise operations, 
 and
 the missingness SHOULD interfere with reduction operations that involve that
 datapoint (np.sum([1,MISSING])=MISSING)

 Right.


 destructive + non-propagating = the data point is truly missing, this is the
 nature of that data point, such missingness should be replicated in 
 elementwise
 operations, but such missingness should NOT interfere with reduction 
 operations
 that involve that datapoint (np.sum([1,MISSING])=1)

 What do you define as element-wise operations?

 Is a sum on an array an element-wise operation?

   [1, MISSING]+2
  [1, MISSING]

 Or is it just a form of reduction (after shape broadcasting)?

   [1, MISSING]+2
  [3, 2]

 For me it's the second, so the only time where special values propagate in a
 non-propagating scenario is when you slice an array.

Let's say I want to re-scale a column (or remove the mean from a column). 
I wouldn't want that to change my missingness. Thus, I'm thinking:

 x = [1,2,MISSING]
 x*3
[3, 6, MISSING]
 x = [1,2,MISSING]
 x - x.mean()
[-0.5, 0.5, MISSING]

To me it makes sense to have identical operations for the temporary IGNORE 
case below (versus the permanent MISSING case here). Note, the reason to 
independently have separate IGNORE and MISSING is so that I can (for 
example) temporarily IGNORE entire rows in my 2D array (which may have 
scattered MISSING elements), and when I undo the IGNORE operation the 
MISSING elements are still MISSING.

The question does still remain what to do when performing operations like 
those above in IGNORE cases. Perform the operation underneath? Or not?

 non-destructive + propagating = I want to ignore this datapoint for now;
 element-wise operations should replicate this ignore designation, and
 missingness of this type SHOULD interfere with reduction operations that 
 involve
 this datapoint (np.sum([1,IGNORE])=IGNORE)

 Right.


 non-destructive + non-propagating = I want to ignore this datapoint for now;
 element-wise operations should replicate this ignore designation, but
 missingness of this type SHOULD NOT interfere with reduction operations that
 involve this datapoint (np.sum([1,IGNORE])=1)

 Same concerns as above.


 Lluis




The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Gary Strangman



On Fri, Nov 4, 2011 at 11:08 AM, Lluís xscr...@gmx.net wrote:
  Gary Strangman writes:
  [...]

   destructive + non-propagating = the data point is truly
  missing, this is the
   nature of that data point, such missingness should be
  replicated in elementwise
   operations, but such missingness should NOT interfere with
  reduction operations
   that involve that datapoint (np.sum([1,MISSING])=1)

What do you define as element-wise operations?

Is a sum on an array an element-wise operation?

  [1, MISSING]+2
 [1, MISSING]


did you mean [3, MISSING]?
 
  Or is it just a form of reduction (after shape broadcasting)?

    [1, MISSING]+2
   [3, 2]

  For me it's the second, so the only time where special values
  propagate in a
  non-propagating scenario is when you slice an array.


Propagation has a very specific meaning here, and I think it is causing
confusion elsewhere.  Propagation (to me) is the *exact* same behavior that
occurs with NaNs, but generalized to any dtype.  It seems like you are
taking propagate to mean whether the mask of the inputs follow on to the
mask of the output.  This is related, but is possibly a murkier concept and
should probably be cleaned up.


I think different people have different notions of propagation here. Yes, 
my notion was more related to input masks propagating to output masks. 
It's important to know you define it differently ... and I think the 
difference in (implicit) definitions is indeed causing confusion. At least 
it is for me. ;-)


-best
Gary


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Lluís
Benjamin Root writes:

 On Fri, Nov 4, 2011 at 11:08 AM, Lluís xscr...@gmx.net wrote:

 Gary Strangman writes:
 [...]
   
 destructive + non-propagating = the data point is truly missing, this is the
 nature of that data point, such missingness should be replicated in 
 elementwise
 operations, but such missingness should NOT interfere with reduction 
 operations
 that involve that datapoint (np.sum([1,MISSING])=1)
   
 What do you define as element-wise operations?
   
 Is a sum on an array an element-wise operation?
   
   [1, MISSING]+2
  [1, MISSING]

 did you mean [3, MISSING]?

Yes, sorry.


 Or is it just a form of reduction (after shape broadcasting)?
   
   [1, MISSING]+2
  [3, 2]
   
 For me it's the second, so the only time where special values propagate 
 in a
 non-propagating scenario is when you slice an array.

 Propagation has a very specific meaning here, and I think it is causing 
 confusion elsewhere.  Propagation (to me) is the *exact* same behavior that 
 occurs
 with NaNs, but generalized to any dtype.  It seems like you are taking 
 propagate to mean whether the mask of the inputs follow on to the mask of 
 the
 output.  This is related, but is possibly a murkier concept and should 
 probably be cleaned up.

If you ignore the existence of a mask (as it is a specific mechanism for
handling the destructiveness, not the propagation), I think we both think of the
same concept of propagation:

High-level:

   x + SPECIAL

Propagating (SPECIAL = NaN-like = MISSING):

   x + SPECIAL = SPECIAL

Non-propagating (SPECIAL = ignore this element, similar to nansum = IGNORE):

   x + SPECIAL = x


Is there an agreement on this, or am I missing something else?


Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-04 Thread Pauli Virtanen
04.11.2011 17:31, Gary Strangman kirjoitti:
[clip]
 The question does still remain what to do when performing operations like 
 those above in IGNORE cases. Perform the operation underneath? Or not?

I have a feeling that if you don't start by mathematically defining the
scalar operations first, and only after that generalize them to arrays,
some conceptual problems may follow.

On the other hand, I should note that numpy.ma does not work this way,
and many people seem still happy with how it works.

But if you go defining scalars first, as far as I see ufuncs (eg. binary
operations), and assignment are what needs to be defined. Since the idea
seems to be to use these as masks, let's assume that each special
value can also carry a payload.

***

There are a two options how to behave with respect to binary/unary
operations:

(P) Propagating

unop(SPECIAL_1) == SPECIAL_new
binop(SPECIAL_1, SPECIAL_2) == SPECIAL_new
binop(a, SPECIAL) == SPECIAL_new

(N) Non-propagating

unop(SPECIAL_1) == SPECIAL_new
binop(SPECIAL_1, SPECIAL_2) == SPECIAL_new
binop(a, SPECIAL) == binop(a, binop.identity) == a

***

And three options on what to do on assignment:

(d) Destructive

a := SPECIAL  # - a == SPECIAL

(n) Non-destructive

a := SPECIAL  # - a unchanged

(s) Self-destructive

a := SPECIAL_1
# - if `a` is SPECIAL-class, then a == SPECIAL_1,
# otherwise `a` remains unchanged

***

Finally, there is a question whether the value has a payload or not.

The payload complicates the scheme, as binary and unary operations need
to create new values. For singletons (eg. NaN) this is not a problem.
But if it's a non-singleton, desirable behavior would be to retain
commutativity (and other similar properties) of binary ops. I see two
sensible approaches for this: either raise an error, or do the
computation on the payload.

This brings in a third choice: (S) singleton, (E) payload, but raise
errors on operations only on special values, and (C) payload, but do
computations on payload.

***

For shorthand, we can refer to the above choices with the nomenclature

shorthand ::= propagation destructivity payload_type
propagation ::= P | N
destructivity ::= d | n | s
payload_type ::= S | E | C

That makes 2 * 3 * 3 = 18 different ways to construct consistent
behavior. Some of them might make sense, the problem is to find out which :)

NAN and NA apparently fall into the PdS class.

If classified this way, behaviour of items in np.ma arrays is different
in different operations, but seems roughly PdX, where X stands for
returning a masked value with the first argument as the payload in
binary ops if either argument is masked. This makes inline binary ops
behave like Nn. Reductions are N. (Assignment: dC, reductions: N, binary
ops: PX, unary ops: PC, inline binary ops: Nn).

Finally, there's a can of worms on specifying the outcome of binary
operations on two special values of different kinds, but it's maybe best
to first choose one that behaves sensibly by itself.

Cheers,
Pauli

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-03 Thread Nathaniel Smith
On Wed, Nov 2, 2011 at 8:20 PM, Benjamin Root ben.r...@ou.edu wrote:
 On Wednesday, November 2, 2011, Nathaniel Smith n...@pobox.com wrote:
 By R compatibility, I specifically had in mind in-memory
 compatibility. rpy2 provides a more-or-less seamless within-process
 interface between R and Python (and specifically lets you get numpy
 views on arrays returned by R functions), so if we can make this work
 for R arrays containing NA too then that'd be handy. (The rpy2 author
 requested this in the last discussion here:
 http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057084.html)
 When it comes to disk formats, then this doesn't matter so much, since
 IO routines have to translate between different representations all
 the time anyway.

 Interesting, but I still have to wonder if that should be on the wishlist
 for MISSING.  I guess it would matter by knowing whether people would be
 fully converting from R or gradually transitioning from it?  That is
 something that I can't answer.

Well, I'm one of the people who would use it, so yeah :-). I've been
trying to standardize my code on Python for a while now, but there's a
ton of statistical tools that are only really available through R, and
that will remain true for a while yet. So I use rpy2 when I have to.

 I take the replacement of my line about MISSING disallowing unmasking
 and your line about MISSING assignment being destructive as basically
 expressing the same idea. Is that fair, or did you mean something
 else?

 I am someone who wants to get to the absolute core of ideas. Also, this
 expression cleanly delineates the differences as binary.

 By expressing it this way, we also shy away from implementation details. For
 example, Unmasking can be programmatically prevented for MISSING while it
 could be implemented by other indirect means for IGNORE. Not that those are
 the preferred ways, only that the phrasing is more flexible and exacting.


 Finally, do you think that people who want IGNORED support care about
 having a convenient API for masking/unmasking values? You removed that
 line, but I don't know if that was because you disagreed with it, or
 were just trying to simplify.

 See previous.

I like getting to the core of things too, but unless there's actual
disagreement, then I think even less central points are still worth
noting :-). I've tried editing things a bit to make the
compare/contrast clearer based on your comments, and put it up here:
   https://github.com/njsmith/numpy/wiki/NA-discussion-status

Maybe it would be better to split each list into core idea versus
extra niceties or something? I'm not sure.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-03 Thread Nathaniel Smith
I also mentioned this at the bottom of a reply to Benjamin, but to
make sure people joining the thread see it: I went ahead and put this
up on a github wiki page that everyone should be able to edit
  https://github.com/njsmith/numpy/wiki/NA-discussion-status

We could move it to the numpy wiki or whatever if people prefer, this
just seemed like the easiest way to get something up there that
everyone would have write access to.

-- Nathaniel

On Wed, Nov 2, 2011 at 4:37 PM, Nathaniel Smith n...@pobox.com wrote:
 Hi again,

 Okay, here's my attempt at an *uncontroversial* email!

 Specifically, I think it'll be easier to talk about this NA stuff if
 we can establish some common ground, and easier for people to follow
 if the basic points of agreement are laid out in one place. So I'm
 going to try and summarize just the things that we can agree about.

 Note that right now I'm *only* talking about what kind of tools we
 want to give the user -- i.e., what kind of problems we are trying to
 solve. AFAICT we don't have as much consensus on implementation
 matters, and anyway it's hard to make implementation decisions without
 knowing what we're trying to accomplish.

 1) I think we have consensus that there are (at least) two different
 possible ways of thinking about this problem, with somewhat different
 constituencies. Let's call these two concepts MISSING data and
 IGNORED data.

 2) I also think we have at least a rough consensus on what these
 concepts mean, and what their supporters want from them:

 MISSING data:
 - Conceptually, MISSINGness acts like a property of a datum --
 assigning MISSING to a location is like assigning any other value to
 that location
 - Ufuncs and other operations must propagate these values by default,
 and there must be an option to cause them to be ignored
 - Must be competitive with NaNs in terms of speed and memory usage (or
 else people will just use NaNs)
 - Compatibility with R is valuable
 - To avoid user confusion, ideally it should *not* be possible to
 'unmask' a missing value, since this is inconsistent with the missing
 value metaphor (e.g., see Wes's comment about leaky abstractions)
 - Possible useful extension: having different classes of missing
 values (similar to Stata)
 - Target audience: data analysis with missing data, neuroimaging,
 econometrics, former R users, ...

 IGNORED data:
 - Conceptually, IGNOREDness acts like a property of the array --
 toggling a location to be IGNORED is kind of vaguely similar to
 changing an array's shape
 - Ufuncs and other operations must ignore these values by default, and
 there doesn't really need to be a way to propagate them, even as an
 option (though it probably wouldn't hurt either)
 - Some memory overhead is inevitable and acceptable
 - Compatibility with R neither possible nor valuable
 - Ability to toggle the IGNORED state of a location is critical, and
 should be as convenient as possible
 - Possible useful extension: having not just different types of
 ignored values, but richer ways to combine them -- e.g., the example
 of combining astronomical images with some kind of associated
 per-pixel quality scores, where one might want the 'mask' to be not
 just a boolean IGNORED/not-IGNORED flag, but an integer (perhaps a
 multi-byte integer) or even a float, and to allow these 'masks' to be
 combined in some more complex way than just logical_and.
 - Target audience: anyone who's already doing this kind of thing by
 hand using a second mask array + boolean indexing, former numpy.ma
 users, matplotlib, ...

 3) And perhaps we can all agree that the biggest *un*resolved question
 is whether we want to:
 - emphasize the similarities between these two use cases and build a
 single interface that can handle both concepts, with some compromises
 - or, treat these at two mostly-separate features that can each become
 exactly what the respective constituency wants without compromise --
 but with some potential redundancy and extra code.
 Each approach has advantages and disadvantages.

 Does that seem like a fair summary? Anything more we can add? Most
 importantly, anything here that you disagree with? Did I summarize
 your needs well? Do you have a use case that you feel doesn't fit
 naturally into either category?

 [Also, I thought this might make the start of a good wiki page for
 people to reference during these discussions, but I don't seem to have
 edit rights. If other people agree, maybe someone could put it up, or
 give me access? My trac id is n...@pobox.com.]

 Thanks,
 -- Nathaniel

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-03 Thread Laurent Gautier
On 2011-11-03 04:22, numpy-discussion-requ...@scipy.org wrote:

 Message: 1
 Date: Wed, 2 Nov 2011 22:20:15 -0500
 From: Benjamin Rootben.r...@ou.edu
 Subject: Re: [Numpy-discussion] in the NA discussion, what can we
   agree on?
 To: Discussion of Numerical Pythonnumpy-discussion@scipy.org
 Message-ID:
   cannq6fnlkweuxugoey0kto7yi-v+tnv3nzj6upukkva+d0d...@mail.gmail.com
 Content-Type: text/plain; charset=iso-8859-1

 On Wednesday, November 2, 2011, Nathaniel Smithn...@pobox.com  wrote:
 Hi Benjamin,

 On Wed, Nov 2, 2011 at 5:25 PM, Benjamin Rootben.r...@ou.edu  wrote:
 I want to pare this down even more.  I think the above lists makes too
 many
 unneeded extrapolations.
 Okay. I found your formatting a little confusing, so I want to make
 sure I understood the changes you're suggesting:

 For the description of what MISSING means, you removed the lines:
 - Compatibility with R is valuable
 - To avoid user confusion, ideally it should *not* be possible to
 'unmask' a missing value, since this is inconsistent with the missing
 value metaphor (e.g., see Wes's comment about leaky abstractions)

 And you added the line:
 + Assigning MISSING is destructive

 And for the description of what IGNORED means, you removed the lines:
 - Some memory overhead is inevitable and acceptable
 - Compatibility with R neither possible nor valuable
 - Ability to toggle the IGNORED state of a location is critical, and
 should be as convenient as possible

 And you added the lines:
 + IGNORE is non-destructive
 + Must be competitive with np.ma for speed and memory (or else users
 would just use np.ma)

 Is that right?
 Correct.

 Assuming it is, my thoughts are:

 By R compatibility, I specifically had in mind in-memory
 compatibility. rpy2 provides a more-or-less seamless within-process
 interface between R and Python (and specifically lets you get numpy
 views on arrays returned by R functions), so if we can make this work
 for R arrays containing NA too then that'd be handy. (The rpy2 author
 requested this in the last discussion here:
 http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057084.html)
 When it comes to disk formats, then this doesn't matter so much, since
 IO routines have to translate between different representations all
 the time anyway.

 Interesting, but I still have to wonder if that should be on the wishlist
 for MISSING.  I guess it would matter by knowing whether people would be
 fully converting from R or gradually transitioning from it?  That is
 something that I can't answer.

I probably do not have all possible use-cases but what I'd think of as 
the most common is: use R stuff just straight out of R from Python. Say 
that you are doing your work in Python and read about some statistical 
method for which an implementation in R exists (but not in 
Python/numpy). You can just pass your numpy arrays or vectors to the 
relevant R function(s) and retrieve the results in a form directly 
usable by numpy (without having the data copied around). Should 
performances become an issue, and that method be of crucial importance, 
you will probably want to reimplement it (C, or Cython, for example). 
Otherwise you could pick R's phenomenal toolbox without much effort and 
keep those calls to R as part of your code.

In my experience, the later would be the most frequent.

Get some compatibility for the NA magic values and that possible 
coupling between R and numpy becomes even better by preventing one side 
or the other to understand them as non-NA values.



 I take the replacement of my line about MISSING disallowing unmasking
 and your line about MISSING assignment being destructive as basically
 expressing the same idea. Is that fair, or did you mean something
 else?
 I am someone who wants to get to the absolute core of ideas. Also, this
 expression cleanly delineates the differences as binary.

 By expressing it this way, we also shy away from implementation details.
 For example, Unmasking can be programmatically prevented for MISSING while
 it could be implemented by other indirect means for IGNORE. Not that those
 are the preferred ways, only that the phrasing is more flexible and
 exacting.

 Finally, do you think that people who want IGNORED support care about
 having a convenient API for masking/unmasking values? You removed that
 line, but I don't know if that was because you disagreed with it, or
 were just trying to simplify.
 See previous.

 Then, as a third-party module developer, I can tell you that having
 separate
 and independent ways to detect MISSING/IGNORED would likely make
 support
 more difficult and would greatly benefit from a common (or easily
 combinable) method of identification.
 Right, sorry... I didn't forget, and that's part of what I was
 thinking when I described the second approach as keeping them as
 *mostly*-separate interfaces... but I should have made it more
 explicit! Anyway, yes:

 4) There is consensus that whatever approach is taken

Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-03 Thread Lluís
Nathaniel Smith writes:

 4) There is consensus that whatever approach is taken, there should be
 a quick and convenient way to identify values that are MISSING,
 IGNORED, or both. (E.g., functions is_MISSING, is_IGNORED,
 is_MISSING_or_IGNORED, or some equivalent.)

Well, maybe it's too low level, but I'd rather decouple the two concepts into
two orthogonal properties that can be composed:

* Destructiveness: whether the previous data value is lost whenever you assign a
  special value.

* Propagation: whether any of these special values is propagated or just
  skipped when performing computations.

I think we can all agree on the definition of these two properties (where
bit-patters are destructive and masks are non-destructive), so I'd say that the
first discussion is establishing whether to expose them as separate properties
or just expose specific combinations of them:

* MISSING: destructive + propagating
* IGNORED: non-destructive + non-propagating

For example, it makes sense to me to have non-destructive + propagating.


If we take this road, then the next points to discuss should probably be how
these combinations are expressed:

* At the array level: all special values behave the same in a specific array,
  given its properties (e.g., all of them are destructive+propagating).

* At the value level: each special value conveys a specific combination of the
  aforementioned properties (e.g., assigning A is destructive+propagating and
  assigning B is non-destructive+non-propagating).

* Hybrid: e.g., all special values are destructive, but propagation depends on
  the specific special value.

I think this last decision is crucial, as it will have a direct impact on
performance, numpy code maintainability and 3rd party interface simplicity.


Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-03 Thread Benjamin Root
On Thu, Nov 3, 2011 at 9:28 AM, Lluís xscr...@gmx.net wrote:

 Nathaniel Smith writes:

  4) There is consensus that whatever approach is taken, there should be
  a quick and convenient way to identify values that are MISSING,
  IGNORED, or both. (E.g., functions is_MISSING, is_IGNORED,
  is_MISSING_or_IGNORED, or some equivalent.)

 Well, maybe it's too low level, but I'd rather decouple the two concepts
 into
 two orthogonal properties that can be composed:

 * Destructiveness: whether the previous data value is lost whenever you
 assign a
  special value.

 * Propagation: whether any of these special values is propagated or just
  skipped when performing computations.

 I think we can all agree on the definition of these two properties (where
 bit-patters are destructive and masks are non-destructive), so I'd say
 that the
 first discussion is establishing whether to expose them as separate
 properties
 or just expose specific combinations of them:

 * MISSING: destructive + propagating
 * IGNORED: non-destructive + non-propagating

 For example, it makes sense to me to have non-destructive + propagating.


This is sort of how it is currently implemented.  By default, NA
propagates, but it is possible to override these defaults on an
operation-by-operation basis using the skipna kwarg, and a subclassed array
could implement a __ufunc_wrap__() to default the skipna kwarg to True.



 If we take this road, then the next points to discuss should probably be
 how
 these combinations are expressed:

 * At the array level: all special values behave the same in a specific
 array,
  given its properties (e.g., all of them are destructive+propagating).

 * At the value level: each special value conveys a specific combination of
 the
  aforementioned properties (e.g., assigning A is destructive+propagating
 and
  assigning B is non-destructive+non-propagating).

 * Hybrid: e.g., all special values are destructive, but propagation
 depends on
  the specific special value.

 I think this last decision is crucial, as it will have a direct impact on
 performance, numpy code maintainability and 3rd party interface simplicity.


This is actually a very good point, and plays directly on the types of
implementations that can be done.  Currently, Mark's implementation is the
first one. The others are not possible with the current design.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-03 Thread Chris.Barker
On 11/2/11 7:16 PM, Nathaniel Smith wrote:
 By R compatibility, I specifically had in mind in-memory
 compatibility.

The R crowd has had a big voice in this discussion, and I understand 
that there are some nice lessons to be learned from it with regard to 
the NA issues.

However, I think making R compatibility a priority is a mistake -- numpy 
is numpy, it is NOT, nor should it be, an emulation of anything else. NA 
functionality is useful to virtually everyone -- not just folks doing 
R-like stuff, and even less so folks directly working with R.

 rpy2 provides a more-or-less seamless within-process
 interface between R and Python

Perhaps rpy2 will need to do some translating -- so be it, better than 
crippling numpy for other uses.

That being said, if the R binary format is a good one for numpy, no harm 
in using it, but I think that should be a secondary, at best, concern.

So should emulating the R API.

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-03 Thread Nathaniel Smith
Hi Chris,

On Thu, Nov 3, 2011 at 9:45 AM, Chris.Barker chris.bar...@noaa.gov wrote:
 On 11/2/11 7:16 PM, Nathaniel Smith wrote:
 By R compatibility, I specifically had in mind in-memory
 compatibility.

 The R crowd has had a big voice in this discussion, and I understand
 that there are some nice lessons to be learned from it with regard to
 the NA issues.

 However, I think making R compatibility a priority is a mistake -- numpy
 is numpy, it is NOT, nor should it be, an emulation of anything else. NA
 functionality is useful to virtually everyone -- not just folks doing
 R-like stuff, and even less so folks directly working with R.

I think we agree, actually. What I currently have written on the wiki
page is In-memory compatibility with R would be handy, which is
intended to convey that all else being equal this is a desirable
feature, but that it's not worth crippling numpy (as you put it) to
get. Do you have a suggestion about how I could make this clearer? or
am I misunderstanding your point?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-03 Thread Nathaniel Smith
Hi Lluís,

On Thu, Nov 3, 2011 at 7:28 AM, Lluís xscr...@gmx.net wrote:
 Well, maybe it's too low level, but I'd rather decouple the two concepts into
 two orthogonal properties that can be composed:

 * Destructiveness: whether the previous data value is lost whenever you 
 assign a
  special value.

 * Propagation: whether any of these special values is propagated or just
  skipped when performing computations.

 I think we can all agree on the definition of these two properties (where
 bit-patters are destructive and masks are non-destructive), so I'd say that 
 the
 first discussion is establishing whether to expose them as separate properties
 or just expose specific combinations of them:

 * MISSING: destructive + propagating
 * IGNORED: non-destructive + non-propagating

Thanks, that's an interesting idea that I'd forgotten about. I added a
link to your message to the proposals section, and to the list of
proposed solutions in point (3).

I'm tempted to respond in more depth, but I'm worried that if we start
digging into specific proposals like this right now then we'll start
going in circles again -- that's why I'm trying to establish some
common ground on what our goals are, so we have more of a basis for
comparing different ideas.

So obviously your suggestion of breaking things down into
finer-grained orthogonal features has merits in terms of simplicity,
elegance, etc. Before we get into those, though, I want to ask: do you
feel that the extra ability to have values that are
destructive+non-propagating and non-destructive+propagating is a major
*practical* benefit, or are you more motivated by the simplicity and
elegance, and the extra flexibility is just something kind of cool
that we'd get for free?

Or put another way, do you think that the MISSING and IGNORED concepts
are adequate to cover practical use cases, or do you have an example
where what's really wanted is say non-destructive + propagating? I can
see how it would work, but I don't think I'd ever use it, so I'm
curious...

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-03 Thread Gary Strangman

For the non-destructive+propagating case, do I understand correctly that 
this would mean I (as a user) could temporarily decide to IGNORE certain 
portions of my data, perform a series of computation on that data, and the 
IGNORED flag (or however it is implemented) would be propagated from 
computation to computation? If that's the case, I suspect I'd use it all 
the time ... to effectively perform data subsetting without generating 
(partial) copies of large datasets. But maybe I misunderstand the 
intended notion of propagation ...

Gary


 Or put another way, do you think that the MISSING and IGNORED concepts
 are adequate to cover practical use cases, or do you have an example
 where what's really wanted is say non-destructive + propagating? I can
 see how it would work, but I don't think I'd ever use it, so I'm
 curious...

 -- Nathaniel
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion





The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-03 Thread Benjamin Root
On Thursday, November 3, 2011, Gary Strangman str...@nmr.mgh.harvard.edu
wrote:

 For the non-destructive+propagating case, do I understand correctly that
 this would mean I (as a user) could temporarily decide to IGNORE certain
 portions of my data, perform a series of computation on that data, and the
 IGNORED flag (or however it is implemented) would be propagated from
 computation to computation? If that's the case, I suspect I'd use it all
 the time ... to effectively perform data subsetting without generating
 (partial) copies of large datasets. But maybe I misunderstand the
 intended notion of propagation ...

 Gary


Propagating is default NaN-like behavior when performing a sum with at
least one NaN in it. Ignoring is like using nansum on that same array.

Masking, one can think of it as very fancy indexing, but the shape and
structure of the data is maintained.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-02 Thread Benjamin Root
On Wed, Nov 2, 2011 at 6:37 PM, Nathaniel Smith n...@pobox.com wrote:

 Hi again,

 Okay, here's my attempt at an *uncontroversial* email!

 Specifically, I think it'll be easier to talk about this NA stuff if
 we can establish some common ground, and easier for people to follow
 if the basic points of agreement are laid out in one place. So I'm
 going to try and summarize just the things that we can agree about.

 Note that right now I'm *only* talking about what kind of tools we
 want to give the user -- i.e., what kind of problems we are trying to
 solve. AFAICT we don't have as much consensus on implementation
 matters, and anyway it's hard to make implementation decisions without
 knowing what we're trying to accomplish.

 1) I think we have consensus that there are (at least) two different
 possible ways of thinking about this problem, with somewhat different
 constituencies. Let's call these two concepts MISSING data and
 IGNORED data.

 2) I also think we have at least a rough consensus on what these
 concepts mean, and what their supporters want from them:

 MISSING data:
 - Conceptually, MISSINGness acts like a property of a datum --
 assigning MISSING to a location is like assigning any other value to
 that location
 - Ufuncs and other operations must propagate these values by default,
 and there must be an option to cause them to be ignored
 - Must be competitive with NaNs in terms of speed and memory usage (or
 else people will just use NaNs)
 - Compatibility with R is valuable
 - To avoid user confusion, ideally it should *not* be possible to
 'unmask' a missing value, since this is inconsistent with the missing
 value metaphor (e.g., see Wes's comment about leaky abstractions)
 - Possible useful extension: having different classes of missing
 values (similar to Stata)
 - Target audience: data analysis with missing data, neuroimaging,
 econometrics, former R users, ...

 IGNORED data:
 - Conceptually, IGNOREDness acts like a property of the array --
 toggling a location to be IGNORED is kind of vaguely similar to
 changing an array's shape
 - Ufuncs and other operations must ignore these values by default, and
 there doesn't really need to be a way to propagate them, even as an
 option (though it probably wouldn't hurt either)
 - Some memory overhead is inevitable and acceptable
 - Compatibility with R neither possible nor valuable
 - Ability to toggle the IGNORED state of a location is critical, and
 should be as convenient as possible
 - Possible useful extension: having not just different types of
 ignored values, but richer ways to combine them -- e.g., the example
 of combining astronomical images with some kind of associated
 per-pixel quality scores, where one might want the 'mask' to be not
 just a boolean IGNORED/not-IGNORED flag, but an integer (perhaps a
 multi-byte integer) or even a float, and to allow these 'masks' to be
 combined in some more complex way than just logical_and.
 - Target audience: anyone who's already doing this kind of thing by
 hand using a second mask array + boolean indexing, former numpy.ma
 users, matplotlib, ...

 3) And perhaps we can all agree that the biggest *un*resolved question
 is whether we want to:
 - emphasize the similarities between these two use cases and build a
 single interface that can handle both concepts, with some compromises
 - or, treat these at two mostly-separate features that can each become
 exactly what the respective constituency wants without compromise --
 but with some potential redundancy and extra code.
 Each approach has advantages and disadvantages.

 Does that seem like a fair summary? Anything more we can add? Most
 importantly, anything here that you disagree with? Did I summarize
 your needs well? Do you have a use case that you feel doesn't fit
 naturally into either category?

 [Also, I thought this might make the start of a good wiki page for
 people to reference during these discussions, but I don't seem to have
 edit rights. If other people agree, maybe someone could put it up, or
 give me access? My trac id is n...@pobox.com.]

 Thanks,
 -- Nathaniel


I want to pare this down even more.  I think the above lists makes too many
unneeded extrapolations.

MISSING data:
- Conceptually, MISSINGness acts like a property of a datum --
assigning MISSING to a location is like assigning any other value to
that location
- Ufuncs and other operations must propagate these values by default,
and there must be an option to cause them to be ignored
- Assigning MISSING is destructive
- Must be competitive with NaNs in terms of speed and memory usage (or
else people will just use NaNs)
- Target audience: data analysis with missing data, neuroimaging,
econometrics, former R users, ...


- Possible useful extension: having different classes of missing
values (similar to Stata)


IGNORED data:
- Conceptually, IGNOREDness acts like a property of the array --
toggling a location to be IGNORED is kind of vaguely similar to
changing 

Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-02 Thread Nathaniel Smith
Hi Benjamin,

On Wed, Nov 2, 2011 at 5:25 PM, Benjamin Root ben.r...@ou.edu wrote:
 I want to pare this down even more.  I think the above lists makes too many
 unneeded extrapolations.

Okay. I found your formatting a little confusing, so I want to make
sure I understood the changes you're suggesting:

For the description of what MISSING means, you removed the lines:
- Compatibility with R is valuable
- To avoid user confusion, ideally it should *not* be possible to
'unmask' a missing value, since this is inconsistent with the missing
value metaphor (e.g., see Wes's comment about leaky abstractions)

And you added the line:
+ Assigning MISSING is destructive

And for the description of what IGNORED means, you removed the lines:
- Some memory overhead is inevitable and acceptable
- Compatibility with R neither possible nor valuable
- Ability to toggle the IGNORED state of a location is critical, and
should be as convenient as possible

And you added the lines:
+ IGNORE is non-destructive
+ Must be competitive with np.ma for speed and memory (or else users
would just use np.ma)

Is that right?

Assuming it is, my thoughts are:

By R compatibility, I specifically had in mind in-memory
compatibility. rpy2 provides a more-or-less seamless within-process
interface between R and Python (and specifically lets you get numpy
views on arrays returned by R functions), so if we can make this work
for R arrays containing NA too then that'd be handy. (The rpy2 author
requested this in the last discussion here:
http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057084.html)
When it comes to disk formats, then this doesn't matter so much, since
IO routines have to translate between different representations all
the time anyway.

I take the replacement of my line about MISSING disallowing unmasking
and your line about MISSING assignment being destructive as basically
expressing the same idea. Is that fair, or did you mean something
else?

Finally, do you think that people who want IGNORED support care about
having a convenient API for masking/unmasking values? You removed that
line, but I don't know if that was because you disagreed with it, or
were just trying to simplify.

 Then, as a third-party module developer, I can tell you that having separate
 and independent ways to detect MISSING/IGNORED would likely make support
 more difficult and would greatly benefit from a common (or easily
 combinable) method of identification.

Right, sorry... I didn't forget, and that's part of what I was
thinking when I described the second approach as keeping them as
*mostly*-separate interfaces... but I should have made it more
explicit! Anyway, yes:

4) There is consensus that whatever approach is taken, there should be
a quick and convenient way to identify values that are MISSING,
IGNORED, or both. (E.g., functions is_MISSING, is_IGNORED,
is_MISSING_or_IGNORED, or some equivalent.)

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in the NA discussion, what can we agree on?

2011-11-02 Thread Benjamin Root
On Wednesday, November 2, 2011, Nathaniel Smith n...@pobox.com wrote:
 Hi Benjamin,

 On Wed, Nov 2, 2011 at 5:25 PM, Benjamin Root ben.r...@ou.edu wrote:
 I want to pare this down even more.  I think the above lists makes too
many
 unneeded extrapolations.

 Okay. I found your formatting a little confusing, so I want to make
 sure I understood the changes you're suggesting:

 For the description of what MISSING means, you removed the lines:
 - Compatibility with R is valuable
 - To avoid user confusion, ideally it should *not* be possible to
 'unmask' a missing value, since this is inconsistent with the missing
 value metaphor (e.g., see Wes's comment about leaky abstractions)

 And you added the line:
 + Assigning MISSING is destructive

 And for the description of what IGNORED means, you removed the lines:
 - Some memory overhead is inevitable and acceptable
 - Compatibility with R neither possible nor valuable
 - Ability to toggle the IGNORED state of a location is critical, and
 should be as convenient as possible

 And you added the lines:
 + IGNORE is non-destructive
 + Must be competitive with np.ma for speed and memory (or else users
 would just use np.ma)

 Is that right?

Correct.


 Assuming it is, my thoughts are:

 By R compatibility, I specifically had in mind in-memory
 compatibility. rpy2 provides a more-or-less seamless within-process
 interface between R and Python (and specifically lets you get numpy
 views on arrays returned by R functions), so if we can make this work
 for R arrays containing NA too then that'd be handy. (The rpy2 author
 requested this in the last discussion here:
 http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057084.html)
 When it comes to disk formats, then this doesn't matter so much, since
 IO routines have to translate between different representations all
 the time anyway.


Interesting, but I still have to wonder if that should be on the wishlist
for MISSING.  I guess it would matter by knowing whether people would be
fully converting from R or gradually transitioning from it?  That is
something that I can't answer.

 I take the replacement of my line about MISSING disallowing unmasking
 and your line about MISSING assignment being destructive as basically
 expressing the same idea. Is that fair, or did you mean something
 else?

I am someone who wants to get to the absolute core of ideas. Also, this
expression cleanly delineates the differences as binary.

By expressing it this way, we also shy away from implementation details.
For example, Unmasking can be programmatically prevented for MISSING while
it could be implemented by other indirect means for IGNORE. Not that those
are the preferred ways, only that the phrasing is more flexible and
exacting.


 Finally, do you think that people who want IGNORED support care about
 having a convenient API for masking/unmasking values? You removed that
 line, but I don't know if that was because you disagreed with it, or
 were just trying to simplify.

See previous.


 Then, as a third-party module developer, I can tell you that having
separate
 and independent ways to detect MISSING/IGNORED would likely make
support
 more difficult and would greatly benefit from a common (or easily
 combinable) method of identification.

 Right, sorry... I didn't forget, and that's part of what I was
 thinking when I described the second approach as keeping them as
 *mostly*-separate interfaces... but I should have made it more
 explicit! Anyway, yes:

 4) There is consensus that whatever approach is taken, there should be
 a quick and convenient way to identify values that are MISSING,
 IGNORED, or both. (E.g., functions is_MISSING, is_IGNORED,
 is_MISSING_or_IGNORED, or some equivalent.)


Good.

Cheers!
Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion