Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-26 Thread Travis Oliphant
I haven't commented yet on the mailing list because of time pressures although 
I have spoken to Mark as often as I can --- and have encouraged him to pursue 
his ideas and discuss them with the community.  The Numeric Python discussion 
list has a long history of great dialogue to try and bring out as many 
perspectives as possible as we wrestle with improving the code base.   It is 
very encouraging to see that tradition continuing.  

Because Enthought was mentioned earlier in this thread, I would like to try and 
clarify a few things about my employer and the company's interest.Enthought 
has been very interested in the development of the NumPy and SciPy stack (and 
the broader SciPy community) for sometime.   With its limited resources, 
Enthought helped significantly to form the SciPy community and continues to 
sponsor it as much as it can.   Many developers who work at Enthought 
(including me) have personal interest in the NumPy / SciPy community and 
codebase that go beyond Enthought's ability to invest directly as well.   

While Enthought has limited resources to invest directly in pursuing the goal, 
Enthought is very interested in improving Python's use as a data analysis 
environment.Because of that interest,  Enthought sponsored a data-array 
summit in May.   There is an inScight podcast that summarizes some of the event 
that you can listen to at http://inscight.org/2011/05/18/episode_13/.The 
purpose of this event was to bring a few people together who have been working 
on different aspects of the problem (particularly around the labelled array, or 
data array problem).We also wanted to jump start the activity of our 
interns and make sure that some of the use cases we have seen during the past 
several years while working on client projects had some light. 

The event was successful in that it generated *a lot* of ideas.   Some of these 
ideas were summarized in notes that are linked to at this convore thread: 
https://convore.com/python-scientific-computing/data-array-in-numpy/ One of 
the major ideas that emerged during the discussion is that NumPy needs to be 
able to handle missing data in a more integrated way (i.e. there need to be 
functions that do the right thing in the face of missing data).   One 
approach that was suggested during some of the discussion was that one way to 
handle missing data would be to introducing special nadtypes.   

Mark is one of 2 interns that we have this summer who are tasked at a high 
level with taking what was learned at the summit and implementing critical 
pieces as their skills and interests allow.I have been talking with them 
individually to map out specific work targets for the summer.Christopher 
Jordan-Squires is one of our interns who is studying to get a PhD in 
Mathematics at the Univ. of Washington.   He has a strong interest in 
statistics and a desire to make Python as easy to use as R for certain 
statistical work flows.Mark Wiebe is known on this list because of his 
recent success at working on the NumPy code base.  As a result of that success, 
Mark is working on making improvements to NumPy that are seen as most critical 
to solving some of the same problems we keep seeing in our projects (labeled 
arrays being one of them).   We are also very interested in the Pandas project 
as it brings a data-structure like the successful DataFrame in R to the Python 
space (and it helps solve some of the problems our clients are seeing).   It 
would be good to make sure that core functionality that Pandas needs is 
available in NumPy where appropriate. 

The date-time work that Mark did was the first low-hanging fruit that needed 
to be finished.   The second project that Mark is involved with is creating an 
approach for missing data in NumPy.   I suggested the missing data dtypes (in 
part because Mark had expressed some concerns about the way dtypes are handled 
in NumPy and I would love for that mechanism for user-defined data-types and 
the whole data-type infrastructure to be improved as needed.)   Mark spent some 
time thinking about it and felt more comfortable with the masked array solution 
and that is where we are now. 

Enthought's main interest remains in seeing how much of the data array can and 
should be moved into low-level NumPy as well as the implementation of 
functionality (wherever it may live) to make data analysis easier and more 
productive in Python.Again, though, this is something Enthought as a 
company can only invest limited resources in, and we want to make sure that 
Mark spends the time that we are sponsoring doing work that is seen as valuable 
by the community but more importantly matching our own internal needs. 

I will post a follow-on message that provides my current views on the subject 
of missing data and masked arrays.

-Travis



On Jun 25, 2011, at 2:09 PM, Benjamin Root wrote:

 
 
 On Sat, Jun 25, 2011 at 1:57 PM, Nathaniel Smith n...@pobox.com wrote:
 

Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Charles R Harris
On Sat, Jun 25, 2011 at 10:05 AM, Nathaniel Smith n...@pobox.com wrote:

 So obviously there's a lot of interest in this question, but I'm
 losing track of all the different issues that've being raised in the
 150-post thread of doom. I think I'll find this easier if we start by
 putting aside the questions about implementation and such and focus
 for now on the *conceptual model* that we want. Maybe I'm not the only
 one?

 So as far as I can tell, there are three different ways of thinking
 about masked/missing data that people have been using in the other
 thread:

 1) Missingness is part of the data. Some data is missing, some isn't,
 this might change through computation on the data (just like some data
 might change from a 3 to a 6 when we apply some transformation, NA |
 True could be True, instead of NA), but we can't just decide that
 some data is no longer missing. It makes no sense to ask what value is
 really there underneath the missingness. And It's critical that we
 keep track of this through all operations, because otherwise we may
 silently give incorrect answers -- exactly like it's critical that we
 keep track of the difference between 3 and 6.

 2) All the data exists, at least in some sense, but we don't always
 want to look at all of it. We lay a mask over our data to view and
 manipulate only parts of it at a time. We might want to use different
 masks at different times, mutate the mask as we go, etc. The most
 important thing is to provide convenient ways to do complex
 manipulations -- preserve masks through indexing operations, overlay
 the mask from one array on top of another array, etc. When it comes to
 other sorts of operations then we'd rather just silently skip the
 masked values -- we know there are values that are masked, that's the
 whole point, to work with the unmasked subset of the data, so if sum
 returned NA then that would just be a stupid hassle.

 3) The all things to all people approach: implement every feature
 implied by either (1) or (2), and switch back and forth between these
 conceptual frameworks whenever necessary to make sense of the
 resulting code.

 The advantage of deciding up front what our model is is that it makes
 a lot of other questions easier. E.g., someone asked in the other
 thread whether, after setting an array element to NA, it would be
 possible to get back the original value. If we follow (1), the answer
 is obviously no, if we follow (2), the answer is obviously yes,
 and if we follow (3), the answer is obviously yes, probably, well,
 maybe you better check the docs?.

 My personal opinions on these are:
 (1): This is a real problem I face, and there isn't any good solution
 now. Support for this in numpy would be awesome.
 (2): This feels more like a convenience feature to me; we already have
 lots of ways to work with subsets of data. I probably wouldn't bother
 using it, but that's fine -- I don't use np.matrix either, but some
 people like it.
 (3): Well, it's a bit of a mess, but I guess it might be better than
 nothing?

 But that's just my opinion. I'm wondering if we can get any consensus
 on which of these we actually *want* (or maybe we want some fourth
 option!), and *then* we can try to figure out the best way to get
 there? Pretty much any implementation strategy we've talked about
 could work for any of these, but hard to decide between them if we
 don't even know what we're trying to do...


I go for 3 ;) And I think that is where we are heading. By default, masked
array operations look like 1), but by taking views one can get 2). I think
the crucial aspect here is the use of views, which both saves on storage and
fits with the current numpy concept of views.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Matthew Brett
Hi,

On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith n...@pobox.com wrote:
 So obviously there's a lot of interest in this question, but I'm
 losing track of all the different issues that've being raised in the
 150-post thread of doom. I think I'll find this easier if we start by
 putting aside the questions about implementation and such and focus
 for now on the *conceptual model* that we want. Maybe I'm not the only
 one?

 So as far as I can tell, there are three different ways of thinking
 about masked/missing data that people have been using in the other
 thread:

 1) Missingness is part of the data. Some data is missing, some isn't,
 this might change through computation on the data (just like some data
 might change from a 3 to a 6 when we apply some transformation, NA |
 True could be True, instead of NA), but we can't just decide that
 some data is no longer missing. It makes no sense to ask what value is
 really there underneath the missingness. And It's critical that we
 keep track of this through all operations, because otherwise we may
 silently give incorrect answers -- exactly like it's critical that we
 keep track of the difference between 3 and 6.

So far I see the difference between 1) and 2) being that you cannot
unmask.  So, if you didn't even know you could unmask data, then it
would not matter that 1) was being implemented by masks?

 2) All the data exists, at least in some sense, but we don't always
 want to look at all of it. We lay a mask over our data to view and
 manipulate only parts of it at a time. We might want to use different
 masks at different times, mutate the mask as we go, etc. The most
 important thing is to provide convenient ways to do complex
 manipulations -- preserve masks through indexing operations, overlay
 the mask from one array on top of another array, etc. When it comes to
 other sorts of operations then we'd rather just silently skip the
 masked values -- we know there are values that are masked, that's the
 whole point, to work with the unmasked subset of the data, so if sum
 returned NA then that would just be a stupid hassle.

To clarify, you're proposing for:

a = np.sum(np.array([np.NA, np.NA])

1) - np.NA
2) - 0.0

?

 But that's just my opinion. I'm wondering if we can get any consensus
 on which of these we actually *want* (or maybe we want some fourth
 option!), and *then* we can try to figure out the best way to get
 there? Pretty much any implementation strategy we've talked about
 could work for any of these, but hard to decide between them if we
 don't even know what we're trying to do...

I agree it's good to separate the API from the implementation.   I
think the implementation is also important because I care about memory
and possibly speed.  But, that is a separate problem from the API...

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Charles R Harris
On Sat, Jun 25, 2011 at 10:26 AM, Matthew Brett matthew.br...@gmail.comwrote:

 Hi,

 On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith n...@pobox.com wrote:
  So obviously there's a lot of interest in this question, but I'm
  losing track of all the different issues that've being raised in the
  150-post thread of doom. I think I'll find this easier if we start by
  putting aside the questions about implementation and such and focus
  for now on the *conceptual model* that we want. Maybe I'm not the only
  one?
 
  So as far as I can tell, there are three different ways of thinking
  about masked/missing data that people have been using in the other
  thread:
 
  1) Missingness is part of the data. Some data is missing, some isn't,
  this might change through computation on the data (just like some data
  might change from a 3 to a 6 when we apply some transformation, NA |
  True could be True, instead of NA), but we can't just decide that
  some data is no longer missing. It makes no sense to ask what value is
  really there underneath the missingness. And It's critical that we
  keep track of this through all operations, because otherwise we may
  silently give incorrect answers -- exactly like it's critical that we
  keep track of the difference between 3 and 6.

 So far I see the difference between 1) and 2) being that you cannot
 unmask.  So, if you didn't even know you could unmask data, then it
 would not matter that 1) was being implemented by masks?

  2) All the data exists, at least in some sense, but we don't always
  want to look at all of it. We lay a mask over our data to view and
  manipulate only parts of it at a time. We might want to use different
  masks at different times, mutate the mask as we go, etc. The most
  important thing is to provide convenient ways to do complex
  manipulations -- preserve masks through indexing operations, overlay
  the mask from one array on top of another array, etc. When it comes to
  other sorts of operations then we'd rather just silently skip the
  masked values -- we know there are values that are masked, that's the
  whole point, to work with the unmasked subset of the data, so if sum
  returned NA then that would just be a stupid hassle.

 To clarify, you're proposing for:

 a = np.sum(np.array([np.NA, np.NA])

 1) - np.NA
 2) - 0.0

 ?

  But that's just my opinion. I'm wondering if we can get any consensus
  on which of these we actually *want* (or maybe we want some fourth
  option!), and *then* we can try to figure out the best way to get
  there? Pretty much any implementation strategy we've talked about
  could work for any of these, but hard to decide between them if we
  don't even know what we're trying to do...

 I agree it's good to separate the API from the implementation.   I
 think the implementation is also important because I care about memory
 and possibly speed.  But, that is a separate problem from the API...


In a larger sense, we are seeking to add metadata to array elements and have
ufuncs that use that metadata together with the element values to compute
results. Off topic a bit, but it reminds me of the Burroughs 6600 that I
once used. The word size on that machine was 48 bits, so it could
accommodate both  6 and 8 bit characters, and 3 bits of metadata were
appended to mark the type. So there was a machine with 51 bit words ;) IIRC,
Knuth was involved in the design and helped with the OS, which was written
in ALGOL...

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Nathaniel Smith
On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett matthew.br...@gmail.com wrote:
 So far I see the difference between 1) and 2) being that you cannot
 unmask.  So, if you didn't even know you could unmask data, then it
 would not matter that 1) was being implemented by masks?

I guess that is a difference, but I'm trying to get at something more
fundamental -- not just what operations are allowed, but what
operations people *expect* to be allowed. It seems like some of us
have been talking past each other a lot, where someone says but
changing masks is the single most important feature! and then someone
else says what are you talking about that doesn't even make sense.

 To clarify, you're proposing for:

 a = np.sum(np.array([np.NA, np.NA])

 1) - np.NA
 2) - 0.0

Yes -- and in R you get actually do get NA, while in numpy.ma you
actually do get 0. I don't think this is a coincidence; I think it's
because they're designed as coherent systems that are trying to solve
different problems. (Well, numpy.ma's hardmask idea seems inspired
by the missing-data concept rather than the temporary-mask concept,
but aside from that it seems pretty consistent in implementing option
2.)

Here's another possible difference -- in (1), intuitively, missingness
is a property of the data, so the logical place to put information
about whether you can expect missing values is in the dtype, and to
enable missing values you need to make a new array with a new dtype.
(If we use a mask-based implementation, then
np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
to skip making a copy of the data -- I'm talking ONLY about the
interface here, not whether missing data has a different storage
format from non-missing data.)

In (2), the whole point is to use different masks with the same data,
so I'd argue masking should be a property of the array object rather
than the dtype, and the interface should logically allow masks to be
created, modified, and destroyed in place.

They're both internally consistent, but I think we might have to make
a decision and stick to it.

 I agree it's good to separate the API from the implementation.   I
 think the implementation is also important because I care about memory
 and possibly speed.  But, that is a separate problem from the API...

Yes, absolutely memory and speed are important. But a really fast
solution to the wrong problem isn't so useful either :-).

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Wes McKinney
On Sat, Jun 25, 2011 at 1:05 PM, Nathaniel Smith n...@pobox.com wrote:
 On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett matthew.br...@gmail.com 
 wrote:
 So far I see the difference between 1) and 2) being that you cannot
 unmask.  So, if you didn't even know you could unmask data, then it
 would not matter that 1) was being implemented by masks?

 I guess that is a difference, but I'm trying to get at something more
 fundamental -- not just what operations are allowed, but what
 operations people *expect* to be allowed. It seems like some of us
 have been talking past each other a lot, where someone says but
 changing masks is the single most important feature! and then someone
 else says what are you talking about that doesn't even make sense.

 To clarify, you're proposing for:

 a = np.sum(np.array([np.NA, np.NA])

 1) - np.NA
 2) - 0.0

 Yes -- and in R you get actually do get NA, while in numpy.ma you
 actually do get 0. I don't think this is a coincidence; I think it's
 because they're designed as coherent systems that are trying to solve
 different problems. (Well, numpy.ma's hardmask idea seems inspired
 by the missing-data concept rather than the temporary-mask concept,
 but aside from that it seems pretty consistent in implementing option
 2.)

Agree. My basic observation about numpy.ma is that it's a finely
crafted solution for a different set of problems than the ones I have.
I just don't want the same thing to happen here so I'm stuck writing
code (like I am now) that looks like

mask = y.mask
the_sum = y.sum(axis)
the_count = mask.sum(axis)
the_sum[the_count == 0] = nan

 Here's another possible difference -- in (1), intuitively, missingness
 is a property of the data, so the logical place to put information
 about whether you can expect missing values is in the dtype, and to
 enable missing values you need to make a new array with a new dtype.
 (If we use a mask-based implementation, then
 np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
 to skip making a copy of the data -- I'm talking ONLY about the
 interface here, not whether missing data has a different storage
 format from non-missing data.)

 In (2), the whole point is to use different masks with the same data,
 so I'd argue masking should be a property of the array object rather
 than the dtype, and the interface should logically allow masks to be
 created, modified, and destroyed in place.

 They're both internally consistent, but I think we might have to make
 a decision and stick to it.

 I agree it's good to separate the API from the implementation.   I
 think the implementation is also important because I care about memory
 and possibly speed.  But, that is a separate problem from the API...

 Yes, absolutely memory and speed are important. But a really fast
 solution to the wrong problem isn't so useful either :-).

 -- Nathaniel
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Benjamin Root
On Sat, Jun 25, 2011 at 11:26 AM, Matthew Brett matthew.br...@gmail.comwrote:

 Hi,

 On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith n...@pobox.com wrote:
  So obviously there's a lot of interest in this question, but I'm
  losing track of all the different issues that've being raised in the
  150-post thread of doom. I think I'll find this easier if we start by
  putting aside the questions about implementation and such and focus
  for now on the *conceptual model* that we want. Maybe I'm not the only
  one?
 
  So as far as I can tell, there are three different ways of thinking
  about masked/missing data that people have been using in the other
  thread:
 
  1) Missingness is part of the data. Some data is missing, some isn't,
  this might change through computation on the data (just like some data
  might change from a 3 to a 6 when we apply some transformation, NA |
  True could be True, instead of NA), but we can't just decide that
  some data is no longer missing. It makes no sense to ask what value is
  really there underneath the missingness. And It's critical that we
  keep track of this through all operations, because otherwise we may
  silently give incorrect answers -- exactly like it's critical that we
  keep track of the difference between 3 and 6.

 So far I see the difference between 1) and 2) being that you cannot
 unmask.  So, if you didn't even know you could unmask data, then it
 would not matter that 1) was being implemented by masks?


Yes, bingo, you hit it right on the nose.  Essentially, 1) could be
considered the hard mask, while 2) would be the soft mask.  Everything
else is implementation details.


  2) All the data exists, at least in some sense, but we don't always
  want to look at all of it. We lay a mask over our data to view and
  manipulate only parts of it at a time. We might want to use different
  masks at different times, mutate the mask as we go, etc. The most
  important thing is to provide convenient ways to do complex
  manipulations -- preserve masks through indexing operations, overlay
  the mask from one array on top of another array, etc. When it comes to
  other sorts of operations then we'd rather just silently skip the
  masked values -- we know there are values that are masked, that's the
  whole point, to work with the unmasked subset of the data, so if sum
  returned NA then that would just be a stupid hassle.

 To clarify, you're proposing for:

 a = np.sum(np.array([np.NA, np.NA])

 1) - np.NA
 2) - 0.0

 ?


Actually, I have always considered this to be a bug.  Note that np.sum([])
also returns 0.0.  I think the reason why it has been returning zero instead
of NaN was because there wasn't a NaN-equivalent for integers.  This is
where I think a np.NA could best serve NumPy by providing a dtype-agnostic
way to represent missing or invalid data.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Alan G Isaac
On 6/25/2011 2:06 PM, Benjamin Root wrote:
 Note that np.sum([]) also returns 0.0.  I think the
 reason why it has been returning zero instead of NaN was
 because there wasn't a NaN-equivalent for integers.


http://en.wikipedia.org/wiki/Empty_sum

fwiw,
Alan Isaac
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Benjamin Root
On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith n...@pobox.com wrote:

 On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett matthew.br...@gmail.com
 wrote:
  So far I see the difference between 1) and 2) being that you cannot
  unmask.  So, if you didn't even know you could unmask data, then it
  would not matter that 1) was being implemented by masks?

 I guess that is a difference, but I'm trying to get at something more
 fundamental -- not just what operations are allowed, but what
 operations people *expect* to be allowed.


That is quite a trickier problem.




 Here's another possible difference -- in (1), intuitively, missingness
 is a property of the data, so the logical place to put information
 about whether you can expect missing values is in the dtype, and to
 enable missing values you need to make a new array with a new dtype.
 (If we use a mask-based implementation, then
 np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
 to skip making a copy of the data -- I'm talking ONLY about the
 interface here, not whether missing data has a different storage
 format from non-missing data.)

 In (2), the whole point is to use different masks with the same data,
 so I'd argue masking should be a property of the array object rather
 than the dtype, and the interface should logically allow masks to be
 created, modified, and destroyed in place.


I can agree with this distinction.  However, if missingness is an
intrinsic property of the data, then shouldn't users be implementing their
own dtype tailored to the data they are using?  In other words, how far does
the core of NumPy need to go to address this issue?  And how far would be
too much?



 They're both internally consistent, but I think we might have to make
 a decision and stick to it.


Of course.  I think that Mark is having a very inspired idea of giving the R
audience what they want (np.NA), while simultaneously making the use of
masked arrays even easier (which I can certainly appreciate).


  I agree it's good to separate the API from the implementation.   I
  think the implementation is also important because I care about memory
  and possibly speed.  But, that is a separate problem from the API...

 Yes, absolutely memory and speed are important. But a really fast
 solution to the wrong problem isn't so useful either :-).


The one thing I have always loved about Python (and NumPy) is that it
respects the developer's time.  I come from a C++ background where I found
C++ to be powerful, but tedious.  I went to Matlab because it was just
straight-up easier to code math and display graphs.  (If anybody here ever
used GrADS, then you know how badly I would want a language that respected
my time).  However, even Matlab couldn't fully respect my time as I usually
kept wasting it trying to get various pieces working.  Python came along,
and while it didn't always match the speed of some of my matlab programs, it
was fast enough.

I will put out a little disclaimer.  I once had to use S+ for a class.  To
be honest, it was the worst programming experience in my life.  This
experience may be coloring my perception of R's approach to handling missing
data.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Nathaniel Smith
On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing efir...@hawaii.edu wrote:
 On 06/25/2011 07:05 AM, Nathaniel Smith wrote:
 On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brettmatthew.br...@gmail.com  
 wrote:
 To clarify, you're proposing for:

 a = np.sum(np.array([np.NA, np.NA])

 1) -  np.NA
 2) -  0.0

 Yes -- and in R you get actually do get NA, while in numpy.ma you
 actually do get 0. I don't think this is a coincidence; I think it's

 No, you don't:

 In [2]: np.ma.array([2, 4], mask=[True, True]).sum()
 Out[2]: masked

 In [4]: np.sum(np.ma.array([2, 4], mask=[True, True]))
 Out[4]: masked

Huh. So in numpy.ma, sum([10, NA]) and sum([10]) are the same, but
sum([NA]) and sum([]) are different? Sounds to me like you should file
a bug on numpy.ma...

Anyway, the general point is that in R, NA's propagate, and in
numpy.ma, masked values are ignored (except, apparently, if all values
are masked). Here, I actually checked these:

Python: np.ma.array([2, 4], mask=[True, False]).sum() - 4
R: sum(c(NA, 4)) - NA

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Nathaniel Smith
On Sat, Jun 25, 2011 at 11:32 AM, Benjamin Root ben.r...@ou.edu wrote:
 On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith n...@pobox.com wrote:
 I guess that is a difference, but I'm trying to get at something more
 fundamental -- not just what operations are allowed, but what
 operations people *expect* to be allowed.

 That is quite a trickier problem.

It can be. I think of it as the difference between design and coding.
They overlap less than one might expect...

 Here's another possible difference -- in (1), intuitively, missingness
 is a property of the data, so the logical place to put information
 about whether you can expect missing values is in the dtype, and to
 enable missing values you need to make a new array with a new dtype.
 (If we use a mask-based implementation, then
 np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
 to skip making a copy of the data -- I'm talking ONLY about the
 interface here, not whether missing data has a different storage
 format from non-missing data.)

 In (2), the whole point is to use different masks with the same data,
 so I'd argue masking should be a property of the array object rather
 than the dtype, and the interface should logically allow masks to be
 created, modified, and destroyed in place.


 I can agree with this distinction.  However, if missingness is an
 intrinsic property of the data, then shouldn't users be implementing their
 own dtype tailored to the data they are using?  In other words, how far does
 the core of NumPy need to go to address this issue?  And how far would be
 too much?

Yes, that's exactly my question: whether our goal is to implement
missingness in numpy or not!


 They're both internally consistent, but I think we might have to make
 a decision and stick to it.


 Of course.  I think that Mark is having a very inspired idea of giving the R
 audience what they want (np.NA), while simultaneously making the use of
 masked arrays even easier (which I can certainly appreciate).

I don't know. I think we could build a really top-notch implementation
of missingness. I also think we could build a really top-notch
implementation of masking. But my suggestions for how to improve the
current design are totally different depending on which of those is
the goal, and neither the R audience (like me) nor the masked array
audience (like you) seems really happy with the current design. And I
don't know what the goal is -- maybe it's something else and the
current design hits it perfectly? Maybe we want a top-notch
implementation of *both* missingness and masking, and those should be
two different things that can be combined, so that some of the
unmasked values inside a masked array can be NA? I don't know.

 I will put out a little disclaimer.  I once had to use S+ for a class.  To
 be honest, it was the worst programming experience in my life.  This
 experience may be coloring my perception of R's approach to handling missing
 data.

There's a lot of things that R does wrong (not their fault; language
design is an extremely difficult and specialized skill, that
statisticians are not exactly trained in), but it did make a few
excellent choices at the beginning. One was to steal the execution
model from Scheme, which, uh, isn't really relevant here. The other
was to steal the basic data types and standard library that the Bell
Labs statisticians had pounded into shape over many years. I use
Python now because using R for everything would drive me crazy, but
despite its many flaws, it still does some things so well that it's
become *the* language used for basically all statistical research. I'm
only talking about stealing those things :-).

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Eric Firing
On 06/25/2011 09:09 AM, Benjamin Root wrote:


 On Sat, Jun 25, 2011 at 1:57 PM, Nathaniel Smith n...@pobox.com
 mailto:n...@pobox.com wrote:

 On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing efir...@hawaii.edu
 mailto:efir...@hawaii.edu wrote:
   On 06/25/2011 07:05 AM, Nathaniel Smith wrote:
   On Sat, Jun 25, 2011 at 9:26 AM, Matthew
 Brettmatthew.br...@gmail.com mailto:matthew.br...@gmail.com  wrote:
   To clarify, you're proposing for:
  
   a = np.sum(np.array([np.NA, np.NA])
  
   1) -  np.NA
   2) -  0.0
  
   Yes -- and in R you get actually do get NA, while in numpy.ma
 http://numpy.ma you
   actually do get 0. I don't think this is a coincidence; I think it's
  
   No, you don't:
  
   In [2]: np.ma.array([2, 4], mask=[True, True]).sum()
   Out[2]: masked
  
   In [4]: np.sum(np.ma.array([2, 4], mask=[True, True]))
   Out[4]: masked

 Huh. So in numpy.ma http://numpy.ma, sum([10, NA]) and sum([10])
 are the same, but
 sum([NA]) and sum([]) are different? Sounds to me like you should file
 a bug on numpy.ma...


 Actually, no... I should have tested this before replying earlier:

   a = np.ma.array([2, 4], mask=[True, True])
   a
 masked_array(data = [-- --],
   mask = [ True  True],
 fill_value = 99)

   a.sum()
 masked
   a = np.ma.array([], mask=[])
   a
   a
 masked_array(data = [],
   mask = [],
 fill_value = 1e+20)
   a.sum()
 masked

 They are the same.


 Anyway, the general point is that in R, NA's propagate, and in
 numpy.ma http://numpy.ma, masked values are ignored (except,
 apparently, if all values
 are masked). Here, I actually checked these:

 Python: np.ma.array([2, 4], mask=[True, False]).sum() - 4
 R: sum(c(NA, 4)) - NA


 If you want NaN behavior, then use NaNs.  If you want masked behavior,
 then use masks.

But I think that where Mark is heading is towards infrastructure that 
makes it easy and efficient to do either, as needed, case by case, line 
by line, for any dtype--not just floats.  If he can succeed, that helps 
all of us.  This doesn't have to be R versus masked arrays, or 
beginners versus experienced programmers.

Eric


 Ben Root



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Wes McKinney
On Sat, Jun 25, 2011 at 3:51 PM, Nathaniel Smith n...@pobox.com wrote:
 On Sat, Jun 25, 2011 at 11:32 AM, Benjamin Root ben.r...@ou.edu wrote:
 On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith n...@pobox.com wrote:
 I guess that is a difference, but I'm trying to get at something more
 fundamental -- not just what operations are allowed, but what
 operations people *expect* to be allowed.

 That is quite a trickier problem.

 It can be. I think of it as the difference between design and coding.
 They overlap less than one might expect...

 Here's another possible difference -- in (1), intuitively, missingness
 is a property of the data, so the logical place to put information
 about whether you can expect missing values is in the dtype, and to
 enable missing values you need to make a new array with a new dtype.
 (If we use a mask-based implementation, then
 np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
 to skip making a copy of the data -- I'm talking ONLY about the
 interface here, not whether missing data has a different storage
 format from non-missing data.)

 In (2), the whole point is to use different masks with the same data,
 so I'd argue masking should be a property of the array object rather
 than the dtype, and the interface should logically allow masks to be
 created, modified, and destroyed in place.


 I can agree with this distinction.  However, if missingness is an
 intrinsic property of the data, then shouldn't users be implementing their
 own dtype tailored to the data they are using?  In other words, how far does
 the core of NumPy need to go to address this issue?  And how far would be
 too much?

 Yes, that's exactly my question: whether our goal is to implement
 missingness in numpy or not!


 They're both internally consistent, but I think we might have to make
 a decision and stick to it.


 Of course.  I think that Mark is having a very inspired idea of giving the R
 audience what they want (np.NA), while simultaneously making the use of
 masked arrays even easier (which I can certainly appreciate).

 I don't know. I think we could build a really top-notch implementation
 of missingness. I also think we could build a really top-notch
 implementation of masking. But my suggestions for how to improve the
 current design are totally different depending on which of those is
 the goal, and neither the R audience (like me) nor the masked array
 audience (like you) seems really happy with the current design. And I
 don't know what the goal is -- maybe it's something else and the
 current design hits it perfectly? Maybe we want a top-notch
 implementation of *both* missingness and masking, and those should be
 two different things that can be combined, so that some of the
 unmasked values inside a masked array can be NA? I don't know.

 I will put out a little disclaimer.  I once had to use S+ for a class.  To
 be honest, it was the worst programming experience in my life.  This
 experience may be coloring my perception of R's approach to handling missing
 data.

 There's a lot of things that R does wrong (not their fault; language
 design is an extremely difficult and specialized skill, that
 statisticians are not exactly trained in), but it did make a few
 excellent choices at the beginning. One was to steal the execution
 model from Scheme, which, uh, isn't really relevant here. The other
 was to steal the basic data types and standard library that the Bell
 Labs statisticians had pounded into shape over many years. I use
 Python now because using R for everything would drive me crazy, but
 despite its many flaws, it still does some things so well that it's
 become *the* language used for basically all statistical research. I'm
 only talking about stealing those things :-).

 -- Nathaniel
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


+1. Everyone knows R ain't perfect. I think it's an atrociously bad
programming language but it can be unbelievably good at statistics, as
evidenced by its success. Brings to mind Andy Gelman's blog last fall:

http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ross_ihaka_to_r.html

As someone in a statistics department I've frequently been
disheartened when I see how easy many statistical things are in R and
how much more difficult they are in Python. This is partially the
result of poor interfaces for statistical modeling, partially due to
data structures (e.g. the integrated-ness of data.frame throughout R)
and things like handling of missing data of which there's currently no
equivalent.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion