Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-26 Thread Travis Oliphant
I haven't commented yet on the mailing list because of time pressures although 
I have spoken to Mark as often as I can --- and have encouraged him to pursue 
his ideas and discuss them with the community.  The Numeric Python discussion 
list has a long history of great dialogue to try and bring out as many 
perspectives as possible as we wrestle with improving the code base.   It is 
very encouraging to see that tradition continuing.  

Because Enthought was mentioned earlier in this thread, I would like to try and 
clarify a few things about my employer and the company's interest.Enthought 
has been very interested in the development of the NumPy and SciPy stack (and 
the broader SciPy community) for sometime.   With its limited resources, 
Enthought helped significantly to form the SciPy community and continues to 
sponsor it as much as it can.   Many developers who work at Enthought 
(including me) have personal interest in the NumPy / SciPy community and 
codebase that go beyond Enthought's ability to invest directly as well.   

While Enthought has limited resources to invest directly in pursuing the goal, 
Enthought is very interested in improving Python's use as a data analysis 
environment.Because of that interest,  Enthought sponsored a "data-array" 
summit in May.   There is an inScight podcast that summarizes some of the event 
that you can listen to at http://inscight.org/2011/05/18/episode_13/.The 
purpose of this event was to bring a few people together who have been working 
on different aspects of the problem (particularly around the labelled array, or 
data array problem).We also wanted to jump start the activity of our 
interns and make sure that some of the use cases we have seen during the past 
several years while working on client projects had some light. 

The event was successful in that it generated *a lot* of ideas.   Some of these 
ideas were summarized in notes that are linked to at this convore thread: 
https://convore.com/python-scientific-computing/data-array-in-numpy/ One of 
the major ideas that emerged during the discussion is that NumPy needs to be 
able to handle missing data in a more integrated way (i.e. there need to be 
functions that do the "right" thing in the face of missing data).   One 
approach that was suggested during some of the discussion was that one way to 
handle missing data would be to introducing special nadtypes.   

Mark is one of 2 interns that we have this summer who are tasked at a high 
level with taking what was learned at the summit and implementing critical 
pieces as their skills and interests allow.I have been talking with them 
individually to map out specific work targets for the summer.Christopher 
Jordan-Squires is one of our interns who is studying to get a PhD in 
Mathematics at the Univ. of Washington.   He has a strong interest in 
statistics and a desire to make Python as easy to use as R for certain 
statistical work flows.Mark Wiebe is known on this list because of his 
recent success at working on the NumPy code base.  As a result of that success, 
Mark is working on making improvements to NumPy that are seen as most critical 
to solving some of the same problems we keep seeing in our projects (labeled 
arrays being one of them).   We are also very interested in the Pandas project 
as it brings a data-structure like the successful DataFrame in R to the Python 
space (and it helps solve some of the problems our clients are seeing).   It 
would be good to make sure that core functionality that Pandas needs is 
available in NumPy where appropriate. 

The date-time work that Mark did was the first "low-hanging" fruit that needed 
to be finished.   The second project that Mark is involved with is creating an 
approach for missing data in NumPy.   I suggested the missing data dtypes (in 
part because Mark had expressed some concerns about the way dtypes are handled 
in NumPy and I would love for that mechanism for user-defined data-types and 
the whole data-type infrastructure to be improved as needed.)   Mark spent some 
time thinking about it and felt more comfortable with the masked array solution 
and that is where we are now. 

Enthought's main interest remains in seeing how much of the data array can and 
should be moved into low-level NumPy as well as the implementation of 
functionality (wherever it may live) to make data analysis easier and more 
productive in Python.Again, though, this is something Enthought as a 
company can only invest limited resources in, and we want to make sure that 
Mark spends the time that we are sponsoring doing work that is seen as valuable 
by the community but more importantly matching our own internal needs. 

I will post a follow-on message that provides my current views on the subject 
of missing data and masked arrays.

-Travis



On Jun 25, 2011, at 2:09 PM, Benjamin Root wrote:

> 
> 
> On Sat, Jun 25, 2011 at 1:57 PM, Nathaniel Smith  wrote:
> On S

Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Wes McKinney
On Sat, Jun 25, 2011 at 3:51 PM, Nathaniel Smith  wrote:
> On Sat, Jun 25, 2011 at 11:32 AM, Benjamin Root  wrote:
>> On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith  wrote:
>>> I guess that is a difference, but I'm trying to get at something more
>>> fundamental -- not just what operations are allowed, but what
>>> operations people *expect* to be allowed.
>>
>> That is quite a trickier problem.
>
> It can be. I think of it as the difference between design and coding.
> They overlap less than one might expect...
>
>>> Here's another possible difference -- in (1), intuitively, missingness
>>> is a property of the data, so the logical place to put information
>>> about whether you can expect missing values is in the dtype, and to
>>> enable missing values you need to make a new array with a new dtype.
>>> (If we use a mask-based implementation, then
>>> np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
>>> to skip making a copy of the data -- I'm talking ONLY about the
>>> interface here, not whether missing data has a different storage
>>> format from non-missing data.)
>>>
>>> In (2), the whole point is to use different masks with the same data,
>>> so I'd argue masking should be a property of the array object rather
>>> than the dtype, and the interface should logically allow masks to be
>>> created, modified, and destroyed in place.
>>>
>>
>> I can agree with this distinction.  However, if "missingness" is an
>> intrinsic property of the data, then shouldn't users be implementing their
>> own dtype tailored to the data they are using?  In other words, how far does
>> the core of NumPy need to go to address this issue?  And how far would be
>> "too much"?
>
> Yes, that's exactly my question: whether our goal is to implement
> missingness in numpy or not!
>
>>>
>>> They're both internally consistent, but I think we might have to make
>>> a decision and stick to it.
>>>
>>
>> Of course.  I think that Mark is having a very inspired idea of giving the R
>> audience what they want (np.NA), while simultaneously making the use of
>> masked arrays even easier (which I can certainly appreciate).
>
> I don't know. I think we could build a really top-notch implementation
> of missingness. I also think we could build a really top-notch
> implementation of masking. But my suggestions for how to improve the
> current design are totally different depending on which of those is
> the goal, and neither the R audience (like me) nor the masked array
> audience (like you) seems really happy with the current design. And I
> don't know what the goal is -- maybe it's something else and the
> current design hits it perfectly? Maybe we want a top-notch
> implementation of *both* missingness and masking, and those should be
> two different things that can be combined, so that some of the
> unmasked values inside a masked array can be NA? I don't know.
>
>> I will put out a little disclaimer.  I once had to use S+ for a class.  To
>> be honest, it was the worst programming experience in my life.  This
>> experience may be coloring my perception of R's approach to handling missing
>> data.
>
> There's a lot of things that R does wrong (not their fault; language
> design is an extremely difficult and specialized skill, that
> statisticians are not exactly trained in), but it did make a few
> excellent choices at the beginning. One was to steal the execution
> model from Scheme, which, uh, isn't really relevant here. The other
> was to steal the basic data types and standard library that the Bell
> Labs statisticians had pounded into shape over many years. I use
> Python now because using R for everything would drive me crazy, but
> despite its many flaws, it still does some things so well that it's
> become *the* language used for basically all statistical research. I'm
> only talking about stealing those things :-).
>
> -- Nathaniel
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

+1. Everyone knows R ain't perfect. I think it's an atrociously bad
programming language but it can be unbelievably good at statistics, as
evidenced by its success. Brings to mind Andy Gelman's blog last fall:

http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ross_ihaka_to_r.html

As someone in a statistics department I've frequently been
disheartened when I see how easy many statistical things are in R and
how much more difficult they are in Python. This is partially the
result of poor interfaces for statistical modeling, partially due to
data structures (e.g. the integrated-ness of data.frame throughout R)
and things like handling of missing data of which there's currently no
equivalent.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Eric Firing
On 06/25/2011 09:09 AM, Benjamin Root wrote:
>
>
> On Sat, Jun 25, 2011 at 1:57 PM, Nathaniel Smith  > wrote:
>
> On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing  > wrote:
>  > On 06/25/2011 07:05 AM, Nathaniel Smith wrote:
>  >> On Sat, Jun 25, 2011 at 9:26 AM, Matthew
> Brettmailto:matthew.br...@gmail.com>>  wrote:
>  >>> To clarify, you're proposing for:
>  >>>
>  >>> a = np.sum(np.array([np.NA, np.NA])
>  >>>
>  >>> 1) ->  np.NA
>  >>> 2) ->  0.0
>  >>
>  >> Yes -- and in R you get actually do get NA, while in numpy.ma
>  you
>  >> actually do get 0. I don't think this is a coincidence; I think it's
>  >
>  > No, you don't:
>  >
>  > In [2]: np.ma.array([2, 4], mask=[True, True]).sum()
>  > Out[2]: masked
>  >
>  > In [4]: np.sum(np.ma.array([2, 4], mask=[True, True]))
>  > Out[4]: masked
>
> Huh. So in numpy.ma , sum([10, NA]) and sum([10])
> are the same, but
> sum([NA]) and sum([]) are different? Sounds to me like you should file
> a bug on numpy.ma...
>
>
> Actually, no... I should have tested this before replying earlier:
>
>  >>> a = np.ma.array([2, 4], mask=[True, True])
>  >>> a
> masked_array(data = [-- --],
>   mask = [ True  True],
> fill_value = 99)
>
>  >>> a.sum()
> masked
>  >>> a = np.ma.array([], mask=[])
>  >>> a
>  >>> a
> masked_array(data = [],
>   mask = [],
> fill_value = 1e+20)
>  >>> a.sum()
> masked
>
> They are the same.
>
>
> Anyway, the general point is that in R, NA's propagate, and in
> numpy.ma , masked values are ignored (except,
> apparently, if all values
> are masked). Here, I actually checked these:
>
> Python: np.ma.array([2, 4], mask=[True, False]).sum() -> 4
> R: sum(c(NA, 4)) -> NA
>
>
> If you want NaN behavior, then use NaNs.  If you want masked behavior,
> then use masks.

But I think that where Mark is heading is towards infrastructure that 
makes it easy and efficient to do either, as needed, case by case, line 
by line, for any dtype--not just floats.  If he can succeed, that helps 
all of us.  This doesn't have to be "R versus masked arrays", or 
beginners versus experienced programmers.

Eric

>
> Ben Root
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Nathaniel Smith
On Sat, Jun 25, 2011 at 11:32 AM, Benjamin Root  wrote:
> On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith  wrote:
>> I guess that is a difference, but I'm trying to get at something more
>> fundamental -- not just what operations are allowed, but what
>> operations people *expect* to be allowed.
>
> That is quite a trickier problem.

It can be. I think of it as the difference between design and coding.
They overlap less than one might expect...

>> Here's another possible difference -- in (1), intuitively, missingness
>> is a property of the data, so the logical place to put information
>> about whether you can expect missing values is in the dtype, and to
>> enable missing values you need to make a new array with a new dtype.
>> (If we use a mask-based implementation, then
>> np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
>> to skip making a copy of the data -- I'm talking ONLY about the
>> interface here, not whether missing data has a different storage
>> format from non-missing data.)
>>
>> In (2), the whole point is to use different masks with the same data,
>> so I'd argue masking should be a property of the array object rather
>> than the dtype, and the interface should logically allow masks to be
>> created, modified, and destroyed in place.
>>
>
> I can agree with this distinction.  However, if "missingness" is an
> intrinsic property of the data, then shouldn't users be implementing their
> own dtype tailored to the data they are using?  In other words, how far does
> the core of NumPy need to go to address this issue?  And how far would be
> "too much"?

Yes, that's exactly my question: whether our goal is to implement
missingness in numpy or not!

>>
>> They're both internally consistent, but I think we might have to make
>> a decision and stick to it.
>>
>
> Of course.  I think that Mark is having a very inspired idea of giving the R
> audience what they want (np.NA), while simultaneously making the use of
> masked arrays even easier (which I can certainly appreciate).

I don't know. I think we could build a really top-notch implementation
of missingness. I also think we could build a really top-notch
implementation of masking. But my suggestions for how to improve the
current design are totally different depending on which of those is
the goal, and neither the R audience (like me) nor the masked array
audience (like you) seems really happy with the current design. And I
don't know what the goal is -- maybe it's something else and the
current design hits it perfectly? Maybe we want a top-notch
implementation of *both* missingness and masking, and those should be
two different things that can be combined, so that some of the
unmasked values inside a masked array can be NA? I don't know.

> I will put out a little disclaimer.  I once had to use S+ for a class.  To
> be honest, it was the worst programming experience in my life.  This
> experience may be coloring my perception of R's approach to handling missing
> data.

There's a lot of things that R does wrong (not their fault; language
design is an extremely difficult and specialized skill, that
statisticians are not exactly trained in), but it did make a few
excellent choices at the beginning. One was to steal the execution
model from Scheme, which, uh, isn't really relevant here. The other
was to steal the basic data types and standard library that the Bell
Labs statisticians had pounded into shape over many years. I use
Python now because using R for everything would drive me crazy, but
despite its many flaws, it still does some things so well that it's
become *the* language used for basically all statistical research. I'm
only talking about stealing those things :-).

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Benjamin Root
On Sat, Jun 25, 2011 at 1:18 PM, Alan G Isaac  wrote:

> On 6/25/2011 2:06 PM, Benjamin Root wrote:
> > Note that "np.sum([])" also returns 0.0.  I think the
> > reason why it has been returning zero instead of NaN was
> > because there wasn't a NaN-equivalent for integers.
>
>
> http://en.wikipedia.org/wiki/Empty_sum
>
>
Ah, thanks.

This then does raise the question of what masked arrays should do in the
case of a completely masked out record (or field, or whatever).

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Benjamin Root
On Sat, Jun 25, 2011 at 1:57 PM, Nathaniel Smith  wrote:

> On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing  wrote:
> > On 06/25/2011 07:05 AM, Nathaniel Smith wrote:
> >> On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett
>  wrote:
> >>> To clarify, you're proposing for:
> >>>
> >>> a = np.sum(np.array([np.NA, np.NA])
> >>>
> >>> 1) ->  np.NA
> >>> 2) ->  0.0
> >>
> >> Yes -- and in R you get actually do get NA, while in numpy.ma you
> >> actually do get 0. I don't think this is a coincidence; I think it's
> >
> > No, you don't:
> >
> > In [2]: np.ma.array([2, 4], mask=[True, True]).sum()
> > Out[2]: masked
> >
> > In [4]: np.sum(np.ma.array([2, 4], mask=[True, True]))
> > Out[4]: masked
>
> Huh. So in numpy.ma, sum([10, NA]) and sum([10]) are the same, but
> sum([NA]) and sum([]) are different? Sounds to me like you should file
> a bug on numpy.ma...
>

Actually, no... I should have tested this before replying earlier:

>>> a = np.ma.array([2, 4], mask=[True, True])
>>> a
masked_array(data = [-- --],
 mask = [ True  True],
   fill_value = 99)

>>> a.sum()
masked
>>> a = np.ma.array([], mask=[])
>>> a
>>> a
masked_array(data = [],
 mask = [],
   fill_value = 1e+20)
>>> a.sum()
masked

They are the same.


> Anyway, the general point is that in R, NA's propagate, and in
> numpy.ma, masked values are ignored (except, apparently, if all values
> are masked). Here, I actually checked these:
>
> Python: np.ma.array([2, 4], mask=[True, False]).sum() -> 4
> R: sum(c(NA, 4)) -> NA
>
>
If you want NaN behavior, then use NaNs.  If you want masked behavior, then
use masks.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Benjamin Root
On Sat, Jun 25, 2011 at 12:17 PM, Wes McKinney  wrote:

>
> Agree. My basic observation about numpy.ma is that it's a finely
> crafted solution for a different set of problems than the ones I have.
> I just don't want the same thing to happen here so I'm stuck writing
> code (like I am now) that looks like
>
> mask = y.mask
> the_sum = y.sum(axis)
> the_count = mask.sum(axis)
> the_sum[the_count == 0] = nan
>
>
Yeah, you are expecting NaN behavior there, in which case, using NaNs
without masks is fine.  But, for a general solution, you might want to
consider that masked_array provides a "recordmask" as well as a mask.
Although, the documentation is very lacking, and using masked arrays with
record arrays seems very clumsy, but that is certainly something that could
get cleaned up.

Also, as a cleaner version of your code:

the_sum = y.sum(axis)
the_sum.mask = np.any(y.mask, axis)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Nathaniel Smith
On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing  wrote:
> On 06/25/2011 07:05 AM, Nathaniel Smith wrote:
>> On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett  
>> wrote:
>>> To clarify, you're proposing for:
>>>
>>> a = np.sum(np.array([np.NA, np.NA])
>>>
>>> 1) ->  np.NA
>>> 2) ->  0.0
>>
>> Yes -- and in R you get actually do get NA, while in numpy.ma you
>> actually do get 0. I don't think this is a coincidence; I think it's
>
> No, you don't:
>
> In [2]: np.ma.array([2, 4], mask=[True, True]).sum()
> Out[2]: masked
>
> In [4]: np.sum(np.ma.array([2, 4], mask=[True, True]))
> Out[4]: masked

Huh. So in numpy.ma, sum([10, NA]) and sum([10]) are the same, but
sum([NA]) and sum([]) are different? Sounds to me like you should file
a bug on numpy.ma...

Anyway, the general point is that in R, NA's propagate, and in
numpy.ma, masked values are ignored (except, apparently, if all values
are masked). Here, I actually checked these:

Python: np.ma.array([2, 4], mask=[True, False]).sum() -> 4
R: sum(c(NA, 4)) -> NA

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Eric Firing
On 06/25/2011 07:05 AM, Nathaniel Smith wrote:
> On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett  
> wrote:
>> So far I see the difference between 1) and 2) being that you cannot
>> unmask.  So, if you didn't even know you could unmask data, then it
>> would not matter that 1) was being implemented by masks?
>
> I guess that is a difference, but I'm trying to get at something more
> fundamental -- not just what operations are allowed, but what
> operations people *expect* to be allowed. It seems like some of us
> have been talking past each other a lot, where someone says "but
> changing masks is the single most important feature!" and then someone
> else says "what are you talking about that doesn't even make sense".
>
>> To clarify, you're proposing for:
>>
>> a = np.sum(np.array([np.NA, np.NA])
>>
>> 1) ->  np.NA
>> 2) ->  0.0
>
> Yes -- and in R you get actually do get NA, while in numpy.ma you
> actually do get 0. I don't think this is a coincidence; I think it's

No, you don't:

In [2]: np.ma.array([2, 4], mask=[True, True]).sum()
Out[2]: masked

In [4]: np.sum(np.ma.array([2, 4], mask=[True, True]))
Out[4]: masked

Eric

> because they're designed as coherent systems that are trying to solve
> different problems. (Well, numpy.ma's "hardmask" idea seems inspired
> by the missing-data concept rather than the temporary-mask concept,
> but aside from that it seems pretty consistent in implementing option
> 2.)
>
> Here's another possible difference -- in (1), intuitively, missingness
> is a property of the data, so the logical place to put information
> about whether you can expect missing values is in the dtype, and to
> enable missing values you need to make a new array with a new dtype.
> (If we use a mask-based implementation, then
> np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
> to skip making a copy of the data -- I'm talking ONLY about the
> interface here, not whether missing data has a different storage
> format from non-missing data.)
>
> In (2), the whole point is to use different masks with the same data,
> so I'd argue masking should be a property of the array object rather
> than the dtype, and the interface should logically allow masks to be
> created, modified, and destroyed in place.
>
> They're both internally consistent, but I think we might have to make
> a decision and stick to it.
>
>> I agree it's good to separate the API from the implementation.   I
>> think the implementation is also important because I care about memory
>> and possibly speed.  But, that is a separate problem from the API...
>
> Yes, absolutely memory and speed are important. But a really fast
> solution to the wrong problem isn't so useful either :-).
>
> -- Nathaniel
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Benjamin Root
On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith  wrote:

> On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett 
> wrote:
> > So far I see the difference between 1) and 2) being that you cannot
> > unmask.  So, if you didn't even know you could unmask data, then it
> > would not matter that 1) was being implemented by masks?
>
> I guess that is a difference, but I'm trying to get at something more
> fundamental -- not just what operations are allowed, but what
> operations people *expect* to be allowed.


That is quite a trickier problem.



>
> Here's another possible difference -- in (1), intuitively, missingness
> is a property of the data, so the logical place to put information
> about whether you can expect missing values is in the dtype, and to
> enable missing values you need to make a new array with a new dtype.
> (If we use a mask-based implementation, then
> np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
> to skip making a copy of the data -- I'm talking ONLY about the
> interface here, not whether missing data has a different storage
> format from non-missing data.)
>
> In (2), the whole point is to use different masks with the same data,
> so I'd argue masking should be a property of the array object rather
> than the dtype, and the interface should logically allow masks to be
> created, modified, and destroyed in place.
>
>
I can agree with this distinction.  However, if "missingness" is an
intrinsic property of the data, then shouldn't users be implementing their
own dtype tailored to the data they are using?  In other words, how far does
the core of NumPy need to go to address this issue?  And how far would be
"too much"?



> They're both internally consistent, but I think we might have to make
> a decision and stick to it.
>
>
Of course.  I think that Mark is having a very inspired idea of giving the R
audience what they want (np.NA), while simultaneously making the use of
masked arrays even easier (which I can certainly appreciate).


> > I agree it's good to separate the API from the implementation.   I
> > think the implementation is also important because I care about memory
> > and possibly speed.  But, that is a separate problem from the API...
>
> Yes, absolutely memory and speed are important. But a really fast
> solution to the wrong problem isn't so useful either :-).
>
>
The one thing I have always loved about Python (and NumPy) is that "it
respects the developer's time".  I come from a C++ background where I found
C++ to be powerful, but tedious.  I went to Matlab because it was just
straight-up easier to code math and display graphs.  (If anybody here ever
used GrADS, then you know how badly I would want a language that respected
my time).  However, even Matlab couldn't fully respect my time as I usually
kept wasting it trying to get various pieces working.  Python came along,
and while it didn't always match the speed of some of my matlab programs, it
was "fast enough".

I will put out a little disclaimer.  I once had to use S+ for a class.  To
be honest, it was the worst programming experience in my life.  This
experience may be coloring my perception of R's approach to handling missing
data.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Alan G Isaac
On 6/25/2011 2:06 PM, Benjamin Root wrote:
> Note that "np.sum([])" also returns 0.0.  I think the
> reason why it has been returning zero instead of NaN was
> because there wasn't a NaN-equivalent for integers.


http://en.wikipedia.org/wiki/Empty_sum

fwiw,
Alan Isaac
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Benjamin Root
On Sat, Jun 25, 2011 at 11:26 AM, Matthew Brett wrote:

> Hi,
>
> On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith  wrote:
> > So obviously there's a lot of interest in this question, but I'm
> > losing track of all the different issues that've being raised in the
> > 150-post thread of doom. I think I'll find this easier if we start by
> > putting aside the questions about implementation and such and focus
> > for now on the *conceptual model* that we want. Maybe I'm not the only
> > one?
> >
> > So as far as I can tell, there are three different ways of thinking
> > about masked/missing data that people have been using in the other
> > thread:
> >
> > 1) Missingness is part of the data. Some data is missing, some isn't,
> > this might change through computation on the data (just like some data
> > might change from a 3 to a 6 when we apply some transformation, NA |
> > True could be True, instead of NA), but we can't just "decide" that
> > some data is no longer missing. It makes no sense to ask what value is
> > "really" there underneath the missingness. And It's critical that we
> > keep track of this through all operations, because otherwise we may
> > silently give incorrect answers -- exactly like it's critical that we
> > keep track of the difference between 3 and 6.
>
> So far I see the difference between 1) and 2) being that you cannot
> unmask.  So, if you didn't even know you could unmask data, then it
> would not matter that 1) was being implemented by masks?
>
>
Yes, bingo, you hit it right on the nose.  Essentially, 1) could be
considered the "hard mask", while 2) would be the "soft mask".  Everything
else is implementation details.


> > 2) All the data exists, at least in some sense, but we don't always
> > want to look at all of it. We lay a mask over our data to view and
> > manipulate only parts of it at a time. We might want to use different
> > masks at different times, mutate the mask as we go, etc. The most
> > important thing is to provide convenient ways to do complex
> > manipulations -- preserve masks through indexing operations, overlay
> > the mask from one array on top of another array, etc. When it comes to
> > other sorts of operations then we'd rather just silently skip the
> > masked values -- we know there are values that are masked, that's the
> > whole point, to work with the unmasked subset of the data, so if sum
> > returned NA then that would just be a stupid hassle.
>
> To clarify, you're proposing for:
>
> a = np.sum(np.array([np.NA, np.NA])
>
> 1) -> np.NA
> 2) -> 0.0
>
> ?
>

Actually, I have always considered this to be a bug.  Note that "np.sum([])"
also returns 0.0.  I think the reason why it has been returning zero instead
of NaN was because there wasn't a NaN-equivalent for integers.  This is
where I think a np.NA could best serve NumPy by providing a dtype-agnostic
way to represent missing or invalid data.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Wes McKinney
On Sat, Jun 25, 2011 at 1:05 PM, Nathaniel Smith  wrote:
> On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett  
> wrote:
>> So far I see the difference between 1) and 2) being that you cannot
>> unmask.  So, if you didn't even know you could unmask data, then it
>> would not matter that 1) was being implemented by masks?
>
> I guess that is a difference, but I'm trying to get at something more
> fundamental -- not just what operations are allowed, but what
> operations people *expect* to be allowed. It seems like some of us
> have been talking past each other a lot, where someone says "but
> changing masks is the single most important feature!" and then someone
> else says "what are you talking about that doesn't even make sense".
>
>> To clarify, you're proposing for:
>>
>> a = np.sum(np.array([np.NA, np.NA])
>>
>> 1) -> np.NA
>> 2) -> 0.0
>
> Yes -- and in R you get actually do get NA, while in numpy.ma you
> actually do get 0. I don't think this is a coincidence; I think it's
> because they're designed as coherent systems that are trying to solve
> different problems. (Well, numpy.ma's "hardmask" idea seems inspired
> by the missing-data concept rather than the temporary-mask concept,
> but aside from that it seems pretty consistent in implementing option
> 2.)

Agree. My basic observation about numpy.ma is that it's a finely
crafted solution for a different set of problems than the ones I have.
I just don't want the same thing to happen here so I'm stuck writing
code (like I am now) that looks like

mask = y.mask
the_sum = y.sum(axis)
the_count = mask.sum(axis)
the_sum[the_count == 0] = nan

> Here's another possible difference -- in (1), intuitively, missingness
> is a property of the data, so the logical place to put information
> about whether you can expect missing values is in the dtype, and to
> enable missing values you need to make a new array with a new dtype.
> (If we use a mask-based implementation, then
> np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
> to skip making a copy of the data -- I'm talking ONLY about the
> interface here, not whether missing data has a different storage
> format from non-missing data.)
>
> In (2), the whole point is to use different masks with the same data,
> so I'd argue masking should be a property of the array object rather
> than the dtype, and the interface should logically allow masks to be
> created, modified, and destroyed in place.
>
> They're both internally consistent, but I think we might have to make
> a decision and stick to it.
>
>> I agree it's good to separate the API from the implementation.   I
>> think the implementation is also important because I care about memory
>> and possibly speed.  But, that is a separate problem from the API...
>
> Yes, absolutely memory and speed are important. But a really fast
> solution to the wrong problem isn't so useful either :-).
>
> -- Nathaniel
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Matthew Brett
Hi,

On Sat, Jun 25, 2011 at 6:05 PM, Nathaniel Smith  wrote:
> Yes, absolutely memory and speed are important. But a really fast
> solution to the wrong problem isn't so useful either :-).

Would you be happy with me summarizing your idea as

1) = NA logic / API
2) = mask logic / API

?

It might be possible to create (most of) an NA logic API from a mask
implementation, and in fact, I think that's more or less what is being
proposed.  It's possible to imagine exposing an apparently separate
API for mask operations, but which in fact uses the same
implementation.  Mark, Chuck - is that right?

If we choose a mask implementation, that will have consequences for
memory (predictable) and performance (unpredictable).

Is that a good summary?

See you,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Nathaniel Smith
On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett  wrote:
> So far I see the difference between 1) and 2) being that you cannot
> unmask.  So, if you didn't even know you could unmask data, then it
> would not matter that 1) was being implemented by masks?

I guess that is a difference, but I'm trying to get at something more
fundamental -- not just what operations are allowed, but what
operations people *expect* to be allowed. It seems like some of us
have been talking past each other a lot, where someone says "but
changing masks is the single most important feature!" and then someone
else says "what are you talking about that doesn't even make sense".

> To clarify, you're proposing for:
>
> a = np.sum(np.array([np.NA, np.NA])
>
> 1) -> np.NA
> 2) -> 0.0

Yes -- and in R you get actually do get NA, while in numpy.ma you
actually do get 0. I don't think this is a coincidence; I think it's
because they're designed as coherent systems that are trying to solve
different problems. (Well, numpy.ma's "hardmask" idea seems inspired
by the missing-data concept rather than the temporary-mask concept,
but aside from that it seems pretty consistent in implementing option
2.)

Here's another possible difference -- in (1), intuitively, missingness
is a property of the data, so the logical place to put information
about whether you can expect missing values is in the dtype, and to
enable missing values you need to make a new array with a new dtype.
(If we use a mask-based implementation, then
np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
to skip making a copy of the data -- I'm talking ONLY about the
interface here, not whether missing data has a different storage
format from non-missing data.)

In (2), the whole point is to use different masks with the same data,
so I'd argue masking should be a property of the array object rather
than the dtype, and the interface should logically allow masks to be
created, modified, and destroyed in place.

They're both internally consistent, but I think we might have to make
a decision and stick to it.

> I agree it's good to separate the API from the implementation.   I
> think the implementation is also important because I care about memory
> and possibly speed.  But, that is a separate problem from the API...

Yes, absolutely memory and speed are important. But a really fast
solution to the wrong problem isn't so useful either :-).

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Charles R Harris
On Sat, Jun 25, 2011 at 10:26 AM, Matthew Brett wrote:

> Hi,
>
> On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith  wrote:
> > So obviously there's a lot of interest in this question, but I'm
> > losing track of all the different issues that've being raised in the
> > 150-post thread of doom. I think I'll find this easier if we start by
> > putting aside the questions about implementation and such and focus
> > for now on the *conceptual model* that we want. Maybe I'm not the only
> > one?
> >
> > So as far as I can tell, there are three different ways of thinking
> > about masked/missing data that people have been using in the other
> > thread:
> >
> > 1) Missingness is part of the data. Some data is missing, some isn't,
> > this might change through computation on the data (just like some data
> > might change from a 3 to a 6 when we apply some transformation, NA |
> > True could be True, instead of NA), but we can't just "decide" that
> > some data is no longer missing. It makes no sense to ask what value is
> > "really" there underneath the missingness. And It's critical that we
> > keep track of this through all operations, because otherwise we may
> > silently give incorrect answers -- exactly like it's critical that we
> > keep track of the difference between 3 and 6.
>
> So far I see the difference between 1) and 2) being that you cannot
> unmask.  So, if you didn't even know you could unmask data, then it
> would not matter that 1) was being implemented by masks?
>
> > 2) All the data exists, at least in some sense, but we don't always
> > want to look at all of it. We lay a mask over our data to view and
> > manipulate only parts of it at a time. We might want to use different
> > masks at different times, mutate the mask as we go, etc. The most
> > important thing is to provide convenient ways to do complex
> > manipulations -- preserve masks through indexing operations, overlay
> > the mask from one array on top of another array, etc. When it comes to
> > other sorts of operations then we'd rather just silently skip the
> > masked values -- we know there are values that are masked, that's the
> > whole point, to work with the unmasked subset of the data, so if sum
> > returned NA then that would just be a stupid hassle.
>
> To clarify, you're proposing for:
>
> a = np.sum(np.array([np.NA, np.NA])
>
> 1) -> np.NA
> 2) -> 0.0
>
> ?
>
> > But that's just my opinion. I'm wondering if we can get any consensus
> > on which of these we actually *want* (or maybe we want some fourth
> > option!), and *then* we can try to figure out the best way to get
> > there? Pretty much any implementation strategy we've talked about
> > could work for any of these, but hard to decide between them if we
> > don't even know what we're trying to do...
>
> I agree it's good to separate the API from the implementation.   I
> think the implementation is also important because I care about memory
> and possibly speed.  But, that is a separate problem from the API...
>
>
In a larger sense, we are seeking to add metadata to array elements and have
ufuncs that use that metadata together with the element values to compute
results. Off topic a bit, but it reminds me of the Burroughs 6600 that I
once used. The word size on that machine was 48 bits, so it could
accommodate both  6 and 8 bit characters, and 3 bits of metadata were
appended to mark the type. So there was a machine with 51 bit words ;) IIRC,
Knuth was involved in the design and helped with the OS, which was written
in ALGOL...

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Matthew Brett
Hi,

On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith  wrote:
> So obviously there's a lot of interest in this question, but I'm
> losing track of all the different issues that've being raised in the
> 150-post thread of doom. I think I'll find this easier if we start by
> putting aside the questions about implementation and such and focus
> for now on the *conceptual model* that we want. Maybe I'm not the only
> one?
>
> So as far as I can tell, there are three different ways of thinking
> about masked/missing data that people have been using in the other
> thread:
>
> 1) Missingness is part of the data. Some data is missing, some isn't,
> this might change through computation on the data (just like some data
> might change from a 3 to a 6 when we apply some transformation, NA |
> True could be True, instead of NA), but we can't just "decide" that
> some data is no longer missing. It makes no sense to ask what value is
> "really" there underneath the missingness. And It's critical that we
> keep track of this through all operations, because otherwise we may
> silently give incorrect answers -- exactly like it's critical that we
> keep track of the difference between 3 and 6.

So far I see the difference between 1) and 2) being that you cannot
unmask.  So, if you didn't even know you could unmask data, then it
would not matter that 1) was being implemented by masks?

> 2) All the data exists, at least in some sense, but we don't always
> want to look at all of it. We lay a mask over our data to view and
> manipulate only parts of it at a time. We might want to use different
> masks at different times, mutate the mask as we go, etc. The most
> important thing is to provide convenient ways to do complex
> manipulations -- preserve masks through indexing operations, overlay
> the mask from one array on top of another array, etc. When it comes to
> other sorts of operations then we'd rather just silently skip the
> masked values -- we know there are values that are masked, that's the
> whole point, to work with the unmasked subset of the data, so if sum
> returned NA then that would just be a stupid hassle.

To clarify, you're proposing for:

a = np.sum(np.array([np.NA, np.NA])

1) -> np.NA
2) -> 0.0

?

> But that's just my opinion. I'm wondering if we can get any consensus
> on which of these we actually *want* (or maybe we want some fourth
> option!), and *then* we can try to figure out the best way to get
> there? Pretty much any implementation strategy we've talked about
> could work for any of these, but hard to decide between them if we
> don't even know what we're trying to do...

I agree it's good to separate the API from the implementation.   I
think the implementation is also important because I care about memory
and possibly speed.  But, that is a separate problem from the API...

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Charles R Harris
On Sat, Jun 25, 2011 at 10:05 AM, Nathaniel Smith  wrote:

> So obviously there's a lot of interest in this question, but I'm
> losing track of all the different issues that've being raised in the
> 150-post thread of doom. I think I'll find this easier if we start by
> putting aside the questions about implementation and such and focus
> for now on the *conceptual model* that we want. Maybe I'm not the only
> one?
>
> So as far as I can tell, there are three different ways of thinking
> about masked/missing data that people have been using in the other
> thread:
>
> 1) Missingness is part of the data. Some data is missing, some isn't,
> this might change through computation on the data (just like some data
> might change from a 3 to a 6 when we apply some transformation, NA |
> True could be True, instead of NA), but we can't just "decide" that
> some data is no longer missing. It makes no sense to ask what value is
> "really" there underneath the missingness. And It's critical that we
> keep track of this through all operations, because otherwise we may
> silently give incorrect answers -- exactly like it's critical that we
> keep track of the difference between 3 and 6.
>
> 2) All the data exists, at least in some sense, but we don't always
> want to look at all of it. We lay a mask over our data to view and
> manipulate only parts of it at a time. We might want to use different
> masks at different times, mutate the mask as we go, etc. The most
> important thing is to provide convenient ways to do complex
> manipulations -- preserve masks through indexing operations, overlay
> the mask from one array on top of another array, etc. When it comes to
> other sorts of operations then we'd rather just silently skip the
> masked values -- we know there are values that are masked, that's the
> whole point, to work with the unmasked subset of the data, so if sum
> returned NA then that would just be a stupid hassle.
>
> 3) The "all things to all people" approach: implement every feature
> implied by either (1) or (2), and switch back and forth between these
> conceptual frameworks whenever necessary to make sense of the
> resulting code.
>
> The advantage of deciding up front what our model is is that it makes
> a lot of other questions easier. E.g., someone asked in the other
> thread whether, after setting an array element to NA, it would be
> possible to get back the original value. If we follow (1), the answer
> is obviously "no", if we follow (2), the answer is obviously "yes",
> and if we follow (3), the answer is obviously "yes, probably, well,
> maybe you better check the docs?".
>
> My personal opinions on these are:
> (1): This is a real problem I face, and there isn't any good solution
> now. Support for this in numpy would be awesome.
> (2): This feels more like a convenience feature to me; we already have
> lots of ways to work with subsets of data. I probably wouldn't bother
> using it, but that's fine -- I don't use np.matrix either, but some
> people like it.
> (3): Well, it's a bit of a mess, but I guess it might be better than
> nothing?
>
> But that's just my opinion. I'm wondering if we can get any consensus
> on which of these we actually *want* (or maybe we want some fourth
> option!), and *then* we can try to figure out the best way to get
> there? Pretty much any implementation strategy we've talked about
> could work for any of these, but hard to decide between them if we
> don't even know what we're trying to do...
>
>
I go for 3 ;) And I think that is where we are heading. By default, masked
array operations look like 1), but by taking views one can get 2). I think
the crucial aspect here is the use of views, which both saves on storage and
fits with the current numpy concept of views.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion