Re: [Numpy-discussion] Concepts for masked/missing data
I haven't commented yet on the mailing list because of time pressures although I have spoken to Mark as often as I can --- and have encouraged him to pursue his ideas and discuss them with the community. The Numeric Python discussion list has a long history of great dialogue to try and bring out as many perspectives as possible as we wrestle with improving the code base. It is very encouraging to see that tradition continuing. Because Enthought was mentioned earlier in this thread, I would like to try and clarify a few things about my employer and the company's interest.Enthought has been very interested in the development of the NumPy and SciPy stack (and the broader SciPy community) for sometime. With its limited resources, Enthought helped significantly to form the SciPy community and continues to sponsor it as much as it can. Many developers who work at Enthought (including me) have personal interest in the NumPy / SciPy community and codebase that go beyond Enthought's ability to invest directly as well. While Enthought has limited resources to invest directly in pursuing the goal, Enthought is very interested in improving Python's use as a data analysis environment.Because of that interest, Enthought sponsored a data-array summit in May. There is an inScight podcast that summarizes some of the event that you can listen to at http://inscight.org/2011/05/18/episode_13/.The purpose of this event was to bring a few people together who have been working on different aspects of the problem (particularly around the labelled array, or data array problem).We also wanted to jump start the activity of our interns and make sure that some of the use cases we have seen during the past several years while working on client projects had some light. The event was successful in that it generated *a lot* of ideas. Some of these ideas were summarized in notes that are linked to at this convore thread: https://convore.com/python-scientific-computing/data-array-in-numpy/ One of the major ideas that emerged during the discussion is that NumPy needs to be able to handle missing data in a more integrated way (i.e. there need to be functions that do the right thing in the face of missing data). One approach that was suggested during some of the discussion was that one way to handle missing data would be to introducing special nadtypes. Mark is one of 2 interns that we have this summer who are tasked at a high level with taking what was learned at the summit and implementing critical pieces as their skills and interests allow.I have been talking with them individually to map out specific work targets for the summer.Christopher Jordan-Squires is one of our interns who is studying to get a PhD in Mathematics at the Univ. of Washington. He has a strong interest in statistics and a desire to make Python as easy to use as R for certain statistical work flows.Mark Wiebe is known on this list because of his recent success at working on the NumPy code base. As a result of that success, Mark is working on making improvements to NumPy that are seen as most critical to solving some of the same problems we keep seeing in our projects (labeled arrays being one of them). We are also very interested in the Pandas project as it brings a data-structure like the successful DataFrame in R to the Python space (and it helps solve some of the problems our clients are seeing). It would be good to make sure that core functionality that Pandas needs is available in NumPy where appropriate. The date-time work that Mark did was the first low-hanging fruit that needed to be finished. The second project that Mark is involved with is creating an approach for missing data in NumPy. I suggested the missing data dtypes (in part because Mark had expressed some concerns about the way dtypes are handled in NumPy and I would love for that mechanism for user-defined data-types and the whole data-type infrastructure to be improved as needed.) Mark spent some time thinking about it and felt more comfortable with the masked array solution and that is where we are now. Enthought's main interest remains in seeing how much of the data array can and should be moved into low-level NumPy as well as the implementation of functionality (wherever it may live) to make data analysis easier and more productive in Python.Again, though, this is something Enthought as a company can only invest limited resources in, and we want to make sure that Mark spends the time that we are sponsoring doing work that is seen as valuable by the community but more importantly matching our own internal needs. I will post a follow-on message that provides my current views on the subject of missing data and masked arrays. -Travis On Jun 25, 2011, at 2:09 PM, Benjamin Root wrote: On Sat, Jun 25, 2011 at 1:57 PM, Nathaniel Smith n...@pobox.com wrote:
Re: [Numpy-discussion] Concepts for masked/missing data
On Sat, Jun 25, 2011 at 10:05 AM, Nathaniel Smith n...@pobox.com wrote: So obviously there's a lot of interest in this question, but I'm losing track of all the different issues that've being raised in the 150-post thread of doom. I think I'll find this easier if we start by putting aside the questions about implementation and such and focus for now on the *conceptual model* that we want. Maybe I'm not the only one? So as far as I can tell, there are three different ways of thinking about masked/missing data that people have been using in the other thread: 1) Missingness is part of the data. Some data is missing, some isn't, this might change through computation on the data (just like some data might change from a 3 to a 6 when we apply some transformation, NA | True could be True, instead of NA), but we can't just decide that some data is no longer missing. It makes no sense to ask what value is really there underneath the missingness. And It's critical that we keep track of this through all operations, because otherwise we may silently give incorrect answers -- exactly like it's critical that we keep track of the difference between 3 and 6. 2) All the data exists, at least in some sense, but we don't always want to look at all of it. We lay a mask over our data to view and manipulate only parts of it at a time. We might want to use different masks at different times, mutate the mask as we go, etc. The most important thing is to provide convenient ways to do complex manipulations -- preserve masks through indexing operations, overlay the mask from one array on top of another array, etc. When it comes to other sorts of operations then we'd rather just silently skip the masked values -- we know there are values that are masked, that's the whole point, to work with the unmasked subset of the data, so if sum returned NA then that would just be a stupid hassle. 3) The all things to all people approach: implement every feature implied by either (1) or (2), and switch back and forth between these conceptual frameworks whenever necessary to make sense of the resulting code. The advantage of deciding up front what our model is is that it makes a lot of other questions easier. E.g., someone asked in the other thread whether, after setting an array element to NA, it would be possible to get back the original value. If we follow (1), the answer is obviously no, if we follow (2), the answer is obviously yes, and if we follow (3), the answer is obviously yes, probably, well, maybe you better check the docs?. My personal opinions on these are: (1): This is a real problem I face, and there isn't any good solution now. Support for this in numpy would be awesome. (2): This feels more like a convenience feature to me; we already have lots of ways to work with subsets of data. I probably wouldn't bother using it, but that's fine -- I don't use np.matrix either, but some people like it. (3): Well, it's a bit of a mess, but I guess it might be better than nothing? But that's just my opinion. I'm wondering if we can get any consensus on which of these we actually *want* (or maybe we want some fourth option!), and *then* we can try to figure out the best way to get there? Pretty much any implementation strategy we've talked about could work for any of these, but hard to decide between them if we don't even know what we're trying to do... I go for 3 ;) And I think that is where we are heading. By default, masked array operations look like 1), but by taking views one can get 2). I think the crucial aspect here is the use of views, which both saves on storage and fits with the current numpy concept of views. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Concepts for masked/missing data
Hi, On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith n...@pobox.com wrote: So obviously there's a lot of interest in this question, but I'm losing track of all the different issues that've being raised in the 150-post thread of doom. I think I'll find this easier if we start by putting aside the questions about implementation and such and focus for now on the *conceptual model* that we want. Maybe I'm not the only one? So as far as I can tell, there are three different ways of thinking about masked/missing data that people have been using in the other thread: 1) Missingness is part of the data. Some data is missing, some isn't, this might change through computation on the data (just like some data might change from a 3 to a 6 when we apply some transformation, NA | True could be True, instead of NA), but we can't just decide that some data is no longer missing. It makes no sense to ask what value is really there underneath the missingness. And It's critical that we keep track of this through all operations, because otherwise we may silently give incorrect answers -- exactly like it's critical that we keep track of the difference between 3 and 6. So far I see the difference between 1) and 2) being that you cannot unmask. So, if you didn't even know you could unmask data, then it would not matter that 1) was being implemented by masks? 2) All the data exists, at least in some sense, but we don't always want to look at all of it. We lay a mask over our data to view and manipulate only parts of it at a time. We might want to use different masks at different times, mutate the mask as we go, etc. The most important thing is to provide convenient ways to do complex manipulations -- preserve masks through indexing operations, overlay the mask from one array on top of another array, etc. When it comes to other sorts of operations then we'd rather just silently skip the masked values -- we know there are values that are masked, that's the whole point, to work with the unmasked subset of the data, so if sum returned NA then that would just be a stupid hassle. To clarify, you're proposing for: a = np.sum(np.array([np.NA, np.NA]) 1) - np.NA 2) - 0.0 ? But that's just my opinion. I'm wondering if we can get any consensus on which of these we actually *want* (or maybe we want some fourth option!), and *then* we can try to figure out the best way to get there? Pretty much any implementation strategy we've talked about could work for any of these, but hard to decide between them if we don't even know what we're trying to do... I agree it's good to separate the API from the implementation. I think the implementation is also important because I care about memory and possibly speed. But, that is a separate problem from the API... Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Concepts for masked/missing data
On Sat, Jun 25, 2011 at 10:26 AM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith n...@pobox.com wrote: So obviously there's a lot of interest in this question, but I'm losing track of all the different issues that've being raised in the 150-post thread of doom. I think I'll find this easier if we start by putting aside the questions about implementation and such and focus for now on the *conceptual model* that we want. Maybe I'm not the only one? So as far as I can tell, there are three different ways of thinking about masked/missing data that people have been using in the other thread: 1) Missingness is part of the data. Some data is missing, some isn't, this might change through computation on the data (just like some data might change from a 3 to a 6 when we apply some transformation, NA | True could be True, instead of NA), but we can't just decide that some data is no longer missing. It makes no sense to ask what value is really there underneath the missingness. And It's critical that we keep track of this through all operations, because otherwise we may silently give incorrect answers -- exactly like it's critical that we keep track of the difference between 3 and 6. So far I see the difference between 1) and 2) being that you cannot unmask. So, if you didn't even know you could unmask data, then it would not matter that 1) was being implemented by masks? 2) All the data exists, at least in some sense, but we don't always want to look at all of it. We lay a mask over our data to view and manipulate only parts of it at a time. We might want to use different masks at different times, mutate the mask as we go, etc. The most important thing is to provide convenient ways to do complex manipulations -- preserve masks through indexing operations, overlay the mask from one array on top of another array, etc. When it comes to other sorts of operations then we'd rather just silently skip the masked values -- we know there are values that are masked, that's the whole point, to work with the unmasked subset of the data, so if sum returned NA then that would just be a stupid hassle. To clarify, you're proposing for: a = np.sum(np.array([np.NA, np.NA]) 1) - np.NA 2) - 0.0 ? But that's just my opinion. I'm wondering if we can get any consensus on which of these we actually *want* (or maybe we want some fourth option!), and *then* we can try to figure out the best way to get there? Pretty much any implementation strategy we've talked about could work for any of these, but hard to decide between them if we don't even know what we're trying to do... I agree it's good to separate the API from the implementation. I think the implementation is also important because I care about memory and possibly speed. But, that is a separate problem from the API... In a larger sense, we are seeking to add metadata to array elements and have ufuncs that use that metadata together with the element values to compute results. Off topic a bit, but it reminds me of the Burroughs 6600 that I once used. The word size on that machine was 48 bits, so it could accommodate both 6 and 8 bit characters, and 3 bits of metadata were appended to mark the type. So there was a machine with 51 bit words ;) IIRC, Knuth was involved in the design and helped with the OS, which was written in ALGOL... Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Concepts for masked/missing data
On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett matthew.br...@gmail.com wrote: So far I see the difference between 1) and 2) being that you cannot unmask. So, if you didn't even know you could unmask data, then it would not matter that 1) was being implemented by masks? I guess that is a difference, but I'm trying to get at something more fundamental -- not just what operations are allowed, but what operations people *expect* to be allowed. It seems like some of us have been talking past each other a lot, where someone says but changing masks is the single most important feature! and then someone else says what are you talking about that doesn't even make sense. To clarify, you're proposing for: a = np.sum(np.array([np.NA, np.NA]) 1) - np.NA 2) - 0.0 Yes -- and in R you get actually do get NA, while in numpy.ma you actually do get 0. I don't think this is a coincidence; I think it's because they're designed as coherent systems that are trying to solve different problems. (Well, numpy.ma's hardmask idea seems inspired by the missing-data concept rather than the temporary-mask concept, but aside from that it seems pretty consistent in implementing option 2.) Here's another possible difference -- in (1), intuitively, missingness is a property of the data, so the logical place to put information about whether you can expect missing values is in the dtype, and to enable missing values you need to make a new array with a new dtype. (If we use a mask-based implementation, then np.asarray(nomissing_array, dtype=yesmissing_type) would still be able to skip making a copy of the data -- I'm talking ONLY about the interface here, not whether missing data has a different storage format from non-missing data.) In (2), the whole point is to use different masks with the same data, so I'd argue masking should be a property of the array object rather than the dtype, and the interface should logically allow masks to be created, modified, and destroyed in place. They're both internally consistent, but I think we might have to make a decision and stick to it. I agree it's good to separate the API from the implementation. I think the implementation is also important because I care about memory and possibly speed. But, that is a separate problem from the API... Yes, absolutely memory and speed are important. But a really fast solution to the wrong problem isn't so useful either :-). -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Concepts for masked/missing data
On Sat, Jun 25, 2011 at 1:05 PM, Nathaniel Smith n...@pobox.com wrote: On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett matthew.br...@gmail.com wrote: So far I see the difference between 1) and 2) being that you cannot unmask. So, if you didn't even know you could unmask data, then it would not matter that 1) was being implemented by masks? I guess that is a difference, but I'm trying to get at something more fundamental -- not just what operations are allowed, but what operations people *expect* to be allowed. It seems like some of us have been talking past each other a lot, where someone says but changing masks is the single most important feature! and then someone else says what are you talking about that doesn't even make sense. To clarify, you're proposing for: a = np.sum(np.array([np.NA, np.NA]) 1) - np.NA 2) - 0.0 Yes -- and in R you get actually do get NA, while in numpy.ma you actually do get 0. I don't think this is a coincidence; I think it's because they're designed as coherent systems that are trying to solve different problems. (Well, numpy.ma's hardmask idea seems inspired by the missing-data concept rather than the temporary-mask concept, but aside from that it seems pretty consistent in implementing option 2.) Agree. My basic observation about numpy.ma is that it's a finely crafted solution for a different set of problems than the ones I have. I just don't want the same thing to happen here so I'm stuck writing code (like I am now) that looks like mask = y.mask the_sum = y.sum(axis) the_count = mask.sum(axis) the_sum[the_count == 0] = nan Here's another possible difference -- in (1), intuitively, missingness is a property of the data, so the logical place to put information about whether you can expect missing values is in the dtype, and to enable missing values you need to make a new array with a new dtype. (If we use a mask-based implementation, then np.asarray(nomissing_array, dtype=yesmissing_type) would still be able to skip making a copy of the data -- I'm talking ONLY about the interface here, not whether missing data has a different storage format from non-missing data.) In (2), the whole point is to use different masks with the same data, so I'd argue masking should be a property of the array object rather than the dtype, and the interface should logically allow masks to be created, modified, and destroyed in place. They're both internally consistent, but I think we might have to make a decision and stick to it. I agree it's good to separate the API from the implementation. I think the implementation is also important because I care about memory and possibly speed. But, that is a separate problem from the API... Yes, absolutely memory and speed are important. But a really fast solution to the wrong problem isn't so useful either :-). -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Concepts for masked/missing data
On Sat, Jun 25, 2011 at 11:26 AM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith n...@pobox.com wrote: So obviously there's a lot of interest in this question, but I'm losing track of all the different issues that've being raised in the 150-post thread of doom. I think I'll find this easier if we start by putting aside the questions about implementation and such and focus for now on the *conceptual model* that we want. Maybe I'm not the only one? So as far as I can tell, there are three different ways of thinking about masked/missing data that people have been using in the other thread: 1) Missingness is part of the data. Some data is missing, some isn't, this might change through computation on the data (just like some data might change from a 3 to a 6 when we apply some transformation, NA | True could be True, instead of NA), but we can't just decide that some data is no longer missing. It makes no sense to ask what value is really there underneath the missingness. And It's critical that we keep track of this through all operations, because otherwise we may silently give incorrect answers -- exactly like it's critical that we keep track of the difference between 3 and 6. So far I see the difference between 1) and 2) being that you cannot unmask. So, if you didn't even know you could unmask data, then it would not matter that 1) was being implemented by masks? Yes, bingo, you hit it right on the nose. Essentially, 1) could be considered the hard mask, while 2) would be the soft mask. Everything else is implementation details. 2) All the data exists, at least in some sense, but we don't always want to look at all of it. We lay a mask over our data to view and manipulate only parts of it at a time. We might want to use different masks at different times, mutate the mask as we go, etc. The most important thing is to provide convenient ways to do complex manipulations -- preserve masks through indexing operations, overlay the mask from one array on top of another array, etc. When it comes to other sorts of operations then we'd rather just silently skip the masked values -- we know there are values that are masked, that's the whole point, to work with the unmasked subset of the data, so if sum returned NA then that would just be a stupid hassle. To clarify, you're proposing for: a = np.sum(np.array([np.NA, np.NA]) 1) - np.NA 2) - 0.0 ? Actually, I have always considered this to be a bug. Note that np.sum([]) also returns 0.0. I think the reason why it has been returning zero instead of NaN was because there wasn't a NaN-equivalent for integers. This is where I think a np.NA could best serve NumPy by providing a dtype-agnostic way to represent missing or invalid data. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Concepts for masked/missing data
On 6/25/2011 2:06 PM, Benjamin Root wrote: Note that np.sum([]) also returns 0.0. I think the reason why it has been returning zero instead of NaN was because there wasn't a NaN-equivalent for integers. http://en.wikipedia.org/wiki/Empty_sum fwiw, Alan Isaac ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Concepts for masked/missing data
On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith n...@pobox.com wrote: On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett matthew.br...@gmail.com wrote: So far I see the difference between 1) and 2) being that you cannot unmask. So, if you didn't even know you could unmask data, then it would not matter that 1) was being implemented by masks? I guess that is a difference, but I'm trying to get at something more fundamental -- not just what operations are allowed, but what operations people *expect* to be allowed. That is quite a trickier problem. Here's another possible difference -- in (1), intuitively, missingness is a property of the data, so the logical place to put information about whether you can expect missing values is in the dtype, and to enable missing values you need to make a new array with a new dtype. (If we use a mask-based implementation, then np.asarray(nomissing_array, dtype=yesmissing_type) would still be able to skip making a copy of the data -- I'm talking ONLY about the interface here, not whether missing data has a different storage format from non-missing data.) In (2), the whole point is to use different masks with the same data, so I'd argue masking should be a property of the array object rather than the dtype, and the interface should logically allow masks to be created, modified, and destroyed in place. I can agree with this distinction. However, if missingness is an intrinsic property of the data, then shouldn't users be implementing their own dtype tailored to the data they are using? In other words, how far does the core of NumPy need to go to address this issue? And how far would be too much? They're both internally consistent, but I think we might have to make a decision and stick to it. Of course. I think that Mark is having a very inspired idea of giving the R audience what they want (np.NA), while simultaneously making the use of masked arrays even easier (which I can certainly appreciate). I agree it's good to separate the API from the implementation. I think the implementation is also important because I care about memory and possibly speed. But, that is a separate problem from the API... Yes, absolutely memory and speed are important. But a really fast solution to the wrong problem isn't so useful either :-). The one thing I have always loved about Python (and NumPy) is that it respects the developer's time. I come from a C++ background where I found C++ to be powerful, but tedious. I went to Matlab because it was just straight-up easier to code math and display graphs. (If anybody here ever used GrADS, then you know how badly I would want a language that respected my time). However, even Matlab couldn't fully respect my time as I usually kept wasting it trying to get various pieces working. Python came along, and while it didn't always match the speed of some of my matlab programs, it was fast enough. I will put out a little disclaimer. I once had to use S+ for a class. To be honest, it was the worst programming experience in my life. This experience may be coloring my perception of R's approach to handling missing data. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Concepts for masked/missing data
On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing efir...@hawaii.edu wrote: On 06/25/2011 07:05 AM, Nathaniel Smith wrote: On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brettmatthew.br...@gmail.com wrote: To clarify, you're proposing for: a = np.sum(np.array([np.NA, np.NA]) 1) - np.NA 2) - 0.0 Yes -- and in R you get actually do get NA, while in numpy.ma you actually do get 0. I don't think this is a coincidence; I think it's No, you don't: In [2]: np.ma.array([2, 4], mask=[True, True]).sum() Out[2]: masked In [4]: np.sum(np.ma.array([2, 4], mask=[True, True])) Out[4]: masked Huh. So in numpy.ma, sum([10, NA]) and sum([10]) are the same, but sum([NA]) and sum([]) are different? Sounds to me like you should file a bug on numpy.ma... Anyway, the general point is that in R, NA's propagate, and in numpy.ma, masked values are ignored (except, apparently, if all values are masked). Here, I actually checked these: Python: np.ma.array([2, 4], mask=[True, False]).sum() - 4 R: sum(c(NA, 4)) - NA -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Concepts for masked/missing data
On Sat, Jun 25, 2011 at 11:32 AM, Benjamin Root ben.r...@ou.edu wrote: On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith n...@pobox.com wrote: I guess that is a difference, but I'm trying to get at something more fundamental -- not just what operations are allowed, but what operations people *expect* to be allowed. That is quite a trickier problem. It can be. I think of it as the difference between design and coding. They overlap less than one might expect... Here's another possible difference -- in (1), intuitively, missingness is a property of the data, so the logical place to put information about whether you can expect missing values is in the dtype, and to enable missing values you need to make a new array with a new dtype. (If we use a mask-based implementation, then np.asarray(nomissing_array, dtype=yesmissing_type) would still be able to skip making a copy of the data -- I'm talking ONLY about the interface here, not whether missing data has a different storage format from non-missing data.) In (2), the whole point is to use different masks with the same data, so I'd argue masking should be a property of the array object rather than the dtype, and the interface should logically allow masks to be created, modified, and destroyed in place. I can agree with this distinction. However, if missingness is an intrinsic property of the data, then shouldn't users be implementing their own dtype tailored to the data they are using? In other words, how far does the core of NumPy need to go to address this issue? And how far would be too much? Yes, that's exactly my question: whether our goal is to implement missingness in numpy or not! They're both internally consistent, but I think we might have to make a decision and stick to it. Of course. I think that Mark is having a very inspired idea of giving the R audience what they want (np.NA), while simultaneously making the use of masked arrays even easier (which I can certainly appreciate). I don't know. I think we could build a really top-notch implementation of missingness. I also think we could build a really top-notch implementation of masking. But my suggestions for how to improve the current design are totally different depending on which of those is the goal, and neither the R audience (like me) nor the masked array audience (like you) seems really happy with the current design. And I don't know what the goal is -- maybe it's something else and the current design hits it perfectly? Maybe we want a top-notch implementation of *both* missingness and masking, and those should be two different things that can be combined, so that some of the unmasked values inside a masked array can be NA? I don't know. I will put out a little disclaimer. I once had to use S+ for a class. To be honest, it was the worst programming experience in my life. This experience may be coloring my perception of R's approach to handling missing data. There's a lot of things that R does wrong (not their fault; language design is an extremely difficult and specialized skill, that statisticians are not exactly trained in), but it did make a few excellent choices at the beginning. One was to steal the execution model from Scheme, which, uh, isn't really relevant here. The other was to steal the basic data types and standard library that the Bell Labs statisticians had pounded into shape over many years. I use Python now because using R for everything would drive me crazy, but despite its many flaws, it still does some things so well that it's become *the* language used for basically all statistical research. I'm only talking about stealing those things :-). -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Concepts for masked/missing data
On 06/25/2011 09:09 AM, Benjamin Root wrote: On Sat, Jun 25, 2011 at 1:57 PM, Nathaniel Smith n...@pobox.com mailto:n...@pobox.com wrote: On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing efir...@hawaii.edu mailto:efir...@hawaii.edu wrote: On 06/25/2011 07:05 AM, Nathaniel Smith wrote: On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brettmatthew.br...@gmail.com mailto:matthew.br...@gmail.com wrote: To clarify, you're proposing for: a = np.sum(np.array([np.NA, np.NA]) 1) - np.NA 2) - 0.0 Yes -- and in R you get actually do get NA, while in numpy.ma http://numpy.ma you actually do get 0. I don't think this is a coincidence; I think it's No, you don't: In [2]: np.ma.array([2, 4], mask=[True, True]).sum() Out[2]: masked In [4]: np.sum(np.ma.array([2, 4], mask=[True, True])) Out[4]: masked Huh. So in numpy.ma http://numpy.ma, sum([10, NA]) and sum([10]) are the same, but sum([NA]) and sum([]) are different? Sounds to me like you should file a bug on numpy.ma... Actually, no... I should have tested this before replying earlier: a = np.ma.array([2, 4], mask=[True, True]) a masked_array(data = [-- --], mask = [ True True], fill_value = 99) a.sum() masked a = np.ma.array([], mask=[]) a a masked_array(data = [], mask = [], fill_value = 1e+20) a.sum() masked They are the same. Anyway, the general point is that in R, NA's propagate, and in numpy.ma http://numpy.ma, masked values are ignored (except, apparently, if all values are masked). Here, I actually checked these: Python: np.ma.array([2, 4], mask=[True, False]).sum() - 4 R: sum(c(NA, 4)) - NA If you want NaN behavior, then use NaNs. If you want masked behavior, then use masks. But I think that where Mark is heading is towards infrastructure that makes it easy and efficient to do either, as needed, case by case, line by line, for any dtype--not just floats. If he can succeed, that helps all of us. This doesn't have to be R versus masked arrays, or beginners versus experienced programmers. Eric Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Concepts for masked/missing data
On Sat, Jun 25, 2011 at 3:51 PM, Nathaniel Smith n...@pobox.com wrote: On Sat, Jun 25, 2011 at 11:32 AM, Benjamin Root ben.r...@ou.edu wrote: On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith n...@pobox.com wrote: I guess that is a difference, but I'm trying to get at something more fundamental -- not just what operations are allowed, but what operations people *expect* to be allowed. That is quite a trickier problem. It can be. I think of it as the difference between design and coding. They overlap less than one might expect... Here's another possible difference -- in (1), intuitively, missingness is a property of the data, so the logical place to put information about whether you can expect missing values is in the dtype, and to enable missing values you need to make a new array with a new dtype. (If we use a mask-based implementation, then np.asarray(nomissing_array, dtype=yesmissing_type) would still be able to skip making a copy of the data -- I'm talking ONLY about the interface here, not whether missing data has a different storage format from non-missing data.) In (2), the whole point is to use different masks with the same data, so I'd argue masking should be a property of the array object rather than the dtype, and the interface should logically allow masks to be created, modified, and destroyed in place. I can agree with this distinction. However, if missingness is an intrinsic property of the data, then shouldn't users be implementing their own dtype tailored to the data they are using? In other words, how far does the core of NumPy need to go to address this issue? And how far would be too much? Yes, that's exactly my question: whether our goal is to implement missingness in numpy or not! They're both internally consistent, but I think we might have to make a decision and stick to it. Of course. I think that Mark is having a very inspired idea of giving the R audience what they want (np.NA), while simultaneously making the use of masked arrays even easier (which I can certainly appreciate). I don't know. I think we could build a really top-notch implementation of missingness. I also think we could build a really top-notch implementation of masking. But my suggestions for how to improve the current design are totally different depending on which of those is the goal, and neither the R audience (like me) nor the masked array audience (like you) seems really happy with the current design. And I don't know what the goal is -- maybe it's something else and the current design hits it perfectly? Maybe we want a top-notch implementation of *both* missingness and masking, and those should be two different things that can be combined, so that some of the unmasked values inside a masked array can be NA? I don't know. I will put out a little disclaimer. I once had to use S+ for a class. To be honest, it was the worst programming experience in my life. This experience may be coloring my perception of R's approach to handling missing data. There's a lot of things that R does wrong (not their fault; language design is an extremely difficult and specialized skill, that statisticians are not exactly trained in), but it did make a few excellent choices at the beginning. One was to steal the execution model from Scheme, which, uh, isn't really relevant here. The other was to steal the basic data types and standard library that the Bell Labs statisticians had pounded into shape over many years. I use Python now because using R for everything would drive me crazy, but despite its many flaws, it still does some things so well that it's become *the* language used for basically all statistical research. I'm only talking about stealing those things :-). -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion +1. Everyone knows R ain't perfect. I think it's an atrociously bad programming language but it can be unbelievably good at statistics, as evidenced by its success. Brings to mind Andy Gelman's blog last fall: http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ross_ihaka_to_r.html As someone in a statistics department I've frequently been disheartened when I see how easy many statistical things are in R and how much more difficult they are in Python. This is partially the result of poor interfaces for statistical modeling, partially due to data structures (e.g. the integrated-ness of data.frame throughout R) and things like handling of missing data of which there's currently no equivalent. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion