[Numpy-discussion] Missing Data
What is the status of: https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst and of missing data in Numpy, more generally? Is np.ma.array still the state-of-the-art way to handle missing data? Or has something better and more comprehensive been put together? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing Data
On Wed, Mar 26, 2014 at 7:22 PM, T J tjhn...@gmail.com wrote: What is the status of: https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst For what it's worth this NEP was written in 2011 by mwiebe who made 258 numpy commits in 2011, 1 in 2012, and 3 in 2014. According to github, in the last few hours alone mwiebe has made several commits to 'blaze' and 'dynd-python'. Here's the blog post explaining the vision for Continuum's 'blaze' project http://continuum.io/blog/blaze. Continuum seems to have been started in early 2012. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing Data
On Wed, Mar 26, 2014 at 5:43 PM, alex argri...@ncsu.edu wrote: On Wed, Mar 26, 2014 at 7:22 PM, T J tjhn...@gmail.com wrote: What is the status of: https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst For what it's worth this NEP was written in 2011 by mwiebe who made 258 numpy commits in 2011, 1 in 2012, and 3 in 2014. According to github, in the last few hours alone mwiebe has made several commits to 'blaze' and 'dynd-python'. Here's the blog post explaining the vision for Continuum's 'blaze' project http://continuum.io/blog/blaze. Continuum seems to have been started in early 2012. It looks like blaze will have bit pattern missing values ala R. I don't know if there is going to be a masked array implementation. The NA code was taken out of Numpy because it was not possible to reach agreement that it did the right thing. Numpy.ma remains the only solution for bad data at this time. The code could probably use more love than it has gotten ;) Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
For what it's worth, I'd prefer ndmasked. As has been mentioned elsewhere, some algorithms can't really cope with missing data. I'd very much rather they fail than silently give incorrect results. Working in the climate prediction business (as with many other domains I'm sure), even the *potential* for incorrect results can be damaging. On 11 May 2012 06:14, Travis Oliphant tra...@continuum.io wrote: On May 10, 2012, at 12:21 AM, Charles R Harris wrote: On Wed, May 9, 2012 at 11:05 PM, Benjamin Root ben.r...@ou.edu wrote: On Wednesday, May 9, 2012, Nathaniel Smith wrote: My only objection to this proposal is that committing to this approach seems premature. The existing masked array objects act quite differently from numpy.ma, so why do you believe that they're a good foundation for numpy.ma, and why will users want to switch to their semantics over numpy.ma's semantics? These aren't rhetorical questions, it seems like they must have concrete answers, but I don't know what they are. Based on the design decisions made in the original NEP, a re-made numpy.ma would have to lose _some_ features particularly, the ability to share masks. Save for that and some very obscure behaviors that are undocumented, it is possible to remake numpy.ma as a compatibility layer. That being said, I think that there are some fundamental questions that has concerned. If I recall, there were unresolved questions about behaviors surrounding assignments to elements of a view. I see the project as broken down like this: 1.) internal architecture (largely abi issues) 2.) external architecture (hooks throughout numpy to utilize the new features where possible such as where= argument) 3.) getter/setter semantics 4.) mathematical semantics At this moment, I think we have pieces of 2 and they are fairly non-controversial. It is 1 that I see as being the immediate hold-up here. 3 4 are non-trivial, but because they are mostly about interfaces, I think we can be willing to accept some very basic, fundamental, barebones components here in order to lay the groundwork for a more complete API later. To talk of Travis's proposal, doing nothing is no-go. Not moving forward would dishearten the community. Making a ndmasked type is very intriguing. I see it as a set towards eventually deprecating ndarray? Also, how would it behave with no.asarray() and no.asanyarray()? My other concern is a possible violation of DRY. How difficult would it be to maintain two ndarrays in parallel? As for the flag approach, this still doesn't solve the problem of legacy code (or did I misunderstand?) My understanding of the flag is to allow the code to stay in and get reworked and experimented with while keeping it from contaminating conventional use. The whole point of putting the code in was to experiment and adjust. The rather bizarre idea that it needs to be perfect from the get go is disheartening, and is seldom how new things get developed. Sure, there is a plan up front, but there needs to be feedback and change. And in fact, I haven't seen much feedback about the actual code, I don't even know that the people complaining have tried using it to see where it hurts. I'd like that sort of feedback. I don't think anyone is saying it needs to be perfect from the get go. What I am saying is that this is fundamental enough to downstream users that this kind of thing is best done as a separate object. The flag could still be used to make all Python-level array constructors build ndmasked objects. But, this doesn't address the C-level story where there is quite a bit of downstream use where people have used the NumPy array as just a pointer to memory without considering that there might be a mask attached that should be inspected as well. The NEP addresses this a little bit for those C or C++ consumers of the ndarray in C who always use PyArray_FromAny which can fail if the array has non-NULL mask contents. However, it is *not* true that all downstream users use PyArray_FromAny. A large number of users just use something like PyArray_Check and then PyArray_DATA to get the pointer to the data buffer and then go from there thinking of their data as a strided memory chunk only (no extra mask). The NEP fundamentally changes this simple invariant that has been in NumPy and Numeric before it for a long, long time. I really don't see how we can do this in a 1.7 release.It has too many unknown and I think unknowable downstream effects.But, I think we could introduce another arrayobject that is the masked_array with a Python-level flag that makes it the default array in Python. There are a few more subtleties, PyArray_Check by default will pass sub-classes so if the new ndmask array were a sub-class then it would be passed (just like current numpy.ma arrays and matrices would pass that check today).However, there is a PyArray_CheckExact macro which could
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On 11 May 2012 06:57, Travis Oliphant tra...@continuum.io wrote: On May 10, 2012, at 3:40 AM, Scott Sinclair wrote: On 9 May 2012 18:46, Travis Oliphant tra...@continuum.io wrote: The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst This is orthogonal to the discussion, but I'm curious as to why this discussion document has landed in the website repo? I suppose it's not a really big deal, but future uploads of the website will now include a page at http://numpy.scipy.org/NA-overview.html with the content of this document. If that's desirable, I'll add a note at the top of the overview referencing this discussion thread. If not it can be relocated somewhere more desirable after this thread's discussion deadline expires.. Yes, it can be relocated. Can you suggest where it should go? It was added there so that nathaniel and mark could both edit it together with Nathaniel added to the web-team. It may not be a bad place for it, though. At least for a while. Having thought about it, a page on the website isn't a bad idea. I've added a note pointing to this discussion. The document now appears at http://numpy.scipy.org/NA-overview.html Cheers, Scott ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On Thu, May 10, 2012 at 11:03 PM, Scott Sinclair scott.sinclair...@gmail.com wrote: Having thought about it, a page on the website isn't a bad idea. I've added a note pointing to this discussion. The document now appears at http://numpy.scipy.org/NA-overview.html Why not have a separate repo for neps/discussion docs? That way, people can be added to the team as they need to edit them and removed when done, and it's separate from the main site itself. The site can simply have a link to this set of documents, which can be built, tracked, separately and cleanly. We have more or less that setup with ipython for the site and docs: - main site page that points to the doc builds: http://ipython.org/documentation.html - doc builds on a secondary site: http://ipython.org/ipython-doc/stable/index.html This seems to me like the best way to separate the main web team (assuming we'll have a nice website for numpy one day) from the team that will edit documents of nep/discussion type. I imagine the web team will be fairly stable, where as the team for these docs will have people coming and going. Just a thought... As usual, crib anything you find useful from our setup. Cheers, f ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On 11 May 2012 08:12, Fernando Perez fperez@gmail.com wrote: On Thu, May 10, 2012 at 11:03 PM, Scott Sinclair scott.sinclair...@gmail.com wrote: Having thought about it, a page on the website isn't a bad idea. I've added a note pointing to this discussion. The document now appears at http://numpy.scipy.org/NA-overview.html Why not have a separate repo for neps/discussion docs? That way, people can be added to the team as they need to edit them and removed when done, and it's separate from the main site itself. The site can simply have a link to this set of documents, which can be built, tracked, separately and cleanly. We have more or less that setup with ipython for the site and docs: - main site page that points to the doc builds: http://ipython.org/documentation.html - doc builds on a secondary site: http://ipython.org/ipython-doc/stable/index.html That's pretty much how things already work. The documentation is in the main source tree and built docs end up at http://docs.scipy.org. NEPs live at https://github.com/numpy/numpy/tree/master/doc/neps, but don't get published outside of the source tree and there's no preferred place for discussion documents. (assuming we'll have a nice website for numpy one day) Ha ha ha ;-) Thanks for the thoughts and prodding. Cheers, Scott ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On Thu, May 10, 2012 at 11:44 PM, Scott Sinclair scott.sinclair...@gmail.com wrote: That's pretty much how things already work. The documentation is in the main source tree and built docs end up at http://docs.scipy.org. NEPs live at https://github.com/numpy/numpy/tree/master/doc/neps, but don't get published outside of the source tree and there's no preferred place for discussion documents. No, b/c that means that for someone to be able to push to a NEP, they'd have to get commit rights to the main numpy source code repo. The whole point of what I'm suggesting is to isolate the NEP repo so that commit rights can be given for it with minimal thought, whenever pretty much anyone says they're going to work on a NEP. Obviously today anyone can do that and submit a PR against the main repo, but that raises the PR review burden for said repo. And that burden is something that we should strive to keep as low as possible, so those key people (the team with commit rights to the main repo) can focus their limited resources on reviewing code PRs. I'm simply suggesting a way to spread the load as much as possible, so that the team with commit rights on the main repo isn't a bottleneck on other tasks. Cheers, f ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On May 11, 2012, at 2:13 AM, Fernando Perez wrote: On Thu, May 10, 2012 at 11:44 PM, Scott Sinclair scott.sinclair...@gmail.com wrote: That's pretty much how things already work. The documentation is in the main source tree and built docs end up at http://docs.scipy.org. NEPs live at https://github.com/numpy/numpy/tree/master/doc/neps, but don't get published outside of the source tree and there's no preferred place for discussion documents. No, b/c that means that for someone to be able to push to a NEP, they'd have to get commit rights to the main numpy source code repo. The whole point of what I'm suggesting is to isolate the NEP repo so that commit rights can be given for it with minimal thought, whenever pretty much anyone says they're going to work on a NEP. Obviously today anyone can do that and submit a PR against the main repo, but that raises the PR review burden for said repo. And that burden is something that we should strive to keep as low as possible, so those key people (the team with commit rights to the main repo) can focus their limited resources on reviewing code PRs. I'm simply suggesting a way to spread the load as much as possible, so that the team with commit rights on the main repo isn't a bottleneck on other tasks. This is a good idea. I think.I like the thought of a separate NEP and docs repo. -Travis Cheers, f ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On Thu, May 10, 2012 at 10:28 PM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Thu, May 10, 2012 at 2:43 AM, Nathaniel Smith n...@pobox.com wrote: Hi Matthew, On Thu, May 10, 2012 at 12:01 AM, Matthew Brett matthew.br...@gmail.com wrote: The third proposal is certainly the best one from Cython's perspective; and I imagine for those writing C extensions against the C API too. Having PyType_Check fail for ndmasked is a very good way of having code fail that is not written to take masks into account. Mark, Nathaniel - can you comment how your chosen approaches would interact with extension code? I'm guessing the bitpattern dtypes would be expected to cause extension code to choke if the type is not supported? That's pretty much how I'm imagining it, yes. Right now if you have, say, a Cython function like cdef f(np.ndarray[double] a): ... and you do f(np.zeros(10, dtype=int)), then it will error out, because that function doesn't know how to handle ints, only doubles. The same would apply for, say, a NA-enabled integer. In general there are almost arbitrarily many dtypes that could get passed into any function (including user-defined ones, etc.), so C code already has to check dtypes for correctness. Second order issues: - There is certainly C code out there that just assumes that it will only be passed an array with certain dtype (and ndim, memory layout, etc...). If you write such C code then it's your job to make sure that you only pass it the kinds of arrays that it expects, just like now :-). - We may want to do some sort of special-casing of handling for floating point NA dtypes that use an NaN as the magic bitpattern, since many algorithms *will* work with these unchanged, and it might be frustrating to have to wait for every extension module to be updated just to allow for this case explicitly before using them. OTOH you can easily work around this. Like say my_qr is a legacy C function that will in fact propagate NaNs correctly, so float NA dtypes would Just Work -- except, it errors out at the start because it doesn't recognize the dtype. How annoying. We *could* have some special hack you can use to force it to work anyway (by like making the is this the dtype I expect? routine lie.) But you can also just do: def my_qr_wrapper(arr): if arr.dtype is a NA float dtype with NaN magic value: result = my_qr(arr.view(arr.dtype.base_dtype)) return result.view(arr.dtype) else: return my_qr(arr) and hey presto, now it will correctly pass through NAs. So perhaps it's not worth bothering with special hacks. - Of course if your extension function does want to handle NAs generically, then there will be a simple C api for checking for them, setting them, etc. Numpy needs such an API internally anyway! Thanks for this. Mark - in view of the discussions about Cython and extension code - could you say what you see as disadvantages to the ndmasked subclass proposal? The biggest difficulty looks to me like how to work with both of them reasonably from the C API. The idea of ndarray and ndmasked having different independent TypeObjects, but still working through the same API calls feels a little disconcerting. Maybe this is a reasonable compromise, though, it would be nice to see the idea fleshed out a bit more with some examples of how the code would work from the C level. Cheers, Mark Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On Wed, May 09, 2012 at 02:35:26PM -0500, Travis Oliphant wrote: Basically it buys not forcing *all* NumPy users (on the C-API level) to now deal with a masked array. I know this push is a feature that is part of Mark's intention (as it pushes downstream libraries to think about missing data at a fundamental level). I think that this is a bad policy because: 1. An array is not always data. I realize that there is a big push for data-related computing lately, but I still believe that the notion missing data makes no sens for the majority of numpy arrays instanciated. 2. Not every algorithm can be made to work with missing data. I would even say that most of the advanced algorithm do not work with missing data. Don't try to force upon people a problem that they do not have :). Gael PS: This message does not claim to take any position in the debate on which solution for missing data is the best, because I don't think that I have a good technical vision to back any position. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On 9 May 2012 18:46, Travis Oliphant tra...@continuum.io wrote: The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst This is orthogonal to the discussion, but I'm curious as to why this discussion document has landed in the website repo? I suppose it's not a really big deal, but future uploads of the website will now include a page at http://numpy.scipy.org/NA-overview.html with the content of this document. If that's desirable, I'll add a note at the top of the overview referencing this discussion thread. If not it can be relocated somewhere more desirable after this thread's discussion deadline expires.. Cheers, Scott ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On 05/10/2012 06:05 AM, Dag Sverre Seljebotn wrote: On 05/10/2012 01:01 AM, Matthew Brett wrote: Hi, On Wed, May 9, 2012 at 12:44 PM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 05/09/2012 06:46 PM, Travis Oliphant wrote: Hey all, Nathaniel and Mark have worked very hard on a joint document to try and explain the current status of the missing-data debate. I think they've done an amazing job at providing some context, articulating their views and suggesting ways forward in a mutually respectful manner. This is an exemplary collaboration and is at the core of why open source is valuable. The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst After reading that document, it appears to me that there are some fundamentally different views on how things should move forward. I'm also reading the document incorporating my understanding of the history, of NumPy as well as all of the users I've met and interacted with which means I have my own perspective that is not necessarily incorporated into that document but informs my recommendations. I'm not sure we can reach full consensus on this. We are also well past time for moving forward with a resolution on this (perhaps we can all agree on that). I would like one more discussion thread where the technical discussion can take place. I will make a plea that we keep this discussion as free from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can. I can't guarantee that I personally will succeed at that, but I can tell you that I will try. That's all I'm asking of anyone else. I recognize that there are a lot of other issues at play here besides *just* the technical questions, but we are not going to resolve every community issue in this technical thread. We need concrete proposals and so I will start with three. Please feel free to comment on these proposals or add your own during the discussion. I will stop paying attention to this thread next Wednesday (May 16th) (or earlier if the thread dies) and hope that by that time we can agree on a way forward. If we don't have agreement, then I will move forward with what I think is the right approach. I will either write the code myself or convince someone else to write it. In all cases, we have agreement that bit-pattern dtypes should be added to NumPy. We should work on these (int32, float64, complex64, str, bool) to start. So, the three proposals are independent of this way forward. The proposals are all about the extra mask part: My three proposals: * do nothing and leave things as is * add a global flag that turns off masked array support by default but otherwise leaves things unchanged (I'm still unclear how this would work exactly) * move Mark's masked ndarray objects into a new fundamental type (ndmasked), leaving the actual ndarray type unchanged. The array_interface keeps the masked array notions and the ufuncs keep the ability to handle arrays like ndmasked. Ideally, numpy.ma http://numpy.ma would be changed to use ndmasked objects as their core. For the record, I'm currently in favor of the third proposal. Feel free to comment on these proposals (or provide your own). Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark! Yes, it is very well written, my compliments to the chefs. The third proposal is certainly the best one from Cython's perspective; and I imagine for those writing C extensions against the C API too. Having PyType_Check fail for ndmasked is a very good way of having code fail that is not written to take masks into account. I want to make something more clear: There are two Cython cases; in the case of cdef np.ndarray[double] there is no problem as PEP 3118 access will raise an exception for masked arrays. But, there's the case where you do cdef np.ndarray, and then proceed to use PyArray_DATA. Myself I do this more than PEP 3118 access; usually because I pass the data pointer to some C or C++ code. It'd be great to have such code be forward-compatible in the sense that it raises an exception when it meets a masked array. Having PyType_Check fail seems like the only way? Am I wrong? I'm very sorry; I always meant PyObject_TypeCheck, not PyType_Check. Dag Mark, Nathaniel - can you comment how your chosen approaches would interact with extension code? I'm guessing the bitpattern dtypes would be expected to cause extension code to choke if the type is not supported? The proposal, as I understand it, is to use that with new dtypes (?). So things will often be fine for that reason: if arr.dtype == np.float32: c_function_32bit(np.PyArray_DATA(arr), ...) else: raise ValueError(need 32-bit float array) Mark - in : https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#cython - do I understand correctly that you think that Cython and other extension writers should
Re: [Numpy-discussion] Missing data wrap-up and request for comments
Hi Matthew, On Thu, May 10, 2012 at 12:01 AM, Matthew Brett matthew.br...@gmail.com wrote: The third proposal is certainly the best one from Cython's perspective; and I imagine for those writing C extensions against the C API too. Having PyType_Check fail for ndmasked is a very good way of having code fail that is not written to take masks into account. Mark, Nathaniel - can you comment how your chosen approaches would interact with extension code? I'm guessing the bitpattern dtypes would be expected to cause extension code to choke if the type is not supported? That's pretty much how I'm imagining it, yes. Right now if you have, say, a Cython function like cdef f(np.ndarray[double] a): ... and you do f(np.zeros(10, dtype=int)), then it will error out, because that function doesn't know how to handle ints, only doubles. The same would apply for, say, a NA-enabled integer. In general there are almost arbitrarily many dtypes that could get passed into any function (including user-defined ones, etc.), so C code already has to check dtypes for correctness. Second order issues: - There is certainly C code out there that just assumes that it will only be passed an array with certain dtype (and ndim, memory layout, etc...). If you write such C code then it's your job to make sure that you only pass it the kinds of arrays that it expects, just like now :-). - We may want to do some sort of special-casing of handling for floating point NA dtypes that use an NaN as the magic bitpattern, since many algorithms *will* work with these unchanged, and it might be frustrating to have to wait for every extension module to be updated just to allow for this case explicitly before using them. OTOH you can easily work around this. Like say my_qr is a legacy C function that will in fact propagate NaNs correctly, so float NA dtypes would Just Work -- except, it errors out at the start because it doesn't recognize the dtype. How annoying. We *could* have some special hack you can use to force it to work anyway (by like making the is this the dtype I expect? routine lie.) But you can also just do: def my_qr_wrapper(arr): if arr.dtype is a NA float dtype with NaN magic value: result = my_qr(arr.view(arr.dtype.base_dtype)) return result.view(arr.dtype) else: return my_qr(arr) and hey presto, now it will correctly pass through NAs. So perhaps it's not worth bothering with special hacks. - Of course if your extension function does want to handle NAs generically, then there will be a simple C api for checking for them, setting them, etc. Numpy needs such an API internally anyway! -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
Hi, On Thu, May 10, 2012 at 2:43 AM, Nathaniel Smith n...@pobox.com wrote: Hi Matthew, On Thu, May 10, 2012 at 12:01 AM, Matthew Brett matthew.br...@gmail.com wrote: The third proposal is certainly the best one from Cython's perspective; and I imagine for those writing C extensions against the C API too. Having PyType_Check fail for ndmasked is a very good way of having code fail that is not written to take masks into account. Mark, Nathaniel - can you comment how your chosen approaches would interact with extension code? I'm guessing the bitpattern dtypes would be expected to cause extension code to choke if the type is not supported? That's pretty much how I'm imagining it, yes. Right now if you have, say, a Cython function like cdef f(np.ndarray[double] a): ... and you do f(np.zeros(10, dtype=int)), then it will error out, because that function doesn't know how to handle ints, only doubles. The same would apply for, say, a NA-enabled integer. In general there are almost arbitrarily many dtypes that could get passed into any function (including user-defined ones, etc.), so C code already has to check dtypes for correctness. Second order issues: - There is certainly C code out there that just assumes that it will only be passed an array with certain dtype (and ndim, memory layout, etc...). If you write such C code then it's your job to make sure that you only pass it the kinds of arrays that it expects, just like now :-). - We may want to do some sort of special-casing of handling for floating point NA dtypes that use an NaN as the magic bitpattern, since many algorithms *will* work with these unchanged, and it might be frustrating to have to wait for every extension module to be updated just to allow for this case explicitly before using them. OTOH you can easily work around this. Like say my_qr is a legacy C function that will in fact propagate NaNs correctly, so float NA dtypes would Just Work -- except, it errors out at the start because it doesn't recognize the dtype. How annoying. We *could* have some special hack you can use to force it to work anyway (by like making the is this the dtype I expect? routine lie.) But you can also just do: def my_qr_wrapper(arr): if arr.dtype is a NA float dtype with NaN magic value: result = my_qr(arr.view(arr.dtype.base_dtype)) return result.view(arr.dtype) else: return my_qr(arr) and hey presto, now it will correctly pass through NAs. So perhaps it's not worth bothering with special hacks. - Of course if your extension function does want to handle NAs generically, then there will be a simple C api for checking for them, setting them, etc. Numpy needs such an API internally anyway! Thanks for this. Mark - in view of the discussions about Cython and extension code - could you say what you see as disadvantages to the ndmasked subclass proposal? Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On May 10, 2012, at 3:40 AM, Scott Sinclair wrote: On 9 May 2012 18:46, Travis Oliphant tra...@continuum.io wrote: The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst This is orthogonal to the discussion, but I'm curious as to why this discussion document has landed in the website repo? I suppose it's not a really big deal, but future uploads of the website will now include a page at http://numpy.scipy.org/NA-overview.html with the content of this document. If that's desirable, I'll add a note at the top of the overview referencing this discussion thread. If not it can be relocated somewhere more desirable after this thread's discussion deadline expires.. Yes, it can be relocated. Can you suggest where it should go? It was added there so that nathaniel and mark could both edit it together with Nathaniel added to the web-team. It may not be a bad place for it, though. At least for a while. -Travis Cheers, Scott ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On May 10, 2012, at 12:21 AM, Charles R Harris wrote: On Wed, May 9, 2012 at 11:05 PM, Benjamin Root ben.r...@ou.edu wrote: On Wednesday, May 9, 2012, Nathaniel Smith wrote: My only objection to this proposal is that committing to this approach seems premature. The existing masked array objects act quite differently from numpy.ma, so why do you believe that they're a good foundation for numpy.ma, and why will users want to switch to their semantics over numpy.ma's semantics? These aren't rhetorical questions, it seems like they must have concrete answers, but I don't know what they are. Based on the design decisions made in the original NEP, a re-made numpy.ma would have to lose _some_ features particularly, the ability to share masks. Save for that and some very obscure behaviors that are undocumented, it is possible to remake numpy.ma as a compatibility layer. That being said, I think that there are some fundamental questions that has concerned. If I recall, there were unresolved questions about behaviors surrounding assignments to elements of a view. I see the project as broken down like this: 1.) internal architecture (largely abi issues) 2.) external architecture (hooks throughout numpy to utilize the new features where possible such as where= argument) 3.) getter/setter semantics 4.) mathematical semantics At this moment, I think we have pieces of 2 and they are fairly non-controversial. It is 1 that I see as being the immediate hold-up here. 3 4 are non-trivial, but because they are mostly about interfaces, I think we can be willing to accept some very basic, fundamental, barebones components here in order to lay the groundwork for a more complete API later. To talk of Travis's proposal, doing nothing is no-go. Not moving forward would dishearten the community. Making a ndmasked type is very intriguing. I see it as a set towards eventually deprecating ndarray? Also, how would it behave with no.asarray() and no.asanyarray()? My other concern is a possible violation of DRY. How difficult would it be to maintain two ndarrays in parallel? As for the flag approach, this still doesn't solve the problem of legacy code (or did I misunderstand?) My understanding of the flag is to allow the code to stay in and get reworked and experimented with while keeping it from contaminating conventional use. The whole point of putting the code in was to experiment and adjust. The rather bizarre idea that it needs to be perfect from the get go is disheartening, and is seldom how new things get developed. Sure, there is a plan up front, but there needs to be feedback and change. And in fact, I haven't seen much feedback about the actual code, I don't even know that the people complaining have tried using it to see where it hurts. I'd like that sort of feedback. I don't think anyone is saying it needs to be perfect from the get go.What I am saying is that this is fundamental enough to downstream users that this kind of thing is best done as a separate object. The flag could still be used to make all Python-level array constructors build ndmasked objects. But, this doesn't address the C-level story where there is quite a bit of downstream use where people have used the NumPy array as just a pointer to memory without considering that there might be a mask attached that should be inspected as well. The NEP addresses this a little bit for those C or C++ consumers of the ndarray in C who always use PyArray_FromAny which can fail if the array has non-NULL mask contents. However, it is *not* true that all downstream users use PyArray_FromAny. A large number of users just use something like PyArray_Check and then PyArray_DATA to get the pointer to the data buffer and then go from there thinking of their data as a strided memory chunk only (no extra mask).The NEP fundamentally changes this simple invariant that has been in NumPy and Numeric before it for a long, long time. I really don't see how we can do this in a 1.7 release.It has too many unknown and I think unknowable downstream effects.But, I think we could introduce another arrayobject that is the masked_array with a Python-level flag that makes it the default array in Python. There are a few more subtleties, PyArray_Check by default will pass sub-classes so if the new ndmask array were a sub-class then it would be passed (just like current numpy.ma arrays and matrices would pass that check today). However, there is a PyArray_CheckExact macro which could be used to ensure the object was actually of PyArray_Type. There is also the PyArg_ParseTuple command with O! that I have seen used many times to ensure an exact NumPy array. -Travis Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Missing data wrap-up and request for comments
Hey all, Nathaniel and Mark have worked very hard on a joint document to try and explain the current status of the missing-data debate. I think they've done an amazing job at providing some context, articulating their views and suggesting ways forward in a mutually respectful manner. This is an exemplary collaboration and is at the core of why open source is valuable. The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst After reading that document, it appears to me that there are some fundamentally different views on how things should move forward. I'm also reading the document incorporating my understanding of the history, of NumPy as well as all of the users I've met and interacted with which means I have my own perspective that is not necessarily incorporated into that document but informs my recommendations.I'm not sure we can reach full consensus on this. We are also well past time for moving forward with a resolution on this (perhaps we can all agree on that). I would like one more discussion thread where the technical discussion can take place.I will make a plea that we keep this discussion as free from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can. I can't guarantee that I personally will succeed at that, but I can tell you that I will try. That's all I'm asking of anyone else.I recognize that there are a lot of other issues at play here besides *just* the technical questions, but we are not going to resolve every community issue in this technical thread. We need concrete proposals and so I will start with three. Please feel free to comment on these proposals or add your own during the discussion.I will stop paying attention to this thread next Wednesday (May 16th) (or earlier if the thread dies) and hope that by that time we can agree on a way forward. If we don't have agreement, then I will move forward with what I think is the right approach. I will either write the code myself or convince someone else to write it. In all cases, we have agreement that bit-pattern dtypes should be added to NumPy. We should work on these (int32, float64, complex64, str, bool) to start.So, the three proposals are independent of this way forward. The proposals are all about the extra mask part: My three proposals: * do nothing and leave things as is * add a global flag that turns off masked array support by default but otherwise leaves things unchanged (I'm still unclear how this would work exactly) * move Mark's masked ndarray objects into a new fundamental type (ndmasked), leaving the actual ndarray type unchanged. The array_interface keeps the masked array notions and the ufuncs keep the ability to handle arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked objects as their core. For the record, I'm currently in favor of the third proposal. Feel free to comment on these proposals (or provide your own). Best regards, -Travis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On Wed, May 9, 2012 at 10:46 AM, Travis Oliphant tra...@continuum.iowrote: Hey all, Nathaniel and Mark have worked very hard on a joint document to try and explain the current status of the missing-data debate. I think they've done an amazing job at providing some context, articulating their views and suggesting ways forward in a mutually respectful manner. This is an exemplary collaboration and is at the core of why open source is valuable. The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst After reading that document, it appears to me that there are some fundamentally different views on how things should move forward. I'm also reading the document incorporating my understanding of the history, of NumPy as well as all of the users I've met and interacted with which means I have my own perspective that is not necessarily incorporated into that document but informs my recommendations.I'm not sure we can reach full consensus on this. We are also well past time for moving forward with a resolution on this (perhaps we can all agree on that). I would like one more discussion thread where the technical discussion can take place.I will make a plea that we keep this discussion as free from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can. I can't guarantee that I personally will succeed at that, but I can tell you that I will try. That's all I'm asking of anyone else.I recognize that there are a lot of other issues at play here besides *just* the technical questions, but we are not going to resolve every community issue in this technical thread. We need concrete proposals and so I will start with three. Please feel free to comment on these proposals or add your own during the discussion. I will stop paying attention to this thread next Wednesday (May 16th) (or earlier if the thread dies) and hope that by that time we can agree on a way forward. If we don't have agreement, then I will move forward with what I think is the right approach. I will either write the code myself or convince someone else to write it. In all cases, we have agreement that bit-pattern dtypes should be added to NumPy. We should work on these (int32, float64, complex64, str, bool) to start.So, the three proposals are independent of this way forward. The proposals are all about the extra mask part: My three proposals: * do nothing and leave things as is * add a global flag that turns off masked array support by default but otherwise leaves things unchanged (I'm still unclear how this would work exactly) * move Mark's masked ndarray objects into a new fundamental type (ndmasked), leaving the actual ndarray type unchanged. The array_interface keeps the masked array notions and the ufuncs keep the ability to handle arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked objects as their core. The numpy.ma is unmaintained and I don't see that changing anytime soon. As you know, I would prefer 1), but 2) is a good compromise and the infra structure for such a flag could be useful for other things, although like yourself I'm not sure how it would be implemented. I don't understand your proposal for 3), but from the description I don't see that it buys anything. For the record, I'm currently in favor of the third proposal. Feel free to comment on these proposals (or provide your own). Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On Wed, May 9, 2012 at 11:46 AM, Travis Oliphant tra...@continuum.iowrote: Hey all, Nathaniel and Mark have worked very hard on a joint document to try and explain the current status of the missing-data debate. I think they've done an amazing job at providing some context, articulating their views and suggesting ways forward in a mutually respectful manner. This is an exemplary collaboration and is at the core of why open source is valuable. The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst After reading that document, it appears to me that there are some fundamentally different views on how things should move forward. I'm also reading the document incorporating my understanding of the history, of NumPy as well as all of the users I've met and interacted with which means I have my own perspective that is not necessarily incorporated into that document but informs my recommendations.I'm not sure we can reach full consensus on this. We are also well past time for moving forward with a resolution on this (perhaps we can all agree on that). I would like one more discussion thread where the technical discussion can take place.I will make a plea that we keep this discussion as free from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can. I can't guarantee that I personally will succeed at that, but I can tell you that I will try. That's all I'm asking of anyone else.I recognize that there are a lot of other issues at play here besides *just* the technical questions, but we are not going to resolve every community issue in this technical thread. We need concrete proposals and so I will start with three. Please feel free to comment on these proposals or add your own during the discussion. I will stop paying attention to this thread next Wednesday (May 16th) (or earlier if the thread dies) and hope that by that time we can agree on a way forward. If we don't have agreement, then I will move forward with what I think is the right approach. I will either write the code myself or convince someone else to write it. In all cases, we have agreement that bit-pattern dtypes should be added to NumPy. We should work on these (int32, float64, complex64, str, bool) to start.So, the three proposals are independent of this way forward. The proposals are all about the extra mask part: My three proposals: * do nothing and leave things as is * add a global flag that turns off masked array support by default but otherwise leaves things unchanged (I'm still unclear how this would work exactly) * move Mark's masked ndarray objects into a new fundamental type (ndmasked), leaving the actual ndarray type unchanged. The array_interface keeps the masked array notions and the ufuncs keep the ability to handle arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked objects as their core. For the record, I'm currently in favor of the third proposal. Feel free to comment on these proposals (or provide your own). I'm most in favour of the second proposal. It won't take very much effort, and more clearly marks off this code as experimental than just documentation notes. Thanks, -Mark Best regards, -Travis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On May 9, 2012, at 2:07 PM, Mark Wiebe wrote: On Wed, May 9, 2012 at 11:46 AM, Travis Oliphant tra...@continuum.io wrote: Hey all, Nathaniel and Mark have worked very hard on a joint document to try and explain the current status of the missing-data debate. I think they've done an amazing job at providing some context, articulating their views and suggesting ways forward in a mutually respectful manner. This is an exemplary collaboration and is at the core of why open source is valuable. The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst After reading that document, it appears to me that there are some fundamentally different views on how things should move forward. I'm also reading the document incorporating my understanding of the history, of NumPy as well as all of the users I've met and interacted with which means I have my own perspective that is not necessarily incorporated into that document but informs my recommendations.I'm not sure we can reach full consensus on this. We are also well past time for moving forward with a resolution on this (perhaps we can all agree on that). I would like one more discussion thread where the technical discussion can take place.I will make a plea that we keep this discussion as free from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can. I can't guarantee that I personally will succeed at that, but I can tell you that I will try. That's all I'm asking of anyone else.I recognize that there are a lot of other issues at play here besides *just* the technical questions, but we are not going to resolve every community issue in this technical thread. We need concrete proposals and so I will start with three. Please feel free to comment on these proposals or add your own during the discussion.I will stop paying attention to this thread next Wednesday (May 16th) (or earlier if the thread dies) and hope that by that time we can agree on a way forward. If we don't have agreement, then I will move forward with what I think is the right approach. I will either write the code myself or convince someone else to write it. In all cases, we have agreement that bit-pattern dtypes should be added to NumPy. We should work on these (int32, float64, complex64, str, bool) to start.So, the three proposals are independent of this way forward. The proposals are all about the extra mask part: My three proposals: * do nothing and leave things as is * add a global flag that turns off masked array support by default but otherwise leaves things unchanged (I'm still unclear how this would work exactly) * move Mark's masked ndarray objects into a new fundamental type (ndmasked), leaving the actual ndarray type unchanged. The array_interface keeps the masked array notions and the ufuncs keep the ability to handle arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked objects as their core. For the record, I'm currently in favor of the third proposal. Feel free to comment on these proposals (or provide your own). I'm most in favour of the second proposal. It won't take very much effort, and more clearly marks off this code as experimental than just documentation notes. Mark will you give more details about this proposal?How would the flag work, what would it modify? The proposal to create a ndmasked object that is separate from ndarray objects also won't take much effort and also marks off the object so those who want to use it can and those who don't are not pushed into using it anyway. -Travis Thanks, -Mark Best regards, -Travis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On Wed, May 9, 2012 at 2:15 PM, Travis Oliphant tra...@continuum.io wrote: On May 9, 2012, at 2:07 PM, Mark Wiebe wrote: On Wed, May 9, 2012 at 11:46 AM, Travis Oliphant tra...@continuum.iowrote: Hey all, Nathaniel and Mark have worked very hard on a joint document to try and explain the current status of the missing-data debate. I think they've done an amazing job at providing some context, articulating their views and suggesting ways forward in a mutually respectful manner. This is an exemplary collaboration and is at the core of why open source is valuable. The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst After reading that document, it appears to me that there are some fundamentally different views on how things should move forward. I'm also reading the document incorporating my understanding of the history, of NumPy as well as all of the users I've met and interacted with which means I have my own perspective that is not necessarily incorporated into that document but informs my recommendations.I'm not sure we can reach full consensus on this. We are also well past time for moving forward with a resolution on this (perhaps we can all agree on that). I would like one more discussion thread where the technical discussion can take place.I will make a plea that we keep this discussion as free from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can. I can't guarantee that I personally will succeed at that, but I can tell you that I will try. That's all I'm asking of anyone else.I recognize that there are a lot of other issues at play here besides *just* the technical questions, but we are not going to resolve every community issue in this technical thread. We need concrete proposals and so I will start with three. Please feel free to comment on these proposals or add your own during the discussion. I will stop paying attention to this thread next Wednesday (May 16th) (or earlier if the thread dies) and hope that by that time we can agree on a way forward. If we don't have agreement, then I will move forward with what I think is the right approach. I will either write the code myself or convince someone else to write it. In all cases, we have agreement that bit-pattern dtypes should be added to NumPy. We should work on these (int32, float64, complex64, str, bool) to start.So, the three proposals are independent of this way forward. The proposals are all about the extra mask part: My three proposals: * do nothing and leave things as is * add a global flag that turns off masked array support by default but otherwise leaves things unchanged (I'm still unclear how this would work exactly) * move Mark's masked ndarray objects into a new fundamental type (ndmasked), leaving the actual ndarray type unchanged. The array_interface keeps the masked array notions and the ufuncs keep the ability to handle arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked objects as their core. For the record, I'm currently in favor of the third proposal. Feel free to comment on these proposals (or provide your own). I'm most in favour of the second proposal. It won't take very much effort, and more clearly marks off this code as experimental than just documentation notes. Mark will you give more details about this proposal?How would the flag work, what would it modify? The idea is inspired in part by the Chrome release cycle, which has a presentation here: https://docs.google.com/present/view?id=dg63dpc6_4d7vkk6chpli=1 Some quotes: Features should be engineered so that they can be disabled easily (1 patch) and Would large feature development still be possible? Yes, engineers would have to work behind flags, however they can work for as many releases as they need to and can remove the flag when they are done. The current numpy codebase isn't designed for this kind of workflow, but I think we can productively emulate the idea for a big feature like NA support. One way to do this flag would be to have a numpy.experimental namespace which is not imported by default. To enable the NA-mask feature, you could do: import numpy.experimental.maskna This would trigger an ExperimentalWarning to message that an experimental feature has been enabled, and would add any NA-specific symbols to the numpy namespace (NA, NAType, etc). Without this import, any operation which would create an NA or NA-masked array raises an ExperimentalError instead of succeeding. After this import, things would behave as they do now. Cheers, Mark The proposal to create a ndmasked object that is separate from ndarray objects also won't take much effort and also marks off the object so those who want to use it can and those who don't are not pushed into using it anyway. -Travis Thanks, -Mark Best regards, -Travis
Re: [Numpy-discussion] Missing data wrap-up and request for comments
My three proposals: * do nothing and leave things as is * add a global flag that turns off masked array support by default but otherwise leaves things unchanged (I'm still unclear how this would work exactly) * move Mark's masked ndarray objects into a new fundamental type (ndmasked), leaving the actual ndarray type unchanged. The array_interface keeps the masked array notions and the ufuncs keep the ability to handle arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked objects as their core. The numpy.ma is unmaintained and I don't see that changing anytime soon. As you know, I would prefer 1), but 2) is a good compromise and the infra structure for such a flag could be useful for other things, although like yourself I'm not sure how it would be implemented. I don't understand your proposal for 3), but from the description I don't see that it buys anything. That is a bit strong to call numpy.ma unmaintained.I don't consider it that way.Are there a lot of tickets for it that are unaddressed? Is it broken? I know it gets a lot of use in the wild and so I don't think NumPy users would be happy to here it is considered unmaintained by NumPy developers. I'm looking forward to more details of Mark's proposal for #2. The proposal for #3 is quite simple and I think it is also a good compromise between removing the masked array entirely from the core NumPy object and leaving things as is in master. It keeps the functionality (but in a separate object) much like numpy.ma is a separate object. Basically it buys not forcing *all* NumPy users (on the C-API level) to now deal with a masked array. I know this push is a feature that is part of Mark's intention (as it pushes downstream libraries to think about missing data at a fundamental level). But, I think this is too big of a change to put in a 1.X release. The internal array-model used by NumPy is used quite extensively in downstream libraries as a *concept*. Many people have enhanced this model with a separate mask array for various reasons, and Mark's current use of mask does not satisfy all those use-cases. I don't see how we can justify changing the NumPy 1.X memory model under these circumstances. This is the sort of change that in my mind is a NumPy 2.0 kind of change where downstream users will be looking for possible array-model changes. -Travis For the record, I'm currently in favor of the third proposal. Feel free to comment on these proposals (or provide your own). Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
Mark will you give more details about this proposal?How would the flag work, what would it modify? The idea is inspired in part by the Chrome release cycle, which has a presentation here: https://docs.google.com/present/view?id=dg63dpc6_4d7vkk6chpli=1 Some quotes: Features should be engineered so that they can be disabled easily (1 patch) and Would large feature development still be possible? Yes, engineers would have to work behind flags, however they can work for as many releases as they need to and can remove the flag when they are done. The current numpy codebase isn't designed for this kind of workflow, but I think we can productively emulate the idea for a big feature like NA support. One way to do this flag would be to have a numpy.experimental namespace which is not imported by default. To enable the NA-mask feature, you could do: import numpy.experimental.maskna This would trigger an ExperimentalWarning to message that an experimental feature has been enabled, and would add any NA-specific symbols to the numpy namespace (NA, NAType, etc). Without this import, any operation which would create an NA or NA-masked array raises an ExperimentalError instead of succeeding. After this import, things would behave as they do now. How would this flag work at the C-API level? -Travis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On 05/09/2012 06:46 PM, Travis Oliphant wrote: Hey all, Nathaniel and Mark have worked very hard on a joint document to try and explain the current status of the missing-data debate. I think they've done an amazing job at providing some context, articulating their views and suggesting ways forward in a mutually respectful manner. This is an exemplary collaboration and is at the core of why open source is valuable. The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst After reading that document, it appears to me that there are some fundamentally different views on how things should move forward. I'm also reading the document incorporating my understanding of the history, of NumPy as well as all of the users I've met and interacted with which means I have my own perspective that is not necessarily incorporated into that document but informs my recommendations. I'm not sure we can reach full consensus on this. We are also well past time for moving forward with a resolution on this (perhaps we can all agree on that). I would like one more discussion thread where the technical discussion can take place. I will make a plea that we keep this discussion as free from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can. I can't guarantee that I personally will succeed at that, but I can tell you that I will try. That's all I'm asking of anyone else. I recognize that there are a lot of other issues at play here besides *just* the technical questions, but we are not going to resolve every community issue in this technical thread. We need concrete proposals and so I will start with three. Please feel free to comment on these proposals or add your own during the discussion. I will stop paying attention to this thread next Wednesday (May 16th) (or earlier if the thread dies) and hope that by that time we can agree on a way forward. If we don't have agreement, then I will move forward with what I think is the right approach. I will either write the code myself or convince someone else to write it. In all cases, we have agreement that bit-pattern dtypes should be added to NumPy. We should work on these (int32, float64, complex64, str, bool) to start. So, the three proposals are independent of this way forward. The proposals are all about the extra mask part: My three proposals: * do nothing and leave things as is * add a global flag that turns off masked array support by default but otherwise leaves things unchanged (I'm still unclear how this would work exactly) * move Mark's masked ndarray objects into a new fundamental type (ndmasked), leaving the actual ndarray type unchanged. The array_interface keeps the masked array notions and the ufuncs keep the ability to handle arrays like ndmasked. Ideally, numpy.ma http://numpy.ma would be changed to use ndmasked objects as their core. For the record, I'm currently in favor of the third proposal. Feel free to comment on these proposals (or provide your own). Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark! The third proposal is certainly the best one from Cython's perspective; and I imagine for those writing C extensions against the C API too. Having PyType_Check fail for ndmasked is a very good way of having code fail that is not written to take masks into account. If it is in ndarray we would also have some pressure to add support in Cython, with ndmasked we avoid that too. Likely outcome is we won't ever support it either way, but then we need some big warning in the docs, and it's better to avoid that. (I guess be +0 on Mark Florisson implementing it if it ends up in core ndarray; I'd almost certainly not do it myself.) That covers Cython. My view as a NumPy user follows. I'm a heavy user of masks, which are used to make data NA in the statistical sense. The setting is that we have to mask out the radiation coming from the Milky Way in full-sky images of the Cosmic Microwave Background. There's data, but we know we can't trust it, so we make it NA. But we also do play around with different masks. Today we keep the mask in a seperate array, and to zero-mask we do masked_data = data * mask or masked_data = data.copy() masked_data[mask == 0] = np.nan # soon np.NA depending on the circumstances. Honestly, API-wise, this is as good as its gets for us. Nice and transparent, no new semantics to learn in the special case of masks. Now, this has performance issues: Lots of memory use, extra transfers over the memory bus. BUT, NumPy has that problem all over the place, even for x + y + z! Solving it in the special case of masks, by making a new API, seems a bit myopic to me. IMO, that's much better solved at the fundamental level. As an *illustration*: with np.lazy: masked_data1 = data * mask1 masked_data2 = data * (mask1 | mask2) masked_data3 = (x + y + z) * (mask1 mask3) This would
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On Wed, May 9, 2012 at 1:35 PM, Travis Oliphant tra...@continuum.io wrote: My three proposals: * do nothing and leave things as is * add a global flag that turns off masked array support by default but otherwise leaves things unchanged (I'm still unclear how this would work exactly) * move Mark's masked ndarray objects into a new fundamental type (ndmasked), leaving the actual ndarray type unchanged. The array_interface keeps the masked array notions and the ufuncs keep the ability to handle arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked objects as their core. The numpy.ma is unmaintained and I don't see that changing anytime soon. As you know, I would prefer 1), but 2) is a good compromise and the infra structure for such a flag could be useful for other things, although like yourself I'm not sure how it would be implemented. I don't understand your proposal for 3), but from the description I don't see that it buys anything. That is a bit strong to call numpy.ma unmaintained.I don't consider it that way.Are there a lot of tickets for it that are unaddressed? Is it broken? I know it gets a lot of use in the wild and so I don't think NumPy users would be happy to here it is considered unmaintained by NumPy developers. I'm looking forward to more details of Mark's proposal for #2. The proposal for #3 is quite simple and I think it is also a good compromise between removing the masked array entirely from the core NumPy object and leaving things as is in master. It keeps the functionality (but in a separate object) much like numpy.ma is a separate object. Basically it buys not forcing *all* NumPy users (on the C-API level) to now deal with a masked array. To me, it looks like we will get stuck with a more complicated implementation without changing the API, something that 2) achieves more easily while providing a feature likely to be useful as we head towards 2.0. I know this push is a feature that is part of Mark's intention (as it pushes downstream libraries to think about missing data at a fundamental level).But, I think this is too big of a change to put in a 1.X release. The internal array-model used by NumPy is used quite extensively in downstream libraries as a *concept*. Many people have enhanced this model with a separate mask array for various reasons, and Mark's current use of mask does not satisfy all those use-cases. I don't see how we can justify changing the NumPy 1.X memory model under these circumstances. You keep referring to these ghostly people and their unspecified uses, no doubt to protect the guilty. You don't have to name names, but a little detail on what they have done and how they use things would be *very* helpful. This is the sort of change that in my mind is a NumPy 2.0 kind of change where downstream users will be looking for possible array-model changes. We tried the flag day approach to 2.0 already and it failed. I think it better to have a long term release and a series of releases thereafter moving step by step with incremental changes towards a 2.0. Mark's 2) would support that approach. snip Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On re-reading, I want to make a couple of things clear: 1) This wrap-up discussion is *only* for what to do for NumPy 1.7 in such a way that we don't tie our hands in the future.I do not believe we can figure out what to do for masked arrays in one short week. What happens beyond NumPy 1.7 should be still discussed and explored.My urgency is entirely about moving forward from where we are in master right now in a direction that we can all accept. The tight timeline is so that we do *something* and move forward. 2) I missed another possible proposal for NumPy 1.7 which is in the write-up that Mark and Nathaniel made: remove the masked array additions entirely possibly moving them to another module like numpy-dtypes. Again, these are only for NumPy 1.7. What happens in any future NumPy and beyond will depend on who comes to the table for both discussion and code-development. Best regards, -Travis On May 9, 2012, at 11:46 AM, Travis Oliphant wrote: Hey all, Nathaniel and Mark have worked very hard on a joint document to try and explain the current status of the missing-data debate. I think they've done an amazing job at providing some context, articulating their views and suggesting ways forward in a mutually respectful manner. This is an exemplary collaboration and is at the core of why open source is valuable. The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst After reading that document, it appears to me that there are some fundamentally different views on how things should move forward. I'm also reading the document incorporating my understanding of the history, of NumPy as well as all of the users I've met and interacted with which means I have my own perspective that is not necessarily incorporated into that document but informs my recommendations.I'm not sure we can reach full consensus on this. We are also well past time for moving forward with a resolution on this (perhaps we can all agree on that). I would like one more discussion thread where the technical discussion can take place.I will make a plea that we keep this discussion as free from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can. I can't guarantee that I personally will succeed at that, but I can tell you that I will try. That's all I'm asking of anyone else.I recognize that there are a lot of other issues at play here besides *just* the technical questions, but we are not going to resolve every community issue in this technical thread. We need concrete proposals and so I will start with three. Please feel free to comment on these proposals or add your own during the discussion.I will stop paying attention to this thread next Wednesday (May 16th) (or earlier if the thread dies) and hope that by that time we can agree on a way forward. If we don't have agreement, then I will move forward with what I think is the right approach. I will either write the code myself or convince someone else to write it. In all cases, we have agreement that bit-pattern dtypes should be added to NumPy. We should work on these (int32, float64, complex64, str, bool) to start.So, the three proposals are independent of this way forward. The proposals are all about the extra mask part: My three proposals: * do nothing and leave things as is * add a global flag that turns off masked array support by default but otherwise leaves things unchanged (I'm still unclear how this would work exactly) * move Mark's masked ndarray objects into a new fundamental type (ndmasked), leaving the actual ndarray type unchanged. The array_interface keeps the masked array notions and the ufuncs keep the ability to handle arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked objects as their core. For the record, I'm currently in favor of the third proposal. Feel free to comment on these proposals (or provide your own). Best regards, -Travis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On Wed, May 9, 2012 at 5:46 PM, Travis Oliphant tra...@continuum.io wrote: Hey all, Nathaniel and Mark have worked very hard on a joint document to try and explain the current status of the missing-data debate. I think they've done an amazing job at providing some context, articulating their views and suggesting ways forward in a mutually respectful manner. This is an exemplary collaboration and is at the core of why open source is valuable. The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst After reading that document, it appears to me that there are some fundamentally different views on how things should move forward. I'm also reading the document incorporating my understanding of the history, of NumPy as well as all of the users I've met and interacted with which means I have my own perspective that is not necessarily incorporated into that document but informs my recommendations. I'm not sure we can reach full consensus on this. We are also well past time for moving forward with a resolution on this (perhaps we can all agree on that). If we're talking about deciding what to do for the 1.7 release branch, then I agree. Otherwise, I definitely don't. We really just don't *know* what our users need with regards to mask-based storage versions of missing data, so committing to something within a short time period will just guarantee we have to re-do it all again later. [Edit: I see that you've clarified this in a follow-up email -- great!] We need concrete proposals and so I will start with three. Please feel free to comment on these proposals or add your own during the discussion. I will stop paying attention to this thread next Wednesday (May 16th) (or earlier if the thread dies) and hope that by that time we can agree on a way forward. If we don't have agreement, then I will move forward with what I think is the right approach. I will either write the code myself or convince someone else to write it. Again, I'm assuming that what you mean here is that we can't and shouldn't delay 1.7 indefinitely for this discussion to play out, so you're proposing that we give ourselves a deadline of 1 week to decide how to at least get the release unblocked. Let me know if I'm misreading, though... In all cases, we have agreement that bit-pattern dtypes should be added to NumPy. We should work on these (int32, float64, complex64, str, bool) to start. So, the three proposals are independent of this way forward. The proposals are all about the extra mask part: My three proposals: * do nothing and leave things as is In the context of 1.7, this seems like a non-starter at this point, at least if we're going to move in the direction of making decisions by consensus. It might well be that we'll decide that the current NEP-like API is what we want (or that some compatible super-set is). But (as described in more detail in the NA-overview document), I think there are still serious questions to work out about how and whether a masked-storage/NA-semantics API is something we want as part of the ndarray object at all. And Ralf with his release-manager hat says that he doesn't want to release the current API unless we can guarantee that some version of it will continue to be supported. To me that suggests that this is off the table for 1.7. * add a global flag that turns off masked array support by default but otherwise leaves things unchanged (I'm still unclear how this would work exactly) I've been assuming something like a global variable, and some guards added to all the top-level functions that take maskna= arguments, so that it's impossible to construct an ndarray that has its maskna flag set to True unless the flag has been toggled. As I said in NA-overview, I'd be fine with this in principle, but only if we're certain we're okay with the ABI consequences. And we should be clear on the goal -- if we just want to let people play with the API, then there are other options, such as my little experiment: https://github.com/njsmith/numpyNEP (This is certainly less robust, but it works, and is probably a much easier base for modifications to test alternative APIs.) If the goal is just to keep the code in master, then that's fine too, though it has both costs and benefits. (An example of a cost is that its presence may complicate adding bitpattern NA support.) * move Mark's masked ndarray objects into a new fundamental type (ndmasked), leaving the actual ndarray type unchanged. The array_interface keeps the masked array notions and the ufuncs keep the ability to handle arrays like ndmasked. Ideally, numpy.ma would be changed to use ndmasked objects as their core. If we're talking about 1.7, then what kind of status do you propose these new objects would have in 1.7? Regular feature, totally experimental, something else? My only objection to this proposal is that committing to this approach
Re: [Numpy-discussion] Missing data wrap-up and request for comments
Hi, On Wed, May 9, 2012 at 12:44 PM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 05/09/2012 06:46 PM, Travis Oliphant wrote: Hey all, Nathaniel and Mark have worked very hard on a joint document to try and explain the current status of the missing-data debate. I think they've done an amazing job at providing some context, articulating their views and suggesting ways forward in a mutually respectful manner. This is an exemplary collaboration and is at the core of why open source is valuable. The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst After reading that document, it appears to me that there are some fundamentally different views on how things should move forward. I'm also reading the document incorporating my understanding of the history, of NumPy as well as all of the users I've met and interacted with which means I have my own perspective that is not necessarily incorporated into that document but informs my recommendations. I'm not sure we can reach full consensus on this. We are also well past time for moving forward with a resolution on this (perhaps we can all agree on that). I would like one more discussion thread where the technical discussion can take place. I will make a plea that we keep this discussion as free from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can. I can't guarantee that I personally will succeed at that, but I can tell you that I will try. That's all I'm asking of anyone else. I recognize that there are a lot of other issues at play here besides *just* the technical questions, but we are not going to resolve every community issue in this technical thread. We need concrete proposals and so I will start with three. Please feel free to comment on these proposals or add your own during the discussion. I will stop paying attention to this thread next Wednesday (May 16th) (or earlier if the thread dies) and hope that by that time we can agree on a way forward. If we don't have agreement, then I will move forward with what I think is the right approach. I will either write the code myself or convince someone else to write it. In all cases, we have agreement that bit-pattern dtypes should be added to NumPy. We should work on these (int32, float64, complex64, str, bool) to start. So, the three proposals are independent of this way forward. The proposals are all about the extra mask part: My three proposals: * do nothing and leave things as is * add a global flag that turns off masked array support by default but otherwise leaves things unchanged (I'm still unclear how this would work exactly) * move Mark's masked ndarray objects into a new fundamental type (ndmasked), leaving the actual ndarray type unchanged. The array_interface keeps the masked array notions and the ufuncs keep the ability to handle arrays like ndmasked. Ideally, numpy.ma http://numpy.ma would be changed to use ndmasked objects as their core. For the record, I'm currently in favor of the third proposal. Feel free to comment on these proposals (or provide your own). Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark! Yes, it is very well written, my compliments to the chefs. The third proposal is certainly the best one from Cython's perspective; and I imagine for those writing C extensions against the C API too. Having PyType_Check fail for ndmasked is a very good way of having code fail that is not written to take masks into account. Mark, Nathaniel - can you comment how your chosen approaches would interact with extension code? I'm guessing the bitpattern dtypes would be expected to cause extension code to choke if the type is not supported? Mark - in : https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#cython - do I understand correctly that you think that Cython and other extension writers should use the numpy API to access the data rather than accessing it directly via the data pointer and strides? Best, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On Wed, May 9, 2012 at 3:12 PM, Travis Oliphant tra...@continuum.io wrote: On re-reading, I want to make a couple of things clear: 1) This wrap-up discussion is *only* for what to do for NumPy 1.7 in such a way that we don't tie our hands in the future.I do not believe we can figure out what to do for masked arrays in one short week. What happens beyond NumPy 1.7 should be still discussed and explored.My urgency is entirely about moving forward from where we are in master right now in a direction that we can all accept. The tight timeline is so that we do *something* and move forward. 2) I missed another possible proposal for NumPy 1.7 which is in the write-up that Mark and Nathaniel made: remove the masked array additions entirely possibly moving them to another module like numpy-dtypes. Again, these are only for NumPy 1.7. What happens in any future NumPy and beyond will depend on who comes to the table for both discussion and code-development. I'm glad that this sentence made it into the write-up: A project like numpy requires developers to write code for advancement to occur, and obstacles that impede the writing of code discourage existing developers from contributing more, and potentially scare away developers who are thinking about joining in. I agree, which is why I'm a little surprised after reading the write-up that there's no deference to the alterNEP (admittedly kludgy) implementation? One of the arguments made for the NEP preliminary NA-mask implementation is that has been extensively tested against scipy and other third-party packages, and has been in master in a stable state for a significant amount of time. It is my understanding that the manner in which this implementation found its way into master was a source of concern and contention. To me (and I don't know the level to which this is a technically feasible) that's precisely the reason that BOTH approaches be allowed to make their way into numpy with experimental status. Otherwise, it seems that there is a sort of scaring away of developers - seeing (from the sidelines) how much of a struggle it's been for the alterNEP to find a nurturing environment as an experimental alternative inside numpy. In my reading, the process and consensus threads that have generated so many responses stem precisely from trying to have an atmosphere where everyone is encouraged to join in. The alternatives proposed so far (though I do understand it's only for 1.7) do not suggest an appreciation for the gravity of the fallout from the neglect the alterNEP and the issues which sprang forth from that. Importantly, I find a problem with how personal this document (and discussion) is - I'd much prefer if we talk about technical things by a descriptive name, not the person who thought of it. You'll note how I've been referring to NEP and alterNEP above. One advantage of this is that down the line, if either Mark or Nathaniel change their minds about their current preferred way forward, it doesn't take the wind out of it with something like Even Paul changed his mind and now withdraws his support of Paul's proposal. We should only focus on the technical merits of a given approach, not how many commits have been made by the person proposing them or what else they've done in their life: a good idea has value regardless of who expresses it. In my fantasy world, with both approaches clearly existing in an experimental sandbox inside numpy, folks who feel primary attachments to either NEP or alterNEP would be willing to cross party lines and pitch in towardd making progress in both camps. That's the way we'll find better solutions, by working together, instead of working in opposition. best, -- Paul Ivanov 314 address only used for lists, off-list direct email at: http://pirsquared.org | GPG/PGP key id: 0x0F3E28F7 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On Wed, May 9, 2012 at 6:13 PM, Paul Ivanov pivanov...@gmail.com wrote: On Wed, May 9, 2012 at 3:12 PM, Travis Oliphant tra...@continuum.iowrote: On re-reading, I want to make a couple of things clear: 1) This wrap-up discussion is *only* for what to do for NumPy 1.7 in such a way that we don't tie our hands in the future.I do not believe we can figure out what to do for masked arrays in one short week. What happens beyond NumPy 1.7 should be still discussed and explored.My urgency is entirely about moving forward from where we are in master right now in a direction that we can all accept. The tight timeline is so that we do *something* and move forward. 2) I missed another possible proposal for NumPy 1.7 which is in the write-up that Mark and Nathaniel made: remove the masked array additions entirely possibly moving them to another module like numpy-dtypes. Again, these are only for NumPy 1.7. What happens in any future NumPy and beyond will depend on who comes to the table for both discussion and code-development. I'm glad that this sentence made it into the write-up: A project like numpy requires developers to write code for advancement to occur, and obstacles that impede the writing of code discourage existing developers from contributing more, and potentially scare away developers who are thinking about joining in. I agree, which is why I'm a little surprised after reading the write-up that there's no deference to the alterNEP (admittedly kludgy) implementation? One of the arguments made for the NEP preliminary NA-mask implementation is that has been extensively tested against scipy and other third-party packages, and has been in master in a stable state for a significant amount of time. It is my understanding that the manner in which this implementation found its way into master was a source of concern and contention. To me (and I don't know the level to which this is a technically feasible) that's precisely the reason that BOTH approaches be allowed to make their way into numpy with experimental status. Otherwise, it seems that there is a sort of scaring away of developers - seeing (from the sidelines) how much of a struggle it's been for the alterNEP to find a nurturing environment as an experimental alternative inside numpy. In my reading, the process and consensus threads that have generated so many responses stem precisely from trying to have an atmosphere where everyone is encouraged to join in. The alternatives proposed so far (though I do understand it's only for 1.7) do not suggest an appreciation for the gravity of the fallout from the neglect the alterNEP and the issues which sprang forth from that. Importantly, I find a problem with how personal this document (and discussion) is - I'd much prefer if we talk about technical things by a descriptive name, not the person who thought of it. You'll note how I've been referring to NEP and alterNEP above. One advantage of this is that down the line, if either Mark or Nathaniel change their minds about their current preferred way forward, it doesn't take the wind out of it with something like Even Paul changed his mind and now withdraws his support of Paul's proposal. We should only focus on the technical merits of a given approach, not how many commits have been made by the person proposing them or what else they've done in their life: a good idea has value regardless of who expresses it. In my fantasy world, with both approaches clearly existing in an experimental sandbox inside numpy, folks who feel primary attachments to either NEP or alterNEP would be willing to cross party lines and pitch in towardd making progress in both camps. That's the way we'll find better solutions, by working together, instead of working in opposition. We are certainly open to code submissions and alternate implementations. The experimental tag would help there. But someone, as you mention, needs to write the code. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On 05/10/2012 01:01 AM, Matthew Brett wrote: Hi, On Wed, May 9, 2012 at 12:44 PM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 05/09/2012 06:46 PM, Travis Oliphant wrote: Hey all, Nathaniel and Mark have worked very hard on a joint document to try and explain the current status of the missing-data debate. I think they've done an amazing job at providing some context, articulating their views and suggesting ways forward in a mutually respectful manner. This is an exemplary collaboration and is at the core of why open source is valuable. The document is available here: https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst After reading that document, it appears to me that there are some fundamentally different views on how things should move forward. I'm also reading the document incorporating my understanding of the history, of NumPy as well as all of the users I've met and interacted with which means I have my own perspective that is not necessarily incorporated into that document but informs my recommendations. I'm not sure we can reach full consensus on this. We are also well past time for moving forward with a resolution on this (perhaps we can all agree on that). I would like one more discussion thread where the technical discussion can take place. I will make a plea that we keep this discussion as free from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can. I can't guarantee that I personally will succeed at that, but I can tell you that I will try. That's all I'm asking of anyone else. I recognize that there are a lot of other issues at play here besides *just* the technical questions, but we are not going to resolve every community issue in this technical thread. We need concrete proposals and so I will start with three. Please feel free to comment on these proposals or add your own during the discussion. I will stop paying attention to this thread next Wednesday (May 16th) (or earlier if the thread dies) and hope that by that time we can agree on a way forward. If we don't have agreement, then I will move forward with what I think is the right approach. I will either write the code myself or convince someone else to write it. In all cases, we have agreement that bit-pattern dtypes should be added to NumPy. We should work on these (int32, float64, complex64, str, bool) to start. So, the three proposals are independent of this way forward. The proposals are all about the extra mask part: My three proposals: * do nothing and leave things as is * add a global flag that turns off masked array support by default but otherwise leaves things unchanged (I'm still unclear how this would work exactly) * move Mark's masked ndarray objects into a new fundamental type (ndmasked), leaving the actual ndarray type unchanged. The array_interface keeps the masked array notions and the ufuncs keep the ability to handle arrays like ndmasked. Ideally, numpy.ma http://numpy.ma would be changed to use ndmasked objects as their core. For the record, I'm currently in favor of the third proposal. Feel free to comment on these proposals (or provide your own). Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark! Yes, it is very well written, my compliments to the chefs. The third proposal is certainly the best one from Cython's perspective; and I imagine for those writing C extensions against the C API too. Having PyType_Check fail for ndmasked is a very good way of having code fail that is not written to take masks into account. I want to make something more clear: There are two Cython cases; in the case of cdef np.ndarray[double] there is no problem as PEP 3118 access will raise an exception for masked arrays. But, there's the case where you do cdef np.ndarray, and then proceed to use PyArray_DATA. Myself I do this more than PEP 3118 access; usually because I pass the data pointer to some C or C++ code. It'd be great to have such code be forward-compatible in the sense that it raises an exception when it meets a masked array. Having PyType_Check fail seems like the only way? Am I wrong? Mark, Nathaniel - can you comment how your chosen approaches would interact with extension code? I'm guessing the bitpattern dtypes would be expected to cause extension code to choke if the type is not supported? The proposal, as I understand it, is to use that with new dtypes (?). So things will often be fine for that reason: if arr.dtype == np.float32: c_function_32bit(np.PyArray_DATA(arr), ...) else: raise ValueError(need 32-bit float array) Mark - in : https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#cython - do I understand correctly that you think that Cython and other extension writers should use the numpy API to access the data rather than accessing it directly via the data pointer and strides? That's not really fleshed out (for
Re: [Numpy-discussion] Missing data wrap-up and request for comments
On Wed, May 9, 2012 at 11:05 PM, Benjamin Root ben.r...@ou.edu wrote: On Wednesday, May 9, 2012, Nathaniel Smith wrote: My only objection to this proposal is that committing to this approach seems premature. The existing masked array objects act quite differently from numpy.ma, so why do you believe that they're a good foundation for numpy.ma, and why will users want to switch to their semantics over numpy.ma's semantics? These aren't rhetorical questions, it seems like they must have concrete answers, but I don't know what they are. Based on the design decisions made in the original NEP, a re-made numpy.mawould have to lose _some_ features particularly, the ability to share masks. Save for that and some very obscure behaviors that are undocumented, it is possible to remake numpy.ma as a compatibility layer. That being said, I think that there are some fundamental questions that has concerned. If I recall, there were unresolved questions about behaviors surrounding assignments to elements of a view. I see the project as broken down like this: 1.) internal architecture (largely abi issues) 2.) external architecture (hooks throughout numpy to utilize the new features where possible such as where= argument) 3.) getter/setter semantics 4.) mathematical semantics At this moment, I think we have pieces of 2 and they are fairly non-controversial. It is 1 that I see as being the immediate hold-up here. 3 4 are non-trivial, but because they are mostly about interfaces, I think we can be willing to accept some very basic, fundamental, barebones components here in order to lay the groundwork for a more complete API later. To talk of Travis's proposal, doing nothing is no-go. Not moving forward would dishearten the community. Making a ndmasked type is very intriguing. I see it as a set towards eventually deprecating ndarray? Also, how would it behave with no.asarray() and no.asanyarray()? My other concern is a possible violation of DRY. How difficult would it be to maintain two ndarrays in parallel? As for the flag approach, this still doesn't solve the problem of legacy code (or did I misunderstand?) My understanding of the flag is to allow the code to stay in and get reworked and experimented with while keeping it from contaminating conventional use. The whole point of putting the code in was to experiment and adjust. The rather bizarre idea that it needs to be perfect from the get go is disheartening, and is seldom how new things get developed. Sure, there is a plan up front, but there needs to be feedback and change. And in fact, I haven't seen much feedback about the actual code, I don't even know that the people complaining have tried using it to see where it hurts. I'd like that sort of feedback. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
Hi Chuck, I think I let my frustration get the better of me, and the message below is too confrontational. I apologize. I truly would like to understand where you're coming from on this, though, so I'll try to make this more productive. My summary of points that no-one has disagreed with yet is here: https://github.com/njsmith/numpy/wiki/NA-discussion-status Of course, this means that there's lots that's left out. Instead of getting into all those contentious details, I'll stick to just a few basic questions that might let us get at least of bit of common ground: 1) Do you disagree with anything that is stated there? 2) Do you feel like that document accurately summarises your basic idea of what this feature is supposed to do (I assume under the IGNORED heading)? Thanks, -- Nathaniel On Wed, Mar 7, 2012 at 11:10 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Mar 7, 2012 at 7:37 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote: When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Maybe for float, for other things, no. And we have lots of otherthings. It would be easier to discuss this if you'd, like, discuss :-(. If you know of some advantage that masks have over bitpatterns when it comes to missing data, can you please share it, instead of just asserting it? Not that I'm immune... I perhaps should have been more explicit myself, when I said performance characteristics, let me clarify that I was thinking of both speed (for floats) and memory (for most-but-not-all things). The performance is a strawman, How many users need to speak up to say that this is a serious problem they have with the current implementation before you stop calling it a strawman? Because when Wes says that it's not going to fly for his stats/econometics cases, and the neuroimaging folk like Gary and Matt say it's not going to fly for their use cases... surely just waving that away is a bit dismissive? I'm not saying that we *have* to implement bitpatterns because performance is *the most important feature* -- I'm just saying, well, what I said. For *missing data use* cases, bitpatterns have better performance characteristics than masks. If we decide that these use cases are important, then we should take this into account and weigh it against other considerations. Maybe what you think is that these use cases shouldn't be the focus of this feature and it should focus on the ignored use cases instead? That would be a legitimate argument... but if that's what you want to say, say it, don't just dismiss your users! and it *isn't* easier to implement. If I thought bitpatterns would be easier to implement, I would have said so... What I said was that they're not harder. You have some extra complexity, mostly in casting, and some reduced complexity -- no need to allocate and manipulate the mask. (E.g., simple same-type assignments and slicing require special casing for masks, but not for bitpatterns.) In many places the complexity is identical -- printing routines need to check for either special bitpatterns or masked values, whatever. Ufunc loops need to either find the appropriate part of the mask, or create a temporary mask buffer by calling a dtype func, whatever. On net they seem about equivalent, complexity-wise. ...I assume you disagree with this analysis, since I've said it before, wrote up a sketch for how the implementation would work at the C level, etc., and you continue to claim that simplicity is a compelling advantage for the masked approach. But I still don't know why you think that :-(. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. True, but not really relevant to the current debate, because you have to handle such issues as part of your general data import workflow anyway, and none of these is any more complicated no matter which implementations are available. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. So this, uh... this was actually the whole goal of the alterNEP design for masks -- making all this stuff easy for people (like you, apparently?) that want support for ignored values, separately from missing data, and want a nice clean API for it. Basically having a separate .mask attribute which was an ordinary, assignable array broadcastable to the attached array's shape. Nobody seemed interested in talking about it much then
Re: [Numpy-discussion] Missing data again
Hi, Thanks you very much for your lights ! Le 06/03/2012 21:59, Nathaniel Smith a écrit : Right -- R has a very impoverished type system as compared to numpy. There's basically four types: numeric (meaning double precision float), integer, logical (boolean), and character (string). And in practice the integer type is essentially unused, because R parses numbers like 1 as being floating point, not integer; the only way to get an integer value is to explicitly cast to it. Each of these types has a specific bit-pattern set aside for representing NA. And... that's it. It's very simple when it works, but also very limited. I also suspected R to be less powerful in terms of types. However, I think the fact that It's very simple when it works is important to take into account. At the end of the day, when using all the fanciness it is not only about can I have some NAs in my array ? but also how *easily* can I have some NAs in my array ?. It's about balancing the how easy and the how powerful. The easyness-of-use is the reason of my concern about having separate types nafloatNN and floatNN. Of course, I won't argue that not breaking everything is even more important !! Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... Best, Pierre signature.asc Description: OpenPGP digital signature ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 4:35 PM, Pierre Haessig pierre.haes...@crans.org wrote: Hi, Thanks you very much for your lights ! Le 06/03/2012 21:59, Nathaniel Smith a écrit : Right -- R has a very impoverished type system as compared to numpy. There's basically four types: numeric (meaning double precision float), integer, logical (boolean), and character (string). And in practice the integer type is essentially unused, because R parses numbers like 1 as being floating point, not integer; the only way to get an integer value is to explicitly cast to it. Each of these types has a specific bit-pattern set aside for representing NA. And... that's it. It's very simple when it works, but also very limited. I also suspected R to be less powerful in terms of types. However, I think the fact that It's very simple when it works is important to take into account. At the end of the day, when using all the fanciness it is not only about can I have some NAs in my array ? but also how *easily* can I have some NAs in my array ?. It's about balancing the how easy and the how powerful. The easyness-of-use is the reason of my concern about having separate types nafloatNN and floatNN. Of course, I won't argue that not breaking everything is even more important !! It's a good point, I just don't see how we can really tell what the trade-offs are at this point. You should bring this up again once more of the big picture stuff is hammered out. Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... All modern floating point formats can represent NaNs with payloads, so in principle there's no difficulty in supporting NA the same way for all of them. If you're using float16 because you want to offload computation to a GPU then I would test carefully before trusting the GPU to handle NaNs correctly, and there may need to be a bit of care to make sure that casts between these types properly map NAs to NAs, but generally it should be fine. -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.orgwrote: Hi, Thanks you very much for your lights ! Le 06/03/2012 21:59, Nathaniel Smith a écrit : Right -- R has a very impoverished type system as compared to numpy. There's basically four types: numeric (meaning double precision float), integer, logical (boolean), and character (string). And in practice the integer type is essentially unused, because R parses numbers like 1 as being floating point, not integer; the only way to get an integer value is to explicitly cast to it. Each of these types has a specific bit-pattern set aside for representing NA. And... that's it. It's very simple when it works, but also very limited. I also suspected R to be less powerful in terms of types. However, I think the fact that It's very simple when it works is important to take into account. At the end of the day, when using all the fanciness it is not only about can I have some NAs in my array ? but also how *easily* can I have some NAs in my array ?. It's about balancing the how easy and the how powerful. The easyness-of-use is the reason of my concern about having separate types nafloatNN and floatNN. Of course, I won't argue that not breaking everything is even more important !! Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... I'm in agreement here, and that was a major consideration in making a 'masked' implementation first. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
Charles R Harris writes: [...] One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. AFAIR, the inability to directly access a mask attribute was intentional to make bit-patterns and masks indistinguishable from the POV of the array user. What's the workflow that leads you to un-ignore specific elements? So maybe two types of masks (different payloads), or an additional flag could be helpful. Do you mean different NA values? If that's the case, I think it was taken into account when implementing the current mechanisms (and was also mentioned in the NEP), so that it could be supported by both bit-patterns and masks (as one of the main design points was to make them indistinguishable in the common case). I think the name was parametrized dtypes. The process of assigning masks could also be made a bit easier than using fancy indexing. I don't get what you mean here, sorry. Do you mean here that this is too cumbersome to use? a[a 5] = np.NA (obviously oversimplified example where everything looks sufficiently simple :)) Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 11:21 AM, Lluís xscr...@gmx.net wrote: Charles R Harris writes: [...] One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. AFAIR, the inability to directly access a mask attribute was intentional to make bit-patterns and masks indistinguishable from the POV of the array user. What's the workflow that leads you to un-ignore specific elements? Because they are not 'unknown', just (temporarily) 'ignored'. This might be the case if you are experimenting with what happens if certain data is left out of a fit. The current implementation tries to handle both these case, and can do so, I would just like the 'ignored' use to be more convenient than it is. So maybe two types of masks (different payloads), or an additional flag could be helpful. Do you mean different NA values? If that's the case, I think it was taken into account when implementing the current mechanisms (and was also mentioned in the NEP), so that it could be supported by both bit-patterns and masks (as one of the main design points was to make them indistinguishable in the common case). No, the mask as currently implemented is eight bits and can be extended to handle different mask values, aka, payloads. I think the name was parametrized dtypes. They don't interest me in the least. But that is a whole different area of discussion. The process of assigning masks could also be made a bit easier than using fancy indexing. I don't get what you mean here, sorry. Suppose I receive a data set, say an hdf file, that also includes a mask. I'd like to load the data and apply the mask directly without doing something like data[mask] = np.NA Do you mean here that this is too cumbersome to use? a[a 5] = np.NA (obviously oversimplified example where everything looks sufficiently simple :)) Mostly speed and memory. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... I'm in agreement here, and that was a major consideration in making a 'masked' implementation first. When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. True, but not really relevant to the current debate, because you have to handle such issues as part of your general data import workflow anyway, and none of these is any more complicated no matter which implementations are available. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. So this, uh... this was actually the whole goal of the alterNEP design for masks -- making all this stuff easy for people (like you, apparently?) that want support for ignored values, separately from missing data, and want a nice clean API for it. Basically having a separate .mask attribute which was an ordinary, assignable array broadcastable to the attached array's shape. Nobody seemed interested in talking about it much then but maybe there's interest now? -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... I'm in agreement here, and that was a major consideration in making a 'masked' implementation first. When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Maybe for float, for other things, no. And we have lots of otherthings. The performance is a strawman, and it *isn't* easier to implement. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. True, but not really relevant to the current debate, because you have to handle such issues as part of your general data import workflow anyway, and none of these is any more complicated no matter which implementations are available. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. So this, uh... this was actually the whole goal of the alterNEP design for masks -- making all this stuff easy for people (like you, apparently?) that want support for ignored values, separately from missing data, and want a nice clean API for it. Basically having a separate .mask attribute which was an ordinary, assignable array broadcastable to the attached array's shape. Nobody seemed interested in talking about it much then but maybe there's interest now? Come off it, Nathaniel, the problem is minor and fixable. The intent of the initial implementation was to discover such things. These things are less accessible with the current API *precisely* because of the feedback from R users. It didn't start that way. We now have something to evolve into what we want. That is a heck of a lot more useful than endless discussion. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 1:26 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... I'm in agreement here, and that was a major consideration in making a 'masked' implementation first. When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Not true. bitpatterns inherently destroys the data, while masks do not. For matplotlib, we can not use bitpatterns because it could over-write user data (or we have to copy the data). I would imagine other extension writers would have similar issues when they need to play around with input data in a safe manner. Also, I doubt that the performance characteristics for strings and integers are the same as it is for masks. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
Hi, On Wed, Mar 7, 2012 at 11:37 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... I'm in agreement here, and that was a major consideration in making a 'masked' implementation first. When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Maybe for float, for other things, no. And we have lots of otherthings. The performance is a strawman, and it *isn't* easier to implement. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. True, but not really relevant to the current debate, because you have to handle such issues as part of your general data import workflow anyway, and none of these is any more complicated no matter which implementations are available. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. So this, uh... this was actually the whole goal of the alterNEP design for masks -- making all this stuff easy for people (like you, apparently?) that want support for ignored values, separately from missing data, and want a nice clean API for it. Basically having a separate .mask attribute which was an ordinary, assignable array broadcastable to the attached array's shape. Nobody seemed interested in talking about it much then but maybe there's interest now? Come off it, Nathaniel, the problem is minor and fixable. The intent of the initial implementation was to discover such things. These things are less accessible with the current API *precisely* because of the feedback from R users. It didn't start that way. We now have something to evolve into what we want. That is a heck of a lot more useful than endless discussion. The endless discussion is for the following reason: - The discussion was never adequately resolved. The discussion was never adequately resolved because there was not enough work done to understand the various arguments. In particular, you've several times said things that indicate to me, as to Nathaniel, that you either have not read or have not understood the points that Nathaniel was making. Travis' recent email - to me - also indicates that there is still a genuine problem here that has not been adequately explored. There is no future in trying to stop discussion, and trying to do so will only prolong it and make it less useful. It will make the discussion - endless. If you want to help - read the alterNEP, respond to it directly, and further the discussion by engaged debate. Best, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On 03/07/2012 09:26 AM, Nathaniel Smith wrote: On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessigpierre.haes...@crans.org Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... I'm in agreement here, and that was a major consideration in making a 'masked' implementation first. When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. True, but not really relevant to the current debate, because you have to handle such issues as part of your general data import workflow anyway, and none of these is any more complicated no matter which implementations are available. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. So this, uh... this was actually the whole goal of the alterNEP design for masks -- making all this stuff easy for people (like you, apparently?) that want support for ignored values, separately from missing data, and want a nice clean API for it. Basically having a separate .mask attribute which was an ordinary, assignable array broadcastable to the attached array's shape. Nobody seemed interested in talking about it much then but maybe there's interest now? In other words, good low-level support for numpy.ma functionality? With a migration path so that a separate numpy.ma might wither away? Yes, there is interest; this is exactly what I think is needed for my own style of applications (which I think are common at least in geoscience), and for matplotlib. The question is how to achieve it as simply and cleanly as possible while also satisfying the needs of the R users, and while making it easy for matplotlib, for example, to handle *any* reasonable input: ma, other masking, nan, or NA-bitpattern. It may be that a rather pragmatic approach to implementation will prove better than a highly idealized set of data models. Or, it may be that a dual approach is best, in which the flag value missing data implementation is tightly bound to the R model and the mask implementation is explicitly designed for the numpy.ma model. In any case, a reasonable level of agreement on the goals is needed. I presume Travis's involvement will facilitate a clarification of the goals and of the implementation; and I expect that much of Mark's work will end up serving well, even if much needs to be added and the API evolves considerably. Eric -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
Hi, Le 07/03/2012 20:57, Eric Firing a écrit : In other words, good low-level support for numpy.ma functionality? Coming back to *existing* ma support, I was just wondering whether it was now possible to np.save a masked array. (I'm using numpy 1.5) In the end, this is the most annoying problem I have with the existing ma module which otherwise is pretty useful to me. I'm happy not to need to process 100% of my data though. Best, Pierre signature.asc Description: OpenPGP digital signature ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On 03/07/2012 11:15 AM, Pierre Haessig wrote: Hi, Le 07/03/2012 20:57, Eric Firing a écrit : In other words, good low-level support for numpy.ma functionality? Coming back to *existing* ma support, I was just wondering whether it was now possible to np.save a masked array. (I'm using numpy 1.5) No, not with the mask preserved. This is one of the improvements I am hoping for with the upcoming missing data work. Eric In the end, this is the most annoying problem I have with the existing ma module which otherwise is pretty useful to me. I'm happy not to need to process 100% of my data though. Best, Pierre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 7:37 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote: When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Maybe for float, for other things, no. And we have lots of otherthings. It would be easier to discuss this if you'd, like, discuss :-(. If you know of some advantage that masks have over bitpatterns when it comes to missing data, can you please share it, instead of just asserting it? Not that I'm immune... I perhaps should have been more explicit myself, when I said performance characteristics, let me clarify that I was thinking of both speed (for floats) and memory (for most-but-not-all things). The performance is a strawman, How many users need to speak up to say that this is a serious problem they have with the current implementation before you stop calling it a strawman? Because when Wes says that it's not going to fly for his stats/econometics cases, and the neuroimaging folk like Gary and Matt say it's not going to fly for their use cases... surely just waving that away is a bit dismissive? I'm not saying that we *have* to implement bitpatterns because performance is *the most important feature* -- I'm just saying, well, what I said. For *missing data use* cases, bitpatterns have better performance characteristics than masks. If we decide that these use cases are important, then we should take this into account and weigh it against other considerations. Maybe what you think is that these use cases shouldn't be the focus of this feature and it should focus on the ignored use cases instead? That would be a legitimate argument... but if that's what you want to say, say it, don't just dismiss your users! and it *isn't* easier to implement. If I thought bitpatterns would be easier to implement, I would have said so... What I said was that they're not harder. You have some extra complexity, mostly in casting, and some reduced complexity -- no need to allocate and manipulate the mask. (E.g., simple same-type assignments and slicing require special casing for masks, but not for bitpatterns.) In many places the complexity is identical -- printing routines need to check for either special bitpatterns or masked values, whatever. Ufunc loops need to either find the appropriate part of the mask, or create a temporary mask buffer by calling a dtype func, whatever. On net they seem about equivalent, complexity-wise. ...I assume you disagree with this analysis, since I've said it before, wrote up a sketch for how the implementation would work at the C level, etc., and you continue to claim that simplicity is a compelling advantage for the masked approach. But I still don't know why you think that :-(. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. True, but not really relevant to the current debate, because you have to handle such issues as part of your general data import workflow anyway, and none of these is any more complicated no matter which implementations are available. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. So this, uh... this was actually the whole goal of the alterNEP design for masks -- making all this stuff easy for people (like you, apparently?) that want support for ignored values, separately from missing data, and want a nice clean API for it. Basically having a separate .mask attribute which was an ordinary, assignable array broadcastable to the attached array's shape. Nobody seemed interested in talking about it much then but maybe there's interest now? Come off it, Nathaniel, the problem is minor and fixable. The intent of the initial implementation was to discover such things. Implementation can be wonderful, I absolutely agree. But you understand that I'd be more impressed by this example if your discovery weren't something I had been arguing for since before the implementation began :-). These things are less accessible with the current API *precisely* because of the feedback from R users. It didn't start that way. We now have something to evolve into what we want. That is a heck of a lot more useful than endless discussion. No, you are still missing the point completely! There is no what *we* want, because what you want is different than what I want. The masking stuff in the alterNEP was an attempt to give people like you who wanted ignored support what they wanted, and the bitpattern stuff was to
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 7:39 PM, Benjamin Root ben.r...@ou.edu wrote: On Wed, Mar 7, 2012 at 1:26 PM, Nathaniel Smith n...@pobox.com wrote: When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Not true. bitpatterns inherently destroys the data, while masks do not. Yes, that's why I only wrote that this is true for missing data, not in general :-). If you have data that is being destroyed, then that's not missing data, by definition. We don't have consensus yet on whether that's the use case we are aiming for, but it's the one that Pierre was worrying about. For matplotlib, we can not use bitpatterns because it could over-write user data (or we have to copy the data). I would imagine other extension writers would have similar issues when they need to play around with input data in a safe manner. Right. You clearly need some sort of masking, either an explicit mask array that you keep somewhere, or one that gets attached to the underlying ndarray in some non-destructive way. Also, I doubt that the performance characteristics for strings and integers are the same as it is for masks. Not sure what you mean by this, but I'd be happy to hear more. -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
Hi Mark, I went through the NA NEP a few days ago, but only too quickly so that my question is probably a rather dumb one. It's about the usability of bitpatter-based NAs, based on your recent post : Le 03/03/2012 22:46, Mark Wiebe a écrit : Also, here's a thought for the usability of NA-float64. As much as global state is a bad idea, something which determines whether implicit float dtypes are NA-float64 or float64 could help. In IPython, pylab mode would default to float64, and statlab or pystat would default to NA-float64. One way to write this might be: np.set_default_float(np.nafloat64) np.array([1.0, 2.0, 3.0]) array([ 1., 2., 3.], dtype=nafloat64) np.set_default_float(np.float64) np.array([1.0, 2.0, 3.0]) array([ 1., 2., 3.], dtype=float64) Q: Is is an *absolute* necessity to have two separate dtypes nafloatNN and floatNN to enable NA bitpattern storage ? From a potential user perspective, I feel it would be nice to have NA and non-NA cases look as similar as possible. Your code example is particularly striking : two different dtypes to store (from a user perspective) the exact same content ! If this *could* be avoided, it would be great... I don't know how the NA machinery is working R. Does it works with a kind of nafloat64 all the time or is there some type inference mechanics involved in choosing the appropriate type ? Best, Pierre signature.asc Description: OpenPGP digital signature ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
Hi Pierre, On Tue, Mar 6, 2012 at 5:48 AM, Pierre Haessig pierre.haes...@crans.orgwrote: Hi Mark, I went through the NA NEP a few days ago, but only too quickly so that my question is probably a rather dumb one. It's about the usability of bitpatter-based NAs, based on your recent post : Le 03/03/2012 22:46, Mark Wiebe a écrit : Also, here's a thought for the usability of NA-float64. As much as global state is a bad idea, something which determines whether implicit float dtypes are NA-float64 or float64 could help. In IPython, pylab mode would default to float64, and statlab or pystat would default to NA-float64. One way to write this might be: np.set_default_float(np.nafloat64) np.array([1.0, 2.0, 3.0]) array([ 1., 2., 3.], dtype=nafloat64) np.set_default_float(np.float64) np.array([1.0, 2.0, 3.0]) array([ 1., 2., 3.], dtype=float64) Q: Is is an *absolute* necessity to have two separate dtypes nafloatNN and floatNN to enable NA bitpattern storage ? From a potential user perspective, I feel it would be nice to have NA and non-NA cases look as similar as possible. Your code example is particularly striking : two different dtypes to store (from a user perspective) the exact same content ! If this *could* be avoided, it would be great... The biggest reason to keep the two types separate is performance. The straight float dtypes map directly to hardware floating-point operations, which can be very fast. The NA-float dtypes have to use additional logic to handle the NA values correctly. NA is treated as a particular NaN, and if the hardware float operations were used directly, NA would turn into NaN. This additional logic usually means more branches, so is slower. One possibility we could consider is to automatically convert an array's dtype from float64 to nafloat64 the first time an NA is assigned. This would have good performance when there are no NAs, but would transparently switch on NA support when it's needed. I don't know how the NA machinery is working R. Does it works with a kind of nafloat64 all the time or is there some type inference mechanics involved in choosing the appropriate type ? My understanding of R is that it works with the nafloat64 for all its operations, yes. Cheers, Mark Best, Pierre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant tra...@continuum.io wrote: Hi all, Hi Travis, Thanks for bringing this back up. Have you looked at the summary from the last thread? https://github.com/njsmith/numpy/wiki/NA-discussion-status The goal was to try and at least work out what points we all *could* agree on, to have some common footing for further discussion. I won't copy the whole thing here, but I'd summarize the state as: -- It's pretty clear that there are two fairly different conceptual models/use cases in play here. For one of them (R-style missing data cases) it's pretty clear what the desired semantics would be. For the other (temporary ignored values) there's still substantive disagreement. -- We *haven't* yet established what we want numpy to actually support. IMHO the critical next step is this latter one -- maybe we want to fully support both use cases. Maybe it's really only one of them that's worth trying to support in the numpy core right now. Maybe it's just one of them, but it's worth doing so thoroughly that it should have multiple implementations. Or whatever. I fear that if we don't talk about these big picture questions and just wade directly back into round-and-round arguments about API details then we'll never get anywhere. [...] Because it is slated to go into release 1.7, we need to re-visit the masked array discussion again. The NEP process is the appropriate one and I'm glad we are taking that route for these discussions. My goal is to get consensus in order for code to get into NumPy (regardless of who writes the code). It may be that we don't come to a consensus (reasonable and intelligent people can disagree on things --- look at the coming election...). We can represent different parts of what is fortunately a very large user-base of NumPy users. First of all, I want to be clear that I think there is much great work that has been done in the current missing data code. There are some nice features in the where clause of the ufunc and the machinery for the iterator that allows re-using ufunc loops that are not re-written to check for missing data. I'm sure there are other things as well that I'm not quite aware of yet. However, I don't think the API presented to the numpy user presently is the correct one for NumPy 1.X. A few particulars: * the reduction operations need to default to skipna --- this is the most common use case which has been re-inforced again to me today by a new user to Python who is using masked arrays presently This is one of the points where the two conceptual models disagree (see also Skipper's point down-thread). If you have missing data, then propagation has to be the default -- the sum of 1, 2, and I-DON'T-KNOW-MAYBE-7-MAYBE-12 is not 3. If you have some data there but you've asked numpy to temporarily ignore it, then, well, duh, of course it should ignore it. * the mask needs to be visible to the user if they use that approach to missing data (people should be able to get a hold of the mask and work with it in Python) This is also a point where the two conceptual models disagree. Actually this is one of the original arguments we made against the NEP design -- that if you want missing data, then having a mask at all is counterproductive, and if you are ignoring data, then of course it should be easy to manipulate the ignore mask. The rationale for the current design is to compromise between these two approaches -- there is a mask, but it's hidden behind a curtain. Mostly. (This may be a compromise in the Solomonic sense.) * bit-pattern approaches to missing data (at least for float64 and int32) need to be implemented. * there should be some way when using masks (even if it's hidden from most users) for missing data to separate the low-level ufunc operation from the operation on the masks... I don't understand what this means. I have heard from several users that they will *not use the missing data* in NumPy as currently implemented, and I can now see why. For better or for worse, my approach to software is generally very user-driven and very pragmatic. On the other hand, I'm also a mathematician and appreciate the cognitive compression that can come out of well-formed structure. None-the-less, I'm an *applied* mathematician and am ultimately motivated by applications. I will get a hold of the NEP and spend some time with it to discuss some of this in that document. This will take several weeks (as PyCon is next week and I have a tutorial I'm giving there). For now, I do not think 1.7 can be released unless the masked array is labeled *experimental*. In project management terms, I see three options: 1) Put a big warning label on the functionality and leave it for now (If this option is given, np.asarray returns a masked array. NOTE: IN THE NEXT RELEASE, IT MAY INSTEAD RETURN A BAG OF
Re: [Numpy-discussion] Missing data again
On Tue, Mar 6, 2012 at 4:38 PM, Mark Wiebe mwwi...@gmail.com wrote: On Tue, Mar 6, 2012 at 5:48 AM, Pierre Haessig pierre.haes...@crans.org wrote: From a potential user perspective, I feel it would be nice to have NA and non-NA cases look as similar as possible. Your code example is particularly striking : two different dtypes to store (from a user perspective) the exact same content ! If this *could* be avoided, it would be great... The biggest reason to keep the two types separate is performance. The straight float dtypes map directly to hardware floating-point operations, which can be very fast. The NA-float dtypes have to use additional logic to handle the NA values correctly. NA is treated as a particular NaN, and if the hardware float operations were used directly, NA would turn into NaN. This additional logic usually means more branches, so is slower. Actually, no -- hardware float operations preserve NA-as-NaN. You might well need to be careful around more exotic code like optimized BLAS kernels, but all the basic ufuncs should Just Work at full speed. Demo: def hexify(x): return hex(np.float64(x).view(np.int64)) hexify(np.nan) '0x7ff8L' # IIRC this is R's NA bitpattern (presumably 1974 is someone's birthday) NA = np.int64(0x7ff8 + 1974).view(np.float64) # It is an NaN... NA nan # But it has a distinct bitpattern: hexify(NA) '0x7ff807b6L' # Like any NaN, it propagates through floating point operations: NA + 3 nan # But, critically, so does the bitpattern; ordinary Python + is returning NA on this operation: hexify(NA + 3) '0x7ff807b6L' This is how R does it, which is more evidence that this actually works on real hardware. There is one place where it fails. In a binary operation with *two* NaN values, there's an ambiguity about which payload should be returned. IEEE754 recommends just returning the first one. This means that NA + NaN = NA, NaN + NA = NaN. This is ugly, but it's an obscure case that nobody cares about, so it's probably worth it for the speed gain. (In fact, if you type those two expressions at the R prompt, then that's what you get, and I can't find any reference to anyone even noticing this.) I don't know how the NA machinery is working R. Does it works with a kind of nafloat64 all the time or is there some type inference mechanics involved in choosing the appropriate type ? My understanding of R is that it works with the nafloat64 for all its operations, yes. Right -- R has a very impoverished type system as compared to numpy. There's basically four types: numeric (meaning double precision float), integer, logical (boolean), and character (string). And in practice the integer type is essentially unused, because R parses numbers like 1 as being floating point, not integer; the only way to get an integer value is to explicitly cast to it. Each of these types has a specific bit-pattern set aside for representing NA. And... that's it. It's very simple when it works, but also very limited. I'm still skeptical that we could make the floating point types NA-aware by default -- until we have an implementation in hand, I'm nervous there'd be some corner case that broke everything. (Maybe ufuncs are fine but np.dot has an unavoidable overhead, or maybe it would mess up casting from float types to non-NA-aware types, etc.) But who knows. Probably not something we can really make a meaningful decision about yet. -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Tue, Mar 6, 2012 at 9:25 PM, Nathaniel Smith n...@pobox.com wrote: On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant tra...@continuum.io wrote: Hi all, Hi Travis, Thanks for bringing this back up. Have you looked at the summary from the last thread? https://github.com/njsmith/numpy/wiki/NA-discussion-status Re-reading that summary and the main documents and threads linked from it, I could find either examples of statistical software that treats missing and ignored data explicitly separately, or links to relevant literature. Those would probably help the discussion a lot. The goal was to try and at least work out what points we all *could* agree on, to have some common footing for further discussion. I won't copy the whole thing here, but I'd summarize the state as: -- It's pretty clear that there are two fairly different conceptual models/use cases in play here. For one of them (R-style missing data cases) it's pretty clear what the desired semantics would be. For the other (temporary ignored values) there's still substantive disagreement. -- We *haven't* yet established what we want numpy to actually support. IMHO the critical next step is this latter one -- maybe we want to fully support both use cases. Maybe it's really only one of them that's worth trying to support in the numpy core right now. Maybe it's just one of them, but it's worth doing so thoroughly that it should have multiple implementations. Or whatever. I fear that if we don't talk about these big picture questions and just wade directly back into round-and-round arguments about API details then we'll never get anywhere. [...] Because it is slated to go into release 1.7, we need to re-visit the masked array discussion again.The NEP process is the appropriate one and I'm glad we are taking that route for these discussions. My goal is to get consensus in order for code to get into NumPy (regardless of who writes the code).It may be that we don't come to a consensus (reasonable and intelligent people can disagree on things --- look at the coming election...). We can represent different parts of what is fortunately a very large user-base of NumPy users. First of all, I want to be clear that I think there is much great work that has been done in the current missing data code. There are some nice features in the where clause of the ufunc and the machinery for the iterator that allows re-using ufunc loops that are not re-written to check for missing data. I'm sure there are other things as well that I'm not quite aware of yet.However, I don't think the API presented to the numpy user presently is the correct one for NumPy 1.X. A few particulars: * the reduction operations need to default to skipna --- this is the most common use case which has been re-inforced again to me today by a new user to Python who is using masked arrays presently This is one of the points where the two conceptual models disagree (see also Skipper's point down-thread). If you have missing data, then propagation has to be the default -- the sum of 1, 2, and I-DON'T-KNOW-MAYBE-7-MAYBE-12 is not 3. If you have some data there but you've asked numpy to temporarily ignore it, then, well, duh, of course it should ignore it. * the mask needs to be visible to the user if they use that approach to missing data (people should be able to get a hold of the mask and work with it in Python) This is also a point where the two conceptual models disagree. Actually this is one of the original arguments we made against the NEP design -- that if you want missing data, then having a mask at all is counterproductive, and if you are ignoring data, then of course it should be easy to manipulate the ignore mask. The rationale for the current design is to compromise between these two approaches -- there is a mask, but it's hidden behind a curtain. Mostly. (This may be a compromise in the Solomonic sense.) * bit-pattern approaches to missing data (at least for float64 and int32) need to be implemented. * there should be some way when using masks (even if it's hidden from most users) for missing data to separate the low-level ufunc operation from the operation on the masks... I don't understand what this means. I have heard from several users that they will *not use the missing data* in NumPy as currently implemented, and I can now see why.For better or for worse, my approach to software is generally very user-driven and very pragmatic. On the other hand, I'm also a mathematician and appreciate the cognitive compression that can come out of well-formed structure.None-the-less, I'm an *applied* mathematician and am ultimately motivated by applications. I will get a hold of the NEP and spend some time with it to discuss some of this in that document. This will take several weeks (as PyCon is next week and I have a
Re: [Numpy-discussion] Missing data again
On Tue, Mar 6, 2012 at 9:14 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Tue, Mar 6, 2012 at 9:25 PM, Nathaniel Smith n...@pobox.com wrote: On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant tra...@continuum.io wrote: Hi all, Hi Travis, Thanks for bringing this back up. Have you looked at the summary from the last thread? https://github.com/njsmith/numpy/wiki/NA-discussion-status Re-reading that summary and the main documents and threads linked from it, I could find either examples of statistical software that treats missing and ignored data explicitly separately, or links to relevant literature. Those would probably help the discussion a lot. (I think you mean couldn't find?) I'm not aware of any software that supports the IGNORED concept at all, whether in combination with missing data or not. np.ma is probably the closest example. I think we'd be breaking new ground there. This is also probably why it is less clear how it should work :-). IIUC, the basic reason that people want IGNORED in the core is that it provides convenience and syntactic sugar for efficient in place operation on subsets of large arrays. So there are actually two parts there -- the efficient operation, and the convenience/syntactic sugar. The key feature for efficient operation is the where= feature, which is not controversial at all. So, there's an argument that for now we should focus on where=, give people some time to work with it, and then use that experience to decide what kind of convenience/sugar would be useful, if any. But, that's just my own idea; I definitely can't claim any consensus on it. In project management terms, I see three options: 1) Put a big warning label on the functionality and leave it for now (If this option is given, np.asarray returns a masked array. NOTE: IN THE NEXT RELEASE, IT MAY INSTEAD RETURN A BAG OF RABID, HUNGRY WEASELS. NO GUARANTEES.) I've opened http://projects.scipy.org/numpy/ticket/2072 for that. Cool, thanks. Assuming we stick with this option, I'd appreciate it if you could check in the first beta that comes out whether or not the warnings are obvious enough and in all the right places. There probably won't be weasels though:) Of course. I've added myself to the CC list. (Err, if the beta won't be for a bit, though, then please remind me if you remember? I'm juggling a lot of balls right now.) 2) Move the code back out of mainline and into a branch until until there's consensus. 3) Hold up the release until this is all sorted. I come from the project-management school that says you should always have a releasable mainline, keep unready code in branches, and never hold up the release for features, so (2) seems obvious to me. While it may sound obvious, I hope you've understood why in practice it's not at all obvious and why you got such strong reactions to your proposal of taking out all that code. If not, just look at what happened with the numpy-refactor work. Of course, and that's why I'm not pressing the point. These trade-offs might be worth talking about at some point -- there are reasons that basically all the major FOSS projects have moved towards time-based releases :-) -- but that'd be a huge discussion at a time when we already have more than enough of those on our plate... But I seem to be very much in the minority on that[1], so oh well :-). I don't have any objection to (1), personally. (3) seems like a bad idea. Just my 2 pence. Agreed that (3) is a bad idea. +1 for (1). Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Cheers, -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Missing data again
Hi all, I've been thinking a lot about the masked array implementation lately. I finally had the time to look hard at what has been done and now am of the opinion that I do not think that 1.7 can be released with the current state of the masked array implementation *unless* it is clearly marked as experimental and may be changed in 1.8 I wish I had been able to be a bigger part of this conversation last year. But, that is why I took the steps I took to try and figure out another way to feed my family *and* stay involved in the NumPy community. I would love to stay involved in what is happening in the SciPy community, but I am more satisfied with what Ralf, Warren, Robert, Pauli, Josef, Charles, Stefan, and others are doing there right now, and don't have time to keep up with everything.Even though SciPy was the heart and soul of why I even got involved with Python for open source in the first place and took many years of my volunteer labor, I won't be able to spend significant time on SciPy code over the coming months. At some point, I really hope to be able to make contributions again to that code-base. Time will tell whether or not my aspirations will be realized. It depends quite a bit on whether or not my kids have what they need from me (which right now is money and time). NumPy, on the other hand, is not in a position where I can feel comfortable leaving my baby to others. I recognize and value the contributions from many people to make NumPy what it is today (e.g. code contributions, code rearrangement and standardization, build and install improvement, and most recently some architectural changes).But, I feel a personal responsibility for the code base as I spent a great many months writing NumPy in the first place, and I've spent a great deal of time interacting with NumPy users and feel like I have at least some sense of their stories.Of course, I built on the shoulders of giants, and much of what is there is *because of* where the code was adapted from (it was not created de-novo). Currently, there remains much that needs to be communicated, improved, and worked on, and I have specific opinions about what some changes and improvements should be, how they should be written, and how the resulting users need to be benefited. It will take time to discuss all of this, and that's where I will spend my open-source time in the coming months. In that vein: Because it is slated to go into release 1.7, we need to re-visit the masked array discussion again.The NEP process is the appropriate one and I'm glad we are taking that route for these discussions. My goal is to get consensus in order for code to get into NumPy (regardless of who writes the code).It may be that we don't come to a consensus (reasonable and intelligent people can disagree on things --- look at the coming election...). We can represent different parts of what is fortunately a very large user-base of NumPy users. First of all, I want to be clear that I think there is much great work that has been done in the current missing data code. There are some nice features in the where clause of the ufunc and the machinery for the iterator that allows re-using ufunc loops that are not re-written to check for missing data. I'm sure there are other things as well that I'm not quite aware of yet. However, I don't think the API presented to the numpy user presently is the correct one for NumPy 1.X. A few particulars: * the reduction operations need to default to skipna --- this is the most common use case which has been re-inforced again to me today by a new user to Python who is using masked arrays presently * the mask needs to be visible to the user if they use that approach to missing data (people should be able to get a hold of the mask and work with it in Python) * bit-pattern approaches to missing data (at least for float64 and int32) need to be implemented. * there should be some way when using masks (even if it's hidden from most users) for missing data to separate the low-level ufunc operation from the operation on the masks... I have heard from several users that they will *not use the missing data* in NumPy as currently implemented, and I can now see why.For better or for worse, my approach to software is generally very user-driven and very pragmatic. On the other hand, I'm also a mathematician and appreciate the cognitive compression that can come out of well-formed structure. None-the-less, I'm an *applied* mathematician and am ultimately motivated by applications. I will get a hold of the NEP and spend some time with it to discuss some of this in that document. This will take several weeks (as PyCon is next week and I have a tutorial I'm giving there).For now, I do not think 1.7 can be released unless the masked array is labeled
Re: [Numpy-discussion] Missing data again
On Sat, Mar 3, 2012 at 1:30 PM, Travis Oliphant tra...@continuum.io wrote: Hi all, I've been thinking a lot about the masked array implementation lately. I finally had the time to look hard at what has been done and now am of the opinion that I do not think that 1.7 can be released with the current state of the masked array implementation *unless* it is clearly marked as experimental and may be changed in 1.8 That was the intention. I wish I had been able to be a bigger part of this conversation last year. But, that is why I took the steps I took to try and figure out another way to feed my family *and* stay involved in the NumPy community. I would love to stay involved in what is happening in the SciPy community, but I am more satisfied with what Ralf, Warren, Robert, Pauli, Josef, Charles, Stefan, and others are doing there right now, and don't have time to keep up with everything.Even though SciPy was the heart and soul of why I even got involved with Python for open source in the first place and took many years of my volunteer labor, I won't be able to spend significant time on SciPy code over the coming months. At some point, I really hope to be able to make contributions again to that code-base. Time will tell whether or not my aspirations will be realized. It depends quite a bit on whether or not my kids have what they need from me (which right now is money and time). NumPy, on the other hand, is not in a position where I can feel comfortable leaving my baby to others. I recognize and value the contributions from many people to make NumPy what it is today (e.g. code contributions, code rearrangement and standardization, build and install improvement, and most recently some architectural changes).But, I feel a personal responsibility for the code base as I spent a great many months writing NumPy in the first place, and I've spent a great deal of time interacting with NumPy users and feel like I have at least some sense of their stories.Of course, I built on the shoulders of giants, and much of what is there is *because of* where the code was adapted from (it was not created de-novo). Currently, there remains much that needs to be communicated, improved, and worked on, and I have specific opinions about what some changes and improvements should be, how they should be written, and how the resulting users need to be benefited. It will take time to discuss all of this, and that's where I will spend my open-source time in the coming months. In that vein: Because it is slated to go into release 1.7, we need to re-visit the masked array discussion again.The NEP process is the appropriate one and I'm glad we are taking that route for these discussions. My goal is to get consensus in order for code to get into NumPy (regardless of who writes the code).It may be that we don't come to a consensus (reasonable and intelligent people can disagree on things --- look at the coming election...). We can represent different parts of what is fortunately a very large user-base of NumPy users. First of all, I want to be clear that I think there is much great work that has been done in the current missing data code. There are some nice features in the where clause of the ufunc and the machinery for the iterator that allows re-using ufunc loops that are not re-written to check for missing data. I'm sure there are other things as well that I'm not quite aware of yet.However, I don't think the API presented to the numpy user presently is the correct one for NumPy 1.X. A few particulars: * the reduction operations need to default to skipna --- this is the most common use case which has been re-inforced again to me today by a new user to Python who is using masked arrays presently * the mask needs to be visible to the user if they use that approach to missing data (people should be able to get a hold of the mask and work with it in Python) * bit-pattern approaches to missing data (at least for float64 and int32) need to be implemented. * there should be some way when using masks (even if it's hidden from most users) for missing data to separate the low-level ufunc operation from the operation on the masks... Mind, Mark only had a few weeks to write code. I think the unfinished state is a direct function of that. I have heard from several users that they will *not use the missing data* in NumPy as currently implemented, and I can now see why.For better or for worse, my approach to software is generally very user-driven and very pragmatic. On the other hand, I'm also a mathematician and appreciate the cognitive compression that can come out of well-formed structure. None-the-less, I'm an *applied* mathematician and am ultimately motivated by applications. I think that would be Wes. I thought the current state wasn't that far away from what he wanted
Re: [Numpy-discussion] Missing data again
Mind, Mark only had a few weeks to write code. I think the unfinished state is a direct function of that. I have heard from several users that they will *not use the missing data* in NumPy as currently implemented, and I can now see why.For better or for worse, my approach to software is generally very user-driven and very pragmatic. On the other hand, I'm also a mathematician and appreciate the cognitive compression that can come out of well-formed structure. None-the-less, I'm an *applied* mathematician and am ultimately motivated by applications. I think that would be Wes. I thought the current state wasn't that far away from what he wanted in the only post where he was somewhat explicit. I think it would be useful for him to sit down with Mark at some time and thrash things out since I think there is some misunderstanding involved. Actually it wasn't Wes. It was 3 other people. I'm already well aware of Wes's perspective and actually think his concerns have been handled already. Also, the person who showed me their use-case was a new user. But, your point about getting people together is well-taken. I also recognize the fact that there have been (and likely continue to be) misunderstandings on multiple fronts. Fortunately, many of us will be at PyCon later this week. We tried really hard to get Mark Wiebe here this weekend as well --- but he could only sacrifice a week away from his degree work to join us for PyCon. It would be great if you could come to PyCon as well. Perhaps we can apply to NumFOCUS for a travel grant to bring NumPy developers together with other interested people to finish the masked array design and implementation. -Travis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Missing Data development plan
It's been a day less than two weeks since I posted my first feedback request on a masked array implementation of missing data. I'd like to thank everyone that contributed to the discussion, and that continues to contribute. I believe my design is very solid thanks to all the feedback, and I understand at the same time there are still concerns that people have about the design. I sincerely hope that those concerns are further discussed and made more clear just as I have spent a lot of effort making sure my ideas are clear and understood by everyone in the discussion. Travis has directed me to for the moment focus a majority of my attention on the implementation. He will post further thoughts on the design issues in the next few days when he has enough of a break in his schedule. With the short time available for this implementation, my plan is as follows: 1) Implement the masked implementation of NA nearly to completion. This is the quickest way to get something that people can provide hands-on feedback with, and the NA dtype in my design uses the machinery of the masked implementation for all the computational kernels. 2) Assuming there is enough time left, implement the NA[] parameterized dtype in concert with a derived[] dtype and cleanups of the datetime64[] dtype, with the goal of creating some good structure for the possibility of creating more parameterized dtypes in the future. The derived[] dtype idea is based on an idea Travis had which he called computed columns, but generalized to apply in more contexts. When the time comes, I will post a proposal for feedback on this idea as well. Thanks once again for all the great feedback, and I look forward to getting a prototype into your hands to test as quickly as possible! -Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On 6/27/11 9:53 AM, Charles R Harris wrote: Some discussion of disk storage might also help. I don't see how the rules can be enforced if two files are used, one for the mask and another for the data, but that may just be something we need to live with. It seems it wouldn't be too big deal to extend the *.npy format to include the mask. Could one memmap both the data array and the mask? Netcdf (and assume hdf) have ways to support masks as well. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Wed, Jun 29, 2011 at 1:07 PM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 06/29/2011 07:38 PM, Mark Wiebe wrote: On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no mailto:d.s.seljeb...@astro.uio.no wrote: On 06/29/2011 03:45 PM, Matthew Brett wrote: Hi, On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebemwwi...@gmail.com mailto:mwwi...@gmail.com wrote: On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brettmatthew.br...@gmail.com mailto:matthew.br...@gmail.com wrote: Hi, On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smithn...@pobox.com mailto:n...@pobox.com wrote: ... (You might think, what difference does it make if you *can* unmask an item? Us missing data folks could just ignore this feature. But: whatever we end up implementing is something that I will have to explain over and over to different people, most of them not particularly sophisticated programmers. And there's just no sensible way to explain this idea that if you store some particular value, then it replaces the old value, but if you store NA, then the old value is still there. Ouch - yes. No question, that is difficult to explain. Well, I think the explanation might go like this: Ah, yes, well, that's because in fact numpy records missing values by using a 'mask'. So when you say `a[3] = np.NA', what you mean is, 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False` Is that fair? My favorite way of explaining it would be to have a grid of numbers written on paper, then have several cardboards with holes poked in them in different configurations. Placing these cardboard masks in front of the grid would show different sets of non-missing data, without affecting the values stored on the paper behind them. Right - but here of course you are trying to explain the mask, and this is Nathaniel's point, that in order to explain NAs, you have to explain masks, and so, even at a basic level, the fusion of the two ideas is obvious, and already confusing. I mean this: a[3] = np.NA Oh, so you just set the a[3] value to have some missing value code? Ah - no - in fact what I did was set a associated mask in position a[3] so that you can't any longer see the previous value of a[3] Huh. You mean I have a mask for every single value in order to be able to blank out a[3]? It looks like an assignment. I mean, it looks just like a[3] = 4. But I guess it isn't? Er... I think Nathaniel's point is a very good one - these are separate ideas, np.NA and np.IGNORE, and a joint implementation is bound to draw them together in the mind of the user.Apart from anything else, the user has to know that, if they want a single NA value in an array, they have to add a mask size array.shape in bytes. They have to know then, that NA is implemented by masking, and then the 'NA for free by adding masking' idea breaks down and starts to feel like a kludge. The counter argument is of course that, in time, the implementation of NA with masking will seem as obvious and intuitive, as, say, broadcasting, and that we are just reacting from lack of experience with the new API. However, no matter how used we get to this, people coming from almost any other tool (in particular R) will keep think it is counter-intuitive. Why set up a major semantic incompatability that people then have to overcome in order to start using NumPy. I'm not aware of a semantic incompatibility. I believe R doesn't support views like NumPy does, so the things you have to do to see masking semantics aren't even possible in R. Well, whether the same feature is possible or not in R is irrelevant to whether a semantic incompatability would exist. Views themselves are a *major* semantic incompatability, and are highly confusing at first to MATLAB/Fortran/R people. However they have major advantages outweighing the disadvantage of having to caution new users. But there's simply no precedence anywhere for an assignment that doesn't erase the old value for a particular input value, and the advantages seem pretty minor (well, I think it is ugly in its own right, but that is besides the point...) I disagree that there's no precedent, but maybe there isn't something which is exactly the same as my design. The whole actual real literal assignment thought process leads to considerations of little gnomes writing
Re: [Numpy-discussion] missing data discussion round 2
On Wed, Jun 29, 2011 at 2:32 PM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe mwwi...@gmail.com wrote: On Wed, Jun 29, 2011 at 8:20 AM, Lluís xscr...@gmx.net wrote: Matthew Brett writes: Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys the idea that the entry is still there, but we're just ignoring it. Of course, that goes against common convention, but it might be easier to explain. I think Nathaniel's point is that np.IGNORE is a different idea than np.NA, and that is why joining the implementations can lead to conceptual confusion. This is how I see it: a = np.array([0, 1, 2], dtype=int) a[0] = np.NA ValueError e = np.array([np.NA, 1, 2], dtype=int) ValueError b = np.array([np.NA, 1, 2], dtype=np.maybe(int)) m = np.array([np.NA, 1, 2], dtype=int, masked=True) bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) b[1] = np.NA np.sum(b) np.NA np.sum(b, skipna=True) 2 b.mask None m[1] = np.NA np.sum(m) 2 np.sum(m, skipna=True) 2 m.mask [False, False, True] bm[1] = np.NA np.sum(bm) 2 np.sum(bm, skipna=True) 2 bm.mask [False, False, True] So: * Mask takes precedence over bit pattern on element assignment. There's still the question of how to assign a bit pattern NA when the mask is active. * When using mask, elements are automagically skipped. * m[1] = np.NA is equivalent to m.mask[1] = False * When using bit pattern + mask, it might make sense to have the initial values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True, False, True] and np.sum(bm) == np.NA) There seems to be a general idea that masks and NA bit patterns imply particular differing semantics, something which I think is simply false. Well - first - it's helpful surely to separate the concepts and the implementation. Concepts / use patterns (as delineated by Nathaniel): A) missing values == 'np.NA' in my emails. Can we call that CMV (concept missing values)? B) masks == np.IGNORE in my emails . CMSK (concept masks)? This is a different conceptual model than I'm proposing in the NEP. This is also exactly what I was trying to clarify in the first email in this thread under the headings Missing Data Abstraction and Implementation Techniques. Masks are *just* an implementation technique. They imply nothing more, except through previously established conventions such as in various bitmasks, image masks, numpy.ma and others. masks != np.IGNORE bit patterns != np.NA Masks vs bit patterns and R's default NA vs rm.na NA semantics are completely independent, except where design choices are made that they should be related. I think they should be unrelated, masks and bit patterns are two approaches to solving the same problem. Implementations 1) bit-pattern == na-dtype - how about we call that IBP (implementation bit patten)? 2) array.mask. IM (implementation mask)? Nathaniel implied that: CMV implies: sum([np.NA, 1]) == np.NA CMSK implies sum([np.NA, 1]) == 1 and indeed, that's how R and masked arrays respectively behave. R and numpy.ma. If we're trying to be clear about our concepts and implementations, numpy.ma is just one possible implementation of masked arrays. So I think it's reasonable to say that at least R thought that the bitmask implied the first and Pierre and others thought the mask meant the second. R's model is based on years of experience and a model of what missing values implies, the bitmask implies nothing about the behavior of NA. The NEP as it stands thinks of CMV and and CM as being different views of the same thing, Please correct me if I'm wrong. Both NaN and Inf are implemented in hardware with the same idea as the NA bit pattern, but they do not follow NA missing value semantics. Right - and that doesn't affect the argument, because the argument is about the concepts and not the implementation. You just said R thought bitmasks implied something, and you're saying masked arrays imply something. If the argument is just about the missing value concepts, neither of these should be in the present discussion. As far as I can tell, the only required difference between them is that NA bit patterns must destroy the data. Nothing else. I think Nathaniel's point was about the expected default behavior in the different concepts. Everything on top of that is a choice of API and interface mechanisms. I want them to behave exactly the same except for that necessary difference, so that it will be possible to use the *exact same Python code* with either approach. Right. And Nathaniel's point is that that desire leads to fusion of the two ideas into one when they should be separated. For example, if I understand correctly: a = np.array([1.0, 2.0, 3, 7.0], masked=True) b = np.array([1.0, 2.0, np.NA, 7.0],
Re: [Numpy-discussion] missing data discussion round 2
On Wed, Jun 29, 2011 at 1:20 PM, Lluís xscr...@gmx.net wrote: Mark Wiebe writes: There seems to be a general idea that masks and NA bit patterns imply particular differing semantics, something which I think is simply false. Well, my example contained a difference (the need for the skipna=True argument) precisely because it seemed that there was some need for different defaults. Honestly, I think this difference breaks the POLA (principle of least astonishment). [...] As far as I can tell, the only required difference between them is that NA bit patterns must destroy the data. Nothing else. Everything on top of that is a choice of API and interface mechanisms. I want them to behave exactly the same except for that necessary difference, so that it will be possible to use the *exact same Python code* with either approach. I completely agree. What I'd suggest is a global and/or per-object ndarray.flags.skipna for people like me that just want to ignore these entries without caring about setting it on each operaion (or the other way around, depends on the default behaviour). The downside is that it adds yet another tweaking knob, which is not desirable... One way around this would be to create an ndarray subclass which changes that default. Currently this would not be possible to do nicely, but with the _numpy_ufunc_ idea I proposed in a separate thread a while back, this could work. -Mark Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Wed, Jun 29, 2011 at 4:21 PM, Eric Firing efir...@hawaii.edu wrote: On 06/29/2011 09:32 AM, Matthew Brett wrote: Hi, [...] Clearly there are some overlaps between what masked arrays are trying to achieve and what Rs NA mechanisms are trying to achieve. Are they really similar enough that they should function using the same API? And if so, won't that be confusing? I think that's the question that's being asked. And I think the answer is no. No more confusing to people coming from R to numpy than views already are--with or without the NEP--and not *requiring* people to use any NA-related functionality beyond what they are used to from R. My understanding of the NEP is that it directly yields an API closely matching that of R, but with the opportunity, via views, to do more with less work, if one so desires. The present masked array module could be made more efficient if the NEP is implemented; regardless of whether this is done, the masked array module is not about to vanish, so anyone wanting precisely the masked array API will have it; and others remain free to ignore it (except for those of us involved in developing libraries such as matplotlib, which will have to support all variations of the new API along with the already-supported masked arrays). In addition, for new code, the full-blown masked array module may not be needed. A convenience it adds, however, is the automatic masking of invalid values: In [1]: np.ma.log(-1) Out[1]: masked I'm sure this horrifies some, but there are times and places where it is a genuine convenience, and preferable to having to use a separate operation to replace nan or inf with NA or whatever it ends up being. I added a mechanism to support this idea with the NA dtypes approach, spelled 'NA[f8,InfNan]'. Here, all Infs and NaNs are treated as NA by the system. -Mark If np.seterr were extended to allow such automatic masking as an option, then the need for a separate masked array module would shrink further. I wouldn't mind having to use an explicit kwarg for ignoring NA in reduction methods. Eric See you, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Wed, Jun 29, 2011 at 5:42 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Jun 29, 2011 at 2:40 PM, Lluís xscr...@gmx.net wrote: I'm for the option of having a single API when you want to have NA elements, regardless of whether it's using masks or bit patterns. I understand the desire to avoid having two different APIS... [snip] My concern is now about how to set the skipna in a comfortable way, so that I don't have to set it again and again as ufunc arguments: a array([NA, 2, 3]) b array([1, 2, NA]) a + b array([NA, 2, NA]) a.flags.skipna=True b.flags.skipna=True a + b array([1, 4, 3]) ...But... now you're introducing two different kinds of arrays with different APIs again? Ones where .skipna==True, and ones where .skipna==False? I know that this way it's not keyed on the underlying storage format, but if we support both bit patterns and mask arrays at the implementation level, then the only way to make them have identical APIs is if we completely disallow unmasking, and shared masks, and so forth. The right set of these conditions has been in the NEP from the beginning. Unmasking without value assignment is disallowed - the only way to see behind the mask or to share masks is with views. My impression is than more people are concerned with sharing the same data between different masks, something also supported through views. -Mark Which doesn't seem like it'd be very popular (and would make including the mask-based implementation pretty pointless). So I think we have to assume that they will have APIs that are at least somewhat different. And then it seems like with this proposal then we'd actually end up with *4* different APIs that any particular array might follow... (or maybe more, depending on how arrays that had both a bit-pattern and mask ended up working). That's why I was thinking the best solution might be to just bite the bullet and make the APIs *totally* different and non-overlapping, so it was always obvious which you were using and how they'd interact. But I don't know -- for my work I'd be happy to just pass skipna everywhere I needed it, and never unmask anything, and so forth, so maybe there's some reason why it's really important for the bit-pattern NA API to overlap more with the masked array API? -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Thu, Jun 30, 2011 at 1:49 AM, Chris Barker chris.bar...@noaa.gov wrote: On 6/27/11 9:53 AM, Charles R Harris wrote: Some discussion of disk storage might also help. I don't see how the rules can be enforced if two files are used, one for the mask and another for the data, but that may just be something we need to live with. It seems it wouldn't be too big deal to extend the *.npy format to include the mask. Could one memmap both the data array and the mask? This I haven't thought about too much yet, but I don't see why not. This does provide a back door into the mask which violates the abstractions, so I would want it to be an extremely narrow special case. -Mark Netcdf (and assume hdf) have ways to support masks as well. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
Clearly there are some overlaps between what masked arrays are trying to achieve and what Rs NA mechanisms are trying to achieve. Are they really similar enough that they should function using the same API? Yes. And if so, won't that be confusing? No, I don't believe so, any more than NA's in R, NaN's, or Inf's are already confusing. As one who's been silently following (most of) this thread, and a heavy R and numpy user, perhaps I should chime in briefly here with a use case. I more-or-less always work with partially masked data, like Matthew, but not numpy masked arrays because the memory overhead is prohibitive. And, sad to say, my experiments don't always go perfectly. I therefore have arrays in which there is /both/ (1) data that is simply missing (np.NA?)--it never had a value and never will--as well as simultaneously (2) data that that is temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask different portions for different purposes/analyses. I consider these two separate, completely independent issues and I unfortunately currently have to kluge a lot to handle this. Concretely, consider a list of 100,000 observations (rows), with 12 measures per observation-row (a 100,000 x 12 array). Every now and then, sprinkled throughout this array, I have missing values (someone didn't answer a question, or a computer failed to record a response, or whatever). For some analyses I want to mask the whole row (e.g., complete-case analysis), leaving me with array entries that should be tagged with all 4 possible labels: 1) not masked, not missing 2) masked, not missing 3) not masked, missing 4) masked, missing Obviously #4 is overkill ... but only until I want to unmask that row. At that point, I need to be sure that missing values remain missing when unmasked. Can a single API really handle this? -best Gary The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
Mark Wiebe writes: On Wed, Jun 29, 2011 at 1:20 PM, Lluís xscr...@gmx.net wrote: [...] As far as I can tell, the only required difference between them is that NA bit patterns must destroy the data. Nothing else. Everything on top of that is a choice of API and interface mechanisms. I want them to behave exactly the same except for that necessary difference, so that it will be possible to use the *exact same Python code* with either approach. I completely agree. What I'd suggest is a global and/or per-object ndarray.flags.skipna for people like me that just want to ignore these entries without caring about setting it on each operaion (or the other way around, depends on the default behaviour). The downside is that it adds yet another tweaking knob, which is not desirable... One way around this would be to create an ndarray subclass which changes that default. Currently this would not be possible to do nicely, but with the _numpy_ufunc_ idea I proposed in a separate thread a while back, this could work. That does indeed sound good :) Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman str...@nmr.mgh.harvard.edu wrote: Clearly there are some overlaps between what masked arrays are trying to achieve and what Rs NA mechanisms are trying to achieve. Are they really similar enough that they should function using the same API? Yes. And if so, won't that be confusing? No, I don't believe so, any more than NA's in R, NaN's, or Inf's are already confusing. As one who's been silently following (most of) this thread, and a heavy R and numpy user, perhaps I should chime in briefly here with a use case. I more-or-less always work with partially masked data, like Matthew, but not numpy masked arrays because the memory overhead is prohibitive. And, sad to say, my experiments don't always go perfectly. I therefore have arrays in which there is /both/ (1) data that is simply missing (np.NA?)--it never had a value and never will--as well as simultaneously (2) data that that is temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask different portions for different purposes/analyses. I consider these two separate, completely independent issues and I unfortunately currently have to kluge a lot to handle this. Concretely, consider a list of 100,000 observations (rows), with 12 measures per observation-row (a 100,000 x 12 array). Every now and then, sprinkled throughout this array, I have missing values (someone didn't answer a question, or a computer failed to record a response, or whatever). For some analyses I want to mask the whole row (e.g., complete-case analysis), leaving me with array entries that should be tagged with all 4 possible labels: 1) not masked, not missing 2) masked, not missing 3) not masked, missing 4) masked, missing Obviously #4 is overkill ... but only until I want to unmask that row. At that point, I need to be sure that missing values remain missing when unmasked. Can a single API really handle this? The single API does support a masked array with an NA dtype, and the behavior in this case will be that the value is considered NA if either it is masked or the value is the NA bit pattern. So you could add a mask to an array with an NA dtype to temporarily treat the data as if more values were missing. One important reason I'm doing it this way is so that each NumPy algorithm and any 3rd party code only needs to be updated once to support both forms of missing data. The C API with masks is also a lot cleaner to work with than one for NA dtypes with the ability to have different NA bit patterns. -Mark -best Gary The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/**compliancelinehttp://www.partners.org/complianceline. If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
Hi, On Thu, Jun 30, 2011 at 5:13 PM, Mark Wiebe mwwi...@gmail.com wrote: On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman str...@nmr.mgh.harvard.edu wrote: Clearly there are some overlaps between what masked arrays are trying to achieve and what Rs NA mechanisms are trying to achieve. Are they really similar enough that they should function using the same API? Yes. And if so, won't that be confusing? No, I don't believe so, any more than NA's in R, NaN's, or Inf's are already confusing. As one who's been silently following (most of) this thread, and a heavy R and numpy user, perhaps I should chime in briefly here with a use case. I more-or-less always work with partially masked data, like Matthew, but not numpy masked arrays because the memory overhead is prohibitive. And, sad to say, my experiments don't always go perfectly. I therefore have arrays in which there is /both/ (1) data that is simply missing (np.NA?)--it never had a value and never will--as well as simultaneously (2) data that that is temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask different portions for different purposes/analyses. I consider these two separate, completely independent issues and I unfortunately currently have to kluge a lot to handle this. Concretely, consider a list of 100,000 observations (rows), with 12 measures per observation-row (a 100,000 x 12 array). Every now and then, sprinkled throughout this array, I have missing values (someone didn't answer a question, or a computer failed to record a response, or whatever). For some analyses I want to mask the whole row (e.g., complete-case analysis), leaving me with array entries that should be tagged with all 4 possible labels: 1) not masked, not missing 2) masked, not missing 3) not masked, missing 4) masked, missing Obviously #4 is overkill ... but only until I want to unmask that row. At that point, I need to be sure that missing values remain missing when unmasked. Can a single API really handle this? The single API does support a masked array with an NA dtype, and the behavior in this case will be that the value is considered NA if either it is masked or the value is the NA bit pattern. So you could add a mask to an array with an NA dtype to temporarily treat the data as if more values were missing. Right - but I think the separated API is cleaner and easier to explain. Do you disagree? One important reason I'm doing it this way is so that each NumPy algorithm and any 3rd party code only needs to be updated once to support both forms of missing data. Could you explain what you mean? Maybe a couple of examples? Whatever API results, it will surely be with us for a long time, and so it would be good to make sure we have the right one even if it costs a bit more to update current code. Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
Mark Wiebe writes: Why is one magic and the other real? All of this is already sitting on 100 layers of abstraction above electrons and atoms. If we're talking about real, maybe we should be programming in machine code or using breadboards with individual transistors. M-x butterfly RET http://xkcd.com/378/ -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] missing data: semantics
Ok, I think it's time to step back and reformulate the problem by completely ignoring the implementation. Here we have 2 generic concepts (i.e., applicable to R), plus another extra concept that is exclusive to numpy: * Assigning np.NA to an array, cannot be undone unless through explicit assignment (i.e., assigning a new arbitrary value, or saving a copy of the original array before assigning np.NA). * np.NA values propagate by default, unless ufuncs have the skipna = True argument (or the other way around, it doesn't really matter to this discussion). In order to avoid passing the argument on each ufunc, we either have some per-array variable for the default skipna value (undesirable) or we can make a trivial ndarray subclass that will set the skipna argument on all ufuncs through the _ufunc_wrapper_ mechanism. Now, numpy has the concept of views, which adds some more goodies to the list of concepts: * With views, two arrays can share the same physical data, so that assignments to any of them will be seen by others (including NA values). The creation of a view is explicitly stated by the user, so its behaviour should not be perceived as odd (after all, you asked for a view). The good thing is that with views you can avoid costly array copies if you're careful when writing into these views. Now, you can add a new concept: local/temporal/transient missing data. We can take an existing array and create a view with the new argument transientna = True. Here, both the view and the transientna = True are explicitly stated by the user, so it is assumed that she already knows what this is all about. The difference with a regular view is that you also explicitly asked for local/temporal/transient NA values. * Assigning np.NA to an array view with transientna = True will *not* be seen by any of the other views (nor the original array), but anything else will still work as usual. After all, this is what *you* asked for when using the transientna = True argument. To conclude, say that others *must not* care about whether the arrays they're working with have transient NA values. This way, I can create a view with transient NAs, set to NA some uninteresting data, and pass it to a routine written by someone else that sets to NA elements that, for example, are beyond certain threshold from the mean of the elements. This would be equivalent to storing a copy of the original array before passing it to this 3rd party function, only that transientna, just as views, provide some handy shortcuts to avoid copies. My main point here is that views and local/temporal/transient NAs are all *explicitly* requested, so that its behaviour should not appear as something unexpected. Is there an agreement on this? Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data: semantics
Hi, On Thu, Jun 30, 2011 at 6:46 PM, Lluís xscr...@gmx.net wrote: Ok, I think it's time to step back and reformulate the problem by completely ignoring the implementation. Here we have 2 generic concepts (i.e., applicable to R), plus another extra concept that is exclusive to numpy: * Assigning np.NA to an array, cannot be undone unless through explicit assignment (i.e., assigning a new arbitrary value, or saving a copy of the original array before assigning np.NA). * np.NA values propagate by default, unless ufuncs have the skipna = True argument (or the other way around, it doesn't really matter to this discussion). In order to avoid passing the argument on each ufunc, we either have some per-array variable for the default skipna value (undesirable) or we can make a trivial ndarray subclass that will set the skipna argument on all ufuncs through the _ufunc_wrapper_ mechanism. Now, numpy has the concept of views, which adds some more goodies to the list of concepts: * With views, two arrays can share the same physical data, so that assignments to any of them will be seen by others (including NA values). The creation of a view is explicitly stated by the user, so its behaviour should not be perceived as odd (after all, you asked for a view). The good thing is that with views you can avoid costly array copies if you're careful when writing into these views. Now, you can add a new concept: local/temporal/transient missing data. We can take an existing array and create a view with the new argument transientna = True. Here, both the view and the transientna = True are explicitly stated by the user, so it is assumed that she already knows what this is all about. The difference with a regular view is that you also explicitly asked for local/temporal/transient NA values. * Assigning np.NA to an array view with transientna = True will *not* be seen by any of the other views (nor the original array), but anything else will still work as usual. After all, this is what *you* asked for when using the transientna = True argument. To conclude, say that others *must not* care about whether the arrays they're working with have transient NA values. This way, I can create a view with transient NAs, set to NA some uninteresting data, and pass it to a routine written by someone else that sets to NA elements that, for example, are beyond certain threshold from the mean of the elements. This would be equivalent to storing a copy of the original array before passing it to this 3rd party function, only that transientna, just as views, provide some handy shortcuts to avoid copies. My main point here is that views and local/temporal/transient NAs are all *explicitly* requested, so that its behaviour should not appear as something unexpected. Is there an agreement on this? Absolutely, if by 'transientna' you mean 'masked'. The discussion is whether the NA API should be the same as the masking API. The thing you are describing is what masking is for, and what it's always been for, as far as I can see. We're arguing that to call this 'transientna' instead of 'masked' confuses two concepts that are different, to no good purpose. Best, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data: semantics
On Thu, Jun 30, 2011 at 11:46 AM, Lluís xscr...@gmx.net wrote: Ok, I think it's time to step back and reformulate the problem by completely ignoring the implementation. Here we have 2 generic concepts (i.e., applicable to R), plus another extra concept that is exclusive to numpy: * Assigning np.NA to an array, cannot be undone unless through explicit assignment (i.e., assigning a new arbitrary value, or saving a copy of the original array before assigning np.NA). * np.NA values propagate by default, unless ufuncs have the skipna = True argument (or the other way around, it doesn't really matter to this discussion). In order to avoid passing the argument on each ufunc, we either have some per-array variable for the default skipna value (undesirable) or we can make a trivial ndarray subclass that will set the skipna argument on all ufuncs through the _ufunc_wrapper_ mechanism. Now, numpy has the concept of views, which adds some more goodies to the list of concepts: * With views, two arrays can share the same physical data, so that assignments to any of them will be seen by others (including NA values). The creation of a view is explicitly stated by the user, so its behaviour should not be perceived as odd (after all, you asked for a view). The good thing is that with views you can avoid costly array copies if you're careful when writing into these views. Now, you can add a new concept: local/temporal/transient missing data. We can take an existing array and create a view with the new argument transientna = True. This is already there: x.view(masked=1), although the keyword transientna has appeal, not least because it avoids the word 'mask', which seems a source of endless confusion. Note that currently this is only supposed to work if the original array is unmasked. Here, both the view and the transientna = True are explicitly stated by the user, so it is assumed that she already knows what this is all about. The difference with a regular view is that you also explicitly asked for local/temporal/transient NA values. * Assigning np.NA to an array view with transientna = True will *not* be seen by any of the other views (nor the original array), but anything else will still work as usual. After all, this is what *you* asked for when using the transientna = True argument. To conclude, say that others *must not* care about whether the arrays they're working with have transient NA values. This way, I can create a view with transient NAs, set to NA some uninteresting data, and pass it to a routine written by someone else that sets to NA elements that, for example, are beyond certain threshold from the mean of the elements. This would be equivalent to storing a copy of the original array before passing it to this 3rd party function, only that transientna, just as views, provide some handy shortcuts to avoid copies. My main point here is that views and local/temporal/transient NAs are all *explicitly* requested, so that its behaviour should not appear as something unexpected. Is there an agreement on this? Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data: semantics
On Thu, Jun 30, 2011 at 11:51 AM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Thu, Jun 30, 2011 at 6:46 PM, Lluís xscr...@gmx.net wrote: Ok, I think it's time to step back and reformulate the problem by completely ignoring the implementation. Here we have 2 generic concepts (i.e., applicable to R), plus another extra concept that is exclusive to numpy: * Assigning np.NA to an array, cannot be undone unless through explicit assignment (i.e., assigning a new arbitrary value, or saving a copy of the original array before assigning np.NA). * np.NA values propagate by default, unless ufuncs have the skipna = True argument (or the other way around, it doesn't really matter to this discussion). In order to avoid passing the argument on each ufunc, we either have some per-array variable for the default skipna value (undesirable) or we can make a trivial ndarray subclass that will set the skipna argument on all ufuncs through the _ufunc_wrapper_ mechanism. Now, numpy has the concept of views, which adds some more goodies to the list of concepts: * With views, two arrays can share the same physical data, so that assignments to any of them will be seen by others (including NA values). The creation of a view is explicitly stated by the user, so its behaviour should not be perceived as odd (after all, you asked for a view). The good thing is that with views you can avoid costly array copies if you're careful when writing into these views. Now, you can add a new concept: local/temporal/transient missing data. We can take an existing array and create a view with the new argument transientna = True. Here, both the view and the transientna = True are explicitly stated by the user, so it is assumed that she already knows what this is all about. The difference with a regular view is that you also explicitly asked for local/temporal/transient NA values. * Assigning np.NA to an array view with transientna = True will *not* be seen by any of the other views (nor the original array), but anything else will still work as usual. After all, this is what *you* asked for when using the transientna = True argument. To conclude, say that others *must not* care about whether the arrays they're working with have transient NA values. This way, I can create a view with transient NAs, set to NA some uninteresting data, and pass it to a routine written by someone else that sets to NA elements that, for example, are beyond certain threshold from the mean of the elements. This would be equivalent to storing a copy of the original array before passing it to this 3rd party function, only that transientna, just as views, provide some handy shortcuts to avoid copies. My main point here is that views and local/temporal/transient NAs are all *explicitly* requested, so that its behaviour should not appear as something unexpected. Is there an agreement on this? Absolutely, if by 'transientna' you mean 'masked'. The discussion is whether the NA API should be the same as the masking API. The thing you are describing is what masking is for, and what it's always been for, as far as I can see. We're arguing that to call this 'transientna' instead of 'masked' confuses two concepts that are different, to no good purpose. It's a hammer. If you want to hammer nails, fine, if you want hammer a bit of tubing flat, fine. It's a tool, the hammer concept if you will. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Thu, Jun 30, 2011 at 11:42 AM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Thu, Jun 30, 2011 at 5:13 PM, Mark Wiebe mwwi...@gmail.com wrote: On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman str...@nmr.mgh.harvard.edu wrote: Clearly there are some overlaps between what masked arrays are trying to achieve and what Rs NA mechanisms are trying to achieve. Are they really similar enough that they should function using the same API? Yes. And if so, won't that be confusing? No, I don't believe so, any more than NA's in R, NaN's, or Inf's are already confusing. As one who's been silently following (most of) this thread, and a heavy R and numpy user, perhaps I should chime in briefly here with a use case. I more-or-less always work with partially masked data, like Matthew, but not numpy masked arrays because the memory overhead is prohibitive. And, sad to say, my experiments don't always go perfectly. I therefore have arrays in which there is /both/ (1) data that is simply missing (np.NA?)--it never had a value and never will--as well as simultaneously (2) data that that is temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask different portions for different purposes/analyses. I consider these two separate, completely independent issues and I unfortunately currently have to kluge a lot to handle this. Concretely, consider a list of 100,000 observations (rows), with 12 measures per observation-row (a 100,000 x 12 array). Every now and then, sprinkled throughout this array, I have missing values (someone didn't answer a question, or a computer failed to record a response, or whatever). For some analyses I want to mask the whole row (e.g., complete-case analysis), leaving me with array entries that should be tagged with all 4 possible labels: 1) not masked, not missing 2) masked, not missing 3) not masked, missing 4) masked, missing Obviously #4 is overkill ... but only until I want to unmask that row. At that point, I need to be sure that missing values remain missing when unmasked. Can a single API really handle this? The single API does support a masked array with an NA dtype, and the behavior in this case will be that the value is considered NA if either it is masked or the value is the NA bit pattern. So you could add a mask to an array with an NA dtype to temporarily treat the data as if more values were missing. Right - but I think the separated API is cleaner and easier to explain. Do you disagree? Kind of, yeah. I think the important things to understand from the Python perspective are that there are two ways of doing missing values with NA that look exactly the same except for how you create the arrays. Since you know that the mask way takes more memory, and that's important for your application, you can decide to use the NA dtype without any additional depth. Understanding that one of them has a special signal for NA while the other uses masks in the background probably isn't even that important to understand to be able to use it. I bet lots of people who use R regularly couldn't come up with a correct explanation of how it works there. If someone doesn't understand masks, they can use their intuition based on the special signal idea without any difficulty. The idea that you can temporarily make some values NA without overwriting your data may not be intuitive at first glance, but I expect people will find it useful even if they don't fully understand the subtle details of the masking mechanism. One important reason I'm doing it this way is so that each NumPy algorithm and any 3rd party code only needs to be updated once to support both forms of missing data. Could you explain what you mean? Maybe a couple of examples? Yeah, I've started adding some implementation notes to the NEP. First I need volunteers to review my current pull requests though. ;) -Mark Whatever API results, it will surely be with us for a long time, and so it would be good to make sure we have the right one even if it costs a bit more to update current code. Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Thu, Jun 30, 2011 at 11:54 AM, Lluís xscr...@gmx.net wrote: Mark Wiebe writes: Why is one magic and the other real? All of this is already sitting on 100 layers of abstraction above electrons and atoms. If we're talking about real, maybe we should be programming in machine code or using breadboards with individual transistors. M-x butterfly RET http://xkcd.com/378/ Ok, I've run this, how long does it take to execute? -Mark -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Wed, Jun 29, 2011 at 2:21 PM, Eric Firing efir...@hawaii.edu wrote: In addition, for new code, the full-blown masked array module may not be needed. A convenience it adds, however, is the automatic masking of invalid values: In [1]: np.ma.log(-1) Out[1]: masked I'm sure this horrifies some, but there are times and places where it is a genuine convenience, and preferable to having to use a separate operation to replace nan or inf with NA or whatever it ends up being. Err, but what would this even get you? NA, NaN, and Inf basically all behave the same WRT floating point operations anyway, i.e., they all propagate? Is the idea that if ufunc's gain a skipna=True flag, you'd also like to be able to turn it into a skipna_and_nan_and_inf=True flag? -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On 06/30/2011 08:53 AM, Nathaniel Smith wrote: On Wed, Jun 29, 2011 at 2:21 PM, Eric Firingefir...@hawaii.edu wrote: In addition, for new code, the full-blown masked array module may not be needed. A convenience it adds, however, is the automatic masking of invalid values: In [1]: np.ma.log(-1) Out[1]: masked I'm sure this horrifies some, but there are times and places where it is a genuine convenience, and preferable to having to use a separate operation to replace nan or inf with NA or whatever it ends up being. Err, but what would this even get you? NA, NaN, and Inf basically all behave the same WRT floating point operations anyway, i.e., they all propagate? Not exactly. First, it depends on np.seterr; second, calculations on NaN can be very slow, so are better avoided entirely; third, if an array is passed to extension code, it is much nicer if that code only has one NA value to handle, instead of having to check for all possible bad values. Is the idea that if ufunc's gain a skipna=True flag, you'd also like to be able to turn it into a skipna_and_nan_and_inf=True flag? No, it is to have a situation where skipna_and_nan_and_inf would not be needed, because an operation generating a nan or inf would turn those values into NA or IGNORE or whatever right away. Eric -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Thu, Jun 30, 2011 at 12:27 PM, Eric Firing efir...@hawaii.edu wrote: On 06/30/2011 08:53 AM, Nathaniel Smith wrote: On Wed, Jun 29, 2011 at 2:21 PM, Eric Firingefir...@hawaii.edu wrote: In addition, for new code, the full-blown masked array module may not be needed. A convenience it adds, however, is the automatic masking of invalid values: In [1]: np.ma.log(-1) Out[1]: masked I'm sure this horrifies some, but there are times and places where it is a genuine convenience, and preferable to having to use a separate operation to replace nan or inf with NA or whatever it ends up being. Err, but what would this even get you? NA, NaN, and Inf basically all behave the same WRT floating point operations anyway, i.e., they all propagate? Not exactly. First, it depends on np.seterr; IIUC, you're proposing to make this conversion depend on np.seterr too, though, right? second, calculations on NaN can be very slow, so are better avoided entirely They're slow because inside the processor they require a branch and a separate code path (which doesn't get a lot of transistors allocated to it). In any of the NA proposals we're talking about, handling an NA would require a software branch and a separate code path (which is in ordinary software, now, so it doesn't get any special transistors allocated to it...). I don't think masking support is likely to give you a speedup over the processor's NaN handling. And if it did, that would mean that we speed up FP operations in general by checking for NaN in software, so then we should do that everywhere anyway instead of making it an NA-specific feature... third, if an array is passed to extension code, it is much nicer if that code only has one NA value to handle, instead of having to check for all possible bad values. I'm pretty sure that Mark's proposal does not work this way -- he's saying that the NA-checking code in numpy could optionally check for all these different bad values and handle them the same in ufuncs, not that we would check the outputs of all FP operations for bad values and then replace them by NA. So your extension code would still have the same problem. Sorry :-( -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On 06/28/2011 11:52 PM, Matthew Brett wrote: Hi, On Tue, Jun 28, 2011 at 5:38 PM, Charles R Harris charlesr.har...@gmail.com wrote: Nathaniel, an implementation using masks will look *exactly* like an implementation using na-dtypes from the user's point of view. Except that taking a masked view of an unmasked array allows ignoring values without destroying or copying the original data. The only downside I can see to an implementation using masks is memory and disk storage, and perhaps memory mapped arrays. And I rather expect the former to solve itself in a few years, eight gigs is becoming a baseline for workstations and in a couple of years I expect that to be up around 16-32, and a few years after that In any case we are talking 12% - 25% overhead, and in practice I expect it won't be quite as big a problem as folks project. Or, in the case of 16 bit integers, 50% memory overhead. I honestly find it hard to believe that I will not care about memory use in the near future, and I don't think it's wise to make decisions on that assumption. In many sciences, waiting for the future makes things worse, not better, simply because the amount of available data easily grows at a faster rate than the amount of memory you can get per dollar :-) Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On 06/27/2011 05:55 PM, Mark Wiebe wrote: First I'd like to thank everyone for all the feedback you're providing, clearly this is an important topic to many people, and the discussion has helped clarify the ideas for me. I've renamed and updated the NEP, then placed it into the master NumPy repository so it has a more permanent home here: https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst One thing to think about is the presence of SSE/AVX instructions, which has the potential to change some of the memory/speed trade-offs here. In the newest Intel-platform CPUs you can do 256-bit operations, translating to a theoretical factor 8 speedup for in-cache single precision data, and the instruction set is constructed for future expansion possibilites to 512 or 1024 bit registers. I feel one should take care to not design oneself into a corner where this can't (eventually) be leveraged. 1) The shuffle instructions takes a single byte as a control character for moving around data in different ways in 128-bit registers. One could probably implement fast IGNORE-style NA with a seperate mask using 1 byte per 16 bytes of data (with 4 or 8-byte elements). OTOH, I'm not sure if 1 byte per element kind of mask would be that fast (but I don't know much about this and haven't looked at the details). 2) The alternative Parameterized Data Type Which Adds Additional Memory for the NA Flag would mean that contiguous arrays with NA's/IGNORE's would not be subject to vector instructions, or create a mess of copying in and out prior to operating on the data. This really seems like the worst of all possibilites to me. (FWIW, my vote is in favour of both NA-using-NaN and IGNORE-using-explicit-masks, and keep the two as entirely seperate worlds to avoid confusion.) Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
Matthew Brett writes: Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys the idea that the entry is still there, but we're just ignoring it. Of course, that goes against common convention, but it might be easier to explain. I think Nathaniel's point is that np.IGNORE is a different idea than np.NA, and that is why joining the implementations can lead to conceptual confusion. This is how I see it: a = np.array([0, 1, 2], dtype=int) a[0] = np.NA ValueError e = np.array([np.NA, 1, 2], dtype=int) ValueError b = np.array([np.NA, 1, 2], dtype=np.maybe(int)) m = np.array([np.NA, 1, 2], dtype=int, masked=True) bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) b[1] = np.NA np.sum(b) np.NA np.sum(b, skipna=True) 2 b.mask None m[1] = np.NA np.sum(m) 2 np.sum(m, skipna=True) 2 m.mask [False, False, True] bm[1] = np.NA np.sum(bm) 2 np.sum(bm, skipna=True) 2 bm.mask [False, False, True] So: * Mask takes precedence over bit pattern on element assignment. There's still the question of how to assign a bit pattern NA when the mask is active. * When using mask, elements are automagically skipped. * m[1] = np.NA is equivalent to m.mask[1] = False * When using bit pattern + mask, it might make sense to have the initial values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True, False, True] and np.sum(bm) == np.NA) Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
Hi, On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe mwwi...@gmail.com wrote: On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith n...@pobox.com wrote: ... (You might think, what difference does it make if you *can* unmask an item? Us missing data folks could just ignore this feature. But: whatever we end up implementing is something that I will have to explain over and over to different people, most of them not particularly sophisticated programmers. And there's just no sensible way to explain this idea that if you store some particular value, then it replaces the old value, but if you store NA, then the old value is still there. Ouch - yes. No question, that is difficult to explain. Well, I think the explanation might go like this: Ah, yes, well, that's because in fact numpy records missing values by using a 'mask'. So when you say `a[3] = np.NA', what you mean is, 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False` Is that fair? My favorite way of explaining it would be to have a grid of numbers written on paper, then have several cardboards with holes poked in them in different configurations. Placing these cardboard masks in front of the grid would show different sets of non-missing data, without affecting the values stored on the paper behind them. Right - but here of course you are trying to explain the mask, and this is Nathaniel's point, that in order to explain NAs, you have to explain masks, and so, even at a basic level, the fusion of the two ideas is obvious, and already confusing. I mean this: a[3] = np.NA Oh, so you just set the a[3] value to have some missing value code? Ah - no - in fact what I did was set a associated mask in position a[3] so that you can't any longer see the previous value of a[3] Huh. You mean I have a mask for every single value in order to be able to blank out a[3]? It looks like an assignment. I mean, it looks just like a[3] = 4. But I guess it isn't? Er... I think Nathaniel's point is a very good one - these are separate ideas, np.NA and np.IGNORE, and a joint implementation is bound to draw them together in the mind of the user.Apart from anything else, the user has to know that, if they want a single NA value in an array, they have to add a mask size array.shape in bytes. They have to know then, that NA is implemented by masking, and then the 'NA for free by adding masking' idea breaks down and starts to feel like a kludge. The counter argument is of course that, in time, the implementation of NA with masking will seem as obvious and intuitive, as, say, broadcasting, and that we are just reacting from lack of experience with the new API. Of course, that does happen, but here, unless I am mistaken, the primary drive to fuse NA and masking is because of ease of implementation. That doesn't necessarily mean that they don't go together - if something is easy to implement, sometimes it means it will also feel natural in use, but at least we might say that there is some risk of the implementation driving the API, and that that can lead to problems. See you, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On 06/29/2011 03:45 PM, Matthew Brett wrote: Hi, On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebemwwi...@gmail.com wrote: On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brettmatthew.br...@gmail.com wrote: Hi, On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smithn...@pobox.com wrote: ... (You might think, what difference does it make if you *can* unmask an item? Us missing data folks could just ignore this feature. But: whatever we end up implementing is something that I will have to explain over and over to different people, most of them not particularly sophisticated programmers. And there's just no sensible way to explain this idea that if you store some particular value, then it replaces the old value, but if you store NA, then the old value is still there. Ouch - yes. No question, that is difficult to explain. Well, I think the explanation might go like this: Ah, yes, well, that's because in fact numpy records missing values by using a 'mask'. So when you say `a[3] = np.NA', what you mean is, 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False` Is that fair? My favorite way of explaining it would be to have a grid of numbers written on paper, then have several cardboards with holes poked in them in different configurations. Placing these cardboard masks in front of the grid would show different sets of non-missing data, without affecting the values stored on the paper behind them. Right - but here of course you are trying to explain the mask, and this is Nathaniel's point, that in order to explain NAs, you have to explain masks, and so, even at a basic level, the fusion of the two ideas is obvious, and already confusing. I mean this: a[3] = np.NA Oh, so you just set the a[3] value to have some missing value code? Ah - no - in fact what I did was set a associated mask in position a[3] so that you can't any longer see the previous value of a[3] Huh. You mean I have a mask for every single value in order to be able to blank out a[3]? It looks like an assignment. I mean, it looks just like a[3] = 4. But I guess it isn't? Er... I think Nathaniel's point is a very good one - these are separate ideas, np.NA and np.IGNORE, and a joint implementation is bound to draw them together in the mind of the user.Apart from anything else, the user has to know that, if they want a single NA value in an array, they have to add a mask size array.shape in bytes. They have to know then, that NA is implemented by masking, and then the 'NA for free by adding masking' idea breaks down and starts to feel like a kludge. The counter argument is of course that, in time, the implementation of NA with masking will seem as obvious and intuitive, as, say, broadcasting, and that we are just reacting from lack of experience with the new API. However, no matter how used we get to this, people coming from almost any other tool (in particular R) will keep think it is counter-intuitive. Why set up a major semantic incompatability that people then have to overcome in order to start using NumPy. I really don't see what's wrong with some more explicit API like a.mask[3] = True. Explicit is better than implicit. Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
Matthew, Dag, +1. On Jun 29, 2011 4:35 PM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 06/29/2011 03:45 PM, Matthew Brett wrote: Hi, On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebemwwi...@gmail.com wrote: On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brettmatthew.br...@gmail.com wrote: Hi, On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smithn...@pobox.com wrote: ... (You might think, what difference does it make if you *can* unmask an item? Us missing data folks could just ignore this feature. But: whatever we end up implementing is something that I will have to explain over and over to different people, most of them not particularly sophisticated programmers. And there's just no sensible way to explain this idea that if you store some particular value, then it replaces the old value, but if you store NA, then the old value is still there. Ouch - yes. No question, that is difficult to explain. Well, I think the explanation might go like this: Ah, yes, well, that's because in fact numpy records missing values by using a 'mask'. So when you say `a[3] = np.NA', what you mean is, 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False` Is that fair? My favorite way of explaining it would be to have a grid of numbers written on paper, then have several cardboards with holes poked in them in different configurations. Placing these cardboard masks in front of the grid would show different sets of non-missing data, without affecting the values stored on the paper behind them. Right - but here of course you are trying to explain the mask, and this is Nathaniel's point, that in order to explain NAs, you have to explain masks, and so, even at a basic level, the fusion of the two ideas is obvious, and already confusing. I mean this: a[3] = np.NA Oh, so you just set the a[3] value to have some missing value code? Ah - no - in fact what I did was set a associated mask in position a[3] so that you can't any longer see the previous value of a[3] Huh. You mean I have a mask for every single value in order to be able to blank out a[3]? It looks like an assignment. I mean, it looks just like a[3] = 4. But I guess it isn't? Er... I think Nathaniel's point is a very good one - these are separate ideas, np.NA and np.IGNORE, and a joint implementation is bound to draw them together in the mind of the user. Apart from anything else, the user has to know that, if they want a single NA value in an array, they have to add a mask size array.shape in bytes. They have to know then, that NA is implemented by masking, and then the 'NA for free by adding masking' idea breaks down and starts to feel like a kludge. The counter argument is of course that, in time, the implementation of NA with masking will seem as obvious and intuitive, as, say, broadcasting, and that we are just reacting from lack of experience with the new API. However, no matter how used we get to this, people coming from almost any other tool (in particular R) will keep think it is counter-intuitive. Why set up a major semantic incompatability that people then have to overcome in order to start using NumPy. I really don't see what's wrong with some more explicit API like a.mask[3] = True. Explicit is better than implicit. Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Wed, Jun 29, 2011 at 2:26 AM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 06/27/2011 05:55 PM, Mark Wiebe wrote: First I'd like to thank everyone for all the feedback you're providing, clearly this is an important topic to many people, and the discussion has helped clarify the ideas for me. I've renamed and updated the NEP, then placed it into the master NumPy repository so it has a more permanent home here: https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst One thing to think about is the presence of SSE/AVX instructions, which has the potential to change some of the memory/speed trade-offs here. In the newest Intel-platform CPUs you can do 256-bit operations, translating to a theoretical factor 8 speedup for in-cache single precision data, and the instruction set is constructed for future expansion possibilites to 512 or 1024 bit registers. The ufuncs themselves need a good bit of refactoring to be able to use these kinds of instructions well. I'm definitely thinking about this kind of thing while designing/implementing. I feel one should take care to not design oneself into a corner where this can't (eventually) be leveraged. 1) The shuffle instructions takes a single byte as a control character for moving around data in different ways in 128-bit registers. One could probably implement fast IGNORE-style NA with a seperate mask using 1 byte per 16 bytes of data (with 4 or 8-byte elements). OTOH, I'm not sure if 1 byte per element kind of mask would be that fast (but I don't know much about this and haven't looked at the details). This level of optimization, while important, is often dwarfed by the effects of cache. Because of the complexity of the system demanded by the functionality, I'm trying to favor simplicity and generality without precluding high performance. 2) The alternative Parameterized Data Type Which Adds Additional Memory for the NA Flag would mean that contiguous arrays with NA's/IGNORE's would not be subject to vector instructions, or create a mess of copying in and out prior to operating on the data. This really seems like the worst of all possibilites to me. This one was suggested on the list, so I added it. -Mark (FWIW, my vote is in favour of both NA-using-NaN and IGNORE-using-explicit-masks, and keep the two as entirely seperate worlds to avoid confusion.) Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Wed, Jun 29, 2011 at 8:20 AM, Lluís xscr...@gmx.net wrote: Matthew Brett writes: Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys the idea that the entry is still there, but we're just ignoring it. Of course, that goes against common convention, but it might be easier to explain. I think Nathaniel's point is that np.IGNORE is a different idea than np.NA, and that is why joining the implementations can lead to conceptual confusion. This is how I see it: a = np.array([0, 1, 2], dtype=int) a[0] = np.NA ValueError e = np.array([np.NA, 1, 2], dtype=int) ValueError b = np.array([np.NA, 1, 2], dtype=np.maybe(int)) m = np.array([np.NA, 1, 2], dtype=int, masked=True) bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) b[1] = np.NA np.sum(b) np.NA np.sum(b, skipna=True) 2 b.mask None m[1] = np.NA np.sum(m) 2 np.sum(m, skipna=True) 2 m.mask [False, False, True] bm[1] = np.NA np.sum(bm) 2 np.sum(bm, skipna=True) 2 bm.mask [False, False, True] So: * Mask takes precedence over bit pattern on element assignment. There's still the question of how to assign a bit pattern NA when the mask is active. * When using mask, elements are automagically skipped. * m[1] = np.NA is equivalent to m.mask[1] = False * When using bit pattern + mask, it might make sense to have the initial values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True, False, True] and np.sum(bm) == np.NA) There seems to be a general idea that masks and NA bit patterns imply particular differing semantics, something which I think is simply false. Both NaN and Inf are implemented in hardware with the same idea as the NA bit pattern, but they do not follow NA missing value semantics. As far as I can tell, the only required difference between them is that NA bit patterns must destroy the data. Nothing else. Everything on top of that is a choice of API and interface mechanisms. I want them to behave exactly the same except for that necessary difference, so that it will be possible to use the *exact same Python code* with either approach. Say you're using NA dtypes, and suddenly you think, what if I temporarily treated these as NA too. Now you have to copy your whole array to avoid destroying your data! The NA bit pattern didn't save you memory here... Say you're using masks, and it turns out you didn't actually need masking semantics. If they're different, you now have to do lots of code changes to switch to NA dtypes! -Mark Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Wed, Jun 29, 2011 at 8:45 AM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe mwwi...@gmail.com wrote: On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith n...@pobox.com wrote: ... (You might think, what difference does it make if you *can* unmask an item? Us missing data folks could just ignore this feature. But: whatever we end up implementing is something that I will have to explain over and over to different people, most of them not particularly sophisticated programmers. And there's just no sensible way to explain this idea that if you store some particular value, then it replaces the old value, but if you store NA, then the old value is still there. Ouch - yes. No question, that is difficult to explain. Well, I think the explanation might go like this: Ah, yes, well, that's because in fact numpy records missing values by using a 'mask'. So when you say `a[3] = np.NA', what you mean is, 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False` Is that fair? My favorite way of explaining it would be to have a grid of numbers written on paper, then have several cardboards with holes poked in them in different configurations. Placing these cardboard masks in front of the grid would show different sets of non-missing data, without affecting the values stored on the paper behind them. Right - but here of course you are trying to explain the mask, and this is Nathaniel's point, that in order to explain NAs, you have to explain masks, and so, even at a basic level, the fusion of the two ideas is obvious, and already confusing. I mean this: a[3] = np.NA Oh, so you just set the a[3] value to have some missing value code? I would answer Yes, that's basically true. The abstraction works that way, and there's no reason to confuse people with those implementation details right off the bat. When you introduce a new user to floating point numbers, it would seem odd to first point out that addition isn't associative. That kind of detail is important when you're learning more about the system and digging deeper. I think it was in a Knuth book that I read the idea that the best teaching is a series of lies that successively correct the previous lies. Ah - no - in fact what I did was set a associated mask in position a[3] so that you can't any longer see the previous value of a[3] Huh. You mean I have a mask for every single value in order to be able to blank out a[3]? It looks like an assignment. I mean, it looks just like a[3] = 4. But I guess it isn't? Er... I think Nathaniel's point is a very good one - these are separate ideas, np.NA and np.IGNORE, and a joint implementation is bound to draw them together in the mind of the user. R jointly implements them with the rm.na=T parameter, and that's our model system for missing data. Apart from anything else, the user has to know that, if they want a single NA value in an array, they have to add a mask size array.shape in bytes. They have to know then, that NA is implemented by masking, and then the 'NA for free by adding masking' idea breaks down and starts to feel like a kludge. The counter argument is of course that, in time, the implementation of NA with masking will seem as obvious and intuitive, as, say, broadcasting, and that we are just reacting from lack of experience with the new API. It will literally work the same as the implementation with NA dtypes, except for the masking semantics which requires the extra steps of taking views. Of course, that does happen, but here, unless I am mistaken, the primary drive to fuse NA and masking is because of ease of implementation. That's not the case, and I've tried to give a slightly better justification for this in my answer Lluis' email. That doesn't necessarily mean that they don't go together - if something is easy to implement, sometimes it means it will also feel natural in use, but at least we might say that there is some risk of the implementation driving the API, and that that can lead to problems. In the design process I'm doing, the implementation concerns are affecting the interface concerns and vice versa, but the missing data semantics are the main driver. -Mark See you, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 06/29/2011 03:45 PM, Matthew Brett wrote: Hi, On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebemwwi...@gmail.com wrote: On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brettmatthew.br...@gmail.com wrote: Hi, On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smithn...@pobox.com wrote: ... (You might think, what difference does it make if you *can* unmask an item? Us missing data folks could just ignore this feature. But: whatever we end up implementing is something that I will have to explain over and over to different people, most of them not particularly sophisticated programmers. And there's just no sensible way to explain this idea that if you store some particular value, then it replaces the old value, but if you store NA, then the old value is still there. Ouch - yes. No question, that is difficult to explain. Well, I think the explanation might go like this: Ah, yes, well, that's because in fact numpy records missing values by using a 'mask'. So when you say `a[3] = np.NA', what you mean is, 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False` Is that fair? My favorite way of explaining it would be to have a grid of numbers written on paper, then have several cardboards with holes poked in them in different configurations. Placing these cardboard masks in front of the grid would show different sets of non-missing data, without affecting the values stored on the paper behind them. Right - but here of course you are trying to explain the mask, and this is Nathaniel's point, that in order to explain NAs, you have to explain masks, and so, even at a basic level, the fusion of the two ideas is obvious, and already confusing. I mean this: a[3] = np.NA Oh, so you just set the a[3] value to have some missing value code? Ah - no - in fact what I did was set a associated mask in position a[3] so that you can't any longer see the previous value of a[3] Huh. You mean I have a mask for every single value in order to be able to blank out a[3]? It looks like an assignment. I mean, it looks just like a[3] = 4. But I guess it isn't? Er... I think Nathaniel's point is a very good one - these are separate ideas, np.NA and np.IGNORE, and a joint implementation is bound to draw them together in the mind of the user.Apart from anything else, the user has to know that, if they want a single NA value in an array, they have to add a mask size array.shape in bytes. They have to know then, that NA is implemented by masking, and then the 'NA for free by adding masking' idea breaks down and starts to feel like a kludge. The counter argument is of course that, in time, the implementation of NA with masking will seem as obvious and intuitive, as, say, broadcasting, and that we are just reacting from lack of experience with the new API. However, no matter how used we get to this, people coming from almost any other tool (in particular R) will keep think it is counter-intuitive. Why set up a major semantic incompatability that people then have to overcome in order to start using NumPy. I'm not aware of a semantic incompatibility. I believe R doesn't support views like NumPy does, so the things you have to do to see masking semantics aren't even possible in R. I really don't see what's wrong with some more explicit API like a.mask[3] = True. Explicit is better than implicit. I agree, but initial feedback was that the way R deals with NA values is very nice, and I've come to agree that it's worth emulating. -Mark Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On 06/29/2011 07:38 PM, Mark Wiebe wrote: On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no mailto:d.s.seljeb...@astro.uio.no wrote: On 06/29/2011 03:45 PM, Matthew Brett wrote: Hi, On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebemwwi...@gmail.com mailto:mwwi...@gmail.com wrote: On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brettmatthew.br...@gmail.com mailto:matthew.br...@gmail.com wrote: Hi, On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smithn...@pobox.com mailto:n...@pobox.com wrote: ... (You might think, what difference does it make if you *can* unmask an item? Us missing data folks could just ignore this feature. But: whatever we end up implementing is something that I will have to explain over and over to different people, most of them not particularly sophisticated programmers. And there's just no sensible way to explain this idea that if you store some particular value, then it replaces the old value, but if you store NA, then the old value is still there. Ouch - yes. No question, that is difficult to explain. Well, I think the explanation might go like this: Ah, yes, well, that's because in fact numpy records missing values by using a 'mask'. So when you say `a[3] = np.NA', what you mean is, 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False` Is that fair? My favorite way of explaining it would be to have a grid of numbers written on paper, then have several cardboards with holes poked in them in different configurations. Placing these cardboard masks in front of the grid would show different sets of non-missing data, without affecting the values stored on the paper behind them. Right - but here of course you are trying to explain the mask, and this is Nathaniel's point, that in order to explain NAs, you have to explain masks, and so, even at a basic level, the fusion of the two ideas is obvious, and already confusing. I mean this: a[3] = np.NA Oh, so you just set the a[3] value to have some missing value code? Ah - no - in fact what I did was set a associated mask in position a[3] so that you can't any longer see the previous value of a[3] Huh. You mean I have a mask for every single value in order to be able to blank out a[3]? It looks like an assignment. I mean, it looks just like a[3] = 4. But I guess it isn't? Er... I think Nathaniel's point is a very good one - these are separate ideas, np.NA and np.IGNORE, and a joint implementation is bound to draw them together in the mind of the user.Apart from anything else, the user has to know that, if they want a single NA value in an array, they have to add a mask size array.shape in bytes. They have to know then, that NA is implemented by masking, and then the 'NA for free by adding masking' idea breaks down and starts to feel like a kludge. The counter argument is of course that, in time, the implementation of NA with masking will seem as obvious and intuitive, as, say, broadcasting, and that we are just reacting from lack of experience with the new API. However, no matter how used we get to this, people coming from almost any other tool (in particular R) will keep think it is counter-intuitive. Why set up a major semantic incompatability that people then have to overcome in order to start using NumPy. I'm not aware of a semantic incompatibility. I believe R doesn't support views like NumPy does, so the things you have to do to see masking semantics aren't even possible in R. Well, whether the same feature is possible or not in R is irrelevant to whether a semantic incompatability would exist. Views themselves are a *major* semantic incompatability, and are highly confusing at first to MATLAB/Fortran/R people. However they have major advantages outweighing the disadvantage of having to caution new users. But there's simply no precedence anywhere for an assignment that doesn't erase the old value for a particular input value, and the advantages seem pretty minor (well, I think it is ugly in its own right, but that is besides the point...) Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
Mark Wiebe writes: There seems to be a general idea that masks and NA bit patterns imply particular differing semantics, something which I think is simply false. Well, my example contained a difference (the need for the skipna=True argument) precisely because it seemed that there was some need for different defaults. Honestly, I think this difference breaks the POLA (principle of least astonishment). [...] As far as I can tell, the only required difference between them is that NA bit patterns must destroy the data. Nothing else. Everything on top of that is a choice of API and interface mechanisms. I want them to behave exactly the same except for that necessary difference, so that it will be possible to use the *exact same Python code* with either approach. I completely agree. What I'd suggest is a global and/or per-object ndarray.flags.skipna for people like me that just want to ignore these entries without caring about setting it on each operaion (or the other way around, depends on the default behaviour). The downside is that it adds yet another tweaking knob, which is not desirable... Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On 06/29/2011 01:07 PM, Dag Sverre Seljebotn wrote: On 06/29/2011 07:38 PM, Mark Wiebe wrote: On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.nomailto:d.s.seljeb...@astro.uio.no wrote: On 06/29/2011 03:45 PM, Matthew Brett wrote: Hi, On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebemwwi...@gmail.com mailto:mwwi...@gmail.com wrote: On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brettmatthew.br...@gmail.commailto:matthew.br...@gmail.com wrote: Hi, On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smithn...@pobox.com mailto:n...@pobox.com wrote: ... (You might think, what difference does it make if you *can* unmask an item? Us missing data folks could just ignore this feature. But: whatever we end up implementing is something that I will have to explain over and over to different people, most of them not particularly sophisticated programmers. And there's just no sensible way to explain this idea that if you store some particular value, then it replaces the old value, but if you store NA, then the old value is still there. Ouch - yes. No question, that is difficult to explain. Well, I think the explanation might go like this: Ah, yes, well, that's because in fact numpy records missing values by using a 'mask'. So when you say `a[3] = np.NA', what you mean is, 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False` Is that fair? My favorite way of explaining it would be to have a grid of numbers written on paper, then have several cardboards with holes poked in them in different configurations. Placing these cardboard masks in front of the grid would show different sets of non-missing data, without affecting the values stored on the paper behind them. Right - but here of course you are trying to explain the mask, and this is Nathaniel's point, that in order to explain NAs, you have to explain masks, and so, even at a basic level, the fusion of the two ideas is obvious, and already confusing. I mean this: a[3] = np.NA Oh, so you just set the a[3] value to have some missing value code? Ah - no - in fact what I did was set a associated mask in position a[3] so that you can't any longer see the previous value of a[3] Huh. You mean I have a mask for every single value in order to be able to blank out a[3]? It looks like an assignment. I mean, it looks just like a[3] = 4. But I guess it isn't? Er... I think Nathaniel's point is a very good one - these are separate ideas, np.NA and np.IGNORE, and a joint implementation is bound to draw them together in the mind of the user.Apart from anything else, the user has to know that, if they want a single NA value in an array, they have to add a mask size array.shape in bytes. They have to know then, that NA is implemented by masking, and then the 'NA for free by adding masking' idea breaks down and starts to feel like a kludge. The counter argument is of course that, in time, the implementation of NA with masking will seem as obvious and intuitive, as, say, broadcasting, and that we are just reacting from lack of experience with the new API. However, no matter how used we get to this, people coming from almost any other tool (in particular R) will keep think it is counter-intuitive. Why set up a major semantic incompatability that people then have to overcome in order to start using NumPy. I'm not aware of a semantic incompatibility. I believe R doesn't support views like NumPy does, so the things you have to do to see masking semantics aren't even possible in R. Well, whether the same feature is possible or not in R is irrelevant to whether a semantic incompatability would exist. Views themselves are a *major* semantic incompatability, and are highly confusing at first to MATLAB/Fortran/R people. However they have major advantages outweighing the disadvantage of having to caution new users. But there's simply no precedence anywhere for an assignment that doesn't erase the old value for a particular input value, and the advantages seem pretty minor (well, I think it is ugly in its own right, but that is besides the point...) Dag Sverre ___ Depending on what you really mean by 'precedence', in most stats software (R, SAS, etc.) it is completely up to the user to do this and do it correctly.
Re: [Numpy-discussion] missing data discussion round 2
Hi, On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe mwwi...@gmail.com wrote: On Wed, Jun 29, 2011 at 8:20 AM, Lluís xscr...@gmx.net wrote: Matthew Brett writes: Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys the idea that the entry is still there, but we're just ignoring it. Of course, that goes against common convention, but it might be easier to explain. I think Nathaniel's point is that np.IGNORE is a different idea than np.NA, and that is why joining the implementations can lead to conceptual confusion. This is how I see it: a = np.array([0, 1, 2], dtype=int) a[0] = np.NA ValueError e = np.array([np.NA, 1, 2], dtype=int) ValueError b = np.array([np.NA, 1, 2], dtype=np.maybe(int)) m = np.array([np.NA, 1, 2], dtype=int, masked=True) bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) b[1] = np.NA np.sum(b) np.NA np.sum(b, skipna=True) 2 b.mask None m[1] = np.NA np.sum(m) 2 np.sum(m, skipna=True) 2 m.mask [False, False, True] bm[1] = np.NA np.sum(bm) 2 np.sum(bm, skipna=True) 2 bm.mask [False, False, True] So: * Mask takes precedence over bit pattern on element assignment. There's still the question of how to assign a bit pattern NA when the mask is active. * When using mask, elements are automagically skipped. * m[1] = np.NA is equivalent to m.mask[1] = False * When using bit pattern + mask, it might make sense to have the initial values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True, False, True] and np.sum(bm) == np.NA) There seems to be a general idea that masks and NA bit patterns imply particular differing semantics, something which I think is simply false. Well - first - it's helpful surely to separate the concepts and the implementation. Concepts / use patterns (as delineated by Nathaniel): A) missing values == 'np.NA' in my emails. Can we call that CMV (concept missing values)? B) masks == np.IGNORE in my emails . CMSK (concept masks)? Implementations 1) bit-pattern == na-dtype - how about we call that IBP (implementation bit patten)? 2) array.mask. IM (implementation mask)? Nathaniel implied that: CMV implies: sum([np.NA, 1]) == np.NA CMSK implies sum([np.NA, 1]) == 1 and indeed, that's how R and masked arrays respectively behave. So I think it's reasonable to say that at least R thought that the bitmask implied the first and Pierre and others thought the mask meant the second. The NEP as it stands thinks of CMV and and CM as being different views of the same thing, Please correct me if I'm wrong. Both NaN and Inf are implemented in hardware with the same idea as the NA bit pattern, but they do not follow NA missing value semantics. Right - and that doesn't affect the argument, because the argument is about the concepts and not the implementation. As far as I can tell, the only required difference between them is that NA bit patterns must destroy the data. Nothing else. I think Nathaniel's point was about the expected default behavior in the different concepts. Everything on top of that is a choice of API and interface mechanisms. I want them to behave exactly the same except for that necessary difference, so that it will be possible to use the *exact same Python code* with either approach. Right. And Nathaniel's point is that that desire leads to fusion of the two ideas into one when they should be separated. For example, if I understand correctly: a = np.array([1.0, 2.0, 3, 7.0], masked=True) b = np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') a[3] = np.NA # actual real hand-on-heart assignment b[3] = np.NA # magic mask setting although it looks the same Say you're using NA dtypes, and suddenly you think, what if I temporarily treated these as NA too. Now you have to copy your whole array to avoid destroying your data! The NA bit pattern didn't save you memory here... Say you're using masks, and it turns out you didn't actually need masking semantics. If they're different, you now have to do lots of code changes to switch to NA dtypes! I personally have not run across that case. I'd imagine that, if you knew you wanted to do something so explicitly masking-like, you'd start with the masking interface. Clearly there are some overlaps between what masked arrays are trying to achieve and what Rs NA mechanisms are trying to achieve. Are they really similar enough that they should function using the same API? And if so, won't that be confusing? I think that's the question that's being asked. See you, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
Oops, On Wed, Jun 29, 2011 at 8:32 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe mwwi...@gmail.com wrote: On Wed, Jun 29, 2011 at 8:20 AM, Lluís xscr...@gmx.net wrote: Matthew Brett writes: Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys the idea that the entry is still there, but we're just ignoring it. Of course, that goes against common convention, but it might be easier to explain. I think Nathaniel's point is that np.IGNORE is a different idea than np.NA, and that is why joining the implementations can lead to conceptual confusion. This is how I see it: a = np.array([0, 1, 2], dtype=int) a[0] = np.NA ValueError e = np.array([np.NA, 1, 2], dtype=int) ValueError b = np.array([np.NA, 1, 2], dtype=np.maybe(int)) m = np.array([np.NA, 1, 2], dtype=int, masked=True) bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) b[1] = np.NA np.sum(b) np.NA np.sum(b, skipna=True) 2 b.mask None m[1] = np.NA np.sum(m) 2 np.sum(m, skipna=True) 2 m.mask [False, False, True] bm[1] = np.NA np.sum(bm) 2 np.sum(bm, skipna=True) 2 bm.mask [False, False, True] So: * Mask takes precedence over bit pattern on element assignment. There's still the question of how to assign a bit pattern NA when the mask is active. * When using mask, elements are automagically skipped. * m[1] = np.NA is equivalent to m.mask[1] = False * When using bit pattern + mask, it might make sense to have the initial values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True, False, True] and np.sum(bm) == np.NA) There seems to be a general idea that masks and NA bit patterns imply particular differing semantics, something which I think is simply false. Well - first - it's helpful surely to separate the concepts and the implementation. Concepts / use patterns (as delineated by Nathaniel): A) missing values == 'np.NA' in my emails. Can we call that CMV (concept missing values)? B) masks == np.IGNORE in my emails . CMSK (concept masks)? Implementations 1) bit-pattern == na-dtype - how about we call that IBP (implementation bit patten)? 2) array.mask. IM (implementation mask)? Nathaniel implied that: CMV implies: sum([np.NA, 1]) == np.NA CMSK implies sum([np.NA, 1]) == 1 and indeed, that's how R and masked arrays respectively behave. So I think it's reasonable to say that at least R thought that the bitmask implied the first and Pierre and others thought the mask meant the second. The NEP as it stands thinks of CMV and and CM as being different views of the same thing, Please correct me if I'm wrong. Both NaN and Inf are implemented in hardware with the same idea as the NA bit pattern, but they do not follow NA missing value semantics. Right - and that doesn't affect the argument, because the argument is about the concepts and not the implementation. As far as I can tell, the only required difference between them is that NA bit patterns must destroy the data. Nothing else. I think Nathaniel's point was about the expected default behavior in the different concepts. Everything on top of that is a choice of API and interface mechanisms. I want them to behave exactly the same except for that necessary difference, so that it will be possible to use the *exact same Python code* with either approach. Right. And Nathaniel's point is that that desire leads to fusion of the two ideas into one when they should be separated. For example, if I understand correctly: a = np.array([1.0, 2.0, 3, 7.0], masked=True) b = np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') a[3] = np.NA # actual real hand-on-heart assignment b[3] = np.NA # magic mask setting although it looks the same I meant: a = np.array([1.0, 2.0, 3.0, 7.0], masked=True) b = np.array([1.0, 2.0, 3.0, 7.0], dtype='NA[f8]') b[3] = np.NA # actual real hand-on-heart assignment a[3] = np.NA # magic mask setting although it looks the same Sorry, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
Hi, On Wed, Jun 29, 2011 at 7:20 PM, Lluís xscr...@gmx.net wrote: Mark Wiebe writes: There seems to be a general idea that masks and NA bit patterns imply particular differing semantics, something which I think is simply false. Well, my example contained a difference (the need for the skipna=True argument) precisely because it seemed that there was some need for different defaults. Honestly, I think this difference breaks the POLA (principle of least astonishment). [...] As far as I can tell, the only required difference between them is that NA bit patterns must destroy the data. Nothing else. Everything on top of that is a choice of API and interface mechanisms. I want them to behave exactly the same except for that necessary difference, so that it will be possible to use the *exact same Python code* with either approach. I completely agree. What I'd suggest is a global and/or per-object ndarray.flags.skipna for people like me that just want to ignore these entries without caring about setting it on each operaion (or the other way around, depends on the default behaviour). The downside is that it adds yet another tweaking knob, which is not desirable... Oh - dear - that would be horrible, if, depending on the tweak somewhere in the distant past of your script, this: a = np.array([np.NA, 1.0], masked=True) np.sum(a) could return either np.NA or 1.0... Imagine someone twiddled the knob the other way and ran your script... See you, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Wed, Jun 29, 2011 at 1:32 PM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe mwwi...@gmail.com wrote: On Wed, Jun 29, 2011 at 8:20 AM, Lluís xscr...@gmx.net wrote: Matthew Brett writes: Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys the idea that the entry is still there, but we're just ignoring it. Of course, that goes against common convention, but it might be easier to explain. I think Nathaniel's point is that np.IGNORE is a different idea than np.NA, and that is why joining the implementations can lead to conceptual confusion. This is how I see it: a = np.array([0, 1, 2], dtype=int) a[0] = np.NA ValueError e = np.array([np.NA, 1, 2], dtype=int) ValueError b = np.array([np.NA, 1, 2], dtype=np.maybe(int)) m = np.array([np.NA, 1, 2], dtype=int, masked=True) bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) b[1] = np.NA np.sum(b) np.NA np.sum(b, skipna=True) 2 b.mask None m[1] = np.NA np.sum(m) 2 np.sum(m, skipna=True) 2 m.mask [False, False, True] bm[1] = np.NA np.sum(bm) 2 np.sum(bm, skipna=True) 2 bm.mask [False, False, True] So: * Mask takes precedence over bit pattern on element assignment. There's still the question of how to assign a bit pattern NA when the mask is active. * When using mask, elements are automagically skipped. * m[1] = np.NA is equivalent to m.mask[1] = False * When using bit pattern + mask, it might make sense to have the initial values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True, False, True] and np.sum(bm) == np.NA) There seems to be a general idea that masks and NA bit patterns imply particular differing semantics, something which I think is simply false. Well - first - it's helpful surely to separate the concepts and the implementation. Concepts / use patterns (as delineated by Nathaniel): A) missing values == 'np.NA' in my emails. Can we call that CMV (concept missing values)? B) masks == np.IGNORE in my emails . CMSK (concept masks)? Implementations 1) bit-pattern == na-dtype - how about we call that IBP (implementation bit patten)? 2) array.mask. IM (implementation mask)? Remember that the masks are invisible, you can't see them, they are an implementation detail. A good reason to hide the implementation is so it can be changed without impacting software that depends on the API. snip Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
Hi, On Wed, Jun 29, 2011 at 9:17 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Jun 29, 2011 at 1:32 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe mwwi...@gmail.com wrote: On Wed, Jun 29, 2011 at 8:20 AM, Lluís xscr...@gmx.net wrote: Matthew Brett writes: Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys the idea that the entry is still there, but we're just ignoring it. Of course, that goes against common convention, but it might be easier to explain. I think Nathaniel's point is that np.IGNORE is a different idea than np.NA, and that is why joining the implementations can lead to conceptual confusion. This is how I see it: a = np.array([0, 1, 2], dtype=int) a[0] = np.NA ValueError e = np.array([np.NA, 1, 2], dtype=int) ValueError b = np.array([np.NA, 1, 2], dtype=np.maybe(int)) m = np.array([np.NA, 1, 2], dtype=int, masked=True) bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) b[1] = np.NA np.sum(b) np.NA np.sum(b, skipna=True) 2 b.mask None m[1] = np.NA np.sum(m) 2 np.sum(m, skipna=True) 2 m.mask [False, False, True] bm[1] = np.NA np.sum(bm) 2 np.sum(bm, skipna=True) 2 bm.mask [False, False, True] So: * Mask takes precedence over bit pattern on element assignment. There's still the question of how to assign a bit pattern NA when the mask is active. * When using mask, elements are automagically skipped. * m[1] = np.NA is equivalent to m.mask[1] = False * When using bit pattern + mask, it might make sense to have the initial values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True, False, True] and np.sum(bm) == np.NA) There seems to be a general idea that masks and NA bit patterns imply particular differing semantics, something which I think is simply false. Well - first - it's helpful surely to separate the concepts and the implementation. Concepts / use patterns (as delineated by Nathaniel): A) missing values == 'np.NA' in my emails. Can we call that CMV (concept missing values)? B) masks == np.IGNORE in my emails . CMSK (concept masks)? Implementations 1) bit-pattern == na-dtype - how about we call that IBP (implementation bit patten)? 2) array.mask. IM (implementation mask)? Remember that the masks are invisible, you can't see them, they are an implementation detail. A good reason to hide the implementation is so it can be changed without impacting software that depends on the API. It's not true that you can't see them because masks are using the same API as for missing values. Because they're using the same API, the person using the CMV stuff will soon find out about the masks, accidentally or not, then they will need to understand masking, and that is the problem we're discussing here. See you, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On Wed, Jun 29, 2011 at 11:20 AM, Lluís xscr...@gmx.net wrote: I completely agree. What I'd suggest is a global and/or per-object ndarray.flags.skipna for people like me that just want to ignore these entries without caring about setting it on each operaion (or the other way around, depends on the default behaviour). I agree with with Matthew that this approach would end up having horrible side-effects, but I can see why you'd want some way to accomplish this... I suggested another approach to handling both NA-style and mask-style missing data by making them totally separate features. It's buried at the bottom of this over-long message (you can search for my proposal): http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057251.html I know that the part 1 of that proposal would satisfy my needs, but I don't know as much about your use case, so I'm curious. Would that proposal (in particular, part 2, the classic masked-array part) work for you? -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
On 06/29/2011 09:32 AM, Matthew Brett wrote: Hi, [...] Clearly there are some overlaps between what masked arrays are trying to achieve and what Rs NA mechanisms are trying to achieve. Are they really similar enough that they should function using the same API? And if so, won't that be confusing? I think that's the question that's being asked. And I think the answer is no. No more confusing to people coming from R to numpy than views already are--with or without the NEP--and not *requiring* people to use any NA-related functionality beyond what they are used to from R. My understanding of the NEP is that it directly yields an API closely matching that of R, but with the opportunity, via views, to do more with less work, if one so desires. The present masked array module could be made more efficient if the NEP is implemented; regardless of whether this is done, the masked array module is not about to vanish, so anyone wanting precisely the masked array API will have it; and others remain free to ignore it (except for those of us involved in developing libraries such as matplotlib, which will have to support all variations of the new API along with the already-supported masked arrays). In addition, for new code, the full-blown masked array module may not be needed. A convenience it adds, however, is the automatic masking of invalid values: In [1]: np.ma.log(-1) Out[1]: masked I'm sure this horrifies some, but there are times and places where it is a genuine convenience, and preferable to having to use a separate operation to replace nan or inf with NA or whatever it ends up being. If np.seterr were extended to allow such automatic masking as an option, then the need for a separate masked array module would shrink further. I wouldn't mind having to use an explicit kwarg for ignoring NA in reduction methods. Eric See you, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] missing data discussion round 2
Nathaniel Smith writes: I know that the part 1 of that proposal would satisfy my needs, but I don't know as much about your use case, so I'm curious. Would that proposal (in particular, part 2, the classic masked-array part) work for you? I'm for the option of having a single API when you want to have NA elements, regardless of whether it's using masks or bit patterns. My question is whether your ufuncs should react differently depending on the type of array you're using (bit pattern vs mask). In the beginning I thought it could make sense, as you know how you have created the array. So if you're using masks, you're probably going to ignore the NAs (becase you've explicitly set them, and you don't want a NA as the result of your summation). *But*, the more API/semantics both approaches share, the better; so I'd say that its better that they show the *very same* behaviour (w.r.t. skipna). My concern is now about how to set the skipna in a comfortable way, so that I don't have to set it again and again as ufunc arguments: a array([NA, 2, 3]) b array([1, 2, NA]) a + b array([NA, 2, NA]) a.flags.skipna=True b.flags.skipna=True a + b array([1, 4, 3]) Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion