[Numpy-discussion] Missing Data

2014-03-26 Thread T J
What is the status of:

   https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst

and of missing data in Numpy, more generally?

Is np.ma.array still the state-of-the-art way to handle missing data? Or
has something better and more comprehensive been put together?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing Data

2014-03-26 Thread alex
On Wed, Mar 26, 2014 at 7:22 PM, T J tjhn...@gmail.com wrote:
 What is the status of:

https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst

For what it's worth this NEP was written in 2011 by mwiebe who made
258 numpy commits in 2011, 1 in 2012, and 3 in 2014.  According to
github, in the last few hours alone mwiebe has made several commits to
'blaze' and 'dynd-python'.  Here's the blog post explaining the vision
for Continuum's 'blaze' project http://continuum.io/blog/blaze.
Continuum seems to have been started in early 2012.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing Data

2014-03-26 Thread Charles R Harris
On Wed, Mar 26, 2014 at 5:43 PM, alex argri...@ncsu.edu wrote:

 On Wed, Mar 26, 2014 at 7:22 PM, T J tjhn...@gmail.com wrote:
  What is the status of:
 
 https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst

 For what it's worth this NEP was written in 2011 by mwiebe who made
 258 numpy commits in 2011, 1 in 2012, and 3 in 2014.  According to
 github, in the last few hours alone mwiebe has made several commits to
 'blaze' and 'dynd-python'.  Here's the blog post explaining the vision
 for Continuum's 'blaze' project http://continuum.io/blog/blaze.
 Continuum seems to have been started in early 2012.


It looks like blaze will have bit pattern missing values ala R. I don't
know if there is going to be a masked array implementation. The NA code was
taken out of Numpy because it was not possible to reach agreement that it
did the right thing.

Numpy.ma remains the only solution for bad data at this time. The code
could probably use more love than it has gotten ;)

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-14 Thread Richard Hattersley
For what it's worth, I'd prefer ndmasked.

As has been mentioned elsewhere, some algorithms can't really cope with
missing data. I'd very much rather they fail than silently give incorrect
results. Working in the climate prediction business (as with many other
domains I'm sure), even the *potential* for incorrect results can be
damaging.


On 11 May 2012 06:14, Travis Oliphant tra...@continuum.io wrote:


 On May 10, 2012, at 12:21 AM, Charles R Harris wrote:



 On Wed, May 9, 2012 at 11:05 PM, Benjamin Root ben.r...@ou.edu wrote:



 On Wednesday, May 9, 2012, Nathaniel Smith wrote:



 My only objection to this proposal is that committing to this approach
 seems premature. The existing masked array objects act quite
 differently from numpy.ma, so why do you believe that they're a good
 foundation for numpy.ma, and why will users want to switch to their
 semantics over numpy.ma's semantics? These aren't rhetorical
 questions, it seems like they must have concrete answers, but I don't
 know what they are.


 Based on the design decisions made in the original NEP, a re-made
 numpy.ma would have to lose _some_ features particularly, the ability to
 share masks. Save for that and some very obscure behaviors that are
 undocumented, it is possible to remake numpy.ma as a compatibility layer.

 That being said, I think that there are some fundamental questions that
 has concerned. If I recall, there were unresolved questions about behaviors
 surrounding assignments to elements of a view.

 I see the project as broken down like this:
 1.) internal architecture (largely abi issues)
 2.) external architecture (hooks throughout numpy to utilize the new
 features where possible such as where= argument)
 3.) getter/setter semantics
 4.) mathematical semantics

 At this moment, I think we have pieces of 2 and they are fairly
 non-controversial. It is 1 that I see as being the immediate hold-up here.
 3  4 are non-trivial, but because they are mostly about interfaces, I
 think we can be willing to accept some very basic, fundamental, barebones
 components here in order to lay the groundwork for a more complete API
 later.

 To talk of Travis's proposal, doing nothing is no-go. Not moving forward
 would dishearten the community. Making a ndmasked type is very intriguing.
 I see it as a set towards eventually deprecating ndarray? Also, how would
 it behave with no.asarray() and no.asanyarray()? My other concern is a
 possible violation of DRY. How difficult would it be to maintain two
 ndarrays in parallel?

 As for the flag approach, this still doesn't solve the problem of legacy
 code (or did I misunderstand?)


 My understanding of the flag is to allow the code to stay in and get
 reworked and experimented with while keeping it from contaminating
 conventional use.

 The whole point of putting the code in was to experiment and adjust. The
 rather bizarre idea that it needs to be perfect from the get go is
 disheartening, and is seldom how new things get developed. Sure, there is a
 plan up front, but there needs to be feedback and change. And in fact, I
 haven't seen much feedback about the actual code, I don't even know that
 the people complaining have tried using it to see where it hurts. I'd like
 that sort of feedback.


 I don't think anyone is saying it needs to be perfect from the get go.
  What I am saying is that this is fundamental enough to downstream users
 that this kind of thing is best done as a separate object.  The flag could
 still be used to make all Python-level array constructors build ndmasked
 objects.

 But, this doesn't address the C-level story where there is quite a bit of
 downstream use where people have used the NumPy array as just a pointer to
 memory without considering that there might be a mask attached that should
 be inspected as well.

 The NEP addresses this a little bit for those C or C++ consumers of the
 ndarray in C who always use PyArray_FromAny which can fail if the array has
 non-NULL mask contents.   However, it is *not* true that all downstream
 users use PyArray_FromAny.

 A large number of users just use something like PyArray_Check and then
 PyArray_DATA to get the pointer to the data buffer and then go from there
 thinking of their data as a strided memory chunk only (no extra mask).
  The NEP fundamentally changes this simple invariant that has been in NumPy
 and Numeric before it for a long, long time.

 I really don't see how we can do this in a 1.7 release.It has too many
 unknown and I think unknowable downstream effects.But, I think we could
 introduce another arrayobject that is the masked_array with a Python-level
 flag that makes it the default array in Python.

 There are a few more subtleties,  PyArray_Check by default will pass
 sub-classes so if the new ndmask array were a sub-class then it would be
 passed (just like current numpy.ma arrays and matrices would pass that
 check today).However, there is a PyArray_CheckExact macro which could
 

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-11 Thread Scott Sinclair
On 11 May 2012 06:57, Travis Oliphant tra...@continuum.io wrote:

 On May 10, 2012, at 3:40 AM, Scott Sinclair wrote:

 On 9 May 2012 18:46, Travis Oliphant tra...@continuum.io wrote:
 The document is available here:
    https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

 This is orthogonal to the discussion, but I'm curious as to why this
 discussion document has landed in the website repo?

 I suppose it's not a really big deal, but future uploads of the
 website will now include a page at
 http://numpy.scipy.org/NA-overview.html with the content of this
 document. If that's desirable, I'll add a note at the top of the
 overview referencing this discussion thread. If not it can be
 relocated somewhere more desirable after this thread's discussion
 deadline expires..

 Yes, it can be relocated.   Can you suggest where it should go?  It was added 
 there so that nathaniel and mark could both edit it together with Nathaniel 
 added to the web-team.

 It may not be a bad place for it, though.   At least for a while.

Having thought about it, a page on the website isn't a bad idea. I've
added a note pointing to this discussion. The document now appears at
http://numpy.scipy.org/NA-overview.html

Cheers,
Scott
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-11 Thread Fernando Perez
On Thu, May 10, 2012 at 11:03 PM, Scott Sinclair
scott.sinclair...@gmail.com wrote:
 Having thought about it, a page on the website isn't a bad idea. I've
 added a note pointing to this discussion. The document now appears at
 http://numpy.scipy.org/NA-overview.html

Why not have a separate repo for neps/discussion docs?  That way,
people can be added to the team as they need to edit them and removed
when done, and it's separate from the main site itself.  The site can
simply have a link to this set of documents, which can be built,
tracked, separately and cleanly.  We have more or less that setup with
ipython for the site and docs:

- main site page that points to the doc builds:
http://ipython.org/documentation.html
- doc builds on a secondary site:
http://ipython.org/ipython-doc/stable/index.html

This seems to me like the best way to separate the main web team
(assuming we'll have a nice website for numpy one day) from the team
that will edit documents of nep/discussion type.  I imagine the web
team will be fairly stable, where as the team for these docs will have
people coming and going.

Just a thought...  As usual, crib anything you find useful from our setup.

Cheers,

f
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-11 Thread Scott Sinclair
On 11 May 2012 08:12, Fernando Perez fperez@gmail.com wrote:
 On Thu, May 10, 2012 at 11:03 PM, Scott Sinclair
 scott.sinclair...@gmail.com wrote:
 Having thought about it, a page on the website isn't a bad idea. I've
 added a note pointing to this discussion. The document now appears at
 http://numpy.scipy.org/NA-overview.html

 Why not have a separate repo for neps/discussion docs?  That way,
 people can be added to the team as they need to edit them and removed
 when done, and it's separate from the main site itself.  The site can
 simply have a link to this set of documents, which can be built,
 tracked, separately and cleanly.  We have more or less that setup with
 ipython for the site and docs:

 - main site page that points to the doc builds:
 http://ipython.org/documentation.html
 - doc builds on a secondary site:
 http://ipython.org/ipython-doc/stable/index.html

That's pretty much how things already work. The documentation is in
the main source tree and built docs end up at http://docs.scipy.org.
NEPs live at https://github.com/numpy/numpy/tree/master/doc/neps, but
don't get published outside of the source tree and there's no
preferred place for discussion documents.

 (assuming we'll have a nice website for numpy one day)

Ha ha ha ;-) Thanks for the thoughts and prodding.

Cheers,
Scott
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-11 Thread Fernando Perez
On Thu, May 10, 2012 at 11:44 PM, Scott Sinclair
scott.sinclair...@gmail.com wrote:
 That's pretty much how things already work. The documentation is in
 the main source tree and built docs end up at http://docs.scipy.org.
 NEPs live at https://github.com/numpy/numpy/tree/master/doc/neps, but
 don't get published outside of the source tree and there's no
 preferred place for discussion documents.

No, b/c that means that for someone to be able to push to a NEP,
they'd have to get commit rights to the main numpy source code repo.
The whole point of what I'm suggesting is to isolate the NEP repo so
that commit rights can be given for it with minimal thought, whenever
pretty much anyone says they're going to work on a NEP.

Obviously today anyone can do that and submit a PR against the main
repo, but that raises the PR review burden for said repo.  And that
burden is something that we should strive to keep as low as possible,
so those key people (the team with commit rights to the main repo) can
focus their limited resources on reviewing code PRs.

I'm simply suggesting a way to spread the load as much as possible, so
that the team with commit rights on the main repo isn't a bottleneck
on other tasks.

Cheers,

f
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-11 Thread Travis Oliphant

On May 11, 2012, at 2:13 AM, Fernando Perez wrote:

 On Thu, May 10, 2012 at 11:44 PM, Scott Sinclair
 scott.sinclair...@gmail.com wrote:
 That's pretty much how things already work. The documentation is in
 the main source tree and built docs end up at http://docs.scipy.org.
 NEPs live at https://github.com/numpy/numpy/tree/master/doc/neps, but
 don't get published outside of the source tree and there's no
 preferred place for discussion documents.
 
 No, b/c that means that for someone to be able to push to a NEP,
 they'd have to get commit rights to the main numpy source code repo.
 The whole point of what I'm suggesting is to isolate the NEP repo so
 that commit rights can be given for it with minimal thought, whenever
 pretty much anyone says they're going to work on a NEP.
 
 Obviously today anyone can do that and submit a PR against the main
 repo, but that raises the PR review burden for said repo.  And that
 burden is something that we should strive to keep as low as possible,
 so those key people (the team with commit rights to the main repo) can
 focus their limited resources on reviewing code PRs.
 
 I'm simply suggesting a way to spread the load as much as possible, so
 that the team with commit rights on the main repo isn't a bottleneck
 on other tasks.

This is a good idea.  I think.I like the thought of a separate NEP and docs 
repo.   

-Travis


 
 Cheers,
 
 f
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-11 Thread Mark Wiebe
On Thu, May 10, 2012 at 10:28 PM, Matthew Brett matthew.br...@gmail.comwrote:

 Hi,

 On Thu, May 10, 2012 at 2:43 AM, Nathaniel Smith n...@pobox.com wrote:
  Hi Matthew,
 
  On Thu, May 10, 2012 at 12:01 AM, Matthew Brett matthew.br...@gmail.com
 wrote:
  The third proposal is certainly the best one from Cython's perspective;
  and I imagine for those writing C extensions against the C API too.
  Having PyType_Check fail for ndmasked is a very good way of having code
  fail that is not written to take masks into account.
 
  Mark, Nathaniel - can you comment how your chosen approaches would
  interact with extension code?
 
  I'm guessing the bitpattern dtypes would be expected to cause
  extension code to choke if the type is not supported?
 
  That's pretty much how I'm imagining it, yes. Right now if you have,
  say, a Cython function like
 
  cdef f(np.ndarray[double] a):
 ...
 
  and you do f(np.zeros(10, dtype=int)), then it will error out, because
  that function doesn't know how to handle ints, only doubles. The same
  would apply for, say, a NA-enabled integer. In general there are
  almost arbitrarily many dtypes that could get passed into any function
  (including user-defined ones, etc.), so C code already has to check
  dtypes for correctness.
 
  Second order issues:
  - There is certainly C code out there that just assumes that it will
  only be passed an array with certain dtype (and ndim, memory layout,
  etc...). If you write such C code then it's your job to make sure that
  you only pass it the kinds of arrays that it expects, just like now
  :-).
 
  - We may want to do some sort of special-casing of handling for
  floating point NA dtypes that use an NaN as the magic bitpattern,
  since many algorithms *will* work with these unchanged, and it might
  be frustrating to have to wait for every extension module to be
  updated just to allow for this case explicitly before using them. OTOH
  you can easily work around this. Like say my_qr is a legacy C function
  that will in fact propagate NaNs correctly, so float NA dtypes would
  Just Work -- except, it errors out at the start because it doesn't
  recognize the dtype. How annoying. We *could* have some special hack
  you can use to force it to work anyway (by like making the is this
  the dtype I expect? routine lie.) But you can also just do:
 
   def my_qr_wrapper(arr):
 if arr.dtype is a NA float dtype with NaN magic value:
   result = my_qr(arr.view(arr.dtype.base_dtype))
   return result.view(arr.dtype)
 else:
   return my_qr(arr)
 
  and hey presto, now it will correctly pass through NAs. So perhaps
  it's not worth bothering with special hacks.
 
  - Of course if  your extension function does want to handle NAs
  generically, then there will be a simple C api for checking for them,
  setting them, etc. Numpy needs such an API internally anyway!

 Thanks for this.

 Mark - in view of the discussions about Cython and extension code -
 could you say what you see as disadvantages to the ndmasked subclass
 proposal?


The biggest difficulty looks to me like how to work with both of them
reasonably from the C API. The idea of ndarray and ndmasked having
different independent TypeObjects, but still working through the same API
calls feels a little disconcerting. Maybe this is a reasonable compromise,
though, it would be nice to see the idea fleshed out a bit more with some
examples of how the code would work from the C level.

Cheers,
Mark



 Cheers,

 Matthew
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Gael Varoquaux
On Wed, May 09, 2012 at 02:35:26PM -0500, Travis Oliphant wrote:
  Basically it buys not forcing *all* NumPy users (on the C-API level) to
now deal with a masked array.    I know this push is a feature that is
part of Mark's intention (as it pushes downstream libraries to think about
missing data at a fundamental level). 

I think that this is a bad policy because:

 1. An array is not always data. I realize that there is a big push for
data-related computing lately, but I still believe that the notion
missing data makes no sens for the majority of numpy arrays 
instanciated.

 2. Not every algorithm can be made to work with missing data. I would
even say that most of the advanced algorithm do not work with missing
data.

Don't try to force upon people a problem that they do not have :).

Gael

PS: This message does not claim to take any position in the debate on
which solution for missing data is the best, because I don't think that I
have a good technical vision to back any position.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Scott Sinclair
On 9 May 2012 18:46, Travis Oliphant tra...@continuum.io wrote:
 The document is available here:
    https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

This is orthogonal to the discussion, but I'm curious as to why this
discussion document has landed in the website repo?

I suppose it's not a really big deal, but future uploads of the
website will now include a page at
http://numpy.scipy.org/NA-overview.html with the content of this
document. If that's desirable, I'll add a note at the top of the
overview referencing this discussion thread. If not it can be
relocated somewhere more desirable after this thread's discussion
deadline expires..

Cheers,
Scott
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Dag Sverre Seljebotn
On 05/10/2012 06:05 AM, Dag Sverre Seljebotn wrote:
 On 05/10/2012 01:01 AM, Matthew Brett wrote:
 Hi,

 On Wed, May 9, 2012 at 12:44 PM, Dag Sverre Seljebotn
 d.s.seljeb...@astro.uio.no   wrote:
 On 05/09/2012 06:46 PM, Travis Oliphant wrote:
 Hey all,

 Nathaniel and Mark have worked very hard on a joint document to try and
 explain the current status of the missing-data debate. I think they've
 done an amazing job at providing some context, articulating their views
 and suggesting ways forward in a mutually respectful manner. This is an
 exemplary collaboration and is at the core of why open source is valuable.

 The document is available here:
 https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

 After reading that document, it appears to me that there are some
 fundamentally different views on how things should move forward. I'm
 also reading the document incorporating my understanding of the history,
 of NumPy as well as all of the users I've met and interacted with which
 means I have my own perspective that is not necessarily incorporated
 into that document but informs my recommendations. I'm not sure we can
 reach full consensus on this. We are also well past time for moving
 forward with a resolution on this (perhaps we can all agree on that).

 I would like one more discussion thread where the technical discussion
 can take place. I will make a plea that we keep this discussion as free
 from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
 we can. I can't guarantee that I personally will succeed at that, but I
 can tell you that I will try. That's all I'm asking of anyone else. I
 recognize that there are a lot of other issues at play here besides
 *just* the technical questions, but we are not going to resolve every
 community issue in this technical thread.

 We need concrete proposals and so I will start with three. Please feel
 free to comment on these proposals or add your own during the
 discussion. I will stop paying attention to this thread next Wednesday
 (May 16th) (or earlier if the thread dies) and hope that by that time we
 can agree on a way forward. If we don't have agreement, then I will move
 forward with what I think is the right approach. I will either write the
 code myself or convince someone else to write it.

 In all cases, we have agreement that bit-pattern dtypes should be added
 to NumPy. We should work on these (int32, float64, complex64, str, bool)
 to start. So, the three proposals are independent of this way forward.
 The proposals are all about the extra mask part:

 My three proposals:

 * do nothing and leave things as is

 * add a global flag that turns off masked array support by default but
 otherwise leaves things unchanged (I'm still unclear how this would work
 exactly)

 * move Mark's masked ndarray objects into a new fundamental type
 (ndmasked), leaving the actual ndarray type unchanged. The
 array_interface keeps the masked array notions and the ufuncs keep the
 ability to handle arrays like ndmasked. Ideally, numpy.ma
 http://numpy.ma   would be changed to use ndmasked objects as their core.

 For the record, I'm currently in favor of the third proposal. Feel free
 to comment on these proposals (or provide your own).


 Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark!

 Yes, it is very well written, my compliments to the chefs.

 The third proposal is certainly the best one from Cython's perspective;
 and I imagine for those writing C extensions against the C API too.
 Having PyType_Check fail for ndmasked is a very good way of having code
 fail that is not written to take masks into account.

 I want to make something more clear: There are two Cython cases; in the
 case of cdef np.ndarray[double] there is no problem as PEP 3118 access
 will raise an exception for masked arrays.

 But, there's the case where you do cdef np.ndarray, and then proceed
 to use PyArray_DATA. Myself I do this more than PEP 3118 access; usually
 because I pass the data pointer to some C or C++ code.

 It'd be great to have such code be forward-compatible in the sense that
 it raises an exception when it meets a masked array. Having PyType_Check
 fail seems like the only way? Am I wrong?

I'm very sorry; I always meant PyObject_TypeCheck, not PyType_Check.

Dag



 Mark, Nathaniel - can you comment how your chosen approaches would
 interact with extension code?

 I'm guessing the bitpattern dtypes would be expected to cause
 extension code to choke if the type is not supported?

 The proposal, as I understand it, is to use that with new dtypes (?). So
 things will often be fine for that reason:

 if arr.dtype == np.float32:
   c_function_32bit(np.PyArray_DATA(arr), ...)
 else:
   raise ValueError(need 32-bit float array)



 Mark - in :

 https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#cython

 - do I understand correctly that you think that Cython and other
 extension writers should 

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Nathaniel Smith
Hi Matthew,

On Thu, May 10, 2012 at 12:01 AM, Matthew Brett matthew.br...@gmail.com wrote:
 The third proposal is certainly the best one from Cython's perspective;
 and I imagine for those writing C extensions against the C API too.
 Having PyType_Check fail for ndmasked is a very good way of having code
 fail that is not written to take masks into account.

 Mark, Nathaniel - can you comment how your chosen approaches would
 interact with extension code?

 I'm guessing the bitpattern dtypes would be expected to cause
 extension code to choke if the type is not supported?

That's pretty much how I'm imagining it, yes. Right now if you have,
say, a Cython function like

cdef f(np.ndarray[double] a):
...

and you do f(np.zeros(10, dtype=int)), then it will error out, because
that function doesn't know how to handle ints, only doubles. The same
would apply for, say, a NA-enabled integer. In general there are
almost arbitrarily many dtypes that could get passed into any function
(including user-defined ones, etc.), so C code already has to check
dtypes for correctness.

Second order issues:
- There is certainly C code out there that just assumes that it will
only be passed an array with certain dtype (and ndim, memory layout,
etc...). If you write such C code then it's your job to make sure that
you only pass it the kinds of arrays that it expects, just like now
:-).

- We may want to do some sort of special-casing of handling for
floating point NA dtypes that use an NaN as the magic bitpattern,
since many algorithms *will* work with these unchanged, and it might
be frustrating to have to wait for every extension module to be
updated just to allow for this case explicitly before using them. OTOH
you can easily work around this. Like say my_qr is a legacy C function
that will in fact propagate NaNs correctly, so float NA dtypes would
Just Work -- except, it errors out at the start because it doesn't
recognize the dtype. How annoying. We *could* have some special hack
you can use to force it to work anyway (by like making the is this
the dtype I expect? routine lie.) But you can also just do:

  def my_qr_wrapper(arr):
if arr.dtype is a NA float dtype with NaN magic value:
  result = my_qr(arr.view(arr.dtype.base_dtype))
  return result.view(arr.dtype)
else:
  return my_qr(arr)

and hey presto, now it will correctly pass through NAs. So perhaps
it's not worth bothering with special hacks.

- Of course if  your extension function does want to handle NAs
generically, then there will be a simple C api for checking for them,
setting them, etc. Numpy needs such an API internally anyway!

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Matthew Brett
Hi,

On Thu, May 10, 2012 at 2:43 AM, Nathaniel Smith n...@pobox.com wrote:
 Hi Matthew,

 On Thu, May 10, 2012 at 12:01 AM, Matthew Brett matthew.br...@gmail.com 
 wrote:
 The third proposal is certainly the best one from Cython's perspective;
 and I imagine for those writing C extensions against the C API too.
 Having PyType_Check fail for ndmasked is a very good way of having code
 fail that is not written to take masks into account.

 Mark, Nathaniel - can you comment how your chosen approaches would
 interact with extension code?

 I'm guessing the bitpattern dtypes would be expected to cause
 extension code to choke if the type is not supported?

 That's pretty much how I'm imagining it, yes. Right now if you have,
 say, a Cython function like

 cdef f(np.ndarray[double] a):
    ...

 and you do f(np.zeros(10, dtype=int)), then it will error out, because
 that function doesn't know how to handle ints, only doubles. The same
 would apply for, say, a NA-enabled integer. In general there are
 almost arbitrarily many dtypes that could get passed into any function
 (including user-defined ones, etc.), so C code already has to check
 dtypes for correctness.

 Second order issues:
 - There is certainly C code out there that just assumes that it will
 only be passed an array with certain dtype (and ndim, memory layout,
 etc...). If you write such C code then it's your job to make sure that
 you only pass it the kinds of arrays that it expects, just like now
 :-).

 - We may want to do some sort of special-casing of handling for
 floating point NA dtypes that use an NaN as the magic bitpattern,
 since many algorithms *will* work with these unchanged, and it might
 be frustrating to have to wait for every extension module to be
 updated just to allow for this case explicitly before using them. OTOH
 you can easily work around this. Like say my_qr is a legacy C function
 that will in fact propagate NaNs correctly, so float NA dtypes would
 Just Work -- except, it errors out at the start because it doesn't
 recognize the dtype. How annoying. We *could* have some special hack
 you can use to force it to work anyway (by like making the is this
 the dtype I expect? routine lie.) But you can also just do:

  def my_qr_wrapper(arr):
    if arr.dtype is a NA float dtype with NaN magic value:
      result = my_qr(arr.view(arr.dtype.base_dtype))
      return result.view(arr.dtype)
    else:
      return my_qr(arr)

 and hey presto, now it will correctly pass through NAs. So perhaps
 it's not worth bothering with special hacks.

 - Of course if  your extension function does want to handle NAs
 generically, then there will be a simple C api for checking for them,
 setting them, etc. Numpy needs such an API internally anyway!

Thanks for this.

Mark - in view of the discussions about Cython and extension code -
could you say what you see as disadvantages to the ndmasked subclass
proposal?

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Travis Oliphant

On May 10, 2012, at 3:40 AM, Scott Sinclair wrote:

 On 9 May 2012 18:46, Travis Oliphant tra...@continuum.io wrote:
 The document is available here:
https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
 
 This is orthogonal to the discussion, but I'm curious as to why this
 discussion document has landed in the website repo?
 
 I suppose it's not a really big deal, but future uploads of the
 website will now include a page at
 http://numpy.scipy.org/NA-overview.html with the content of this
 document. If that's desirable, I'll add a note at the top of the
 overview referencing this discussion thread. If not it can be
 relocated somewhere more desirable after this thread's discussion
 deadline expires..

Yes, it can be relocated.   Can you suggest where it should go?  It was added 
there so that nathaniel and mark could both edit it together with Nathaniel 
added to the web-team. 

It may not be a bad place for it, though.   At least for a while. 

-Travis


 
 Cheers,
 Scott
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Travis Oliphant

On May 10, 2012, at 12:21 AM, Charles R Harris wrote:

 
 
 On Wed, May 9, 2012 at 11:05 PM, Benjamin Root ben.r...@ou.edu wrote:
 
 
 On Wednesday, May 9, 2012, Nathaniel Smith wrote:
 
 
 My only objection to this proposal is that committing to this approach
 seems premature. The existing masked array objects act quite
 differently from numpy.ma, so why do you believe that they're a good
 foundation for numpy.ma, and why will users want to switch to their
 semantics over numpy.ma's semantics? These aren't rhetorical
 questions, it seems like they must have concrete answers, but I don't
 know what they are.
 
 Based on the design decisions made in the original NEP, a re-made numpy.ma 
 would have to lose _some_ features particularly, the ability to share masks. 
 Save for that and some very obscure behaviors that are undocumented, it is 
 possible to remake numpy.ma as a compatibility layer.
 
 That being said, I think that there are some fundamental questions that has 
 concerned. If I recall, there were unresolved questions about behaviors 
 surrounding assignments to elements of a view.
 
 I see the project as broken down like this:
 1.) internal architecture (largely abi issues)
 2.) external architecture (hooks throughout numpy to utilize the new features 
 where possible such as where= argument)
 3.) getter/setter semantics
 4.) mathematical semantics
 
 At this moment, I think we have pieces of 2 and they are fairly 
 non-controversial. It is 1 that I see as being the immediate hold-up here. 3 
  4 are non-trivial, but because they are mostly about interfaces, I think we 
 can be willing to accept some very basic, fundamental, barebones components 
 here in order to lay the groundwork for a more complete API later.
 
 To talk of Travis's proposal, doing nothing is no-go. Not moving forward 
 would dishearten the community. Making a ndmasked type is very intriguing. I 
 see it as a set towards eventually deprecating ndarray? Also, how would it 
 behave with no.asarray() and no.asanyarray()? My other concern is a possible 
 violation of DRY. How difficult would it be to maintain two ndarrays in 
 parallel?  
 
 As for the flag approach, this still doesn't solve the problem of legacy code 
 (or did I misunderstand?)
 
 My understanding of the flag is to allow the code to stay in and get reworked 
 and experimented with while keeping it from contaminating conventional use.
 
 The whole point of putting the code in was to experiment and adjust. The 
 rather bizarre idea that it needs to be perfect from the get go is 
 disheartening, and is seldom how new things get developed. Sure, there is a 
 plan up front, but there needs to be feedback and change. And in fact, I 
 haven't seen much feedback about the actual code, I don't even know that the 
 people complaining have tried using it to see where it hurts. I'd like that 
 sort of feedback.
 

I don't think anyone is saying it needs to be perfect from the get go.What 
I am saying is that this is fundamental enough to downstream users that this 
kind of thing is best done as a separate object.  The flag could still be used 
to make all Python-level array constructors build ndmasked objects.  

But, this doesn't address the C-level story where there is quite a bit of 
downstream use where people have used the NumPy array as just a pointer to 
memory without considering that there might be a mask attached that should be 
inspected as well. 

The NEP addresses this a little bit for those C or C++ consumers of the ndarray 
in C who always use PyArray_FromAny which can fail if the array has non-NULL 
mask contents.   However, it is *not* true that all downstream users use 
PyArray_FromAny. 

A large number of users just use something like PyArray_Check and then 
PyArray_DATA to get the pointer to the data buffer and then go from there 
thinking of their data as a strided memory chunk only (no extra mask).The 
NEP fundamentally changes this simple invariant that has been in NumPy and 
Numeric before it for a long, long time. 

I really don't see how we can do this in a 1.7 release.It has too many 
unknown and I think unknowable downstream effects.But, I think we could 
introduce another arrayobject that is the masked_array with a Python-level flag 
that makes it the default array in Python. 

There are a few more subtleties,  PyArray_Check by default will pass 
sub-classes so if the new ndmask array were a sub-class then it would be passed 
(just like current numpy.ma arrays and matrices would pass that check today).   
 However, there is a PyArray_CheckExact macro which could be used to ensure the 
object was actually of PyArray_Type.   There is also the PyArg_ParseTuple 
command with O! that I have seen used many times to ensure an exact NumPy 
array.  

-Travis






 Chuck
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Travis Oliphant
Hey all, 

Nathaniel and Mark have worked very hard on a joint document to try and explain 
the current status of the missing-data debate.   I think they've done an 
amazing job at providing some context, articulating their views and suggesting 
ways forward in a mutually respectful manner.   This is an exemplary 
collaboration and is at the core of why open source is valuable. 

The document is available here: 
   https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

After reading that document, it appears to me that there are some fundamentally 
different views on how things should move forward.   I'm also reading the 
document incorporating my understanding of the history, of NumPy as well as all 
of the users I've met and interacted with which means I have my own perspective 
that is not necessarily incorporated into that document but informs my 
recommendations.I'm not sure we can reach full consensus on this. We 
are also well past time for moving forward with a resolution on this (perhaps 
we can all agree on that).

I would like one more discussion thread where the technical discussion can take 
place.I will make a plea that we keep this discussion as free from logical 
fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can.   I can't 
guarantee that I personally will succeed at that, but I can tell you that I 
will try.   That's all I'm asking of anyone else.I recognize that there are 
a lot of other issues at play here besides *just* the technical questions, but 
we are not going to resolve every community issue in this technical thread. 

We need concrete proposals and so I will start with three.   Please feel free 
to comment on these proposals or add your own during the discussion.I will 
stop paying attention to this thread next Wednesday (May 16th) (or earlier if 
the thread dies) and hope that by that time we can agree on a way forward.  If 
we don't have agreement, then I will move forward with what I think is the 
right approach.   I will either write the code myself or convince someone else 
to write it. 

In all cases, we have agreement that bit-pattern dtypes should be added to 
NumPy.  We should work on these (int32, float64, complex64, str, bool) to 
start.So, the three proposals are independent of this way forward.   The 
proposals are all about the extra mask part:  

My three proposals: 

* do nothing and leave things as is 

* add a global flag that turns off masked array support by default but 
otherwise leaves things unchanged (I'm still unclear how this would work 
exactly)

* move Mark's masked ndarray objects into a new fundamental type 
(ndmasked), leaving the actual ndarray type unchanged.  The array_interface 
keeps the masked array notions and the ufuncs keep the ability to handle arrays 
like ndmasked.Ideally, numpy.ma would be changed to use ndmasked objects as 
their core. 

For the record, I'm currently in favor of the third proposal.   Feel free to 
comment on these proposals (or provide your own). 

Best regards,

-Travis

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Charles R Harris
On Wed, May 9, 2012 at 10:46 AM, Travis Oliphant tra...@continuum.iowrote:

 Hey all,

 Nathaniel and Mark have worked very hard on a joint document to try and
 explain the current status of the missing-data debate.   I think they've
 done an amazing job at providing some context, articulating their views and
 suggesting ways forward in a mutually respectful manner.   This is an
 exemplary collaboration and is at the core of why open source is valuable.

 The document is available here:
https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

 After reading that document, it appears to me that there are some
 fundamentally different views on how things should move forward.   I'm also
 reading the document incorporating my understanding of the history, of
 NumPy as well as all of the users I've met and interacted with which means
 I have my own perspective that is not necessarily incorporated into that
 document but informs my recommendations.I'm not sure we can reach full
 consensus on this. We are also well past time for moving forward with a
 resolution on this (perhaps we can all agree on that).

 I would like one more discussion thread where the technical discussion can
 take place.I will make a plea that we keep this discussion as free from
 logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can.
   I can't guarantee that I personally will succeed at that, but I can tell
 you that I will try.   That's all I'm asking of anyone else.I recognize
 that there are a lot of other issues at play here besides *just* the
 technical questions, but we are not going to resolve every community issue
 in this technical thread.

 We need concrete proposals and so I will start with three.   Please feel
 free to comment on these proposals or add your own during the discussion.
  I will stop paying attention to this thread next Wednesday (May 16th) (or
 earlier if the thread dies) and hope that by that time we can agree on a
 way forward.  If we don't have agreement, then I will move forward with
 what I think is the right approach.   I will either write the code myself
 or convince someone else to write it.

 In all cases, we have agreement that bit-pattern dtypes should be added to
 NumPy.  We should work on these (int32, float64, complex64, str, bool)
 to start.So, the three proposals are independent of this way forward.
 The proposals are all about the extra mask part:

 My three proposals:

 * do nothing and leave things as is

 * add a global flag that turns off masked array support by default but
 otherwise leaves things unchanged (I'm still unclear how this would work
 exactly)

 * move Mark's masked ndarray objects into a new fundamental type
 (ndmasked), leaving the actual ndarray type unchanged.  The array_interface
 keeps the masked array notions and the ufuncs keep the ability to handle
 arrays like ndmasked.Ideally, numpy.ma would be changed to use
 ndmasked objects as their core.


The numpy.ma is unmaintained and I don't see that changing anytime soon. As
you know, I would prefer 1), but 2) is a good compromise and the infra
structure for such a flag could be useful for other things, although like
yourself I'm not sure how it would be implemented. I don't understand your
proposal for 3), but from the description I don't see that it buys anything.


 For the record, I'm currently in favor of the third proposal.   Feel free
 to comment on these proposals (or provide your own).


Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Mark Wiebe
On Wed, May 9, 2012 at 11:46 AM, Travis Oliphant tra...@continuum.iowrote:

 Hey all,

 Nathaniel and Mark have worked very hard on a joint document to try and
 explain the current status of the missing-data debate.   I think they've
 done an amazing job at providing some context, articulating their views and
 suggesting ways forward in a mutually respectful manner.   This is an
 exemplary collaboration and is at the core of why open source is valuable.

 The document is available here:
https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

 After reading that document, it appears to me that there are some
 fundamentally different views on how things should move forward.   I'm also
 reading the document incorporating my understanding of the history, of
 NumPy as well as all of the users I've met and interacted with which means
 I have my own perspective that is not necessarily incorporated into that
 document but informs my recommendations.I'm not sure we can reach full
 consensus on this. We are also well past time for moving forward with a
 resolution on this (perhaps we can all agree on that).

 I would like one more discussion thread where the technical discussion can
 take place.I will make a plea that we keep this discussion as free from
 logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can.
   I can't guarantee that I personally will succeed at that, but I can tell
 you that I will try.   That's all I'm asking of anyone else.I recognize
 that there are a lot of other issues at play here besides *just* the
 technical questions, but we are not going to resolve every community issue
 in this technical thread.

 We need concrete proposals and so I will start with three.   Please feel
 free to comment on these proposals or add your own during the discussion.
  I will stop paying attention to this thread next Wednesday (May 16th) (or
 earlier if the thread dies) and hope that by that time we can agree on a
 way forward.  If we don't have agreement, then I will move forward with
 what I think is the right approach.   I will either write the code myself
 or convince someone else to write it.

 In all cases, we have agreement that bit-pattern dtypes should be added to
 NumPy.  We should work on these (int32, float64, complex64, str, bool)
 to start.So, the three proposals are independent of this way forward.
 The proposals are all about the extra mask part:

 My three proposals:

 * do nothing and leave things as is

 * add a global flag that turns off masked array support by default but
 otherwise leaves things unchanged (I'm still unclear how this would work
 exactly)

 * move Mark's masked ndarray objects into a new fundamental type
 (ndmasked), leaving the actual ndarray type unchanged.  The array_interface
 keeps the masked array notions and the ufuncs keep the ability to handle
 arrays like ndmasked.Ideally, numpy.ma would be changed to use
 ndmasked objects as their core.

 For the record, I'm currently in favor of the third proposal.   Feel free
 to comment on these proposals (or provide your own).


I'm most in favour of the second proposal. It won't take very much effort,
and more clearly marks off this code as experimental than just
documentation notes.

Thanks,
-Mark



 Best regards,

 -Travis


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Travis Oliphant

On May 9, 2012, at 2:07 PM, Mark Wiebe wrote:

 On Wed, May 9, 2012 at 11:46 AM, Travis Oliphant tra...@continuum.io wrote:
 Hey all, 
 
 Nathaniel and Mark have worked very hard on a joint document to try and 
 explain the current status of the missing-data debate.   I think they've done 
 an amazing job at providing some context, articulating their views and 
 suggesting ways forward in a mutually respectful manner.   This is an 
 exemplary collaboration and is at the core of why open source is valuable. 
 
 The document is available here: 
https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
 
 After reading that document, it appears to me that there are some 
 fundamentally different views on how things should move forward.   I'm also 
 reading the document incorporating my understanding of the history, of NumPy 
 as well as all of the users I've met and interacted with which means I have 
 my own perspective that is not necessarily incorporated into that document 
 but informs my recommendations.I'm not sure we can reach full consensus 
 on this. We are also well past time for moving forward with a resolution 
 on this (perhaps we can all agree on that). 
 
 I would like one more discussion thread where the technical discussion can 
 take place.I will make a plea that we keep this discussion as free from 
 logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can.   I 
 can't guarantee that I personally will succeed at that, but I can tell you 
 that I will try.   That's all I'm asking of anyone else.I recognize that 
 there are a lot of other issues at play here besides *just* the technical 
 questions, but we are not going to resolve every community issue in this 
 technical thread. 
 
 We need concrete proposals and so I will start with three.   Please feel free 
 to comment on these proposals or add your own during the discussion.I 
 will stop paying attention to this thread next Wednesday (May 16th) (or 
 earlier if the thread dies) and hope that by that time we can agree on a way 
 forward.  If we don't have agreement, then I will move forward with what I 
 think is the right approach.   I will either write the code myself or 
 convince someone else to write it. 
 
 In all cases, we have agreement that bit-pattern dtypes should be added to 
 NumPy.  We should work on these (int32, float64, complex64, str, bool) to 
 start.So, the three proposals are independent of this way forward.   The 
 proposals are all about the extra mask part:  
 
 My three proposals: 
 
   * do nothing and leave things as is 
 
   * add a global flag that turns off masked array support by default but 
 otherwise leaves things unchanged (I'm still unclear how this would work 
 exactly)
 
   * move Mark's masked ndarray objects into a new fundamental type 
 (ndmasked), leaving the actual ndarray type unchanged.  The array_interface 
 keeps the masked array notions and the ufuncs keep the ability to handle 
 arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked 
 objects as their core. 
 
 For the record, I'm currently in favor of the third proposal.   Feel free to 
 comment on these proposals (or provide your own).
 
 I'm most in favour of the second proposal. It won't take very much effort, 
 and more clearly marks off this code as experimental than just documentation 
 notes.
 

Mark will you give more details about this proposal?How would the flag 
work, what would it modify? 

The proposal to create a ndmasked object that is separate from ndarray objects 
also won't take much effort and also marks off the object so those who want to 
use it can and those who don't are not pushed into using it anyway. 

-Travis


 Thanks,
 -Mark
  
 
 Best regards,
 
 -Travis
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Mark Wiebe
On Wed, May 9, 2012 at 2:15 PM, Travis Oliphant tra...@continuum.io wrote:


 On May 9, 2012, at 2:07 PM, Mark Wiebe wrote:

 On Wed, May 9, 2012 at 11:46 AM, Travis Oliphant tra...@continuum.iowrote:

 Hey all,

 Nathaniel and Mark have worked very hard on a joint document to try and
 explain the current status of the missing-data debate.   I think they've
 done an amazing job at providing some context, articulating their views and
 suggesting ways forward in a mutually respectful manner.   This is an
 exemplary collaboration and is at the core of why open source is valuable.

 The document is available here:
https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

 After reading that document, it appears to me that there are some
 fundamentally different views on how things should move forward.   I'm also
 reading the document incorporating my understanding of the history, of
 NumPy as well as all of the users I've met and interacted with which means
 I have my own perspective that is not necessarily incorporated into that
 document but informs my recommendations.I'm not sure we can reach full
 consensus on this. We are also well past time for moving forward with a
 resolution on this (perhaps we can all agree on that).

 I would like one more discussion thread where the technical discussion
 can take place.I will make a plea that we keep this discussion as free
 from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
 we can.   I can't guarantee that I personally will succeed at that, but I
 can tell you that I will try.   That's all I'm asking of anyone else.I
 recognize that there are a lot of other issues at play here besides *just*
 the technical questions, but we are not going to resolve every community
 issue in this technical thread.

 We need concrete proposals and so I will start with three.   Please feel
 free to comment on these proposals or add your own during the discussion.
  I will stop paying attention to this thread next Wednesday (May 16th) (or
 earlier if the thread dies) and hope that by that time we can agree on a
 way forward.  If we don't have agreement, then I will move forward with
 what I think is the right approach.   I will either write the code myself
 or convince someone else to write it.

 In all cases, we have agreement that bit-pattern dtypes should be added
 to NumPy.  We should work on these (int32, float64, complex64, str,
 bool) to start.So, the three proposals are independent of this way
 forward.   The proposals are all about the extra mask part:

 My three proposals:

 * do nothing and leave things as is

 * add a global flag that turns off masked array support by default but
 otherwise leaves things unchanged (I'm still unclear how this would work
 exactly)

 * move Mark's masked ndarray objects into a new fundamental type
 (ndmasked), leaving the actual ndarray type unchanged.  The array_interface
 keeps the masked array notions and the ufuncs keep the ability to handle
 arrays like ndmasked.Ideally, numpy.ma would be changed to use
 ndmasked objects as their core.

 For the record, I'm currently in favor of the third proposal.   Feel free
 to comment on these proposals (or provide your own).


 I'm most in favour of the second proposal. It won't take very much effort,
 and more clearly marks off this code as experimental than just
 documentation notes.


 Mark will you give more details about this proposal?How would the flag
 work, what would it modify?


The idea is inspired in part by the Chrome release cycle, which has a
presentation here:

https://docs.google.com/present/view?id=dg63dpc6_4d7vkk6chpli=1

Some quotes:

Features should be engineered so that they can be disabled easily (1 patch)

and

Would large feature development still be possible?

Yes, engineers would have to work behind flags, however they can work for
as many releases as they need to and can remove the flag when they are
done.


The current numpy codebase isn't designed for this kind of workflow, but I
think we can productively emulate the idea for a big feature like NA
support.

One way to do this flag would be to have a numpy.experimental namespace
which is not imported by default. To enable the NA-mask feature, you could
do:

 import numpy.experimental.maskna

This would trigger an ExperimentalWarning to message that an experimental
feature has been enabled, and would add any NA-specific symbols to the
numpy namespace (NA, NAType, etc). Without this import, any operation which
would create an NA or NA-masked array raises an ExperimentalError instead
of succeeding. After this import, things would behave as they do now.

Cheers,
Mark

The proposal to create a ndmasked object that is separate from ndarray
 objects also won't take much effort and also marks off the object so those
 who want to use it can and those who don't are not pushed into using it
 anyway.

 -Travis


 Thanks,
 -Mark



 Best regards,

 -Travis


 

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Travis Oliphant
 My three proposals: 
 
   * do nothing and leave things as is 
 
   * add a global flag that turns off masked array support by default but 
 otherwise leaves things unchanged (I'm still unclear how this would work 
 exactly)
 
   * move Mark's masked ndarray objects into a new fundamental type 
 (ndmasked), leaving the actual ndarray type unchanged.  The array_interface 
 keeps the masked array notions and the ufuncs keep the ability to handle 
 arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked 
 objects as their core. 
 
 
 The numpy.ma is unmaintained and I don't see that changing anytime soon. As 
 you know, I would prefer 1), but 2) is a good compromise and the infra 
 structure for such a flag could be useful for other things, although like 
 yourself I'm not sure how it would be implemented. I don't understand your 
 proposal for 3), but from the description I don't see that it buys anything.

That is a bit strong to call numpy.ma unmaintained.I don't consider it that 
way.Are there a lot of tickets for it that are unaddressed?   Is it broken? 
  I know it gets a lot of use in the wild and so I don't think NumPy users 
would be happy to here it is considered unmaintained by NumPy developers. 

I'm looking forward to more details of Mark's proposal for #2. 

The proposal for #3 is quite simple and I think it is also a good compromise 
between removing the masked array entirely from the core NumPy object and 
leaving things as is in master.  It keeps the functionality (but in a separate 
object) much like numpy.ma is a separate object.   Basically it buys not 
forcing *all* NumPy users (on the C-API level) to now deal with a masked array. 
   I know this push is a feature that is part of Mark's intention (as it pushes 
downstream libraries to think about missing data at a fundamental level).
But, I think this is too big of a change to put in a 1.X release.   The 
internal array-model used by NumPy is used quite extensively in downstream 
libraries as a *concept*.  Many people have enhanced this model with a separate 
mask array for various reasons, and Mark's current use of mask does not satisfy 
all those use-cases.   I don't see how we can justify changing the NumPy 1.X 
memory model under these circumstances. 

This is the sort of change that in my mind is a NumPy 2.0 kind of change where 
downstream users will be looking for possible array-model changes.  

-Travis





  
 For the record, I'm currently in favor of the third proposal.   Feel free to 
 comment on these proposals (or provide your own). 
 
 
 Chuck 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Travis Oliphant
 Mark will you give more details about this proposal?How would the flag 
 work, what would it modify?
 
 The idea is inspired in part by the Chrome release cycle, which has a 
 presentation here:
 
 https://docs.google.com/present/view?id=dg63dpc6_4d7vkk6chpli=1
 
 Some quotes:
 Features should be engineered so that they can be disabled easily (1 patch)
 and
 Would large feature development still be possible?
 
 Yes, engineers would have to work behind flags, however they can work for as 
 many releases as they need to and can remove the flag when they are done.
 
 The current numpy codebase isn't designed for this kind of workflow, but I 
 think we can productively emulate the idea for a big feature like NA support.
 
 One way to do this flag would be to have a numpy.experimental namespace 
 which is not imported by default. To enable the NA-mask feature, you could do:
 
  import numpy.experimental.maskna
 
 This would trigger an ExperimentalWarning to message that an experimental 
 feature has been enabled, and would add any NA-specific symbols to the numpy 
 namespace (NA, NAType, etc). Without this import, any operation which would 
 create an NA or NA-masked array raises an ExperimentalError instead of 
 succeeding. After this import, things would behave as they do now.

How would this flag work at the C-API level? 

-Travis


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Dag Sverre Seljebotn
On 05/09/2012 06:46 PM, Travis Oliphant wrote:
 Hey all,

 Nathaniel and Mark have worked very hard on a joint document to try and
 explain the current status of the missing-data debate. I think they've
 done an amazing job at providing some context, articulating their views
 and suggesting ways forward in a mutually respectful manner. This is an
 exemplary collaboration and is at the core of why open source is valuable.

 The document is available here:
 https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

 After reading that document, it appears to me that there are some
 fundamentally different views on how things should move forward. I'm
 also reading the document incorporating my understanding of the history,
 of NumPy as well as all of the users I've met and interacted with which
 means I have my own perspective that is not necessarily incorporated
 into that document but informs my recommendations. I'm not sure we can
 reach full consensus on this. We are also well past time for moving
 forward with a resolution on this (perhaps we can all agree on that).

 I would like one more discussion thread where the technical discussion
 can take place. I will make a plea that we keep this discussion as free
 from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
 we can. I can't guarantee that I personally will succeed at that, but I
 can tell you that I will try. That's all I'm asking of anyone else. I
 recognize that there are a lot of other issues at play here besides
 *just* the technical questions, but we are not going to resolve every
 community issue in this technical thread.

 We need concrete proposals and so I will start with three. Please feel
 free to comment on these proposals or add your own during the
 discussion. I will stop paying attention to this thread next Wednesday
 (May 16th) (or earlier if the thread dies) and hope that by that time we
 can agree on a way forward. If we don't have agreement, then I will move
 forward with what I think is the right approach. I will either write the
 code myself or convince someone else to write it.

 In all cases, we have agreement that bit-pattern dtypes should be added
 to NumPy. We should work on these (int32, float64, complex64, str, bool)
 to start. So, the three proposals are independent of this way forward.
 The proposals are all about the extra mask part:

 My three proposals:

 * do nothing and leave things as is

 * add a global flag that turns off masked array support by default but
 otherwise leaves things unchanged (I'm still unclear how this would work
 exactly)

 * move Mark's masked ndarray objects into a new fundamental type
 (ndmasked), leaving the actual ndarray type unchanged. The
 array_interface keeps the masked array notions and the ufuncs keep the
 ability to handle arrays like ndmasked. Ideally, numpy.ma
 http://numpy.ma would be changed to use ndmasked objects as their core.

 For the record, I'm currently in favor of the third proposal. Feel free
 to comment on these proposals (or provide your own).


Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark!

The third proposal is certainly the best one from Cython's perspective; 
and I imagine for those writing C extensions against the C API too. 
Having PyType_Check fail for ndmasked is a very good way of having code 
fail that is not written to take masks into account.

If it is in ndarray we would also have some pressure to add support in 
Cython, with ndmasked we avoid that too. Likely outcome is we won't ever 
support it either way, but then we need some big warning in the docs, 
and it's better to avoid that. (I guess be +0 on Mark Florisson 
implementing it if it ends up in core ndarray; I'd almost certainly not 
do it myself.)

That covers Cython. My view as a NumPy user follows.

I'm a heavy user of masks, which are used to make data NA in the 
statistical sense. The setting is that we have to mask out the radiation 
coming from the Milky Way in full-sky images of the Cosmic Microwave 
Background. There's data, but we know we can't trust it, so we make it 
NA. But we also do play around with different masks.

Today we keep the mask in a seperate array, and to zero-mask we do

masked_data = data * mask

or

masked_data = data.copy()
masked_data[mask == 0] = np.nan # soon np.NA

depending on the circumstances.

Honestly, API-wise, this is as good as its gets for us. Nice and 
transparent, no new semantics to learn in the special case of masks.

Now, this has performance issues: Lots of memory use, extra transfers 
over the memory bus.

BUT, NumPy has that problem all over the place, even for x + y + z! 
Solving it in the special case of masks, by making a new API, seems a 
bit myopic to me.

IMO, that's much better solved at the fundamental level. As an 
*illustration*:

with np.lazy:
 masked_data1 = data * mask1
 masked_data2 = data * (mask1 | mask2)
 masked_data3 = (x + y + z) * (mask1  mask3)

This would 

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Charles R Harris
On Wed, May 9, 2012 at 1:35 PM, Travis Oliphant tra...@continuum.io wrote:

 My three proposals:

 * do nothing and leave things as is

 * add a global flag that turns off masked array support by default but
 otherwise leaves things unchanged (I'm still unclear how this would work
 exactly)

 * move Mark's masked ndarray objects into a new fundamental type
 (ndmasked), leaving the actual ndarray type unchanged.  The array_interface
 keeps the masked array notions and the ufuncs keep the ability to handle
 arrays like ndmasked.Ideally, numpy.ma would be changed to use
 ndmasked objects as their core.


 The numpy.ma is unmaintained and I don't see that changing anytime soon.
 As you know, I would prefer 1), but 2) is a good compromise and the infra
 structure for such a flag could be useful for other things, although like
 yourself I'm not sure how it would be implemented. I don't understand your
 proposal for 3), but from the description I don't see that it buys anything.


 That is a bit strong to call numpy.ma unmaintained.I don't consider
 it that way.Are there a lot of tickets for it that are unaddressed?
 Is it broken?   I know it gets a lot of use in the wild and so I don't
 think NumPy users would be happy to here it is considered unmaintained by
 NumPy developers.

 I'm looking forward to more details of Mark's proposal for #2.

 The proposal for #3 is quite simple and I think it is also a good
 compromise between removing the masked array entirely from the core NumPy
 object and leaving things as is in master.  It keeps the functionality (but
 in a separate object) much like numpy.ma is a separate object.
   Basically it buys not forcing *all* NumPy users (on the C-API level) to
 now deal with a masked array.


To me, it looks like we will get stuck with a more complicated
implementation without changing the API, something that 2) achieves more
easily while providing a feature likely to be useful as we head towards 2.0.


 I know this push is a feature that is part of Mark's intention (as it
 pushes downstream libraries to think about missing data at a fundamental
 level).But, I think this is too big of a change to put in a 1.X
 release.   The internal array-model used by NumPy is used quite extensively
 in downstream libraries as a *concept*.  Many people have enhanced this
 model with a separate mask array for various reasons, and Mark's current
 use of mask does not satisfy all those use-cases.   I don't see how we can
 justify changing the NumPy 1.X memory model under these circumstances.


You keep referring to these ghostly people and their unspecified uses, no
doubt to protect the guilty. You don't have to name names, but a little
detail on what they have done and how they use things would be *very*
helpful.


 This is the sort of change that in my mind is a NumPy 2.0 kind of change
 where downstream users will be looking for possible array-model changes.


We tried the flag day approach to 2.0 already and it failed. I think it
better to have a long term release and a series of releases thereafter
moving step by step with incremental changes towards a 2.0. Mark's 2) would
support that approach.

snip

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Travis Oliphant
On re-reading, I want to make a couple of things clear:   

1) This wrap-up discussion is *only* for what to do for NumPy 1.7 in 
such a way that we don't tie our hands in the future.I do not believe we 
can figure out what to do for masked arrays in one short week.   What happens 
beyond NumPy 1.7 should be still discussed and explored.My urgency is 
entirely about moving forward from where we are in master right now in a 
direction that we can all accept.  The tight timeline is so that we do 
*something* and move forward.

2) I missed another possible proposal for NumPy 1.7 which is in the 
write-up that Mark and Nathaniel made:  remove the masked array additions 
entirely possibly moving them to another module like numpy-dtypes.

Again, these are only for NumPy 1.7.   What happens in any future NumPy and 
beyond will depend on who comes to the table for both discussion and 
code-development. 

Best regards,

-Travis



On May 9, 2012, at 11:46 AM, Travis Oliphant wrote:

 Hey all, 
 
 Nathaniel and Mark have worked very hard on a joint document to try and 
 explain the current status of the missing-data debate.   I think they've done 
 an amazing job at providing some context, articulating their views and 
 suggesting ways forward in a mutually respectful manner.   This is an 
 exemplary collaboration and is at the core of why open source is valuable. 
 
 The document is available here: 
https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
 
 After reading that document, it appears to me that there are some 
 fundamentally different views on how things should move forward.   I'm also 
 reading the document incorporating my understanding of the history, of NumPy 
 as well as all of the users I've met and interacted with which means I have 
 my own perspective that is not necessarily incorporated into that document 
 but informs my recommendations.I'm not sure we can reach full consensus 
 on this. We are also well past time for moving forward with a resolution 
 on this (perhaps we can all agree on that). 
 
 I would like one more discussion thread where the technical discussion can 
 take place.I will make a plea that we keep this discussion as free from 
 logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can.   I 
 can't guarantee that I personally will succeed at that, but I can tell you 
 that I will try.   That's all I'm asking of anyone else.I recognize that 
 there are a lot of other issues at play here besides *just* the technical 
 questions, but we are not going to resolve every community issue in this 
 technical thread. 
 
 We need concrete proposals and so I will start with three.   Please feel free 
 to comment on these proposals or add your own during the discussion.I 
 will stop paying attention to this thread next Wednesday (May 16th) (or 
 earlier if the thread dies) and hope that by that time we can agree on a way 
 forward.  If we don't have agreement, then I will move forward with what I 
 think is the right approach.   I will either write the code myself or 
 convince someone else to write it. 
 
 In all cases, we have agreement that bit-pattern dtypes should be added to 
 NumPy.  We should work on these (int32, float64, complex64, str, bool) to 
 start.So, the three proposals are independent of this way forward.   The 
 proposals are all about the extra mask part:  
 
 My three proposals: 
 
   * do nothing and leave things as is 
 
   * add a global flag that turns off masked array support by default but 
 otherwise leaves things unchanged (I'm still unclear how this would work 
 exactly)
 
   * move Mark's masked ndarray objects into a new fundamental type 
 (ndmasked), leaving the actual ndarray type unchanged.  The array_interface 
 keeps the masked array notions and the ufuncs keep the ability to handle 
 arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked 
 objects as their core. 
 
 For the record, I'm currently in favor of the third proposal.   Feel free to 
 comment on these proposals (or provide your own). 
 
 Best regards,
 
 -Travis
 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Nathaniel Smith
On Wed, May 9, 2012 at 5:46 PM, Travis Oliphant tra...@continuum.io wrote:
 Hey all,

 Nathaniel and Mark have worked very hard on a joint document to try and
 explain the current status of the missing-data debate.   I think they've
 done an amazing job at providing some context, articulating their views and
 suggesting ways forward in a mutually respectful manner.   This is an
 exemplary collaboration and is at the core of why open source is valuable.

 The document is available here:
    https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

 After reading that document, it appears to me that there are some
 fundamentally different views on how things should move forward.   I'm also
 reading the document incorporating my understanding of the history, of NumPy
 as well as all of the users I've met and interacted with which means I have
 my own perspective that is not necessarily incorporated into that document
 but informs my recommendations.    I'm not sure we can reach full consensus
 on this.     We are also well past time for moving forward with a resolution
 on this (perhaps we can all agree on that).

If we're talking about deciding what to do for the 1.7 release branch,
then I agree. Otherwise, I definitely don't. We really just don't
*know* what our users need with regards to mask-based storage versions
of missing data, so committing to something within a short time period
will just guarantee we have to re-do it all again later.

[Edit: I see that you've clarified this in a follow-up email -- great!]

 We need concrete proposals and so I will start with three.   Please feel
 free to comment on these proposals or add your own during the discussion.
  I will stop paying attention to this thread next Wednesday (May 16th) (or
 earlier if the thread dies) and hope that by that time we can agree on a way
 forward.  If we don't have agreement, then I will move forward with what I
 think is the right approach.   I will either write the code myself or
 convince someone else to write it.

Again, I'm assuming that what you mean here is that we can't and
shouldn't delay 1.7 indefinitely for this discussion to play out, so
you're proposing that we give ourselves a deadline of 1 week to decide
how to at least get the release unblocked. Let me know if I'm
misreading, though...

 In all cases, we have agreement that bit-pattern dtypes should be added to
 NumPy.      We should work on these (int32, float64, complex64, str, bool)
 to start.    So, the three proposals are independent of this way forward.
 The proposals are all about the extra mask part:

 My three proposals:

 * do nothing and leave things as is

In the context of 1.7, this seems like a non-starter at this point, at
least if we're going to move in the direction of making decisions by
consensus. It might well be that we'll decide that the current
NEP-like API is what we want (or that some compatible super-set is).
But (as described in more detail in the NA-overview document), I think
there are still serious questions to work out about how and whether a
masked-storage/NA-semantics API is something we want as part of the
ndarray object at all. And Ralf with his release-manager hat says that
he doesn't want to release the current API unless we can guarantee
that some version of it will continue to be supported. To me that
suggests that this is off the table for 1.7.

 * add a global flag that turns off masked array support by default but
 otherwise leaves things unchanged (I'm still unclear how this would work
 exactly)

I've been assuming something like a global variable, and some guards
added to all the top-level functions that take maskna= arguments, so
that it's impossible to construct an ndarray that has its maskna
flag set to True unless the flag has been toggled.

As I said in NA-overview, I'd be fine with this in principle, but only
if we're certain we're okay with the ABI consequences. And we should
be clear on the goal -- if we just want to let people play with the
API, then there are other options, such as my little experiment:
  https://github.com/njsmith/numpyNEP
(This is certainly less robust, but it works, and is probably a much
easier base for modifications to test alternative APIs.) If the goal
is just to keep the code in master, then that's fine too, though it
has both costs and benefits. (An example of a cost is that its
presence may complicate adding bitpattern NA support.)

 * move Mark's masked ndarray objects into a new fundamental type
 (ndmasked), leaving the actual ndarray type unchanged.  The array_interface
 keeps the masked array notions and the ufuncs keep the ability to handle
 arrays like ndmasked.    Ideally, numpy.ma would be changed to use ndmasked
 objects as their core.

If we're talking about 1.7, then what kind of status do you propose
these new objects would have in 1.7? Regular feature, totally
experimental, something else?

My only objection to this proposal is that committing to this approach

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Matthew Brett
Hi,

On Wed, May 9, 2012 at 12:44 PM, Dag Sverre Seljebotn
d.s.seljeb...@astro.uio.no wrote:
 On 05/09/2012 06:46 PM, Travis Oliphant wrote:
 Hey all,

 Nathaniel and Mark have worked very hard on a joint document to try and
 explain the current status of the missing-data debate. I think they've
 done an amazing job at providing some context, articulating their views
 and suggesting ways forward in a mutually respectful manner. This is an
 exemplary collaboration and is at the core of why open source is valuable.

 The document is available here:
 https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

 After reading that document, it appears to me that there are some
 fundamentally different views on how things should move forward. I'm
 also reading the document incorporating my understanding of the history,
 of NumPy as well as all of the users I've met and interacted with which
 means I have my own perspective that is not necessarily incorporated
 into that document but informs my recommendations. I'm not sure we can
 reach full consensus on this. We are also well past time for moving
 forward with a resolution on this (perhaps we can all agree on that).

 I would like one more discussion thread where the technical discussion
 can take place. I will make a plea that we keep this discussion as free
 from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
 we can. I can't guarantee that I personally will succeed at that, but I
 can tell you that I will try. That's all I'm asking of anyone else. I
 recognize that there are a lot of other issues at play here besides
 *just* the technical questions, but we are not going to resolve every
 community issue in this technical thread.

 We need concrete proposals and so I will start with three. Please feel
 free to comment on these proposals or add your own during the
 discussion. I will stop paying attention to this thread next Wednesday
 (May 16th) (or earlier if the thread dies) and hope that by that time we
 can agree on a way forward. If we don't have agreement, then I will move
 forward with what I think is the right approach. I will either write the
 code myself or convince someone else to write it.

 In all cases, we have agreement that bit-pattern dtypes should be added
 to NumPy. We should work on these (int32, float64, complex64, str, bool)
 to start. So, the three proposals are independent of this way forward.
 The proposals are all about the extra mask part:

 My three proposals:

 * do nothing and leave things as is

 * add a global flag that turns off masked array support by default but
 otherwise leaves things unchanged (I'm still unclear how this would work
 exactly)

 * move Mark's masked ndarray objects into a new fundamental type
 (ndmasked), leaving the actual ndarray type unchanged. The
 array_interface keeps the masked array notions and the ufuncs keep the
 ability to handle arrays like ndmasked. Ideally, numpy.ma
 http://numpy.ma would be changed to use ndmasked objects as their core.

 For the record, I'm currently in favor of the third proposal. Feel free
 to comment on these proposals (or provide your own).


 Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark!

Yes, it is very well written, my compliments to the chefs.

 The third proposal is certainly the best one from Cython's perspective;
 and I imagine for those writing C extensions against the C API too.
 Having PyType_Check fail for ndmasked is a very good way of having code
 fail that is not written to take masks into account.

Mark, Nathaniel - can you comment how your chosen approaches would
interact with extension code?

I'm guessing the bitpattern dtypes would be expected to cause
extension code to choke if the type is not supported?

Mark - in :

https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#cython

- do I understand correctly that you think that Cython and other
extension writers should use the numpy API to access the data rather
than accessing it directly via the data pointer and strides?

Best,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Paul Ivanov
On Wed, May 9, 2012 at 3:12 PM, Travis Oliphant tra...@continuum.io wrote:

 On re-reading, I want to make a couple of things clear:

 1) This wrap-up discussion is *only* for what to do for NumPy 1.7 in
 such a way that we don't tie our hands in the future.I do not believe
 we can figure out what to do for masked arrays in one short week.   What
 happens beyond NumPy 1.7 should be still discussed and explored.My
 urgency is entirely about moving forward from where we are in master right
 now in a direction that we can all accept.  The tight timeline is so
 that we do *something* and move forward.

 2) I missed another possible proposal for NumPy 1.7 which is in the
 write-up that Mark and Nathaniel made:  remove the masked array additions
 entirely possibly moving them to another module like numpy-dtypes.

 Again, these are only for NumPy 1.7.   What happens in any future NumPy
 and beyond will depend on who comes to the table for both discussion and
 code-development.


I'm glad that this sentence made it into the write-up: A project like
numpy requires developers to write code for advancement to occur, and
obstacles that impede the writing of code discourage existing developers
from contributing more, and potentially scare away developers who are
thinking about joining in. I agree, which is why I'm a little surprised
after reading the write-up that there's no deference to the alterNEP
(admittedly kludgy) implementation? One of the arguments made for the NEP
preliminary NA-mask implementation is that has been extensively tested
against scipy and other third-party packages, and has been in master in a
stable state for a significant amount of time. It is my understanding that
the manner in which this implementation found its way into master was a
source of concern and contention. To me (and I don't know the level to
which this is a technically feasible) that's precisely the reason that BOTH
approaches be allowed to make their way into numpy with experimental
status. Otherwise, it seems that there is a sort of scaring away of
developers - seeing (from the sidelines) how much of a struggle it's been
for the alterNEP to find a nurturing environment as an experimental
alternative inside numpy. In my reading, the process and consensus threads
that have generated so many responses stem precisely from trying to have an
atmosphere where everyone is encouraged to join in. The alternatives
proposed so far (though I do understand it's only for 1.7) do not suggest
an appreciation for the gravity of the fallout from the neglect the
alterNEP and the issues which sprang forth from that.

Importantly, I find a problem with how personal this document (and
discussion) is - I'd much prefer if we talk about technical things by a
descriptive name, not the person who thought of it. You'll note how I've
been referring to NEP and alterNEP above. One advantage of this is that
down the line, if either Mark or Nathaniel change their minds about their
current preferred way forward, it doesn't take the wind out of it with
something like Even Paul changed his mind and now withdraws his support of
Paul's proposal. We should only focus on the technical merits of a given
approach, not how many commits have been made by the person proposing them
or what else they've done in their life: a good idea has value regardless
of who expresses it. In my fantasy world, with both approaches clearly
existing in an experimental sandbox inside numpy, folks who feel primary
attachments to either NEP or alterNEP would be willing to cross party lines
and pitch in towardd making progress in both camps. That's the way we'll
find better solutions, by working together, instead of working in
opposition.

best,
-- 
Paul Ivanov
314 address only used for lists,  off-list direct email at:
http://pirsquared.org | GPG/PGP key id: 0x0F3E28F7
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Charles R Harris
On Wed, May 9, 2012 at 6:13 PM, Paul Ivanov pivanov...@gmail.com wrote:



 On Wed, May 9, 2012 at 3:12 PM, Travis Oliphant tra...@continuum.iowrote:

 On re-reading, I want to make a couple of things clear:

 1) This wrap-up discussion is *only* for what to do for NumPy 1.7 in
 such a way that we don't tie our hands in the future.I do not believe
 we can figure out what to do for masked arrays in one short week.   What
 happens beyond NumPy 1.7 should be still discussed and explored.My
 urgency is entirely about moving forward from where we are in master right
 now in a direction that we can all accept.  The tight timeline is so
 that we do *something* and move forward.

 2) I missed another possible proposal for NumPy 1.7 which is in the
 write-up that Mark and Nathaniel made:  remove the masked array additions
 entirely possibly moving them to another module like numpy-dtypes.

 Again, these are only for NumPy 1.7.   What happens in any future NumPy
 and beyond will depend on who comes to the table for both discussion and
 code-development.


 I'm glad that this sentence made it into the write-up: A project like
 numpy requires developers to write code for advancement to occur, and
 obstacles that impede the writing of code discourage existing developers
 from contributing more, and potentially scare away developers who are
 thinking about joining in. I agree, which is why I'm a little surprised
 after reading the write-up that there's no deference to the alterNEP
 (admittedly kludgy) implementation? One of the arguments made for the NEP
 preliminary NA-mask implementation is that has been extensively tested
 against scipy and other third-party packages, and has been in master in a
 stable state for a significant amount of time. It is my understanding that
 the manner in which this implementation found its way into master was a
 source of concern and contention. To me (and I don't know the level to
 which this is a technically feasible) that's precisely the reason that BOTH
 approaches be allowed to make their way into numpy with experimental
 status. Otherwise, it seems that there is a sort of scaring away of
 developers - seeing (from the sidelines) how much of a struggle it's been
 for the alterNEP to find a nurturing environment as an experimental
 alternative inside numpy. In my reading, the process and consensus threads
 that have generated so many responses stem precisely from trying to have an
 atmosphere where everyone is encouraged to join in. The alternatives
 proposed so far (though I do understand it's only for 1.7) do not suggest
 an appreciation for the gravity of the fallout from the neglect the
 alterNEP and the issues which sprang forth from that.

 Importantly, I find a problem with how personal this document (and
 discussion) is - I'd much prefer if we talk about technical things by a
 descriptive name, not the person who thought of it. You'll note how I've
 been referring to NEP and alterNEP above. One advantage of this is that
 down the line, if either Mark or Nathaniel change their minds about their
 current preferred way forward, it doesn't take the wind out of it with
 something like Even Paul changed his mind and now withdraws his support of
 Paul's proposal. We should only focus on the technical merits of a given
 approach, not how many commits have been made by the person proposing them
 or what else they've done in their life: a good idea has value regardless
 of who expresses it. In my fantasy world, with both approaches clearly
 existing in an experimental sandbox inside numpy, folks who feel primary
 attachments to either NEP or alterNEP would be willing to cross party lines
 and pitch in towardd making progress in both camps. That's the way we'll
 find better solutions, by working together, instead of working in
 opposition.


We are certainly open to code submissions and alternate implementations.
The experimental tag would help there. But someone, as you mention, needs
to write the code.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Dag Sverre Seljebotn
On 05/10/2012 01:01 AM, Matthew Brett wrote:
 Hi,

 On Wed, May 9, 2012 at 12:44 PM, Dag Sverre Seljebotn
 d.s.seljeb...@astro.uio.no  wrote:
 On 05/09/2012 06:46 PM, Travis Oliphant wrote:
 Hey all,

 Nathaniel and Mark have worked very hard on a joint document to try and
 explain the current status of the missing-data debate. I think they've
 done an amazing job at providing some context, articulating their views
 and suggesting ways forward in a mutually respectful manner. This is an
 exemplary collaboration and is at the core of why open source is valuable.

 The document is available here:
 https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

 After reading that document, it appears to me that there are some
 fundamentally different views on how things should move forward. I'm
 also reading the document incorporating my understanding of the history,
 of NumPy as well as all of the users I've met and interacted with which
 means I have my own perspective that is not necessarily incorporated
 into that document but informs my recommendations. I'm not sure we can
 reach full consensus on this. We are also well past time for moving
 forward with a resolution on this (perhaps we can all agree on that).

 I would like one more discussion thread where the technical discussion
 can take place. I will make a plea that we keep this discussion as free
 from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
 we can. I can't guarantee that I personally will succeed at that, but I
 can tell you that I will try. That's all I'm asking of anyone else. I
 recognize that there are a lot of other issues at play here besides
 *just* the technical questions, but we are not going to resolve every
 community issue in this technical thread.

 We need concrete proposals and so I will start with three. Please feel
 free to comment on these proposals or add your own during the
 discussion. I will stop paying attention to this thread next Wednesday
 (May 16th) (or earlier if the thread dies) and hope that by that time we
 can agree on a way forward. If we don't have agreement, then I will move
 forward with what I think is the right approach. I will either write the
 code myself or convince someone else to write it.

 In all cases, we have agreement that bit-pattern dtypes should be added
 to NumPy. We should work on these (int32, float64, complex64, str, bool)
 to start. So, the three proposals are independent of this way forward.
 The proposals are all about the extra mask part:

 My three proposals:

 * do nothing and leave things as is

 * add a global flag that turns off masked array support by default but
 otherwise leaves things unchanged (I'm still unclear how this would work
 exactly)

 * move Mark's masked ndarray objects into a new fundamental type
 (ndmasked), leaving the actual ndarray type unchanged. The
 array_interface keeps the masked array notions and the ufuncs keep the
 ability to handle arrays like ndmasked. Ideally, numpy.ma
 http://numpy.ma  would be changed to use ndmasked objects as their core.

 For the record, I'm currently in favor of the third proposal. Feel free
 to comment on these proposals (or provide your own).


 Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark!

 Yes, it is very well written, my compliments to the chefs.

 The third proposal is certainly the best one from Cython's perspective;
 and I imagine for those writing C extensions against the C API too.
 Having PyType_Check fail for ndmasked is a very good way of having code
 fail that is not written to take masks into account.

I want to make something more clear: There are two Cython cases; in the 
case of cdef np.ndarray[double] there is no problem as PEP 3118 access 
will raise an exception for masked arrays.

But, there's the case where you do cdef np.ndarray, and then proceed 
to use PyArray_DATA. Myself I do this more than PEP 3118 access; usually 
because I pass the data pointer to some C or C++ code.

It'd be great to have such code be forward-compatible in the sense that 
it raises an exception when it meets a masked array. Having PyType_Check 
fail seems like the only way? Am I wrong?


 Mark, Nathaniel - can you comment how your chosen approaches would
 interact with extension code?

 I'm guessing the bitpattern dtypes would be expected to cause
 extension code to choke if the type is not supported?

The proposal, as I understand it, is to use that with new dtypes (?). So 
things will often be fine for that reason:

if arr.dtype == np.float32:
 c_function_32bit(np.PyArray_DATA(arr), ...)
else:
 raise ValueError(need 32-bit float array)



 Mark - in :

 https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#cython

 - do I understand correctly that you think that Cython and other
 extension writers should use the numpy API to access the data rather
 than accessing it directly via the data pointer and strides?

That's not really fleshed out (for 

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Charles R Harris
On Wed, May 9, 2012 at 11:05 PM, Benjamin Root ben.r...@ou.edu wrote:



 On Wednesday, May 9, 2012, Nathaniel Smith wrote:



 My only objection to this proposal is that committing to this approach
 seems premature. The existing masked array objects act quite
 differently from numpy.ma, so why do you believe that they're a good
 foundation for numpy.ma, and why will users want to switch to their
 semantics over numpy.ma's semantics? These aren't rhetorical
 questions, it seems like they must have concrete answers, but I don't
 know what they are.


 Based on the design decisions made in the original NEP, a re-made 
 numpy.mawould have to lose _some_ features particularly, the ability to share
 masks. Save for that and some very obscure behaviors that are undocumented,
 it is possible to remake numpy.ma as a compatibility layer.

 That being said, I think that there are some fundamental questions that
 has concerned. If I recall, there were unresolved questions about behaviors
 surrounding assignments to elements of a view.

 I see the project as broken down like this:
 1.) internal architecture (largely abi issues)
 2.) external architecture (hooks throughout numpy to utilize the new
 features where possible such as where= argument)
 3.) getter/setter semantics
 4.) mathematical semantics

 At this moment, I think we have pieces of 2 and they are fairly
 non-controversial. It is 1 that I see as being the immediate hold-up here.
 3  4 are non-trivial, but because they are mostly about interfaces, I
 think we can be willing to accept some very basic, fundamental, barebones
 components here in order to lay the groundwork for a more complete API
 later.

 To talk of Travis's proposal, doing nothing is no-go. Not moving forward
 would dishearten the community. Making a ndmasked type is very intriguing.
 I see it as a set towards eventually deprecating ndarray? Also, how would
 it behave with no.asarray() and no.asanyarray()? My other concern is a
 possible violation of DRY. How difficult would it be to maintain two
 ndarrays in parallel?

 As for the flag approach, this still doesn't solve the problem of legacy
 code (or did I misunderstand?)


My understanding of the flag is to allow the code to stay in and get
reworked and experimented with while keeping it from contaminating
conventional use.

The whole point of putting the code in was to experiment and adjust. The
rather bizarre idea that it needs to be perfect from the get go is
disheartening, and is seldom how new things get developed. Sure, there is a
plan up front, but there needs to be feedback and change. And in fact, I
haven't seen much feedback about the actual code, I don't even know that
the people complaining have tried using it to see where it hurts. I'd like
that sort of feedback.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-15 Thread Nathaniel Smith
Hi Chuck,

I think I let my frustration get the better of me, and the message
below is too confrontational. I apologize.

I truly would like to understand where you're coming from on this,
though, so I'll try to make this more productive. My summary of points
that no-one has disagreed with yet is here:
  https://github.com/njsmith/numpy/wiki/NA-discussion-status
Of course, this means that there's lots that's left out. Instead of
getting into all those contentious details, I'll stick to just a few
basic questions that might let us get at least of bit of common
ground:
1) Do you disagree with anything that is stated there?
2) Do you feel like that document accurately summarises your basic
idea of what this feature is supposed to do (I assume under the
IGNORED heading)?

Thanks,
-- Nathaniel

On Wed, Mar 7, 2012 at 11:10 PM, Nathaniel Smith n...@pobox.com wrote:
 On Wed, Mar 7, 2012 at 7:37 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:


 On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote:
 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.


 Maybe for float, for other things, no. And we have lots of otherthings.

 It would be easier to discuss this if you'd, like, discuss :-(. If you
 know of some advantage that masks have over bitpatterns when it comes
 to missing data, can you please share it, instead of just asserting
 it?

 Not that I'm immune... I perhaps should have been more explicit
 myself, when I said performance characteristics, let me clarify that
 I was thinking of both speed (for floats) and memory (for
 most-but-not-all things).

 The
 performance is a strawman,

 How many users need to speak up to say that this is a serious problem
 they have with the current implementation before you stop calling it a
 strawman? Because when Wes says that it's not going to fly for his
 stats/econometics cases, and the neuroimaging folk like Gary and Matt
 say it's not going to fly for their use cases... surely just waving
 that away is a bit dismissive?

 I'm not saying that we *have* to implement bitpatterns because
 performance is *the most important feature* -- I'm just saying, well,
 what I said. For *missing data use* cases, bitpatterns have better
 performance characteristics than masks. If we decide that these use
 cases are important, then we should take this into account and weigh
 it against other considerations. Maybe what you think is that these
 use cases shouldn't be the focus of this feature and it should focus
 on the ignored use cases instead? That would be a legitimate
 argument... but if that's what you want to say, say it, don't just
 dismiss your users!

 and it *isn't* easier to implement.

 If I thought bitpatterns would be easier to implement, I would have
 said so... What I said was that they're not harder. You have some
 extra complexity, mostly in casting, and some reduced complexity -- no
 need to allocate and manipulate the mask. (E.g., simple same-type
 assignments and slicing require special casing for masks, but not for
 bitpatterns.) In many places the complexity is identical -- printing
 routines need to check for either special bitpatterns or masked
 values, whatever. Ufunc loops need to either find the appropriate part
 of the mask, or create a temporary mask buffer by calling a dtype
 func, whatever. On net they seem about equivalent, complexity-wise.

 ...I assume you disagree with this analysis, since I've said it
 before, wrote up a sketch for how the implementation would work at the
 C level, etc., and you continue to claim that simplicity is a
 compelling advantage for the masked approach. But I still don't know
 why you think that :-(.

  Also, different folks adopt different values
  for 'missing' data, and distributing one or several masks along with the
  data is another common practice.

 True, but not really relevant to the current debate, because you have
 to handle such issues as part of your general data import workflow
 anyway, and none of these is any more complicated no matter which
 implementations are available.

  One inconvenience I have run into with the current API is that is should
  be
  easier to clear the mask from an ignored value without taking a new
  view
  or assigning known data. So maybe two types of masks (different
  payloads),
  or an additional flag could be helpful. The process of assigning masks
  could
  also be made a bit easier than using fancy indexing.

 So this, uh... this was actually the whole goal of the alterNEP
 design for masks -- making all this stuff easy for people (like you,
 apparently?) that want support for ignored values, separately from
 missing data, and want a nice clean API for it. Basically having a
 separate .mask attribute which was an ordinary, assignable array
 broadcastable to the attached array's shape. Nobody seemed interested
 in talking about it much then 

Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Pierre Haessig
Hi,

Thanks you very much for your lights !

Le 06/03/2012 21:59, Nathaniel Smith a écrit :
 Right -- R has a very impoverished type system as compared to numpy.
 There's basically four types: numeric (meaning double precision
 float), integer, logical (boolean), and character (string). And
 in practice the integer type is essentially unused, because R parses
 numbers like 1 as being floating point, not integer; the only way to
 get an integer value is to explicitly cast to it. Each of these types
 has a specific bit-pattern set aside for representing NA. And...
 that's it. It's very simple when it works, but also very limited.
I also suspected R to be less powerful in terms of types.
However, I think  the fact that It's very simple when it works is
important to take into account. At the end of the day, when using all
the fanciness it is not only about can I have some NAs in my array ?
but also how *easily* can I have some NAs in my array ?. It's about
balancing the how easy and the how powerful.

The easyness-of-use is the reason of my concern about having separate
types nafloatNN and floatNN. Of course, I won't argue that not
breaking everything is even more important !!

Coming back to Travis proposition bit-pattern approaches to missing
data (*at least* for float64 and int32) need to be implemented., I
wonder what is the amount of extra work to go from nafloat64 to
nafloat32/16 ? Is there an hardware support NaN payloads with these
smaller floats ? If not, or if it is too complicated, I feel it is
acceptable to say it's too complicated and fall back to mask. One may
have to choose between fancy types and fancy NAs...

Best,
Pierre



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Nathaniel Smith
On Wed, Mar 7, 2012 at 4:35 PM, Pierre Haessig pierre.haes...@crans.org wrote:
 Hi,

 Thanks you very much for your lights !

 Le 06/03/2012 21:59, Nathaniel Smith a écrit :
 Right -- R has a very impoverished type system as compared to numpy.
 There's basically four types: numeric (meaning double precision
 float), integer, logical (boolean), and character (string). And
 in practice the integer type is essentially unused, because R parses
 numbers like 1 as being floating point, not integer; the only way to
 get an integer value is to explicitly cast to it. Each of these types
 has a specific bit-pattern set aside for representing NA. And...
 that's it. It's very simple when it works, but also very limited.
 I also suspected R to be less powerful in terms of types.
 However, I think  the fact that It's very simple when it works is
 important to take into account. At the end of the day, when using all
 the fanciness it is not only about can I have some NAs in my array ?
 but also how *easily* can I have some NAs in my array ?. It's about
 balancing the how easy and the how powerful.

 The easyness-of-use is the reason of my concern about having separate
 types nafloatNN and floatNN. Of course, I won't argue that not
 breaking everything is even more important !!

It's a good point, I just don't see how we can really tell what the
trade-offs are at this point. You should bring this up again once more
of the big picture stuff is hammered out.

 Coming back to Travis proposition bit-pattern approaches to missing
 data (*at least* for float64 and int32) need to be implemented., I
 wonder what is the amount of extra work to go from nafloat64 to
 nafloat32/16 ? Is there an hardware support NaN payloads with these
 smaller floats ? If not, or if it is too complicated, I feel it is
 acceptable to say it's too complicated and fall back to mask. One may
 have to choose between fancy types and fancy NAs...

All modern floating point formats can represent NaNs with payloads, so
in principle there's no difficulty in supporting NA the same way for
all of them. If you're using float16 because you want to offload
computation to a GPU then I would test carefully before trusting the
GPU to handle NaNs correctly, and there may need to be a bit of care
to make sure that casts between these types properly map NAs to NAs,
but generally it should be fine.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Charles R Harris
On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.orgwrote:

 Hi,

 Thanks you very much for your lights !

 Le 06/03/2012 21:59, Nathaniel Smith a écrit :
  Right -- R has a very impoverished type system as compared to numpy.
  There's basically four types: numeric (meaning double precision
  float), integer, logical (boolean), and character (string). And
  in practice the integer type is essentially unused, because R parses
  numbers like 1 as being floating point, not integer; the only way to
  get an integer value is to explicitly cast to it. Each of these types
  has a specific bit-pattern set aside for representing NA. And...
  that's it. It's very simple when it works, but also very limited.
 I also suspected R to be less powerful in terms of types.
 However, I think  the fact that It's very simple when it works is
 important to take into account. At the end of the day, when using all
 the fanciness it is not only about can I have some NAs in my array ?
 but also how *easily* can I have some NAs in my array ?. It's about
 balancing the how easy and the how powerful.

 The easyness-of-use is the reason of my concern about having separate
 types nafloatNN and floatNN. Of course, I won't argue that not
 breaking everything is even more important !!

 Coming back to Travis proposition bit-pattern approaches to missing
 data (*at least* for float64 and int32) need to be implemented., I
 wonder what is the amount of extra work to go from nafloat64 to
 nafloat32/16 ? Is there an hardware support NaN payloads with these
 smaller floats ? If not, or if it is too complicated, I feel it is
 acceptable to say it's too complicated and fall back to mask. One may
 have to choose between fancy types and fancy NAs...


I'm in agreement here, and that was a major consideration in making a
'masked' implementation first. Also, different folks adopt different values
for 'missing' data, and distributing one or several masks along with the
data is another common practice.

One inconvenience I have run into with the current API is that is should be
easier to clear the mask from an ignored value without taking a new view
or assigning known data. So maybe two types of masks (different payloads),
or an additional flag could be helpful. The process of assigning masks
could also be made a bit easier than using fancy indexing.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Lluís
Charles R Harris writes:
[...]
 One inconvenience I have run into with the current API is that is should be
 easier to clear the mask from an ignored value without taking a new view or
 assigning known data.

AFAIR, the inability to directly access a mask attribute was intentional to
make bit-patterns and masks indistinguishable from the POV of the array user.

What's the workflow that leads you to un-ignore specific elements?


 So maybe two types of masks (different payloads), or an additional flag could
 be helpful.

Do you mean different NA values? If that's the case, I think it was taken into
account when implementing the current mechanisms (and was also mentioned in the
NEP), so that it could be supported by both bit-patterns and masks (as one of
the main design points was to make them indistinguishable in the common case).

I think the name was parametrized dtypes.


 The process of assigning masks could also be made a bit easier than using
 fancy indexing.

I don't get what you mean here, sorry.

Do you mean here that this is too cumbersome to use?

 a[a  5] = np.NA

(obviously oversimplified example where everything looks sufficiently simple :))




Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Charles R Harris
On Wed, Mar 7, 2012 at 11:21 AM, Lluís xscr...@gmx.net wrote:

 Charles R Harris writes:
 [...]
  One inconvenience I have run into with the current API is that is should
 be
  easier to clear the mask from an ignored value without taking a new
 view or
  assigning known data.

 AFAIR, the inability to directly access a mask attribute was intentional
 to
 make bit-patterns and masks indistinguishable from the POV of the array
 user.

 What's the workflow that leads you to un-ignore specific elements?



Because they are not 'unknown', just (temporarily) 'ignored'. This might be
the case if you are experimenting with what happens if certain data is left
out of a fit. The current implementation tries to handle both these case,
and can do so, I would just like the 'ignored' use to be more convenient
than it is.


  So maybe two types of masks (different payloads), or an additional flag
 could
  be helpful.

 Do you mean different NA values? If that's the case, I think it was taken
 into
 account when implementing the current mechanisms (and was also mentioned
 in the
 NEP), so that it could be supported by both bit-patterns and masks (as one
 of
 the main design points was to make them indistinguishable in the common
 case).


No, the mask as currently implemented is eight bits and can be extended to
handle different mask values, aka, payloads.


 I think the name was parametrized dtypes.


They don't interest me in the least. But that is a whole different area of
discussion.



  The process of assigning masks could also be made a bit easier than using
  fancy indexing.

 I don't get what you mean here, sorry.


Suppose I receive a data set, say an hdf file, that also includes a mask.
I'd like to load the data and apply the mask directly without doing
something like

data[mask] = np.NA


Do you mean here that this is too cumbersome to use?

 a[a  5] = np.NA

 (obviously oversimplified example where everything looks sufficiently
 simple :))


Mostly speed and memory.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Nathaniel Smith
On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
charlesr.har...@gmail.com wrote:
 On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org
 Coming back to Travis proposition bit-pattern approaches to missing
 data (*at least* for float64 and int32) need to be implemented., I
 wonder what is the amount of extra work to go from nafloat64 to
 nafloat32/16 ? Is there an hardware support NaN payloads with these
 smaller floats ? If not, or if it is too complicated, I feel it is
 acceptable to say it's too complicated and fall back to mask. One may
 have to choose between fancy types and fancy NAs...

 I'm in agreement here, and that was a major consideration in making a
 'masked' implementation first.

When it comes to missing data, bitpatterns can do everything that
masks can do, are no more complicated to implement, and have better
performance characteristics.

 Also, different folks adopt different values
 for 'missing' data, and distributing one or several masks along with the
 data is another common practice.

True, but not really relevant to the current debate, because you have
to handle such issues as part of your general data import workflow
anyway, and none of these is any more complicated no matter which
implementations are available.

 One inconvenience I have run into with the current API is that is should be
 easier to clear the mask from an ignored value without taking a new view
 or assigning known data. So maybe two types of masks (different payloads),
 or an additional flag could be helpful. The process of assigning masks could
 also be made a bit easier than using fancy indexing.

So this, uh... this was actually the whole goal of the alterNEP
design for masks -- making all this stuff easy for people (like you,
apparently?) that want support for ignored values, separately from
missing data, and want a nice clean API for it. Basically having a
separate .mask attribute which was an ordinary, assignable array
broadcastable to the attached array's shape. Nobody seemed interested
in talking about it much then but maybe there's interest now?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Charles R Harris
On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote:

 On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
  On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org
 
  Coming back to Travis proposition bit-pattern approaches to missing
  data (*at least* for float64 and int32) need to be implemented., I
  wonder what is the amount of extra work to go from nafloat64 to
  nafloat32/16 ? Is there an hardware support NaN payloads with these
  smaller floats ? If not, or if it is too complicated, I feel it is
  acceptable to say it's too complicated and fall back to mask. One may
  have to choose between fancy types and fancy NAs...
 
  I'm in agreement here, and that was a major consideration in making a
  'masked' implementation first.

 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.


Maybe for float, for other things, no. And we have lots of otherthings. The
performance is a strawman, and it *isn't* easier to implement.


  Also, different folks adopt different values
  for 'missing' data, and distributing one or several masks along with the
  data is another common practice.

 True, but not really relevant to the current debate, because you have
 to handle such issues as part of your general data import workflow
 anyway, and none of these is any more complicated no matter which
 implementations are available.

  One inconvenience I have run into with the current API is that is should
 be
  easier to clear the mask from an ignored value without taking a new
 view
  or assigning known data. So maybe two types of masks (different
 payloads),
  or an additional flag could be helpful. The process of assigning masks
 could
  also be made a bit easier than using fancy indexing.

 So this, uh... this was actually the whole goal of the alterNEP
 design for masks -- making all this stuff easy for people (like you,
 apparently?) that want support for ignored values, separately from
 missing data, and want a nice clean API for it. Basically having a
 separate .mask attribute which was an ordinary, assignable array
 broadcastable to the attached array's shape. Nobody seemed interested
 in talking about it much then but maybe there's interest now?


Come off it, Nathaniel, the problem is minor and fixable. The intent of the
initial implementation was to discover such things. These things are less
accessible with the current API *precisely* because of the feedback from R
users. It didn't start that way.

We now have something to evolve into what we want. That is a heck of a lot
more useful than endless discussion.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Benjamin Root
On Wed, Mar 7, 2012 at 1:26 PM, Nathaniel Smith n...@pobox.com wrote:

 On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
  On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org
 
  Coming back to Travis proposition bit-pattern approaches to missing
  data (*at least* for float64 and int32) need to be implemented., I
  wonder what is the amount of extra work to go from nafloat64 to
  nafloat32/16 ? Is there an hardware support NaN payloads with these
  smaller floats ? If not, or if it is too complicated, I feel it is
  acceptable to say it's too complicated and fall back to mask. One may
  have to choose between fancy types and fancy NAs...
 
  I'm in agreement here, and that was a major consideration in making a
  'masked' implementation first.

 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.


Not true.  bitpatterns inherently destroys the data, while masks do not.
For matplotlib, we can not use bitpatterns because it could over-write user
data (or we have to copy the data).  I would imagine other extension
writers would have similar issues when they need to play around with input
data in a safe manner.

Also, I doubt that the performance characteristics for strings and integers
are the same as it is for masks.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Matthew Brett
Hi,

On Wed, Mar 7, 2012 at 11:37 AM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote:

 On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
  On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig
  pierre.haes...@crans.org
  Coming back to Travis proposition bit-pattern approaches to missing
  data (*at least* for float64 and int32) need to be implemented., I
  wonder what is the amount of extra work to go from nafloat64 to
  nafloat32/16 ? Is there an hardware support NaN payloads with these
  smaller floats ? If not, or if it is too complicated, I feel it is
  acceptable to say it's too complicated and fall back to mask. One may
  have to choose between fancy types and fancy NAs...
 
  I'm in agreement here, and that was a major consideration in making a
  'masked' implementation first.

 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.


 Maybe for float, for other things, no. And we have lots of otherthings. The
 performance is a strawman, and it *isn't* easier to implement.


  Also, different folks adopt different values
  for 'missing' data, and distributing one or several masks along with the
  data is another common practice.

 True, but not really relevant to the current debate, because you have
 to handle such issues as part of your general data import workflow
 anyway, and none of these is any more complicated no matter which
 implementations are available.

  One inconvenience I have run into with the current API is that is should
  be
  easier to clear the mask from an ignored value without taking a new
  view
  or assigning known data. So maybe two types of masks (different
  payloads),
  or an additional flag could be helpful. The process of assigning masks
  could
  also be made a bit easier than using fancy indexing.

 So this, uh... this was actually the whole goal of the alterNEP
 design for masks -- making all this stuff easy for people (like you,
 apparently?) that want support for ignored values, separately from
 missing data, and want a nice clean API for it. Basically having a
 separate .mask attribute which was an ordinary, assignable array
 broadcastable to the attached array's shape. Nobody seemed interested
 in talking about it much then but maybe there's interest now?


 Come off it, Nathaniel, the problem is minor and fixable. The intent of the
 initial implementation was to discover such things. These things are less
 accessible with the current API *precisely* because of the feedback from R
 users. It didn't start that way.

 We now have something to evolve into what we want. That is a heck of a lot
 more useful than endless discussion.

The endless discussion is for the following reason:

- The discussion was never adequately resolved.

The discussion was never adequately resolved because there was not
enough work done to understand the various arguments.   In particular,
you've several times said things that indicate to me, as to Nathaniel,
that you either have not read or have not understood the points that
Nathaniel was making.

Travis' recent email - to me - also indicates that there is still a
genuine problem here that has not been adequately explored.

There is no future in trying to stop discussion, and trying to do so
will only prolong it and make it less useful.  It will make the
discussion - endless.

If you want to help - read the alterNEP, respond to it directly, and
further the discussion by engaged debate.

Best,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Eric Firing
On 03/07/2012 09:26 AM, Nathaniel Smith wrote:
 On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
 charlesr.har...@gmail.com  wrote:
 On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessigpierre.haes...@crans.org
 Coming back to Travis proposition bit-pattern approaches to missing
 data (*at least* for float64 and int32) need to be implemented., I
 wonder what is the amount of extra work to go from nafloat64 to
 nafloat32/16 ? Is there an hardware support NaN payloads with these
 smaller floats ? If not, or if it is too complicated, I feel it is
 acceptable to say it's too complicated and fall back to mask. One may
 have to choose between fancy types and fancy NAs...

 I'm in agreement here, and that was a major consideration in making a
 'masked' implementation first.

 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.

 Also, different folks adopt different values
 for 'missing' data, and distributing one or several masks along with the
 data is another common practice.

 True, but not really relevant to the current debate, because you have
 to handle such issues as part of your general data import workflow
 anyway, and none of these is any more complicated no matter which
 implementations are available.

 One inconvenience I have run into with the current API is that is should be
 easier to clear the mask from an ignored value without taking a new view
 or assigning known data. So maybe two types of masks (different payloads),
 or an additional flag could be helpful. The process of assigning masks could
 also be made a bit easier than using fancy indexing.

 So this, uh... this was actually the whole goal of the alterNEP
 design for masks -- making all this stuff easy for people (like you,
 apparently?) that want support for ignored values, separately from
 missing data, and want a nice clean API for it. Basically having a
 separate .mask attribute which was an ordinary, assignable array
 broadcastable to the attached array's shape. Nobody seemed interested
 in talking about it much then but maybe there's interest now?

In other words, good low-level support for numpy.ma functionality?  With 
a migration path so that a separate numpy.ma might wither away?  Yes, 
there is interest; this is exactly what I think is needed for my own 
style of applications (which I think are common at least in geoscience), 
and for matplotlib.  The question is how to achieve it as simply and 
cleanly as possible while also satisfying the needs of the R users, and 
while making it easy for matplotlib, for example, to handle *any* 
reasonable input: ma, other masking, nan, or NA-bitpattern.

It may be that a rather pragmatic approach to implementation will prove 
better than a highly idealized set of data models.  Or, it may be that a 
dual approach is best, in which the flag value missing data 
implementation is tightly bound to the R model and the mask 
implementation is explicitly designed for the numpy.ma model. In any 
case, a reasonable level of agreement on the goals is needed.  I presume 
Travis's involvement will facilitate a clarification of the goals and of 
the implementation; and I expect that much of Mark's work will end up 
serving well, even if much needs to be added and the API evolves 
considerably.

Eric


 -- Nathaniel
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Pierre Haessig
Hi,
Le 07/03/2012 20:57, Eric Firing a écrit :
 In other words, good low-level support for numpy.ma functionality?
Coming back to *existing* ma support, I was just wondering whether it
was now possible to np.save a masked array.
(I'm using numpy 1.5)
In the end, this is the most annoying problem I have with the existing
ma module which otherwise is pretty useful to me. I'm happy not to need
to process 100% of my data though.

Best,
Pierre



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Eric Firing
On 03/07/2012 11:15 AM, Pierre Haessig wrote:
 Hi,
 Le 07/03/2012 20:57, Eric Firing a écrit :
 In other words, good low-level support for numpy.ma functionality?
 Coming back to *existing* ma support, I was just wondering whether it
 was now possible to np.save a masked array.
 (I'm using numpy 1.5)

No, not with the mask preserved.  This is one of the improvements I am 
hoping for with the upcoming missing data work.

Eric

 In the end, this is the most annoying problem I have with the existing
 ma module which otherwise is pretty useful to me. I'm happy not to need
 to process 100% of my data though.

 Best,
 Pierre




 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Nathaniel Smith
On Wed, Mar 7, 2012 at 7:37 PM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote:
 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.


 Maybe for float, for other things, no. And we have lots of otherthings.

It would be easier to discuss this if you'd, like, discuss :-(. If you
know of some advantage that masks have over bitpatterns when it comes
to missing data, can you please share it, instead of just asserting
it?

Not that I'm immune... I perhaps should have been more explicit
myself, when I said performance characteristics, let me clarify that
I was thinking of both speed (for floats) and memory (for
most-but-not-all things).

 The
 performance is a strawman,

How many users need to speak up to say that this is a serious problem
they have with the current implementation before you stop calling it a
strawman? Because when Wes says that it's not going to fly for his
stats/econometics cases, and the neuroimaging folk like Gary and Matt
say it's not going to fly for their use cases... surely just waving
that away is a bit dismissive?

I'm not saying that we *have* to implement bitpatterns because
performance is *the most important feature* -- I'm just saying, well,
what I said. For *missing data use* cases, bitpatterns have better
performance characteristics than masks. If we decide that these use
cases are important, then we should take this into account and weigh
it against other considerations. Maybe what you think is that these
use cases shouldn't be the focus of this feature and it should focus
on the ignored use cases instead? That would be a legitimate
argument... but if that's what you want to say, say it, don't just
dismiss your users!

 and it *isn't* easier to implement.

If I thought bitpatterns would be easier to implement, I would have
said so... What I said was that they're not harder. You have some
extra complexity, mostly in casting, and some reduced complexity -- no
need to allocate and manipulate the mask. (E.g., simple same-type
assignments and slicing require special casing for masks, but not for
bitpatterns.) In many places the complexity is identical -- printing
routines need to check for either special bitpatterns or masked
values, whatever. Ufunc loops need to either find the appropriate part
of the mask, or create a temporary mask buffer by calling a dtype
func, whatever. On net they seem about equivalent, complexity-wise.

...I assume you disagree with this analysis, since I've said it
before, wrote up a sketch for how the implementation would work at the
C level, etc., and you continue to claim that simplicity is a
compelling advantage for the masked approach. But I still don't know
why you think that :-(.

  Also, different folks adopt different values
  for 'missing' data, and distributing one or several masks along with the
  data is another common practice.

 True, but not really relevant to the current debate, because you have
 to handle such issues as part of your general data import workflow
 anyway, and none of these is any more complicated no matter which
 implementations are available.

  One inconvenience I have run into with the current API is that is should
  be
  easier to clear the mask from an ignored value without taking a new
  view
  or assigning known data. So maybe two types of masks (different
  payloads),
  or an additional flag could be helpful. The process of assigning masks
  could
  also be made a bit easier than using fancy indexing.

 So this, uh... this was actually the whole goal of the alterNEP
 design for masks -- making all this stuff easy for people (like you,
 apparently?) that want support for ignored values, separately from
 missing data, and want a nice clean API for it. Basically having a
 separate .mask attribute which was an ordinary, assignable array
 broadcastable to the attached array's shape. Nobody seemed interested
 in talking about it much then but maybe there's interest now?


 Come off it, Nathaniel, the problem is minor and fixable. The intent of the
 initial implementation was to discover such things.

Implementation can be wonderful, I absolutely agree. But you
understand that I'd be more impressed by this example if your
discovery weren't something I had been arguing for since before the
implementation began :-).

 These things are less
 accessible with the current API *precisely* because of the feedback from R
 users. It didn't start that way.

 We now have something to evolve into what we want. That is a heck of a lot
 more useful than endless discussion.

No, you are still missing the point completely! There is no what *we*
want, because what you want is different than what I want. The
masking stuff in the alterNEP was an attempt to give people like you
who wanted ignored support what they wanted, and the bitpattern
stuff was to 

Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Nathaniel Smith
On Wed, Mar 7, 2012 at 7:39 PM, Benjamin Root ben.r...@ou.edu wrote:
 On Wed, Mar 7, 2012 at 1:26 PM, Nathaniel Smith n...@pobox.com wrote:
 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.


 Not true.  bitpatterns inherently destroys the data, while masks do not.

Yes, that's why I only wrote that this is true for missing data, not
in general :-). If you have data that is being destroyed, then that's
not missing data, by definition. We don't have consensus yet on
whether that's the use case we are aiming for, but it's the one that
Pierre was worrying about.

 For matplotlib, we can not use bitpatterns because it could over-write user
 data (or we have to copy the data).  I would imagine other extension writers
 would have similar issues when they need to play around with input data in a
 safe manner.

Right. You clearly need some sort of masking, either an explicit mask
array that you keep somewhere, or one that gets attached to the
underlying ndarray in some non-destructive way.

 Also, I doubt that the performance characteristics for strings and integers
 are the same as it is for masks.

Not sure what you mean by this, but I'd be happy to hear more.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Pierre Haessig
Hi Mark,

I went through the NA NEP a few days ago, but only too quickly so that
my question is probably a rather dumb one. It's about the usability of
bitpatter-based NAs, based on your recent post :

Le 03/03/2012 22:46, Mark Wiebe a écrit :
 Also, here's a thought for the usability of NA-float64. As much as
 global state is a bad idea, something which determines whether
 implicit float dtypes are NA-float64 or float64 could help. In
 IPython, pylab mode would default to float64, and statlab or
 pystat would default to NA-float64. One way to write this might be:

  np.set_default_float(np.nafloat64)
  np.array([1.0, 2.0, 3.0])
 array([ 1.,  2.,  3.], dtype=nafloat64)
  np.set_default_float(np.float64)
  np.array([1.0, 2.0, 3.0])
 array([ 1.,  2.,  3.], dtype=float64)

Q: Is is an *absolute* necessity to have two separate dtypes nafloatNN
and floatNN to enable NA bitpattern storage ?

From a potential user perspective, I feel it would be nice to have NA
and non-NA cases look as similar as possible. Your code example is
particularly striking : two different dtypes to store (from a user
perspective) the exact same content ! If this *could* be avoided, it
would be great...

I don't know how the NA machinery is working R. Does it works with a
kind of nafloat64 all the time or is there some type inference
mechanics involved in choosing the appropriate type ?

Best,
Pierre



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Mark Wiebe
Hi Pierre,

On Tue, Mar 6, 2012 at 5:48 AM, Pierre Haessig pierre.haes...@crans.orgwrote:

 Hi Mark,

 I went through the NA NEP a few days ago, but only too quickly so that
 my question is probably a rather dumb one. It's about the usability of
 bitpatter-based NAs, based on your recent post :

 Le 03/03/2012 22:46, Mark Wiebe a écrit :
  Also, here's a thought for the usability of NA-float64. As much as
  global state is a bad idea, something which determines whether
  implicit float dtypes are NA-float64 or float64 could help. In
  IPython, pylab mode would default to float64, and statlab or
  pystat would default to NA-float64. One way to write this might be:
 
   np.set_default_float(np.nafloat64)
   np.array([1.0, 2.0, 3.0])
  array([ 1.,  2.,  3.], dtype=nafloat64)
   np.set_default_float(np.float64)
   np.array([1.0, 2.0, 3.0])
  array([ 1.,  2.,  3.], dtype=float64)

 Q: Is is an *absolute* necessity to have two separate dtypes nafloatNN
 and floatNN to enable NA bitpattern storage ?

 From a potential user perspective, I feel it would be nice to have NA
 and non-NA cases look as similar as possible. Your code example is
 particularly striking : two different dtypes to store (from a user
 perspective) the exact same content ! If this *could* be avoided, it
 would be great...


The biggest reason to keep the two types separate is performance. The
straight float dtypes map directly to hardware floating-point operations,
which can be very fast. The NA-float dtypes have to use additional logic to
handle the NA values correctly. NA is treated as a particular NaN, and if
the hardware float operations were used directly, NA would turn into NaN.
This additional logic usually means more branches, so is slower.

One possibility we could consider is to automatically convert an array's
dtype from float64 to nafloat64 the first time an NA is assigned. This
would have good performance when there are no NAs, but would transparently
switch on NA support when it's needed.


 I don't know how the NA machinery is working R. Does it works with a
 kind of nafloat64 all the time or is there some type inference
 mechanics involved in choosing the appropriate type ?


My understanding of R is that it works with the nafloat64 for all its
operations, yes.

Cheers,
Mark


 Best,
 Pierre


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Nathaniel Smith
On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant tra...@continuum.io wrote:
 Hi all,

Hi Travis,

Thanks for bringing this back up.

Have you looked at the summary from the last thread?
  https://github.com/njsmith/numpy/wiki/NA-discussion-status
The goal was to try and at least work out what points we all *could*
agree on, to have some common footing for further discussion. I won't
copy the whole thing here, but I'd summarize the state as:
  -- It's pretty clear that there are two fairly different conceptual
models/use cases in play here. For one of them (R-style missing data
cases) it's pretty clear what the desired semantics would be. For the
other (temporary ignored values) there's still substantive
disagreement.
  -- We *haven't* yet established what we want numpy to actually support.

IMHO the critical next step is this latter one -- maybe we want to
fully support both use cases. Maybe it's really only one of them
that's worth trying to support in the numpy core right now. Maybe it's
just one of them, but it's worth doing so thoroughly that it should
have multiple implementations. Or whatever.

I fear that if we don't talk about these big picture questions and
just wade directly back into round-and-round arguments about API
details then we'll never get anywhere.

[...]
 Because it is slated to go into release 1.7, we need to re-visit the masked 
 array discussion again.    The NEP process is the appropriate one and I'm 
 glad we are taking that route for these discussions.   My goal is to get 
 consensus in order for code to get into NumPy (regardless of who writes the 
 code).    It may be that we don't come to a consensus (reasonable and 
 intelligent people can disagree on things --- look at the coming 
 election...).   We can represent different parts of what is fortunately a 
 very large user-base of NumPy users.

 First of all, I want to be clear that I think there is much great work that 
 has been done in the current missing data code.  There are some nice features 
 in the where clause of the ufunc and the machinery for the iterator that 
 allows re-using ufunc loops that are not re-written to check for missing 
 data.   I'm sure there are other things as well that I'm not quite aware of 
 yet.    However, I don't think the API presented to the numpy user presently 
 is the correct one for NumPy 1.X.

 A few particulars:

        * the reduction operations need to default to skipna --- this is the 
 most common use case which has been re-inforced again to me today by a new 
 user to Python who is using masked arrays presently

This is one of the points where the two conceptual models disagree
(see also Skipper's point down-thread). If you have missing data,
then propagation has to be the default -- the sum of 1, 2, and
I-DON'T-KNOW-MAYBE-7-MAYBE-12 is not 3. If you have some data there
but you've asked numpy to temporarily ignore it, then, well, duh, of
course it should ignore it.

        * the mask needs to be visible to the user if they use that approach 
 to missing data (people should be able to get a hold of the mask and work 
 with it in Python)

This is also a point where the two conceptual models disagree.

Actually this is one of the original arguments we made against the NEP
design -- that if you want missing data, then having a mask at all is
counterproductive, and if you are ignoring data, then of course it
should be easy to manipulate the ignore mask. The rationale for the
current design is to compromise between these two approaches -- there
is a mask, but it's hidden behind a curtain. Mostly. (This may be a
compromise in the Solomonic sense.)

        * bit-pattern approaches to missing data (at least for float64 and 
 int32) need to be implemented.

        * there should be some way when using masks (even if it's hidden 
 from most users) for missing data to separate the low-level ufunc operation 
 from the operation
           on the masks...

I don't understand what this means.

 I have heard from several users that they will *not use the missing data* in 
 NumPy as currently implemented, and I can now see why.    For better or for 
 worse, my approach to software is generally very user-driven and very 
 pragmatic.  On the other hand, I'm also a mathematician and appreciate the 
 cognitive compression that can come out of well-formed structure.    
 None-the-less, I'm an *applied* mathematician and am ultimately motivated by 
 applications.

 I will get a hold of the NEP and spend some time with it to discuss some of 
 this in that document.   This will take several weeks (as PyCon is next week 
 and I have a tutorial I'm giving there).    For now, I do not think 1.7 can 
 be released unless the masked array is labeled *experimental*.

In project management terms, I see three options:
1) Put a big warning label on the functionality and leave it for now
(If this option is given, np.asarray returns a masked array. NOTE: IN
THE NEXT RELEASE, IT MAY INSTEAD RETURN A BAG OF 

Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Nathaniel Smith
On Tue, Mar 6, 2012 at 4:38 PM, Mark Wiebe mwwi...@gmail.com wrote:
 On Tue, Mar 6, 2012 at 5:48 AM, Pierre Haessig pierre.haes...@crans.org
 wrote:
 From a potential user perspective, I feel it would be nice to have NA
 and non-NA cases look as similar as possible. Your code example is
 particularly striking : two different dtypes to store (from a user
 perspective) the exact same content ! If this *could* be avoided, it
 would be great...

 The biggest reason to keep the two types separate is performance. The
 straight float dtypes map directly to hardware floating-point operations,
 which can be very fast. The NA-float dtypes have to use additional logic to
 handle the NA values correctly. NA is treated as a particular NaN, and if
 the hardware float operations were used directly, NA would turn into NaN.
 This additional logic usually means more branches, so is slower.

Actually, no -- hardware float operations preserve NA-as-NaN. You
might well need to be careful around more exotic code like optimized
BLAS kernels, but all the basic ufuncs should Just Work at full speed.
Demo:

 def hexify(x): return hex(np.float64(x).view(np.int64))
 hexify(np.nan)
'0x7ff8L'
# IIRC this is R's NA bitpattern (presumably 1974 is someone's birthday)
 NA = np.int64(0x7ff8 + 1974).view(np.float64)
# It is an NaN...
 NA
nan
# But it has a distinct bitpattern:
 hexify(NA)
'0x7ff807b6L'
# Like any NaN, it propagates through floating point operations:
 NA + 3
nan
# But, critically, so does the bitpattern; ordinary Python + is
returning NA on this operation:
 hexify(NA + 3)
'0x7ff807b6L'

This is how R does it, which is more evidence that this actually works
on real hardware.

There is one place where it fails. In a binary operation with *two*
NaN values, there's an ambiguity about which payload should be
returned. IEEE754 recommends just returning the first one. This means
that NA + NaN = NA, NaN + NA = NaN. This is ugly, but it's an obscure
case that nobody cares about, so it's probably worth it for the speed
gain. (In fact, if you type those two expressions at the R prompt,
then that's what you get, and I can't find any reference to anyone
even noticing this.)

 I don't know how the NA machinery is working R. Does it works with a
 kind of nafloat64 all the time or is there some type inference
 mechanics involved in choosing the appropriate type ?

 My understanding of R is that it works with the nafloat64 for all its
 operations, yes.

Right -- R has a very impoverished type system as compared to numpy.
There's basically four types: numeric (meaning double precision
float), integer, logical (boolean), and character (string). And
in practice the integer type is essentially unused, because R parses
numbers like 1 as being floating point, not integer; the only way to
get an integer value is to explicitly cast to it. Each of these types
has a specific bit-pattern set aside for representing NA. And...
that's it. It's very simple when it works, but also very limited.

I'm still skeptical that we could make the floating point types
NA-aware by default -- until we have an implementation in hand, I'm
nervous there'd be some corner case that broke everything. (Maybe
ufuncs are fine but np.dot has an unavoidable overhead, or maybe it
would mess up casting from float types to non-NA-aware types, etc.)
But who knows. Probably not something we can really make a meaningful
decision about yet.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Ralf Gommers
On Tue, Mar 6, 2012 at 9:25 PM, Nathaniel Smith n...@pobox.com wrote:

 On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant tra...@continuum.io
 wrote:
  Hi all,

 Hi Travis,

 Thanks for bringing this back up.

 Have you looked at the summary from the last thread?
  https://github.com/njsmith/numpy/wiki/NA-discussion-status


Re-reading that summary and the main documents and threads linked from it,
I could find either examples of statistical software that treats missing
and ignored data explicitly separately, or links to relevant literature.
Those would probably help the discussion a lot.

The goal was to try and at least work out what points we all *could*
 agree on, to have some common footing for further discussion. I won't
 copy the whole thing here, but I'd summarize the state as:
  -- It's pretty clear that there are two fairly different conceptual
 models/use cases in play here. For one of them (R-style missing data
 cases) it's pretty clear what the desired semantics would be. For the
 other (temporary ignored values) there's still substantive
 disagreement.
  -- We *haven't* yet established what we want numpy to actually support.

 IMHO the critical next step is this latter one -- maybe we want to
 fully support both use cases. Maybe it's really only one of them
 that's worth trying to support in the numpy core right now. Maybe it's
 just one of them, but it's worth doing so thoroughly that it should
 have multiple implementations. Or whatever.

 I fear that if we don't talk about these big picture questions and
 just wade directly back into round-and-round arguments about API
 details then we'll never get anywhere.

 [...]
  Because it is slated to go into release 1.7, we need to re-visit the
 masked array discussion again.The NEP process is the appropriate one
 and I'm glad we are taking that route for these discussions.   My goal is
 to get consensus in order for code to get into NumPy (regardless of who
 writes the code).It may be that we don't come to a consensus
 (reasonable and intelligent people can disagree on things --- look at the
 coming election...).   We can represent different parts of what is
 fortunately a very large user-base of NumPy users.
 
  First of all, I want to be clear that I think there is much great work
 that has been done in the current missing data code.  There are some nice
 features in the where clause of the ufunc and the machinery for the
 iterator that allows re-using ufunc loops that are not re-written to check
 for missing data.   I'm sure there are other things as well that I'm not
 quite aware of yet.However, I don't think the API presented to the
 numpy user presently is the correct one for NumPy 1.X.
 
  A few particulars:
 
 * the reduction operations need to default to skipna --- this
 is the most common use case which has been re-inforced again to me today by
 a new user to Python who is using masked arrays presently

 This is one of the points where the two conceptual models disagree
 (see also Skipper's point down-thread). If you have missing data,
 then propagation has to be the default -- the sum of 1, 2, and
 I-DON'T-KNOW-MAYBE-7-MAYBE-12 is not 3. If you have some data there
 but you've asked numpy to temporarily ignore it, then, well, duh, of
 course it should ignore it.

 * the mask needs to be visible to the user if they use that
 approach to missing data (people should be able to get a hold of the mask
 and work with it in Python)

 This is also a point where the two conceptual models disagree.

 Actually this is one of the original arguments we made against the NEP
 design -- that if you want missing data, then having a mask at all is
 counterproductive, and if you are ignoring data, then of course it
 should be easy to manipulate the ignore mask. The rationale for the
 current design is to compromise between these two approaches -- there
 is a mask, but it's hidden behind a curtain. Mostly. (This may be a
 compromise in the Solomonic sense.)

 * bit-pattern approaches to missing data (at least for float64
 and int32) need to be implemented.
 
 * there should be some way when using masks (even if it's
 hidden from most users) for missing data to separate the low-level ufunc
 operation from the operation
on the masks...

 I don't understand what this means.

  I have heard from several users that they will *not use the missing
 data* in NumPy as currently implemented, and I can now see why.For
 better or for worse, my approach to software is generally very user-driven
 and very pragmatic.  On the other hand, I'm also a mathematician and
 appreciate the cognitive compression that can come out of well-formed
 structure.None-the-less, I'm an *applied* mathematician and am
 ultimately motivated by applications.
 
  I will get a hold of the NEP and spend some time with it to discuss some
 of this in that document.   This will take several weeks (as PyCon is next
 week and I have a 

Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Nathaniel Smith
On Tue, Mar 6, 2012 at 9:14 PM, Ralf Gommers
ralf.gomm...@googlemail.com wrote:
 On Tue, Mar 6, 2012 at 9:25 PM, Nathaniel Smith n...@pobox.com wrote:
 On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant tra...@continuum.io
 wrote:
  Hi all,

 Hi Travis,

 Thanks for bringing this back up.

 Have you looked at the summary from the last thread?
  https://github.com/njsmith/numpy/wiki/NA-discussion-status

 Re-reading that summary and the main documents and threads linked from it, I
 could find either examples of statistical software that treats missing and
 ignored data explicitly separately, or links to relevant literature. Those
 would probably help the discussion a lot.

(I think you mean couldn't find?)

I'm not aware of any software that supports the IGNORED concept at
all, whether in combination with missing data or not. np.ma is
probably the closest example. I think we'd be breaking new ground
there. This is also probably why it is less clear how it should work
:-).

IIUC, the basic reason that people want IGNORED in the core is that it
provides convenience and syntactic sugar for efficient in place
operation on subsets of large arrays. So there are actually two parts
there -- the efficient operation, and the convenience/syntactic sugar.
The key feature for efficient operation is the where= feature, which
is not controversial at all. So, there's an argument that for now we
should focus on where=, give people some time to work with it, and
then use that experience to decide what kind of convenience/sugar
would be useful, if any. But, that's just my own idea; I definitely
can't claim any consensus on it.

 In project management terms, I see three options:
 1) Put a big warning label on the functionality and leave it for now
 (If this option is given, np.asarray returns a masked array. NOTE: IN
 THE NEXT RELEASE, IT MAY INSTEAD RETURN A BAG OF RABID, HUNGRY
 WEASELS. NO GUARANTEES.)

 I've opened http://projects.scipy.org/numpy/ticket/2072 for that.

Cool, thanks.

 Assuming
 we stick with this option, I'd appreciate it if you could check in the first
 beta that comes out whether or not the warnings are obvious enough and in
 all the right places. There probably won't be weasels though:)

Of course. I've added myself to the CC list. (Err, if the beta won't
be for a bit, though, then please remind me if you remember? I'm
juggling a lot of balls right now.)

 2) Move the code back out of mainline and into a branch until until
 there's consensus.
 3) Hold up the release until this is all sorted.

 I come from the project-management school that says you should always
 have a releasable mainline, keep unready code in branches, and never
 hold up the release for features, so (2) seems obvious to me.

 While it may sound obvious, I hope you've understood why in practice it's
 not at all obvious and why you got such strong reactions to your proposal of
 taking out all that code. If not, just look at what happened with the
 numpy-refactor work.

Of course, and that's why I'm not pressing the point. These trade-offs
might be worth talking about at some point -- there are reasons that
basically all the major FOSS projects have moved towards time-based
releases :-) -- but that'd be a huge discussion at a time when we
already have more than enough of those on our plate...

 But I seem to be very much in the minority on that[1], so oh well :-). I
 don't have any objection to (1), personally. (3) seems like a bad
 idea. Just my 2 pence.


 Agreed that (3) is a bad idea. +1 for (1).

 Ralf


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Cheers,
-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Missing data again

2012-03-03 Thread Travis Oliphant
Hi all, 

I've been thinking a lot about the masked array implementation lately. I 
finally had the time to look hard at what has been done and now am of the 
opinion that I do not think that 1.7 can be released with the current state of 
the masked array implementation *unless* it is clearly marked as experimental 
and may be changed in 1.8  

I wish I had been able to be a bigger part of this conversation last year.   
But, that is why I took the steps I took to try and figure out another way to 
feed my family *and* stay involved in the NumPy community.   I would love to 
stay involved in what is happening in the SciPy community, but I am more 
satisfied with what Ralf, Warren, Robert, Pauli, Josef, Charles, Stefan, and 
others are doing there right now, and don't have time to keep up with 
everything.Even though SciPy was the heart and soul of why I even got 
involved with Python for open source in the first place and took many years of 
my volunteer labor, I won't be able to spend significant time on SciPy code 
over the coming months.   At some point, I really hope to be able to make 
contributions again to that code-base.   Time will tell whether or not my 
aspirations will be realized.  It depends quite a bit on whether or not my kids 
have what they need from me (which right now is money and time). 
 
NumPy, on the other hand, is not in a position where I can feel comfortable 
leaving my baby to others.  I recognize and value the contributions from many 
people to make NumPy what it is today (e.g. code contributions, code 
rearrangement and standardization, build and install improvement, and most 
recently some architectural changes).But, I feel a personal responsibility 
for the code base as I spent a great many months writing NumPy in the first 
place, and I've spent a great deal of time interacting with NumPy users and 
feel like I have at least some sense of their stories.Of course, I built on 
the shoulders of giants, and much of what is there is *because of* where the 
code was adapted from (it was not created de-novo).   Currently,  there remains 
much that needs to be communicated, improved, and worked on, and I have 
specific opinions about what some changes and improvements should be, how they 
should be written, and how the resulting users need to be benefited.   
 It will take time to discuss all of this, and that's where I will spend my 
open-source time in the coming months. 

In that vein: 

Because it is slated to go into release 1.7, we need to re-visit the masked 
array discussion again.The NEP process is the appropriate one and I'm glad 
we are taking that route for these discussions.   My goal is to get consensus 
in order for code to get into NumPy (regardless of who writes the code).It 
may be that we don't come to a consensus (reasonable and intelligent people can 
disagree on things --- look at the coming election...).   We can represent 
different parts of what is fortunately a very large user-base of NumPy users.   
 

First of all, I want to be clear that I think there is much great work that has 
been done in the current missing data code.  There are some nice features in 
the where clause of the ufunc and the machinery for the iterator that allows 
re-using ufunc loops that are not re-written to check for missing data.   I'm 
sure there are other things as well that I'm not quite aware of yet.
However, I don't think the API presented to the numpy user presently is the 
correct one for NumPy 1.X.   

A few particulars: 

* the reduction operations need to default to skipna --- this is the 
most common use case which has been re-inforced again to me today by a new user 
to Python who is using masked arrays presently 

* the mask needs to be visible to the user if they use that approach to 
missing data (people should be able to get a hold of the mask and work with it 
in Python)

* bit-pattern approaches to missing data (at least for float64 and 
int32) need to be implemented. 

* there should be some way when using masks (even if it's hidden from 
most users) for missing data to separate the low-level ufunc operation from the 
operation
   on the masks...

I have heard from several users that they will *not use the missing data* in 
NumPy as currently implemented, and I can now see why.For better or for 
worse, my approach to software is generally very user-driven and very 
pragmatic.  On the other hand, I'm also a mathematician and appreciate the 
cognitive compression that can come out of well-formed structure.
None-the-less, I'm an *applied* mathematician and am ultimately motivated by 
applications.

I will get a hold of the NEP and spend some time with it to discuss some of 
this in that document.   This will take several weeks (as PyCon is next week 
and I have a tutorial I'm giving there).For now, I do not think 1.7 can be 
released unless the masked array is labeled 

Re: [Numpy-discussion] Missing data again

2012-03-03 Thread Charles R Harris
On Sat, Mar 3, 2012 at 1:30 PM, Travis Oliphant tra...@continuum.io wrote:

 Hi all,

 I've been thinking a lot about the masked array implementation lately.
 I finally had the time to look hard at what has been done and now am of the
 opinion that I do not think that 1.7 can be released with the current state
 of the masked array implementation *unless* it is clearly marked as
 experimental and may be changed in 1.8


That was the intention.


 I wish I had been able to be a bigger part of this conversation last year.
   But, that is why I took the steps I took to try and figure out another
 way to feed my family *and* stay involved in the NumPy community.   I would
 love to stay involved in what is happening in the SciPy community, but I am
 more satisfied with what Ralf, Warren, Robert, Pauli, Josef, Charles,
 Stefan, and others are doing there right now, and don't have time to keep
 up with everything.Even though SciPy was the heart and soul of why I
 even got involved with Python for open source in the first place and took
 many years of my volunteer labor, I won't be able to spend significant time
 on SciPy code over the coming months.   At some point, I really hope to be
 able to make contributions again to that code-base.   Time will tell
 whether or not my aspirations will be realized.  It depends quite a bit on
 whether or not my kids have what they need from me (which right now is
 money and time).

 NumPy, on the other hand, is not in a position where I can feel
 comfortable leaving my baby to others.  I recognize and value the
 contributions from many people to make NumPy what it is today (e.g. code
 contributions, code rearrangement and standardization, build and install
 improvement, and most recently some architectural changes).But, I feel
 a personal responsibility for the code base as I spent a great many months
 writing NumPy in the first place, and I've spent a great deal of time
 interacting with NumPy users and feel like I have at least some sense of
 their stories.Of course, I built on the shoulders of giants, and much
 of what is there is *because of* where the code was adapted from (it was
 not created de-novo).   Currently,  there remains much that needs to be
 communicated, improved, and worked on, and I have specific opinions about
 what some changes and improvements should be, how they should be written,
 and how the resulting users need to be benefited.
  It will take time to discuss all of this, and that's where I will spend
 my open-source time in the coming months.

 In that vein:

 Because it is slated to go into release 1.7, we need to re-visit the
 masked array discussion again.The NEP process is the appropriate one
 and I'm glad we are taking that route for these discussions.   My goal is
 to get consensus in order for code to get into NumPy (regardless of who
 writes the code).It may be that we don't come to a consensus
 (reasonable and intelligent people can disagree on things --- look at the
 coming election...).   We can represent different parts of what is
 fortunately a very large user-base of NumPy users.

 First of all, I want to be clear that I think there is much great work
 that has been done in the current missing data code.  There are some nice
 features in the where clause of the ufunc and the machinery for the
 iterator that allows re-using ufunc loops that are not re-written to check
 for missing data.   I'm sure there are other things as well that I'm not
 quite aware of yet.However, I don't think the API presented to the
 numpy user presently is the correct one for NumPy 1.X.


A few particulars:

* the reduction operations need to default to skipna --- this is
 the most common use case which has been re-inforced again to me today by a
 new user to Python who is using masked arrays presently

* the mask needs to be visible to the user if they use that
 approach to missing data (people should be able to get a hold of the mask
 and work with it in Python)

* bit-pattern approaches to missing data (at least for float64 and
 int32) need to be implemented.

* there should be some way when using masks (even if it's hidden
 from most users) for missing data to separate the low-level ufunc operation
 from the operation
   on the masks...


Mind, Mark only had a few weeks to write code. I think the unfinished state
is a direct function of that.


 I have heard from several users that they will *not use the missing data*
 in NumPy as currently implemented, and I can now see why.For better or
 for worse, my approach to software is generally very user-driven and very
 pragmatic.  On the other hand, I'm also a mathematician and appreciate the
 cognitive compression that can come out of well-formed structure.
  None-the-less, I'm an *applied* mathematician and am ultimately motivated
 by applications.


I think that would be Wes. I thought the current state wasn't that far away
from what he wanted 

Re: [Numpy-discussion] Missing data again

2012-03-03 Thread Travis Oliphant
 
 Mind, Mark only had a few weeks to write code. I think the unfinished state 
 is a direct function of that.
  
 I have heard from several users that they will *not use the missing data* in 
 NumPy as currently implemented, and I can now see why.For better or for 
 worse, my approach to software is generally very user-driven and very 
 pragmatic.  On the other hand, I'm also a mathematician and appreciate the 
 cognitive compression that can come out of well-formed structure.
 None-the-less, I'm an *applied* mathematician and am ultimately motivated by 
 applications.
 
 
 I think that would be Wes. I thought the current state wasn't that far away 
 from what he wanted in the only post where he was somewhat explicit. I think 
 it would be useful for him to sit down with Mark at some time and thrash 
 things out since I think there is some misunderstanding involved.
  

Actually it wasn't Wes.  It was 3 other people.   I'm already well aware of 
Wes's perspective and actually think his concerns have been handled already.
Also, the person who showed me their use-case was a new user.

But, your point about getting people together is well-taken.  I also recognize 
the fact that there have been (and likely continue to be) misunderstandings on 
multiple fronts.   Fortunately, many of us will be at PyCon later this week.   
We tried really hard to get Mark Wiebe here this weekend as well --- but he 
could only sacrifice a week away from his degree work to join us for PyCon. 

It would be great if you could come to PyCon as well.   Perhaps we can apply to 
NumFOCUS for a travel grant to bring NumPy developers together with other 
interested people to finish the masked array design and implementation.

-Travis


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Missing Data development plan

2011-07-07 Thread Mark Wiebe
It's been a day less than two weeks since I posted my first feedback request
on a masked array implementation of missing data. I'd like to thank everyone
that contributed to the discussion, and that continues to contribute.

I believe my design is very solid thanks to all the feedback, and I
understand at the same time there are still concerns that people have about
the design. I sincerely hope that those concerns are further discussed and
made more clear just as I have spent a lot of effort making sure my ideas
are clear and understood by everyone in the discussion.

Travis has directed me to for the moment focus a majority of my attention on
the implementation. He will post further thoughts on the design issues in
the next few days when he has enough of a break in his schedule.

With the short time available for this implementation, my plan is as
follows:

1) Implement the masked implementation of NA nearly to completion. This is
the quickest way to get something that people can provide hands-on feedback
with, and the NA dtype in my design uses the machinery of the masked
implementation for all the computational kernels.

2) Assuming there is enough time left, implement the NA[] parameterized
dtype in concert with a derived[] dtype and cleanups of the datetime64[]
dtype, with the goal of creating some good structure for the possibility of
creating more parameterized dtypes in the future. The derived[] dtype idea
is based on an idea Travis had which he called computed columns, but
generalized to apply in more contexts. When the time comes, I will post a
proposal for feedback on this idea as well.

Thanks once again for all the great feedback, and I look forward to getting
a prototype into your hands to test as quickly as possible!

-Mark
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Chris Barker
On 6/27/11 9:53 AM, Charles R Harris wrote:
 Some discussion of disk storage might also help. I don't see how the
 rules can be enforced if two files are used, one for the mask and
 another for the data, but that may just be something we need to live with.

It seems it wouldn't be too big  deal to extend the *.npy format to 
include the mask.

Could one memmap both the data array and the mask?

Netcdf (and assume hdf) have ways to support masks as well.

-Chris




-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 1:07 PM, Dag Sverre Seljebotn 
d.s.seljeb...@astro.uio.no wrote:

 On 06/29/2011 07:38 PM, Mark Wiebe wrote:
  On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn
  d.s.seljeb...@astro.uio.no mailto:d.s.seljeb...@astro.uio.no wrote:
 
  On 06/29/2011 03:45 PM, Matthew Brett wrote:
Hi,
   
On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebemwwi...@gmail.com
  mailto:mwwi...@gmail.com  wrote:
On Tue, Jun 28, 2011 at 5:20 PM, Matthew
  Brettmatthew.br...@gmail.com mailto:matthew.br...@gmail.com
wrote:
   
Hi,
   
On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smithn...@pobox.com
  mailto:n...@pobox.com  wrote:
...
(You might think, what difference does it make if you *can*
  unmask an
item? Us missing data folks could just ignore this feature.
 But:
whatever we end up implementing is something that I will have
 to
explain over and over to different people, most of them not
particularly sophisticated programmers. And there's just no
  sensible
way to explain this idea that if you store some particular
  value, then
it replaces the old value, but if you store NA, then the old
  value is
still there.
   
Ouch - yes.  No question, that is difficult to explain.   Well,
 I
think the explanation might go like this:
   
Ah, yes, well, that's because in fact numpy records missing
  values by
using a 'mask'.   So when you say `a[3] = np.NA', what you mean
 is,
'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`
   
Is that fair?
   
My favorite way of explaining it would be to have a grid of
  numbers written
on paper, then have several cardboards with holes poked in them
  in different
configurations. Placing these cardboard masks in front of the
  grid would
show different sets of non-missing data, without affecting the
  values stored
on the paper behind them.
   
Right - but here of course you are trying to explain the mask, and
this is Nathaniel's point, that in order to explain NAs, you have
 to
explain masks, and so, even at a basic level, the fusion of the
 two
ideas is obvious, and already confusing.  I mean this:
   
a[3] = np.NA
   
Oh, so you just set the a[3] value to have some missing value
 code?
   
Ah - no - in fact what I did was set a associated mask in
 position
a[3] so that you can't any longer see the previous value of a[3]
   
Huh.  You mean I have a mask for every single value in order to
 be
able to blank out a[3]?  It looks like an assignment.  I mean, it
looks just like a[3] = 4.  But I guess it isn't?
   
Er...
   
I think Nathaniel's point is a very good one - these are separate
ideas, np.NA and np.IGNORE, and a joint implementation is bound to
draw them together in the mind of the user.Apart from anything
else, the user has to know that, if they want a single NA value in
 an
array, they have to add a mask size array.shape in bytes.  They
 have
to know then, that NA is implemented by masking, and then the 'NA
 for
free by adding masking' idea breaks down and starts to feel like a
kludge.
   
The counter argument is of course that, in time, the
  implementation of
NA with masking will seem as obvious and intuitive, as, say,
broadcasting, and that we are just reacting from lack of
 experience
with the new API.
 
  However, no matter how used we get to this, people coming from almost
  any other tool (in particular R) will keep think it is
  counter-intuitive. Why set up a major semantic incompatability that
  people then have to overcome in order to start using NumPy.
 
 
  I'm not aware of a semantic incompatibility. I believe R doesn't support
  views like NumPy does, so the things you have to do to see masking
  semantics aren't even possible in R.

 Well, whether the same feature is possible or not in R is irrelevant to
 whether a semantic incompatability would exist.

 Views themselves are a *major* semantic incompatability, and are highly
 confusing at first to MATLAB/Fortran/R people. However they have major
 advantages outweighing the disadvantage of having to caution new users.

 But there's simply no precedence anywhere for an assignment that doesn't
 erase the old value for a particular input value, and the advantages
 seem pretty minor (well, I think it is ugly in its own right, but that
 is besides the point...)


I disagree that there's no precedent, but maybe there isn't something which
is exactly the same as my design. The whole actual real literal assignment
thought process leads to considerations of little gnomes writing 

Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 2:32 PM, Matthew Brett matthew.br...@gmail.comwrote:

 Hi,

 On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe mwwi...@gmail.com wrote:
  On Wed, Jun 29, 2011 at 8:20 AM, Lluís xscr...@gmx.net wrote:
 
  Matthew Brett writes:
 
   Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
   the idea that the entry is still there, but we're just ignoring it.
  Of
   course, that goes against common convention, but it might be easier
 to
   explain.
 
   I think Nathaniel's point is that np.IGNORE is a different idea than
   np.NA, and that is why joining the implementations can lead to
   conceptual confusion.
 
  This is how I see it:
 
   a = np.array([0, 1, 2], dtype=int)
   a[0] = np.NA
  ValueError
   e = np.array([np.NA, 1, 2], dtype=int)
  ValueError
   b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
   m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
   bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
   b[1] = np.NA
   np.sum(b)
  np.NA
   np.sum(b, skipna=True)
  2
   b.mask
  None
   m[1] = np.NA
   np.sum(m)
  2
   np.sum(m, skipna=True)
  2
   m.mask
  [False, False, True]
   bm[1] = np.NA
   np.sum(bm)
  2
   np.sum(bm, skipna=True)
  2
   bm.mask
  [False, False, True]
 
  So:
 
  * Mask takes precedence over bit pattern on element assignment. There's
   still the question of how to assign a bit pattern NA when the mask is
   active.
 
  * When using mask, elements are automagically skipped.
 
  * m[1] = np.NA is equivalent to m.mask[1] = False
 
  * When using bit pattern + mask, it might make sense to have the initial
   values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True,
   False, True] and np.sum(bm) == np.NA)
 
  There seems to be a general idea that masks and NA bit patterns imply
  particular differing semantics, something which I think is simply false.

 Well - first - it's helpful surely to separate the concepts and the
 implementation.

 Concepts / use patterns (as delineated by Nathaniel):
 A) missing values == 'np.NA' in my emails.  Can we call that CMV
 (concept missing values)?
 B) masks == np.IGNORE in my emails . CMSK (concept masks)?


This is a different conceptual model than I'm proposing in the NEP. This is
also exactly what I was trying to clarify in the first email in this thread
under the headings Missing Data Abstraction and Implementation
Techniques. Masks are *just* an implementation technique. They imply
nothing more, except through previously established conventions such as in
various bitmasks, image masks, numpy.ma and others.

masks != np.IGNORE
bit patterns != np.NA

Masks vs bit patterns and R's default NA vs rm.na NA semantics are
completely independent, except where design choices are made that they
should be related. I think they should be unrelated, masks and bit patterns
are two approaches to solving the same problem.



 Implementations
 1) bit-pattern == na-dtype - how about we call that IBP
 (implementation bit patten)?
 2) array.mask.  IM (implementation mask)?

 Nathaniel implied that:

 CMV implies: sum([np.NA, 1]) == np.NA
 CMSK implies sum([np.NA, 1]) == 1

 and indeed, that's how R and masked arrays respectively behave.


R and numpy.ma.  If we're trying to be clear about our concepts and
implementations, numpy.ma is just one possible implementation of masked
arrays.


 So I
 think it's reasonable to say that at least R thought that the bitmask
 implied the first and Pierre and others thought the mask meant the
 second.


R's model is based on years of experience and a model of what missing values
implies, the bitmask implies nothing about the behavior of NA.



 The NEP as it stands thinks of CMV and and CM as being different views
 of the same thing,   Please correct me if I'm wrong.

  Both NaN and Inf are implemented in hardware with the same idea as the NA
  bit pattern, but they do not follow NA missing value semantics.

 Right - and that doesn't affect the argument, because the argument is
 about the concepts and not the implementation.


You just said R thought bitmasks implied something, and you're saying masked
arrays imply something. If the argument is just about the missing value
concepts, neither of these should be in the present discussion.



  As far as I can tell, the only required difference between them is that
 NA
  bit patterns must destroy the data. Nothing else.

 I think Nathaniel's point was about the expected default behavior in
 the different concepts.

  Everything on top of that
  is a choice of API and interface mechanisms. I want them to behave
 exactly
  the same except for that necessary difference, so that it will be
 possible
  to use the *exact same Python code* with either approach.

 Right.  And Nathaniel's point is that that desire leads to fusion of
 the two ideas into one when they should be separated.  For example, if
 I understand correctly:

  a = np.array([1.0, 2.0, 3, 7.0], masked=True)
  b = np.array([1.0, 2.0, np.NA, 7.0], 

Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 1:20 PM, Lluís xscr...@gmx.net wrote:

 Mark Wiebe writes:

  There seems to be a general idea that masks and NA bit patterns imply
  particular differing semantics, something which I think is simply
  false.

 Well, my example contained a difference (the need for the skipna=True
 argument) precisely because it seemed that there was some need for
 different defaults.

 Honestly, I think this difference breaks the POLA (principle of least
 astonishment).


 [...]
  As far as I can tell, the only required difference between them is
  that NA bit patterns must destroy the data. Nothing else. Everything
  on top of that is a choice of API and interface mechanisms. I want
  them to behave exactly the same except for that necessary difference,
  so that it will be possible to use the *exact same Python code* with
  either approach.

 I completely agree. What I'd suggest is a global and/or per-object
 ndarray.flags.skipna for people like me that just want to ignore these
 entries without caring about setting it on each operaion (or the other
 way around, depends on the default behaviour).

 The downside is that it adds yet another tweaking knob, which is not
 desirable...


One way around this would be to create an ndarray subclass which changes
that default. Currently this would not be possible to do nicely, but with
the _numpy_ufunc_ idea I proposed in a separate thread a while back, this
could work.

-Mark




 Lluis

 --
  And it's much the same thing with knowledge, for whenever you learn
  something new, the whole world becomes that much richer.
  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
  Tollbooth
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 4:21 PM, Eric Firing efir...@hawaii.edu wrote:

 On 06/29/2011 09:32 AM, Matthew Brett wrote:
  Hi,
 
 [...]
 
  Clearly there are some overlaps between what masked arrays are trying
  to achieve and what Rs NA mechanisms are trying to achieve.  Are they
  really similar enough that they should function using the same API?
  And if so, won't that be confusing?  I think that's the question
  that's being asked.

 And I think the answer is no.  No more confusing to people coming from
 R to numpy than views already are--with or without the NEP--and not
 *requiring* people to use any NA-related functionality beyond what they
 are used to from R.

 My understanding of the NEP is that it directly yields an API closely
 matching that of R, but with the opportunity, via views, to do more with
 less work, if one so desires.  The present masked array module could be
 made more efficient if the NEP is implemented; regardless of whether
 this is done, the masked array module is not about to vanish, so anyone
 wanting precisely the masked array API will have it; and others remain
 free to ignore it (except for those of us involved in developing
 libraries such as matplotlib, which will have to support all variations
 of the new API along with the already-supported masked arrays).

 In addition, for new code, the full-blown masked array module may not be
 needed.  A convenience it adds, however, is the automatic masking of
 invalid values:

 In [1]: np.ma.log(-1)
 Out[1]: masked

 I'm sure this horrifies some, but there are times and places where it is
 a genuine convenience, and preferable to having to use a separate
 operation to replace nan or inf with NA or whatever it ends up being.


I added a mechanism to support this idea with the NA dtypes approach,
spelled 'NA[f8,InfNan]'. Here, all Infs and NaNs are treated as NA by the
system.

-Mark

If np.seterr were extended to allow such automatic masking as an option,
 then the need for a separate masked array module would shrink further.
 I wouldn't mind having to use an explicit kwarg for ignoring NA in
 reduction methods.

 Eric


 
  See you,
 
  Matthew
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 5:42 PM, Nathaniel Smith n...@pobox.com wrote:

 On Wed, Jun 29, 2011 at 2:40 PM, Lluís xscr...@gmx.net wrote:
  I'm for the option of having a single API when you want to have NA
  elements, regardless of whether it's using masks or bit patterns.

 I understand the desire to avoid having two different APIS...

 [snip]
  My concern is now about how to set the skipna in a comfortable way,
  so that I don't have to set it again and again as ufunc arguments:
 
  a
  array([NA, 2, 3])
  b
  array([1, 2, NA])
  a + b
  array([NA, 2, NA])
  a.flags.skipna=True
  b.flags.skipna=True
  a + b
  array([1, 4, 3])

 ...But... now you're introducing two different kinds of arrays with
 different APIs again? Ones where .skipna==True, and ones where
 .skipna==False?

 I know that this way it's not keyed on the underlying storage format,
 but if we support both bit patterns and mask arrays at the
 implementation level, then the only way to make them have identical
 APIs is if we completely disallow unmasking, and shared masks, and so
 forth.


The right set of these conditions has been in the NEP from the beginning.
Unmasking without value assignment is disallowed - the only way to see
behind the mask or to share masks is with views. My impression is than more
people are concerned with sharing the same data between different masks,
something also supported through views.

-Mark


 Which doesn't seem like it'd be very popular (and would make
 including the mask-based implementation pretty pointless). So I think
 we have to assume that they will have APIs that are at least somewhat
 different. And then it seems like with this proposal then we'd
 actually end up with *4* different APIs that any particular array
 might follow... (or maybe more, depending on how arrays that had both
 a bit-pattern and mask ended up working).

 That's why I was thinking the best solution might be to just bite the
 bullet and make the APIs *totally* different and non-overlapping, so
 it was always obvious which you were using and how they'd interact.
 But I don't know -- for my work I'd be happy to just pass skipna
 everywhere I needed it, and never unmask anything, and so forth, so
 maybe there's some reason why it's really important for the
 bit-pattern NA API to overlap more with the masked array API?

 -- Nathaniel
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Thu, Jun 30, 2011 at 1:49 AM, Chris Barker chris.bar...@noaa.gov wrote:

 On 6/27/11 9:53 AM, Charles R Harris wrote:
  Some discussion of disk storage might also help. I don't see how the
  rules can be enforced if two files are used, one for the mask and
  another for the data, but that may just be something we need to live
 with.

 It seems it wouldn't be too big  deal to extend the *.npy format to
 include the mask.

 Could one memmap both the data array and the mask?


This I haven't thought about too much yet, but I don't see why not. This
does provide a back door into the mask which violates the abstractions, so I
would want it to be an extremely narrow special case.

-Mark



 Netcdf (and assume hdf) have ways to support masks as well.

 -Chris




 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Gary Strangman



  Clearly there are some overlaps between what masked arrays are
  trying to achieve and what Rs NA mechanisms are trying to achieve.
   Are they really similar enough that they should function using
  the same API?

Yes.

  And if so, won't that be confusing?

No, I don't believe so, any more than NA's in R, NaN's, or Inf's are already
confusing.


As one who's been silently following (most of) this thread, and a heavy R 
and numpy user, perhaps I should chime in briefly here with a use case. I 
more-or-less always work with partially masked data, like Matthew, but not 
numpy masked arrays because the memory overhead is prohibitive. And, sad 
to say, my experiments don't always go perfectly. I therefore have arrays 
in which there is /both/ (1) data that is simply missing (np.NA?)--it 
never had a value and never will--as well as simultaneously (2) data that 
that is temporarily masked (np.IGNORE? np.MASKED?) where I want to 
mask/unmask different portions for different purposes/analyses. I consider 
these two separate, completely independent issues and I unfortunately 
currently have to kluge a lot to handle this.


Concretely, consider a list of 100,000 observations (rows), with 12 
measures per observation-row (a 100,000 x 12 array). Every now and then, 
sprinkled throughout this array, I have missing values (someone didn't 
answer a question, or a computer failed to record a response, or 
whatever). For some analyses I want to mask the whole row (e.g., 
complete-case analysis), leaving me with array entries that should be 
tagged with all 4 possible labels:


1) not masked, not missing
2) masked, not missing
3) not masked, missing
4) masked, missing

Obviously #4 is overkill ... but only until I want to unmask that row. 
At that point, I need to be sure that missing values remain missing when 
unmasked. Can a single API really handle this?


-best
Gary


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Lluís
Mark Wiebe writes:

 On Wed, Jun 29, 2011 at 1:20 PM, Lluís xscr...@gmx.net wrote:
 [...]
 As far as I can tell, the only required difference between them is
 that NA bit patterns must destroy the data. Nothing else. Everything
 on top of that is a choice of API and interface mechanisms. I want
 them to behave exactly the same except for that necessary difference,
 so that it will be possible to use the *exact same Python code* with
 either approach.
   
 I completely agree. What I'd suggest is a global and/or per-object
 ndarray.flags.skipna for people like me that just want to ignore these
 entries without caring about setting it on each operaion (or the other
 way around, depends on the default behaviour).
   
 The downside is that it adds yet another tweaking knob, which is not
 desirable...

 One way around this would be to create an ndarray subclass which
 changes that default. Currently this would not be possible to do
 nicely, but with the _numpy_ufunc_ idea I proposed in a separate
 thread a while back, this could work.

That does indeed sound good :)


Lluis


-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman str...@nmr.mgh.harvard.edu
 wrote:


   Clearly there are some overlaps between what masked arrays are
  trying to achieve and what Rs NA mechanisms are trying to achieve.
   Are they really similar enough that they should function using
  the same API?

 Yes.

  And if so, won't that be confusing?

 No, I don't believe so, any more than NA's in R, NaN's, or Inf's are
 already
 confusing.


 As one who's been silently following (most of) this thread, and a heavy R
 and numpy user, perhaps I should chime in briefly here with a use case. I
 more-or-less always work with partially masked data, like Matthew, but not
 numpy masked arrays because the memory overhead is prohibitive. And, sad to
 say, my experiments don't always go perfectly. I therefore have arrays in
 which there is /both/ (1) data that is simply missing (np.NA?)--it never had
 a value and never will--as well as simultaneously (2) data that that is
 temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask
 different portions for different purposes/analyses. I consider these two
 separate, completely independent issues and I unfortunately currently have
 to kluge a lot to handle this.

 Concretely, consider a list of 100,000 observations (rows), with 12
 measures per observation-row (a 100,000 x 12 array). Every now and then,
 sprinkled throughout this array, I have missing values (someone didn't
 answer a question, or a computer failed to record a response, or whatever).
 For some analyses I want to mask the whole row (e.g., complete-case
 analysis), leaving me with array entries that should be tagged with all 4
 possible labels:

 1) not masked, not missing
 2) masked, not missing
 3) not masked, missing
 4) masked, missing

 Obviously #4 is overkill ... but only until I want to unmask that row. At
 that point, I need to be sure that missing values remain missing when
 unmasked. Can a single API really handle this?


The single API does support a masked array with an NA dtype, and the
behavior in this case will be that the value is considered NA if either it
is masked or the value is the NA bit pattern. So you could add a mask to an
array with an NA dtype to temporarily treat the data as if more values were
missing.

One important reason I'm doing it this way is so that each NumPy algorithm
and any 3rd party code only needs to be updated once to support both forms
of missing data. The C API with masks is also a lot cleaner to work with
than one for NA dtypes with the ability to have different NA bit patterns.

-Mark



 -best
 Gary


 The information in this e-mail is intended only for the person to whom it
 is
 addressed. If you believe this e-mail was sent to you in error and the
 e-mail
 contains patient information, please contact the Partners Compliance
 HelpLine at
 http://www.partners.org/**compliancelinehttp://www.partners.org/complianceline.
  If the e-mail was sent to you in error
 but does not contain patient information, please contact the sender and
 properly
 dispose of the e-mail.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Matthew Brett
Hi,

On Thu, Jun 30, 2011 at 5:13 PM, Mark Wiebe mwwi...@gmail.com wrote:
 On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman
 str...@nmr.mgh.harvard.edu wrote:

      Clearly there are some overlaps between what masked arrays are
      trying to achieve and what Rs NA mechanisms are trying to achieve.
       Are they really similar enough that they should function using
      the same API?

 Yes.

      And if so, won't that be confusing?

 No, I don't believe so, any more than NA's in R, NaN's, or Inf's are
 already
 confusing.

 As one who's been silently following (most of) this thread, and a heavy R
 and numpy user, perhaps I should chime in briefly here with a use case. I
 more-or-less always work with partially masked data, like Matthew, but not
 numpy masked arrays because the memory overhead is prohibitive. And, sad to
 say, my experiments don't always go perfectly. I therefore have arrays in
 which there is /both/ (1) data that is simply missing (np.NA?)--it never had
 a value and never will--as well as simultaneously (2) data that that is
 temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask
 different portions for different purposes/analyses. I consider these two
 separate, completely independent issues and I unfortunately currently have
 to kluge a lot to handle this.

 Concretely, consider a list of 100,000 observations (rows), with 12
 measures per observation-row (a 100,000 x 12 array). Every now and then,
 sprinkled throughout this array, I have missing values (someone didn't
 answer a question, or a computer failed to record a response, or whatever).
 For some analyses I want to mask the whole row (e.g., complete-case
 analysis), leaving me with array entries that should be tagged with all 4
 possible labels:

 1) not masked, not missing
 2) masked, not missing
 3) not masked, missing
 4) masked, missing

 Obviously #4 is overkill ... but only until I want to unmask that row.
 At that point, I need to be sure that missing values remain missing when
 unmasked. Can a single API really handle this?

 The single API does support a masked array with an NA dtype, and the
 behavior in this case will be that the value is considered NA if either it
 is masked or the value is the NA bit pattern. So you could add a mask to an
 array with an NA dtype to temporarily treat the data as if more values were
 missing.

Right - but I think the separated API is cleaner and easier to
explain.  Do you disagree?

 One important reason I'm doing it this way is so that each NumPy algorithm
 and any 3rd party code only needs to be updated once to support both forms
 of missing data.

Could you explain what you mean?  Maybe a couple of examples?

Whatever API results, it will surely be with us for a long time, and
so it would be good to make sure we have the right one even if it
costs a bit more to update current code.

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Lluís
Mark Wiebe writes:
 Why is one magic and the other real? All of this is already
 sitting on 100 layers of abstraction above electrons and atoms. If
 we're talking about real, maybe we should be programming in machine
 code or using breadboards with individual transistors.

M-x butterfly RET

http://xkcd.com/378/

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] missing data: semantics

2011-06-30 Thread Lluís
Ok, I think it's time to step back and reformulate the problem by
completely ignoring the implementation.

Here we have 2 generic concepts (i.e., applicable to R), plus another
extra concept that is exclusive to numpy:

* Assigning np.NA to an array, cannot be undone unless through explicit
  assignment (i.e., assigning a new arbitrary value, or saving a copy of
  the original array before assigning np.NA).

* np.NA values propagate by default, unless ufuncs have the skipna =
  True argument (or the other way around, it doesn't really matter to
  this discussion). In order to avoid passing the argument on each
  ufunc, we either have some per-array variable for the default skipna
  value (undesirable) or we can make a trivial ndarray subclass that
  will set the skipna argument on all ufuncs through the
  _ufunc_wrapper_ mechanism.



Now, numpy has the concept of views, which adds some more goodies to the
list of concepts:

* With views, two arrays can share the same physical data, so that
  assignments to any of them will be seen by others (including NA
  values).

The creation of a view is explicitly stated by the user, so its
behaviour should not be perceived as odd (after all, you asked for a
view).

The good thing is that with views you can avoid costly array copies if
you're careful when writing into these views.



Now, you can add a new concept: local/temporal/transient missing data.

We can take an existing array and create a view with the new argument
transientna = True.

Here, both the view and the transientna = True are explicitly stated
by the user, so it is assumed that she already knows what this is all
about.

The difference with a regular view is that you also explicitly asked for
local/temporal/transient NA values.

* Assigning np.NA to an array view with transientna = True will
  *not* be seen by any of the other views (nor the original array),
  but anything else will still work as usual.

After all, this is what *you* asked for when using the transientna =
True argument.



To conclude, say that others *must not* care about whether the arrays
they're working with have transient NA values. This way, I can create a
view with transient NAs, set to NA some uninteresting data, and pass it
to a routine written by someone else that sets to NA elements that, for
example, are beyond certain threshold from the mean of the elements.

This would be equivalent to storing a copy of the original array before
passing it to this 3rd party function, only that transientna, just as
views, provide some handy shortcuts to avoid copies.


My main point here is that views and local/temporal/transient NAs are
all *explicitly* requested, so that its behaviour should not appear as
something unexpected.

Is there an agreement on this?


Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data: semantics

2011-06-30 Thread Matthew Brett
Hi,

On Thu, Jun 30, 2011 at 6:46 PM, Lluís xscr...@gmx.net wrote:
 Ok, I think it's time to step back and reformulate the problem by
 completely ignoring the implementation.

 Here we have 2 generic concepts (i.e., applicable to R), plus another
 extra concept that is exclusive to numpy:

 * Assigning np.NA to an array, cannot be undone unless through explicit
  assignment (i.e., assigning a new arbitrary value, or saving a copy of
  the original array before assigning np.NA).

 * np.NA values propagate by default, unless ufuncs have the skipna =
  True argument (or the other way around, it doesn't really matter to
  this discussion). In order to avoid passing the argument on each
  ufunc, we either have some per-array variable for the default skipna
  value (undesirable) or we can make a trivial ndarray subclass that
  will set the skipna argument on all ufuncs through the
  _ufunc_wrapper_ mechanism.



 Now, numpy has the concept of views, which adds some more goodies to the
 list of concepts:

 * With views, two arrays can share the same physical data, so that
  assignments to any of them will be seen by others (including NA
  values).

 The creation of a view is explicitly stated by the user, so its
 behaviour should not be perceived as odd (after all, you asked for a
 view).

 The good thing is that with views you can avoid costly array copies if
 you're careful when writing into these views.



 Now, you can add a new concept: local/temporal/transient missing data.

 We can take an existing array and create a view with the new argument
 transientna = True.

 Here, both the view and the transientna = True are explicitly stated
 by the user, so it is assumed that she already knows what this is all
 about.

 The difference with a regular view is that you also explicitly asked for
 local/temporal/transient NA values.

 * Assigning np.NA to an array view with transientna = True will
  *not* be seen by any of the other views (nor the original array),
  but anything else will still work as usual.

 After all, this is what *you* asked for when using the transientna =
 True argument.



 To conclude, say that others *must not* care about whether the arrays
 they're working with have transient NA values. This way, I can create a
 view with transient NAs, set to NA some uninteresting data, and pass it
 to a routine written by someone else that sets to NA elements that, for
 example, are beyond certain threshold from the mean of the elements.

 This would be equivalent to storing a copy of the original array before
 passing it to this 3rd party function, only that transientna, just as
 views, provide some handy shortcuts to avoid copies.


 My main point here is that views and local/temporal/transient NAs are
 all *explicitly* requested, so that its behaviour should not appear as
 something unexpected.

 Is there an agreement on this?

Absolutely, if by 'transientna' you mean 'masked'.  The discussion is
whether the NA API should be the same as the masking API.   The thing
you are describing is what masking is for, and what it's always been
for, as far as I can see.   We're arguing that to call this
'transientna' instead of 'masked' confuses two concepts that are
different, to no good purpose.

Best,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data: semantics

2011-06-30 Thread Charles R Harris
On Thu, Jun 30, 2011 at 11:46 AM, Lluís xscr...@gmx.net wrote:

 Ok, I think it's time to step back and reformulate the problem by
 completely ignoring the implementation.

 Here we have 2 generic concepts (i.e., applicable to R), plus another
 extra concept that is exclusive to numpy:

 * Assigning np.NA to an array, cannot be undone unless through explicit
  assignment (i.e., assigning a new arbitrary value, or saving a copy of
  the original array before assigning np.NA).

 * np.NA values propagate by default, unless ufuncs have the skipna =
  True argument (or the other way around, it doesn't really matter to
  this discussion). In order to avoid passing the argument on each
  ufunc, we either have some per-array variable for the default skipna
  value (undesirable) or we can make a trivial ndarray subclass that
  will set the skipna argument on all ufuncs through the
  _ufunc_wrapper_ mechanism.



 Now, numpy has the concept of views, which adds some more goodies to the
 list of concepts:

 * With views, two arrays can share the same physical data, so that
  assignments to any of them will be seen by others (including NA
  values).

 The creation of a view is explicitly stated by the user, so its
 behaviour should not be perceived as odd (after all, you asked for a
 view).

 The good thing is that with views you can avoid costly array copies if
 you're careful when writing into these views.



 Now, you can add a new concept: local/temporal/transient missing data.

 We can take an existing array and create a view with the new argument
 transientna = True.


This is already there: x.view(masked=1), although the keyword transientna
has appeal, not least because it avoids the word 'mask', which seems a
source of endless confusion. Note that currently this is only supposed to
work if the original array is unmasked.

Here, both the view and the transientna = True are explicitly stated
 by the user, so it is assumed that she already knows what this is all
 about.

 The difference with a regular view is that you also explicitly asked for
 local/temporal/transient NA values.

 * Assigning np.NA to an array view with transientna = True will
  *not* be seen by any of the other views (nor the original array),
  but anything else will still work as usual.

 After all, this is what *you* asked for when using the transientna =
 True argument.



 To conclude, say that others *must not* care about whether the arrays
 they're working with have transient NA values. This way, I can create a
 view with transient NAs, set to NA some uninteresting data, and pass it
 to a routine written by someone else that sets to NA elements that, for
 example, are beyond certain threshold from the mean of the elements.

 This would be equivalent to storing a copy of the original array before
 passing it to this 3rd party function, only that transientna, just as
 views, provide some handy shortcuts to avoid copies.


 My main point here is that views and local/temporal/transient NAs are
 all *explicitly* requested, so that its behaviour should not appear as
 something unexpected.

 Is there an agreement on this?


Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data: semantics

2011-06-30 Thread Charles R Harris
On Thu, Jun 30, 2011 at 11:51 AM, Matthew Brett matthew.br...@gmail.comwrote:

 Hi,

 On Thu, Jun 30, 2011 at 6:46 PM, Lluís xscr...@gmx.net wrote:
  Ok, I think it's time to step back and reformulate the problem by
  completely ignoring the implementation.
 
  Here we have 2 generic concepts (i.e., applicable to R), plus another
  extra concept that is exclusive to numpy:
 
  * Assigning np.NA to an array, cannot be undone unless through explicit
   assignment (i.e., assigning a new arbitrary value, or saving a copy of
   the original array before assigning np.NA).
 
  * np.NA values propagate by default, unless ufuncs have the skipna =
   True argument (or the other way around, it doesn't really matter to
   this discussion). In order to avoid passing the argument on each
   ufunc, we either have some per-array variable for the default skipna
   value (undesirable) or we can make a trivial ndarray subclass that
   will set the skipna argument on all ufuncs through the
   _ufunc_wrapper_ mechanism.
 
 
 
  Now, numpy has the concept of views, which adds some more goodies to the
  list of concepts:
 
  * With views, two arrays can share the same physical data, so that
   assignments to any of them will be seen by others (including NA
   values).
 
  The creation of a view is explicitly stated by the user, so its
  behaviour should not be perceived as odd (after all, you asked for a
  view).
 
  The good thing is that with views you can avoid costly array copies if
  you're careful when writing into these views.
 
 
 
  Now, you can add a new concept: local/temporal/transient missing data.
 
  We can take an existing array and create a view with the new argument
  transientna = True.
 
  Here, both the view and the transientna = True are explicitly stated
  by the user, so it is assumed that she already knows what this is all
  about.
 
  The difference with a regular view is that you also explicitly asked for
  local/temporal/transient NA values.
 
  * Assigning np.NA to an array view with transientna = True will
   *not* be seen by any of the other views (nor the original array),
   but anything else will still work as usual.
 
  After all, this is what *you* asked for when using the transientna =
  True argument.
 
 
 
  To conclude, say that others *must not* care about whether the arrays
  they're working with have transient NA values. This way, I can create a
  view with transient NAs, set to NA some uninteresting data, and pass it
  to a routine written by someone else that sets to NA elements that, for
  example, are beyond certain threshold from the mean of the elements.
 
  This would be equivalent to storing a copy of the original array before
  passing it to this 3rd party function, only that transientna, just as
  views, provide some handy shortcuts to avoid copies.
 
 
  My main point here is that views and local/temporal/transient NAs are
  all *explicitly* requested, so that its behaviour should not appear as
  something unexpected.
 
  Is there an agreement on this?

 Absolutely, if by 'transientna' you mean 'masked'.  The discussion is
 whether the NA API should be the same as the masking API.   The thing
 you are describing is what masking is for, and what it's always been
 for, as far as I can see.   We're arguing that to call this
 'transientna' instead of 'masked' confuses two concepts that are
 different, to no good purpose.


It's a hammer. If you want to hammer nails, fine, if you want hammer a bit
of tubing flat, fine. It's a tool, the hammer concept if you will.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Thu, Jun 30, 2011 at 11:42 AM, Matthew Brett matthew.br...@gmail.comwrote:

 Hi,

 On Thu, Jun 30, 2011 at 5:13 PM, Mark Wiebe mwwi...@gmail.com wrote:
  On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman
  str...@nmr.mgh.harvard.edu wrote:
 
   Clearly there are some overlaps between what masked arrays are
   trying to achieve and what Rs NA mechanisms are trying to achieve.
Are they really similar enough that they should function using
   the same API?
 
  Yes.
 
   And if so, won't that be confusing?
 
  No, I don't believe so, any more than NA's in R, NaN's, or Inf's are
  already
  confusing.
 
  As one who's been silently following (most of) this thread, and a heavy
 R
  and numpy user, perhaps I should chime in briefly here with a use case.
 I
  more-or-less always work with partially masked data, like Matthew, but
 not
  numpy masked arrays because the memory overhead is prohibitive. And, sad
 to
  say, my experiments don't always go perfectly. I therefore have arrays
 in
  which there is /both/ (1) data that is simply missing (np.NA?)--it never
 had
  a value and never will--as well as simultaneously (2) data that that is
  temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask
  different portions for different purposes/analyses. I consider these two
  separate, completely independent issues and I unfortunately currently
 have
  to kluge a lot to handle this.
 
  Concretely, consider a list of 100,000 observations (rows), with 12
  measures per observation-row (a 100,000 x 12 array). Every now and then,
  sprinkled throughout this array, I have missing values (someone didn't
  answer a question, or a computer failed to record a response, or
 whatever).
  For some analyses I want to mask the whole row (e.g., complete-case
  analysis), leaving me with array entries that should be tagged with all
 4
  possible labels:
 
  1) not masked, not missing
  2) masked, not missing
  3) not masked, missing
  4) masked, missing
 
  Obviously #4 is overkill ... but only until I want to unmask that row.
  At that point, I need to be sure that missing values remain missing when
  unmasked. Can a single API really handle this?
 
  The single API does support a masked array with an NA dtype, and the
  behavior in this case will be that the value is considered NA if either
 it
  is masked or the value is the NA bit pattern. So you could add a mask to
 an
  array with an NA dtype to temporarily treat the data as if more values
 were
  missing.

 Right - but I think the separated API is cleaner and easier to
 explain.  Do you disagree?


Kind of, yeah. I think the important things to understand from the Python
perspective are that there are two ways of doing missing values with NA that
look exactly the same except for how you create the arrays. Since you know
that the mask way takes more memory, and that's important for your
application, you can decide to use the NA dtype without any additional
depth.

Understanding that one of them has a special signal for NA while the other
uses masks in the background probably isn't even that important to
understand to be able to use it. I bet lots of people who use R regularly
couldn't come up with a correct explanation of how it works there.

If someone doesn't understand masks, they can use their intuition based on
the special signal idea without any difficulty. The idea that you can
temporarily make some values NA without overwriting your data may not be
intuitive at first glance, but I expect people will find it useful even if
they don't fully understand the subtle details of the masking mechanism.

 One important reason I'm doing it this way is so that each NumPy algorithm
  and any 3rd party code only needs to be updated once to support both
 forms
  of missing data.

 Could you explain what you mean?  Maybe a couple of examples?


Yeah, I've started adding some implementation notes to the NEP. First I need
volunteers to review my current pull requests though. ;)

-Mark



 Whatever API results, it will surely be with us for a long time, and
 so it would be good to make sure we have the right one even if it
 costs a bit more to update current code.

 Cheers,

 Matthew
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Thu, Jun 30, 2011 at 11:54 AM, Lluís xscr...@gmx.net wrote:

 Mark Wiebe writes:
  Why is one magic and the other real? All of this is already
  sitting on 100 layers of abstraction above electrons and atoms. If
  we're talking about real, maybe we should be programming in machine
  code or using breadboards with individual transistors.

 M-x butterfly RET

 http://xkcd.com/378/


Ok, I've run this, how long does it take to execute?

-Mark




 --
  And it's much the same thing with knowledge, for whenever you learn
  something new, the whole world becomes that much richer.
  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
  Tollbooth
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Nathaniel Smith
On Wed, Jun 29, 2011 at 2:21 PM, Eric Firing efir...@hawaii.edu wrote:
 In addition, for new code, the full-blown masked array module may not be
 needed.  A convenience it adds, however, is the automatic masking of
 invalid values:

 In [1]: np.ma.log(-1)
 Out[1]: masked

 I'm sure this horrifies some, but there are times and places where it is
 a genuine convenience, and preferable to having to use a separate
 operation to replace nan or inf with NA or whatever it ends up being.

Err, but what would this even get you? NA, NaN, and Inf basically all
behave the same WRT floating point operations anyway, i.e., they all
propagate?

Is the idea that if ufunc's gain a skipna=True flag, you'd also like
to be able to turn it into a skipna_and_nan_and_inf=True flag?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Eric Firing
On 06/30/2011 08:53 AM, Nathaniel Smith wrote:
 On Wed, Jun 29, 2011 at 2:21 PM, Eric Firingefir...@hawaii.edu  wrote:
 In addition, for new code, the full-blown masked array module may not be
 needed.  A convenience it adds, however, is the automatic masking of
 invalid values:

 In [1]: np.ma.log(-1)
 Out[1]: masked

 I'm sure this horrifies some, but there are times and places where it is
 a genuine convenience, and preferable to having to use a separate
 operation to replace nan or inf with NA or whatever it ends up being.

 Err, but what would this even get you? NA, NaN, and Inf basically all
 behave the same WRT floating point operations anyway, i.e., they all
 propagate?

Not exactly. First, it depends on np.seterr; second, calculations on NaN 
can be very slow, so are better avoided entirely; third, if an array is 
passed to extension code, it is much nicer if that code only has one NA 
value to handle, instead of having to check for all possible bad values.


 Is the idea that if ufunc's gain a skipna=True flag, you'd also like
 to be able to turn it into a skipna_and_nan_and_inf=True flag?

No, it is to have a situation where skipna_and_nan_and_inf would not be 
needed, because an operation generating a nan or inf would turn those 
values into NA or IGNORE or whatever right away.

Eric


 -- Nathaniel
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Nathaniel Smith
On Thu, Jun 30, 2011 at 12:27 PM, Eric Firing efir...@hawaii.edu wrote:
 On 06/30/2011 08:53 AM, Nathaniel Smith wrote:
 On Wed, Jun 29, 2011 at 2:21 PM, Eric Firingefir...@hawaii.edu  wrote:
 In addition, for new code, the full-blown masked array module may not be
 needed.  A convenience it adds, however, is the automatic masking of
 invalid values:

 In [1]: np.ma.log(-1)
 Out[1]: masked

 I'm sure this horrifies some, but there are times and places where it is
 a genuine convenience, and preferable to having to use a separate
 operation to replace nan or inf with NA or whatever it ends up being.

 Err, but what would this even get you? NA, NaN, and Inf basically all
 behave the same WRT floating point operations anyway, i.e., they all
 propagate?

 Not exactly. First, it depends on np.seterr;

IIUC, you're proposing to make this conversion depend on np.seterr
too, though, right?

 second, calculations on NaN
 can be very slow, so are better avoided entirely

They're slow because inside the processor they require a branch and a
separate code path (which doesn't get a lot of transistors allocated
to it). In any of the NA proposals we're talking about, handling an NA
would require a software branch and a separate code path (which is in
ordinary software, now, so it doesn't get any special transistors
allocated to it...). I don't think masking support is likely to give
you a speedup over the processor's NaN handling.

And if it did, that would mean that we speed up FP operations in
general by checking for NaN in software, so then we should do that
everywhere anyway instead of making it an NA-specific feature...

 third, if an array is
 passed to extension code, it is much nicer if that code only has one NA
 value to handle, instead of having to check for all possible bad values.

I'm pretty sure that Mark's proposal does not work this way -- he's
saying that the NA-checking code in numpy could optionally check for
all these different bad values and handle them the same in ufuncs,
not that we would check the outputs of all FP operations for bad
values and then replace them by NA. So your extension code would still
have the same problem. Sorry :-(

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Dag Sverre Seljebotn
On 06/28/2011 11:52 PM, Matthew Brett wrote:
 Hi,

 On Tue, Jun 28, 2011 at 5:38 PM, Charles R Harris
 charlesr.har...@gmail.com  wrote:
 Nathaniel, an implementation using masks will look *exactly* like an
 implementation using na-dtypes from the user's point of view. Except that
 taking a masked view of an unmasked array allows ignoring values without
 destroying or copying the original data. The only downside I can see to an
 implementation using masks is memory and disk storage, and perhaps memory
 mapped arrays. And I rather expect the former to solve itself in a few
 years, eight gigs is becoming a baseline for workstations and in a couple of
 years I expect that to be up around 16-32, and a few years after that In
 any case we are talking 12% - 25% overhead, and in practice I expect it
 won't be quite as big a problem as folks project.

 Or, in the case of 16 bit integers, 50% memory overhead.

 I honestly find it hard to believe that I will not care about memory
 use in the near future, and I don't think it's wise to make decisions
 on that assumption.

In many sciences, waiting for the future makes things worse, not better, 
simply because the amount of available data easily grows at a faster 
rate than the amount of memory you can get per dollar :-)

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Dag Sverre Seljebotn
On 06/27/2011 05:55 PM, Mark Wiebe wrote:
 First I'd like to thank everyone for all the feedback you're providing,
 clearly this is an important topic to many people, and the discussion
 has helped clarify the ideas for me. I've renamed and updated the NEP,
 then placed it into the master NumPy repository so it has a more
 permanent home here:

 https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst

One thing to think about is the presence of SSE/AVX instructions, which 
has the potential to change some of the memory/speed trade-offs here.

In the newest Intel-platform CPUs you can do 256-bit operations, 
translating to a theoretical factor 8 speedup for in-cache single 
precision data, and the instruction set is constructed for future 
expansion possibilites to 512 or 1024 bit registers.

I feel one should take care to not design oneself into a corner where 
this can't (eventually) be leveraged.

1) The shuffle instructions takes a single byte as a control character 
for moving around data in different ways in 128-bit registers. One could 
probably implement fast IGNORE-style NA with a seperate mask using 1 
byte per 16 bytes of data (with 4 or 8-byte elements). OTOH, I'm not 
sure if 1 byte per element kind of mask would be that fast (but I don't 
know much about this and haven't looked at the details).

2) The alternative Parameterized Data Type Which Adds Additional Memory 
for the NA Flag would mean that contiguous arrays with NA's/IGNORE's 
would not be subject to vector instructions, or create a mess of copying 
in and out prior to operating on the data. This really seems like the 
worst of all possibilites to me.

(FWIW, my vote is in favour of both NA-using-NaN and 
IGNORE-using-explicit-masks, and keep the two as entirely seperate 
worlds to avoid confusion.)

Dag Sverre

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Lluís
Matthew Brett writes:

 Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
 the idea that the entry is still there, but we're just ignoring it.  Of
 course, that goes against common convention, but it might be easier to
 explain.

 I think Nathaniel's point is that np.IGNORE is a different idea than
 np.NA, and that is why joining the implementations can lead to
 conceptual confusion.

This is how I see it:

 a = np.array([0, 1, 2], dtype=int)
 a[0] = np.NA
ValueError
 e = np.array([np.NA, 1, 2], dtype=int)
ValueError
 b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
 m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
 bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
 b[1] = np.NA
 np.sum(b)
np.NA
 np.sum(b, skipna=True)
2
 b.mask
None
 m[1] = np.NA
 np.sum(m)
2
 np.sum(m, skipna=True)
2
 m.mask
[False, False, True]
 bm[1] = np.NA
 np.sum(bm)
2
 np.sum(bm, skipna=True)
2
 bm.mask
[False, False, True]

So:

* Mask takes precedence over bit pattern on element assignment. There's
  still the question of how to assign a bit pattern NA when the mask is
  active.

* When using mask, elements are automagically skipped.

* m[1] = np.NA is equivalent to m.mask[1] = False

* When using bit pattern + mask, it might make sense to have the initial
  values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True,
  False, True] and np.sum(bm) == np.NA)


Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Matthew Brett
Hi,

On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe mwwi...@gmail.com wrote:
 On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett matthew.br...@gmail.com
 wrote:

 Hi,

 On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith n...@pobox.com wrote:
 ...
  (You might think, what difference does it make if you *can* unmask an
  item? Us missing data folks could just ignore this feature. But:
  whatever we end up implementing is something that I will have to
  explain over and over to different people, most of them not
  particularly sophisticated programmers. And there's just no sensible
  way to explain this idea that if you store some particular value, then
  it replaces the old value, but if you store NA, then the old value is
  still there.

 Ouch - yes.  No question, that is difficult to explain.   Well, I
 think the explanation might go like this:

 Ah, yes, well, that's because in fact numpy records missing values by
 using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`

 Is that fair?

 My favorite way of explaining it would be to have a grid of numbers written
 on paper, then have several cardboards with holes poked in them in different
 configurations. Placing these cardboard masks in front of the grid would
 show different sets of non-missing data, without affecting the values stored
 on the paper behind them.

Right - but here of course you are trying to explain the mask, and
this is Nathaniel's point, that in order to explain NAs, you have to
explain masks, and so, even at a basic level, the fusion of the two
ideas is obvious, and already confusing.  I mean this:

a[3] = np.NA

Oh, so you just set the a[3] value to have some missing value code?

Ah - no - in fact what I did was set a associated mask in position
a[3] so that you can't any longer see the previous value of a[3]

Huh.  You mean I have a mask for every single value in order to be
able to blank out a[3]?  It looks like an assignment.  I mean, it
looks just like a[3] = 4.  But I guess it isn't?

Er...

I think Nathaniel's point is a very good one - these are separate
ideas, np.NA and np.IGNORE, and a joint implementation is bound to
draw them together in the mind of the user.Apart from anything
else, the user has to know that, if they want a single NA value in an
array, they have to add a mask size array.shape in bytes.  They have
to know then, that NA is implemented by masking, and then the 'NA for
free by adding masking' idea breaks down and starts to feel like a
kludge.

The counter argument is of course that, in time, the implementation of
NA with masking will seem as obvious and intuitive, as, say,
broadcasting, and that we are just reacting from lack of experience
with the new API.

Of course, that does happen, but here, unless I am mistaken, the
primary drive to fuse NA and masking is because of ease of
implementation.   That doesn't necessarily mean that they don't go
together - if something is easy to implement, sometimes it means it
will also feel natural in use, but at least we might say that there is
some risk of the implementation driving the API, and that that can
lead to problems.

See you,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Dag Sverre Seljebotn
On 06/29/2011 03:45 PM, Matthew Brett wrote:
 Hi,

 On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebemwwi...@gmail.com  wrote:
 On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brettmatthew.br...@gmail.com
 wrote:

 Hi,

 On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smithn...@pobox.com  wrote:
 ...
 (You might think, what difference does it make if you *can* unmask an
 item? Us missing data folks could just ignore this feature. But:
 whatever we end up implementing is something that I will have to
 explain over and over to different people, most of them not
 particularly sophisticated programmers. And there's just no sensible
 way to explain this idea that if you store some particular value, then
 it replaces the old value, but if you store NA, then the old value is
 still there.

 Ouch - yes.  No question, that is difficult to explain.   Well, I
 think the explanation might go like this:

 Ah, yes, well, that's because in fact numpy records missing values by
 using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`

 Is that fair?

 My favorite way of explaining it would be to have a grid of numbers written
 on paper, then have several cardboards with holes poked in them in different
 configurations. Placing these cardboard masks in front of the grid would
 show different sets of non-missing data, without affecting the values stored
 on the paper behind them.

 Right - but here of course you are trying to explain the mask, and
 this is Nathaniel's point, that in order to explain NAs, you have to
 explain masks, and so, even at a basic level, the fusion of the two
 ideas is obvious, and already confusing.  I mean this:

 a[3] = np.NA

 Oh, so you just set the a[3] value to have some missing value code?

 Ah - no - in fact what I did was set a associated mask in position
 a[3] so that you can't any longer see the previous value of a[3]

 Huh.  You mean I have a mask for every single value in order to be
 able to blank out a[3]?  It looks like an assignment.  I mean, it
 looks just like a[3] = 4.  But I guess it isn't?

 Er...

 I think Nathaniel's point is a very good one - these are separate
 ideas, np.NA and np.IGNORE, and a joint implementation is bound to
 draw them together in the mind of the user.Apart from anything
 else, the user has to know that, if they want a single NA value in an
 array, they have to add a mask size array.shape in bytes.  They have
 to know then, that NA is implemented by masking, and then the 'NA for
 free by adding masking' idea breaks down and starts to feel like a
 kludge.

 The counter argument is of course that, in time, the implementation of
 NA with masking will seem as obvious and intuitive, as, say,
 broadcasting, and that we are just reacting from lack of experience
 with the new API.

However, no matter how used we get to this, people coming from almost 
any other tool (in particular R) will keep think it is 
counter-intuitive. Why set up a major semantic incompatability that 
people then have to overcome in order to start using NumPy.

I really don't see what's wrong with some more explicit API like 
a.mask[3] = True. Explicit is better than implicit.

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Pierre GM
Matthew, Dag, +1.
On Jun 29, 2011 4:35 PM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no
wrote:
 On 06/29/2011 03:45 PM, Matthew Brett wrote:
 Hi,

 On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebemwwi...@gmail.com wrote:
 On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brettmatthew.br...@gmail.com
 wrote:

 Hi,

 On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smithn...@pobox.com wrote:
 ...
 (You might think, what difference does it make if you *can* unmask an
 item? Us missing data folks could just ignore this feature. But:
 whatever we end up implementing is something that I will have to
 explain over and over to different people, most of them not
 particularly sophisticated programmers. And there's just no sensible
 way to explain this idea that if you store some particular value, then
 it replaces the old value, but if you store NA, then the old value is
 still there.

 Ouch - yes. No question, that is difficult to explain. Well, I
 think the explanation might go like this:

 Ah, yes, well, that's because in fact numpy records missing values by
 using a 'mask'. So when you say `a[3] = np.NA', what you mean is,
 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`

 Is that fair?

 My favorite way of explaining it would be to have a grid of numbers
written
 on paper, then have several cardboards with holes poked in them in
different
 configurations. Placing these cardboard masks in front of the grid would
 show different sets of non-missing data, without affecting the values
stored
 on the paper behind them.

 Right - but here of course you are trying to explain the mask, and
 this is Nathaniel's point, that in order to explain NAs, you have to
 explain masks, and so, even at a basic level, the fusion of the two
 ideas is obvious, and already confusing. I mean this:

 a[3] = np.NA

 Oh, so you just set the a[3] value to have some missing value code?

 Ah - no - in fact what I did was set a associated mask in position
 a[3] so that you can't any longer see the previous value of a[3]

 Huh. You mean I have a mask for every single value in order to be
 able to blank out a[3]? It looks like an assignment. I mean, it
 looks just like a[3] = 4. But I guess it isn't?

 Er...

 I think Nathaniel's point is a very good one - these are separate
 ideas, np.NA and np.IGNORE, and a joint implementation is bound to
 draw them together in the mind of the user. Apart from anything
 else, the user has to know that, if they want a single NA value in an
 array, they have to add a mask size array.shape in bytes. They have
 to know then, that NA is implemented by masking, and then the 'NA for
 free by adding masking' idea breaks down and starts to feel like a
 kludge.

 The counter argument is of course that, in time, the implementation of
 NA with masking will seem as obvious and intuitive, as, say,
 broadcasting, and that we are just reacting from lack of experience
 with the new API.

 However, no matter how used we get to this, people coming from almost
 any other tool (in particular R) will keep think it is
 counter-intuitive. Why set up a major semantic incompatability that
 people then have to overcome in order to start using NumPy.

 I really don't see what's wrong with some more explicit API like
 a.mask[3] = True. Explicit is better than implicit.

 Dag Sverre
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 2:26 AM, Dag Sverre Seljebotn 
d.s.seljeb...@astro.uio.no wrote:

 On 06/27/2011 05:55 PM, Mark Wiebe wrote:
  First I'd like to thank everyone for all the feedback you're providing,
  clearly this is an important topic to many people, and the discussion
  has helped clarify the ideas for me. I've renamed and updated the NEP,
  then placed it into the master NumPy repository so it has a more
  permanent home here:
 
  https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst

 One thing to think about is the presence of SSE/AVX instructions, which
 has the potential to change some of the memory/speed trade-offs here.

 In the newest Intel-platform CPUs you can do 256-bit operations,
 translating to a theoretical factor 8 speedup for in-cache single
 precision data, and the instruction set is constructed for future
 expansion possibilites to 512 or 1024 bit registers.


The ufuncs themselves need a good bit of refactoring to be able to use these
kinds of instructions well. I'm definitely thinking about this kind of thing
while designing/implementing.


 I feel one should take care to not design oneself into a corner where
 this can't (eventually) be leveraged.

 1) The shuffle instructions takes a single byte as a control character
 for moving around data in different ways in 128-bit registers. One could
 probably implement fast IGNORE-style NA with a seperate mask using 1
 byte per 16 bytes of data (with 4 or 8-byte elements). OTOH, I'm not
 sure if 1 byte per element kind of mask would be that fast (but I don't
 know much about this and haven't looked at the details).


This level of optimization, while important, is often dwarfed by the effects
of cache. Because of the complexity of the system demanded by the
functionality, I'm trying to favor simplicity and generality without
precluding high performance.

2) The alternative Parameterized Data Type Which Adds Additional Memory
 for the NA Flag would mean that contiguous arrays with NA's/IGNORE's
 would not be subject to vector instructions, or create a mess of copying
 in and out prior to operating on the data. This really seems like the
 worst of all possibilites to me.


This one was suggested on the list, so I added it.

-Mark


 (FWIW, my vote is in favour of both NA-using-NaN and
 IGNORE-using-explicit-masks, and keep the two as entirely seperate
 worlds to avoid confusion.)


 Dag Sverre

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 8:20 AM, Lluís xscr...@gmx.net wrote:

 Matthew Brett writes:

  Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
  the idea that the entry is still there, but we're just ignoring it.  Of
  course, that goes against common convention, but it might be easier to
  explain.

  I think Nathaniel's point is that np.IGNORE is a different idea than
  np.NA, and that is why joining the implementations can lead to
  conceptual confusion.

 This is how I see it:

  a = np.array([0, 1, 2], dtype=int)
  a[0] = np.NA
 ValueError
  e = np.array([np.NA, 1, 2], dtype=int)
 ValueError
  b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
  m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
  bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
  b[1] = np.NA
  np.sum(b)
 np.NA
  np.sum(b, skipna=True)
 2
  b.mask
 None
  m[1] = np.NA
  np.sum(m)
 2
  np.sum(m, skipna=True)
 2
  m.mask
 [False, False, True]
  bm[1] = np.NA
  np.sum(bm)
 2
  np.sum(bm, skipna=True)
 2
  bm.mask
 [False, False, True]

 So:

 * Mask takes precedence over bit pattern on element assignment. There's
  still the question of how to assign a bit pattern NA when the mask is
  active.

 * When using mask, elements are automagically skipped.

 * m[1] = np.NA is equivalent to m.mask[1] = False

 * When using bit pattern + mask, it might make sense to have the initial
  values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True,
  False, True] and np.sum(bm) == np.NA)


There seems to be a general idea that masks and NA bit patterns imply
particular differing semantics, something which I think is simply false.
Both NaN and Inf are implemented in hardware with the same idea as the NA
bit pattern, but they do not follow NA missing value semantics.

As far as I can tell, the only required difference between them is that NA
bit patterns must destroy the data. Nothing else. Everything on top of that
is a choice of API and interface mechanisms. I want them to behave exactly
the same except for that necessary difference, so that it will be possible
to use the *exact same Python code* with either approach.

Say you're using NA dtypes, and suddenly you think, what if I temporarily
treated these as NA too. Now you have to copy your whole array to avoid
destroying your data! The NA bit pattern didn't save you memory here... Say
you're using masks, and it turns out you didn't actually need masking
semantics. If they're different, you now have to do lots of code changes to
switch to NA dtypes!

-Mark




 Lluis

 --
  And it's much the same thing with knowledge, for whenever you learn
  something new, the whole world becomes that much richer.
  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
  Tollbooth
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 8:45 AM, Matthew Brett matthew.br...@gmail.comwrote:

 Hi,

 On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe mwwi...@gmail.com wrote:
  On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett matthew.br...@gmail.com
  wrote:
 
  Hi,
 
  On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith n...@pobox.com wrote:
  ...
   (You might think, what difference does it make if you *can* unmask an
   item? Us missing data folks could just ignore this feature. But:
   whatever we end up implementing is something that I will have to
   explain over and over to different people, most of them not
   particularly sophisticated programmers. And there's just no sensible
   way to explain this idea that if you store some particular value, then
   it replaces the old value, but if you store NA, then the old value is
   still there.
 
  Ouch - yes.  No question, that is difficult to explain.   Well, I
  think the explanation might go like this:
 
  Ah, yes, well, that's because in fact numpy records missing values by
  using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
  'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`
 
  Is that fair?
 
  My favorite way of explaining it would be to have a grid of numbers
 written
  on paper, then have several cardboards with holes poked in them in
 different
  configurations. Placing these cardboard masks in front of the grid would
  show different sets of non-missing data, without affecting the values
 stored
  on the paper behind them.

 Right - but here of course you are trying to explain the mask, and
 this is Nathaniel's point, that in order to explain NAs, you have to
 explain masks, and so, even at a basic level, the fusion of the two
 ideas is obvious, and already confusing.  I mean this:

 a[3] = np.NA

 Oh, so you just set the a[3] value to have some missing value code?


I would answer Yes, that's basically true. The abstraction works that way,
and there's no reason to confuse people with those implementation details
right off the bat. When you introduce a new user to floating point numbers,
it would seem odd to first point out that addition isn't associative. That
kind of detail is important when you're learning more about the system and
digging deeper.

I think it was in a Knuth book that I read the idea that the best teaching
is a series of lies that successively correct the previous lies.


 Ah - no - in fact what I did was set a associated mask in position
 a[3] so that you can't any longer see the previous value of a[3]

 Huh.  You mean I have a mask for every single value in order to be
 able to blank out a[3]?  It looks like an assignment.  I mean, it
 looks just like a[3] = 4.  But I guess it isn't?

 Er...

 I think Nathaniel's point is a very good one - these are separate
 ideas, np.NA and np.IGNORE, and a joint implementation is bound to
 draw them together in the mind of the user.


R jointly implements them with the rm.na=T parameter, and that's our model
system for missing data.


 Apart from anything
 else, the user has to know that, if they want a single NA value in an
 array, they have to add a mask size array.shape in bytes.  They have
 to know then, that NA is implemented by masking, and then the 'NA for
 free by adding masking' idea breaks down and starts to feel like a
 kludge.

 The counter argument is of course that, in time, the implementation of
 NA with masking will seem as obvious and intuitive, as, say,
 broadcasting, and that we are just reacting from lack of experience
 with the new API.


It will literally work the same as the implementation with NA dtypes, except
for the masking semantics which requires the extra steps of taking views.



 Of course, that does happen, but here, unless I am mistaken, the
 primary drive to fuse NA and masking is because of ease of
 implementation.


That's not the case, and I've tried to give a slightly better justification
for this in my answer Lluis' email.


 That doesn't necessarily mean that they don't go
 together - if something is easy to implement, sometimes it means it
 will also feel natural in use, but at least we might say that there is
 some risk of the implementation driving the API, and that that can
 lead to problems.


In the design process I'm doing, the implementation concerns are affecting
the interface concerns and vice versa, but the missing data semantics are
the main driver.

-Mark



 See you,

 Matthew
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn 
d.s.seljeb...@astro.uio.no wrote:

 On 06/29/2011 03:45 PM, Matthew Brett wrote:
  Hi,
 
  On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebemwwi...@gmail.com  wrote:
  On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brettmatthew.br...@gmail.com
  wrote:
 
  Hi,
 
  On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smithn...@pobox.com
  wrote:
  ...
  (You might think, what difference does it make if you *can* unmask an
  item? Us missing data folks could just ignore this feature. But:
  whatever we end up implementing is something that I will have to
  explain over and over to different people, most of them not
  particularly sophisticated programmers. And there's just no sensible
  way to explain this idea that if you store some particular value, then
  it replaces the old value, but if you store NA, then the old value is
  still there.
 
  Ouch - yes.  No question, that is difficult to explain.   Well, I
  think the explanation might go like this:
 
  Ah, yes, well, that's because in fact numpy records missing values by
  using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
  'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`
 
  Is that fair?
 
  My favorite way of explaining it would be to have a grid of numbers
 written
  on paper, then have several cardboards with holes poked in them in
 different
  configurations. Placing these cardboard masks in front of the grid would
  show different sets of non-missing data, without affecting the values
 stored
  on the paper behind them.
 
  Right - but here of course you are trying to explain the mask, and
  this is Nathaniel's point, that in order to explain NAs, you have to
  explain masks, and so, even at a basic level, the fusion of the two
  ideas is obvious, and already confusing.  I mean this:
 
  a[3] = np.NA
 
  Oh, so you just set the a[3] value to have some missing value code?
 
  Ah - no - in fact what I did was set a associated mask in position
  a[3] so that you can't any longer see the previous value of a[3]
 
  Huh.  You mean I have a mask for every single value in order to be
  able to blank out a[3]?  It looks like an assignment.  I mean, it
  looks just like a[3] = 4.  But I guess it isn't?
 
  Er...
 
  I think Nathaniel's point is a very good one - these are separate
  ideas, np.NA and np.IGNORE, and a joint implementation is bound to
  draw them together in the mind of the user.Apart from anything
  else, the user has to know that, if they want a single NA value in an
  array, they have to add a mask size array.shape in bytes.  They have
  to know then, that NA is implemented by masking, and then the 'NA for
  free by adding masking' idea breaks down and starts to feel like a
  kludge.
 
  The counter argument is of course that, in time, the implementation of
  NA with masking will seem as obvious and intuitive, as, say,
  broadcasting, and that we are just reacting from lack of experience
  with the new API.

 However, no matter how used we get to this, people coming from almost
 any other tool (in particular R) will keep think it is
 counter-intuitive. Why set up a major semantic incompatability that
 people then have to overcome in order to start using NumPy.


I'm not aware of a semantic incompatibility. I believe R doesn't support
views like NumPy does, so the things you have to do to see masking semantics
aren't even possible in R.

I really don't see what's wrong with some more explicit API like
 a.mask[3] = True. Explicit is better than implicit.


I agree, but initial feedback was that the way R deals with NA values is
very nice, and I've come to agree that it's worth emulating.

-Mark



 Dag Sverre
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Dag Sverre Seljebotn
On 06/29/2011 07:38 PM, Mark Wiebe wrote:
 On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn
 d.s.seljeb...@astro.uio.no mailto:d.s.seljeb...@astro.uio.no wrote:

 On 06/29/2011 03:45 PM, Matthew Brett wrote:
   Hi,
  
   On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebemwwi...@gmail.com
 mailto:mwwi...@gmail.com  wrote:
   On Tue, Jun 28, 2011 at 5:20 PM, Matthew
 Brettmatthew.br...@gmail.com mailto:matthew.br...@gmail.com
   wrote:
  
   Hi,
  
   On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smithn...@pobox.com
 mailto:n...@pobox.com  wrote:
   ...
   (You might think, what difference does it make if you *can*
 unmask an
   item? Us missing data folks could just ignore this feature. But:
   whatever we end up implementing is something that I will have to
   explain over and over to different people, most of them not
   particularly sophisticated programmers. And there's just no
 sensible
   way to explain this idea that if you store some particular
 value, then
   it replaces the old value, but if you store NA, then the old
 value is
   still there.
  
   Ouch - yes.  No question, that is difficult to explain.   Well, I
   think the explanation might go like this:
  
   Ah, yes, well, that's because in fact numpy records missing
 values by
   using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
   'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`
  
   Is that fair?
  
   My favorite way of explaining it would be to have a grid of
 numbers written
   on paper, then have several cardboards with holes poked in them
 in different
   configurations. Placing these cardboard masks in front of the
 grid would
   show different sets of non-missing data, without affecting the
 values stored
   on the paper behind them.
  
   Right - but here of course you are trying to explain the mask, and
   this is Nathaniel's point, that in order to explain NAs, you have to
   explain masks, and so, even at a basic level, the fusion of the two
   ideas is obvious, and already confusing.  I mean this:
  
   a[3] = np.NA
  
   Oh, so you just set the a[3] value to have some missing value code?
  
   Ah - no - in fact what I did was set a associated mask in position
   a[3] so that you can't any longer see the previous value of a[3]
  
   Huh.  You mean I have a mask for every single value in order to be
   able to blank out a[3]?  It looks like an assignment.  I mean, it
   looks just like a[3] = 4.  But I guess it isn't?
  
   Er...
  
   I think Nathaniel's point is a very good one - these are separate
   ideas, np.NA and np.IGNORE, and a joint implementation is bound to
   draw them together in the mind of the user.Apart from anything
   else, the user has to know that, if they want a single NA value in an
   array, they have to add a mask size array.shape in bytes.  They have
   to know then, that NA is implemented by masking, and then the 'NA for
   free by adding masking' idea breaks down and starts to feel like a
   kludge.
  
   The counter argument is of course that, in time, the
 implementation of
   NA with masking will seem as obvious and intuitive, as, say,
   broadcasting, and that we are just reacting from lack of experience
   with the new API.

 However, no matter how used we get to this, people coming from almost
 any other tool (in particular R) will keep think it is
 counter-intuitive. Why set up a major semantic incompatability that
 people then have to overcome in order to start using NumPy.


 I'm not aware of a semantic incompatibility. I believe R doesn't support
 views like NumPy does, so the things you have to do to see masking
 semantics aren't even possible in R.

Well, whether the same feature is possible or not in R is irrelevant to 
whether a semantic incompatability would exist.

Views themselves are a *major* semantic incompatability, and are highly 
confusing at first to MATLAB/Fortran/R people. However they have major 
advantages outweighing the disadvantage of having to caution new users.

But there's simply no precedence anywhere for an assignment that doesn't 
erase the old value for a particular input value, and the advantages 
seem pretty minor (well, I think it is ugly in its own right, but that 
is besides the point...)

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Lluís
Mark Wiebe writes:

 There seems to be a general idea that masks and NA bit patterns imply
 particular differing semantics, something which I think is simply
 false.

Well, my example contained a difference (the need for the skipna=True
argument) precisely because it seemed that there was some need for
different defaults.

Honestly, I think this difference breaks the POLA (principle of least
astonishment).


[...]
 As far as I can tell, the only required difference between them is
 that NA bit patterns must destroy the data. Nothing else. Everything
 on top of that is a choice of API and interface mechanisms. I want
 them to behave exactly the same except for that necessary difference,
 so that it will be possible to use the *exact same Python code* with
 either approach.

I completely agree. What I'd suggest is a global and/or per-object
ndarray.flags.skipna for people like me that just want to ignore these
entries without caring about setting it on each operaion (or the other
way around, depends on the default behaviour).

The downside is that it adds yet another tweaking knob, which is not
desirable...


Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Bruce Southey
On 06/29/2011 01:07 PM, Dag Sverre Seljebotn wrote:
 On 06/29/2011 07:38 PM, Mark Wiebe wrote:
 On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn
 d.s.seljeb...@astro.uio.nomailto:d.s.seljeb...@astro.uio.no  wrote:

  On 06/29/2011 03:45 PM, Matthew Brett wrote:
 Hi,
   
 On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebemwwi...@gmail.com
  mailto:mwwi...@gmail.com   wrote:
 On Tue, Jun 28, 2011 at 5:20 PM, Matthew
  Brettmatthew.br...@gmail.commailto:matthew.br...@gmail.com
 wrote:
   
 Hi,
   
 On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smithn...@pobox.com
  mailto:n...@pobox.com   wrote:
 ...
 (You might think, what difference does it make if you *can*
  unmask an
 item? Us missing data folks could just ignore this feature. But:
 whatever we end up implementing is something that I will have to
 explain over and over to different people, most of them not
 particularly sophisticated programmers. And there's just no
  sensible
 way to explain this idea that if you store some particular
  value, then
 it replaces the old value, but if you store NA, then the old
  value is
 still there.
   
 Ouch - yes.  No question, that is difficult to explain.   Well, I
 think the explanation might go like this:
   
 Ah, yes, well, that's because in fact numpy records missing
  values by
 using a 'mask'.   So when you say `a[3] = np.NA', what you mean 
 is,
 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`
   
 Is that fair?
   
 My favorite way of explaining it would be to have a grid of
  numbers written
 on paper, then have several cardboards with holes poked in them
  in different
 configurations. Placing these cardboard masks in front of the
  grid would
 show different sets of non-missing data, without affecting the
  values stored
 on the paper behind them.
   
 Right - but here of course you are trying to explain the mask, and
 this is Nathaniel's point, that in order to explain NAs, you have to
 explain masks, and so, even at a basic level, the fusion of the two
 ideas is obvious, and already confusing.  I mean this:
   
 a[3] = np.NA
   
 Oh, so you just set the a[3] value to have some missing value 
 code?
   
 Ah - no - in fact what I did was set a associated mask in position
 a[3] so that you can't any longer see the previous value of a[3]
   
 Huh.  You mean I have a mask for every single value in order to be
 able to blank out a[3]?  It looks like an assignment.  I mean, it
 looks just like a[3] = 4.  But I guess it isn't?
   
 Er...
   
 I think Nathaniel's point is a very good one - these are separate
 ideas, np.NA and np.IGNORE, and a joint implementation is bound to
 draw them together in the mind of the user.Apart from anything
 else, the user has to know that, if they want a single NA value in 
 an
 array, they have to add a mask size array.shape in bytes.  They have
 to know then, that NA is implemented by masking, and then the 'NA 
 for
 free by adding masking' idea breaks down and starts to feel like a
 kludge.
   
 The counter argument is of course that, in time, the
  implementation of
 NA with masking will seem as obvious and intuitive, as, say,
 broadcasting, and that we are just reacting from lack of experience
 with the new API.

  However, no matter how used we get to this, people coming from almost
  any other tool (in particular R) will keep think it is
  counter-intuitive. Why set up a major semantic incompatability that
  people then have to overcome in order to start using NumPy.


 I'm not aware of a semantic incompatibility. I believe R doesn't support
 views like NumPy does, so the things you have to do to see masking
 semantics aren't even possible in R.
 Well, whether the same feature is possible or not in R is irrelevant to
 whether a semantic incompatability would exist.

 Views themselves are a *major* semantic incompatability, and are highly
 confusing at first to MATLAB/Fortran/R people. However they have major
 advantages outweighing the disadvantage of having to caution new users.

 But there's simply no precedence anywhere for an assignment that doesn't
 erase the old value for a particular input value, and the advantages
 seem pretty minor (well, I think it is ugly in its own right, but that
 is besides the point...)

 Dag Sverre
 ___
Depending on what you really mean by 'precedence', in most stats 
software (R, SAS, etc.) it is completely up to the user to do this and 
do it correctly. 

Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Matthew Brett
Hi,

On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe mwwi...@gmail.com wrote:
 On Wed, Jun 29, 2011 at 8:20 AM, Lluís xscr...@gmx.net wrote:

 Matthew Brett writes:

  Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
  the idea that the entry is still there, but we're just ignoring it.  Of
  course, that goes against common convention, but it might be easier to
  explain.

  I think Nathaniel's point is that np.IGNORE is a different idea than
  np.NA, and that is why joining the implementations can lead to
  conceptual confusion.

 This is how I see it:

  a = np.array([0, 1, 2], dtype=int)
  a[0] = np.NA
 ValueError
  e = np.array([np.NA, 1, 2], dtype=int)
 ValueError
  b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
  m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
  bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
  b[1] = np.NA
  np.sum(b)
 np.NA
  np.sum(b, skipna=True)
 2
  b.mask
 None
  m[1] = np.NA
  np.sum(m)
 2
  np.sum(m, skipna=True)
 2
  m.mask
 [False, False, True]
  bm[1] = np.NA
  np.sum(bm)
 2
  np.sum(bm, skipna=True)
 2
  bm.mask
 [False, False, True]

 So:

 * Mask takes precedence over bit pattern on element assignment. There's
  still the question of how to assign a bit pattern NA when the mask is
  active.

 * When using mask, elements are automagically skipped.

 * m[1] = np.NA is equivalent to m.mask[1] = False

 * When using bit pattern + mask, it might make sense to have the initial
  values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True,
  False, True] and np.sum(bm) == np.NA)

 There seems to be a general idea that masks and NA bit patterns imply
 particular differing semantics, something which I think is simply false.

Well - first - it's helpful surely to separate the concepts and the
implementation.

Concepts / use patterns (as delineated by Nathaniel):
A) missing values == 'np.NA' in my emails.  Can we call that CMV
(concept missing values)?
B) masks == np.IGNORE in my emails . CMSK (concept masks)?

Implementations
1) bit-pattern == na-dtype - how about we call that IBP
(implementation bit patten)?
2) array.mask.  IM (implementation mask)?

Nathaniel implied that:

CMV implies: sum([np.NA, 1]) == np.NA
CMSK implies sum([np.NA, 1]) == 1

and indeed, that's how R and masked arrays respectively behave.  So I
think it's reasonable to say that at least R thought that the bitmask
implied the first and Pierre and others thought the mask meant the
second.

The NEP as it stands thinks of CMV and and CM as being different views
of the same thing,   Please correct me if I'm wrong.

 Both NaN and Inf are implemented in hardware with the same idea as the NA
 bit pattern, but they do not follow NA missing value semantics.

Right - and that doesn't affect the argument, because the argument is
about the concepts and not the implementation.

 As far as I can tell, the only required difference between them is that NA
 bit patterns must destroy the data. Nothing else.

I think Nathaniel's point was about the expected default behavior in
the different concepts.

 Everything on top of that
 is a choice of API and interface mechanisms. I want them to behave exactly
 the same except for that necessary difference, so that it will be possible
 to use the *exact same Python code* with either approach.

Right.  And Nathaniel's point is that that desire leads to fusion of
the two ideas into one when they should be separated.  For example, if
I understand correctly:

 a = np.array([1.0, 2.0, 3, 7.0], masked=True)
 b = np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
 a[3] = np.NA  # actual real hand-on-heart assignment
 b[3] = np.NA # magic mask setting although it looks the same

 Say you're using NA dtypes, and suddenly you think, what if I temporarily
 treated these as NA too. Now you have to copy your whole array to avoid
 destroying your data! The NA bit pattern didn't save you memory here... Say
 you're using masks, and it turns out you didn't actually need masking
 semantics. If they're different, you now have to do lots of code changes to
 switch to NA dtypes!

I personally have not run across that case.  I'd imagine that, if you
knew you wanted to do something so explicitly masking-like, you'd
start with the masking interface.

Clearly there are some overlaps between what masked arrays are trying
to achieve and what Rs NA mechanisms are trying to achieve.  Are they
really similar enough that they should function using the same API?
And if so, won't that be confusing?  I think that's the question
that's being asked.

See you,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Matthew Brett
Oops,

On Wed, Jun 29, 2011 at 8:32 PM, Matthew Brett matthew.br...@gmail.com wrote:
 Hi,

 On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe mwwi...@gmail.com wrote:
 On Wed, Jun 29, 2011 at 8:20 AM, Lluís xscr...@gmx.net wrote:

 Matthew Brett writes:

  Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
  the idea that the entry is still there, but we're just ignoring it.  Of
  course, that goes against common convention, but it might be easier to
  explain.

  I think Nathaniel's point is that np.IGNORE is a different idea than
  np.NA, and that is why joining the implementations can lead to
  conceptual confusion.

 This is how I see it:

  a = np.array([0, 1, 2], dtype=int)
  a[0] = np.NA
 ValueError
  e = np.array([np.NA, 1, 2], dtype=int)
 ValueError
  b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
  m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
  bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
  b[1] = np.NA
  np.sum(b)
 np.NA
  np.sum(b, skipna=True)
 2
  b.mask
 None
  m[1] = np.NA
  np.sum(m)
 2
  np.sum(m, skipna=True)
 2
  m.mask
 [False, False, True]
  bm[1] = np.NA
  np.sum(bm)
 2
  np.sum(bm, skipna=True)
 2
  bm.mask
 [False, False, True]

 So:

 * Mask takes precedence over bit pattern on element assignment. There's
  still the question of how to assign a bit pattern NA when the mask is
  active.

 * When using mask, elements are automagically skipped.

 * m[1] = np.NA is equivalent to m.mask[1] = False

 * When using bit pattern + mask, it might make sense to have the initial
  values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True,
  False, True] and np.sum(bm) == np.NA)

 There seems to be a general idea that masks and NA bit patterns imply
 particular differing semantics, something which I think is simply false.

 Well - first - it's helpful surely to separate the concepts and the
 implementation.

 Concepts / use patterns (as delineated by Nathaniel):
 A) missing values == 'np.NA' in my emails.  Can we call that CMV
 (concept missing values)?
 B) masks == np.IGNORE in my emails . CMSK (concept masks)?

 Implementations
 1) bit-pattern == na-dtype - how about we call that IBP
 (implementation bit patten)?
 2) array.mask.  IM (implementation mask)?

 Nathaniel implied that:

 CMV implies: sum([np.NA, 1]) == np.NA
 CMSK implies sum([np.NA, 1]) == 1

 and indeed, that's how R and masked arrays respectively behave.  So I
 think it's reasonable to say that at least R thought that the bitmask
 implied the first and Pierre and others thought the mask meant the
 second.

 The NEP as it stands thinks of CMV and and CM as being different views
 of the same thing,   Please correct me if I'm wrong.

 Both NaN and Inf are implemented in hardware with the same idea as the NA
 bit pattern, but they do not follow NA missing value semantics.

 Right - and that doesn't affect the argument, because the argument is
 about the concepts and not the implementation.

 As far as I can tell, the only required difference between them is that NA
 bit patterns must destroy the data. Nothing else.

 I think Nathaniel's point was about the expected default behavior in
 the different concepts.

 Everything on top of that
 is a choice of API and interface mechanisms. I want them to behave exactly
 the same except for that necessary difference, so that it will be possible
 to use the *exact same Python code* with either approach.

 Right.  And Nathaniel's point is that that desire leads to fusion of
 the two ideas into one when they should be separated.  For example, if
 I understand correctly:

 a = np.array([1.0, 2.0, 3, 7.0], masked=True)
 b = np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
 a[3] = np.NA  # actual real hand-on-heart assignment
 b[3] = np.NA # magic mask setting although it looks the same

I meant:

 a = np.array([1.0, 2.0, 3.0, 7.0], masked=True)
 b = np.array([1.0, 2.0, 3.0, 7.0], dtype='NA[f8]')
 b[3] = np.NA  # actual real hand-on-heart assignment
 a[3] = np.NA # magic mask setting although it looks the same

Sorry,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Matthew Brett
Hi,

On Wed, Jun 29, 2011 at 7:20 PM, Lluís xscr...@gmx.net wrote:
 Mark Wiebe writes:

 There seems to be a general idea that masks and NA bit patterns imply
 particular differing semantics, something which I think is simply
 false.

 Well, my example contained a difference (the need for the skipna=True
 argument) precisely because it seemed that there was some need for
 different defaults.

 Honestly, I think this difference breaks the POLA (principle of least
 astonishment).


 [...]
 As far as I can tell, the only required difference between them is
 that NA bit patterns must destroy the data. Nothing else. Everything
 on top of that is a choice of API and interface mechanisms. I want
 them to behave exactly the same except for that necessary difference,
 so that it will be possible to use the *exact same Python code* with
 either approach.

 I completely agree. What I'd suggest is a global and/or per-object
 ndarray.flags.skipna for people like me that just want to ignore these
 entries without caring about setting it on each operaion (or the other
 way around, depends on the default behaviour).

 The downside is that it adds yet another tweaking knob, which is not
 desirable...

Oh - dear - that would be horrible, if, depending on the tweak
somewhere in the distant past of your script, this:

 a = np.array([np.NA, 1.0], masked=True)
 np.sum(a)

could return either np.NA or 1.0...

Imagine someone twiddled the knob the other way and ran your script...

See you,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Charles R Harris
On Wed, Jun 29, 2011 at 1:32 PM, Matthew Brett matthew.br...@gmail.comwrote:

 Hi,

 On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe mwwi...@gmail.com wrote:
  On Wed, Jun 29, 2011 at 8:20 AM, Lluís xscr...@gmx.net wrote:
 
  Matthew Brett writes:
 
   Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
   the idea that the entry is still there, but we're just ignoring it.
  Of
   course, that goes against common convention, but it might be easier
 to
   explain.
 
   I think Nathaniel's point is that np.IGNORE is a different idea than
   np.NA, and that is why joining the implementations can lead to
   conceptual confusion.
 
  This is how I see it:
 
   a = np.array([0, 1, 2], dtype=int)
   a[0] = np.NA
  ValueError
   e = np.array([np.NA, 1, 2], dtype=int)
  ValueError
   b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
   m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
   bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
   b[1] = np.NA
   np.sum(b)
  np.NA
   np.sum(b, skipna=True)
  2
   b.mask
  None
   m[1] = np.NA
   np.sum(m)
  2
   np.sum(m, skipna=True)
  2
   m.mask
  [False, False, True]
   bm[1] = np.NA
   np.sum(bm)
  2
   np.sum(bm, skipna=True)
  2
   bm.mask
  [False, False, True]
 
  So:
 
  * Mask takes precedence over bit pattern on element assignment. There's
   still the question of how to assign a bit pattern NA when the mask is
   active.
 
  * When using mask, elements are automagically skipped.
 
  * m[1] = np.NA is equivalent to m.mask[1] = False
 
  * When using bit pattern + mask, it might make sense to have the initial
   values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True,
   False, True] and np.sum(bm) == np.NA)
 
  There seems to be a general idea that masks and NA bit patterns imply
  particular differing semantics, something which I think is simply false.

 Well - first - it's helpful surely to separate the concepts and the
 implementation.

 Concepts / use patterns (as delineated by Nathaniel):
 A) missing values == 'np.NA' in my emails.  Can we call that CMV
 (concept missing values)?
 B) masks == np.IGNORE in my emails . CMSK (concept masks)?

 Implementations
 1) bit-pattern == na-dtype - how about we call that IBP
 (implementation bit patten)?
 2) array.mask.  IM (implementation mask)?


Remember that the masks are invisible, you can't see them, they are an
implementation detail. A good reason to hide the implementation is so it can
be changed without impacting software that depends on the API.

snip

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Matthew Brett
Hi,

On Wed, Jun 29, 2011 at 9:17 PM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Wed, Jun 29, 2011 at 1:32 PM, Matthew Brett matthew.br...@gmail.com
 wrote:

 Hi,

 On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe mwwi...@gmail.com wrote:
  On Wed, Jun 29, 2011 at 8:20 AM, Lluís xscr...@gmx.net wrote:
 
  Matthew Brett writes:
 
   Maybe instead of np.NA, we could say np.IGNORE, which sort of
   conveys
   the idea that the entry is still there, but we're just ignoring it.
    Of
   course, that goes against common convention, but it might be easier
   to
   explain.
 
   I think Nathaniel's point is that np.IGNORE is a different idea than
   np.NA, and that is why joining the implementations can lead to
   conceptual confusion.
 
  This is how I see it:
 
   a = np.array([0, 1, 2], dtype=int)
   a[0] = np.NA
  ValueError
   e = np.array([np.NA, 1, 2], dtype=int)
  ValueError
   b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
   m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
   bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
   b[1] = np.NA
   np.sum(b)
  np.NA
   np.sum(b, skipna=True)
  2
   b.mask
  None
   m[1] = np.NA
   np.sum(m)
  2
   np.sum(m, skipna=True)
  2
   m.mask
  [False, False, True]
   bm[1] = np.NA
   np.sum(bm)
  2
   np.sum(bm, skipna=True)
  2
   bm.mask
  [False, False, True]
 
  So:
 
  * Mask takes precedence over bit pattern on element assignment. There's
   still the question of how to assign a bit pattern NA when the mask is
   active.
 
  * When using mask, elements are automagically skipped.
 
  * m[1] = np.NA is equivalent to m.mask[1] = False
 
  * When using bit pattern + mask, it might make sense to have the
  initial
   values as bit-pattern NAs, instead of masked (i.e., bm.mask == [True,
   False, True] and np.sum(bm) == np.NA)
 
  There seems to be a general idea that masks and NA bit patterns imply
  particular differing semantics, something which I think is simply false.

 Well - first - it's helpful surely to separate the concepts and the
 implementation.

 Concepts / use patterns (as delineated by Nathaniel):
 A) missing values == 'np.NA' in my emails.  Can we call that CMV
 (concept missing values)?
 B) masks == np.IGNORE in my emails . CMSK (concept masks)?

 Implementations
 1) bit-pattern == na-dtype - how about we call that IBP
 (implementation bit patten)?
 2) array.mask.  IM (implementation mask)?


 Remember that the masks are invisible, you can't see them, they are an
 implementation detail. A good reason to hide the implementation is so it can
 be changed without impacting software that depends on the API.

It's not true that you can't see them because masks are using the same
API as for missing values.  Because they're using the same API, the
person using the CMV stuff will soon find out about the masks,
accidentally or not, then they will need to understand masking, and
that is the problem we're discussing here.

See you,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Nathaniel Smith
On Wed, Jun 29, 2011 at 11:20 AM, Lluís xscr...@gmx.net wrote:
 I completely agree. What I'd suggest is a global and/or per-object
 ndarray.flags.skipna for people like me that just want to ignore these
 entries without caring about setting it on each operaion (or the other
 way around, depends on the default behaviour).

I agree with with Matthew that this approach would end up having
horrible side-effects, but I can see why you'd want some way to
accomplish this...

I suggested another approach to handling both NA-style and mask-style
missing data by making them totally separate features. It's buried at
the bottom of this over-long message (you can search for my
proposal):
  http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057251.html

I know that the part 1 of that proposal would satisfy my needs, but I
don't know as much about your use case, so I'm curious. Would that
proposal (in particular, part 2, the classic masked-array part) work
for you?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Eric Firing
On 06/29/2011 09:32 AM, Matthew Brett wrote:
 Hi,

[...]

 Clearly there are some overlaps between what masked arrays are trying
 to achieve and what Rs NA mechanisms are trying to achieve.  Are they
 really similar enough that they should function using the same API?
 And if so, won't that be confusing?  I think that's the question
 that's being asked.

And I think the answer is no.  No more confusing to people coming from 
R to numpy than views already are--with or without the NEP--and not 
*requiring* people to use any NA-related functionality beyond what they 
are used to from R.

My understanding of the NEP is that it directly yields an API closely 
matching that of R, but with the opportunity, via views, to do more with 
less work, if one so desires.  The present masked array module could be 
made more efficient if the NEP is implemented; regardless of whether 
this is done, the masked array module is not about to vanish, so anyone 
wanting precisely the masked array API will have it; and others remain 
free to ignore it (except for those of us involved in developing 
libraries such as matplotlib, which will have to support all variations 
of the new API along with the already-supported masked arrays).

In addition, for new code, the full-blown masked array module may not be 
needed.  A convenience it adds, however, is the automatic masking of 
invalid values:

In [1]: np.ma.log(-1)
Out[1]: masked

I'm sure this horrifies some, but there are times and places where it is 
a genuine convenience, and preferable to having to use a separate 
operation to replace nan or inf with NA or whatever it ends up being.

If np.seterr were extended to allow such automatic masking as an option, 
then the need for a separate masked array module would shrink further. 
I wouldn't mind having to use an explicit kwarg for ignoring NA in 
reduction methods.

Eric



 See you,

 Matthew
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Lluís
Nathaniel Smith writes:
 I know that the part 1 of that proposal would satisfy my needs, but I
 don't know as much about your use case, so I'm curious. Would that
 proposal (in particular, part 2, the classic masked-array part) work
 for you?

I'm for the option of having a single API when you want to have NA
elements, regardless of whether it's using masks or bit patterns.

My question is whether your ufuncs should react differently depending on
the type of array you're using (bit pattern vs mask).

In the beginning I thought it could make sense, as you know how you have
created the array. So if you're using masks, you're probably going to
ignore the NAs (becase you've explicitly set them, and you don't want a
NA as the result of your summation).

*But*, the more API/semantics both approaches share, the better; so I'd
say that its better that they show the *very same* behaviour
(w.r.t. skipna).

My concern is now about how to set the skipna in a comfortable way,
so that I don't have to set it again and again as ufunc arguments:

 a
array([NA, 2, 3])
 b
array([1, 2, NA])
 a + b
array([NA, 2, NA])
 a.flags.skipna=True
 b.flags.skipna=True
 a + b
array([1, 4, 3])


Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


  1   2   >