date:20110706

Hi,

Just for reference, I am using this as the latest version of the NEP -
I hope it's current:

https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

I'm mostly relaying stuff I said, although generally (please do
correct me if I am wrong) I am just re-expressing points that
Nathaniel has already made in the alterNEP text and the emails.

On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
cjord...@uw.edu wrote:
...
Since we only have Mark is only around Austin until early August, there's
also broad agreement that we need to get something done quickly.

I think I might have missed that part of the discussion :)

I feel the need to emphasize the centrality of the assertion by
Nathaniel, and agreement by (at least) me, that the NA case (there
really is no data) and the IGNORE case (there is data but I'm
concealing it from you) are conceptually different, and come from
different use-cases.

The underlying disagreement returned many times to this fundamental
difference between the NEP and alterNEP:

In the NEP - by design - it is impossible to distinguish between na.NA
and na.IGNORE
The alterNEP insists you should be able to distinguish.

Mark says something like it's all missing data, there's no reason you
should want to distinguish. Nathaniel and I were saying the two
types of missing do have different use-cases, and it should be
possible to distinguish. You might want to chose to treat them the
same, but you should be able to see what they are..

I returned several times to this (original point by Nathaniel):

a[3] = np.NA

(what does this mean? I am altering the underlying array, or a mask?
How would I explain this to someone?)

We confirmed that, in order to make it difficult to know what your NA
is (masked or bit-pattern), Mark has to a) hinder access to the data
below the mask and b) prevent direct API access to the masking array.
I described this as 'hobbling the API' and Mark thought of it as
'generic programming' (missing is always missing).

I asserted that explaining NA to people would be easier if ``a[3] =
np.NA`` was direct assignment and altered the array.

BIT PATTERN MASK IMPLEMENTATIONS FOR NA
--
The current NEP proposes both mask and bit pattern implementations for
missing data. I use the terms bit pattern and parameterized dtype
interchangeably, since the parameterized dtype will use a bit pattern for
its implementation. The two implementations will support the same
functionality with respect to NA, and the implementation details will be
largely invisible to the user. Their differences are in the 'extra' features
each supports.

Two common questions were:
1. Why make two implementations of missing data: one with masks and the
other with parameterized dtypes?
2. Why does the implementation using masks have higher priority?
The answers are:
1. The mask implementation is more general and easier to implement and
maintain. The bit pattern implementation saves memory, makes
interoperability easier, and makes ABI (Application Binary Interface)
compatibility easier. Since each has different strengths, the argument is
both should be implemented.
2. The implementation for the parameterized dtypes will rely on the
implementation using a mask.

NA VS. IGNORE
-
A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in aNEP
sense and NA in NEP sense. With NA, there is a clear notion of how NA
propagates through all basic numpy operations. (e.g., 3+NA=NA and log(NA) =
NA, while NA | True = True.) IGNORE is separate from NA, with different
interpretations depending on the use case.
IGNORE could mean:
1. Data that is being temporarily ignored. e.g., a possible outlier that is
temporarily being removed from consideration.
2. Data that cannot exist. e.g., a matrix representing a grid of water
depths for a lake. Since the lake isn't square, some entries will represent
land, and so depth will be a meaningless concept for those entries.
3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE,
3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though this
leaves open how [1, 2, IGNORE] + [3 , 4] should behave.
Because of these different uses of IGNORE, it doesn't have as clear a
theoretical interpretation as NA. (For instance, what is IGNORE+3, IGNORE*3,
or IGNORE | True?)

I don't remember this bit of the discussion, but I see from current
masked arrays that IGNORE is treated as the identity, so:

IGNORE + 3 = 3
IGNORE * 3 = 3

But several of the discussants thought the use cases for IGNORE were very
compelling. Specifically, they wanted to be able to use IGNORE's and NA's
simultaneously while still being able to differentiate between them. So, for
example, being able to designate some data as IGNORE while still able to
determine

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

On 07/06/2011 02:05 PM, Matthew Brett wrote:
Hi,

Just for reference, I am using this as the latest version of the NEP -
I hope it's current:

https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

I'm mostly relaying stuff I said, although generally (please do
correct me if I am wrong) I am just re-expressing points that
Nathaniel has already made in the alterNEP text and the emails.

I think I might have missed that part of the discussion :)

The underlying disagreement returned many times to this fundamental
difference between the NEP and alterNEP:

In the NEP - by design - it is impossible to distinguish between na.NA
and na.IGNORE
The alterNEP insists you should be able to distinguish.

I returned several times to this (original point by Nathaniel):

a[3] = np.NA

(what does this mean? I am altering the underlying array, or a mask?
How would I explain this to someone?)

Here's an HPC perspective...:

If you, say, want to off-load array processing with a mask to some code
running on a GPU, you really can't have the GPU go through some NumPy
API. Or if you want to implement a masked array on a cluster with MPI,
you similarly really, really want raw access.

At least I feel that the transparency of NumPy is a huge part of its
current success. Many more than me spend half their time in C/Fortran
and half their time in Python.

I tend to look at NumPy this way: Assuming you have some data in memory
(possibly loaded by a C or Fortran library). (Almost) no matter how it
is allocated, ordered, packed, aligned -- there's a way to find strides
and dtypes to put a nice NumPy wrapper around it and use the memory from
Python.

So, my view on Mark's NEP was: With a reasonably amount of flexibility
in how you decided to implement masking for your data, you can create a
NumPy wrapper that will understand that. Whether your Fortran library
exposes NAs in its 40GB buffer as bit patterns, or using a seperate
mask, both will work.

And IMO Mark's NEP comes rather close to this, you just need an
additional NEP later to give raw details to the implementation details,
once those are settled :-)

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

On 07/06/2011 02:27 PM, Dag Sverre Seljebotn wrote:
On 07/06/2011 02:05 PM, Matthew Brett wrote:
Hi,

Just for reference, I am using this as the latest version of the NEP -
I hope it's current:

https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

I'm mostly relaying stuff I said, although generally (please do
correct me if I am wrong) I am just re-expressing points that
Nathaniel has already made in the alterNEP text and the emails.

I think I might have missed that part of the discussion :)

The underlying disagreement returned many times to this fundamental
difference between the NEP and alterNEP:

In the NEP - by design - it is impossible to distinguish between na.NA
and na.IGNORE
The alterNEP insists you should be able to distinguish.

I returned several times to this (original point by Nathaniel):

a[3] = np.NA

(what does this mean? I am altering the underlying array, or a mask?
How would I explain this to someone?)

Here's an HPC perspective...:

At least I feel that the transparency of NumPy is a huge part of its
current success. Many more than me spend half their time in C/Fortran
and half their time in Python.

And IMO Mark's NEP comes rather close to this, you just need an
additional NEP later to give raw details to the implementation details,
once those are settled :-)

To be concrete, I'm thinking something like a custom extension to PEP
3118, which could also allow efficient access from Cython without
hard-coding Cython for NumPy (a GSoC project this summer will continue
to move us away from the np.ndarray[int] syntax to a more generic
int[:] that's less tied to NumPy).

But first things first!

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

Hi,

Sorry, I hope you don't mind, I moved this to it's own thread, trying
to separate comments on the NA debate from the discussion yesterday.

On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn
d.s.seljeb...@astro.uio.no wrote:
On 07/06/2011 02:05 PM, Matthew Brett wrote:
Hi,

Just for reference, I am using this as the latest version of the NEP -
I hope it's current:

https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

I'm mostly relaying stuff I said, although generally (please do
correct me if I am wrong) I am just re-expressing points that
Nathaniel has already made in the alterNEP text and the emails.

I think I might have missed that part of the discussion :)

The underlying disagreement returned many times to this fundamental
difference between the NEP and alterNEP:

In the NEP - by design - it is impossible to distinguish between na.NA
and na.IGNORE
The alterNEP insists you should be able to distinguish.

I returned several times to this (original point by Nathaniel):

a[3] = np.NA

(what does this mean? I am altering the underlying array, or a mask?
How would I explain this to someone?)

Here's an HPC perspective...:

At least I feel that the transparency of NumPy is a huge part of its
current success. Many more than me spend half their time in C/Fortran
and half their time in Python.

And IMO Mark's NEP comes rather close to this, you just need an
additional NEP later to give raw details to the implementation details,
once those are settled :-)

I was a little puzzled as to what you were trying to say, but I
suspect that's my ignorance about Numpy internals.

Superficially, I would have assumed that, making masked and
bit-pattern NAs behave the same in numpy, would take you away from the
raw data, in the sense that you not only need the dtype, you also need
the mask machinery, in order to know if you have an NA. Later I
realized that you probably weren't saying that. So, just for my
unhappy ignorance - how does the HPC perspective relate to debate
about can / can't distinguish NA from ignore?

Sorry, thanks,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

On 07/06/2011 02:46 PM, Matthew Brett wrote:
Hi,

Sorry, I hope you don't mind, I moved this to it's own thread, trying
to separate comments on the NA debate from the discussion yesterday.

I'm sorry.

On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn
d.s.seljeb...@astro.uio.no wrote:
On 07/06/2011 02:05 PM, Matthew Brett wrote:
Hi,

Just for reference, I am using this as the latest version of the NEP -
I hope it's current:

https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

I'm mostly relaying stuff I said, although generally (please do
correct me if I am wrong) I am just re-expressing points that
Nathaniel has already made in the alterNEP text and the emails.

On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
cjord...@uw.eduwrote:
...
Since we only have Mark is only around Austin until early August, there's
also broad agreement that we need to get something done quickly.

I think I might have missed that part of the discussion :)

The underlying disagreement returned many times to this fundamental
difference between the NEP and alterNEP:

In the NEP - by design - it is impossible to distinguish between na.NA
and na.IGNORE
The alterNEP insists you should be able to distinguish.

I returned several times to this (original point by Nathaniel):

a[3] = np.NA

(what does this mean? I am altering the underlying array, or a mask?
How would I explain this to someone?)

Here's an HPC perspective...:

At least I feel that the transparency of NumPy is a huge part of its
current success. Many more than me spend half their time in C/Fortran
and half their time in Python.

And IMO Mark's NEP comes rather close to this, you just need an
additional NEP later to give raw details to the implementation details,
once those are settled :-)

I was a little puzzled as to what you were trying to say, but I
suspect that's my ignorance about Numpy internals.

I just commented on the prevent direct API access to the masking array
part -- I'm hoping direct access by external code to the underlying
implementation details will be allowed, at some point.

What I'm saying is that Mark's proposal is more flexible. Say for the
sake of the argument that I have two codes I need to interface with:

- Library A is written in Fortran and uses a seperate (explicit) mask
array for NA

- Library B runs on a GPU and uses a bit pattern for NA

Mark's proposal then comes closer to allowing me to wrap both codes
using NumPy, since it supports both implementation mechanisms. Sure, it
would need a seperate NEP down the road to extend it, but it goes in the
right direction for this to happen.

As for

Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

Hi,

On Wed, Jul 6, 2011 at 2:12 PM, Dag Sverre Seljebotn
d.s.seljeb...@astro.uio.no wrote:
On 07/06/2011 02:46 PM, Matthew Brett wrote:
Hi,

Sorry, I hope you don't mind, I moved this to it's own thread, trying
to separate comments on the NA debate from the discussion yesterday.

I'm sorry.

On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn
d.s.seljeb...@astro.uio.no wrote:
On 07/06/2011 02:05 PM, Matthew Brett wrote:
Hi,

Just for reference, I am using this as the latest version of the NEP -
I hope it's current:

https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

I'm mostly relaying stuff I said, although generally (please do
correct me if I am wrong) I am just re-expressing points that
Nathaniel has already made in the alterNEP text and the emails.

I think I might have missed that part of the discussion :)

The underlying disagreement returned many times to this fundamental
difference between the NEP and alterNEP:

In the NEP - by design - it is impossible to distinguish between na.NA
and na.IGNORE
The alterNEP insists you should be able to distinguish.

I returned several times to this (original point by Nathaniel):

a[3] = np.NA

(what does this mean? I am altering the underlying array, or a mask?
How would I explain this to someone?)

Here's an HPC perspective...:

At least I feel that the transparency of NumPy is a huge part of its
current success. Many more than me spend half their time in C/Fortran
and half their time in Python.

And IMO Mark's NEP comes rather close to this, you just need an
additional NEP later to give raw details to the implementation details,
once those are settled :-)

I was a little puzzled as to what you were trying to say, but I
suspect that's my ignorance about Numpy internals.

I just commented on the prevent direct API access to the masking array
part -- I'm hoping direct access by external code to the underlying
implementation details will be allowed, at some point.

What I'm saying is that Mark's proposal is more flexible. Say for the
sake of the argument that I have two codes I need to interface with:

- Library A is written in Fortran and uses a seperate (explicit) mask
array for NA

- Library B runs on a GPU and uses a bit pattern for NA

Mark's proposal then comes closer to allowing me to wrap both codes
using NumPy, since it supports both implementation mechanisms. Sure, it
would need a seperate

[Numpy-discussion] using the same vocabulary for missing value ideas

It appears to me that one of the biggest reason some of us have been talking
past each other in the discussions is that different people have different
definitions for the terms being used. Until this is thoroughly cleared up, I
feel the design process is tilting at windmills.

In the interests of clarity in our discussions, here is a starting point
which is consistent with the NEP. These definitions have been added in a
glossary within the NEP. If there are any ideas for amendments to these
definitions that we can agree on, I will update the NEP with those
amendments. Also, if I missed any important terms which need to be added,
please propose definitions for them.

NA (Not Available)
A placeholder for a value which is unknown to computations. That
value may be temporarily hidden with a mask, may have been lost
due to hard drive corruption, or gone for any number of reasons.
This is the same as NA in the R project.

IGNORE (Skip/Ignore)
A placeholder which should be treated by computations as if no value
does
or could exist there. For sums, this means act as if the value
were zero, and for products, this means act as if the value were one.
It's as if the array were compressed in some fashion to not include
that element.

bitpattern
A technique for implementing either NA or IGNORE, where a particular
set of bit patterns are chosen from all the possible bit patterns of the
value's data type to signal that the element is NA or IGNORE.

mask
A technique for implementing either NA or IGNORE, where a
boolean or enum array parallel to the data array is used to signal
which elements are NA or IGNORE.

numpy.ma
The existing implementation of a particular form of masked arrays,
which is part of the NumPy codebase.


The most important distinctions I'm trying to draw are:

1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
IGNORE as mask are reasonable.

2) The idea of masking and the numpy.ma implementation are different. The
numpy.ma object makes particular choices about how to interpret the mask,
but while backwards compatibility is important, a fresh evaluation of all
the design choices going into a mask implementation is worthwhile.

Thanks,
Mark
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

2011-07-06 Thread Peter

On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote:
 It appears to me that one of the biggest reason some of us have been talking
 past each other in the discussions is that different people have different
 definitions for the terms being used. Until this is thoroughly cleared up, I
 feel the design process is tilting at windmills.
 In the interests of clarity in our discussions, here is a starting point
 which is consistent with the NEP. These definitions have been added in a
 glossary within the NEP. If there are any ideas for amendments to these
 definitions that we can agree on, I will update the NEP with those
 amendments. Also, if I missed any important terms which need to be added,
 please propose definitions for them.

That sounds good - I've only been scanning these discussions and it
is confusing.

 NA (Not Available)
     A placeholder for a value which is unknown to computations. That
     value may be temporarily hidden with a mask, may have been lost
     due to hard drive corruption, or gone for any number of reasons.
     This is the same as NA in the R project.

Could you expand that to say how sums and products act with NA
(since you do so for the IGNORE case).

Thanks,

Peter
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

Hi,

On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote:
 It appears to me that one of the biggest reason some of us have been talking
 past each other in the discussions is that different people have different
 definitions for the terms being used. Until this is thoroughly cleared up, I
 feel the design process is tilting at windmills.
 In the interests of clarity in our discussions, here is a starting point
 which is consistent with the NEP. These definitions have been added in a
 glossary within the NEP. If there are any ideas for amendments to these
 definitions that we can agree on, I will update the NEP with those
 amendments. Also, if I missed any important terms which need to be added,
 please propose definitions for them.
 NA (Not Available)
     A placeholder for a value which is unknown to computations. That
     value may be temporarily hidden with a mask, may have been lost
     due to hard drive corruption, or gone for any number of reasons.
     This is the same as NA in the R project.

Really?  Can one implement NA with a mask in R?  I thought an NA was
always bitpattern in R?

 IGNORE (Skip/Ignore)
     A placeholder which should be treated by computations as if no value
 does
     or could exist there. For sums, this means act as if the value
     were zero, and for products, this means act as if the value were one.
     It's as if the array were compressed in some fashion to not include
     that element.
 bitpattern
     A technique for implementing either NA or IGNORE, where a particular
     set of bit patterns are chosen from all the possible bit patterns of the
     value's data type to signal that the element is NA or IGNORE.
 mask
     A technique for implementing either NA or IGNORE, where a
     boolean or enum array parallel to the data array is used to signal
     which elements are NA or IGNORE.
 numpy.ma
     The existing implementation of a particular form of masked arrays,
     which is part of the NumPy codebase.

 The most important distinctions I'm trying to draw are:
 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
 combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
 IGNORE as mask are reasonable.
 2) The idea of masking and the numpy.ma implementation are different. The
 numpy.ma object makes particular choices about how to interpret the mask,
 but while backwards compatibility is important, a fresh evaluation of all
 the design choices going into a mask implementation is worthwhile.

I agree that there has been some confusion due to the terms.

However, I continue to believe that the discussion is substantial and
not due to confusion.

Let us then characterize the substantial discussion as this:

NEP: bitpattern and masked out values should be made nearly impossible
to distinguish in the API
alterNEP: bitpattern and masked out values should be distinct in the
API so that it can be made clear which is meant (and therefore,
implicitly, how they are implemented).

Do you agree that this is the discussion?

See you,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

2011-07-06 Thread Peter

On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett matthew.br...@gmail.com wrote:

 Hi,

 On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote:
 It appears to me that one of the biggest reason some of us have been talking
 past each other in the discussions is that different people have different
 definitions for the terms being used. Until this is thoroughly cleared up, I
 feel the design process is tilting at windmills.
 In the interests of clarity in our discussions, here is a starting point
 which is consistent with the NEP. These definitions have been added in a
 glossary within the NEP. If there are any ideas for amendments to these
 definitions that we can agree on, I will update the NEP with those
 amendments. Also, if I missed any important terms which need to be added,
 please propose definitions for them.
 NA (Not Available)
     A placeholder for a value which is unknown to computations. That
     value may be temporarily hidden with a mask, may have been lost
     due to hard drive corruption, or gone for any number of reasons.
     This is the same as NA in the R project.

 Really?  Can one implement NA with a mask in R?  I thought an NA was
 always bitpattern in R?

I don't think that was what Mark was saying, see this bit later in this email:

 The most important distinctions I'm trying to draw are:
 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
 combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
 IGNORE as mask are reasonable.

This point as I understood it is there is the semantics of the special values
(not available vs ignore), and there is the implementation (bitpattern vs
mask), and they are independent.

Peter
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

Hi,

On Wed, Jul 6, 2011 at 5:48 PM, Peter
numpy-discuss...@maubp.freeserve.co.uk wrote:
 On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett matthew.br...@gmail.com wrote:

 Hi,

 On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote:
 It appears to me that one of the biggest reason some of us have been talking
 past each other in the discussions is that different people have different
 definitions for the terms being used. Until this is thoroughly cleared up, I
 feel the design process is tilting at windmills.
 In the interests of clarity in our discussions, here is a starting point
 which is consistent with the NEP. These definitions have been added in a
 glossary within the NEP. If there are any ideas for amendments to these
 definitions that we can agree on, I will update the NEP with those
 amendments. Also, if I missed any important terms which need to be added,
 please propose definitions for them.
 NA (Not Available)
     A placeholder for a value which is unknown to computations. That
     value may be temporarily hidden with a mask, may have been lost
     due to hard drive corruption, or gone for any number of reasons.
     This is the same as NA in the R project.

 Really?  Can one implement NA with a mask in R?  I thought an NA was
 always bitpattern in R?

 I don't think that was what Mark was saying, see this bit later in this email:

I think it would make a difference if there was an implementation that
had conflated masking with bitpatterns in terms of API.  I don't think
R is an example.

 The most important distinctions I'm trying to draw are:
 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
 combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
 IGNORE as mask are reasonable.

 This point as I understood it is there is the semantics of the special values
 (not available vs ignore), and there is the implementation (bitpattern vs
 mask), and they are independent.

Yes.   Although, we can see from the implementations that we have to hand that

a) bitpatterns - propagation (NaN-like) semantics by default (R)
b) masks - ignore semantics by default (masked arrays)

I don't think Mark accepts that there is any reason for this tendency
of implementations to semantics, but Nathaniel was arguing otherwise
in the alterNEP.

I think we all accept that it's possible to imagine masking have
propagation semantics and bitpatterns having ignore semantics.

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett matthew.br...@gmail.comwrote:

 Hi,

 On Wed, Jul 6, 2011 at 5:48 PM, Peter
 numpy-discuss...@maubp.freeserve.co.uk wrote:
  On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett matthew.br...@gmail.com
 wrote:
 
  Hi,
 
  On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote:
  It appears to me that one of the biggest reason some of us have been
 talking
  past each other in the discussions is that different people have
 different
  definitions for the terms being used. Until this is thoroughly cleared
 up, I
  feel the design process is tilting at windmills.
  In the interests of clarity in our discussions, here is a starting
 point
  which is consistent with the NEP. These definitions have been added in
 a
  glossary within the NEP. If there are any ideas for amendments to these
  definitions that we can agree on, I will update the NEP with those
  amendments. Also, if I missed any important terms which need to be
 added,
  please propose definitions for them.
  NA (Not Available)
  A placeholder for a value which is unknown to computations. That
  value may be temporarily hidden with a mask, may have been lost
  due to hard drive corruption, or gone for any number of reasons.
  This is the same as NA in the R project.
 
  Really?  Can one implement NA with a mask in R?  I thought an NA was
  always bitpattern in R?
 
  I don't think that was what Mark was saying, see this bit later in this
 email:

 I think it would make a difference if there was an implementation that
 had conflated masking with bitpatterns in terms of API.  I don't think
 R is an example.


Of course R is not an example of that.  Nothing is.  This is merely
conceptual.  Separate NA from np.NA in Mark's NEP, and you will see his
point.  Consider it the logical intersection of NA in Mark's NEP and the
aNEP.


  The most important distinctions I'm trying to draw are:
  1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
  combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
  IGNORE as mask are reasonable.
 
  This point as I understood it is there is the semantics of the special
 values
  (not available vs ignore), and there is the implementation (bitpattern vs
  mask), and they are independent.

 Yes.


Good, that's all Mark's definition guide is trying to do.


   Although, we can see from the implementations that we have to hand that

 a) bitpatterns - propagation (NaN-like) semantics by default (R)
 b) masks - ignore semantics by default (masked arrays)



The above is extraneous and out of the scope of Mark's definitions.  We are
taking this little-by-little.


 I don't think Mark accepts that there is any reason for this tendency
 of implementations to semantics, but Nathaniel was arguing otherwise
 in the alterNEP.


Then that is what we will debate *later*, once we establish definitions.



 I think we all accept that it's possible to imagine masking have
 propagation semantics and bitpatterns having ignore semantics.


Good!  I think that is what Mark wanted to get across in this set of
definitions.

It kinda seems like you are champing at the bit here to continue the debate,
but I agree with Mark that after yesterday's discussion, we need to make
sure that we have a solid foundation for understanding each other.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

2011-07-06 Thread Pierre GM

 Ah, semantics...

On Jul 6, 2011, at 5:40 PM, Mark Wiebe wrote:
 
 NA (Not Available)
 A placeholder for a value which is unknown to computations. That
 value may be temporarily hidden with a mask, may have been lost
 due to hard drive corruption, or gone for any number of reasons.
 This is the same as NA in the R project.

I have a problem with 'temporarily hidden with a mask'. In my mind, the concept 
of NA carries a notion of perennation. The data is just not available, just as 
a NaN is just not a number.

 IGNORE (Skip/Ignore)
 A placeholder which should be treated by computations as if no value does
 or could exist there. For sums, this means act as if the value
 were zero, and for products, this means act as if the value were one.
 It's as if the array were compressed in some fashion to not include
 that element.

A data temporarily hidden by a mask becomes np.IGNORE.


 bitpattern
 A technique for implementing either NA or IGNORE, where a particular
 set of bit patterns are chosen from all the possible bit patterns of the
 value's data type to signal that the element is NA or IGNORE.
 
 mask
 A technique for implementing either NA or IGNORE, where a
 boolean or enum array parallel to the data array is used to signal
 which elements are NA or IGNORE.
 
 numpy.ma
 The existing implementation of a particular form of masked arrays,
 which is part of the NumPy codebase.

OK with that.



 
 The most important distinctions I'm trying to draw are:
 
 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any 
 combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE 
 as mask are reasonable.

OK with that.



 2) The idea of masking and the numpy.ma implementation are different. The 
 numpy.ma object makes particular choices about how to interpret the mask, but 
 while backwards compatibility is important, a fresh evaluation of all the 
 design choices going into a mask implementation is worthwhile.

Indeed. 
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

Hi,

On Wed, Jul 6, 2011 at 6:11 PM, Benjamin Root ben.r...@ou.edu wrote:


 On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett matthew.br...@gmail.com
 wrote:

 Hi,

 On Wed, Jul 6, 2011 at 5:48 PM, Peter
 numpy-discuss...@maubp.freeserve.co.uk wrote:
  On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett matthew.br...@gmail.com
  wrote:
 
  Hi,
 
  On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote:
  It appears to me that one of the biggest reason some of us have been
  talking
  past each other in the discussions is that different people have
  different
  definitions for the terms being used. Until this is thoroughly cleared
  up, I
  feel the design process is tilting at windmills.
  In the interests of clarity in our discussions, here is a starting
  point
  which is consistent with the NEP. These definitions have been added in
  a
  glossary within the NEP. If there are any ideas for amendments to
  these
  definitions that we can agree on, I will update the NEP with those
  amendments. Also, if I missed any important terms which need to be
  added,
  please propose definitions for them.
  NA (Not Available)
      A placeholder for a value which is unknown to computations. That
      value may be temporarily hidden with a mask, may have been lost
      due to hard drive corruption, or gone for any number of reasons.
      This is the same as NA in the R project.
 
  Really?  Can one implement NA with a mask in R?  I thought an NA was
  always bitpattern in R?
 
  I don't think that was what Mark was saying, see this bit later in this
  email:

 I think it would make a difference if there was an implementation that
 had conflated masking with bitpatterns in terms of API.  I don't think
 R is an example.


 Of course R is not an example of that.  Nothing is.  This is merely
 conceptual.  Separate NA from np.NA in Mark's NEP, and you will see his
 point.  Consider it the logical intersection of NA in Mark's NEP and the
 aNEP.

I am trying to work out what you feel you feel the points of
discussion are.  There's surely no point in continuing to debate
things we agree on.

I don't think anyone disputes (or has ever disputed) that:

There can be missing data implemented with bitpatterns
There can be missing data implemented with masks
Missing data can have propagate semantics
Missing data can have ignore semantics.
The implementation does not in itself constrain the semantics.

Let's not discuss that any more; we all agree.  So what do you think
is the source of the disagreement?

Or are you saying that there should be no disagreement at this stage?

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

On Wed, Jul 6, 2011 at 5:05 AM, Matthew Brett matthew.br...@gmail.comwrote:

Hi,

Just for reference, I am using this as the latest version of the NEP -
I hope it's current:

https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

I'm mostly relaying stuff I said, although generally (please do
correct me if I am wrong) I am just re-expressing points that
Nathaniel has already made in the alterNEP text and the emails.

I think I might have missed that part of the discussion :)

I think that might have been mentioned by Travis right before he had to
leave for another meeting, which might have been after you'd disconnected.
Travis' concern as a member of a numpy community is the desire for something
that is broadly applicable and adopted. But as Mark's employer, his concern
is to get a more complete and coherent missing data functionality
implemented in numpy while Mark is still at Enthought, for use in the
problems Enthought and statisticians commonly encounter if nothing else.

The underlying disagreement returned many times to this fundamental
difference between the NEP and alterNEP:

In the NEP - by design - it is impossible to distinguish between na.NA
and na.IGNORE
The alterNEP insists you should be able to distinguish.

I returned several times to this (original point by Nathaniel):

a[3] = np.NA

(what does this mean? I am altering the underlying array, or a mask?
How would I explain this to someone?)

I asserted that explaining NA to people would be easier if ``a[3] =
np.NA`` was direct assignment and altered the array.

BIT PATTERN MASK IMPLEMENTATIONS FOR NA

--
The current NEP proposes both mask and bit pattern implementations for
missing data. I use the terms bit pattern and parameterized dtype
interchangeably, since the parameterized dtype will use a bit pattern for
its implementation. The two implementations will support the same
functionality with respect to NA, and the implementation details will be
largely invisible to the user. Their differences are in the 'extra'
features
each supports.

NA VS. IGNORE
-
A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in
aNEP
sense and NA in NEP sense. With NA, there is a clear notion of how NA
propagates through all basic numpy operations. (e.g., 3+NA=NA and
log(NA) =
NA, while NA | True = True.) IGNORE is separate from NA, with different
interpretations depending on the use case.
IGNORE could mean:
1. Data that is being temporarily ignored. e.g., a possible outlier that
is
temporarily being removed from consideration.
2. Data that cannot exist. e.g., a matrix representing a grid of water
depths for a lake. Since the lake isn't square, some entries will
represent
land, and so depth will be a meaningless concept for those entries.
3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE],
[IGNORE,
3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though
this
leaves open how [1,

Re: [Numpy-discussion] ANN: NumPy 1.6.1 release candidate 2

2011-07-06 Thread Russell E. Owen

In article 
cabl7cqhnnjkzk9xnrlvdarsdknwrm4ev0mxdurjsaxq73eb...@mail.gmail.com,
 Ralf Gommers ralf.gomm...@googlemail.com wrote:

 On Tue, Jul 5, 2011 at 11:41 PM, Russell E. Owen ro...@uw.edu wrote:
 
  In article BANLkTi=LXiTcrv1LgMtP=p9nF8eMr8=+h...@mail.gmail.com,
   Ralf Gommers ralf.gomm...@googlemail.com wrote:
 
   https://sourceforge.net/projects/numpy/files/NumPy/1.6.1rc2/
 
  Will there be a Mac binary for 32-bit pythons (one that is compatible
  with older versions of MacOS X)? At present I only see a 64-bit
  10.6-only version.
 
 
  Yes there will be for the final release (10.4-10.6 compatible). I can't
 create those on my own computer, so sometimes I don't make them for RCs.

I'm glad they will be present for the final release.

FYI: I built my own 1.6.1rc2 against Python 2.7.2 (the 32-bit Mac 
version from python.org). I reproduced a memory error that I've been 
trying to narrow down. This is ticket 1896: 
http://projects.scipy.org/numpy/ticket/1896
and the problem is also in 1.6.0.

-- Russell

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

Christopher Jordan-Squire wrote:
 Here's a short-ish summary of the topics discussed in the conference 
 call this afternoon.

Thanks, this is great! And thanks to all who participated in the call.

 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], 
 [IGNORE, 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] 
 ].

whoooa!

I actually have been looking for, and thinking about jagged arrays a 
fair bit lately, so this is kind of exciting, but this looks like a bad 
idea to me. The above indicates that:

a = np.array( [ [1, 2, np.IGNORE],
 [np.IGNORE, 3, 4] ]

a[:,1] would yield:

array([2, 4])

which seems really wrong -- you've tossed out the location information 
altogether.

( think it should be: array([2, 3])


I could see a jagged array being represented by IGNOREs all at the END 
of each row, but putting items in the middle, and shifting things to the 
left strikes me as a plain old bad idea (and a pain to implement)

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

Hi,

On Wed, Jul 6, 2011 at 6:54 PM, Christopher Jordan-Squire
cjord...@uw.edu wrote:

On Wed, Jul 6, 2011 at 5:05 AM, Matthew Brett matthew.br...@gmail.com
wrote:

Hi,

Just for reference, I am using this as the latest version of the NEP -
I hope it's current:

https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

I'm mostly relaying stuff I said, although generally (please do
correct me if I am wrong) I am just re-expressing points that
Nathaniel has already made in the alterNEP text and the emails.

On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
cjord...@uw.edu wrote:
...
Since we only have Mark is only around Austin until early August,
there's
also broad agreement that we need to get something done quickly.

I think I might have missed that part of the discussion :)

Sorry - yes - I wasn't there for all the conversation. Of course
(not disagreeing), we must take care to get the API right because it's
unlikely to change and will be explaining and supporting it for a long
time to come.

The underlying disagreement returned many times to this fundamental
difference between the NEP and alterNEP:

In the NEP - by design - it is impossible to distinguish between na.NA
and na.IGNORE
The alterNEP insists you should be able to distinguish.

I returned several times to this (original point by Nathaniel):

a[3] = np.NA

(what does this mean? I am altering the underlying array, or a mask?
How would I explain this to someone?)

I asserted that explaining NA to people would be easier if ``a[3] =
np.NA`` was direct assignment and altered the array.

BIT PATTERN MASK IMPLEMENTATIONS FOR NA

--
The current NEP proposes both mask and bit pattern implementations for
missing data. I use the terms bit pattern and parameterized dtype
interchangeably, since the parameterized dtype will use a bit pattern
for
its implementation. The two implementations will support the same
functionality with respect to NA, and the implementation details will be
largely invisible to the user. Their differences are in the 'extra'
features
each supports.

Two common questions were:
1. Why make two implementations of missing data: one with masks and the
other with parameterized dtypes?
2. Why does the implementation using masks have higher priority?
The answers are:
1. The mask implementation is more general and easier to implement and
maintain. The bit pattern implementation saves memory, makes
interoperability easier, and makes ABI (Application Binary Interface)
compatibility easier. Since each has different strengths, the argument
is
both should be implemented.
2. The implementation for the parameterized dtypes will rely on the
implementation using a mask.

Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

On Wed, Jul 6, 2011 at 6:12 AM, Dag Sverre Seljebotn
d.s.seljeb...@astro.uio.no wrote:
 What I'm saying is that Mark's proposal is more flexible. Say for the
 sake of the argument that I have two codes I need to interface with:

  - Library A is written in Fortran and uses a seperate (explicit) mask
 array for NA

  - Library B runs on a GPU and uses a bit pattern for NA

Have you ever encountered any such codes? I'm not aware of any code
outside of R that implements the proposed NA semantics -- esp. in
high-performance code, people generally want to avoid lots of
conditionals, and the proposed NA semantics require a branch around
every operation inside your inner loops.

Certainly there is code out there that uses NaNs, and code that uses
masks (in various ways that might or might not match the way the NEP
uses them). And it's easy to work with both from numpy right now. The
question is whether and how the core should add some tricky and subtle
semantics for a few very specific ways of handling NaN-like objects
and masking.

Upthread you also wrote:
 At least I feel that the transparency of NumPy is a huge part of its
 current success. Many more than me spend half their time in C/Fortran
 and half their time in Python.

It's exactly this transparency that worries Matthew and me -- we feel
that the alterNEP preserves it, and the NEP attempts to erase it. In
the NEP, there are two totally different underlying data structures,
but this difference is blurred at the Python level. The idea is that
you shouldn't have to think about which you have, but if you work with
C/Fortran, then of course you do have to be constantly aware of the
underlying implementation anyway. And operations which would obviously
make sense for the some of the objects that you know you're working
with (e.g., unmasking elements from a masked array, or even accessing
the mask directly using numpy slicing) are disallowed, specifically in
order to make this distinction harder to make.

According to the NEP, C code that takes a masked array should never
ever unmask any element; unmasking should only be done by making a
full copy of the mask, and attaching it to a new view taken from the
original array. Would you honestly feel obliged to follow this
requirement in your C code? Or would you just unmask elements in place
when it made sense, in order to save memory?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 10:44 AM, Matthew Brett matthew.br...@gmail.comwrote:

 Hi,

 On Wed, Jul 6, 2011 at 6:11 PM, Benjamin Root ben.r...@ou.edu wrote:
 
 
  On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett matthew.br...@gmail.com
  wrote:
 
  Hi,
 
  On Wed, Jul 6, 2011 at 5:48 PM, Peter
  numpy-discuss...@maubp.freeserve.co.uk wrote:
   On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett 
 matthew.br...@gmail.com
   wrote:
  
   Hi,
  
   On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com
 wrote:
   It appears to me that one of the biggest reason some of us have been
   talking
   past each other in the discussions is that different people have
   different
   definitions for the terms being used. Until this is thoroughly
 cleared
   up, I
   feel the design process is tilting at windmills.
   In the interests of clarity in our discussions, here is a starting
   point
   which is consistent with the NEP. These definitions have been added
 in
   a
   glossary within the NEP. If there are any ideas for amendments to
   these
   definitions that we can agree on, I will update the NEP with those
   amendments. Also, if I missed any important terms which need to be
   added,
   please propose definitions for them.
   NA (Not Available)
   A placeholder for a value which is unknown to computations. That
   value may be temporarily hidden with a mask, may have been lost
   due to hard drive corruption, or gone for any number of reasons.
   This is the same as NA in the R project.
  
   Really?  Can one implement NA with a mask in R?  I thought an NA was
   always bitpattern in R?
  
   I don't think that was what Mark was saying, see this bit later in
 this
   email:
 
  I think it would make a difference if there was an implementation that
  had conflated masking with bitpatterns in terms of API.  I don't think
  R is an example.
 
 
  Of course R is not an example of that.  Nothing is.  This is merely
  conceptual.  Separate NA from np.NA in Mark's NEP, and you will see his
  point.  Consider it the logical intersection of NA in Mark's NEP and the
  aNEP.

 I am trying to work out what you feel you feel the points of
 discussion are.  There's surely no point in continuing to debate
 things we agree on.

 I don't think anyone disputes (or has ever disputed) that:

 There can be missing data implemented with bitpatterns
 There can be missing data implemented with masks
 Missing data can have propagate semantics
 Missing data can have ignore semantics.
 The implementation does not in itself constrain the semantics.



So, to be clear, is your concern is that you want to be able to tell
difference between whether an np.NA comes from the bit pattern or the mask
in its implementation? But why would you have both the parameterized dtype
and the mask implementation at the same time? They implement the same
abstraction.

Is your desire that the np.NA's are implemented solely through bit patterns
and np.IGNORE is implemented solely through masks? So that you can think of
the masks as being IGNORE flags? What if you want multiple types of IGNORE?
(To ignore certain values because they're outliers, others because the data
wouldn't make sense, and others because you're just focusing on a particular
subgroup, for instance.)

A related question is if the IGNORE values could just be another NA value? I
don't understand what the specific problem would be with having several NA
values, say NA(1), NA(2), ..., and then letting the user decide that NA(1)
means NA in the sense discussed above and NA(2) means IGNORE. Then the
ufuncs could be told whether to ignore or propagate each type of NA value.
Could you explain to me if this would resolve your concerns about NA/IGNORE,
or possibly give a few examples if it doesn't? Because I am still rather
confused.


Let's not discuss that any more; we all agree.  So what do you think
 is the source of the disagreement?

 Or are you saying that there should be no disagreement at this stage?

 Cheers,

 Matthew
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

Dag Sverre Seljebotn wrote:
 Here's an HPC perspective...:

 At least I feel that the transparency of NumPy is a huge part of its 
 current success. Many more than me spend half their time in C/Fortran 
 and half their time in Python.

Absolutely -- and this point has been raised a couple times in the 
discussion, so I hope it is not forgotten.

   I tend to look at NumPy this way: Assuming you have some data in memory
 (possibly loaded by a C or Fortran library). (Almost) no matter how it 
 is allocated, ordered, packed, aligned -- there's a way to find strides 
 and dtypes to put a nice NumPy wrapper around it and use the memory from 
 Python.

and vice-versa -- Assuming you have some data in numpy arrays, there's a 
way to process it with a C or Fortran library without copying the data.

And this is where I am skeptical of the bit-pattern idea -- while one 
can expect C and fortran and GPU, and ??? to understand NaNs for 
floating point data, is there any support in compilers or hardware for 
special bit patterns for NA values to integers? I've never seen in my 
(very limited experience).

Maybe having the mask option, too, will make that irrelevant, but I want 
to be clear about that kind of use case.

-Chris





-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

Mark Wiebe wrote:
 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any 
 combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and 
 IGNORE as mask are reasonable.

Is this really true? if you use a bitpattern for IGNORE, haven't you 
just lost the ability to get the original value back if you want to stop 
ignoring it? Maybe that's not inherent to what an IGNORE means, but it 
seems pretty key to me.

-Chris








-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett matthew.br...@gmail.comwrote:

 Hi,

 On Wed, Jul 6, 2011 at 5:48 PM, Peter
 numpy-discuss...@maubp.freeserve.co.uk wrote:
  On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett matthew.br...@gmail.com
 wrote:
 
  Hi,
 
  On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote:
  It appears to me that one of the biggest reason some of us have been
 talking
  past each other in the discussions is that different people have
 different
  definitions for the terms being used. Until this is thoroughly cleared
 up, I
  feel the design process is tilting at windmills.
  In the interests of clarity in our discussions, here is a starting
 point
  which is consistent with the NEP. These definitions have been added in
 a
  glossary within the NEP. If there are any ideas for amendments to these
  definitions that we can agree on, I will update the NEP with those
  amendments. Also, if I missed any important terms which need to be
 added,
  please propose definitions for them.
  NA (Not Available)
  A placeholder for a value which is unknown to computations. That
  value may be temporarily hidden with a mask, may have been lost
  due to hard drive corruption, or gone for any number of reasons.
  This is the same as NA in the R project.
 
  Really?  Can one implement NA with a mask in R?  I thought an NA was
  always bitpattern in R?
 
  I don't think that was what Mark was saying, see this bit later in this
 email:

 I think it would make a difference if there was an implementation that
 had conflated masking with bitpatterns in terms of API.  I don't think
 R is an example.


This reminds me of another confusion I've seen in the list. I'd like to
suggest that we ban the word API by itself from the present discussion, and
always specify Python API or C API for clarity's sake. Here are my suggested
definitions for these two terms:

Python API
All the interface mechanisms that are exposed to Python code
for using missing values in NumPy. This API is designed to be
Pythonic and fit into the way NumPy works as much as possible.

C API
All the implementation mechanisms exposed for CPython extensions
written in C that want to support NumPy missing value support.
This API is designed to be as natural as possible in C, and
is usually prioritizes flexibility and high performance.

Before we proceed to any discussion of what are good/bad choices, I really
want to nail this down from just the definition perspective. I don't want
arbitrary choices baked into the terms we use, because that implies already
having made a design decision.

-Mark



  The most important distinctions I'm trying to draw are:
  1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
  combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
  IGNORE as mask are reasonable.
 
  This point as I understood it is there is the semantics of the special
 values
  (not available vs ignore), and there is the implementation (bitpattern vs
  mask), and they are independent.

 Yes.   Although, we can see from the implementations that we have to hand
 that

 a) bitpatterns - propagation (NaN-like) semantics by default (R)
 b) masks - ignore semantics by default (masked arrays)

 I don't think Mark accepts that there is any reason for this tendency
 of implementations to semantics, but Nathaniel was arguing otherwise
 in the alterNEP.

 I think we all accept that it's possible to imagine masking have
 propagation semantics and bitpatterns having ignore semantics.

 Cheers,

 Matthew
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

Christopher Jordan-Squire wrote:
 If we follow those rules for IGNORE for all computations, we sometimes 
 get some weird output. For example:
 [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix 
 multiply and not * with broadcasting.) Or should that sort of operation 
 through an error?

That should throw an error -- matrix computation is heavily influenced 
by the shape and size of matrices, so I think IGNORES really don't make 
sense there.


Nathaniel Smith wrote:
 It's exactly this transparency that worries Matthew and me -- we feel
 that the alterNEP preserves it, and the NEP attempts to erase it. In
 the NEP, there are two totally different underlying data structures,
 but this difference is blurred at the Python level. The idea is that
 you shouldn't have to think about which you have, but if you work with
 C/Fortran, then of course you do have to be constantly aware of the
 underlying implementation anyway. 

I don't think this bothers me -- I think it's analogous to things in 
numpy like Fortran order and non-contiguous arrays -- you can ignore all 
that when working in pure python when performance isn't critical, but 
you need a deeper understanding if you want to work with the data in C 
or Fortran or to tune performance in python.

So as long as there is an API to query and control how things work, I 
like that it's hidden from simple python code.

-Chris





-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

On 07/06/2011 08:10 PM, Nathaniel Smith wrote:
 On Wed, Jul 6, 2011 at 6:12 AM, Dag Sverre Seljebotn
 d.s.seljeb...@astro.uio.no  wrote:
 What I'm saying is that Mark's proposal is more flexible. Say for the
 sake of the argument that I have two codes I need to interface with:

   - Library A is written in Fortran and uses a seperate (explicit) mask
 array for NA

   - Library B runs on a GPU and uses a bit pattern for NA

 Have you ever encountered any such codes? I'm not aware of any code
 outside of R that implements the proposed NA semantics -- esp. in
 high-performance code, people generally want to avoid lots of
 conditionals, and the proposed NA semantics require a branch around
 every operation inside your inner loops.

I'll admit that this whole thing was an hypothetical exercise.

I've interfaced with Fortran code with NA values -- not a high 
performance case, but not all you interface with is high performance.

 Certainly there is code out there that uses NaNs, and code that uses
 masks (in various ways that might or might not match the way the NEP
 uses them). And it's easy to work with both from numpy right now. The
 question is whether and how the core should add some tricky and subtle
 semantics for a few very specific ways of handling NaN-like objects
 and masking.

I don't disagree with this.

 It's exactly this transparency that worries Matthew and me -- we feel
 that the alterNEP preserves it, and the NEP attempts to erase it. In
 the NEP, there are two totally different underlying data structures,
 but this difference is blurred at the Python level. The idea is that
 you shouldn't have to think about which you have, but if you work with
 C/Fortran, then of course you do have to be constantly aware of the
 underlying implementation anyway. And operations which would obviously
 make sense for the some of the objects that you know you're working
 with (e.g., unmasking elements from a masked array, or even accessing
 the mask directly using numpy slicing) are disallowed, specifically in
 order to make this distinction harder to make.

This worries me too.

What I was thinking is that it could be sort of like indexing -- it 
works OK to have indexing be transparent in Python-land with respect to 
striding, and have a contiguous array be just a special case marked by 
an attribute. If you want, you can still check the strides or flags 
attributes.

 According to the NEP, C code that takes a masked array should never
 ever unmask any element; unmasking should only be done by making a
 full copy of the mask, and attaching it to a new view taken from the
 original array. Would you honestly feel obliged to follow this
 requirement in your C code? Or would you just unmask elements in place
 when it made sense, in order to save memory?

I'm with you on this one: I wouldn't adopt any NumPy feature widely 
unless I had totally transparent access to the underlying implementation 
details from C -- without relying on any NumPy headers (except in my 
Cython wrappers)!

I don't believe in APIs, I believe in standardized binary data.

But I always assumed that could be done down the road, once the internal 
details had stabilized.

As for myself, I'll admit that I'll almost certainly continue with 
explicit masking without using any of the proposed NEPs -- I have to be 
extremely aware of the masks in the statistical methods I use.

Perhaps that's a sign I should withdraw from the discussion.

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 11:33 AM, Peter 
numpy-discuss...@maubp.freeserve.co.uk wrote:

 On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote:
  It appears to me that one of the biggest reason some of us have been
 talking
  past each other in the discussions is that different people have
 different
  definitions for the terms being used. Until this is thoroughly cleared
 up, I
  feel the design process is tilting at windmills.
  In the interests of clarity in our discussions, here is a starting point
  which is consistent with the NEP. These definitions have been added in a
  glossary within the NEP. If there are any ideas for amendments to these
  definitions that we can agree on, I will update the NEP with those
  amendments. Also, if I missed any important terms which need to be added,
  please propose definitions for them.

 That sounds good - I've only been scanning these discussions and it
 is confusing.

  NA (Not Available)
  A placeholder for a value which is unknown to computations. That
  value may be temporarily hidden with a mask, may have been lost
  due to hard drive corruption, or gone for any number of reasons.
  This is the same as NA in the R project.

 Could you expand that to say how sums and products act with NA
 (since you do so for the IGNORE case).


I've added that, here's the new version:

NA (Not Available)
A placeholder for a value which is unknown to computations. That
value may be temporarily hidden with a mask, may have been lost
due to hard drive corruption, or gone for any number of reasons.
For sums and products this means to produce NA if any of the inputs
are NA.  This is the same as NA in the R project.

Thanks,

-Mark



 Thanks,

 Peter
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 11:38 AM, Matthew Brett matthew.br...@gmail.comwrote:

 Hi,

 On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote:
  It appears to me that one of the biggest reason some of us have been
 talking
  past each other in the discussions is that different people have
 different
  definitions for the terms being used. Until this is thoroughly cleared
 up, I
  feel the design process is tilting at windmills.
  In the interests of clarity in our discussions, here is a starting point
  which is consistent with the NEP. These definitions have been added in a
  glossary within the NEP. If there are any ideas for amendments to these
  definitions that we can agree on, I will update the NEP with those
  amendments. Also, if I missed any important terms which need to be added,
  please propose definitions for them.
  NA (Not Available)
  A placeholder for a value which is unknown to computations. That
  value may be temporarily hidden with a mask, may have been lost
  due to hard drive corruption, or gone for any number of reasons.
  This is the same as NA in the R project.

 Really?  Can one implement NA with a mask in R?  I thought an NA was
 always bitpattern in R?

  IGNORE (Skip/Ignore)
  A placeholder which should be treated by computations as if no value
  does
  or could exist there. For sums, this means act as if the value
  were zero, and for products, this means act as if the value were one.
  It's as if the array were compressed in some fashion to not include
  that element.
  bitpattern
  A technique for implementing either NA or IGNORE, where a particular
  set of bit patterns are chosen from all the possible bit patterns of
 the
  value's data type to signal that the element is NA or IGNORE.
  mask
  A technique for implementing either NA or IGNORE, where a
  boolean or enum array parallel to the data array is used to signal
  which elements are NA or IGNORE.
  numpy.ma
  The existing implementation of a particular form of masked arrays,
  which is part of the NumPy codebase.
 
  The most important distinctions I'm trying to draw are:
  1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
  combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
  IGNORE as mask are reasonable.
  2) The idea of masking and the numpy.ma implementation are different.
 The
  numpy.ma object makes particular choices about how to interpret the
 mask,
  but while backwards compatibility is important, a fresh evaluation of all
  the design choices going into a mask implementation is worthwhile.

 I agree that there has been some confusion due to the terms.

 However, I continue to believe that the discussion is substantial and
 not due to confusion.


I believe this is true as well, but the confusion due to the terms appears
to be one of the root causes preventing the ideas from getting across.
Without first clearing up this aspect of the discussion, things will stay
confusing.


 Let us then characterize the substantial discussion as this:

 NEP: bitpattern and masked out values should be made nearly impossible
 to distinguish in the API
 alterNEP: bitpattern and masked out values should be distinct in the
 API so that it can be made clear which is meant (and therefore,
 implicitly, how they are implemented).

 Do you agree that this is the discussion?


I'd like to get agreement on the definitions before moving to any of the
points of contention that are being raised.

Thanks,

-Mark



 See you,

 Matthew
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

On 07/06/2011 04:47 PM, Matthew Brett wrote:
 Hi,

 On Wed, Jul 6, 2011 at 2:12 PM, Dag Sverre Seljebotn
 d.s.seljeb...@astro.uio.no  wrote:
 I just commented on the prevent direct API access to the masking array
 part -- I'm hoping direct access by external code to the underlying
 implementation details will be allowed, at some point.

 What I'm saying is that Mark's proposal is more flexible. Say for the
 sake of the argument that I have two codes I need to interface with:

   - Library A is written in Fortran and uses a seperate (explicit) mask
 array for NA

   - Library B runs on a GPU and uses a bit pattern for NA

 Mark's proposal then comes closer to allowing me to wrap both codes
 using NumPy, since it supports both implementation mechanisms. Sure, it
 would need a seperate NEP down the road to extend it, but it goes in the
 right direction for this to happen.

 I'm sorry - honestly - maybe it's because I've just had lunch, but I
 think I am not understanding something.   When you say Mark's
 proposal is more flexible - more flexible than what?  I think we
 agree that:

 * NA bitpatterns are good to have
 * masks are good to have

 and the discussion is about:

 * should it be possible to distinguish between bitpatterns (NAs) and
 masks (IGNORE).

I guess I just don't agree with these definitions. There's (NA, IGNORE), 
and there's (bitpatterns, masks); these are in principle orthogonal. It 
is possible (and perhaps reasonable) to hard-wire them they way you say 
-- that may be more obvious, user-friendly, etc., but it is not more 
flexible.

Both Mark and Chuck have explicitly supported having many different NA 
types down the road (thread: An NA compromise idea -- many-NA). So the 
main difference to me seems to be that you want to hard-wire the NA type 
and the representation in a specific configuration.

I may be missing something though.

 Are you saying that making it not-possible to distinguish - at the
 numpy level, is more flexible?

I'm OK with the common ways of accessing data to not distinguish, as 
long as there's some poweruser way around it. Just like strides -- you 
index a strided array just like a contiguous array, but you can peek 
inside into the implementation if you want.

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 12:41 PM, Pierre GM pgmdevl...@gmail.com wrote:

  Ah, semantics...

 On Jul 6, 2011, at 5:40 PM, Mark Wiebe wrote:
 
  NA (Not Available)
  A placeholder for a value which is unknown to computations. That
  value may be temporarily hidden with a mask, may have been lost
  due to hard drive corruption, or gone for any number of reasons.
  This is the same as NA in the R project.

 I have a problem with 'temporarily hidden with a mask'. In my mind, the
 concept of NA carries a notion of perennation. The data is just not
 available, just as a NaN is just not a number.


Yes, this gets directly to what I've been meaning when I say NA vs IGNORE is
independent of mask vs bitpattern. The way I'm trying to structure things,
NA vs IGNORE only affects the semantic meaning, i.e. the outputs produced by
computations. This is precisely why I put 'temporarily hidden with a mask'
first, to make that more clear.


  IGNORE (Skip/Ignore)
  A placeholder which should be treated by computations as if no value
 does
  or could exist there. For sums, this means act as if the value
  were zero, and for products, this means act as if the value were one.
  It's as if the array were compressed in some fashion to not include
  that element.

 A data temporarily hidden by a mask becomes np.IGNORE.


Are you willing to suspend the idea of that implication for the purposes of
the present discussion? If not, do you see a way to amend things so that
masked NAs and bitpattern-based IGNOREs make sense? Would renaming IGNORE to
SKIP be more clear, perhaps?

Thanks,
Mark




  bitpattern
  A technique for implementing either NA or IGNORE, where a particular
  set of bit patterns are chosen from all the possible bit patterns of
 the
  value's data type to signal that the element is NA or IGNORE.
 
  mask
  A technique for implementing either NA or IGNORE, where a
  boolean or enum array parallel to the data array is used to signal
  which elements are NA or IGNORE.
 
  numpy.ma
  The existing implementation of a particular form of masked arrays,
  which is part of the NumPy codebase.

 OK with that.



 
  The most important distinctions I'm trying to draw are:
 
  1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
 combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
 IGNORE as mask are reasonable.

 OK with that.



  2) The idea of masking and the numpy.ma implementation are different.
 The numpy.ma object makes particular choices about how to interpret the
 mask, but while backwards compatibility is important, a fresh evaluation of
 all the design choices going into a mask implementation is worthwhile.

 Indeed.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ANN: NumPy 1.6.1 release candidate 2

2011-07-06 Thread Christoph Gohlke



On 7/6/2011 10:57 AM, Russell E. Owen wrote:
 In article
 cabl7cqhnnjkzk9xnrlvdarsdknwrm4ev0mxdurjsaxq73eb...@mail.gmail.com,
   Ralf Gommersralf.gomm...@googlemail.com  wrote:

 On Tue, Jul 5, 2011 at 11:41 PM, Russell E. Owenro...@uw.edu  wrote:

 In articleBANLkTi=LXiTcrv1LgMtP=p9nF8eMr8=+h...@mail.gmail.com,
   Ralf Gommersralf.gomm...@googlemail.com  wrote:

 https://sourceforge.net/projects/numpy/files/NumPy/1.6.1rc2/

 Will there be a Mac binary for 32-bit pythons (one that is compatible
 with older versions of MacOS X)? At present I only see a 64-bit
 10.6-only version.


 Yes there will be for the final release (10.4-10.6 compatible). I can't
 create those on my own computer, so sometimes I don't make them for RCs.

 I'm glad they will be present for the final release.

 FYI: I built my own 1.6.1rc2 against Python 2.7.2 (the 32-bit Mac
 version from python.org). I reproduced a memory error that I've been
 trying to narrow down. This is ticket 1896:
 http://projects.scipy.org/numpy/ticket/1896
 and the problem is also in 1.6.0.

 -- Russell



I can reproduce this error on Windows. It looks like a serious regression.

Christoph
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 1:25 PM, Christopher Barker chris.bar...@noaa.govwrote:

 Mark Wiebe wrote:
  1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
  combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
  IGNORE as mask are reasonable.

 Is this really true? if you use a bitpattern for IGNORE, haven't you
 just lost the ability to get the original value back if you want to stop
 ignoring it? Maybe that's not inherent to what an IGNORE means, but it
 seems pretty key to me.


What do you think of renaming IGNORE to SKIP?

-Mark



 -Chris








 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

On 07/06/2011 08:25 PM, Christopher Barker wrote:
 Mark Wiebe wrote:
 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
 combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
 IGNORE as mask are reasonable.

 Is this really true? if you use a bitpattern for IGNORE, haven't you
 just lost the ability to get the original value back if you want to stop
 ignoring it? Maybe that's not inherent to what an IGNORE means, but it
 seems pretty key to me.

There's the question of how reductions treats the value. IIUC, IGNORE as 
bitpattern would imply that reductions treat the value as 0, which is a 
question orthogonal to whether the value can possibly be unmasked or not.

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] towards a more productive missing values/masked arrays discussion...

So one thing that came up on the call yesterday is that there actually
is a significant chunk of functionality that everyone seems to agree
is useful, needed, and basically how it should work.

This includes:
  -- the basic existence and semantics for NA values (however this is
implemented)
  -- that there should exist a dtype/bit-pattern implementation for
NAs (whatever other implementations there might also be)
  -- that ufunc's should take a where= argument
  -- that there should be a better way for ndarray subclasses like
numpy.ma to override the arguments to ufuncs involving them
  -- maybe some other things I'm not thinking of

The real controversy is around what role masking should play, both at
the API and implementation level; there are lots of different
arguments for different approaches, and it's not at all clear any
current proposal will actually solve the problems are facing (or even
what those problems are).

So rather than continue to go around in circles indefinitely on that,
I'm going to write up some miniNEPs just focusing on the details of
how the features we do agree on should work, so we can hopefully have
a more technical discussion of *that*.

Cheers,
-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] towards a more productive missing values/masked arrays discussion...

On Wed, Jul 6, 2011 at 2:20 PM, Nathaniel Smith n...@pobox.com wrote:

 So one thing that came up on the call yesterday is that there actually
 is a significant chunk of functionality that everyone seems to agree
 is useful, needed, and basically how it should work.

 This includes:
  -- the basic existence and semantics for NA values (however this is
 implemented)
  -- that there should exist a dtype/bit-pattern implementation for
 NAs (whatever other implementations there might also be)
  -- that ufunc's should take a where= argument
  -- that there should be a better way for ndarray subclasses like
 numpy.ma to override the arguments to ufuncs involving them
  -- maybe some other things I'm not thinking of

 The real controversy is around what role masking should play, both at
 the API and implementation level; there are lots of different
 arguments for different approaches, and it's not at all clear any
 current proposal will actually solve the problems are facing (or even
 what those problems are).

 So rather than continue to go around in circles indefinitely on that,
 I'm going to write up some miniNEPs just focusing on the details of
 how the features we do agree on should work, so we can hopefully have
 a more technical discussion of *that*.


That sounds alright to me. One thing I would like to ask is to please adopt
the vocabulary we are discussing, using it exactly as defined so that people
reading all the various ideas don't have to readjust when switching between
documents.

Thanks,
Mark



 Cheers,
 -- Nathaniel
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] miniNEP1: where= argument for ufuncs

Here's the master copy:
  https://gist.github.com/1068056

But for your commenting convenience, I'll include the current text here:


A mini-NEP for the where= argument to ufuncs


To try and make more progress on the whole missing values/masked
arrays/... debate, it seems useful to have a more technical discussion
of the pieces which we *can* agree on. This is the first, which
attempts to nail down the details of the new ``where=`` argument to
ufuncs.

*
Rationale
*

It is often useful to apply operations to a subset of your data, and
numpy provides a rich interface for accomplishing this, by combining
indexing operations with ufunc operations, e.g.::

  a[10, mymask] += b
  np.sum(a[which_indices], axis=0)

But any kind of complex indexing necessarily requires making a
temporary copy of (parts of) the underlying array, which can be quite
expensive, and this copying could be avoided by teaching the ufunc
loop to `index as it goes'.

There are strong arguments against doing this. There are tons of cases
like this where one can save some memory by avoiding temporaries, and
we can't build them all into the core -- especially since we also have
more general solutions like numexpr or writing optimized routines
C/Fortran/Cython. Furthermore, this case is a clear violation of
orthogonality -- we already have indexing and ufuncs as separate
things, so adding a second, somewhat crippled implementation of
indexing to ufuncs themselves is a bit ugly. (It would be better if we
could make sure that anything that could be passed to
ndarray.__getitem__ could also be passed to ufuncs with the same
semantics, but this would require substantial refactoring and seems
unlikely to be implemented any time soon.)

However,

***
API
***

A new optional, keyword argument named ``where=`` will be added to all ufuncs.

--
Error checking
--

If given, this argument must be a boolean array. If ``f`` is a ufunc,
then given a function call like::

  f(a, b, where=mymask)

the following occurs. First, ``mymask`` is coerced to an array if
necessary, but no type conversion is performed. (I.e., we do
``np.asarray(mymask)``.) Next, we check whether ``mymask`` is a
boolean array. If it is not, then we raise an exception. (In the
future it would be nice to support other forms of indexing as well,
such as lists of slices or arrays of integer indices. In order to
preserve this option, we do not want to coerce integers into
booleans.) Next, ``a`` and ``b`` are broadcast against each other,
just as now; this determines the shape of the output array. Then
``mymask`` is broadcast to match this output array shape. (The shape
of the output array cannot be changed by this process -- for example,
having ``a.shape == (10, 1, 1)``, ``b.shape = (1, 10, 1)``,
``mymask.shape == (1, 1, 10)`` will raise an error rather than
returning a new array with shape ``(10, 10, 10)``.)

-
Semantics: ufunc ``__call__``
-

When simply calling a ufunc with an output argument, e.g.::

  f(a, b, out=c, where=mymask)

then the result is equivalent to::

  c[mymask] = f(a[mymask], b[mymask])

On the other hand, if no output argument is given::

  f(a, b, where=mymask)

then an output array is instantiated as if by calling
``np.empty(shape, dtype=dtype)``, and then treated as above::

  c = np.empty(shape_for(a, b), dtype=dtype_for(f, a, b))
  f(a, b, out=c, where=mymask)
  return c

Note that this means that the output will, in general, contain
uninitialized values.


Semantics: ufunc ``.reduce``


Take an expression like::

  f.reduce(a, axis=0, where=mymask)

This performs the given reduction operation along each column of
``a``, but simply skips any elements where the corresponding entry in
``mymask`` is false. (For ufuncs which have an identity, this is
equivalent to treating the given elements as if they were the
identity.) For example, if ``a`` is a 2-dimensional array and skipping
over the details of broadcasting, dtype selection, etc., the above
operation produces the same result as::

  out = np.empty(a.shape[1])
  for i in xrange(a.shape[1]):
out[i] = f.reduce(a[mymask[:, i], i])
  return out


Semantics: ufunc ``.accumulate``


Accumulation is similar to reduction, except that ``.accumulate``
saves the intermediate values generated during the reduction loop.
Therefore we use the same semantics as for ``.reduce`` above. If ``a``
is 2-d etc., then this expression::

  f.accumulate(a, axis=0, where=mymask)

is equivalent to::

  out = np.empty(a.shape)
  for i in xrange(a.shape[1]):
out[mymask[:, i], i] = f.accumulate(a[mymask[:, i], i])
  return out

Notice that once again, elements of ``out`` which correspond to False
entries in the mask are left unitialized.

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker
chris.bar...@noaa.govwrote:

 Christopher Jordan-Squire wrote:
  If we follow those rules for IGNORE for all computations, we sometimes
  get some weird output. For example:
  [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
  multiply and not * with broadcasting.) Or should that sort of operation
  through an error?

 That should throw an error -- matrix computation is heavily influenced
 by the shape and size of matrices, so I think IGNORES really don't make
 sense there.



If the IGNORES don't make sense in basic numpy computations then I'm kinda
confused why they'd be included at the numpy core level.



 Nathaniel Smith wrote:
  It's exactly this transparency that worries Matthew and me -- we feel
  that the alterNEP preserves it, and the NEP attempts to erase it. In
  the NEP, there are two totally different underlying data structures,
  but this difference is blurred at the Python level. The idea is that
  you shouldn't have to think about which you have, but if you work with
  C/Fortran, then of course you do have to be constantly aware of the
  underlying implementation anyway.

 I don't think this bothers me -- I think it's analogous to things in
 numpy like Fortran order and non-contiguous arrays -- you can ignore all
 that when working in pure python when performance isn't critical, but
 you need a deeper understanding if you want to work with the data in C
 or Fortran or to tune performance in python.

 So as long as there is an API to query and control how things work, I
 like that it's hidden from simple python code.

 -Chris



I'm similarly not too concerned about it. Performance seems finicky when
you're dealing with missing data, since a lot of arrays will likely have to
be copied over to other arrays containing only complete data before being
handed over to BLAS. My primary concern is that the np.NA stuff 'just
works'. Especially since I've never run into use cases in statistics where
the difference between IGNORE and NA mattered.







 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

On Wed, Jul 6, 2011 at 8:12 AM, Dag Sverre Seljebotn 
d.s.seljeb...@astro.uio.no wrote:

 snip
 I just commented on the prevent direct API access to the masking array
 part -- I'm hoping direct access by external code to the underlying
 implementation details will be allowed, at some point.


I think direct or nearly direct access needs to be in right away, unless
we're fairly sure that we will change low level implementation details in
the near future. I've added Python API and C API definitions for us to
use to try and clear up this kind of potential confusion.

-Mark


 What I'm saying is that Mark's proposal is more flexible. Say for the
 sake of the argument that I have two codes I need to interface with:

  - Library A is written in Fortran and uses a seperate (explicit) mask
 array for NA

  - Library B runs on a GPU and uses a bit pattern for NA

 Mark's proposal then comes closer to allowing me to wrap both codes
 using NumPy, since it supports both implementation mechanisms. Sure, it
 would need a seperate NEP down the road to extend it, but it goes in the
 right direction for this to happen.

 As for NA vs. IGNORE I still think 2 types is too little. One should
 allow for 255 different NA-values, each with user-defined behaviour.
 Again, Mark's proposal then makes a good start on that, even if more
 work would be needed to make it happen.

 I.e., in my perfect world I'd do this to wrap library A (Cythonish
 psuedo-code:

 def call_lib_A():
 ...
 lib_A_function(arraybuf, maskbuf, ...)
 DOG_ATE_IT = np.NA(DOG_ATE_IT, value=42, behaviour=raise)
 # behaviour could also be zero, invalid
 missing_value_map = {0xAF: np.NA, 0x43: np.IGNORE, 0xF0: DOG_ATE_IT}
 result = np.PyArray_CreateArrayFromBufferWithMaskBuffer(
 arraybuf, maskbuf, missing_value_map, ...)
 return result

 def call_lib_B():
 lib_B_function(arraybuf, ...)
 missing_value_patterns = {0xCACA : np.NA}
 result = np.PyArray_CreateArrayFromBufferWithBitPattern(
 arraybuf, maskbuf, missing_value_patterns, ...)
 return result

 Hope that is clearer. Again, my intention is not to suggest even more
 work at the present stage, just to state some advantages with the
 general direction of Mark's proposal.

 Dag Sverre
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] towards a more productive missing values/masked arrays discussion...

It'd be easier to follow if you just made changes/suggestions on github to
Mark's NEP directly. (You can checkout Mark's missing data branch to get the
NEP.) Then I'll be able to focus on the ways the suggestions differ or
compliment the current NEP.

-Chris Jordan-Squire

On Wed, Jul 6, 2011 at 12:24 PM, Mark Wiebe mwwi...@gmail.com wrote:

 On Wed, Jul 6, 2011 at 2:20 PM, Nathaniel Smith n...@pobox.com wrote:

 So one thing that came up on the call yesterday is that there actually
 is a significant chunk of functionality that everyone seems to agree
 is useful, needed, and basically how it should work.

 This includes:
  -- the basic existence and semantics for NA values (however this is
 implemented)
  -- that there should exist a dtype/bit-pattern implementation for
 NAs (whatever other implementations there might also be)
  -- that ufunc's should take a where= argument
  -- that there should be a better way for ndarray subclasses like
 numpy.ma to override the arguments to ufuncs involving them
  -- maybe some other things I'm not thinking of

 The real controversy is around what role masking should play, both at
 the API and implementation level; there are lots of different
 arguments for different approaches, and it's not at all clear any
 current proposal will actually solve the problems are facing (or even
 what those problems are).

 So rather than continue to go around in circles indefinitely on that,
 I'm going to write up some miniNEPs just focusing on the details of
 how the features we do agree on should work, so we can hopefully have
 a more technical discussion of *that*.


 That sounds alright to me. One thing I would like to ask is to please adopt
 the vocabulary we are discussing, using it exactly as defined so that people
 reading all the various ideas don't have to readjust when switching between
 documents.

 Thanks,
 Mark



 Cheers,
 -- Nathaniel
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
cjord...@uw.edu wrote:


 On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov
 wrote:

 Christopher Jordan-Squire wrote:
  If we follow those rules for IGNORE for all computations, we sometimes
  get some weird output. For example:
  [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
  multiply and not * with broadcasting.) Or should that sort of operation
  through an error?

 That should throw an error -- matrix computation is heavily influenced
 by the shape and size of matrices, so I think IGNORES really don't make
 sense there.



 If the IGNORES don't make sense in basic numpy computations then I'm kinda
 confused why they'd be included at the numpy core level.


 Nathaniel Smith wrote:
  It's exactly this transparency that worries Matthew and me -- we feel
  that the alterNEP preserves it, and the NEP attempts to erase it. In
  the NEP, there are two totally different underlying data structures,
  but this difference is blurred at the Python level. The idea is that
  you shouldn't have to think about which you have, but if you work with
  C/Fortran, then of course you do have to be constantly aware of the
  underlying implementation anyway.

 I don't think this bothers me -- I think it's analogous to things in
 numpy like Fortran order and non-contiguous arrays -- you can ignore all
 that when working in pure python when performance isn't critical, but
 you need a deeper understanding if you want to work with the data in C
 or Fortran or to tune performance in python.

 So as long as there is an API to query and control how things work, I
 like that it's hidden from simple python code.

 -Chris



 I'm similarly not too concerned about it. Performance seems finicky when
 you're dealing with missing data, since a lot of arrays will likely have to
 be copied over to other arrays containing only complete data before being
 handed over to BLAS.

Unless you know the neutral value for the computation or you just want
to do a forward_fill in time series, and you have to ask the user not
to give you an unmutable array with NAs if they don't want extra
copies.

Josef

 My primary concern is that the np.NA stuff 'just
 works'. Especially since I've never run into use cases in statistics where
 the difference between IGNORE and NA mattered.




 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Bruce Southey


On 07/06/2011 02:38 PM, Christopher Jordan-Squire wrote:



On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker 
chris.bar...@noaa.gov mailto:chris.bar...@noaa.gov wrote:


Christopher Jordan-Squire wrote:
 If we follow those rules for IGNORE for all computations, we
sometimes
 get some weird output. For example:
 [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
 multiply and not * with broadcasting.) Or should that sort of
operation
 through an error?

That should throw an error -- matrix computation is heavily influenced
by the shape and size of matrices, so I think IGNORES really don't
make
sense there.



If the IGNORES don't make sense in basic numpy computations then I'm 
kinda confused why they'd be included at the numpy core level.


Nathaniel Smith wrote:
 It's exactly this transparency that worries Matthew and me -- we
feel
 that the alterNEP preserves it, and the NEP attempts to erase it. In
 the NEP, there are two totally different underlying data structures,
 but this difference is blurred at the Python level. The idea is that
 you shouldn't have to think about which you have, but if you
work with
 C/Fortran, then of course you do have to be constantly aware of the
 underlying implementation anyway.

I don't think this bothers me -- I think it's analogous to things in
numpy like Fortran order and non-contiguous arrays -- you can
ignore all
that when working in pure python when performance isn't critical, but
you need a deeper understanding if you want to work with the data in C
or Fortran or to tune performance in python.

So as long as there is an API to query and control how things work, I
like that it's hidden from simple python code.

-Chris



I'm similarly not too concerned about it. Performance seems finicky 
when you're dealing with missing data, since a lot of arrays will 
likely have to be copied over to other arrays containing only complete 
data before being handed over to BLAS. My primary concern is that the 
np.NA stuff 'just works'. Especially since I've never run into use 
cases in statistics where the difference between IGNORE and NA mattered.




Exactly!
I have not been able to think of an real example where that difference 
matters as the calculations are only on the 'valid' (ie non-missing and 
non-masked) values.


Bruce




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote:

 On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
 cjord...@uw.edu wrote:
 
 
  On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker 
 chris.bar...@noaa.gov
  wrote:
 
  Christopher Jordan-Squire wrote:
   If we follow those rules for IGNORE for all computations, we sometimes
   get some weird output. For example:
   [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
   multiply and not * with broadcasting.) Or should that sort of
 operation
   through an error?
 
  That should throw an error -- matrix computation is heavily influenced
  by the shape and size of matrices, so I think IGNORES really don't make
  sense there.
 
 
 
  If the IGNORES don't make sense in basic numpy computations then I'm
 kinda
  confused why they'd be included at the numpy core level.
 
 
  Nathaniel Smith wrote:
   It's exactly this transparency that worries Matthew and me -- we feel
   that the alterNEP preserves it, and the NEP attempts to erase it. In
   the NEP, there are two totally different underlying data structures,
   but this difference is blurred at the Python level. The idea is that
   you shouldn't have to think about which you have, but if you work with
   C/Fortran, then of course you do have to be constantly aware of the
   underlying implementation anyway.
 
  I don't think this bothers me -- I think it's analogous to things in
  numpy like Fortran order and non-contiguous arrays -- you can ignore all
  that when working in pure python when performance isn't critical, but
  you need a deeper understanding if you want to work with the data in C
  or Fortran or to tune performance in python.
 
  So as long as there is an API to query and control how things work, I
  like that it's hidden from simple python code.
 
  -Chris
 
 
 
  I'm similarly not too concerned about it. Performance seems finicky when
  you're dealing with missing data, since a lot of arrays will likely have
 to
  be copied over to other arrays containing only complete data before being
  handed over to BLAS.

 Unless you know the neutral value for the computation or you just want
 to do a forward_fill in time series, and you have to ask the user not
 to give you an unmutable array with NAs if they don't want extra
 copies.

 Josef


Mean value replacement, or more generally single scalar value replacement,
is generally not a good idea. It biases downward your standard error
estimates if you use mean replacement, and it will bias both if you use
anything other than mean replacement. The bias is gets worse with more
missing data. So it's worst in the precisely the cases where you'd want to
fill in the data the most. (Though I admit I'm not too familiar with time
series, so maybe this doesn't apply. But it's true as a general principle in
statistics.) I'm not sure why we'd want to make this use case easier.

-Chris Jordan-Squire



  My primary concern is that the np.NA stuff 'just
  works'. Especially since I've never run into use cases in statistics
 where
  the difference between IGNORE and NA mattered.
 
 
 
 
  --
  Christopher Barker, Ph.D.
  Oceanographer
 
  Emergency Response Division
  NOAA/NOS/ORR(206) 526-6959   voice
  7600 Sand Point Way NE   (206) 526-6329   fax
  Seattle, WA  98115   (206) 526-6317   main reception
 
  chris.bar...@noaa.gov
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Pierre GM


On Jul 6, 2011, at 10:11 PM, Bruce Southey wrote:

 On 07/06/2011 02:38 PM, Christopher Jordan-Squire wrote:
 
 
 On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov 
 wrote:
 Christopher Jordan-Squire wrote:
  If we follow those rules for IGNORE for all computations, we sometimes
  get some weird output. For example:
  [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
  multiply and not * with broadcasting.) Or should that sort of operation
  through an error?
 
 That should throw an error -- matrix computation is heavily influenced
 by the shape and size of matrices, so I think IGNORES really don't make
 sense there.
 
 
 
 If the IGNORES don't make sense in basic numpy computations then I'm kinda 
 confused why they'd be included at the numpy core level. 
 
  
 Nathaniel Smith wrote:
  It's exactly this transparency that worries Matthew and me -- we feel
  that the alterNEP preserves it, and the NEP attempts to erase it. In
  the NEP, there are two totally different underlying data structures,
  but this difference is blurred at the Python level. The idea is that
  you shouldn't have to think about which you have, but if you work with
  C/Fortran, then of course you do have to be constantly aware of the
  underlying implementation anyway.
 
 I don't think this bothers me -- I think it's analogous to things in
 numpy like Fortran order and non-contiguous arrays -- you can ignore all
 that when working in pure python when performance isn't critical, but
 you need a deeper understanding if you want to work with the data in C
 or Fortran or to tune performance in python.
 
 So as long as there is an API to query and control how things work, I
 like that it's hidden from simple python code.
 
 -Chris
 
 
 
 I'm similarly not too concerned about it. Performance seems finicky when 
 you're dealing with missing data, since a lot of arrays will likely have to 
 be copied over to other arrays containing only complete data before being 
 handed over to BLAS. My primary concern is that the np.NA stuff 'just 
 works'. Especially since I've never run into use cases in statistics where 
 the difference between IGNORE and NA mattered. 
 
 
 Exactly!
 I have not been able to think of an real example where that difference 
 matters as the calculations are only on the 'valid' (ie non-missing and 
 non-masked) values.

In practice, they could be treated the same way (ie, skipped). However, they 
are conceptually different and one may wish to keep this difference of 
information around (between NAs you didn't have and IGNOREs you just dropped 
temporarily.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
cjord...@uw.edu wrote:


 On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote:

 On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
 cjord...@uw.edu wrote:
 
 
  On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker
  chris.bar...@noaa.gov
  wrote:
 
  Christopher Jordan-Squire wrote:
   If we follow those rules for IGNORE for all computations, we
   sometimes
   get some weird output. For example:
   [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
   multiply and not * with broadcasting.) Or should that sort of
   operation
   through an error?
 
  That should throw an error -- matrix computation is heavily influenced
  by the shape and size of matrices, so I think IGNORES really don't make
  sense there.
 
 
 
  If the IGNORES don't make sense in basic numpy computations then I'm
  kinda
  confused why they'd be included at the numpy core level.
 
 
  Nathaniel Smith wrote:
   It's exactly this transparency that worries Matthew and me -- we feel
   that the alterNEP preserves it, and the NEP attempts to erase it. In
   the NEP, there are two totally different underlying data structures,
   but this difference is blurred at the Python level. The idea is that
   you shouldn't have to think about which you have, but if you work
   with
   C/Fortran, then of course you do have to be constantly aware of the
   underlying implementation anyway.
 
  I don't think this bothers me -- I think it's analogous to things in
  numpy like Fortran order and non-contiguous arrays -- you can ignore
  all
  that when working in pure python when performance isn't critical, but
  you need a deeper understanding if you want to work with the data in C
  or Fortran or to tune performance in python.
 
  So as long as there is an API to query and control how things work, I
  like that it's hidden from simple python code.
 
  -Chris
 
 
 
  I'm similarly not too concerned about it. Performance seems finicky when
  you're dealing with missing data, since a lot of arrays will likely have
  to
  be copied over to other arrays containing only complete data before
  being
  handed over to BLAS.

 Unless you know the neutral value for the computation or you just want
 to do a forward_fill in time series, and you have to ask the user not
 to give you an unmutable array with NAs if they don't want extra
 copies.

 Josef


 Mean value replacement, or more generally single scalar value replacement,
 is generally not a good idea. It biases downward your standard error
 estimates if you use mean replacement, and it will bias both if you use
 anything other than mean replacement. The bias is gets worse with more
 missing data. So it's worst in the precisely the cases where you'd want to
 fill in the data the most. (Though I admit I'm not too familiar with time
 series, so maybe this doesn't apply. But it's true as a general principle in
 statistics.) I'm not sure why we'd want to make this use case easier.

We just discussed a use case for pandas on the statsmodels mailing
list, minute data of stock quotes (prices), if the quote is NA then
fill it with the last price quote. If it would be necessary for memory
usage and performance, this can be handled efficiently and with
minimal copying.

If you want to fill in a missing value without messing up any result
statistics, then there is a large literature in statistics on
imputations, repeatedly assigning values to a NA from an underlying
distribution. scipy/statsmodels doesn't have anything like this (yet)
but R and the others have it available, and it looks more popular in
bio-statistics.

(But similar to what Dag said, for statistical analysis it will be
necessary to keep case specific masks and data arrays around. I
haven't actually written any missing values algorithm yet, so I'm
quite again.)

Josef

 -Chris Jordan-Squire


  My primary concern is that the np.NA stuff 'just
  works'. Especially since I've never run into use cases in statistics
  where
  the difference between IGNORE and NA mattered.
 
 
 
 
  --
  Christopher Barker, Ph.D.
  Oceanographer
 
  Emergency Response Division
  NOAA/NOS/ORR            (206) 526-6959   voice
  7600 Sand Point Way NE   (206) 526-6329   fax
  Seattle, WA  98115       (206) 526-6317   main reception
 
  chris.bar...@noaa.gov
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

On Wed, Jul 6, 2011 at 4:38 PM,  josef.p...@gmail.com wrote:
 On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
 cjord...@uw.edu wrote:


 On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote:

 On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
 cjord...@uw.edu wrote:
 
 
  On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker
  chris.bar...@noaa.gov
  wrote:
 
  Christopher Jordan-Squire wrote:
   If we follow those rules for IGNORE for all computations, we
   sometimes
   get some weird output. For example:
   [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
   multiply and not * with broadcasting.) Or should that sort of
   operation
   through an error?
 
  That should throw an error -- matrix computation is heavily influenced
  by the shape and size of matrices, so I think IGNORES really don't make
  sense there.
 
 
 
  If the IGNORES don't make sense in basic numpy computations then I'm
  kinda
  confused why they'd be included at the numpy core level.
 
 
  Nathaniel Smith wrote:
   It's exactly this transparency that worries Matthew and me -- we feel
   that the alterNEP preserves it, and the NEP attempts to erase it. In
   the NEP, there are two totally different underlying data structures,
   but this difference is blurred at the Python level. The idea is that
   you shouldn't have to think about which you have, but if you work
   with
   C/Fortran, then of course you do have to be constantly aware of the
   underlying implementation anyway.
 
  I don't think this bothers me -- I think it's analogous to things in
  numpy like Fortran order and non-contiguous arrays -- you can ignore
  all
  that when working in pure python when performance isn't critical, but
  you need a deeper understanding if you want to work with the data in C
  or Fortran or to tune performance in python.
 
  So as long as there is an API to query and control how things work, I
  like that it's hidden from simple python code.
 
  -Chris
 
 
 
  I'm similarly not too concerned about it. Performance seems finicky when
  you're dealing with missing data, since a lot of arrays will likely have
  to
  be copied over to other arrays containing only complete data before
  being
  handed over to BLAS.

 Unless you know the neutral value for the computation or you just want
 to do a forward_fill in time series, and you have to ask the user not
 to give you an unmutable array with NAs if they don't want extra
 copies.

 Josef


 Mean value replacement, or more generally single scalar value replacement,
 is generally not a good idea. It biases downward your standard error
 estimates if you use mean replacement, and it will bias both if you use
 anything other than mean replacement. The bias is gets worse with more
 missing data. So it's worst in the precisely the cases where you'd want to
 fill in the data the most. (Though I admit I'm not too familiar with time
 series, so maybe this doesn't apply. But it's true as a general principle in
 statistics.) I'm not sure why we'd want to make this use case easier.

Another qualification on this (I cannot help it).
I think this only applies if you use a prefabricated no-missing-values
algorithm. If I write it myself, I can do the proper correction for
the reduced number of observations. (similar to the case when we
ignore correlated information and use statistics based on uncorrelated
observations which also overestimate the amount of information we have
available.)

Josef


 We just discussed a use case for pandas on the statsmodels mailing
 list, minute data of stock quotes (prices), if the quote is NA then
 fill it with the last price quote. If it would be necessary for memory
 usage and performance, this can be handled efficiently and with
 minimal copying.

 If you want to fill in a missing value without messing up any result
 statistics, then there is a large literature in statistics on
 imputations, repeatedly assigning values to a NA from an underlying
 distribution. scipy/statsmodels doesn't have anything like this (yet)
 but R and the others have it available, and it looks more popular in
 bio-statistics.

 (But similar to what Dag said, for statistical analysis it will be
 necessary to keep case specific masks and data arrays around. I
 haven't actually written any missing values algorithm yet, so I'm
 quite again.)

 Josef

 -Chris Jordan-Squire


  My primary concern is that the np.NA stuff 'just
  works'. Especially since I've never run into use cases in statistics
  where
  the difference between IGNORE and NA mattered.
 
 
 
 
  --
  Christopher Barker, Ph.D.
  Oceanographer
 
  Emergency Response Division
  NOAA/NOS/ORR            (206) 526-6959   voice
  7600 Sand Point Way NE   (206) 526-6329   fax
  Seattle, WA  98115       (206) 526-6317   main reception
 
  chris.bar...@noaa.gov
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Neal Becker

Christopher Barker wrote:

 Dag Sverre Seljebotn wrote:
 Here's an HPC perspective...:
 
 At least I feel that the transparency of NumPy is a huge part of its
 current success. Many more than me spend half their time in C/Fortran
 and half their time in Python.
 
 Absolutely -- and this point has been raised a couple times in the
 discussion, so I hope it is not forgotten.
 
I tend to look at NumPy this way: Assuming you have some data in memory
 (possibly loaded by a C or Fortran library). (Almost) no matter how it
 is allocated, ordered, packed, aligned -- there's a way to find strides
 and dtypes to put a nice NumPy wrapper around it and use the memory from
 Python.
 
 and vice-versa -- Assuming you have some data in numpy arrays, there's a
 way to process it with a C or Fortran library without copying the data.
 
 And this is where I am skeptical of the bit-pattern idea -- while one
 can expect C and fortran and GPU, and ??? to understand NaNs for
 floating point data, is there any support in compilers or hardware for
 special bit patterns for NA values to integers? I've never seen in my
 (very limited experience).
 
 Maybe having the mask option, too, will make that irrelevant, but I want
 to be clear about that kind of use case.
 
 -Chris

Am I the only one that finds the idea of special values of things like int[1] 
to 
have special meanings to be really ugly?

[1] which already have defined behavior over their entire domain of bit patterns

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] miniNEP1: where= argument for ufuncs

2011-07-06 Thread Gael Varoquaux

On Wed, Jul 06, 2011 at 12:26:24PM -0700, Nathaniel Smith wrote:
 A mini-NEP for the where= argument to ufuncs

I _love_ this proposal and it would probably be much more useful to me
than the different masked array proposal that are too focused on a
specific usage pattern to answer all my needs.

So a strong +1 on the miniNEP.

G
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] histogram2d error with empty inputs

2011-07-06 Thread Ralf Gommers

On Mon, Jun 27, 2011 at 9:38 PM, Benjamin Root ben.r...@ou.edu wrote:

 I found another empty input edge case.  Somewhat recently, we fixed an
 issue with np.histogram() and empty inputs (so long as the bins are somehow
 known).

  np.histogram([], bins=4)
 (array([0, 0, 0, 0]), array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ]))

 However, histogram2d needs the same treatment.

  np.histogram([], [], bins=4)
 (array([ 0.,  0.]), array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ]), array([ 0.
 ,  0.25,  0.5 ,  0.75,  1.  ]))

 The first element in the return tuple needs to be 4x4 (in this case).

 Could you open a ticket for this?

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

2011-07-06 Thread Gael Varoquaux

On Wed, Jul 06, 2011 at 08:39:37PM +0200, Dag Sverre Seljebotn wrote:
 As for myself, I'll admit that I'll almost certainly continue with 
 explicit masking without using any of the proposed NEPs -- I have to be 
 extremely aware of the masks in the statistical methods I use.

My gut feeling is that I am in the same case.

G
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

On Wed, Jul 6, 2011 at 2:53 PM, Neal Becker ndbeck...@gmail.com wrote:

 Christopher Barker wrote:

  Dag Sverre Seljebotn wrote:
  Here's an HPC perspective...:
 
  At least I feel that the transparency of NumPy is a huge part of its
  current success. Many more than me spend half their time in C/Fortran
  and half their time in Python.
 
  Absolutely -- and this point has been raised a couple times in the
  discussion, so I hope it is not forgotten.
 
 I tend to look at NumPy this way: Assuming you have some data in
 memory
  (possibly loaded by a C or Fortran library). (Almost) no matter how it
  is allocated, ordered, packed, aligned -- there's a way to find strides
  and dtypes to put a nice NumPy wrapper around it and use the memory from
  Python.
 
  and vice-versa -- Assuming you have some data in numpy arrays, there's a
  way to process it with a C or Fortran library without copying the data.
 
  And this is where I am skeptical of the bit-pattern idea -- while one
  can expect C and fortran and GPU, and ??? to understand NaNs for
  floating point data, is there any support in compilers or hardware for
  special bit patterns for NA values to integers? I've never seen in my
  (very limited experience).
 
  Maybe having the mask option, too, will make that irrelevant, but I want
  to be clear about that kind of use case.
 
  -Chris

 Am I the only one that finds the idea of special values of things like
 int[1] to
 have special meanings to be really ugly?

 [1] which already have defined behavior over their entire domain of bit
 patterns


Umm, no, I find it ugly also. On the other hand, it is an useful artifact
left to us by the ancients and solves a lot of problems. So in the absence
of anything more standardized...

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Bruce Southey

On 07/06/2011 03:37 PM, Pierre GM wrote:
 On Jul 6, 2011, at 10:11 PM, Bruce Southey wrote:

 On 07/06/2011 02:38 PM, Christopher Jordan-Squire wrote:

 On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barkerchris.bar...@noaa.gov  
 wrote:
 Christopher Jordan-Squire wrote:
 If we follow those rules for IGNORE for all computations, we sometimes
 get some weird output. For example:
 [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
 multiply and not * with broadcasting.) Or should that sort of operation
 through an error?
 That should throw an error -- matrix computation is heavily influenced
 by the shape and size of matrices, so I think IGNORES really don't make
 sense there.



 If the IGNORES don't make sense in basic numpy computations then I'm kinda 
 confused why they'd be included at the numpy core level.


 Nathaniel Smith wrote:
 It's exactly this transparency that worries Matthew and me -- we feel
 that the alterNEP preserves it, and the NEP attempts to erase it. In
 the NEP, there are two totally different underlying data structures,
 but this difference is blurred at the Python level. The idea is that
 you shouldn't have to think about which you have, but if you work with
 C/Fortran, then of course you do have to be constantly aware of the
 underlying implementation anyway.
 I don't think this bothers me -- I think it's analogous to things in
 numpy like Fortran order and non-contiguous arrays -- you can ignore all
 that when working in pure python when performance isn't critical, but
 you need a deeper understanding if you want to work with the data in C
 or Fortran or to tune performance in python.

 So as long as there is an API to query and control how things work, I
 like that it's hidden from simple python code.

 -Chris



 I'm similarly not too concerned about it. Performance seems finicky when 
 you're dealing with missing data, since a lot of arrays will likely have to 
 be copied over to other arrays containing only complete data before being 
 handed over to BLAS. My primary concern is that the np.NA stuff 'just 
 works'. Especially since I've never run into use cases in statistics where 
 the difference between IGNORE and NA mattered.


 Exactly!
 I have not been able to think of an real example where that difference 
 matters as the calculations are only on the 'valid' (ie non-missing and 
 non-masked) values.
 In practice, they could be treated the same way (ie, skipped). However, they 
 are conceptually different and one may wish to keep this difference of 
 information around (between NAs you didn't have and IGNOREs you just dropped 
 temporarily.


 ___
I have yet to see these as *conceptually different* in any of the 
arguments given.

Separate NAs or IGNORES or any number of missing value codes just 
requires use to avoid 'unmasking' those missing value codes in your 
array as, I presume like masked arrays, you need some placeholder values.

Bruce



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] histogram2d error with empty inputs

On Wednesday, July 6, 2011, Ralf Gommers ralf.gomm...@googlemail.com wrote:


 On Mon, Jun 27, 2011 at 9:38 PM, Benjamin Root ben.r...@ou.edu wrote:

 I found another empty input edge case.  Somewhat recently, we fixed an issue 
 with np.histogram() and empty inputs (so long as the bins are somehow known).

 np.histogram([], bins=4)
 (array([0, 0, 0, 0]), array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ]))

 However, histogram2d needs the same treatment.

 np.histogram([], [], bins=4)
 (array([ 0.,  0.]), array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ]), array([ 0.  
 ,  0.25,  0.5 ,  0.75,  1.  ]))

 The first element in the return tuple needs to be 4x4 (in this case).

 Could you open a ticket for this?

 Ralf




Not a problem.  I managed to partly trace the problem down into
histogramdd, but the function is a little confusing.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

On Wednesday, July 6, 2011, Dag Sverre Seljebotn
d.s.seljeb...@astro.uio.no wrote:
 On 07/06/2011 08:25 PM, Christopher Barker wrote:
 Mark Wiebe wrote:
 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
 combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
 IGNORE as mask are reasonable.

 Is this really true? if you use a bitpattern for IGNORE, haven't you
 just lost the ability to get the original value back if you want to stop
 ignoring it? Maybe that's not inherent to what an IGNORE means, but it
 seems pretty key to me.

 There's the question of how reductions treats the value. IIUC, IGNORE as
 bitpattern would imply that reductions treat the value as 0, which is a
 question orthogonal to whether the value can possibly be unmasked or not.

 Dag Sverre


Just because we are trying to be exact here, the reductions would
treat IGNORE as the operation's identity.  Therefore, for addition, it
would be treated like 0, but for multiplication, it is treated like a
1.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Call for papers: AMS Jan 22-26, 2012

2011-07-06 Thread David Brown


I would like to call to the attention of the NumPy community the following call 
for papers:

Second Symposium on Advances in Modeling and Analysis Using Python, 22–26 
January 2012, New Orleans, Louisiana
 The Second Symposium on Advances in Modeling and Analysis Using Python, 
sponsored by the American Meteorological Society, will be held 22–26 January 
2012, as part of the 92nd AMS Annual Meeting in New Orleans, Louisiana. 
Preliminary programs, registration, hotel, and general information will be 
posted on the AMS Web site (http://www.ametsoc.org/meet/annual/) in 
late-September 2011. 
  The application of object-oriented programming and other advances in 
computer science to the atmospheric and oceanic sciences has in turn led to 
advances in modeling and analysis tools and methods. This symposium focuses on 
applications of the open-source language Python and seeks to disseminate 
advances using Python in the atmospheric and oceanic sciences, as well as grow 
the earth sciences Python community. Papers describing Python work in 
applications, methodologies, and package development in all areas of 
meteorology, climatology, oceanography, and space sciences are welcome, 
including (but not limited to): modeling, time series analysis, air quality, 
satellite data processing, in-situ data analysis, GIS, Python as a software 
integration platform, visualization, gridding, model intercomparison, and very 
large (petabyte) dataset manipulation and access.
  The $95 abstract fee includes the submission of your abstract, the 
posting of your extended abstract, and the uploading and recording of your 
presentation which will be archived on the AMS Web site.
Please submit your abstract electronically via the Web by 1 August 2011 (refer 
to the AMS Web page athttp://www.ametsoc.org/meet/online_submit.html.) An 
abstract fee of $95 (payable by credit card or purchase order) is charged at 
the time of submission (refundable only if abstract is not accepted).
  Authors of accepted presentations will be notified via e-mail by 
late-September 2011. All extended abstracts are to be submitted electronically 
and will be available on-line via the Web, Instructions for formatting extended 
abstracts will be posted on the AMS Web site. Manuscripts (up to 3MB) must be 
submitted electronically by 22 February 2012. All abstracts, extended abstracts 
and presentations will be available on the AMS Web site at no cost.
  For additional information, please contact the program chairperson, 
Johnny Lin, Physics Department, North Park University (j...@northpark.edu). 
(5/11)


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 5:03 PM, Benjamin Root ben.r...@ou.edu wrote:

 On Wednesday, July 6, 2011, Dag Sverre Seljebotn
 d.s.seljeb...@astro.uio.no wrote:
  On 07/06/2011 08:25 PM, Christopher Barker wrote:
  Mark Wiebe wrote:
  1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
  combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
  IGNORE as mask are reasonable.
 
  Is this really true? if you use a bitpattern for IGNORE, haven't you
  just lost the ability to get the original value back if you want to stop
  ignoring it? Maybe that's not inherent to what an IGNORE means, but it
  seems pretty key to me.
 
  There's the question of how reductions treats the value. IIUC, IGNORE as
  bitpattern would imply that reductions treat the value as 0, which is a
  question orthogonal to whether the value can possibly be unmasked or not.
 
  Dag Sverre
 

 Just because we are trying to be exact here, the reductions would
 treat IGNORE as the operation's identity.  Therefore, for addition, it
 would be treated like 0, but for multiplication, it is treated like a
 1.

 Ben Root


Yes. But, as discussed on another thread, that can lead to unexpected
results when it's propagated through several operations.



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] using the same vocabulary for missing value ideas

On Wednesday, July 6, 2011, Christopher Jordan-Squire cjord...@uw.edu wrote:


 On Wed, Jul 6, 2011 at 5:03 PM, Benjamin Root ben.r...@ou.edu wrote:

 On Wednesday, July 6, 2011, Dag Sverre Seljebotn
 d.s.seljeb...@astro.uio.no wrote:
 On 07/06/2011 08:25 PM, Christopher Barker wrote:
 Mark Wiebe wrote:
 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
 combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
 IGNORE as mask are reasonable.

 Is this really true? if you use a bitpattern for IGNORE, haven't you
 just lost the ability to get the original value back if you want to stop
 ignoring it? Maybe that's not inherent to what an IGNORE means, but it
 seems pretty key to me.

 There's the question of how reductions treats the value. IIUC, IGNORE as
 bitpattern would imply that reductions treat the value as 0, which is a
 question orthogonal to whether the value can possibly be unmasked or not.

 Dag Sverre


 Just because we are trying to be exact here, the reductions would
 treat IGNORE as the operation's identity.  Therefore, for addition, it
 would be treated like 0, but for multiplication, it is treated like a
 1.

 Ben Root

 Yes. But, as discussed on another thread, that can lead to unexpected results 
 when it's propagated through several operations.



If you are talking about means, for example, then the count is
adjusted before dividing.  It is like they never existed. Same with
standard deviation. Of course, there are issues with having fewer
samples, but that isn't a problem caused by the underlying concept of
skipping elements.

As long as the underlying mathematical support for array math is still
valid, I am not certain what the issue is.  Matrix math on the other
hand...

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 5:38 PM, Benjamin Root ben.r...@ou.edu wrote:

 On Wednesday, July 6, 2011, Christopher Jordan-Squire cjord...@uw.edu
 wrote:
 
 
  On Wed, Jul 6, 2011 at 5:03 PM, Benjamin Root ben.r...@ou.edu wrote:
 
  On Wednesday, July 6, 2011, Dag Sverre Seljebotn
  d.s.seljeb...@astro.uio.no wrote:
  On 07/06/2011 08:25 PM, Christopher Barker wrote:
  Mark Wiebe wrote:
  1) NA vs IGNORE and bitpattern vs mask are completely independent. Any
  combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and
  IGNORE as mask are reasonable.
 
  Is this really true? if you use a bitpattern for IGNORE, haven't you
  just lost the ability to get the original value back if you want to
 stop
  ignoring it? Maybe that's not inherent to what an IGNORE means, but it
  seems pretty key to me.
 
  There's the question of how reductions treats the value. IIUC, IGNORE as
  bitpattern would imply that reductions treat the value as 0, which is a
  question orthogonal to whether the value can possibly be unmasked or
 not.
 
  Dag Sverre
 
 
  Just because we are trying to be exact here, the reductions would
  treat IGNORE as the operation's identity.  Therefore, for addition, it
  would be treated like 0, but for multiplication, it is treated like a
  1.
 
  Ben Root
 
  Yes. But, as discussed on another thread, that can lead to unexpected
 results when it's propagated through several operations.
 
 

 If you are talking about means, for example, then the count is
 adjusted before dividing.  It is like they never existed. Same with
 standard deviation. Of course, there are issues with having fewer
 samples, but that isn't a problem caused by the underlying concept of
 skipping elements.

 As long as the underlying mathematical support for array math is still
 valid, I am not certain what the issue is.  Matrix math on the other
 hand...


Ah, I see. I misunderstood the class of operations you were discussing.

-Chris Jordan-Squire



 Ben Root
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

On Wed, Jul 6, 2011 at 3:47 PM, josef.p...@gmail.com wrote:

 On Wed, Jul 6, 2011 at 4:38 PM,  josef.p...@gmail.com wrote:
  On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
  cjord...@uw.edu wrote:
 
 
  On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote:
 
  On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
  cjord...@uw.edu wrote:
  
  
   On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker
   chris.bar...@noaa.gov
   wrote:
  
   Christopher Jordan-Squire wrote:
If we follow those rules for IGNORE for all computations, we
sometimes
get some weird output. For example:
[ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
multiply and not * with broadcasting.) Or should that sort of
operation
through an error?
  
   That should throw an error -- matrix computation is heavily
 influenced
   by the shape and size of matrices, so I think IGNORES really don't
 make
   sense there.
  
  
  
   If the IGNORES don't make sense in basic numpy computations then I'm
   kinda
   confused why they'd be included at the numpy core level.
  
  
   Nathaniel Smith wrote:
It's exactly this transparency that worries Matthew and me -- we
 feel
that the alterNEP preserves it, and the NEP attempts to erase it.
 In
the NEP, there are two totally different underlying data
 structures,
but this difference is blurred at the Python level. The idea is
 that
you shouldn't have to think about which you have, but if you work
with
C/Fortran, then of course you do have to be constantly aware of
 the
underlying implementation anyway.
  
   I don't think this bothers me -- I think it's analogous to things in
   numpy like Fortran order and non-contiguous arrays -- you can ignore
   all
   that when working in pure python when performance isn't critical,
 but
   you need a deeper understanding if you want to work with the data in
 C
   or Fortran or to tune performance in python.
  
   So as long as there is an API to query and control how things work,
 I
   like that it's hidden from simple python code.
  
   -Chris
  
  
  
   I'm similarly not too concerned about it. Performance seems finicky
 when
   you're dealing with missing data, since a lot of arrays will likely
 have
   to
   be copied over to other arrays containing only complete data before
   being
   handed over to BLAS.
 
  Unless you know the neutral value for the computation or you just want
  to do a forward_fill in time series, and you have to ask the user not
  to give you an unmutable array with NAs if they don't want extra
  copies.
 
  Josef
 
 
  Mean value replacement, or more generally single scalar value
 replacement,
  is generally not a good idea. It biases downward your standard error
  estimates if you use mean replacement, and it will bias both if you use
  anything other than mean replacement. The bias is gets worse with more
  missing data. So it's worst in the precisely the cases where you'd want
 to
  fill in the data the most. (Though I admit I'm not too familiar with
 time
  series, so maybe this doesn't apply. But it's true as a general
 principle in
  statistics.) I'm not sure why we'd want to make this use case easier.

 Another qualification on this (I cannot help it).
 I think this only applies if you use a prefabricated no-missing-values
 algorithm. If I write it myself, I can do the proper correction for
 the reduced number of observations. (similar to the case when we
 ignore correlated information and use statistics based on uncorrelated
 observations which also overestimate the amount of information we have
 available.)


Can you do that sort of technique with longitudinal (panel) data? I'm
honestly curious because I haven't looked into such corrections before. I
haven't been able to find a reference after a few quick google searches. I
don't suppose you know one off the top of your head?

And you're right about the last measurement carried forward. I was just
thinking about filling in all missing values with the same value.

-Chris Jordan-Squire

PS--Thanks for mentioning the statsmodels discussion. I'd been keeping track
of that on a different email account, and I haven't realized it wasn't
forwarding those messages correctly.




 Josef



  We just discussed a use case for pandas on the statsmodels mailing
  list, minute data of stock quotes (prices), if the quote is NA then
  fill it with the last price quote. If it would be necessary for memory
  usage and performance, this can be handled efficiently and with
  minimal copying.
 
  If you want to fill in a missing value without messing up any result
  statistics, then there is a large literature in statistics on
  imputations, repeatedly assigning values to a NA from an underlying
  distribution. scipy/statsmodels doesn't have anything like this (yet)
  but R and the others have it available, and it looks more popular in
  bio-statistics.
 
  (But similar to what Dag said, for statistical

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

Hi,

On Wed, Jul 6, 2011 at 7:10 PM, Christopher Jordan-Squire
cjord...@uw.edu wrote:


 On Wed, Jul 6, 2011 at 10:44 AM, Matthew Brett matthew.br...@gmail.com
 wrote:

 Hi,

 On Wed, Jul 6, 2011 at 6:11 PM, Benjamin Root ben.r...@ou.edu wrote:
 
 
  On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett matthew.br...@gmail.com
  wrote:
 
  Hi,
 
  On Wed, Jul 6, 2011 at 5:48 PM, Peter
  numpy-discuss...@maubp.freeserve.co.uk wrote:
   On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett
   matthew.br...@gmail.com
   wrote:
  
   Hi,
  
   On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com
   wrote:
   It appears to me that one of the biggest reason some of us have
   been
   talking
   past each other in the discussions is that different people have
   different
   definitions for the terms being used. Until this is thoroughly
   cleared
   up, I
   feel the design process is tilting at windmills.
   In the interests of clarity in our discussions, here is a starting
   point
   which is consistent with the NEP. These definitions have been added
   in
   a
   glossary within the NEP. If there are any ideas for amendments to
   these
   definitions that we can agree on, I will update the NEP with those
   amendments. Also, if I missed any important terms which need to be
   added,
   please propose definitions for them.
   NA (Not Available)
       A placeholder for a value which is unknown to computations.
   That
       value may be temporarily hidden with a mask, may have been lost
       due to hard drive corruption, or gone for any number of
   reasons.
       This is the same as NA in the R project.
  
   Really?  Can one implement NA with a mask in R?  I thought an NA was
   always bitpattern in R?
  
   I don't think that was what Mark was saying, see this bit later in
   this
   email:
 
  I think it would make a difference if there was an implementation that
  had conflated masking with bitpatterns in terms of API.  I don't think
  R is an example.
 
 
  Of course R is not an example of that.  Nothing is.  This is merely
  conceptual.  Separate NA from np.NA in Mark's NEP, and you will see his
  point.  Consider it the logical intersection of NA in Mark's NEP and the
  aNEP.

 I am trying to work out what you feel you feel the points of
 discussion are.  There's surely no point in continuing to debate
 things we agree on.

 I don't think anyone disputes (or has ever disputed) that:

 There can be missing data implemented with bitpatterns
 There can be missing data implemented with masks
 Missing data can have propagate semantics
 Missing data can have ignore semantics.
 The implementation does not in itself constrain the semantics.

 So, to be clear, is your concern is that you want to be able to tell
 difference between whether an np.NA comes from the bit pattern or the mask
 in its implementation? But why would you have both the parameterized dtype
 and the mask implementation at the same time? They implement the same
 abstraction.

In Mark's mind they implement the same abstraction.  In my mind, and
Nathaniels, and I think, Pierre's, and others, they are not the same
abstraction.  You can treat them the same if you want, even by
default, but they are two different ideas, with two different
implementations.

A bitmask NA value is absolutely completely missing.  It's a value
that says 'missing'
A masked-out value is temporarily or provisionally missing.   When you
take away the mask, the previous value is there.  These are two
different things.  They are each very easy to explain.

 Is your desire that the np.NA's are implemented solely through bit patterns
 and np.IGNORE is implemented solely through masks? So that you can think of
 the masks as being IGNORE flags? What if you want multiple types of IGNORE?
 (To ignore certain values because they're outliers, others because the data
 wouldn't make sense, and others because you're just focusing on a particular
 subgroup, for instance.)

Forgive me, I have been at dinner and had several glasses of wine.
So, what I'm about to say might be dumber than usual.  With that
rider:

I agree with Mark, we should avoid np.IGNORE because it conflates
ignore semantics with the masking implementation.

The idea of several different missings seems to me orthogonal.  There
can be different missings with bitmasks and different missings with
masks.

My fundamental point, that I accept I am not getting across with much
success, is the following:

In general, as Dag has pointed out elsewhere, numpy is close the metal
- you can almost feel the C array underneath the python numpy object.
 This is its strength.  It doesn't try and hide the C array from you,
it gives you the whole machinery, open kimono.

I can see an open kimono way of dealing with missing values.  There's
the bitpattern way.  If I do a[3] = np.NA, what I mean is 'store an NA
in the array memory'.  Exactly the same as when I do a[3] = 2, I mean
'store a 2 in the array memory'.   It's obvious and

Re: [Numpy-discussion] miniNEP1: where= argument for ufuncs

2011-07-06 Thread Lluís

Sorry, but I didn't find a way of inserting inline comments in the gist.

Nathaniel Smith writes:
[...]
 Is there any less stupid-looking name than ``where1=`` and ``where2=``
 for the ``.outer`` operation? (For that matter, can ``.outer`` be
 applied to more than 2 arrays? The docs say it can't, but it's
 perfectly well-defined for arbitrary number of arrays too, so maybe we
 want an interface that allows for 3-way, 4-way etc. ``.outer``
 operations in the future?)

Well, if outer can indeed be defined for an arbitrary number of arrays
(and if it's going to be sometime in the future), I'd say the simplest
is to use an array:

.outer(a, b, ..., where = [my_where1, my_where2, ...])


Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

On Wed, Jul 6, 2011 at 7:14 PM, Christopher Jordan-Squire
cjord...@uw.edu wrote:


 On Wed, Jul 6, 2011 at 3:47 PM, josef.p...@gmail.com wrote:

 On Wed, Jul 6, 2011 at 4:38 PM,  josef.p...@gmail.com wrote:
  On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
  cjord...@uw.edu wrote:
 
 
  On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote:
 
  On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
  cjord...@uw.edu wrote:
  
  
   On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker
   chris.bar...@noaa.gov
   wrote:
  
   Christopher Jordan-Squire wrote:
If we follow those rules for IGNORE for all computations, we
sometimes
get some weird output. For example:
[ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is
matrix
multiply and not * with broadcasting.) Or should that sort of
operation
through an error?
  
   That should throw an error -- matrix computation is heavily
   influenced
   by the shape and size of matrices, so I think IGNORES really don't
   make
   sense there.
  
  
  
   If the IGNORES don't make sense in basic numpy computations then I'm
   kinda
   confused why they'd be included at the numpy core level.
  
  
   Nathaniel Smith wrote:
It's exactly this transparency that worries Matthew and me -- we
feel
that the alterNEP preserves it, and the NEP attempts to erase it.
In
the NEP, there are two totally different underlying data
structures,
but this difference is blurred at the Python level. The idea is
that
you shouldn't have to think about which you have, but if you work
with
C/Fortran, then of course you do have to be constantly aware of
the
underlying implementation anyway.
  
   I don't think this bothers me -- I think it's analogous to things
   in
   numpy like Fortran order and non-contiguous arrays -- you can
   ignore
   all
   that when working in pure python when performance isn't critical,
   but
   you need a deeper understanding if you want to work with the data
   in C
   or Fortran or to tune performance in python.
  
   So as long as there is an API to query and control how things work,
   I
   like that it's hidden from simple python code.
  
   -Chris
  
  
  
   I'm similarly not too concerned about it. Performance seems finicky
   when
   you're dealing with missing data, since a lot of arrays will likely
   have
   to
   be copied over to other arrays containing only complete data before
   being
   handed over to BLAS.
 
  Unless you know the neutral value for the computation or you just want
  to do a forward_fill in time series, and you have to ask the user not
  to give you an unmutable array with NAs if they don't want extra
  copies.
 
  Josef
 
 
  Mean value replacement, or more generally single scalar value
  replacement,
  is generally not a good idea. It biases downward your standard error
  estimates if you use mean replacement, and it will bias both if you use
  anything other than mean replacement. The bias is gets worse with more
  missing data. So it's worst in the precisely the cases where you'd want
  to
  fill in the data the most. (Though I admit I'm not too familiar with
  time
  series, so maybe this doesn't apply. But it's true as a general
  principle in
  statistics.) I'm not sure why we'd want to make this use case easier.

 Another qualification on this (I cannot help it).
 I think this only applies if you use a prefabricated no-missing-values
 algorithm. If I write it myself, I can do the proper correction for
 the reduced number of observations. (similar to the case when we
 ignore correlated information and use statistics based on uncorrelated
 observations which also overestimate the amount of information we have
 available.)


 Can you do that sort of technique with longitudinal (panel) data? I'm
 honestly curious because I haven't looked into such corrections before. I
 haven't been able to find a reference after a few quick google searches. I
 don't suppose you know one off the top of your head?

I was thinking mainly of simple cases where the correction only
requires to correctly count the number of observations in order to
adjust the degrees of freedom. For example, statistical tests that are
based on relatively simple statistics or ANOVA which just needs a
correct counting of the number of observations by groups. (This might
be partially covered by any NA ufunc implementation, that does mean,
var and cov correctly and maybe sorting like the current NaN sort.)

In the panel data case it might be possible to do this, if it can just
be treated like an unbalanced panel. I guess it depends on the details
of the model.

For regression, one way to remove an observation is to include a dummy
variable for that observation, or use X'X with rows zeroed out. R has
a package for multivariate normal with missing values that allows
calculation of expected values for the missing ones.

But in many of these cases, getting a clean

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

2011-07-06 Thread Gary Strangman


(snip discussion of open kimono)
 On the other hand, to try and conceal these implementation
 differences, seems to me to break my feeling for numpy arrays, and
 make me feel I have an object that is rather magic, that I don't fully
 understand, and for which clever stuff is going on, under the hood,
 that I worry about but have to trust.

To weigh-in as someone less tipsy, I totally agree with this concern. In 
fact, in trying to understand the proposal myself--and I use numpy R NAs 
all the time--it was difficult to understand, and I don't think I have 
fully gotten it yet. That makes it seem like magic, and magic makes me 
seriously nervous ... specifically, that I won't get what I intended, 
which will lead to nearly-impossible-to-find bugs.

 I think this is not the numpy way.   I think I fully understand why
 it's attractive, but I continue to think that it's a mistake, and one
 that may take some time to become clear. It will become clear only
 after a few years of trying to teach people, and noticing that when
 they get to this stuff, they start switching off, and getting a bit
 confused, and concluding it's all too hard for them.

Agreed.

For ultra simplicity, I'd be perfectly happy with a np.NA element 
(bitpattern?) that I could use to represent points that will forevermore 
be missing, as well as a masking capability that allows multiple masking 
values (not just true/false) such as:

a.mask[3] = 0  # unmasked
a.mask[3] = 1  # masked type 1 (eg, missing?)
a.mask[3] = 2  # masked type 2 (eg, data from different source)
a.mask[3] = 3  # masked type 3 (eg, ignore in complete-case analysis)
etc.

Regardless of whether a mask is boolean or more, though, the simplicity of 
explaining masking separate from NA cases is, I think, a huge win.

-best
Gary


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] miniNEP1: where= argument for ufuncs

On Wed, Jul 6, 2011 at 5:41 PM, Lluís xscr...@gmx.net wrote:
 Sorry, but I didn't find a way of inserting inline comments in the gist.

I'm a little confused about how gists work, actually. For actual
discussion, it's probably just as well, since this way everyone sees
the comment on the list and has a chance to join the conversation...
but I'd be just as happy if other people could just go in and edit it,
and I'm not sure how that works. I'm happy to move to somewhere else
if people have suggestions, this was just easiest.

 Nathaniel Smith writes:
 [...]
 Is there any less stupid-looking name than ``where1=`` and ``where2=``
 for the ``.outer`` operation? (For that matter, can ``.outer`` be
 applied to more than 2 arrays? The docs say it can't, but it's
 perfectly well-defined for arbitrary number of arrays too, so maybe we
 want an interface that allows for 3-way, 4-way etc. ``.outer``
 operations in the future?)

 Well, if outer can indeed be defined for an arbitrary number of arrays
 (and if it's going to be sometime in the future), I'd say the simplest
 is to use an array:

    .outer(a, b, ..., where = [my_where1, my_where2, ...])

Yeah, that's a much better idea... I've edited it to match.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] miniNEP 2: NA support via special dtypes

Well, everyone seems to like my first attempt at this so far, so I
guess I'll really stick my foot in it now... here's my second miniNEP,
which lays out a plan for handling dtype/bit-pattern-style NAs. I've
stolen bits of text from both the NEP and the alterNEP for this, but
since the focus is on nailing down the details, most of the content is
new.

There are many FIXME's noted, where some decisions or more work is
needed... the idea here is to lay out some specifics, so we can figure
out if the idea will work and get the details right. So feedback is
*very* welcome!

Master version:
  https://gist.github.com/1068264

Current version for commenting:

###
miniNEP 2: NA support via special dtypes
###

To try and make more progress on the whole missing values/masked
arrays/... debate, it seems useful to have a more technical discussion
of the pieces which we *can* agree on. This is the second, which
attempts to nail down the details of how NAs can be implemented using
special dtype's.

*
Table of contents
*

.. contents::

*
Rationale
*

An ordinary value is something like an integer or a floating point
number. A missing value is a placeholder for an ordinary value that is
for some reason unavailable. For example, in working with statistical
data, we often build tables in which each row represents one item, and
each column represents properties of that item. For instance, we might
take a group of people and for each one record height, age, education
level, and income, and then stick these values into a table. But then
we discover that our research assistant screwed up and forgot to
record the age of one of our individuals. We could throw out the rest
of their data as well, but this would be wasteful; even such an
incomplete row is still perfectly usable for some analyses (e.g., we
can compute the correlation of height and income). The traditional way
to handle this would be to stick some particular meaningless value in
for the missing data, e.g., recording this person's age as 0. But this
is very error prone; we may later forget about these special values
while running other analyses, and discover to our surprise that babies
have higher incomes than teenagers. (In this case, the solution would
be to just leave out all the items where we have no age recorded, but
this isn't a general solution; many analyses require something more
clever to handle missing values.) So instead of using an ordinary
value like 0, we define a special missing value, written NA for
not available.

There are several possible ways to represent such a value in memory.
For instance, we could reserve a specific value (like 0, or a
particular NaN, or the smallest negative integer) and then ensure that
this value is treated specially by all arithmetic and other operations
on our array. Another option would be to add an additional mask array
next to our main array, use this to indicate which values should be
treated as NA, and then extend our array operations to check this mask
array whenever performing computations. Each implementation approach
has various strengths and weaknesses, but here we focus on the former
(value-based) approach exclusively and leave the possible addition of
the latter to future discussion. The core advantages of this approach
are (1) it adds no additional memory overhead, (2) it is
straightforward to store and retrieve such arrays to disk using
existing file storage formats, (3) it allows binary compatibility with
R arrays including NA values, (4) it is compatible with the common
practice of using NaN to indicate missingness when working with
floating point numbers, (5) the dtype is already a place where `weird
things can happen' -- there are a wide variety of dtypes that don't
act like ordinary numbers (including structs, Python objects,
fixed-length strings, ...), so code that accepts arbitrary numpy
arrays already has to be prepared to handle these (even if only by
checking for them and raising an error). Therefore adding yet more new
dtypes has less impact on extension authors than if we change the
ndarray object itself.

The basic semantics of NA values are as follows. Like any other value,
they must be supported by your array's dtype -- you can't store a
floating point number in an array with dtype=int32, and you can't
store an NA in it either. You need an array with dtype=NAint32 or
something (exact syntax to be determined). Otherwise, NA values act
exactly like any other values. In particular, you can apply arithmetic
functions and so forth to them. By default, any function which takes
an NA as an argument always returns an NA as well, regardless of the
values of the other arguments. This ensures that if we try to compute
the correlation of income with age, we will get NA, meaning given
that some of the entries could be anything, the answer could be
anything as well. This reminds us to

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Skipper Seabold

On Wed, Jul 6, 2011 at 7:14 PM, Christopher Jordan-Squire
cjord...@uw.edu wrote:
 On Wed, Jul 6, 2011 at 3:47 PM, josef.p...@gmail.com wrote:
 On Wed, Jul 6, 2011 at 4:38 PM,  josef.p...@gmail.com wrote:
  On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
snip
  Mean value replacement, or more generally single scalar value
  replacement,
  is generally not a good idea. It biases downward your standard error
  estimates if you use mean replacement, and it will bias both if you use
  anything other than mean replacement. The bias is gets worse with more
  missing data. So it's worst in the precisely the cases where you'd want
  to
  fill in the data the most. (Though I admit I'm not too familiar with
  time
  series, so maybe this doesn't apply. But it's true as a general
  principle in
  statistics.) I'm not sure why we'd want to make this use case easier.

 Another qualification on this (I cannot help it).
 I think this only applies if you use a prefabricated no-missing-values
 algorithm. If I write it myself, I can do the proper correction for
 the reduced number of observations. (similar to the case when we
 ignore correlated information and use statistics based on uncorrelated
 observations which also overestimate the amount of information we have
 available.)


 Can you do that sort of technique with longitudinal (panel) data? I'm
 honestly curious because I haven't looked into such corrections before. I
 haven't been able to find a reference after a few quick google searches. I
 don't suppose you know one off the top of your head?
 And you're right about the last measurement carried forward. I was just
 thinking about filling in all missing values with the same value.
 -Chris Jordan-Squire
 PS--Thanks for mentioning the statsmodels discussion. I'd been keeping track
 of that on a different email account, and I haven't realized it wasn't
 forwarding those messages correctly.


Maybe a bit OT, but I've seen people doing imputation using Bayesian
MCMC or multiple imputation for missing values in panel data. Google
'data augmentation' or 'multiple imputation'. I haven't looked much
into the details yet, but it's definitely not mean replacement.

FWIW (I haven't been following closely the discussion), there is a
distinction in statistics between ignorable and nonignorable missing
data, but I can't think of a situation where I would need this at the
computational level rather than relying on a (numerically comparable)
missing data type(s) a la SAS/Stata. I've also found the odd examples
of IGNORE without a clear answer to be scary.

Skipper
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] miniNEP 2: NA support via special dtypes

On Wed, Jul 6, 2011 at 7:34 PM, Nathaniel Smith n...@pobox.com wrote:

 Well, everyone seems to like my first attempt at this so far, so I
 guess I'll really stick my foot in it now... here's my second miniNEP,
 which lays out a plan for handling dtype/bit-pattern-style NAs. I've
 stolen bits of text from both the NEP and the alterNEP for this, but
 since the focus is on nailing down the details, most of the content is
 new.

 There are many FIXME's noted, where some decisions or more work is
 needed... the idea here is to lay out some specifics, so we can figure
 out if the idea will work and get the details right. So feedback is
 *very* welcome!

 Master version:
  https://gist.github.com/1068264

 Current version for commenting:

 ###
 miniNEP 2: NA support via special dtypes
 ###

 To try and make more progress on the whole missing values/masked
 arrays/... debate, it seems useful to have a more technical discussion
 of the pieces which we *can* agree on. This is the second, which
 attempts to nail down the details of how NAs can be implemented using
 special dtype's.

 *
 Table of contents
 *

 .. contents::

 *
 Rationale
 *

 An ordinary value is something like an integer or a floating point
 number. A missing value is a placeholder for an ordinary value that is
 for some reason unavailable. For example, in working with statistical
 data, we often build tables in which each row represents one item, and
 each column represents properties of that item. For instance, we might
 take a group of people and for each one record height, age, education
 level, and income, and then stick these values into a table. But then
 we discover that our research assistant screwed up and forgot to
 record the age of one of our individuals. We could throw out the rest
 of their data as well, but this would be wasteful; even such an
 incomplete row is still perfectly usable for some analyses (e.g., we
 can compute the correlation of height and income). The traditional way
 to handle this would be to stick some particular meaningless value in
 for the missing data, e.g., recording this person's age as 0. But this
 is very error prone; we may later forget about these special values
 while running other analyses, and discover to our surprise that babies
 have higher incomes than teenagers. (In this case, the solution would
 be to just leave out all the items where we have no age recorded, but
 this isn't a general solution; many analyses require something more
 clever to handle missing values.) So instead of using an ordinary
 value like 0, we define a special missing value, written NA for
 not available.

 There are several possible ways to represent such a value in memory.
 For instance, we could reserve a specific value (like 0, or a
 particular NaN, or the smallest negative integer) and then ensure that
 this value is treated specially by all arithmetic and other operations
 on our array. Another option would be to add an additional mask array
 next to our main array, use this to indicate which values should be
 treated as NA, and then extend our array operations to check this mask
 array whenever performing computations. Each implementation approach
 has various strengths and weaknesses, but here we focus on the former
 (value-based) approach exclusively and leave the possible addition of
 the latter to future discussion. The core advantages of this approach
 are (1) it adds no additional memory overhead, (2) it is
 straightforward to store and retrieve such arrays to disk using
 existing file storage formats, (3) it allows binary compatibility with
 R arrays including NA values, (4) it is compatible with the common
 practice of using NaN to indicate missingness when working with
 floating point numbers, (5) the dtype is already a place where `weird
 things can happen' -- there are a wide variety of dtypes that don't
 act like ordinary numbers (including structs, Python objects,
 fixed-length strings, ...), so code that accepts arbitrary numpy
 arrays already has to be prepared to handle these (even if only by
 checking for them and raising an error). Therefore adding yet more new
 dtypes has less impact on extension authors than if we change the
 ndarray object itself.

 The basic semantics of NA values are as follows. Like any other value,
 they must be supported by your array's dtype -- you can't store a
 floating point number in an array with dtype=int32, and you can't
 store an NA in it either. You need an array with dtype=NAint32 or
 something (exact syntax to be determined). Otherwise, NA values act
 exactly like any other values. In particular, you can apply arithmetic
 functions and so forth to them. By default, any function which takes
 an NA as an argument always returns an NA as well, regardless of the
 values of the other arguments. This ensures that if we try to compute
 the correlation

Re: [Numpy-discussion] miniNEP 2: NA support via special dtypes

On Wed, Jul 6, 2011 at 7:01 PM, Charles R Harris
charlesr.har...@gmail.com wrote:
 Numpy already has a general mechanism for defining new dtypes and
 slotting them in so that they're supported by ndarrays, by the casting
 machinery, by ufuncs, and so on. In principle, we could implement

 Well, actually not in any useful sense, take a look at what Mark went
 through for the half floats. There is a reason the NEP went with
 parametrized dtypes and masks. But we would sure welcome a plan and code to
 make it true, it is one of the areas that could really use improvement.

Err, yes, that's basically what the next few sentences say?

This is basically a draft spec for implementing the parametrized dtypes idea.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] miniNEP 2: NA support via special dtypes

On Wed, Jul 6, 2011 at 8:09 PM, Nathaniel Smith n...@pobox.com wrote:

 On Wed, Jul 6, 2011 at 7:01 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
  Numpy already has a general mechanism for defining new dtypes and
  slotting them in so that they're supported by ndarrays, by the casting
  machinery, by ufuncs, and so on. In principle, we could implement
 
  Well, actually not in any useful sense, take a look at what Mark went
  through for the half floats. There is a reason the NEP went with
  parametrized dtypes and masks. But we would sure welcome a plan and code
 to
  make it true, it is one of the areas that could really use improvement.

 Err, yes, that's basically what the next few sentences say?

 This is basically a draft spec for implementing the parametrized dtypes
 idea.

 And yet:

FIXME: this really needs attention from an expert on numpy's casting
rules. But I can't seem to find the docs that explain how casting
loops are looked up and decided between (e.g., if you're casting from
dtype A to dtype B, which dtype's loops are used?), so I can't go into
details. But those details are tricky and they matter...

There is also a reason that masks were chosen to be implemented first. The
numpy code is freely available and there is no reason not to make
experiments or help Mark get some of the current problems solved, it doesn't
need to be a one man effort and your feedback will have a lot more impact if
you are in the trenches. In particular, I think there is a good deal of work
that will need to be done for the sorts, argmax, and the other functions you
mention that would give you a good idea of what was involved and how to go
about implementing your ideas.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] miniNEP 2: NA support via special dtypes

On Wed, Jul 6, 2011 at 8:34 PM, Charles R Harris
charlesr.har...@gmail.comwrote:



 On Wed, Jul 6, 2011 at 8:09 PM, Nathaniel Smith n...@pobox.com wrote:

 On Wed, Jul 6, 2011 at 7:01 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
  Numpy already has a general mechanism for defining new dtypes and
  slotting them in so that they're supported by ndarrays, by the casting
  machinery, by ufuncs, and so on. In principle, we could implement
 
  Well, actually not in any useful sense, take a look at what Mark went
  through for the half floats. There is a reason the NEP went with
  parametrized dtypes and masks. But we would sure welcome a plan and code
 to
  make it true, it is one of the areas that could really use improvement.

 Err, yes, that's basically what the next few sentences say?

 This is basically a draft spec for implementing the parametrized dtypes
 idea.

 And yet:


 FIXME: this really needs attention from an expert on numpy's casting
 rules. But I can't seem to find the docs that explain how casting
 loops are looked up and decided between (e.g., if you're casting from
 dtype A to dtype B, which dtype's loops are used?), so I can't go into
 details. But those details are tricky and they matter...

 There is also a reason that masks were chosen to be implemented first. The
 numpy code is freely available and there is no reason not to make
 experiments or help Mark get some of the current problems solved, it doesn't
 need to be a one man effort and your feedback will have a lot more impact if
 you are in the trenches. In particular, I think there is a good deal of work
 that will need to be done for the sorts, argmax, and the other functions you
 mention that would give you a good idea of what was involved and how to go
 about implementing your ideas.


Let me lay out a bit more how I see things developing at this point, and
bear in mind that I am not a psychic so this is just a guess ;) Mark is
going to work at Enthought for maybe 3-4 more weeks and then return to
school. Mark is very good, but that is still a very tough schedule and all
the things in the NEP may not get finished, let alone all the supporting
work that will be needed around the core implementation. After that what
Mark does in his spare time is up to him. I expect there will be another
numpy release sometime in the Fall, maybe around Nov/Dec, to get the new
features, especially the datetime work, out there. At that point the
interface is semi-fixed. I like to think that new features should be
regarded as experimental for at least one release cycle but that is
certainly not official Numpy policy. In any case there is likely going to be
a gap of several months where the rate of commits slows down and other
folks, if they are interested, have a real opportunity to get involved.
After the projected Fall release I see maybe another six months to make
changes/extensions to the interface, and this is where new ideas can get
worked out, but there needs to be someone with the interest and skill to
implement those ideas for that to happen. If no such person shows up, then
the interface will be what it is until there is such a person with an
interest in carrying things forward. But at that point they will need take
care to maintain backward compatibility unless pretty much everyone agrees
that the then current interface is a disaster.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] miniNEP 2: NA support via special dtypes