Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Matthew Brett
Hi,

Just for reference, I am using this as the latest version of the NEP -
I hope it's current:

https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

I'm mostly relaying stuff I said, although generally (please do
correct me if I am wrong) I am just re-expressing points that
Nathaniel has already made in the alterNEP text and the emails.

On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
cjord...@uw.edu wrote:
...
 Since we only have Mark is only around Austin until early August, there's
 also broad agreement that we need to get something done quickly.

I think I might have missed that part of the discussion :)

I feel the need to emphasize the centrality of the assertion by
Nathaniel, and agreement by (at least) me, that the NA case (there
really is no data) and the IGNORE case (there is data but I'm
concealing it from you) are conceptually different, and come from
different use-cases.

The underlying disagreement returned many times to this fundamental
difference between the NEP and alterNEP:

In the NEP - by design - it is impossible to distinguish between na.NA
and na.IGNORE
The alterNEP insists you should be able to distinguish.

Mark says something like it's all missing data, there's no reason you
should want to distinguish.  Nathaniel and I were saying the two
types of missing do have different use-cases, and it should be
possible to distinguish.  You might want to chose to treat them the
same, but you should be able to see what they are..

I returned several times to this (original point by Nathaniel):

a[3] = np.NA

(what does this mean?   I am altering the underlying array, or a mask?
  How would I explain this to someone?)

We confirmed that, in order to make it difficult to know what your NA
is (masked or bit-pattern), Mark has to a) hinder access to the data
below the mask and b) prevent direct API access to the masking array.
I described this as 'hobbling the API' and Mark thought of it as
'generic programming' (missing is always missing).

I asserted that explaining NA to people would be easier if ``a[3] =
np.NA`` was direct assignment and altered the array.

 BIT PATTERN  MASK IMPLEMENTATIONS FOR NA
 --
 The current NEP proposes both mask and bit pattern implementations for
 missing data. I use the terms bit pattern and parameterized dtype
 interchangeably, since the parameterized dtype will use a bit pattern for
 its implementation. The two implementations will support the same
 functionality with respect to NA, and the implementation details will be
 largely invisible to the user. Their differences are in the 'extra' features
 each supports.

 Two common questions were:
 1. Why make two implementations of missing data: one with masks and the
 other with parameterized dtypes?
 2. Why does the implementation using masks have higher priority?
 The answers are:
 1.  The mask implementation is more general and easier to implement and
 maintain.  The bit pattern implementation saves memory, makes
 interoperability easier, and makes ABI (Application Binary Interface)
 compatibility easier. Since each has different strengths, the argument is
 both should be implemented.
 2. The implementation for the parameterized dtypes will rely on the
 implementation using a mask.

 NA VS. IGNORE
 -
 A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in aNEP
 sense and NA in  NEP sense. With NA, there is a clear notion of how NA
 propagates through all basic numpy operations.  (e.g., 3+NA=NA and log(NA) =
 NA, while NA | True = True.) IGNORE is separate from NA, with different
 interpretations depending on the use case.
 IGNORE could mean:
 1. Data that is being temporarily ignored. e.g., a possible outlier that is
 temporarily being removed from consideration.
 2. Data that cannot exist. e.g., a matrix representing a grid of water
 depths for a lake. Since the lake isn't square, some entries will represent
 land, and so depth will be a meaningless concept for those entries.
 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE,
 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though this
 leaves open how [1, 2, IGNORE] + [3 , 4] should behave.
 Because of these different uses of IGNORE, it doesn't have as clear a
 theoretical interpretation as NA. (For instance, what is IGNORE+3, IGNORE*3,
 or IGNORE | True?)

I don't remember this bit of the discussion, but I see from current
masked arrays that IGNORE is treated as the identity, so:

IGNORE + 3 = 3
IGNORE * 3 = 3

 But several of the discussants thought the use cases for IGNORE were very
 compelling. Specifically, they wanted to be able to use IGNORE's and NA's
 simultaneously while still being able to differentiate between them. So, for
 example, being able to designate some data as IGNORE while still able to
 determine 

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Dag Sverre Seljebotn
On 07/06/2011 02:05 PM, Matthew Brett wrote:
 Hi,

 Just for reference, I am using this as the latest version of the NEP -
 I hope it's current:

 https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

 I'm mostly relaying stuff I said, although generally (please do
 correct me if I am wrong) I am just re-expressing points that
 Nathaniel has already made in the alterNEP text and the emails.

 On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
 cjord...@uw.edu  wrote:
 ...
 Since we only have Mark is only around Austin until early August, there's
 also broad agreement that we need to get something done quickly.

 I think I might have missed that part of the discussion :)

 I feel the need to emphasize the centrality of the assertion by
 Nathaniel, and agreement by (at least) me, that the NA case (there
 really is no data) and the IGNORE case (there is data but I'm
 concealing it from you) are conceptually different, and come from
 different use-cases.

 The underlying disagreement returned many times to this fundamental
 difference between the NEP and alterNEP:

 In the NEP - by design - it is impossible to distinguish between na.NA
 and na.IGNORE
 The alterNEP insists you should be able to distinguish.

 Mark says something like it's all missing data, there's no reason you
 should want to distinguish.  Nathaniel and I were saying the two
 types of missing do have different use-cases, and it should be
 possible to distinguish.  You might want to chose to treat them the
 same, but you should be able to see what they are..

 I returned several times to this (original point by Nathaniel):

 a[3] = np.NA

 (what does this mean?   I am altering the underlying array, or a mask?
How would I explain this to someone?)

 We confirmed that, in order to make it difficult to know what your NA
 is (masked or bit-pattern), Mark has to a) hinder access to the data
 below the mask and b) prevent direct API access to the masking array.
 I described this as 'hobbling the API' and Mark thought of it as
 'generic programming' (missing is always missing).

Here's an HPC perspective...:

If you, say, want to off-load array processing with a mask to some code 
running on a GPU, you really can't have the GPU go through some NumPy 
API. Or if you want to implement a masked array on a cluster with MPI, 
you similarly really, really want raw access.

At least I feel that the transparency of NumPy is a huge part of its 
current success. Many more than me spend half their time in C/Fortran 
and half their time in Python.

I tend to look at NumPy this way: Assuming you have some data in memory 
(possibly loaded by a C or Fortran library). (Almost) no matter how it 
is allocated, ordered, packed, aligned -- there's a way to find strides 
and dtypes to put a nice NumPy wrapper around it and use the memory from 
Python.

So, my view on Mark's NEP was: With a reasonably amount of flexibility 
in how you decided to implement masking for your data, you can create a 
NumPy wrapper that will understand that. Whether your Fortran library 
exposes NAs in its 40GB buffer as bit patterns, or using a seperate 
mask, both will work.

And IMO Mark's NEP comes rather close to this, you just need an 
additional NEP later to give raw details to the implementation details, 
once those are settled :-)

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Dag Sverre Seljebotn
On 07/06/2011 02:27 PM, Dag Sverre Seljebotn wrote:
 On 07/06/2011 02:05 PM, Matthew Brett wrote:
 Hi,

 Just for reference, I am using this as the latest version of the NEP -
 I hope it's current:

 https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

 I'm mostly relaying stuff I said, although generally (please do
 correct me if I am wrong) I am just re-expressing points that
 Nathaniel has already made in the alterNEP text and the emails.

 On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
 cjord...@uw.edu   wrote:
 ...
 Since we only have Mark is only around Austin until early August, there's
 also broad agreement that we need to get something done quickly.

 I think I might have missed that part of the discussion :)

 I feel the need to emphasize the centrality of the assertion by
 Nathaniel, and agreement by (at least) me, that the NA case (there
 really is no data) and the IGNORE case (there is data but I'm
 concealing it from you) are conceptually different, and come from
 different use-cases.

 The underlying disagreement returned many times to this fundamental
 difference between the NEP and alterNEP:

 In the NEP - by design - it is impossible to distinguish between na.NA
 and na.IGNORE
 The alterNEP insists you should be able to distinguish.

 Mark says something like it's all missing data, there's no reason you
 should want to distinguish.  Nathaniel and I were saying the two
 types of missing do have different use-cases, and it should be
 possible to distinguish.  You might want to chose to treat them the
 same, but you should be able to see what they are..

 I returned several times to this (original point by Nathaniel):

 a[3] = np.NA

 (what does this mean?   I am altering the underlying array, or a mask?
 How would I explain this to someone?)

 We confirmed that, in order to make it difficult to know what your NA
 is (masked or bit-pattern), Mark has to a) hinder access to the data
 below the mask and b) prevent direct API access to the masking array.
 I described this as 'hobbling the API' and Mark thought of it as
 'generic programming' (missing is always missing).

 Here's an HPC perspective...:

 If you, say, want to off-load array processing with a mask to some code
 running on a GPU, you really can't have the GPU go through some NumPy
 API. Or if you want to implement a masked array on a cluster with MPI,
 you similarly really, really want raw access.

 At least I feel that the transparency of NumPy is a huge part of its
 current success. Many more than me spend half their time in C/Fortran
 and half their time in Python.

 I tend to look at NumPy this way: Assuming you have some data in memory
 (possibly loaded by a C or Fortran library). (Almost) no matter how it
 is allocated, ordered, packed, aligned -- there's a way to find strides
 and dtypes to put a nice NumPy wrapper around it and use the memory from
 Python.

 So, my view on Mark's NEP was: With a reasonably amount of flexibility
 in how you decided to implement masking for your data, you can create a
 NumPy wrapper that will understand that. Whether your Fortran library
 exposes NAs in its 40GB buffer as bit patterns, or using a seperate
 mask, both will work.

 And IMO Mark's NEP comes rather close to this, you just need an
 additional NEP later to give raw details to the implementation details,
 once those are settled :-)

To be concrete, I'm thinking something like a custom extension to PEP 
3118, which could also allow efficient access from Cython without 
hard-coding Cython for NumPy (a GSoC project this summer will continue 
to move us away from the np.ndarray[int] syntax to a more generic 
int[:] that's less tied to NumPy).

But first things first!

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Christopher Jordan-Squire
On Wed, Jul 6, 2011 at 5:05 AM, Matthew Brett matthew.br...@gmail.comwrote:

 Hi,

 Just for reference, I am using this as the latest version of the NEP -
 I hope it's current:


 https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

 I'm mostly relaying stuff I said, although generally (please do
 correct me if I am wrong) I am just re-expressing points that
 Nathaniel has already made in the alterNEP text and the emails.

 On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
 cjord...@uw.edu wrote:
 ...
  Since we only have Mark is only around Austin until early August, there's
  also broad agreement that we need to get something done quickly.

 I think I might have missed that part of the discussion :)


I think that might have been mentioned by Travis right before he had to
leave for another meeting, which might have been after you'd disconnected.
Travis' concern as a member of a numpy community is the desire for something
that is broadly applicable and adopted. But as Mark's employer, his concern
is to get a more complete and coherent missing data functionality
implemented in numpy while Mark is still at Enthought, for use in the
problems Enthought and statisticians commonly encounter if nothing else.



 I feel the need to emphasize the centrality of the assertion by
 Nathaniel, and agreement by (at least) me, that the NA case (there
 really is no data) and the IGNORE case (there is data but I'm
 concealing it from you) are conceptually different, and come from
 different use-cases.

 The underlying disagreement returned many times to this fundamental
 difference between the NEP and alterNEP:

 In the NEP - by design - it is impossible to distinguish between na.NA
 and na.IGNORE
 The alterNEP insists you should be able to distinguish.

 Mark says something like it's all missing data, there's no reason you
 should want to distinguish.  Nathaniel and I were saying the two
 types of missing do have different use-cases, and it should be
 possible to distinguish.  You might want to chose to treat them the
 same, but you should be able to see what they are..

 I returned several times to this (original point by Nathaniel):

 a[3] = np.NA

 (what does this mean?   I am altering the underlying array, or a mask?
  How would I explain this to someone?)

 We confirmed that, in order to make it difficult to know what your NA
 is (masked or bit-pattern), Mark has to a) hinder access to the data
 below the mask and b) prevent direct API access to the masking array.
 I described this as 'hobbling the API' and Mark thought of it as
 'generic programming' (missing is always missing).

 I asserted that explaining NA to people would be easier if ``a[3] =
 np.NA`` was direct assignment and altered the array.

  BIT PATTERN  MASK IMPLEMENTATIONS FOR NA
 
 --
  The current NEP proposes both mask and bit pattern implementations for
  missing data. I use the terms bit pattern and parameterized dtype
  interchangeably, since the parameterized dtype will use a bit pattern for
  its implementation. The two implementations will support the same
  functionality with respect to NA, and the implementation details will be
  largely invisible to the user. Their differences are in the 'extra'
 features
  each supports.
 
  Two common questions were:
  1. Why make two implementations of missing data: one with masks and the
  other with parameterized dtypes?
  2. Why does the implementation using masks have higher priority?
  The answers are:
  1.  The mask implementation is more general and easier to implement and
  maintain.  The bit pattern implementation saves memory, makes
  interoperability easier, and makes ABI (Application Binary Interface)
  compatibility easier. Since each has different strengths, the argument is
  both should be implemented.
  2. The implementation for the parameterized dtypes will rely on the
  implementation using a mask.
 
  NA VS. IGNORE
  -
  A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in
 aNEP
  sense and NA in  NEP sense. With NA, there is a clear notion of how NA
  propagates through all basic numpy operations.  (e.g., 3+NA=NA and
 log(NA) =
  NA, while NA | True = True.) IGNORE is separate from NA, with different
  interpretations depending on the use case.
  IGNORE could mean:
  1. Data that is being temporarily ignored. e.g., a possible outlier that
 is
  temporarily being removed from consideration.
  2. Data that cannot exist. e.g., a matrix representing a grid of water
  depths for a lake. Since the lake isn't square, some entries will
 represent
  land, and so depth will be a meaningless concept for those entries.
  3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE],
 [IGNORE,
  3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though
 this
  leaves open how [1, 

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Christopher Barker
Christopher Jordan-Squire wrote:
 Here's a short-ish summary of the topics discussed in the conference 
 call this afternoon.

Thanks, this is great! And thanks to all who participated in the call.

 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], 
 [IGNORE, 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] 
 ].

whoooa!

I actually have been looking for, and thinking about jagged arrays a 
fair bit lately, so this is kind of exciting, but this looks like a bad 
idea to me. The above indicates that:

a = np.array( [ [1, 2, np.IGNORE],
 [np.IGNORE, 3, 4] ]

a[:,1] would yield:

array([2, 4])

which seems really wrong -- you've tossed out the location information 
altogether.

( think it should be: array([2, 3])


I could see a jagged array being represented by IGNOREs all at the END 
of each row, but putting items in the middle, and shifting things to the 
left strikes me as a plain old bad idea (and a pain to implement)

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Matthew Brett
Hi,

On Wed, Jul 6, 2011 at 6:54 PM, Christopher Jordan-Squire
cjord...@uw.edu wrote:


 On Wed, Jul 6, 2011 at 5:05 AM, Matthew Brett matthew.br...@gmail.com
 wrote:

 Hi,

 Just for reference, I am using this as the latest version of the NEP -
 I hope it's current:


 https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

 I'm mostly relaying stuff I said, although generally (please do
 correct me if I am wrong) I am just re-expressing points that
 Nathaniel has already made in the alterNEP text and the emails.

 On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
 cjord...@uw.edu wrote:
 ...
  Since we only have Mark is only around Austin until early August,
  there's
  also broad agreement that we need to get something done quickly.

 I think I might have missed that part of the discussion :)


 I think that might have been mentioned by Travis right before he had to
 leave for another meeting, which might have been after you'd disconnected.
 Travis' concern as a member of a numpy community is the desire for something
 that is broadly applicable and adopted. But as Mark's employer, his concern
 is to get a more complete and coherent missing data functionality
 implemented in numpy while Mark is still at Enthought, for use in the
 problems Enthought and statisticians commonly encounter if nothing else.

Sorry - yes - I wasn't there for all the conversation.   Of course
(not disagreeing), we must take care to get the API right because it's
unlikely to change and will be explaining and supporting it for a long
time to come.

 I feel the need to emphasize the centrality of the assertion by
 Nathaniel, and agreement by (at least) me, that the NA case (there
 really is no data) and the IGNORE case (there is data but I'm
 concealing it from you) are conceptually different, and come from
 different use-cases.

 The underlying disagreement returned many times to this fundamental
 difference between the NEP and alterNEP:

 In the NEP - by design - it is impossible to distinguish between na.NA
 and na.IGNORE
 The alterNEP insists you should be able to distinguish.

 Mark says something like it's all missing data, there's no reason you
 should want to distinguish.  Nathaniel and I were saying the two
 types of missing do have different use-cases, and it should be
 possible to distinguish.  You might want to chose to treat them the
 same, but you should be able to see what they are..

 I returned several times to this (original point by Nathaniel):

 a[3] = np.NA

 (what does this mean?   I am altering the underlying array, or a mask?
  How would I explain this to someone?)

 We confirmed that, in order to make it difficult to know what your NA
 is (masked or bit-pattern), Mark has to a) hinder access to the data
 below the mask and b) prevent direct API access to the masking array.
 I described this as 'hobbling the API' and Mark thought of it as
 'generic programming' (missing is always missing).

 I asserted that explaining NA to people would be easier if ``a[3] =
 np.NA`` was direct assignment and altered the array.

  BIT PATTERN  MASK IMPLEMENTATIONS FOR NA
 
  --
  The current NEP proposes both mask and bit pattern implementations for
  missing data. I use the terms bit pattern and parameterized dtype
  interchangeably, since the parameterized dtype will use a bit pattern
  for
  its implementation. The two implementations will support the same
  functionality with respect to NA, and the implementation details will be
  largely invisible to the user. Their differences are in the 'extra'
  features
  each supports.
 
  Two common questions were:
  1. Why make two implementations of missing data: one with masks and the
  other with parameterized dtypes?
  2. Why does the implementation using masks have higher priority?
  The answers are:
  1.  The mask implementation is more general and easier to implement and
  maintain.  The bit pattern implementation saves memory, makes
  interoperability easier, and makes ABI (Application Binary Interface)
  compatibility easier. Since each has different strengths, the argument
  is
  both should be implemented.
  2. The implementation for the parameterized dtypes will rely on the
  implementation using a mask.
 
  NA VS. IGNORE
  -
  A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in
  aNEP
  sense and NA in  NEP sense. With NA, there is a clear notion of how NA
  propagates through all basic numpy operations.  (e.g., 3+NA=NA and
  log(NA) =
  NA, while NA | True = True.) IGNORE is separate from NA, with different
  interpretations depending on the use case.
  IGNORE could mean:
  1. Data that is being temporarily ignored. e.g., a possible outlier that
  is
  temporarily being removed from consideration.
  2. Data that cannot exist. e.g., a matrix representing a grid of water

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Christopher Barker
Dag Sverre Seljebotn wrote:
 Here's an HPC perspective...:

 At least I feel that the transparency of NumPy is a huge part of its 
 current success. Many more than me spend half their time in C/Fortran 
 and half their time in Python.

Absolutely -- and this point has been raised a couple times in the 
discussion, so I hope it is not forgotten.

   I tend to look at NumPy this way: Assuming you have some data in memory
 (possibly loaded by a C or Fortran library). (Almost) no matter how it 
 is allocated, ordered, packed, aligned -- there's a way to find strides 
 and dtypes to put a nice NumPy wrapper around it and use the memory from 
 Python.

and vice-versa -- Assuming you have some data in numpy arrays, there's a 
way to process it with a C or Fortran library without copying the data.

And this is where I am skeptical of the bit-pattern idea -- while one 
can expect C and fortran and GPU, and ??? to understand NaNs for 
floating point data, is there any support in compilers or hardware for 
special bit patterns for NA values to integers? I've never seen in my 
(very limited experience).

Maybe having the mask option, too, will make that irrelevant, but I want 
to be clear about that kind of use case.

-Chris





-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Christopher Barker
Christopher Jordan-Squire wrote:
 If we follow those rules for IGNORE for all computations, we sometimes 
 get some weird output. For example:
 [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix 
 multiply and not * with broadcasting.) Or should that sort of operation 
 through an error?

That should throw an error -- matrix computation is heavily influenced 
by the shape and size of matrices, so I think IGNORES really don't make 
sense there.


Nathaniel Smith wrote:
 It's exactly this transparency that worries Matthew and me -- we feel
 that the alterNEP preserves it, and the NEP attempts to erase it. In
 the NEP, there are two totally different underlying data structures,
 but this difference is blurred at the Python level. The idea is that
 you shouldn't have to think about which you have, but if you work with
 C/Fortran, then of course you do have to be constantly aware of the
 underlying implementation anyway. 

I don't think this bothers me -- I think it's analogous to things in 
numpy like Fortran order and non-contiguous arrays -- you can ignore all 
that when working in pure python when performance isn't critical, but 
you need a deeper understanding if you want to work with the data in C 
or Fortran or to tune performance in python.

So as long as there is an API to query and control how things work, I 
like that it's hidden from simple python code.

-Chris





-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Christopher Jordan-Squire
On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker
chris.bar...@noaa.govwrote:

 Christopher Jordan-Squire wrote:
  If we follow those rules for IGNORE for all computations, we sometimes
  get some weird output. For example:
  [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
  multiply and not * with broadcasting.) Or should that sort of operation
  through an error?

 That should throw an error -- matrix computation is heavily influenced
 by the shape and size of matrices, so I think IGNORES really don't make
 sense there.



If the IGNORES don't make sense in basic numpy computations then I'm kinda
confused why they'd be included at the numpy core level.



 Nathaniel Smith wrote:
  It's exactly this transparency that worries Matthew and me -- we feel
  that the alterNEP preserves it, and the NEP attempts to erase it. In
  the NEP, there are two totally different underlying data structures,
  but this difference is blurred at the Python level. The idea is that
  you shouldn't have to think about which you have, but if you work with
  C/Fortran, then of course you do have to be constantly aware of the
  underlying implementation anyway.

 I don't think this bothers me -- I think it's analogous to things in
 numpy like Fortran order and non-contiguous arrays -- you can ignore all
 that when working in pure python when performance isn't critical, but
 you need a deeper understanding if you want to work with the data in C
 or Fortran or to tune performance in python.

 So as long as there is an API to query and control how things work, I
 like that it's hidden from simple python code.

 -Chris



I'm similarly not too concerned about it. Performance seems finicky when
you're dealing with missing data, since a lot of arrays will likely have to
be copied over to other arrays containing only complete data before being
handed over to BLAS. My primary concern is that the np.NA stuff 'just
works'. Especially since I've never run into use cases in statistics where
the difference between IGNORE and NA mattered.







 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread josef . pktd
On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
cjord...@uw.edu wrote:


 On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov
 wrote:

 Christopher Jordan-Squire wrote:
  If we follow those rules for IGNORE for all computations, we sometimes
  get some weird output. For example:
  [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
  multiply and not * with broadcasting.) Or should that sort of operation
  through an error?

 That should throw an error -- matrix computation is heavily influenced
 by the shape and size of matrices, so I think IGNORES really don't make
 sense there.



 If the IGNORES don't make sense in basic numpy computations then I'm kinda
 confused why they'd be included at the numpy core level.


 Nathaniel Smith wrote:
  It's exactly this transparency that worries Matthew and me -- we feel
  that the alterNEP preserves it, and the NEP attempts to erase it. In
  the NEP, there are two totally different underlying data structures,
  but this difference is blurred at the Python level. The idea is that
  you shouldn't have to think about which you have, but if you work with
  C/Fortran, then of course you do have to be constantly aware of the
  underlying implementation anyway.

 I don't think this bothers me -- I think it's analogous to things in
 numpy like Fortran order and non-contiguous arrays -- you can ignore all
 that when working in pure python when performance isn't critical, but
 you need a deeper understanding if you want to work with the data in C
 or Fortran or to tune performance in python.

 So as long as there is an API to query and control how things work, I
 like that it's hidden from simple python code.

 -Chris



 I'm similarly not too concerned about it. Performance seems finicky when
 you're dealing with missing data, since a lot of arrays will likely have to
 be copied over to other arrays containing only complete data before being
 handed over to BLAS.

Unless you know the neutral value for the computation or you just want
to do a forward_fill in time series, and you have to ask the user not
to give you an unmutable array with NAs if they don't want extra
copies.

Josef

 My primary concern is that the np.NA stuff 'just
 works'. Especially since I've never run into use cases in statistics where
 the difference between IGNORE and NA mattered.




 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Bruce Southey

On 07/06/2011 02:38 PM, Christopher Jordan-Squire wrote:



On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker 
chris.bar...@noaa.gov mailto:chris.bar...@noaa.gov wrote:


Christopher Jordan-Squire wrote:
 If we follow those rules for IGNORE for all computations, we
sometimes
 get some weird output. For example:
 [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
 multiply and not * with broadcasting.) Or should that sort of
operation
 through an error?

That should throw an error -- matrix computation is heavily influenced
by the shape and size of matrices, so I think IGNORES really don't
make
sense there.



If the IGNORES don't make sense in basic numpy computations then I'm 
kinda confused why they'd be included at the numpy core level.


Nathaniel Smith wrote:
 It's exactly this transparency that worries Matthew and me -- we
feel
 that the alterNEP preserves it, and the NEP attempts to erase it. In
 the NEP, there are two totally different underlying data structures,
 but this difference is blurred at the Python level. The idea is that
 you shouldn't have to think about which you have, but if you
work with
 C/Fortran, then of course you do have to be constantly aware of the
 underlying implementation anyway.

I don't think this bothers me -- I think it's analogous to things in
numpy like Fortran order and non-contiguous arrays -- you can
ignore all
that when working in pure python when performance isn't critical, but
you need a deeper understanding if you want to work with the data in C
or Fortran or to tune performance in python.

So as long as there is an API to query and control how things work, I
like that it's hidden from simple python code.

-Chris



I'm similarly not too concerned about it. Performance seems finicky 
when you're dealing with missing data, since a lot of arrays will 
likely have to be copied over to other arrays containing only complete 
data before being handed over to BLAS. My primary concern is that the 
np.NA stuff 'just works'. Especially since I've never run into use 
cases in statistics where the difference between IGNORE and NA mattered.




Exactly!
I have not been able to think of an real example where that difference 
matters as the calculations are only on the 'valid' (ie non-missing and 
non-masked) values.


Bruce




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Christopher Jordan-Squire
On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote:

 On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
 cjord...@uw.edu wrote:
 
 
  On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker 
 chris.bar...@noaa.gov
  wrote:
 
  Christopher Jordan-Squire wrote:
   If we follow those rules for IGNORE for all computations, we sometimes
   get some weird output. For example:
   [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
   multiply and not * with broadcasting.) Or should that sort of
 operation
   through an error?
 
  That should throw an error -- matrix computation is heavily influenced
  by the shape and size of matrices, so I think IGNORES really don't make
  sense there.
 
 
 
  If the IGNORES don't make sense in basic numpy computations then I'm
 kinda
  confused why they'd be included at the numpy core level.
 
 
  Nathaniel Smith wrote:
   It's exactly this transparency that worries Matthew and me -- we feel
   that the alterNEP preserves it, and the NEP attempts to erase it. In
   the NEP, there are two totally different underlying data structures,
   but this difference is blurred at the Python level. The idea is that
   you shouldn't have to think about which you have, but if you work with
   C/Fortran, then of course you do have to be constantly aware of the
   underlying implementation anyway.
 
  I don't think this bothers me -- I think it's analogous to things in
  numpy like Fortran order and non-contiguous arrays -- you can ignore all
  that when working in pure python when performance isn't critical, but
  you need a deeper understanding if you want to work with the data in C
  or Fortran or to tune performance in python.
 
  So as long as there is an API to query and control how things work, I
  like that it's hidden from simple python code.
 
  -Chris
 
 
 
  I'm similarly not too concerned about it. Performance seems finicky when
  you're dealing with missing data, since a lot of arrays will likely have
 to
  be copied over to other arrays containing only complete data before being
  handed over to BLAS.

 Unless you know the neutral value for the computation or you just want
 to do a forward_fill in time series, and you have to ask the user not
 to give you an unmutable array with NAs if they don't want extra
 copies.

 Josef


Mean value replacement, or more generally single scalar value replacement,
is generally not a good idea. It biases downward your standard error
estimates if you use mean replacement, and it will bias both if you use
anything other than mean replacement. The bias is gets worse with more
missing data. So it's worst in the precisely the cases where you'd want to
fill in the data the most. (Though I admit I'm not too familiar with time
series, so maybe this doesn't apply. But it's true as a general principle in
statistics.) I'm not sure why we'd want to make this use case easier.

-Chris Jordan-Squire



  My primary concern is that the np.NA stuff 'just
  works'. Especially since I've never run into use cases in statistics
 where
  the difference between IGNORE and NA mattered.
 
 
 
 
  --
  Christopher Barker, Ph.D.
  Oceanographer
 
  Emergency Response Division
  NOAA/NOS/ORR(206) 526-6959   voice
  7600 Sand Point Way NE   (206) 526-6329   fax
  Seattle, WA  98115   (206) 526-6317   main reception
 
  chris.bar...@noaa.gov
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Pierre GM

On Jul 6, 2011, at 10:11 PM, Bruce Southey wrote:

 On 07/06/2011 02:38 PM, Christopher Jordan-Squire wrote:
 
 
 On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov 
 wrote:
 Christopher Jordan-Squire wrote:
  If we follow those rules for IGNORE for all computations, we sometimes
  get some weird output. For example:
  [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
  multiply and not * with broadcasting.) Or should that sort of operation
  through an error?
 
 That should throw an error -- matrix computation is heavily influenced
 by the shape and size of matrices, so I think IGNORES really don't make
 sense there.
 
 
 
 If the IGNORES don't make sense in basic numpy computations then I'm kinda 
 confused why they'd be included at the numpy core level. 
 
  
 Nathaniel Smith wrote:
  It's exactly this transparency that worries Matthew and me -- we feel
  that the alterNEP preserves it, and the NEP attempts to erase it. In
  the NEP, there are two totally different underlying data structures,
  but this difference is blurred at the Python level. The idea is that
  you shouldn't have to think about which you have, but if you work with
  C/Fortran, then of course you do have to be constantly aware of the
  underlying implementation anyway.
 
 I don't think this bothers me -- I think it's analogous to things in
 numpy like Fortran order and non-contiguous arrays -- you can ignore all
 that when working in pure python when performance isn't critical, but
 you need a deeper understanding if you want to work with the data in C
 or Fortran or to tune performance in python.
 
 So as long as there is an API to query and control how things work, I
 like that it's hidden from simple python code.
 
 -Chris
 
 
 
 I'm similarly not too concerned about it. Performance seems finicky when 
 you're dealing with missing data, since a lot of arrays will likely have to 
 be copied over to other arrays containing only complete data before being 
 handed over to BLAS. My primary concern is that the np.NA stuff 'just 
 works'. Especially since I've never run into use cases in statistics where 
 the difference between IGNORE and NA mattered. 
 
 
 Exactly!
 I have not been able to think of an real example where that difference 
 matters as the calculations are only on the 'valid' (ie non-missing and 
 non-masked) values.

In practice, they could be treated the same way (ie, skipped). However, they 
are conceptually different and one may wish to keep this difference of 
information around (between NAs you didn't have and IGNOREs you just dropped 
temporarily.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread josef . pktd
On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
cjord...@uw.edu wrote:


 On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote:

 On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
 cjord...@uw.edu wrote:
 
 
  On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker
  chris.bar...@noaa.gov
  wrote:
 
  Christopher Jordan-Squire wrote:
   If we follow those rules for IGNORE for all computations, we
   sometimes
   get some weird output. For example:
   [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
   multiply and not * with broadcasting.) Or should that sort of
   operation
   through an error?
 
  That should throw an error -- matrix computation is heavily influenced
  by the shape and size of matrices, so I think IGNORES really don't make
  sense there.
 
 
 
  If the IGNORES don't make sense in basic numpy computations then I'm
  kinda
  confused why they'd be included at the numpy core level.
 
 
  Nathaniel Smith wrote:
   It's exactly this transparency that worries Matthew and me -- we feel
   that the alterNEP preserves it, and the NEP attempts to erase it. In
   the NEP, there are two totally different underlying data structures,
   but this difference is blurred at the Python level. The idea is that
   you shouldn't have to think about which you have, but if you work
   with
   C/Fortran, then of course you do have to be constantly aware of the
   underlying implementation anyway.
 
  I don't think this bothers me -- I think it's analogous to things in
  numpy like Fortran order and non-contiguous arrays -- you can ignore
  all
  that when working in pure python when performance isn't critical, but
  you need a deeper understanding if you want to work with the data in C
  or Fortran or to tune performance in python.
 
  So as long as there is an API to query and control how things work, I
  like that it's hidden from simple python code.
 
  -Chris
 
 
 
  I'm similarly not too concerned about it. Performance seems finicky when
  you're dealing with missing data, since a lot of arrays will likely have
  to
  be copied over to other arrays containing only complete data before
  being
  handed over to BLAS.

 Unless you know the neutral value for the computation or you just want
 to do a forward_fill in time series, and you have to ask the user not
 to give you an unmutable array with NAs if they don't want extra
 copies.

 Josef


 Mean value replacement, or more generally single scalar value replacement,
 is generally not a good idea. It biases downward your standard error
 estimates if you use mean replacement, and it will bias both if you use
 anything other than mean replacement. The bias is gets worse with more
 missing data. So it's worst in the precisely the cases where you'd want to
 fill in the data the most. (Though I admit I'm not too familiar with time
 series, so maybe this doesn't apply. But it's true as a general principle in
 statistics.) I'm not sure why we'd want to make this use case easier.

We just discussed a use case for pandas on the statsmodels mailing
list, minute data of stock quotes (prices), if the quote is NA then
fill it with the last price quote. If it would be necessary for memory
usage and performance, this can be handled efficiently and with
minimal copying.

If you want to fill in a missing value without messing up any result
statistics, then there is a large literature in statistics on
imputations, repeatedly assigning values to a NA from an underlying
distribution. scipy/statsmodels doesn't have anything like this (yet)
but R and the others have it available, and it looks more popular in
bio-statistics.

(But similar to what Dag said, for statistical analysis it will be
necessary to keep case specific masks and data arrays around. I
haven't actually written any missing values algorithm yet, so I'm
quite again.)

Josef

 -Chris Jordan-Squire


  My primary concern is that the np.NA stuff 'just
  works'. Especially since I've never run into use cases in statistics
  where
  the difference between IGNORE and NA mattered.
 
 
 
 
  --
  Christopher Barker, Ph.D.
  Oceanographer
 
  Emergency Response Division
  NOAA/NOS/ORR            (206) 526-6959   voice
  7600 Sand Point Way NE   (206) 526-6329   fax
  Seattle, WA  98115       (206) 526-6317   main reception
 
  chris.bar...@noaa.gov
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread josef . pktd
On Wed, Jul 6, 2011 at 4:38 PM,  josef.p...@gmail.com wrote:
 On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
 cjord...@uw.edu wrote:


 On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote:

 On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
 cjord...@uw.edu wrote:
 
 
  On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker
  chris.bar...@noaa.gov
  wrote:
 
  Christopher Jordan-Squire wrote:
   If we follow those rules for IGNORE for all computations, we
   sometimes
   get some weird output. For example:
   [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
   multiply and not * with broadcasting.) Or should that sort of
   operation
   through an error?
 
  That should throw an error -- matrix computation is heavily influenced
  by the shape and size of matrices, so I think IGNORES really don't make
  sense there.
 
 
 
  If the IGNORES don't make sense in basic numpy computations then I'm
  kinda
  confused why they'd be included at the numpy core level.
 
 
  Nathaniel Smith wrote:
   It's exactly this transparency that worries Matthew and me -- we feel
   that the alterNEP preserves it, and the NEP attempts to erase it. In
   the NEP, there are two totally different underlying data structures,
   but this difference is blurred at the Python level. The idea is that
   you shouldn't have to think about which you have, but if you work
   with
   C/Fortran, then of course you do have to be constantly aware of the
   underlying implementation anyway.
 
  I don't think this bothers me -- I think it's analogous to things in
  numpy like Fortran order and non-contiguous arrays -- you can ignore
  all
  that when working in pure python when performance isn't critical, but
  you need a deeper understanding if you want to work with the data in C
  or Fortran or to tune performance in python.
 
  So as long as there is an API to query and control how things work, I
  like that it's hidden from simple python code.
 
  -Chris
 
 
 
  I'm similarly not too concerned about it. Performance seems finicky when
  you're dealing with missing data, since a lot of arrays will likely have
  to
  be copied over to other arrays containing only complete data before
  being
  handed over to BLAS.

 Unless you know the neutral value for the computation or you just want
 to do a forward_fill in time series, and you have to ask the user not
 to give you an unmutable array with NAs if they don't want extra
 copies.

 Josef


 Mean value replacement, or more generally single scalar value replacement,
 is generally not a good idea. It biases downward your standard error
 estimates if you use mean replacement, and it will bias both if you use
 anything other than mean replacement. The bias is gets worse with more
 missing data. So it's worst in the precisely the cases where you'd want to
 fill in the data the most. (Though I admit I'm not too familiar with time
 series, so maybe this doesn't apply. But it's true as a general principle in
 statistics.) I'm not sure why we'd want to make this use case easier.

Another qualification on this (I cannot help it).
I think this only applies if you use a prefabricated no-missing-values
algorithm. If I write it myself, I can do the proper correction for
the reduced number of observations. (similar to the case when we
ignore correlated information and use statistics based on uncorrelated
observations which also overestimate the amount of information we have
available.)

Josef


 We just discussed a use case for pandas on the statsmodels mailing
 list, minute data of stock quotes (prices), if the quote is NA then
 fill it with the last price quote. If it would be necessary for memory
 usage and performance, this can be handled efficiently and with
 minimal copying.

 If you want to fill in a missing value without messing up any result
 statistics, then there is a large literature in statistics on
 imputations, repeatedly assigning values to a NA from an underlying
 distribution. scipy/statsmodels doesn't have anything like this (yet)
 but R and the others have it available, and it looks more popular in
 bio-statistics.

 (But similar to what Dag said, for statistical analysis it will be
 necessary to keep case specific masks and data arrays around. I
 haven't actually written any missing values algorithm yet, so I'm
 quite again.)

 Josef

 -Chris Jordan-Squire


  My primary concern is that the np.NA stuff 'just
  works'. Especially since I've never run into use cases in statistics
  where
  the difference between IGNORE and NA mattered.
 
 
 
 
  --
  Christopher Barker, Ph.D.
  Oceanographer
 
  Emergency Response Division
  NOAA/NOS/ORR            (206) 526-6959   voice
  7600 Sand Point Way NE   (206) 526-6329   fax
  Seattle, WA  98115       (206) 526-6317   main reception
 
  chris.bar...@noaa.gov
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Neal Becker
Christopher Barker wrote:

 Dag Sverre Seljebotn wrote:
 Here's an HPC perspective...:
 
 At least I feel that the transparency of NumPy is a huge part of its
 current success. Many more than me spend half their time in C/Fortran
 and half their time in Python.
 
 Absolutely -- and this point has been raised a couple times in the
 discussion, so I hope it is not forgotten.
 
I tend to look at NumPy this way: Assuming you have some data in memory
 (possibly loaded by a C or Fortran library). (Almost) no matter how it
 is allocated, ordered, packed, aligned -- there's a way to find strides
 and dtypes to put a nice NumPy wrapper around it and use the memory from
 Python.
 
 and vice-versa -- Assuming you have some data in numpy arrays, there's a
 way to process it with a C or Fortran library without copying the data.
 
 And this is where I am skeptical of the bit-pattern idea -- while one
 can expect C and fortran and GPU, and ??? to understand NaNs for
 floating point data, is there any support in compilers or hardware for
 special bit patterns for NA values to integers? I've never seen in my
 (very limited experience).
 
 Maybe having the mask option, too, will make that irrelevant, but I want
 to be clear about that kind of use case.
 
 -Chris

Am I the only one that finds the idea of special values of things like int[1] 
to 
have special meanings to be really ugly?

[1] which already have defined behavior over their entire domain of bit patterns

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Charles R Harris
On Wed, Jul 6, 2011 at 2:53 PM, Neal Becker ndbeck...@gmail.com wrote:

 Christopher Barker wrote:

  Dag Sverre Seljebotn wrote:
  Here's an HPC perspective...:
 
  At least I feel that the transparency of NumPy is a huge part of its
  current success. Many more than me spend half their time in C/Fortran
  and half their time in Python.
 
  Absolutely -- and this point has been raised a couple times in the
  discussion, so I hope it is not forgotten.
 
 I tend to look at NumPy this way: Assuming you have some data in
 memory
  (possibly loaded by a C or Fortran library). (Almost) no matter how it
  is allocated, ordered, packed, aligned -- there's a way to find strides
  and dtypes to put a nice NumPy wrapper around it and use the memory from
  Python.
 
  and vice-versa -- Assuming you have some data in numpy arrays, there's a
  way to process it with a C or Fortran library without copying the data.
 
  And this is where I am skeptical of the bit-pattern idea -- while one
  can expect C and fortran and GPU, and ??? to understand NaNs for
  floating point data, is there any support in compilers or hardware for
  special bit patterns for NA values to integers? I've never seen in my
  (very limited experience).
 
  Maybe having the mask option, too, will make that irrelevant, but I want
  to be clear about that kind of use case.
 
  -Chris

 Am I the only one that finds the idea of special values of things like
 int[1] to
 have special meanings to be really ugly?

 [1] which already have defined behavior over their entire domain of bit
 patterns


Umm, no, I find it ugly also. On the other hand, it is an useful artifact
left to us by the ancients and solves a lot of problems. So in the absence
of anything more standardized...

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Bruce Southey
On 07/06/2011 03:37 PM, Pierre GM wrote:
 On Jul 6, 2011, at 10:11 PM, Bruce Southey wrote:

 On 07/06/2011 02:38 PM, Christopher Jordan-Squire wrote:

 On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barkerchris.bar...@noaa.gov  
 wrote:
 Christopher Jordan-Squire wrote:
 If we follow those rules for IGNORE for all computations, we sometimes
 get some weird output. For example:
 [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
 multiply and not * with broadcasting.) Or should that sort of operation
 through an error?
 That should throw an error -- matrix computation is heavily influenced
 by the shape and size of matrices, so I think IGNORES really don't make
 sense there.



 If the IGNORES don't make sense in basic numpy computations then I'm kinda 
 confused why they'd be included at the numpy core level.


 Nathaniel Smith wrote:
 It's exactly this transparency that worries Matthew and me -- we feel
 that the alterNEP preserves it, and the NEP attempts to erase it. In
 the NEP, there are two totally different underlying data structures,
 but this difference is blurred at the Python level. The idea is that
 you shouldn't have to think about which you have, but if you work with
 C/Fortran, then of course you do have to be constantly aware of the
 underlying implementation anyway.
 I don't think this bothers me -- I think it's analogous to things in
 numpy like Fortran order and non-contiguous arrays -- you can ignore all
 that when working in pure python when performance isn't critical, but
 you need a deeper understanding if you want to work with the data in C
 or Fortran or to tune performance in python.

 So as long as there is an API to query and control how things work, I
 like that it's hidden from simple python code.

 -Chris



 I'm similarly not too concerned about it. Performance seems finicky when 
 you're dealing with missing data, since a lot of arrays will likely have to 
 be copied over to other arrays containing only complete data before being 
 handed over to BLAS. My primary concern is that the np.NA stuff 'just 
 works'. Especially since I've never run into use cases in statistics where 
 the difference between IGNORE and NA mattered.


 Exactly!
 I have not been able to think of an real example where that difference 
 matters as the calculations are only on the 'valid' (ie non-missing and 
 non-masked) values.
 In practice, they could be treated the same way (ie, skipped). However, they 
 are conceptually different and one may wish to keep this difference of 
 information around (between NAs you didn't have and IGNOREs you just dropped 
 temporarily.


 ___
I have yet to see these as *conceptually different* in any of the 
arguments given.

Separate NAs or IGNORES or any number of missing value codes just 
requires use to avoid 'unmasking' those missing value codes in your 
array as, I presume like masked arrays, you need some placeholder values.

Bruce



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Christopher Jordan-Squire
On Wed, Jul 6, 2011 at 3:47 PM, josef.p...@gmail.com wrote:

 On Wed, Jul 6, 2011 at 4:38 PM,  josef.p...@gmail.com wrote:
  On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
  cjord...@uw.edu wrote:
 
 
  On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote:
 
  On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
  cjord...@uw.edu wrote:
  
  
   On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker
   chris.bar...@noaa.gov
   wrote:
  
   Christopher Jordan-Squire wrote:
If we follow those rules for IGNORE for all computations, we
sometimes
get some weird output. For example:
[ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
multiply and not * with broadcasting.) Or should that sort of
operation
through an error?
  
   That should throw an error -- matrix computation is heavily
 influenced
   by the shape and size of matrices, so I think IGNORES really don't
 make
   sense there.
  
  
  
   If the IGNORES don't make sense in basic numpy computations then I'm
   kinda
   confused why they'd be included at the numpy core level.
  
  
   Nathaniel Smith wrote:
It's exactly this transparency that worries Matthew and me -- we
 feel
that the alterNEP preserves it, and the NEP attempts to erase it.
 In
the NEP, there are two totally different underlying data
 structures,
but this difference is blurred at the Python level. The idea is
 that
you shouldn't have to think about which you have, but if you work
with
C/Fortran, then of course you do have to be constantly aware of
 the
underlying implementation anyway.
  
   I don't think this bothers me -- I think it's analogous to things in
   numpy like Fortran order and non-contiguous arrays -- you can ignore
   all
   that when working in pure python when performance isn't critical,
 but
   you need a deeper understanding if you want to work with the data in
 C
   or Fortran or to tune performance in python.
  
   So as long as there is an API to query and control how things work,
 I
   like that it's hidden from simple python code.
  
   -Chris
  
  
  
   I'm similarly not too concerned about it. Performance seems finicky
 when
   you're dealing with missing data, since a lot of arrays will likely
 have
   to
   be copied over to other arrays containing only complete data before
   being
   handed over to BLAS.
 
  Unless you know the neutral value for the computation or you just want
  to do a forward_fill in time series, and you have to ask the user not
  to give you an unmutable array with NAs if they don't want extra
  copies.
 
  Josef
 
 
  Mean value replacement, or more generally single scalar value
 replacement,
  is generally not a good idea. It biases downward your standard error
  estimates if you use mean replacement, and it will bias both if you use
  anything other than mean replacement. The bias is gets worse with more
  missing data. So it's worst in the precisely the cases where you'd want
 to
  fill in the data the most. (Though I admit I'm not too familiar with
 time
  series, so maybe this doesn't apply. But it's true as a general
 principle in
  statistics.) I'm not sure why we'd want to make this use case easier.

 Another qualification on this (I cannot help it).
 I think this only applies if you use a prefabricated no-missing-values
 algorithm. If I write it myself, I can do the proper correction for
 the reduced number of observations. (similar to the case when we
 ignore correlated information and use statistics based on uncorrelated
 observations which also overestimate the amount of information we have
 available.)


Can you do that sort of technique with longitudinal (panel) data? I'm
honestly curious because I haven't looked into such corrections before. I
haven't been able to find a reference after a few quick google searches. I
don't suppose you know one off the top of your head?

And you're right about the last measurement carried forward. I was just
thinking about filling in all missing values with the same value.

-Chris Jordan-Squire

PS--Thanks for mentioning the statsmodels discussion. I'd been keeping track
of that on a different email account, and I haven't realized it wasn't
forwarding those messages correctly.




 Josef



  We just discussed a use case for pandas on the statsmodels mailing
  list, minute data of stock quotes (prices), if the quote is NA then
  fill it with the last price quote. If it would be necessary for memory
  usage and performance, this can be handled efficiently and with
  minimal copying.
 
  If you want to fill in a missing value without messing up any result
  statistics, then there is a large literature in statistics on
  imputations, repeatedly assigning values to a NA from an underlying
  distribution. scipy/statsmodels doesn't have anything like this (yet)
  but R and the others have it available, and it looks more popular in
  bio-statistics.
 
  (But similar to what Dag said, for statistical 

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread josef . pktd
On Wed, Jul 6, 2011 at 7:14 PM, Christopher Jordan-Squire
cjord...@uw.edu wrote:


 On Wed, Jul 6, 2011 at 3:47 PM, josef.p...@gmail.com wrote:

 On Wed, Jul 6, 2011 at 4:38 PM,  josef.p...@gmail.com wrote:
  On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
  cjord...@uw.edu wrote:
 
 
  On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote:
 
  On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
  cjord...@uw.edu wrote:
  
  
   On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker
   chris.bar...@noaa.gov
   wrote:
  
   Christopher Jordan-Squire wrote:
If we follow those rules for IGNORE for all computations, we
sometimes
get some weird output. For example:
[ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is
matrix
multiply and not * with broadcasting.) Or should that sort of
operation
through an error?
  
   That should throw an error -- matrix computation is heavily
   influenced
   by the shape and size of matrices, so I think IGNORES really don't
   make
   sense there.
  
  
  
   If the IGNORES don't make sense in basic numpy computations then I'm
   kinda
   confused why they'd be included at the numpy core level.
  
  
   Nathaniel Smith wrote:
It's exactly this transparency that worries Matthew and me -- we
feel
that the alterNEP preserves it, and the NEP attempts to erase it.
In
the NEP, there are two totally different underlying data
structures,
but this difference is blurred at the Python level. The idea is
that
you shouldn't have to think about which you have, but if you work
with
C/Fortran, then of course you do have to be constantly aware of
the
underlying implementation anyway.
  
   I don't think this bothers me -- I think it's analogous to things
   in
   numpy like Fortran order and non-contiguous arrays -- you can
   ignore
   all
   that when working in pure python when performance isn't critical,
   but
   you need a deeper understanding if you want to work with the data
   in C
   or Fortran or to tune performance in python.
  
   So as long as there is an API to query and control how things work,
   I
   like that it's hidden from simple python code.
  
   -Chris
  
  
  
   I'm similarly not too concerned about it. Performance seems finicky
   when
   you're dealing with missing data, since a lot of arrays will likely
   have
   to
   be copied over to other arrays containing only complete data before
   being
   handed over to BLAS.
 
  Unless you know the neutral value for the computation or you just want
  to do a forward_fill in time series, and you have to ask the user not
  to give you an unmutable array with NAs if they don't want extra
  copies.
 
  Josef
 
 
  Mean value replacement, or more generally single scalar value
  replacement,
  is generally not a good idea. It biases downward your standard error
  estimates if you use mean replacement, and it will bias both if you use
  anything other than mean replacement. The bias is gets worse with more
  missing data. So it's worst in the precisely the cases where you'd want
  to
  fill in the data the most. (Though I admit I'm not too familiar with
  time
  series, so maybe this doesn't apply. But it's true as a general
  principle in
  statistics.) I'm not sure why we'd want to make this use case easier.

 Another qualification on this (I cannot help it).
 I think this only applies if you use a prefabricated no-missing-values
 algorithm. If I write it myself, I can do the proper correction for
 the reduced number of observations. (similar to the case when we
 ignore correlated information and use statistics based on uncorrelated
 observations which also overestimate the amount of information we have
 available.)


 Can you do that sort of technique with longitudinal (panel) data? I'm
 honestly curious because I haven't looked into such corrections before. I
 haven't been able to find a reference after a few quick google searches. I
 don't suppose you know one off the top of your head?

I was thinking mainly of simple cases where the correction only
requires to correctly count the number of observations in order to
adjust the degrees of freedom. For example, statistical tests that are
based on relatively simple statistics or ANOVA which just needs a
correct counting of the number of observations by groups. (This might
be partially covered by any NA ufunc implementation, that does mean,
var and cov correctly and maybe sorting like the current NaN sort.)

In the panel data case it might be possible to do this, if it can just
be treated like an unbalanced panel. I guess it depends on the details
of the model.

For regression, one way to remove an observation is to include a dummy
variable for that observation, or use X'X with rows zeroed out. R has
a package for multivariate normal with missing values that allows
calculation of expected values for the missing ones.

But in many of these cases, getting a clean 

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-06 Thread Skipper Seabold
On Wed, Jul 6, 2011 at 7:14 PM, Christopher Jordan-Squire
cjord...@uw.edu wrote:
 On Wed, Jul 6, 2011 at 3:47 PM, josef.p...@gmail.com wrote:
 On Wed, Jul 6, 2011 at 4:38 PM,  josef.p...@gmail.com wrote:
  On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
snip
  Mean value replacement, or more generally single scalar value
  replacement,
  is generally not a good idea. It biases downward your standard error
  estimates if you use mean replacement, and it will bias both if you use
  anything other than mean replacement. The bias is gets worse with more
  missing data. So it's worst in the precisely the cases where you'd want
  to
  fill in the data the most. (Though I admit I'm not too familiar with
  time
  series, so maybe this doesn't apply. But it's true as a general
  principle in
  statistics.) I'm not sure why we'd want to make this use case easier.

 Another qualification on this (I cannot help it).
 I think this only applies if you use a prefabricated no-missing-values
 algorithm. If I write it myself, I can do the proper correction for
 the reduced number of observations. (similar to the case when we
 ignore correlated information and use statistics based on uncorrelated
 observations which also overestimate the amount of information we have
 available.)


 Can you do that sort of technique with longitudinal (panel) data? I'm
 honestly curious because I haven't looked into such corrections before. I
 haven't been able to find a reference after a few quick google searches. I
 don't suppose you know one off the top of your head?
 And you're right about the last measurement carried forward. I was just
 thinking about filling in all missing values with the same value.
 -Chris Jordan-Squire
 PS--Thanks for mentioning the statsmodels discussion. I'd been keeping track
 of that on a different email account, and I haven't realized it wasn't
 forwarding those messages correctly.


Maybe a bit OT, but I've seen people doing imputation using Bayesian
MCMC or multiple imputation for missing values in panel data. Google
'data augmentation' or 'multiple imputation'. I haven't looked much
into the details yet, but it's definitely not mean replacement.

FWIW (I haven't been following closely the discussion), there is a
distinction in statistics between ignorable and nonignorable missing
data, but I can't think of a situation where I would need this at the
computational level rather than relying on a (numerically comparable)
missing data type(s) a la SAS/Stata. I've also found the odd examples
of IGNORE without a clear answer to be scary.

Skipper
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-05 Thread Christopher Jordan-Squire
Here's a short-ish summary of the topics discussed in the conference call
this afternoon. WARNING: I try to give examples for everything discussed to
make it as concrete as possible. However, most of the examples were not
explicitly discussed during the conference. I apologize in advance if I
mischaracterize anyone's arguments, and please jump in to correct me if I
did.

Participants: Travis Oliphant, Mark Wiebe, Matthew Brett, Nathaniel Smith,
Pierre GM, Ben Root, Chuck Harris, Wes McKinney, Chris Jordan-Squire

First, areas of broad agreement:
*There should be more functionality for missing data
*There should be dtypes which support missing data ('parameterized dtypes'
in the current NEP)
*Adding a 'where' semantic to ufuncs
*Have the same data with different sets of missing elements in different
views
*Easy for non-expert numpy users

Since we only have Mark is only around Austin until early August, there's
also broad agreement that we need to get something done quickly. However,
the numpy community (and Travis in particular) are balancing this against
the possibility of a sub-optimal solution which can't be taken back.

BIT PATTERN  MASK IMPLEMENTATIONS FOR NA
--

The current NEP proposes both mask and bit pattern implementations for
missing data. I use the terms bit pattern and parameterized dtype
interchangeably, since the parameterized dtype will use a bit pattern for
its implementation. The two implementations will support the same
functionality with respect to NA, and the implementation details will be
largely invisible to the user. Their differences are in the 'extra' features
each supports.

Two common questions were:
1. Why make two implementations of missing data: one with masks and the
other with parameterized dtypes?
2. Why does the implementation using masks have higher priority?

The answers are:
1.  The mask implementation is more general and easier to implement and
maintain.  The bit pattern implementation saves memory, makes
interoperability easier, and makes ABI (Application Binary Interface)
compatibility easier. Since each has different strengths, the argument is
both should be implemented.
2. The implementation for the parameterized dtypes will rely on the
implementation using a mask.


NA VS. IGNORE
-

A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in aNEP
sense and NA in  NEP sense. With NA, there is a clear notion of how NA
propagates through all basic numpy operations.  (e.g., 3+NA=NA and log(NA) =
NA, while NA | True = True.) IGNORE is separate from NA, with different
interpretations depending on the use case.

IGNORE could mean:
1. Data that is being temporarily ignored. e.g., a possible outlier that is
temporarily being removed from consideration.
2. Data that cannot exist. e.g., a matrix representing a grid of water
depths for a lake. Since the lake isn't square, some entries will represent
land, and so depth will be a meaningless concept for those entries.
3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE,
3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though this
leaves open how [1, 2, IGNORE] + [3 , 4] should behave.

Because of these different uses of IGNORE, it doesn't have as clear a
theoretical interpretation as NA. (For instance, what is IGNORE+3, IGNORE*3,
or IGNORE | True?)

But several of the discussants thought the use cases for IGNORE were very
compelling. Specifically, they wanted to be able to use IGNORE's and NA's
simultaneously while still being able to differentiate between them. So, for
example, being able to designate some data as IGNORE while still able to
determine which data was NA but not IGNORE. The current NEP does not allow
for this directly. Although in some cases it can be indirectly done via
views. (By taking a view of the original data, expanding the values which
are considered NA in the view, and then comparing with the original data to
see if the NA is in the original or not.) Since both are possible in this
sense, Mark's NEP makes it so IGNORE is allowed but isn't the default.

Another important point from the current NEP is that not being able to
access values considered missing, even if the implementation of missingness
is via a mask, is a feature and not a bug. It is a feature because if the
data is missing then, conceptually, neither the user nor any function the
user calls should be able to obtain that data. This is precisely why the
indirect route, via views of the original data, is required to access data
that a different view says is missing.

The current NEP treats all NA's the same. The reasoning is that, regardless
of where the NA originated, the functions the numpy array is fed in to will
either ignore all NA's or propagate them (i.e. not ignore them). These two
different behaviors are chosen when passed into a ufunc by setting the
skipna ufunc 

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

2011-07-05 Thread Benjamin Root
Thanks for these notes.  Just a couple of thoughts as I looked over these
notes.

On Tue, Jul 5, 2011 at 6:46 PM, Christopher Jordan-Squire
cjord...@uw.eduwrote:



3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE,
 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though this
 leaves open how [1, 2, IGNORE] + [3 , 4] should behave.


I don't think there is any confusion about that particular case. Even when
using the IGNORE semantics, numpy broadcasting rules are still in play.
This particular case should throw an exception.



 Because of these different uses of IGNORE, it doesn't have as clear a
 theoretical interpretation as NA. (For instance, what is IGNORE+3, IGNORE*3,
 or IGNORE | True?)


I think we were more referring to matrix operations like dot products.
Element-by-element operations should still behave the same as NA.  Scalar
operations should return IGNORE.

HOW DOES THIS RELATE TO THE CURRENT MASKED ARRAY?

 

 Everyone seems to agree they'd love it if this could encompass all current
 use cases of the numpy.ma arrays, so numpy.ma arrays could be deprecated.
 (However they wouldn't be eliminated for several years, even in the most
 optimistic scenarios.)


This is going to be a very tricky thing to handle and it is going to require
coordination and agreements among many of the third-party toolkits like
scipy and matplotlib.


In addition to these notes (unless I missed it), Nathaniel pointed out that
with the ufunc where= parameter feature and the ufunc wrapper, we have the
potential to greatly improve the codebase of numpy.ma as it stands.
Potentially mitigating the need for moving more of numpy.ma into the core,
and to focus more on NA.  While I am not 100% on board with this idea, I can
definitely see the potential for this path.

Thanks everybody for the productive chat!
Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion