Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Hi, Just for reference, I am using this as the latest version of the NEP - I hope it's current: https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails. On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire cjord...@uw.edu wrote: ... Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. I think I might have missed that part of the discussion :) I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases. The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP: In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish. Mark says something like it's all missing data, there's no reason you should want to distinguish. Nathaniel and I were saying the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.. I returned several times to this (original point by Nathaniel): a[3] = np.NA (what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?) We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing). I asserted that explaining NA to people would be easier if ``a[3] = np.NA`` was direct assignment and altered the array. BIT PATTERN MASK IMPLEMENTATIONS FOR NA -- The current NEP proposes both mask and bit pattern implementations for missing data. I use the terms bit pattern and parameterized dtype interchangeably, since the parameterized dtype will use a bit pattern for its implementation. The two implementations will support the same functionality with respect to NA, and the implementation details will be largely invisible to the user. Their differences are in the 'extra' features each supports. Two common questions were: 1. Why make two implementations of missing data: one with masks and the other with parameterized dtypes? 2. Why does the implementation using masks have higher priority? The answers are: 1. The mask implementation is more general and easier to implement and maintain. The bit pattern implementation saves memory, makes interoperability easier, and makes ABI (Application Binary Interface) compatibility easier. Since each has different strengths, the argument is both should be implemented. 2. The implementation for the parameterized dtypes will rely on the implementation using a mask. NA VS. IGNORE - A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in aNEP sense and NA in NEP sense. With NA, there is a clear notion of how NA propagates through all basic numpy operations. (e.g., 3+NA=NA and log(NA) = NA, while NA | True = True.) IGNORE is separate from NA, with different interpretations depending on the use case. IGNORE could mean: 1. Data that is being temporarily ignored. e.g., a possible outlier that is temporarily being removed from consideration. 2. Data that cannot exist. e.g., a matrix representing a grid of water depths for a lake. Since the lake isn't square, some entries will represent land, and so depth will be a meaningless concept for those entries. 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE, 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though this leaves open how [1, 2, IGNORE] + [3 , 4] should behave. Because of these different uses of IGNORE, it doesn't have as clear a theoretical interpretation as NA. (For instance, what is IGNORE+3, IGNORE*3, or IGNORE | True?) I don't remember this bit of the discussion, but I see from current masked arrays that IGNORE is treated as the identity, so: IGNORE + 3 = 3 IGNORE * 3 = 3 But several of the discussants thought the use cases for IGNORE were very compelling. Specifically, they wanted to be able to use IGNORE's and NA's simultaneously while still being able to differentiate between them. So, for example, being able to designate some data as IGNORE while still able to determine
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On 07/06/2011 02:05 PM, Matthew Brett wrote: Hi, Just for reference, I am using this as the latest version of the NEP - I hope it's current: https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails. On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire cjord...@uw.edu wrote: ... Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. I think I might have missed that part of the discussion :) I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases. The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP: In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish. Mark says something like it's all missing data, there's no reason you should want to distinguish. Nathaniel and I were saying the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.. I returned several times to this (original point by Nathaniel): a[3] = np.NA (what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?) We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing). Here's an HPC perspective...: If you, say, want to off-load array processing with a mask to some code running on a GPU, you really can't have the GPU go through some NumPy API. Or if you want to implement a masked array on a cluster with MPI, you similarly really, really want raw access. At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. So, my view on Mark's NEP was: With a reasonably amount of flexibility in how you decided to implement masking for your data, you can create a NumPy wrapper that will understand that. Whether your Fortran library exposes NAs in its 40GB buffer as bit patterns, or using a seperate mask, both will work. And IMO Mark's NEP comes rather close to this, you just need an additional NEP later to give raw details to the implementation details, once those are settled :-) Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On 07/06/2011 02:27 PM, Dag Sverre Seljebotn wrote: On 07/06/2011 02:05 PM, Matthew Brett wrote: Hi, Just for reference, I am using this as the latest version of the NEP - I hope it's current: https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails. On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire cjord...@uw.edu wrote: ... Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. I think I might have missed that part of the discussion :) I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases. The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP: In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish. Mark says something like it's all missing data, there's no reason you should want to distinguish. Nathaniel and I were saying the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.. I returned several times to this (original point by Nathaniel): a[3] = np.NA (what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?) We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing). Here's an HPC perspective...: If you, say, want to off-load array processing with a mask to some code running on a GPU, you really can't have the GPU go through some NumPy API. Or if you want to implement a masked array on a cluster with MPI, you similarly really, really want raw access. At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. So, my view on Mark's NEP was: With a reasonably amount of flexibility in how you decided to implement masking for your data, you can create a NumPy wrapper that will understand that. Whether your Fortran library exposes NAs in its 40GB buffer as bit patterns, or using a seperate mask, both will work. And IMO Mark's NEP comes rather close to this, you just need an additional NEP later to give raw details to the implementation details, once those are settled :-) To be concrete, I'm thinking something like a custom extension to PEP 3118, which could also allow efficient access from Cython without hard-coding Cython for NumPy (a GSoC project this summer will continue to move us away from the np.ndarray[int] syntax to a more generic int[:] that's less tied to NumPy). But first things first! Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 5:05 AM, Matthew Brett matthew.br...@gmail.comwrote: Hi, Just for reference, I am using this as the latest version of the NEP - I hope it's current: https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails. On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire cjord...@uw.edu wrote: ... Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. I think I might have missed that part of the discussion :) I think that might have been mentioned by Travis right before he had to leave for another meeting, which might have been after you'd disconnected. Travis' concern as a member of a numpy community is the desire for something that is broadly applicable and adopted. But as Mark's employer, his concern is to get a more complete and coherent missing data functionality implemented in numpy while Mark is still at Enthought, for use in the problems Enthought and statisticians commonly encounter if nothing else. I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases. The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP: In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish. Mark says something like it's all missing data, there's no reason you should want to distinguish. Nathaniel and I were saying the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.. I returned several times to this (original point by Nathaniel): a[3] = np.NA (what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?) We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing). I asserted that explaining NA to people would be easier if ``a[3] = np.NA`` was direct assignment and altered the array. BIT PATTERN MASK IMPLEMENTATIONS FOR NA -- The current NEP proposes both mask and bit pattern implementations for missing data. I use the terms bit pattern and parameterized dtype interchangeably, since the parameterized dtype will use a bit pattern for its implementation. The two implementations will support the same functionality with respect to NA, and the implementation details will be largely invisible to the user. Their differences are in the 'extra' features each supports. Two common questions were: 1. Why make two implementations of missing data: one with masks and the other with parameterized dtypes? 2. Why does the implementation using masks have higher priority? The answers are: 1. The mask implementation is more general and easier to implement and maintain. The bit pattern implementation saves memory, makes interoperability easier, and makes ABI (Application Binary Interface) compatibility easier. Since each has different strengths, the argument is both should be implemented. 2. The implementation for the parameterized dtypes will rely on the implementation using a mask. NA VS. IGNORE - A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in aNEP sense and NA in NEP sense. With NA, there is a clear notion of how NA propagates through all basic numpy operations. (e.g., 3+NA=NA and log(NA) = NA, while NA | True = True.) IGNORE is separate from NA, with different interpretations depending on the use case. IGNORE could mean: 1. Data that is being temporarily ignored. e.g., a possible outlier that is temporarily being removed from consideration. 2. Data that cannot exist. e.g., a matrix representing a grid of water depths for a lake. Since the lake isn't square, some entries will represent land, and so depth will be a meaningless concept for those entries. 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE, 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though this leaves open how [1,
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Christopher Jordan-Squire wrote: Here's a short-ish summary of the topics discussed in the conference call this afternoon. Thanks, this is great! And thanks to all who participated in the call. 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE, 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. whoooa! I actually have been looking for, and thinking about jagged arrays a fair bit lately, so this is kind of exciting, but this looks like a bad idea to me. The above indicates that: a = np.array( [ [1, 2, np.IGNORE], [np.IGNORE, 3, 4] ] a[:,1] would yield: array([2, 4]) which seems really wrong -- you've tossed out the location information altogether. ( think it should be: array([2, 3]) I could see a jagged array being represented by IGNOREs all at the END of each row, but putting items in the middle, and shifting things to the left strikes me as a plain old bad idea (and a pain to implement) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Hi, On Wed, Jul 6, 2011 at 6:54 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 5:05 AM, Matthew Brett matthew.br...@gmail.com wrote: Hi, Just for reference, I am using this as the latest version of the NEP - I hope it's current: https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails. On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire cjord...@uw.edu wrote: ... Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. I think I might have missed that part of the discussion :) I think that might have been mentioned by Travis right before he had to leave for another meeting, which might have been after you'd disconnected. Travis' concern as a member of a numpy community is the desire for something that is broadly applicable and adopted. But as Mark's employer, his concern is to get a more complete and coherent missing data functionality implemented in numpy while Mark is still at Enthought, for use in the problems Enthought and statisticians commonly encounter if nothing else. Sorry - yes - I wasn't there for all the conversation. Of course (not disagreeing), we must take care to get the API right because it's unlikely to change and will be explaining and supporting it for a long time to come. I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases. The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP: In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish. Mark says something like it's all missing data, there's no reason you should want to distinguish. Nathaniel and I were saying the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.. I returned several times to this (original point by Nathaniel): a[3] = np.NA (what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?) We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing). I asserted that explaining NA to people would be easier if ``a[3] = np.NA`` was direct assignment and altered the array. BIT PATTERN MASK IMPLEMENTATIONS FOR NA -- The current NEP proposes both mask and bit pattern implementations for missing data. I use the terms bit pattern and parameterized dtype interchangeably, since the parameterized dtype will use a bit pattern for its implementation. The two implementations will support the same functionality with respect to NA, and the implementation details will be largely invisible to the user. Their differences are in the 'extra' features each supports. Two common questions were: 1. Why make two implementations of missing data: one with masks and the other with parameterized dtypes? 2. Why does the implementation using masks have higher priority? The answers are: 1. The mask implementation is more general and easier to implement and maintain. The bit pattern implementation saves memory, makes interoperability easier, and makes ABI (Application Binary Interface) compatibility easier. Since each has different strengths, the argument is both should be implemented. 2. The implementation for the parameterized dtypes will rely on the implementation using a mask. NA VS. IGNORE - A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in aNEP sense and NA in NEP sense. With NA, there is a clear notion of how NA propagates through all basic numpy operations. (e.g., 3+NA=NA and log(NA) = NA, while NA | True = True.) IGNORE is separate from NA, with different interpretations depending on the use case. IGNORE could mean: 1. Data that is being temporarily ignored. e.g., a possible outlier that is temporarily being removed from consideration. 2. Data that cannot exist. e.g., a matrix representing a grid of water
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Dag Sverre Seljebotn wrote: Here's an HPC perspective...: At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. Absolutely -- and this point has been raised a couple times in the discussion, so I hope it is not forgotten. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. and vice-versa -- Assuming you have some data in numpy arrays, there's a way to process it with a C or Fortran library without copying the data. And this is where I am skeptical of the bit-pattern idea -- while one can expect C and fortran and GPU, and ??? to understand NaNs for floating point data, is there any support in compilers or hardware for special bit patterns for NA values to integers? I've never seen in my (very limited experience). Maybe having the mask option, too, will make that irrelevant, but I want to be clear about that kind of use case. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.govwrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. Unless you know the neutral value for the computation or you just want to do a forward_fill in time series, and you have to ask the user not to give you an unmutable array with NAs if they don't want extra copies. Josef My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On 07/06/2011 02:38 PM, Christopher Jordan-Squire wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov mailto:chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. Exactly! I have not been able to think of an real example where that difference matters as the calculations are only on the 'valid' (ie non-missing and non-masked) values. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. Unless you know the neutral value for the computation or you just want to do a forward_fill in time series, and you have to ask the user not to give you an unmutable array with NAs if they don't want extra copies. Josef Mean value replacement, or more generally single scalar value replacement, is generally not a good idea. It biases downward your standard error estimates if you use mean replacement, and it will bias both if you use anything other than mean replacement. The bias is gets worse with more missing data. So it's worst in the precisely the cases where you'd want to fill in the data the most. (Though I admit I'm not too familiar with time series, so maybe this doesn't apply. But it's true as a general principle in statistics.) I'm not sure why we'd want to make this use case easier. -Chris Jordan-Squire My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Jul 6, 2011, at 10:11 PM, Bruce Southey wrote: On 07/06/2011 02:38 PM, Christopher Jordan-Squire wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. Exactly! I have not been able to think of an real example where that difference matters as the calculations are only on the 'valid' (ie non-missing and non-masked) values. In practice, they could be treated the same way (ie, skipped). However, they are conceptually different and one may wish to keep this difference of information around (between NAs you didn't have and IGNOREs you just dropped temporarily. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. Unless you know the neutral value for the computation or you just want to do a forward_fill in time series, and you have to ask the user not to give you an unmutable array with NAs if they don't want extra copies. Josef Mean value replacement, or more generally single scalar value replacement, is generally not a good idea. It biases downward your standard error estimates if you use mean replacement, and it will bias both if you use anything other than mean replacement. The bias is gets worse with more missing data. So it's worst in the precisely the cases where you'd want to fill in the data the most. (Though I admit I'm not too familiar with time series, so maybe this doesn't apply. But it's true as a general principle in statistics.) I'm not sure why we'd want to make this use case easier. We just discussed a use case for pandas on the statsmodels mailing list, minute data of stock quotes (prices), if the quote is NA then fill it with the last price quote. If it would be necessary for memory usage and performance, this can be handled efficiently and with minimal copying. If you want to fill in a missing value without messing up any result statistics, then there is a large literature in statistics on imputations, repeatedly assigning values to a NA from an underlying distribution. scipy/statsmodels doesn't have anything like this (yet) but R and the others have it available, and it looks more popular in bio-statistics. (But similar to what Dag said, for statistical analysis it will be necessary to keep case specific masks and data arrays around. I haven't actually written any missing values algorithm yet, so I'm quite again.) Josef -Chris Jordan-Squire My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 4:38 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. Unless you know the neutral value for the computation or you just want to do a forward_fill in time series, and you have to ask the user not to give you an unmutable array with NAs if they don't want extra copies. Josef Mean value replacement, or more generally single scalar value replacement, is generally not a good idea. It biases downward your standard error estimates if you use mean replacement, and it will bias both if you use anything other than mean replacement. The bias is gets worse with more missing data. So it's worst in the precisely the cases where you'd want to fill in the data the most. (Though I admit I'm not too familiar with time series, so maybe this doesn't apply. But it's true as a general principle in statistics.) I'm not sure why we'd want to make this use case easier. Another qualification on this (I cannot help it). I think this only applies if you use a prefabricated no-missing-values algorithm. If I write it myself, I can do the proper correction for the reduced number of observations. (similar to the case when we ignore correlated information and use statistics based on uncorrelated observations which also overestimate the amount of information we have available.) Josef We just discussed a use case for pandas on the statsmodels mailing list, minute data of stock quotes (prices), if the quote is NA then fill it with the last price quote. If it would be necessary for memory usage and performance, this can be handled efficiently and with minimal copying. If you want to fill in a missing value without messing up any result statistics, then there is a large literature in statistics on imputations, repeatedly assigning values to a NA from an underlying distribution. scipy/statsmodels doesn't have anything like this (yet) but R and the others have it available, and it looks more popular in bio-statistics. (But similar to what Dag said, for statistical analysis it will be necessary to keep case specific masks and data arrays around. I haven't actually written any missing values algorithm yet, so I'm quite again.) Josef -Chris Jordan-Squire My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Christopher Barker wrote: Dag Sverre Seljebotn wrote: Here's an HPC perspective...: At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. Absolutely -- and this point has been raised a couple times in the discussion, so I hope it is not forgotten. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. and vice-versa -- Assuming you have some data in numpy arrays, there's a way to process it with a C or Fortran library without copying the data. And this is where I am skeptical of the bit-pattern idea -- while one can expect C and fortran and GPU, and ??? to understand NaNs for floating point data, is there any support in compilers or hardware for special bit patterns for NA values to integers? I've never seen in my (very limited experience). Maybe having the mask option, too, will make that irrelevant, but I want to be clear about that kind of use case. -Chris Am I the only one that finds the idea of special values of things like int[1] to have special meanings to be really ugly? [1] which already have defined behavior over their entire domain of bit patterns ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 2:53 PM, Neal Becker ndbeck...@gmail.com wrote: Christopher Barker wrote: Dag Sverre Seljebotn wrote: Here's an HPC perspective...: At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. Absolutely -- and this point has been raised a couple times in the discussion, so I hope it is not forgotten. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. and vice-versa -- Assuming you have some data in numpy arrays, there's a way to process it with a C or Fortran library without copying the data. And this is where I am skeptical of the bit-pattern idea -- while one can expect C and fortran and GPU, and ??? to understand NaNs for floating point data, is there any support in compilers or hardware for special bit patterns for NA values to integers? I've never seen in my (very limited experience). Maybe having the mask option, too, will make that irrelevant, but I want to be clear about that kind of use case. -Chris Am I the only one that finds the idea of special values of things like int[1] to have special meanings to be really ugly? [1] which already have defined behavior over their entire domain of bit patterns Umm, no, I find it ugly also. On the other hand, it is an useful artifact left to us by the ancients and solves a lot of problems. So in the absence of anything more standardized... Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On 07/06/2011 03:37 PM, Pierre GM wrote: On Jul 6, 2011, at 10:11 PM, Bruce Southey wrote: On 07/06/2011 02:38 PM, Christopher Jordan-Squire wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barkerchris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. Exactly! I have not been able to think of an real example where that difference matters as the calculations are only on the 'valid' (ie non-missing and non-masked) values. In practice, they could be treated the same way (ie, skipped). However, they are conceptually different and one may wish to keep this difference of information around (between NAs you didn't have and IGNOREs you just dropped temporarily. ___ I have yet to see these as *conceptually different* in any of the arguments given. Separate NAs or IGNORES or any number of missing value codes just requires use to avoid 'unmasking' those missing value codes in your array as, I presume like masked arrays, you need some placeholder values. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 3:47 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:38 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. Unless you know the neutral value for the computation or you just want to do a forward_fill in time series, and you have to ask the user not to give you an unmutable array with NAs if they don't want extra copies. Josef Mean value replacement, or more generally single scalar value replacement, is generally not a good idea. It biases downward your standard error estimates if you use mean replacement, and it will bias both if you use anything other than mean replacement. The bias is gets worse with more missing data. So it's worst in the precisely the cases where you'd want to fill in the data the most. (Though I admit I'm not too familiar with time series, so maybe this doesn't apply. But it's true as a general principle in statistics.) I'm not sure why we'd want to make this use case easier. Another qualification on this (I cannot help it). I think this only applies if you use a prefabricated no-missing-values algorithm. If I write it myself, I can do the proper correction for the reduced number of observations. (similar to the case when we ignore correlated information and use statistics based on uncorrelated observations which also overestimate the amount of information we have available.) Can you do that sort of technique with longitudinal (panel) data? I'm honestly curious because I haven't looked into such corrections before. I haven't been able to find a reference after a few quick google searches. I don't suppose you know one off the top of your head? And you're right about the last measurement carried forward. I was just thinking about filling in all missing values with the same value. -Chris Jordan-Squire PS--Thanks for mentioning the statsmodels discussion. I'd been keeping track of that on a different email account, and I haven't realized it wasn't forwarding those messages correctly. Josef We just discussed a use case for pandas on the statsmodels mailing list, minute data of stock quotes (prices), if the quote is NA then fill it with the last price quote. If it would be necessary for memory usage and performance, this can be handled efficiently and with minimal copying. If you want to fill in a missing value without messing up any result statistics, then there is a large literature in statistics on imputations, repeatedly assigning values to a NA from an underlying distribution. scipy/statsmodels doesn't have anything like this (yet) but R and the others have it available, and it looks more popular in bio-statistics. (But similar to what Dag said, for statistical
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 7:14 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 3:47 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:38 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. Unless you know the neutral value for the computation or you just want to do a forward_fill in time series, and you have to ask the user not to give you an unmutable array with NAs if they don't want extra copies. Josef Mean value replacement, or more generally single scalar value replacement, is generally not a good idea. It biases downward your standard error estimates if you use mean replacement, and it will bias both if you use anything other than mean replacement. The bias is gets worse with more missing data. So it's worst in the precisely the cases where you'd want to fill in the data the most. (Though I admit I'm not too familiar with time series, so maybe this doesn't apply. But it's true as a general principle in statistics.) I'm not sure why we'd want to make this use case easier. Another qualification on this (I cannot help it). I think this only applies if you use a prefabricated no-missing-values algorithm. If I write it myself, I can do the proper correction for the reduced number of observations. (similar to the case when we ignore correlated information and use statistics based on uncorrelated observations which also overestimate the amount of information we have available.) Can you do that sort of technique with longitudinal (panel) data? I'm honestly curious because I haven't looked into such corrections before. I haven't been able to find a reference after a few quick google searches. I don't suppose you know one off the top of your head? I was thinking mainly of simple cases where the correction only requires to correctly count the number of observations in order to adjust the degrees of freedom. For example, statistical tests that are based on relatively simple statistics or ANOVA which just needs a correct counting of the number of observations by groups. (This might be partially covered by any NA ufunc implementation, that does mean, var and cov correctly and maybe sorting like the current NaN sort.) In the panel data case it might be possible to do this, if it can just be treated like an unbalanced panel. I guess it depends on the details of the model. For regression, one way to remove an observation is to include a dummy variable for that observation, or use X'X with rows zeroed out. R has a package for multivariate normal with missing values that allows calculation of expected values for the missing ones. But in many of these cases, getting a clean
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 7:14 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 3:47 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:38 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire snip Mean value replacement, or more generally single scalar value replacement, is generally not a good idea. It biases downward your standard error estimates if you use mean replacement, and it will bias both if you use anything other than mean replacement. The bias is gets worse with more missing data. So it's worst in the precisely the cases where you'd want to fill in the data the most. (Though I admit I'm not too familiar with time series, so maybe this doesn't apply. But it's true as a general principle in statistics.) I'm not sure why we'd want to make this use case easier. Another qualification on this (I cannot help it). I think this only applies if you use a prefabricated no-missing-values algorithm. If I write it myself, I can do the proper correction for the reduced number of observations. (similar to the case when we ignore correlated information and use statistics based on uncorrelated observations which also overestimate the amount of information we have available.) Can you do that sort of technique with longitudinal (panel) data? I'm honestly curious because I haven't looked into such corrections before. I haven't been able to find a reference after a few quick google searches. I don't suppose you know one off the top of your head? And you're right about the last measurement carried forward. I was just thinking about filling in all missing values with the same value. -Chris Jordan-Squire PS--Thanks for mentioning the statsmodels discussion. I'd been keeping track of that on a different email account, and I haven't realized it wasn't forwarding those messages correctly. Maybe a bit OT, but I've seen people doing imputation using Bayesian MCMC or multiple imputation for missing values in panel data. Google 'data augmentation' or 'multiple imputation'. I haven't looked much into the details yet, but it's definitely not mean replacement. FWIW (I haven't been following closely the discussion), there is a distinction in statistics between ignorable and nonignorable missing data, but I can't think of a situation where I would need this at the computational level rather than relying on a (numerically comparable) missing data type(s) a la SAS/Stata. I've also found the odd examples of IGNORE without a clear answer to be scary. Skipper ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] NA/Missing Data Conference Call Summary
Here's a short-ish summary of the topics discussed in the conference call this afternoon. WARNING: I try to give examples for everything discussed to make it as concrete as possible. However, most of the examples were not explicitly discussed during the conference. I apologize in advance if I mischaracterize anyone's arguments, and please jump in to correct me if I did. Participants: Travis Oliphant, Mark Wiebe, Matthew Brett, Nathaniel Smith, Pierre GM, Ben Root, Chuck Harris, Wes McKinney, Chris Jordan-Squire First, areas of broad agreement: *There should be more functionality for missing data *There should be dtypes which support missing data ('parameterized dtypes' in the current NEP) *Adding a 'where' semantic to ufuncs *Have the same data with different sets of missing elements in different views *Easy for non-expert numpy users Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. However, the numpy community (and Travis in particular) are balancing this against the possibility of a sub-optimal solution which can't be taken back. BIT PATTERN MASK IMPLEMENTATIONS FOR NA -- The current NEP proposes both mask and bit pattern implementations for missing data. I use the terms bit pattern and parameterized dtype interchangeably, since the parameterized dtype will use a bit pattern for its implementation. The two implementations will support the same functionality with respect to NA, and the implementation details will be largely invisible to the user. Their differences are in the 'extra' features each supports. Two common questions were: 1. Why make two implementations of missing data: one with masks and the other with parameterized dtypes? 2. Why does the implementation using masks have higher priority? The answers are: 1. The mask implementation is more general and easier to implement and maintain. The bit pattern implementation saves memory, makes interoperability easier, and makes ABI (Application Binary Interface) compatibility easier. Since each has different strengths, the argument is both should be implemented. 2. The implementation for the parameterized dtypes will rely on the implementation using a mask. NA VS. IGNORE - A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in aNEP sense and NA in NEP sense. With NA, there is a clear notion of how NA propagates through all basic numpy operations. (e.g., 3+NA=NA and log(NA) = NA, while NA | True = True.) IGNORE is separate from NA, with different interpretations depending on the use case. IGNORE could mean: 1. Data that is being temporarily ignored. e.g., a possible outlier that is temporarily being removed from consideration. 2. Data that cannot exist. e.g., a matrix representing a grid of water depths for a lake. Since the lake isn't square, some entries will represent land, and so depth will be a meaningless concept for those entries. 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE, 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though this leaves open how [1, 2, IGNORE] + [3 , 4] should behave. Because of these different uses of IGNORE, it doesn't have as clear a theoretical interpretation as NA. (For instance, what is IGNORE+3, IGNORE*3, or IGNORE | True?) But several of the discussants thought the use cases for IGNORE were very compelling. Specifically, they wanted to be able to use IGNORE's and NA's simultaneously while still being able to differentiate between them. So, for example, being able to designate some data as IGNORE while still able to determine which data was NA but not IGNORE. The current NEP does not allow for this directly. Although in some cases it can be indirectly done via views. (By taking a view of the original data, expanding the values which are considered NA in the view, and then comparing with the original data to see if the NA is in the original or not.) Since both are possible in this sense, Mark's NEP makes it so IGNORE is allowed but isn't the default. Another important point from the current NEP is that not being able to access values considered missing, even if the implementation of missingness is via a mask, is a feature and not a bug. It is a feature because if the data is missing then, conceptually, neither the user nor any function the user calls should be able to obtain that data. This is precisely why the indirect route, via views of the original data, is required to access data that a different view says is missing. The current NEP treats all NA's the same. The reasoning is that, regardless of where the NA originated, the functions the numpy array is fed in to will either ignore all NA's or propagate them (i.e. not ignore them). These two different behaviors are chosen when passed into a ufunc by setting the skipna ufunc
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Thanks for these notes. Just a couple of thoughts as I looked over these notes. On Tue, Jul 5, 2011 at 6:46 PM, Christopher Jordan-Squire cjord...@uw.eduwrote: 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE, 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though this leaves open how [1, 2, IGNORE] + [3 , 4] should behave. I don't think there is any confusion about that particular case. Even when using the IGNORE semantics, numpy broadcasting rules are still in play. This particular case should throw an exception. Because of these different uses of IGNORE, it doesn't have as clear a theoretical interpretation as NA. (For instance, what is IGNORE+3, IGNORE*3, or IGNORE | True?) I think we were more referring to matrix operations like dot products. Element-by-element operations should still behave the same as NA. Scalar operations should return IGNORE. HOW DOES THIS RELATE TO THE CURRENT MASKED ARRAY? Everyone seems to agree they'd love it if this could encompass all current use cases of the numpy.ma arrays, so numpy.ma arrays could be deprecated. (However they wouldn't be eliminated for several years, even in the most optimistic scenarios.) This is going to be a very tricky thing to handle and it is going to require coordination and agreements among many of the third-party toolkits like scipy and matplotlib. In addition to these notes (unless I missed it), Nathaniel pointed out that with the ufunc where= parameter feature and the ufunc wrapper, we have the potential to greatly improve the codebase of numpy.ma as it stands. Potentially mitigating the need for moving more of numpy.ma into the core, and to focus more on NA. While I am not 100% on board with this idea, I can definitely see the potential for this path. Thanks everybody for the productive chat! Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion