Re: [Numpy-discussion] Moving lib.recfunctions?
Pierre GM pgmdevlist at gmail.com writes: Hello, The idea behin having a lib.recfunctions and not a rec.recfunctions or whatever was to illustrate that the functions of this package are more generic than they appear. They work with regular structured ndarrays and don't need recarrays. Methinks we gonna lose this aspect if you try to rename it, but hey, your call. I've never really thought there's much distinction between the two - AFAICT a recarray is just a structured array with attribute access? If a function only accepts a recarray (are there any?) isn't it just a simple call to .view(np.recarray) to get it to work with structured arrays? Because of this view I've always thought functions which worked on either should be grouped together. As to as why they were never really advertised ? Because I never received any feedback when I started developing them (developing is a big word here, I just took a lot of code that John D Hunter had developed in matplotlib and make it more consistent). I advertised them once or twice on the list, wrote the basic docstrings, but waited for other people to start using them. Anyhow. So, yes, there might be some weird import to polish. Note that if you decided to just rename the package and leave it where it was, it would probably be easier. The path of least resistance is to just import lib.recfunctions.* into the (already crowded) main numpy namespace and be done with it. Why ? Why can't you leave it available through numpy.lib ? Once again, if it's only a matter of PRing, you could start writing an entry page in the doc describing the functions, that would improve the visibility. I do recall them being advertised a while ago, but when I came to look for them I couldn't find them - IMHO np.rec is a much more intuitive (and nicer/shorter) namespace than np.lib.recfunctions. I think having similar functionality in two completely different namespaces is confusing hard to remember. It also doesn't help that np.lib.recfunctions isn't discoverable by t ab-completion: In [2]: np.lib.rec np.lib.recfromcsv np.lib.recfromtxt ...of course you could probably find it with np.lookfor but it's one more barrier to their use. FWIW I'd be happy if the np.lib.recfunctions fuctions were made available in the np.rec namespace (and possibly deprecate np.lib.recfunctions to avoid confusion?) I'm conscious that as a user (not a developer) talk is cheap and I'm happy with whatever the consensus is. I just thought I'd pipe up since it was only through this thread that I re-discovered np.lib.recfunctions! HTH, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Current status of 64 bit windows support.
On Fri, Jul 1, 2011 at 7:23 PM, Sturla Molden stu...@molden.no wrote: Den 01.07.2011 19:22, skrev Charles R Harris: Just curious as to what folks know about the current status of the free windows 64 bit compilers. I know things were dicey with gcc and gfortran some two years ago, but... well, two years have passed. This Windows 7 SDK is free (as in beer). It is the C compiler used to build Python on Windows 64. Here is the download: http://www.microsoft.com/download/en/details.aspx?displaylang=enid=3138 A newer version of the Windows SDK will use a C compiler that links with a different CRT than Python uses. Use version 3.5. When using this compiler, remember to set the environment variable DISTUTILS_USE_SDK. This should be sufficient to build NumPy. AFAIK only SciPy requires a Fortran compiler. Mingw is still not stabile on Windows 64. There are supposedly compatibility issues between the MinGW runtime used by libgfortran and Python's CRT. While there are experimental MinGW builds for Windows 64 (e.g. TDM-GCC), we will probably need to build libgfortran against another C runtime for SciPy. A commercial Fortran compiler compatible with MSVC is recommended for SciPy, e.g. Intel, Absoft or Portland. Sturla So it sounds like we're getting closer to having official NumPy 1.6.x binaries for 64 bit Windows (using the Windows 7 SDK), but not quite there yet? What is the roadblock? I would guess from the comments on Christoph Gohlke's page the issue is having something that will work with SciPy... see http://www.lfd.uci.edu/~gohlke/pythonlibs/ I'm interested from the point of view of third party libraries using NumPy, where we have had users asking for 64bit installers. We need an official NumPy installer to build against. Regards, Peter ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Hi, Just for reference, I am using this as the latest version of the NEP - I hope it's current: https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails. On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire cjord...@uw.edu wrote: ... Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. I think I might have missed that part of the discussion :) I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases. The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP: In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish. Mark says something like it's all missing data, there's no reason you should want to distinguish. Nathaniel and I were saying the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.. I returned several times to this (original point by Nathaniel): a[3] = np.NA (what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?) We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing). I asserted that explaining NA to people would be easier if ``a[3] = np.NA`` was direct assignment and altered the array. BIT PATTERN MASK IMPLEMENTATIONS FOR NA -- The current NEP proposes both mask and bit pattern implementations for missing data. I use the terms bit pattern and parameterized dtype interchangeably, since the parameterized dtype will use a bit pattern for its implementation. The two implementations will support the same functionality with respect to NA, and the implementation details will be largely invisible to the user. Their differences are in the 'extra' features each supports. Two common questions were: 1. Why make two implementations of missing data: one with masks and the other with parameterized dtypes? 2. Why does the implementation using masks have higher priority? The answers are: 1. The mask implementation is more general and easier to implement and maintain. The bit pattern implementation saves memory, makes interoperability easier, and makes ABI (Application Binary Interface) compatibility easier. Since each has different strengths, the argument is both should be implemented. 2. The implementation for the parameterized dtypes will rely on the implementation using a mask. NA VS. IGNORE - A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in aNEP sense and NA in NEP sense. With NA, there is a clear notion of how NA propagates through all basic numpy operations. (e.g., 3+NA=NA and log(NA) = NA, while NA | True = True.) IGNORE is separate from NA, with different interpretations depending on the use case. IGNORE could mean: 1. Data that is being temporarily ignored. e.g., a possible outlier that is temporarily being removed from consideration. 2. Data that cannot exist. e.g., a matrix representing a grid of water depths for a lake. Since the lake isn't square, some entries will represent land, and so depth will be a meaningless concept for those entries. 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE, 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though this leaves open how [1, 2, IGNORE] + [3 , 4] should behave. Because of these different uses of IGNORE, it doesn't have as clear a theoretical interpretation as NA. (For instance, what is IGNORE+3, IGNORE*3, or IGNORE | True?) I don't remember this bit of the discussion, but I see from current masked arrays that IGNORE is treated as the identity, so: IGNORE + 3 = 3 IGNORE * 3 = 3 But several of the discussants thought the use cases for IGNORE were very compelling. Specifically, they wanted to be able to use IGNORE's and NA's simultaneously while still being able to differentiate between them. So, for example, being able to designate some data as IGNORE while still able to determine
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On 07/06/2011 02:05 PM, Matthew Brett wrote: Hi, Just for reference, I am using this as the latest version of the NEP - I hope it's current: https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails. On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire cjord...@uw.edu wrote: ... Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. I think I might have missed that part of the discussion :) I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases. The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP: In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish. Mark says something like it's all missing data, there's no reason you should want to distinguish. Nathaniel and I were saying the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.. I returned several times to this (original point by Nathaniel): a[3] = np.NA (what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?) We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing). Here's an HPC perspective...: If you, say, want to off-load array processing with a mask to some code running on a GPU, you really can't have the GPU go through some NumPy API. Or if you want to implement a masked array on a cluster with MPI, you similarly really, really want raw access. At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. So, my view on Mark's NEP was: With a reasonably amount of flexibility in how you decided to implement masking for your data, you can create a NumPy wrapper that will understand that. Whether your Fortran library exposes NAs in its 40GB buffer as bit patterns, or using a seperate mask, both will work. And IMO Mark's NEP comes rather close to this, you just need an additional NEP later to give raw details to the implementation details, once those are settled :-) Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On 07/06/2011 02:27 PM, Dag Sverre Seljebotn wrote: On 07/06/2011 02:05 PM, Matthew Brett wrote: Hi, Just for reference, I am using this as the latest version of the NEP - I hope it's current: https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails. On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire cjord...@uw.edu wrote: ... Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. I think I might have missed that part of the discussion :) I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases. The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP: In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish. Mark says something like it's all missing data, there's no reason you should want to distinguish. Nathaniel and I were saying the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.. I returned several times to this (original point by Nathaniel): a[3] = np.NA (what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?) We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing). Here's an HPC perspective...: If you, say, want to off-load array processing with a mask to some code running on a GPU, you really can't have the GPU go through some NumPy API. Or if you want to implement a masked array on a cluster with MPI, you similarly really, really want raw access. At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. So, my view on Mark's NEP was: With a reasonably amount of flexibility in how you decided to implement masking for your data, you can create a NumPy wrapper that will understand that. Whether your Fortran library exposes NAs in its 40GB buffer as bit patterns, or using a seperate mask, both will work. And IMO Mark's NEP comes rather close to this, you just need an additional NEP later to give raw details to the implementation details, once those are settled :-) To be concrete, I'm thinking something like a custom extension to PEP 3118, which could also allow efficient access from Cython without hard-coding Cython for NumPy (a GSoC project this summer will continue to move us away from the np.ndarray[int] syntax to a more generic int[:] that's less tied to NumPy). But first things first! Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary
Hi, Sorry, I hope you don't mind, I moved this to it's own thread, trying to separate comments on the NA debate from the discussion yesterday. On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 07/06/2011 02:05 PM, Matthew Brett wrote: Hi, Just for reference, I am using this as the latest version of the NEP - I hope it's current: https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails. On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire cjord...@uw.edu wrote: ... Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. I think I might have missed that part of the discussion :) I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases. The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP: In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish. Mark says something like it's all missing data, there's no reason you should want to distinguish. Nathaniel and I were saying the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.. I returned several times to this (original point by Nathaniel): a[3] = np.NA (what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?) We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing). Here's an HPC perspective...: If you, say, want to off-load array processing with a mask to some code running on a GPU, you really can't have the GPU go through some NumPy API. Or if you want to implement a masked array on a cluster with MPI, you similarly really, really want raw access. At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. So, my view on Mark's NEP was: With a reasonably amount of flexibility in how you decided to implement masking for your data, you can create a NumPy wrapper that will understand that. Whether your Fortran library exposes NAs in its 40GB buffer as bit patterns, or using a seperate mask, both will work. And IMO Mark's NEP comes rather close to this, you just need an additional NEP later to give raw details to the implementation details, once those are settled :-) I was a little puzzled as to what you were trying to say, but I suspect that's my ignorance about Numpy internals. Superficially, I would have assumed that, making masked and bit-pattern NAs behave the same in numpy, would take you away from the raw data, in the sense that you not only need the dtype, you also need the mask machinery, in order to know if you have an NA. Later I realized that you probably weren't saying that. So, just for my unhappy ignorance - how does the HPC perspective relate to debate about can / can't distinguish NA from ignore? Sorry, thanks, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary
On 07/06/2011 02:46 PM, Matthew Brett wrote: Hi, Sorry, I hope you don't mind, I moved this to it's own thread, trying to separate comments on the NA debate from the discussion yesterday. I'm sorry. On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 07/06/2011 02:05 PM, Matthew Brett wrote: Hi, Just for reference, I am using this as the latest version of the NEP - I hope it's current: https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails. On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire cjord...@uw.eduwrote: ... Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. I think I might have missed that part of the discussion :) I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases. The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP: In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish. Mark says something like it's all missing data, there's no reason you should want to distinguish. Nathaniel and I were saying the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.. I returned several times to this (original point by Nathaniel): a[3] = np.NA (what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?) We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing). Here's an HPC perspective...: If you, say, want to off-load array processing with a mask to some code running on a GPU, you really can't have the GPU go through some NumPy API. Or if you want to implement a masked array on a cluster with MPI, you similarly really, really want raw access. At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. So, my view on Mark's NEP was: With a reasonably amount of flexibility in how you decided to implement masking for your data, you can create a NumPy wrapper that will understand that. Whether your Fortran library exposes NAs in its 40GB buffer as bit patterns, or using a seperate mask, both will work. And IMO Mark's NEP comes rather close to this, you just need an additional NEP later to give raw details to the implementation details, once those are settled :-) I was a little puzzled as to what you were trying to say, but I suspect that's my ignorance about Numpy internals. Superficially, I would have assumed that, making masked and bit-pattern NAs behave the same in numpy, would take you away from the raw data, in the sense that you not only need the dtype, you also need the mask machinery, in order to know if you have an NA. Later I realized that you probably weren't saying that. So, just for my unhappy ignorance - how does the HPC perspective relate to debate about can / can't distinguish NA from ignore? I just commented on the prevent direct API access to the masking array part -- I'm hoping direct access by external code to the underlying implementation details will be allowed, at some point. What I'm saying is that Mark's proposal is more flexible. Say for the sake of the argument that I have two codes I need to interface with: - Library A is written in Fortran and uses a seperate (explicit) mask array for NA - Library B runs on a GPU and uses a bit pattern for NA Mark's proposal then comes closer to allowing me to wrap both codes using NumPy, since it supports both implementation mechanisms. Sure, it would need a seperate NEP down the road to extend it, but it goes in the right direction for this to happen. As for
Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary
Hi, On Wed, Jul 6, 2011 at 2:12 PM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 07/06/2011 02:46 PM, Matthew Brett wrote: Hi, Sorry, I hope you don't mind, I moved this to it's own thread, trying to separate comments on the NA debate from the discussion yesterday. I'm sorry. On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 07/06/2011 02:05 PM, Matthew Brett wrote: Hi, Just for reference, I am using this as the latest version of the NEP - I hope it's current: https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails. On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire cjord...@uw.edu wrote: ... Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. I think I might have missed that part of the discussion :) I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases. The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP: In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish. Mark says something like it's all missing data, there's no reason you should want to distinguish. Nathaniel and I were saying the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.. I returned several times to this (original point by Nathaniel): a[3] = np.NA (what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?) We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing). Here's an HPC perspective...: If you, say, want to off-load array processing with a mask to some code running on a GPU, you really can't have the GPU go through some NumPy API. Or if you want to implement a masked array on a cluster with MPI, you similarly really, really want raw access. At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. So, my view on Mark's NEP was: With a reasonably amount of flexibility in how you decided to implement masking for your data, you can create a NumPy wrapper that will understand that. Whether your Fortran library exposes NAs in its 40GB buffer as bit patterns, or using a seperate mask, both will work. And IMO Mark's NEP comes rather close to this, you just need an additional NEP later to give raw details to the implementation details, once those are settled :-) I was a little puzzled as to what you were trying to say, but I suspect that's my ignorance about Numpy internals. Superficially, I would have assumed that, making masked and bit-pattern NAs behave the same in numpy, would take you away from the raw data, in the sense that you not only need the dtype, you also need the mask machinery, in order to know if you have an NA. Later I realized that you probably weren't saying that. So, just for my unhappy ignorance - how does the HPC perspective relate to debate about can / can't distinguish NA from ignore? I just commented on the prevent direct API access to the masking array part -- I'm hoping direct access by external code to the underlying implementation details will be allowed, at some point. What I'm saying is that Mark's proposal is more flexible. Say for the sake of the argument that I have two codes I need to interface with: - Library A is written in Fortran and uses a seperate (explicit) mask array for NA - Library B runs on a GPU and uses a bit pattern for NA Mark's proposal then comes closer to allowing me to wrap both codes using NumPy, since it supports both implementation mechanisms. Sure, it would need a seperate
[Numpy-discussion] using the same vocabulary for missing value ideas
It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. IGNORE (Skip/Ignore) A placeholder which should be treated by computations as if no value does or could exist there. For sums, this means act as if the value were zero, and for products, this means act as if the value were one. It's as if the array were compressed in some fashion to not include that element. bitpattern A technique for implementing either NA or IGNORE, where a particular set of bit patterns are chosen from all the possible bit patterns of the value's data type to signal that the element is NA or IGNORE. mask A technique for implementing either NA or IGNORE, where a boolean or enum array parallel to the data array is used to signal which elements are NA or IGNORE. numpy.ma The existing implementation of a particular form of masked arrays, which is part of the NumPy codebase. The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. 2) The idea of masking and the numpy.ma implementation are different. The numpy.ma object makes particular choices about how to interpret the mask, but while backwards compatibility is important, a fresh evaluation of all the design choices going into a mask implementation is worthwhile. Thanks, Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote: It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. That sounds good - I've only been scanning these discussions and it is confusing. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. Could you expand that to say how sums and products act with NA (since you do so for the IGNORE case). Thanks, Peter ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
Hi, On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote: It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R? IGNORE (Skip/Ignore) A placeholder which should be treated by computations as if no value does or could exist there. For sums, this means act as if the value were zero, and for products, this means act as if the value were one. It's as if the array were compressed in some fashion to not include that element. bitpattern A technique for implementing either NA or IGNORE, where a particular set of bit patterns are chosen from all the possible bit patterns of the value's data type to signal that the element is NA or IGNORE. mask A technique for implementing either NA or IGNORE, where a boolean or enum array parallel to the data array is used to signal which elements are NA or IGNORE. numpy.ma The existing implementation of a particular form of masked arrays, which is part of the NumPy codebase. The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. 2) The idea of masking and the numpy.ma implementation are different. The numpy.ma object makes particular choices about how to interpret the mask, but while backwards compatibility is important, a fresh evaluation of all the design choices going into a mask implementation is worthwhile. I agree that there has been some confusion due to the terms. However, I continue to believe that the discussion is substantial and not due to confusion. Let us then characterize the substantial discussion as this: NEP: bitpattern and masked out values should be made nearly impossible to distinguish in the API alterNEP: bitpattern and masked out values should be distinct in the API so that it can be made clear which is meant (and therefore, implicitly, how they are implemented). Do you agree that this is the discussion? See you, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote: It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R? I don't think that was what Mark was saying, see this bit later in this email: The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. This point as I understood it is there is the semantics of the special values (not available vs ignore), and there is the implementation (bitpattern vs mask), and they are independent. Peter ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
Hi, On Wed, Jul 6, 2011 at 5:48 PM, Peter numpy-discuss...@maubp.freeserve.co.uk wrote: On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote: It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R? I don't think that was what Mark was saying, see this bit later in this email: I think it would make a difference if there was an implementation that had conflated masking with bitpatterns in terms of API. I don't think R is an example. The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. This point as I understood it is there is the semantics of the special values (not available vs ignore), and there is the implementation (bitpattern vs mask), and they are independent. Yes. Although, we can see from the implementations that we have to hand that a) bitpatterns - propagation (NaN-like) semantics by default (R) b) masks - ignore semantics by default (masked arrays) I don't think Mark accepts that there is any reason for this tendency of implementations to semantics, but Nathaniel was arguing otherwise in the alterNEP. I think we all accept that it's possible to imagine masking have propagation semantics and bitpatterns having ignore semantics. Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Wed, Jul 6, 2011 at 5:48 PM, Peter numpy-discuss...@maubp.freeserve.co.uk wrote: On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote: It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R? I don't think that was what Mark was saying, see this bit later in this email: I think it would make a difference if there was an implementation that had conflated masking with bitpatterns in terms of API. I don't think R is an example. Of course R is not an example of that. Nothing is. This is merely conceptual. Separate NA from np.NA in Mark's NEP, and you will see his point. Consider it the logical intersection of NA in Mark's NEP and the aNEP. The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. This point as I understood it is there is the semantics of the special values (not available vs ignore), and there is the implementation (bitpattern vs mask), and they are independent. Yes. Good, that's all Mark's definition guide is trying to do. Although, we can see from the implementations that we have to hand that a) bitpatterns - propagation (NaN-like) semantics by default (R) b) masks - ignore semantics by default (masked arrays) The above is extraneous and out of the scope of Mark's definitions. We are taking this little-by-little. I don't think Mark accepts that there is any reason for this tendency of implementations to semantics, but Nathaniel was arguing otherwise in the alterNEP. Then that is what we will debate *later*, once we establish definitions. I think we all accept that it's possible to imagine masking have propagation semantics and bitpatterns having ignore semantics. Good! I think that is what Mark wanted to get across in this set of definitions. It kinda seems like you are champing at the bit here to continue the debate, but I agree with Mark that after yesterday's discussion, we need to make sure that we have a solid foundation for understanding each other. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
Ah, semantics... On Jul 6, 2011, at 5:40 PM, Mark Wiebe wrote: NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. I have a problem with 'temporarily hidden with a mask'. In my mind, the concept of NA carries a notion of perennation. The data is just not available, just as a NaN is just not a number. IGNORE (Skip/Ignore) A placeholder which should be treated by computations as if no value does or could exist there. For sums, this means act as if the value were zero, and for products, this means act as if the value were one. It's as if the array were compressed in some fashion to not include that element. A data temporarily hidden by a mask becomes np.IGNORE. bitpattern A technique for implementing either NA or IGNORE, where a particular set of bit patterns are chosen from all the possible bit patterns of the value's data type to signal that the element is NA or IGNORE. mask A technique for implementing either NA or IGNORE, where a boolean or enum array parallel to the data array is used to signal which elements are NA or IGNORE. numpy.ma The existing implementation of a particular form of masked arrays, which is part of the NumPy codebase. OK with that. The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. OK with that. 2) The idea of masking and the numpy.ma implementation are different. The numpy.ma object makes particular choices about how to interpret the mask, but while backwards compatibility is important, a fresh evaluation of all the design choices going into a mask implementation is worthwhile. Indeed. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
Hi, On Wed, Jul 6, 2011 at 6:11 PM, Benjamin Root ben.r...@ou.edu wrote: On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Jul 6, 2011 at 5:48 PM, Peter numpy-discuss...@maubp.freeserve.co.uk wrote: On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote: It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R? I don't think that was what Mark was saying, see this bit later in this email: I think it would make a difference if there was an implementation that had conflated masking with bitpatterns in terms of API. I don't think R is an example. Of course R is not an example of that. Nothing is. This is merely conceptual. Separate NA from np.NA in Mark's NEP, and you will see his point. Consider it the logical intersection of NA in Mark's NEP and the aNEP. I am trying to work out what you feel you feel the points of discussion are. There's surely no point in continuing to debate things we agree on. I don't think anyone disputes (or has ever disputed) that: There can be missing data implemented with bitpatterns There can be missing data implemented with masks Missing data can have propagate semantics Missing data can have ignore semantics. The implementation does not in itself constrain the semantics. Let's not discuss that any more; we all agree. So what do you think is the source of the disagreement? Or are you saying that there should be no disagreement at this stage? Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 5:05 AM, Matthew Brett matthew.br...@gmail.comwrote: Hi, Just for reference, I am using this as the latest version of the NEP - I hope it's current: https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails. On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire cjord...@uw.edu wrote: ... Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. I think I might have missed that part of the discussion :) I think that might have been mentioned by Travis right before he had to leave for another meeting, which might have been after you'd disconnected. Travis' concern as a member of a numpy community is the desire for something that is broadly applicable and adopted. But as Mark's employer, his concern is to get a more complete and coherent missing data functionality implemented in numpy while Mark is still at Enthought, for use in the problems Enthought and statisticians commonly encounter if nothing else. I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases. The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP: In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish. Mark says something like it's all missing data, there's no reason you should want to distinguish. Nathaniel and I were saying the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.. I returned several times to this (original point by Nathaniel): a[3] = np.NA (what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?) We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing). I asserted that explaining NA to people would be easier if ``a[3] = np.NA`` was direct assignment and altered the array. BIT PATTERN MASK IMPLEMENTATIONS FOR NA -- The current NEP proposes both mask and bit pattern implementations for missing data. I use the terms bit pattern and parameterized dtype interchangeably, since the parameterized dtype will use a bit pattern for its implementation. The two implementations will support the same functionality with respect to NA, and the implementation details will be largely invisible to the user. Their differences are in the 'extra' features each supports. Two common questions were: 1. Why make two implementations of missing data: one with masks and the other with parameterized dtypes? 2. Why does the implementation using masks have higher priority? The answers are: 1. The mask implementation is more general and easier to implement and maintain. The bit pattern implementation saves memory, makes interoperability easier, and makes ABI (Application Binary Interface) compatibility easier. Since each has different strengths, the argument is both should be implemented. 2. The implementation for the parameterized dtypes will rely on the implementation using a mask. NA VS. IGNORE - A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in aNEP sense and NA in NEP sense. With NA, there is a clear notion of how NA propagates through all basic numpy operations. (e.g., 3+NA=NA and log(NA) = NA, while NA | True = True.) IGNORE is separate from NA, with different interpretations depending on the use case. IGNORE could mean: 1. Data that is being temporarily ignored. e.g., a possible outlier that is temporarily being removed from consideration. 2. Data that cannot exist. e.g., a matrix representing a grid of water depths for a lake. Since the lake isn't square, some entries will represent land, and so depth will be a meaningless concept for those entries. 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE, 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though this leaves open how [1,
Re: [Numpy-discussion] ANN: NumPy 1.6.1 release candidate 2
In article cabl7cqhnnjkzk9xnrlvdarsdknwrm4ev0mxdurjsaxq73eb...@mail.gmail.com, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Tue, Jul 5, 2011 at 11:41 PM, Russell E. Owen ro...@uw.edu wrote: In article BANLkTi=LXiTcrv1LgMtP=p9nF8eMr8=+h...@mail.gmail.com, Ralf Gommers ralf.gomm...@googlemail.com wrote: https://sourceforge.net/projects/numpy/files/NumPy/1.6.1rc2/ Will there be a Mac binary for 32-bit pythons (one that is compatible with older versions of MacOS X)? At present I only see a 64-bit 10.6-only version. Yes there will be for the final release (10.4-10.6 compatible). I can't create those on my own computer, so sometimes I don't make them for RCs. I'm glad they will be present for the final release. FYI: I built my own 1.6.1rc2 against Python 2.7.2 (the 32-bit Mac version from python.org). I reproduced a memory error that I've been trying to narrow down. This is ticket 1896: http://projects.scipy.org/numpy/ticket/1896 and the problem is also in 1.6.0. -- Russell ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Christopher Jordan-Squire wrote: Here's a short-ish summary of the topics discussed in the conference call this afternoon. Thanks, this is great! And thanks to all who participated in the call. 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE, 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. whoooa! I actually have been looking for, and thinking about jagged arrays a fair bit lately, so this is kind of exciting, but this looks like a bad idea to me. The above indicates that: a = np.array( [ [1, 2, np.IGNORE], [np.IGNORE, 3, 4] ] a[:,1] would yield: array([2, 4]) which seems really wrong -- you've tossed out the location information altogether. ( think it should be: array([2, 3]) I could see a jagged array being represented by IGNOREs all at the END of each row, but putting items in the middle, and shifting things to the left strikes me as a plain old bad idea (and a pain to implement) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Hi, On Wed, Jul 6, 2011 at 6:54 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 5:05 AM, Matthew Brett matthew.br...@gmail.com wrote: Hi, Just for reference, I am using this as the latest version of the NEP - I hope it's current: https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails. On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire cjord...@uw.edu wrote: ... Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly. I think I might have missed that part of the discussion :) I think that might have been mentioned by Travis right before he had to leave for another meeting, which might have been after you'd disconnected. Travis' concern as a member of a numpy community is the desire for something that is broadly applicable and adopted. But as Mark's employer, his concern is to get a more complete and coherent missing data functionality implemented in numpy while Mark is still at Enthought, for use in the problems Enthought and statisticians commonly encounter if nothing else. Sorry - yes - I wasn't there for all the conversation. Of course (not disagreeing), we must take care to get the API right because it's unlikely to change and will be explaining and supporting it for a long time to come. I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases. The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP: In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish. Mark says something like it's all missing data, there's no reason you should want to distinguish. Nathaniel and I were saying the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.. I returned several times to this (original point by Nathaniel): a[3] = np.NA (what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?) We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing). I asserted that explaining NA to people would be easier if ``a[3] = np.NA`` was direct assignment and altered the array. BIT PATTERN MASK IMPLEMENTATIONS FOR NA -- The current NEP proposes both mask and bit pattern implementations for missing data. I use the terms bit pattern and parameterized dtype interchangeably, since the parameterized dtype will use a bit pattern for its implementation. The two implementations will support the same functionality with respect to NA, and the implementation details will be largely invisible to the user. Their differences are in the 'extra' features each supports. Two common questions were: 1. Why make two implementations of missing data: one with masks and the other with parameterized dtypes? 2. Why does the implementation using masks have higher priority? The answers are: 1. The mask implementation is more general and easier to implement and maintain. The bit pattern implementation saves memory, makes interoperability easier, and makes ABI (Application Binary Interface) compatibility easier. Since each has different strengths, the argument is both should be implemented. 2. The implementation for the parameterized dtypes will rely on the implementation using a mask. NA VS. IGNORE - A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in aNEP sense and NA in NEP sense. With NA, there is a clear notion of how NA propagates through all basic numpy operations. (e.g., 3+NA=NA and log(NA) = NA, while NA | True = True.) IGNORE is separate from NA, with different interpretations depending on the use case. IGNORE could mean: 1. Data that is being temporarily ignored. e.g., a possible outlier that is temporarily being removed from consideration. 2. Data that cannot exist. e.g., a matrix representing a grid of water
Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 6:12 AM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: What I'm saying is that Mark's proposal is more flexible. Say for the sake of the argument that I have two codes I need to interface with: - Library A is written in Fortran and uses a seperate (explicit) mask array for NA - Library B runs on a GPU and uses a bit pattern for NA Have you ever encountered any such codes? I'm not aware of any code outside of R that implements the proposed NA semantics -- esp. in high-performance code, people generally want to avoid lots of conditionals, and the proposed NA semantics require a branch around every operation inside your inner loops. Certainly there is code out there that uses NaNs, and code that uses masks (in various ways that might or might not match the way the NEP uses them). And it's easy to work with both from numpy right now. The question is whether and how the core should add some tricky and subtle semantics for a few very specific ways of handling NaN-like objects and masking. Upthread you also wrote: At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. And operations which would obviously make sense for the some of the objects that you know you're working with (e.g., unmasking elements from a masked array, or even accessing the mask directly using numpy slicing) are disallowed, specifically in order to make this distinction harder to make. According to the NEP, C code that takes a masked array should never ever unmask any element; unmasking should only be done by making a full copy of the mask, and attaching it to a new view taken from the original array. Would you honestly feel obliged to follow this requirement in your C code? Or would you just unmask elements in place when it made sense, in order to save memory? -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On Wed, Jul 6, 2011 at 10:44 AM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Wed, Jul 6, 2011 at 6:11 PM, Benjamin Root ben.r...@ou.edu wrote: On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Jul 6, 2011 at 5:48 PM, Peter numpy-discuss...@maubp.freeserve.co.uk wrote: On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote: It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R? I don't think that was what Mark was saying, see this bit later in this email: I think it would make a difference if there was an implementation that had conflated masking with bitpatterns in terms of API. I don't think R is an example. Of course R is not an example of that. Nothing is. This is merely conceptual. Separate NA from np.NA in Mark's NEP, and you will see his point. Consider it the logical intersection of NA in Mark's NEP and the aNEP. I am trying to work out what you feel you feel the points of discussion are. There's surely no point in continuing to debate things we agree on. I don't think anyone disputes (or has ever disputed) that: There can be missing data implemented with bitpatterns There can be missing data implemented with masks Missing data can have propagate semantics Missing data can have ignore semantics. The implementation does not in itself constrain the semantics. So, to be clear, is your concern is that you want to be able to tell difference between whether an np.NA comes from the bit pattern or the mask in its implementation? But why would you have both the parameterized dtype and the mask implementation at the same time? They implement the same abstraction. Is your desire that the np.NA's are implemented solely through bit patterns and np.IGNORE is implemented solely through masks? So that you can think of the masks as being IGNORE flags? What if you want multiple types of IGNORE? (To ignore certain values because they're outliers, others because the data wouldn't make sense, and others because you're just focusing on a particular subgroup, for instance.) A related question is if the IGNORE values could just be another NA value? I don't understand what the specific problem would be with having several NA values, say NA(1), NA(2), ..., and then letting the user decide that NA(1) means NA in the sense discussed above and NA(2) means IGNORE. Then the ufuncs could be told whether to ignore or propagate each type of NA value. Could you explain to me if this would resolve your concerns about NA/IGNORE, or possibly give a few examples if it doesn't? Because I am still rather confused. Let's not discuss that any more; we all agree. So what do you think is the source of the disagreement? Or are you saying that there should be no disagreement at this stage? Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Dag Sverre Seljebotn wrote: Here's an HPC perspective...: At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. Absolutely -- and this point has been raised a couple times in the discussion, so I hope it is not forgotten. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. and vice-versa -- Assuming you have some data in numpy arrays, there's a way to process it with a C or Fortran library without copying the data. And this is where I am skeptical of the bit-pattern idea -- while one can expect C and fortran and GPU, and ??? to understand NaNs for floating point data, is there any support in compilers or hardware for special bit patterns for NA values to integers? I've never seen in my (very limited experience). Maybe having the mask option, too, will make that irrelevant, but I want to be clear about that kind of use case. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
Mark Wiebe wrote: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Wed, Jul 6, 2011 at 5:48 PM, Peter numpy-discuss...@maubp.freeserve.co.uk wrote: On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote: It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R? I don't think that was what Mark was saying, see this bit later in this email: I think it would make a difference if there was an implementation that had conflated masking with bitpatterns in terms of API. I don't think R is an example. This reminds me of another confusion I've seen in the list. I'd like to suggest that we ban the word API by itself from the present discussion, and always specify Python API or C API for clarity's sake. Here are my suggested definitions for these two terms: Python API All the interface mechanisms that are exposed to Python code for using missing values in NumPy. This API is designed to be Pythonic and fit into the way NumPy works as much as possible. C API All the implementation mechanisms exposed for CPython extensions written in C that want to support NumPy missing value support. This API is designed to be as natural as possible in C, and is usually prioritizes flexibility and high performance. Before we proceed to any discussion of what are good/bad choices, I really want to nail this down from just the definition perspective. I don't want arbitrary choices baked into the terms we use, because that implies already having made a design decision. -Mark The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. This point as I understood it is there is the semantics of the special values (not available vs ignore), and there is the implementation (bitpattern vs mask), and they are independent. Yes. Although, we can see from the implementations that we have to hand that a) bitpatterns - propagation (NaN-like) semantics by default (R) b) masks - ignore semantics by default (masked arrays) I don't think Mark accepts that there is any reason for this tendency of implementations to semantics, but Nathaniel was arguing otherwise in the alterNEP. I think we all accept that it's possible to imagine masking have propagation semantics and bitpatterns having ignore semantics. Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary
On 07/06/2011 08:10 PM, Nathaniel Smith wrote: On Wed, Jul 6, 2011 at 6:12 AM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: What I'm saying is that Mark's proposal is more flexible. Say for the sake of the argument that I have two codes I need to interface with: - Library A is written in Fortran and uses a seperate (explicit) mask array for NA - Library B runs on a GPU and uses a bit pattern for NA Have you ever encountered any such codes? I'm not aware of any code outside of R that implements the proposed NA semantics -- esp. in high-performance code, people generally want to avoid lots of conditionals, and the proposed NA semantics require a branch around every operation inside your inner loops. I'll admit that this whole thing was an hypothetical exercise. I've interfaced with Fortran code with NA values -- not a high performance case, but not all you interface with is high performance. Certainly there is code out there that uses NaNs, and code that uses masks (in various ways that might or might not match the way the NEP uses them). And it's easy to work with both from numpy right now. The question is whether and how the core should add some tricky and subtle semantics for a few very specific ways of handling NaN-like objects and masking. I don't disagree with this. It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. And operations which would obviously make sense for the some of the objects that you know you're working with (e.g., unmasking elements from a masked array, or even accessing the mask directly using numpy slicing) are disallowed, specifically in order to make this distinction harder to make. This worries me too. What I was thinking is that it could be sort of like indexing -- it works OK to have indexing be transparent in Python-land with respect to striding, and have a contiguous array be just a special case marked by an attribute. If you want, you can still check the strides or flags attributes. According to the NEP, C code that takes a masked array should never ever unmask any element; unmasking should only be done by making a full copy of the mask, and attaching it to a new view taken from the original array. Would you honestly feel obliged to follow this requirement in your C code? Or would you just unmask elements in place when it made sense, in order to save memory? I'm with you on this one: I wouldn't adopt any NumPy feature widely unless I had totally transparent access to the underlying implementation details from C -- without relying on any NumPy headers (except in my Cython wrappers)! I don't believe in APIs, I believe in standardized binary data. But I always assumed that could be done down the road, once the internal details had stabilized. As for myself, I'll admit that I'll almost certainly continue with explicit masking without using any of the proposed NEPs -- I have to be extremely aware of the masks in the statistical methods I use. Perhaps that's a sign I should withdraw from the discussion. Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On Wed, Jul 6, 2011 at 11:33 AM, Peter numpy-discuss...@maubp.freeserve.co.uk wrote: On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote: It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. That sounds good - I've only been scanning these discussions and it is confusing. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. Could you expand that to say how sums and products act with NA (since you do so for the IGNORE case). I've added that, here's the new version: NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. For sums and products this means to produce NA if any of the inputs are NA. This is the same as NA in the R project. Thanks, -Mark Thanks, Peter ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On Wed, Jul 6, 2011 at 11:38 AM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote: It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R? IGNORE (Skip/Ignore) A placeholder which should be treated by computations as if no value does or could exist there. For sums, this means act as if the value were zero, and for products, this means act as if the value were one. It's as if the array were compressed in some fashion to not include that element. bitpattern A technique for implementing either NA or IGNORE, where a particular set of bit patterns are chosen from all the possible bit patterns of the value's data type to signal that the element is NA or IGNORE. mask A technique for implementing either NA or IGNORE, where a boolean or enum array parallel to the data array is used to signal which elements are NA or IGNORE. numpy.ma The existing implementation of a particular form of masked arrays, which is part of the NumPy codebase. The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. 2) The idea of masking and the numpy.ma implementation are different. The numpy.ma object makes particular choices about how to interpret the mask, but while backwards compatibility is important, a fresh evaluation of all the design choices going into a mask implementation is worthwhile. I agree that there has been some confusion due to the terms. However, I continue to believe that the discussion is substantial and not due to confusion. I believe this is true as well, but the confusion due to the terms appears to be one of the root causes preventing the ideas from getting across. Without first clearing up this aspect of the discussion, things will stay confusing. Let us then characterize the substantial discussion as this: NEP: bitpattern and masked out values should be made nearly impossible to distinguish in the API alterNEP: bitpattern and masked out values should be distinct in the API so that it can be made clear which is meant (and therefore, implicitly, how they are implemented). Do you agree that this is the discussion? I'd like to get agreement on the definitions before moving to any of the points of contention that are being raised. Thanks, -Mark See you, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary
On 07/06/2011 04:47 PM, Matthew Brett wrote: Hi, On Wed, Jul 6, 2011 at 2:12 PM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: I just commented on the prevent direct API access to the masking array part -- I'm hoping direct access by external code to the underlying implementation details will be allowed, at some point. What I'm saying is that Mark's proposal is more flexible. Say for the sake of the argument that I have two codes I need to interface with: - Library A is written in Fortran and uses a seperate (explicit) mask array for NA - Library B runs on a GPU and uses a bit pattern for NA Mark's proposal then comes closer to allowing me to wrap both codes using NumPy, since it supports both implementation mechanisms. Sure, it would need a seperate NEP down the road to extend it, but it goes in the right direction for this to happen. I'm sorry - honestly - maybe it's because I've just had lunch, but I think I am not understanding something. When you say Mark's proposal is more flexible - more flexible than what? I think we agree that: * NA bitpatterns are good to have * masks are good to have and the discussion is about: * should it be possible to distinguish between bitpatterns (NAs) and masks (IGNORE). I guess I just don't agree with these definitions. There's (NA, IGNORE), and there's (bitpatterns, masks); these are in principle orthogonal. It is possible (and perhaps reasonable) to hard-wire them they way you say -- that may be more obvious, user-friendly, etc., but it is not more flexible. Both Mark and Chuck have explicitly supported having many different NA types down the road (thread: An NA compromise idea -- many-NA). So the main difference to me seems to be that you want to hard-wire the NA type and the representation in a specific configuration. I may be missing something though. Are you saying that making it not-possible to distinguish - at the numpy level, is more flexible? I'm OK with the common ways of accessing data to not distinguish, as long as there's some poweruser way around it. Just like strides -- you index a strided array just like a contiguous array, but you can peek inside into the implementation if you want. Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On Wed, Jul 6, 2011 at 12:41 PM, Pierre GM pgmdevl...@gmail.com wrote: Ah, semantics... On Jul 6, 2011, at 5:40 PM, Mark Wiebe wrote: NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. I have a problem with 'temporarily hidden with a mask'. In my mind, the concept of NA carries a notion of perennation. The data is just not available, just as a NaN is just not a number. Yes, this gets directly to what I've been meaning when I say NA vs IGNORE is independent of mask vs bitpattern. The way I'm trying to structure things, NA vs IGNORE only affects the semantic meaning, i.e. the outputs produced by computations. This is precisely why I put 'temporarily hidden with a mask' first, to make that more clear. IGNORE (Skip/Ignore) A placeholder which should be treated by computations as if no value does or could exist there. For sums, this means act as if the value were zero, and for products, this means act as if the value were one. It's as if the array were compressed in some fashion to not include that element. A data temporarily hidden by a mask becomes np.IGNORE. Are you willing to suspend the idea of that implication for the purposes of the present discussion? If not, do you see a way to amend things so that masked NAs and bitpattern-based IGNOREs make sense? Would renaming IGNORE to SKIP be more clear, perhaps? Thanks, Mark bitpattern A technique for implementing either NA or IGNORE, where a particular set of bit patterns are chosen from all the possible bit patterns of the value's data type to signal that the element is NA or IGNORE. mask A technique for implementing either NA or IGNORE, where a boolean or enum array parallel to the data array is used to signal which elements are NA or IGNORE. numpy.ma The existing implementation of a particular form of masked arrays, which is part of the NumPy codebase. OK with that. The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. OK with that. 2) The idea of masking and the numpy.ma implementation are different. The numpy.ma object makes particular choices about how to interpret the mask, but while backwards compatibility is important, a fresh evaluation of all the design choices going into a mask implementation is worthwhile. Indeed. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: NumPy 1.6.1 release candidate 2
On 7/6/2011 10:57 AM, Russell E. Owen wrote: In article cabl7cqhnnjkzk9xnrlvdarsdknwrm4ev0mxdurjsaxq73eb...@mail.gmail.com, Ralf Gommersralf.gomm...@googlemail.com wrote: On Tue, Jul 5, 2011 at 11:41 PM, Russell E. Owenro...@uw.edu wrote: In articleBANLkTi=LXiTcrv1LgMtP=p9nF8eMr8=+h...@mail.gmail.com, Ralf Gommersralf.gomm...@googlemail.com wrote: https://sourceforge.net/projects/numpy/files/NumPy/1.6.1rc2/ Will there be a Mac binary for 32-bit pythons (one that is compatible with older versions of MacOS X)? At present I only see a 64-bit 10.6-only version. Yes there will be for the final release (10.4-10.6 compatible). I can't create those on my own computer, so sometimes I don't make them for RCs. I'm glad they will be present for the final release. FYI: I built my own 1.6.1rc2 against Python 2.7.2 (the 32-bit Mac version from python.org). I reproduced a memory error that I've been trying to narrow down. This is ticket 1896: http://projects.scipy.org/numpy/ticket/1896 and the problem is also in 1.6.0. -- Russell I can reproduce this error on Windows. It looks like a serious regression. Christoph ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On Wed, Jul 6, 2011 at 1:25 PM, Christopher Barker chris.bar...@noaa.govwrote: Mark Wiebe wrote: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me. What do you think of renaming IGNORE to SKIP? -Mark -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On 07/06/2011 08:25 PM, Christopher Barker wrote: Mark Wiebe wrote: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me. There's the question of how reductions treats the value. IIUC, IGNORE as bitpattern would imply that reductions treat the value as 0, which is a question orthogonal to whether the value can possibly be unmasked or not. Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] towards a more productive missing values/masked arrays discussion...
So one thing that came up on the call yesterday is that there actually is a significant chunk of functionality that everyone seems to agree is useful, needed, and basically how it should work. This includes: -- the basic existence and semantics for NA values (however this is implemented) -- that there should exist a dtype/bit-pattern implementation for NAs (whatever other implementations there might also be) -- that ufunc's should take a where= argument -- that there should be a better way for ndarray subclasses like numpy.ma to override the arguments to ufuncs involving them -- maybe some other things I'm not thinking of The real controversy is around what role masking should play, both at the API and implementation level; there are lots of different arguments for different approaches, and it's not at all clear any current proposal will actually solve the problems are facing (or even what those problems are). So rather than continue to go around in circles indefinitely on that, I'm going to write up some miniNEPs just focusing on the details of how the features we do agree on should work, so we can hopefully have a more technical discussion of *that*. Cheers, -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] towards a more productive missing values/masked arrays discussion...
On Wed, Jul 6, 2011 at 2:20 PM, Nathaniel Smith n...@pobox.com wrote: So one thing that came up on the call yesterday is that there actually is a significant chunk of functionality that everyone seems to agree is useful, needed, and basically how it should work. This includes: -- the basic existence and semantics for NA values (however this is implemented) -- that there should exist a dtype/bit-pattern implementation for NAs (whatever other implementations there might also be) -- that ufunc's should take a where= argument -- that there should be a better way for ndarray subclasses like numpy.ma to override the arguments to ufuncs involving them -- maybe some other things I'm not thinking of The real controversy is around what role masking should play, both at the API and implementation level; there are lots of different arguments for different approaches, and it's not at all clear any current proposal will actually solve the problems are facing (or even what those problems are). So rather than continue to go around in circles indefinitely on that, I'm going to write up some miniNEPs just focusing on the details of how the features we do agree on should work, so we can hopefully have a more technical discussion of *that*. That sounds alright to me. One thing I would like to ask is to please adopt the vocabulary we are discussing, using it exactly as defined so that people reading all the various ideas don't have to readjust when switching between documents. Thanks, Mark Cheers, -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] miniNEP1: where= argument for ufuncs
Here's the master copy: https://gist.github.com/1068056 But for your commenting convenience, I'll include the current text here: A mini-NEP for the where= argument to ufuncs To try and make more progress on the whole missing values/masked arrays/... debate, it seems useful to have a more technical discussion of the pieces which we *can* agree on. This is the first, which attempts to nail down the details of the new ``where=`` argument to ufuncs. * Rationale * It is often useful to apply operations to a subset of your data, and numpy provides a rich interface for accomplishing this, by combining indexing operations with ufunc operations, e.g.:: a[10, mymask] += b np.sum(a[which_indices], axis=0) But any kind of complex indexing necessarily requires making a temporary copy of (parts of) the underlying array, which can be quite expensive, and this copying could be avoided by teaching the ufunc loop to `index as it goes'. There are strong arguments against doing this. There are tons of cases like this where one can save some memory by avoiding temporaries, and we can't build them all into the core -- especially since we also have more general solutions like numexpr or writing optimized routines C/Fortran/Cython. Furthermore, this case is a clear violation of orthogonality -- we already have indexing and ufuncs as separate things, so adding a second, somewhat crippled implementation of indexing to ufuncs themselves is a bit ugly. (It would be better if we could make sure that anything that could be passed to ndarray.__getitem__ could also be passed to ufuncs with the same semantics, but this would require substantial refactoring and seems unlikely to be implemented any time soon.) However, *** API *** A new optional, keyword argument named ``where=`` will be added to all ufuncs. -- Error checking -- If given, this argument must be a boolean array. If ``f`` is a ufunc, then given a function call like:: f(a, b, where=mymask) the following occurs. First, ``mymask`` is coerced to an array if necessary, but no type conversion is performed. (I.e., we do ``np.asarray(mymask)``.) Next, we check whether ``mymask`` is a boolean array. If it is not, then we raise an exception. (In the future it would be nice to support other forms of indexing as well, such as lists of slices or arrays of integer indices. In order to preserve this option, we do not want to coerce integers into booleans.) Next, ``a`` and ``b`` are broadcast against each other, just as now; this determines the shape of the output array. Then ``mymask`` is broadcast to match this output array shape. (The shape of the output array cannot be changed by this process -- for example, having ``a.shape == (10, 1, 1)``, ``b.shape = (1, 10, 1)``, ``mymask.shape == (1, 1, 10)`` will raise an error rather than returning a new array with shape ``(10, 10, 10)``.) - Semantics: ufunc ``__call__`` - When simply calling a ufunc with an output argument, e.g.:: f(a, b, out=c, where=mymask) then the result is equivalent to:: c[mymask] = f(a[mymask], b[mymask]) On the other hand, if no output argument is given:: f(a, b, where=mymask) then an output array is instantiated as if by calling ``np.empty(shape, dtype=dtype)``, and then treated as above:: c = np.empty(shape_for(a, b), dtype=dtype_for(f, a, b)) f(a, b, out=c, where=mymask) return c Note that this means that the output will, in general, contain uninitialized values. Semantics: ufunc ``.reduce`` Take an expression like:: f.reduce(a, axis=0, where=mymask) This performs the given reduction operation along each column of ``a``, but simply skips any elements where the corresponding entry in ``mymask`` is false. (For ufuncs which have an identity, this is equivalent to treating the given elements as if they were the identity.) For example, if ``a`` is a 2-dimensional array and skipping over the details of broadcasting, dtype selection, etc., the above operation produces the same result as:: out = np.empty(a.shape[1]) for i in xrange(a.shape[1]): out[i] = f.reduce(a[mymask[:, i], i]) return out Semantics: ufunc ``.accumulate`` Accumulation is similar to reduction, except that ``.accumulate`` saves the intermediate values generated during the reduction loop. Therefore we use the same semantics as for ``.reduce`` above. If ``a`` is 2-d etc., then this expression:: f.accumulate(a, axis=0, where=mymask) is equivalent to:: out = np.empty(a.shape) for i in xrange(a.shape[1]): out[mymask[:, i], i] = f.accumulate(a[mymask[:, i], i]) return out Notice that once again, elements of ``out`` which correspond to False entries in the mask are left unitialized.
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.govwrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 8:12 AM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: snip I just commented on the prevent direct API access to the masking array part -- I'm hoping direct access by external code to the underlying implementation details will be allowed, at some point. I think direct or nearly direct access needs to be in right away, unless we're fairly sure that we will change low level implementation details in the near future. I've added Python API and C API definitions for us to use to try and clear up this kind of potential confusion. -Mark What I'm saying is that Mark's proposal is more flexible. Say for the sake of the argument that I have two codes I need to interface with: - Library A is written in Fortran and uses a seperate (explicit) mask array for NA - Library B runs on a GPU and uses a bit pattern for NA Mark's proposal then comes closer to allowing me to wrap both codes using NumPy, since it supports both implementation mechanisms. Sure, it would need a seperate NEP down the road to extend it, but it goes in the right direction for this to happen. As for NA vs. IGNORE I still think 2 types is too little. One should allow for 255 different NA-values, each with user-defined behaviour. Again, Mark's proposal then makes a good start on that, even if more work would be needed to make it happen. I.e., in my perfect world I'd do this to wrap library A (Cythonish psuedo-code: def call_lib_A(): ... lib_A_function(arraybuf, maskbuf, ...) DOG_ATE_IT = np.NA(DOG_ATE_IT, value=42, behaviour=raise) # behaviour could also be zero, invalid missing_value_map = {0xAF: np.NA, 0x43: np.IGNORE, 0xF0: DOG_ATE_IT} result = np.PyArray_CreateArrayFromBufferWithMaskBuffer( arraybuf, maskbuf, missing_value_map, ...) return result def call_lib_B(): lib_B_function(arraybuf, ...) missing_value_patterns = {0xCACA : np.NA} result = np.PyArray_CreateArrayFromBufferWithBitPattern( arraybuf, maskbuf, missing_value_patterns, ...) return result Hope that is clearer. Again, my intention is not to suggest even more work at the present stage, just to state some advantages with the general direction of Mark's proposal. Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] towards a more productive missing values/masked arrays discussion...
It'd be easier to follow if you just made changes/suggestions on github to Mark's NEP directly. (You can checkout Mark's missing data branch to get the NEP.) Then I'll be able to focus on the ways the suggestions differ or compliment the current NEP. -Chris Jordan-Squire On Wed, Jul 6, 2011 at 12:24 PM, Mark Wiebe mwwi...@gmail.com wrote: On Wed, Jul 6, 2011 at 2:20 PM, Nathaniel Smith n...@pobox.com wrote: So one thing that came up on the call yesterday is that there actually is a significant chunk of functionality that everyone seems to agree is useful, needed, and basically how it should work. This includes: -- the basic existence and semantics for NA values (however this is implemented) -- that there should exist a dtype/bit-pattern implementation for NAs (whatever other implementations there might also be) -- that ufunc's should take a where= argument -- that there should be a better way for ndarray subclasses like numpy.ma to override the arguments to ufuncs involving them -- maybe some other things I'm not thinking of The real controversy is around what role masking should play, both at the API and implementation level; there are lots of different arguments for different approaches, and it's not at all clear any current proposal will actually solve the problems are facing (or even what those problems are). So rather than continue to go around in circles indefinitely on that, I'm going to write up some miniNEPs just focusing on the details of how the features we do agree on should work, so we can hopefully have a more technical discussion of *that*. That sounds alright to me. One thing I would like to ask is to please adopt the vocabulary we are discussing, using it exactly as defined so that people reading all the various ideas don't have to readjust when switching between documents. Thanks, Mark Cheers, -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. Unless you know the neutral value for the computation or you just want to do a forward_fill in time series, and you have to ask the user not to give you an unmutable array with NAs if they don't want extra copies. Josef My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On 07/06/2011 02:38 PM, Christopher Jordan-Squire wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov mailto:chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. Exactly! I have not been able to think of an real example where that difference matters as the calculations are only on the 'valid' (ie non-missing and non-masked) values. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. Unless you know the neutral value for the computation or you just want to do a forward_fill in time series, and you have to ask the user not to give you an unmutable array with NAs if they don't want extra copies. Josef Mean value replacement, or more generally single scalar value replacement, is generally not a good idea. It biases downward your standard error estimates if you use mean replacement, and it will bias both if you use anything other than mean replacement. The bias is gets worse with more missing data. So it's worst in the precisely the cases where you'd want to fill in the data the most. (Though I admit I'm not too familiar with time series, so maybe this doesn't apply. But it's true as a general principle in statistics.) I'm not sure why we'd want to make this use case easier. -Chris Jordan-Squire My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Jul 6, 2011, at 10:11 PM, Bruce Southey wrote: On 07/06/2011 02:38 PM, Christopher Jordan-Squire wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. Exactly! I have not been able to think of an real example where that difference matters as the calculations are only on the 'valid' (ie non-missing and non-masked) values. In practice, they could be treated the same way (ie, skipped). However, they are conceptually different and one may wish to keep this difference of information around (between NAs you didn't have and IGNOREs you just dropped temporarily. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. Unless you know the neutral value for the computation or you just want to do a forward_fill in time series, and you have to ask the user not to give you an unmutable array with NAs if they don't want extra copies. Josef Mean value replacement, or more generally single scalar value replacement, is generally not a good idea. It biases downward your standard error estimates if you use mean replacement, and it will bias both if you use anything other than mean replacement. The bias is gets worse with more missing data. So it's worst in the precisely the cases where you'd want to fill in the data the most. (Though I admit I'm not too familiar with time series, so maybe this doesn't apply. But it's true as a general principle in statistics.) I'm not sure why we'd want to make this use case easier. We just discussed a use case for pandas on the statsmodels mailing list, minute data of stock quotes (prices), if the quote is NA then fill it with the last price quote. If it would be necessary for memory usage and performance, this can be handled efficiently and with minimal copying. If you want to fill in a missing value without messing up any result statistics, then there is a large literature in statistics on imputations, repeatedly assigning values to a NA from an underlying distribution. scipy/statsmodels doesn't have anything like this (yet) but R and the others have it available, and it looks more popular in bio-statistics. (But similar to what Dag said, for statistical analysis it will be necessary to keep case specific masks and data arrays around. I haven't actually written any missing values algorithm yet, so I'm quite again.) Josef -Chris Jordan-Squire My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 4:38 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. Unless you know the neutral value for the computation or you just want to do a forward_fill in time series, and you have to ask the user not to give you an unmutable array with NAs if they don't want extra copies. Josef Mean value replacement, or more generally single scalar value replacement, is generally not a good idea. It biases downward your standard error estimates if you use mean replacement, and it will bias both if you use anything other than mean replacement. The bias is gets worse with more missing data. So it's worst in the precisely the cases where you'd want to fill in the data the most. (Though I admit I'm not too familiar with time series, so maybe this doesn't apply. But it's true as a general principle in statistics.) I'm not sure why we'd want to make this use case easier. Another qualification on this (I cannot help it). I think this only applies if you use a prefabricated no-missing-values algorithm. If I write it myself, I can do the proper correction for the reduced number of observations. (similar to the case when we ignore correlated information and use statistics based on uncorrelated observations which also overestimate the amount of information we have available.) Josef We just discussed a use case for pandas on the statsmodels mailing list, minute data of stock quotes (prices), if the quote is NA then fill it with the last price quote. If it would be necessary for memory usage and performance, this can be handled efficiently and with minimal copying. If you want to fill in a missing value without messing up any result statistics, then there is a large literature in statistics on imputations, repeatedly assigning values to a NA from an underlying distribution. scipy/statsmodels doesn't have anything like this (yet) but R and the others have it available, and it looks more popular in bio-statistics. (But similar to what Dag said, for statistical analysis it will be necessary to keep case specific masks and data arrays around. I haven't actually written any missing values algorithm yet, so I'm quite again.) Josef -Chris Jordan-Squire My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Christopher Barker wrote: Dag Sverre Seljebotn wrote: Here's an HPC perspective...: At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. Absolutely -- and this point has been raised a couple times in the discussion, so I hope it is not forgotten. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. and vice-versa -- Assuming you have some data in numpy arrays, there's a way to process it with a C or Fortran library without copying the data. And this is where I am skeptical of the bit-pattern idea -- while one can expect C and fortran and GPU, and ??? to understand NaNs for floating point data, is there any support in compilers or hardware for special bit patterns for NA values to integers? I've never seen in my (very limited experience). Maybe having the mask option, too, will make that irrelevant, but I want to be clear about that kind of use case. -Chris Am I the only one that finds the idea of special values of things like int[1] to have special meanings to be really ugly? [1] which already have defined behavior over their entire domain of bit patterns ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] miniNEP1: where= argument for ufuncs
On Wed, Jul 06, 2011 at 12:26:24PM -0700, Nathaniel Smith wrote: A mini-NEP for the where= argument to ufuncs I _love_ this proposal and it would probably be much more useful to me than the different masked array proposal that are too focused on a specific usage pattern to answer all my needs. So a strong +1 on the miniNEP. G ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] histogram2d error with empty inputs
On Mon, Jun 27, 2011 at 9:38 PM, Benjamin Root ben.r...@ou.edu wrote: I found another empty input edge case. Somewhat recently, we fixed an issue with np.histogram() and empty inputs (so long as the bins are somehow known). np.histogram([], bins=4) (array([0, 0, 0, 0]), array([ 0. , 0.25, 0.5 , 0.75, 1. ])) However, histogram2d needs the same treatment. np.histogram([], [], bins=4) (array([ 0., 0.]), array([ 0. , 0.25, 0.5 , 0.75, 1. ]), array([ 0. , 0.25, 0.5 , 0.75, 1. ])) The first element in the return tuple needs to be 4x4 (in this case). Could you open a ticket for this? Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary
On Wed, Jul 06, 2011 at 08:39:37PM +0200, Dag Sverre Seljebotn wrote: As for myself, I'll admit that I'll almost certainly continue with explicit masking without using any of the proposed NEPs -- I have to be extremely aware of the masks in the statistical methods I use. My gut feeling is that I am in the same case. G ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 2:53 PM, Neal Becker ndbeck...@gmail.com wrote: Christopher Barker wrote: Dag Sverre Seljebotn wrote: Here's an HPC perspective...: At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. Absolutely -- and this point has been raised a couple times in the discussion, so I hope it is not forgotten. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. and vice-versa -- Assuming you have some data in numpy arrays, there's a way to process it with a C or Fortran library without copying the data. And this is where I am skeptical of the bit-pattern idea -- while one can expect C and fortran and GPU, and ??? to understand NaNs for floating point data, is there any support in compilers or hardware for special bit patterns for NA values to integers? I've never seen in my (very limited experience). Maybe having the mask option, too, will make that irrelevant, but I want to be clear about that kind of use case. -Chris Am I the only one that finds the idea of special values of things like int[1] to have special meanings to be really ugly? [1] which already have defined behavior over their entire domain of bit patterns Umm, no, I find it ugly also. On the other hand, it is an useful artifact left to us by the ancients and solves a lot of problems. So in the absence of anything more standardized... Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On 07/06/2011 03:37 PM, Pierre GM wrote: On Jul 6, 2011, at 10:11 PM, Bruce Southey wrote: On 07/06/2011 02:38 PM, Christopher Jordan-Squire wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barkerchris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered. Exactly! I have not been able to think of an real example where that difference matters as the calculations are only on the 'valid' (ie non-missing and non-masked) values. In practice, they could be treated the same way (ie, skipped). However, they are conceptually different and one may wish to keep this difference of information around (between NAs you didn't have and IGNOREs you just dropped temporarily. ___ I have yet to see these as *conceptually different* in any of the arguments given. Separate NAs or IGNORES or any number of missing value codes just requires use to avoid 'unmasking' those missing value codes in your array as, I presume like masked arrays, you need some placeholder values. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] histogram2d error with empty inputs
On Wednesday, July 6, 2011, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Mon, Jun 27, 2011 at 9:38 PM, Benjamin Root ben.r...@ou.edu wrote: I found another empty input edge case. Somewhat recently, we fixed an issue with np.histogram() and empty inputs (so long as the bins are somehow known). np.histogram([], bins=4) (array([0, 0, 0, 0]), array([ 0. , 0.25, 0.5 , 0.75, 1. ])) However, histogram2d needs the same treatment. np.histogram([], [], bins=4) (array([ 0., 0.]), array([ 0. , 0.25, 0.5 , 0.75, 1. ]), array([ 0. , 0.25, 0.5 , 0.75, 1. ])) The first element in the return tuple needs to be 4x4 (in this case). Could you open a ticket for this? Ralf Not a problem. I managed to partly trace the problem down into histogramdd, but the function is a little confusing. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On Wednesday, July 6, 2011, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 07/06/2011 08:25 PM, Christopher Barker wrote: Mark Wiebe wrote: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me. There's the question of how reductions treats the value. IIUC, IGNORE as bitpattern would imply that reductions treat the value as 0, which is a question orthogonal to whether the value can possibly be unmasked or not. Dag Sverre Just because we are trying to be exact here, the reductions would treat IGNORE as the operation's identity. Therefore, for addition, it would be treated like 0, but for multiplication, it is treated like a 1. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Call for papers: AMS Jan 22-26, 2012
I would like to call to the attention of the NumPy community the following call for papers: Second Symposium on Advances in Modeling and Analysis Using Python, 22–26 January 2012, New Orleans, Louisiana The Second Symposium on Advances in Modeling and Analysis Using Python, sponsored by the American Meteorological Society, will be held 22–26 January 2012, as part of the 92nd AMS Annual Meeting in New Orleans, Louisiana. Preliminary programs, registration, hotel, and general information will be posted on the AMS Web site (http://www.ametsoc.org/meet/annual/) in late-September 2011. The application of object-oriented programming and other advances in computer science to the atmospheric and oceanic sciences has in turn led to advances in modeling and analysis tools and methods. This symposium focuses on applications of the open-source language Python and seeks to disseminate advances using Python in the atmospheric and oceanic sciences, as well as grow the earth sciences Python community. Papers describing Python work in applications, methodologies, and package development in all areas of meteorology, climatology, oceanography, and space sciences are welcome, including (but not limited to): modeling, time series analysis, air quality, satellite data processing, in-situ data analysis, GIS, Python as a software integration platform, visualization, gridding, model intercomparison, and very large (petabyte) dataset manipulation and access. The $95 abstract fee includes the submission of your abstract, the posting of your extended abstract, and the uploading and recording of your presentation which will be archived on the AMS Web site. Please submit your abstract electronically via the Web by 1 August 2011 (refer to the AMS Web page athttp://www.ametsoc.org/meet/online_submit.html.) An abstract fee of $95 (payable by credit card or purchase order) is charged at the time of submission (refundable only if abstract is not accepted). Authors of accepted presentations will be notified via e-mail by late-September 2011. All extended abstracts are to be submitted electronically and will be available on-line via the Web, Instructions for formatting extended abstracts will be posted on the AMS Web site. Manuscripts (up to 3MB) must be submitted electronically by 22 February 2012. All abstracts, extended abstracts and presentations will be available on the AMS Web site at no cost. For additional information, please contact the program chairperson, Johnny Lin, Physics Department, North Park University (j...@northpark.edu). (5/11) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On Wed, Jul 6, 2011 at 5:03 PM, Benjamin Root ben.r...@ou.edu wrote: On Wednesday, July 6, 2011, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 07/06/2011 08:25 PM, Christopher Barker wrote: Mark Wiebe wrote: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me. There's the question of how reductions treats the value. IIUC, IGNORE as bitpattern would imply that reductions treat the value as 0, which is a question orthogonal to whether the value can possibly be unmasked or not. Dag Sverre Just because we are trying to be exact here, the reductions would treat IGNORE as the operation's identity. Therefore, for addition, it would be treated like 0, but for multiplication, it is treated like a 1. Ben Root Yes. But, as discussed on another thread, that can lead to unexpected results when it's propagated through several operations. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] using the same vocabulary for missing value ideas
On Wednesday, July 6, 2011, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 5:03 PM, Benjamin Root ben.r...@ou.edu wrote: On Wednesday, July 6, 2011, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 07/06/2011 08:25 PM, Christopher Barker wrote: Mark Wiebe wrote: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me. There's the question of how reductions treats the value. IIUC, IGNORE as bitpattern would imply that reductions treat the value as 0, which is a question orthogonal to whether the value can possibly be unmasked or not. Dag Sverre Just because we are trying to be exact here, the reductions would treat IGNORE as the operation's identity. Therefore, for addition, it would be treated like 0, but for multiplication, it is treated like a 1. Ben Root Yes. But, as discussed on another thread, that can lead to unexpected results when it's propagated through several operations. If you are talking about means, for example, then the count is adjusted before dividing. It is like they never existed. Same with standard deviation. Of course, there are issues with having fewer samples, but that isn't a problem caused by the underlying concept of skipping elements. As long as the underlying mathematical support for array math is still valid, I am not certain what the issue is. Matrix math on the other hand... Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On Wed, Jul 6, 2011 at 5:38 PM, Benjamin Root ben.r...@ou.edu wrote: On Wednesday, July 6, 2011, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 5:03 PM, Benjamin Root ben.r...@ou.edu wrote: On Wednesday, July 6, 2011, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: On 07/06/2011 08:25 PM, Christopher Barker wrote: Mark Wiebe wrote: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me. There's the question of how reductions treats the value. IIUC, IGNORE as bitpattern would imply that reductions treat the value as 0, which is a question orthogonal to whether the value can possibly be unmasked or not. Dag Sverre Just because we are trying to be exact here, the reductions would treat IGNORE as the operation's identity. Therefore, for addition, it would be treated like 0, but for multiplication, it is treated like a 1. Ben Root Yes. But, as discussed on another thread, that can lead to unexpected results when it's propagated through several operations. If you are talking about means, for example, then the count is adjusted before dividing. It is like they never existed. Same with standard deviation. Of course, there are issues with having fewer samples, but that isn't a problem caused by the underlying concept of skipping elements. As long as the underlying mathematical support for array math is still valid, I am not certain what the issue is. Matrix math on the other hand... Ah, I see. I misunderstood the class of operations you were discussing. -Chris Jordan-Squire Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 3:47 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:38 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. Unless you know the neutral value for the computation or you just want to do a forward_fill in time series, and you have to ask the user not to give you an unmutable array with NAs if they don't want extra copies. Josef Mean value replacement, or more generally single scalar value replacement, is generally not a good idea. It biases downward your standard error estimates if you use mean replacement, and it will bias both if you use anything other than mean replacement. The bias is gets worse with more missing data. So it's worst in the precisely the cases where you'd want to fill in the data the most. (Though I admit I'm not too familiar with time series, so maybe this doesn't apply. But it's true as a general principle in statistics.) I'm not sure why we'd want to make this use case easier. Another qualification on this (I cannot help it). I think this only applies if you use a prefabricated no-missing-values algorithm. If I write it myself, I can do the proper correction for the reduced number of observations. (similar to the case when we ignore correlated information and use statistics based on uncorrelated observations which also overestimate the amount of information we have available.) Can you do that sort of technique with longitudinal (panel) data? I'm honestly curious because I haven't looked into such corrections before. I haven't been able to find a reference after a few quick google searches. I don't suppose you know one off the top of your head? And you're right about the last measurement carried forward. I was just thinking about filling in all missing values with the same value. -Chris Jordan-Squire PS--Thanks for mentioning the statsmodels discussion. I'd been keeping track of that on a different email account, and I haven't realized it wasn't forwarding those messages correctly. Josef We just discussed a use case for pandas on the statsmodels mailing list, minute data of stock quotes (prices), if the quote is NA then fill it with the last price quote. If it would be necessary for memory usage and performance, this can be handled efficiently and with minimal copying. If you want to fill in a missing value without messing up any result statistics, then there is a large literature in statistics on imputations, repeatedly assigning values to a NA from an underlying distribution. scipy/statsmodels doesn't have anything like this (yet) but R and the others have it available, and it looks more popular in bio-statistics. (But similar to what Dag said, for statistical
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
Hi, On Wed, Jul 6, 2011 at 7:10 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 10:44 AM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Jul 6, 2011 at 6:11 PM, Benjamin Root ben.r...@ou.edu wrote: On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Jul 6, 2011 at 5:48 PM, Peter numpy-discuss...@maubp.freeserve.co.uk wrote: On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe mwwi...@gmail.com wrote: It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R? I don't think that was what Mark was saying, see this bit later in this email: I think it would make a difference if there was an implementation that had conflated masking with bitpatterns in terms of API. I don't think R is an example. Of course R is not an example of that. Nothing is. This is merely conceptual. Separate NA from np.NA in Mark's NEP, and you will see his point. Consider it the logical intersection of NA in Mark's NEP and the aNEP. I am trying to work out what you feel you feel the points of discussion are. There's surely no point in continuing to debate things we agree on. I don't think anyone disputes (or has ever disputed) that: There can be missing data implemented with bitpatterns There can be missing data implemented with masks Missing data can have propagate semantics Missing data can have ignore semantics. The implementation does not in itself constrain the semantics. So, to be clear, is your concern is that you want to be able to tell difference between whether an np.NA comes from the bit pattern or the mask in its implementation? But why would you have both the parameterized dtype and the mask implementation at the same time? They implement the same abstraction. In Mark's mind they implement the same abstraction. In my mind, and Nathaniels, and I think, Pierre's, and others, they are not the same abstraction. You can treat them the same if you want, even by default, but they are two different ideas, with two different implementations. A bitmask NA value is absolutely completely missing. It's a value that says 'missing' A masked-out value is temporarily or provisionally missing. When you take away the mask, the previous value is there. These are two different things. They are each very easy to explain. Is your desire that the np.NA's are implemented solely through bit patterns and np.IGNORE is implemented solely through masks? So that you can think of the masks as being IGNORE flags? What if you want multiple types of IGNORE? (To ignore certain values because they're outliers, others because the data wouldn't make sense, and others because you're just focusing on a particular subgroup, for instance.) Forgive me, I have been at dinner and had several glasses of wine. So, what I'm about to say might be dumber than usual. With that rider: I agree with Mark, we should avoid np.IGNORE because it conflates ignore semantics with the masking implementation. The idea of several different missings seems to me orthogonal. There can be different missings with bitmasks and different missings with masks. My fundamental point, that I accept I am not getting across with much success, is the following: In general, as Dag has pointed out elsewhere, numpy is close the metal - you can almost feel the C array underneath the python numpy object. This is its strength. It doesn't try and hide the C array from you, it gives you the whole machinery, open kimono. I can see an open kimono way of dealing with missing values. There's the bitpattern way. If I do a[3] = np.NA, what I mean is 'store an NA in the array memory'. Exactly the same as when I do a[3] = 2, I mean 'store a 2 in the array memory'. It's obvious and
Re: [Numpy-discussion] miniNEP1: where= argument for ufuncs
Sorry, but I didn't find a way of inserting inline comments in the gist. Nathaniel Smith writes: [...] Is there any less stupid-looking name than ``where1=`` and ``where2=`` for the ``.outer`` operation? (For that matter, can ``.outer`` be applied to more than 2 arrays? The docs say it can't, but it's perfectly well-defined for arbitrary number of arrays too, so maybe we want an interface that allows for 3-way, 4-way etc. ``.outer`` operations in the future?) Well, if outer can indeed be defined for an arbitrary number of arrays (and if it's going to be sometime in the future), I'd say the simplest is to use an array: .outer(a, b, ..., where = [my_where1, my_where2, ...]) Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 7:14 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 3:47 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:38 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 1:08 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker chris.bar...@noaa.gov wrote: Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. Unless you know the neutral value for the computation or you just want to do a forward_fill in time series, and you have to ask the user not to give you an unmutable array with NAs if they don't want extra copies. Josef Mean value replacement, or more generally single scalar value replacement, is generally not a good idea. It biases downward your standard error estimates if you use mean replacement, and it will bias both if you use anything other than mean replacement. The bias is gets worse with more missing data. So it's worst in the precisely the cases where you'd want to fill in the data the most. (Though I admit I'm not too familiar with time series, so maybe this doesn't apply. But it's true as a general principle in statistics.) I'm not sure why we'd want to make this use case easier. Another qualification on this (I cannot help it). I think this only applies if you use a prefabricated no-missing-values algorithm. If I write it myself, I can do the proper correction for the reduced number of observations. (similar to the case when we ignore correlated information and use statistics based on uncorrelated observations which also overestimate the amount of information we have available.) Can you do that sort of technique with longitudinal (panel) data? I'm honestly curious because I haven't looked into such corrections before. I haven't been able to find a reference after a few quick google searches. I don't suppose you know one off the top of your head? I was thinking mainly of simple cases where the correction only requires to correctly count the number of observations in order to adjust the degrees of freedom. For example, statistical tests that are based on relatively simple statistics or ANOVA which just needs a correct counting of the number of observations by groups. (This might be partially covered by any NA ufunc implementation, that does mean, var and cov correctly and maybe sorting like the current NaN sort.) In the panel data case it might be possible to do this, if it can just be treated like an unbalanced panel. I guess it depends on the details of the model. For regression, one way to remove an observation is to include a dummy variable for that observation, or use X'X with rows zeroed out. R has a package for multivariate normal with missing values that allows calculation of expected values for the missing ones. But in many of these cases, getting a clean
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
(snip discussion of open kimono) On the other hand, to try and conceal these implementation differences, seems to me to break my feeling for numpy arrays, and make me feel I have an object that is rather magic, that I don't fully understand, and for which clever stuff is going on, under the hood, that I worry about but have to trust. To weigh-in as someone less tipsy, I totally agree with this concern. In fact, in trying to understand the proposal myself--and I use numpy R NAs all the time--it was difficult to understand, and I don't think I have fully gotten it yet. That makes it seem like magic, and magic makes me seriously nervous ... specifically, that I won't get what I intended, which will lead to nearly-impossible-to-find bugs. I think this is not the numpy way. I think I fully understand why it's attractive, but I continue to think that it's a mistake, and one that may take some time to become clear. It will become clear only after a few years of trying to teach people, and noticing that when they get to this stuff, they start switching off, and getting a bit confused, and concluding it's all too hard for them. Agreed. For ultra simplicity, I'd be perfectly happy with a np.NA element (bitpattern?) that I could use to represent points that will forevermore be missing, as well as a masking capability that allows multiple masking values (not just true/false) such as: a.mask[3] = 0 # unmasked a.mask[3] = 1 # masked type 1 (eg, missing?) a.mask[3] = 2 # masked type 2 (eg, data from different source) a.mask[3] = 3 # masked type 3 (eg, ignore in complete-case analysis) etc. Regardless of whether a mask is boolean or more, though, the simplicity of explaining masking separate from NA cases is, I think, a huge win. -best Gary The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] miniNEP1: where= argument for ufuncs
On Wed, Jul 6, 2011 at 5:41 PM, Lluís xscr...@gmx.net wrote: Sorry, but I didn't find a way of inserting inline comments in the gist. I'm a little confused about how gists work, actually. For actual discussion, it's probably just as well, since this way everyone sees the comment on the list and has a chance to join the conversation... but I'd be just as happy if other people could just go in and edit it, and I'm not sure how that works. I'm happy to move to somewhere else if people have suggestions, this was just easiest. Nathaniel Smith writes: [...] Is there any less stupid-looking name than ``where1=`` and ``where2=`` for the ``.outer`` operation? (For that matter, can ``.outer`` be applied to more than 2 arrays? The docs say it can't, but it's perfectly well-defined for arbitrary number of arrays too, so maybe we want an interface that allows for 3-way, 4-way etc. ``.outer`` operations in the future?) Well, if outer can indeed be defined for an arbitrary number of arrays (and if it's going to be sometime in the future), I'd say the simplest is to use an array: .outer(a, b, ..., where = [my_where1, my_where2, ...]) Yeah, that's a much better idea... I've edited it to match. -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] miniNEP 2: NA support via special dtypes
Well, everyone seems to like my first attempt at this so far, so I guess I'll really stick my foot in it now... here's my second miniNEP, which lays out a plan for handling dtype/bit-pattern-style NAs. I've stolen bits of text from both the NEP and the alterNEP for this, but since the focus is on nailing down the details, most of the content is new. There are many FIXME's noted, where some decisions or more work is needed... the idea here is to lay out some specifics, so we can figure out if the idea will work and get the details right. So feedback is *very* welcome! Master version: https://gist.github.com/1068264 Current version for commenting: ### miniNEP 2: NA support via special dtypes ### To try and make more progress on the whole missing values/masked arrays/... debate, it seems useful to have a more technical discussion of the pieces which we *can* agree on. This is the second, which attempts to nail down the details of how NAs can be implemented using special dtype's. * Table of contents * .. contents:: * Rationale * An ordinary value is something like an integer or a floating point number. A missing value is a placeholder for an ordinary value that is for some reason unavailable. For example, in working with statistical data, we often build tables in which each row represents one item, and each column represents properties of that item. For instance, we might take a group of people and for each one record height, age, education level, and income, and then stick these values into a table. But then we discover that our research assistant screwed up and forgot to record the age of one of our individuals. We could throw out the rest of their data as well, but this would be wasteful; even such an incomplete row is still perfectly usable for some analyses (e.g., we can compute the correlation of height and income). The traditional way to handle this would be to stick some particular meaningless value in for the missing data, e.g., recording this person's age as 0. But this is very error prone; we may later forget about these special values while running other analyses, and discover to our surprise that babies have higher incomes than teenagers. (In this case, the solution would be to just leave out all the items where we have no age recorded, but this isn't a general solution; many analyses require something more clever to handle missing values.) So instead of using an ordinary value like 0, we define a special missing value, written NA for not available. There are several possible ways to represent such a value in memory. For instance, we could reserve a specific value (like 0, or a particular NaN, or the smallest negative integer) and then ensure that this value is treated specially by all arithmetic and other operations on our array. Another option would be to add an additional mask array next to our main array, use this to indicate which values should be treated as NA, and then extend our array operations to check this mask array whenever performing computations. Each implementation approach has various strengths and weaknesses, but here we focus on the former (value-based) approach exclusively and leave the possible addition of the latter to future discussion. The core advantages of this approach are (1) it adds no additional memory overhead, (2) it is straightforward to store and retrieve such arrays to disk using existing file storage formats, (3) it allows binary compatibility with R arrays including NA values, (4) it is compatible with the common practice of using NaN to indicate missingness when working with floating point numbers, (5) the dtype is already a place where `weird things can happen' -- there are a wide variety of dtypes that don't act like ordinary numbers (including structs, Python objects, fixed-length strings, ...), so code that accepts arbitrary numpy arrays already has to be prepared to handle these (even if only by checking for them and raising an error). Therefore adding yet more new dtypes has less impact on extension authors than if we change the ndarray object itself. The basic semantics of NA values are as follows. Like any other value, they must be supported by your array's dtype -- you can't store a floating point number in an array with dtype=int32, and you can't store an NA in it either. You need an array with dtype=NAint32 or something (exact syntax to be determined). Otherwise, NA values act exactly like any other values. In particular, you can apply arithmetic functions and so forth to them. By default, any function which takes an NA as an argument always returns an NA as well, regardless of the values of the other arguments. This ensures that if we try to compute the correlation of income with age, we will get NA, meaning given that some of the entries could be anything, the answer could be anything as well. This reminds us to
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
On Wed, Jul 6, 2011 at 7:14 PM, Christopher Jordan-Squire cjord...@uw.edu wrote: On Wed, Jul 6, 2011 at 3:47 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:38 PM, josef.p...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire snip Mean value replacement, or more generally single scalar value replacement, is generally not a good idea. It biases downward your standard error estimates if you use mean replacement, and it will bias both if you use anything other than mean replacement. The bias is gets worse with more missing data. So it's worst in the precisely the cases where you'd want to fill in the data the most. (Though I admit I'm not too familiar with time series, so maybe this doesn't apply. But it's true as a general principle in statistics.) I'm not sure why we'd want to make this use case easier. Another qualification on this (I cannot help it). I think this only applies if you use a prefabricated no-missing-values algorithm. If I write it myself, I can do the proper correction for the reduced number of observations. (similar to the case when we ignore correlated information and use statistics based on uncorrelated observations which also overestimate the amount of information we have available.) Can you do that sort of technique with longitudinal (panel) data? I'm honestly curious because I haven't looked into such corrections before. I haven't been able to find a reference after a few quick google searches. I don't suppose you know one off the top of your head? And you're right about the last measurement carried forward. I was just thinking about filling in all missing values with the same value. -Chris Jordan-Squire PS--Thanks for mentioning the statsmodels discussion. I'd been keeping track of that on a different email account, and I haven't realized it wasn't forwarding those messages correctly. Maybe a bit OT, but I've seen people doing imputation using Bayesian MCMC or multiple imputation for missing values in panel data. Google 'data augmentation' or 'multiple imputation'. I haven't looked much into the details yet, but it's definitely not mean replacement. FWIW (I haven't been following closely the discussion), there is a distinction in statistics between ignorable and nonignorable missing data, but I can't think of a situation where I would need this at the computational level rather than relying on a (numerically comparable) missing data type(s) a la SAS/Stata. I've also found the odd examples of IGNORE without a clear answer to be scary. Skipper ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] miniNEP 2: NA support via special dtypes
On Wed, Jul 6, 2011 at 7:34 PM, Nathaniel Smith n...@pobox.com wrote: Well, everyone seems to like my first attempt at this so far, so I guess I'll really stick my foot in it now... here's my second miniNEP, which lays out a plan for handling dtype/bit-pattern-style NAs. I've stolen bits of text from both the NEP and the alterNEP for this, but since the focus is on nailing down the details, most of the content is new. There are many FIXME's noted, where some decisions or more work is needed... the idea here is to lay out some specifics, so we can figure out if the idea will work and get the details right. So feedback is *very* welcome! Master version: https://gist.github.com/1068264 Current version for commenting: ### miniNEP 2: NA support via special dtypes ### To try and make more progress on the whole missing values/masked arrays/... debate, it seems useful to have a more technical discussion of the pieces which we *can* agree on. This is the second, which attempts to nail down the details of how NAs can be implemented using special dtype's. * Table of contents * .. contents:: * Rationale * An ordinary value is something like an integer or a floating point number. A missing value is a placeholder for an ordinary value that is for some reason unavailable. For example, in working with statistical data, we often build tables in which each row represents one item, and each column represents properties of that item. For instance, we might take a group of people and for each one record height, age, education level, and income, and then stick these values into a table. But then we discover that our research assistant screwed up and forgot to record the age of one of our individuals. We could throw out the rest of their data as well, but this would be wasteful; even such an incomplete row is still perfectly usable for some analyses (e.g., we can compute the correlation of height and income). The traditional way to handle this would be to stick some particular meaningless value in for the missing data, e.g., recording this person's age as 0. But this is very error prone; we may later forget about these special values while running other analyses, and discover to our surprise that babies have higher incomes than teenagers. (In this case, the solution would be to just leave out all the items where we have no age recorded, but this isn't a general solution; many analyses require something more clever to handle missing values.) So instead of using an ordinary value like 0, we define a special missing value, written NA for not available. There are several possible ways to represent such a value in memory. For instance, we could reserve a specific value (like 0, or a particular NaN, or the smallest negative integer) and then ensure that this value is treated specially by all arithmetic and other operations on our array. Another option would be to add an additional mask array next to our main array, use this to indicate which values should be treated as NA, and then extend our array operations to check this mask array whenever performing computations. Each implementation approach has various strengths and weaknesses, but here we focus on the former (value-based) approach exclusively and leave the possible addition of the latter to future discussion. The core advantages of this approach are (1) it adds no additional memory overhead, (2) it is straightforward to store and retrieve such arrays to disk using existing file storage formats, (3) it allows binary compatibility with R arrays including NA values, (4) it is compatible with the common practice of using NaN to indicate missingness when working with floating point numbers, (5) the dtype is already a place where `weird things can happen' -- there are a wide variety of dtypes that don't act like ordinary numbers (including structs, Python objects, fixed-length strings, ...), so code that accepts arbitrary numpy arrays already has to be prepared to handle these (even if only by checking for them and raising an error). Therefore adding yet more new dtypes has less impact on extension authors than if we change the ndarray object itself. The basic semantics of NA values are as follows. Like any other value, they must be supported by your array's dtype -- you can't store a floating point number in an array with dtype=int32, and you can't store an NA in it either. You need an array with dtype=NAint32 or something (exact syntax to be determined). Otherwise, NA values act exactly like any other values. In particular, you can apply arithmetic functions and so forth to them. By default, any function which takes an NA as an argument always returns an NA as well, regardless of the values of the other arguments. This ensures that if we try to compute the correlation
Re: [Numpy-discussion] miniNEP 2: NA support via special dtypes
On Wed, Jul 6, 2011 at 7:01 PM, Charles R Harris charlesr.har...@gmail.com wrote: Numpy already has a general mechanism for defining new dtypes and slotting them in so that they're supported by ndarrays, by the casting machinery, by ufuncs, and so on. In principle, we could implement Well, actually not in any useful sense, take a look at what Mark went through for the half floats. There is a reason the NEP went with parametrized dtypes and masks. But we would sure welcome a plan and code to make it true, it is one of the areas that could really use improvement. Err, yes, that's basically what the next few sentences say? This is basically a draft spec for implementing the parametrized dtypes idea. -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] miniNEP 2: NA support via special dtypes
On Wed, Jul 6, 2011 at 8:09 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Jul 6, 2011 at 7:01 PM, Charles R Harris charlesr.har...@gmail.com wrote: Numpy already has a general mechanism for defining new dtypes and slotting them in so that they're supported by ndarrays, by the casting machinery, by ufuncs, and so on. In principle, we could implement Well, actually not in any useful sense, take a look at what Mark went through for the half floats. There is a reason the NEP went with parametrized dtypes and masks. But we would sure welcome a plan and code to make it true, it is one of the areas that could really use improvement. Err, yes, that's basically what the next few sentences say? This is basically a draft spec for implementing the parametrized dtypes idea. And yet: FIXME: this really needs attention from an expert on numpy's casting rules. But I can't seem to find the docs that explain how casting loops are looked up and decided between (e.g., if you're casting from dtype A to dtype B, which dtype's loops are used?), so I can't go into details. But those details are tricky and they matter... There is also a reason that masks were chosen to be implemented first. The numpy code is freely available and there is no reason not to make experiments or help Mark get some of the current problems solved, it doesn't need to be a one man effort and your feedback will have a lot more impact if you are in the trenches. In particular, I think there is a good deal of work that will need to be done for the sorts, argmax, and the other functions you mention that would give you a good idea of what was involved and how to go about implementing your ideas. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] miniNEP 2: NA support via special dtypes
On Wed, Jul 6, 2011 at 8:34 PM, Charles R Harris charlesr.har...@gmail.comwrote: On Wed, Jul 6, 2011 at 8:09 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Jul 6, 2011 at 7:01 PM, Charles R Harris charlesr.har...@gmail.com wrote: Numpy already has a general mechanism for defining new dtypes and slotting them in so that they're supported by ndarrays, by the casting machinery, by ufuncs, and so on. In principle, we could implement Well, actually not in any useful sense, take a look at what Mark went through for the half floats. There is a reason the NEP went with parametrized dtypes and masks. But we would sure welcome a plan and code to make it true, it is one of the areas that could really use improvement. Err, yes, that's basically what the next few sentences say? This is basically a draft spec for implementing the parametrized dtypes idea. And yet: FIXME: this really needs attention from an expert on numpy's casting rules. But I can't seem to find the docs that explain how casting loops are looked up and decided between (e.g., if you're casting from dtype A to dtype B, which dtype's loops are used?), so I can't go into details. But those details are tricky and they matter... There is also a reason that masks were chosen to be implemented first. The numpy code is freely available and there is no reason not to make experiments or help Mark get some of the current problems solved, it doesn't need to be a one man effort and your feedback will have a lot more impact if you are in the trenches. In particular, I think there is a good deal of work that will need to be done for the sorts, argmax, and the other functions you mention that would give you a good idea of what was involved and how to go about implementing your ideas. Let me lay out a bit more how I see things developing at this point, and bear in mind that I am not a psychic so this is just a guess ;) Mark is going to work at Enthought for maybe 3-4 more weeks and then return to school. Mark is very good, but that is still a very tough schedule and all the things in the NEP may not get finished, let alone all the supporting work that will be needed around the core implementation. After that what Mark does in his spare time is up to him. I expect there will be another numpy release sometime in the Fall, maybe around Nov/Dec, to get the new features, especially the datetime work, out there. At that point the interface is semi-fixed. I like to think that new features should be regarded as experimental for at least one release cycle but that is certainly not official Numpy policy. In any case there is likely going to be a gap of several months where the rate of commits slows down and other folks, if they are interested, have a real opportunity to get involved. After the projected Fall release I see maybe another six months to make changes/extensions to the interface, and this is where new ideas can get worked out, but there needs to be someone with the interest and skill to implement those ideas for that to happen. If no such person shows up, then the interface will be what it is until there is such a person with an interest in carrying things forward. But at that point they will need take care to maintain backward compatibility unless pretty much everyone agrees that the then current interface is a disaster. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] miniNEP 2: NA support via special dtypes
On Wed, Jul 6, 2011 at 7:34 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Jul 6, 2011 at 8:09 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Jul 6, 2011 at 7:01 PM, Charles R Harris charlesr.har...@gmail.com wrote: Numpy already has a general mechanism for defining new dtypes and slotting them in so that they're supported by ndarrays, by the casting machinery, by ufuncs, and so on. In principle, we could implement Well, actually not in any useful sense, take a look at what Mark went through for the half floats. There is a reason the NEP went with parametrized dtypes and masks. But we would sure welcome a plan and code to make it true, it is one of the areas that could really use improvement. Err, yes, that's basically what the next few sentences say? This is basically a draft spec for implementing the parametrized dtypes idea. And yet: FIXME: this really needs attention from an expert on numpy's casting rules. But I can't seem to find the docs that explain how casting loops are looked up and decided between (e.g., if you're casting from dtype A to dtype B, which dtype's loops are used?), so I can't go into details. But those details are tricky and they matter... There is also a reason that masks were chosen to be implemented first. The numpy code is freely available and there is no reason not to make experiments or help Mark get some of the current problems solved, it doesn't need to be a one man effort and your feedback will have a lot more impact if you are in the trenches. In particular, I think there is a good deal of work that will need to be done for the sorts, argmax, and the other functions you mention that would give you a good idea of what was involved and how to go about implementing your ideas. Hi Chuck, My goal in posting this was to try to find a way for those of us who disagree to still be productive together. If you'd like to help with that in a constructive way, then please do, but otherwise, can I ask in a polite and well-meaning way that you butt out? Scolding me for not getting in the trenches is not helpful. People like Wes and Matthew and I have been in the trenches for years building up numpy as a viable platform for statistical computing. (I can't claim that my efforts compare to theirs, but see for instance [1], which is an improved version of R's formula support, one of the other key advantages it has over Python. It works, so I'd have written some docs and released it by now, except I'm defending my PhD in 4 weeks, so, well, you know.) Yes, there are some details missing from the spec I wrote up in a few hours this afternoon, but how about we solve them? There are plenty of people on this list who know more than me, or Mark, or any one of any of us. This problem is complicated, but not *that* complicated. So, you know, let's do this. And maybe that way, in a month, we'll have something that we all actually like, even if it doesn't do everything that we want. -- Nathaniel [1] https://github.com/charlton/charlton ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
On 7/6/11 11:57 AM, Mark Wiebe wrote: On Wed, Jul 6, 2011 at 1:25 PM, Christopher Barker Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me. What do you think of renaming IGNORE to SKIP? This isn't a semantics issue -- IGNORE is fine. What I'm getting at is that we need a word (and code) for: ignore for now, but I might want to use it later - Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion