Re: [Numpy-discussion] (no subject)
Hi, On Fri, Apr 20, 2012 at 9:15 PM, Andre Martel soucoupevola...@yahoo.comwrote: What would be the best way to remove the maximum from a cube and collapse the remaining elements along the z-axis ? For example, I want to reduce Cube to NewCube: Cube array([[[ 13, 2, 3, 42], [ 5, 100, 7, 8], [ 9, 1, 11, 12]], [[ 25, 4, 15, 1], [ 17, 30, 9, 20], [ 21, 2, 23, 24]], [[ 1, 2, 27, 28], [ 29, 18, 31, 32], [ -1, 3, 35, 4]]]) NewCube array([[[ 13, 2, 3, 1], [ 5, 30, 7, 8], [ 9, 1, 11, 12]], [[ 1, 2, 15, 28], [ 17, 18, 9, 20], [ -1, 2, 23, 4]]]) I tried with argmax() and then roll() and delete() but these all work on 1-D arrays only. Thanks. Perhaps it would be more straightforward to process via 2D-arrays, like: In []: C Out[]: array([[[ 13, 2, 3, 42], [ 5, 100, 7, 8], [ 9, 1, 11, 12]], [[ 25, 4, 15, 1], [ 17, 30, 9, 20], [ 21, 2, 23, 24]], [[ 1, 2, 27, 28], [ 29, 18, 31, 32], [ -1, 3, 35, 4]]]) In []: C_in= C.reshape(3, -1).copy() In []: ndx= C_in.argmax(0) In []: C_out= C_in[:2, :] In []: C_out[:, ndx== 0]= C_in[1:, ndx== 0] In []: C_out[1, ndx== 1]= C_in[2, ndx== 1] In []: C_out.reshape(2, 3, 4) Out[]: array([[[13, 2, 3, 1], [ 5, 30, 7, 8], [ 9, 1, 11, 12]], [[ 1, 2, 15, 28], [17, 18, 9, 20], [-1, 2, 23, 4]]]) My 2 cents, -eat ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Removing masked arrays for 1.7? (Was 1.7 blockers)
On 21. apr. 2012, at 00:16, Drew Frank wrote: On Fri, Apr 20, 2012 at 11:45 AM, Chris Barker chris.bar...@noaa.gov wrote: On Fri, Apr 20, 2012 at 11:39 AM, Dag Sverre Seljebotn d.s.seljeb...@astro.uio.no wrote: Oh, right. I was thinking small as in fits in L2 cache, not small as in a few dozen entries. Another example of a small array use-case: I've been using numpy for my research in multi-target tracking, which involves something like a bunch of entangled hidden markov models. I represent target states with small 2d arrays (e.g. 2x2, 4x4, ..) and observations with small 1d arrays (1 or 2 elements). It may be possible to combine a bunch of these small arrays into a single larger array and use indexing to extract views, but it is much cleaner and more intuitive to use separate, small arrays. It's also convenient to use numpy arrays rather than a custom class because I use the linear algebra functionality as well as integration with other libraries (e.g. matplotlib). I also work with approximate probabilistic inference in graphical models (belief propagation, etc), which is another area where it can be nice to work with many small arrays. In any case, I just wanted to chime in with my small bit of evidence for people wanting to use numpy for work with small arrays, even if they are currently pretty slow. If there were a special version of a numpy array that would be faster for cases like this, I would definitely make use of it. Drew Although performance hasn't been a killer for me, I've been using numpy arrays (or matrices) for Mueller matrices [0] and Stokes vectors [1]. These describe the polarization of light and are always 4x1 vectors or 4x4 matrices. It would be nice if my code ran in 1 night instead of one week, although this is still tolerable in my case. Again, just an example of how small-vector/matrix performance can be important in certain use cases. Paul [0] https://en.wikipedia.org/wiki/Mueller_calculus [1] https://en.wikipedia.org/wiki/Stokes_vector ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A 1.6.2 release?
On Sat, Apr 21, 2012 at 5:16 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sat, Apr 21, 2012 at 2:46 AM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Fri, Apr 20, 2012 at 8:04 PM, Charles R Harris charlesr.har...@gmail.com wrote: Hi All, Given the amount of new stuff coming in 1.7 and the slip in it's schedule, I wonder if it would be worth putting out a 1.6.2 release with fixes for einsum, ticket 1578, perhaps some others. My reasoning is that the fall releases of Fedora, Ubuntu are likely to still use 1.6 and they might as well use a somewhat fixed up version. The downside is located and backporting fixes is likely to be a fair amount of work. A 1.7 release would be preferable, but I'm not sure when we can make that happen. Travis still sounded hopeful of being able to resolve the 1.7 issues relatively soon. On the other hand, even if that's done in one month we'll still miss Debian stable and a 1.6.2 release won't be *that* much work. Let's go for it I would say. Aiming for a RC on May 2nd and final release on May 16th would work for me. I count 280 BUG commits since 1.6.1, so we are going to need to thin those out. Indeed. We can discard all commits related to NA and datetime, and then we should find some balance between how important the fixes are and how much risk there is that they break something. I agree with the couple of backports you've done so far, but I propose to do the rest via PRs. There's also build issues. I checked all of those and sent a PR with backports of all the relevant ones: https://github.com/numpy/numpy/pull/258 Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A 1.6.2 release?
On Sun, Apr 22, 2012 at 5:25 AM, Ralf Gommers ralf.gomm...@googlemail.comwrote: On Sat, Apr 21, 2012 at 5:16 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sat, Apr 21, 2012 at 2:46 AM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Fri, Apr 20, 2012 at 8:04 PM, Charles R Harris charlesr.har...@gmail.com wrote: Hi All, Given the amount of new stuff coming in 1.7 and the slip in it's schedule, I wonder if it would be worth putting out a 1.6.2 release with fixes for einsum, ticket 1578, perhaps some others. My reasoning is that the fall releases of Fedora, Ubuntu are likely to still use 1.6 and they might as well use a somewhat fixed up version. The downside is located and backporting fixes is likely to be a fair amount of work. A 1.7 release would be preferable, but I'm not sure when we can make that happen. Travis still sounded hopeful of being able to resolve the 1.7 issues relatively soon. On the other hand, even if that's done in one month we'll still miss Debian stable and a 1.6.2 release won't be *that* much work. Let's go for it I would say. Aiming for a RC on May 2nd and final release on May 16th would work for me. I count 280 BUG commits since 1.6.1, so we are going to need to thin those out. Indeed. We can discard all commits related to NA and datetime, and then we should find some balance between how important the fixes are and how much risk there is that they break something. I agree with the couple of backports you've done so far, but I propose to do the rest via PRs. There's also build issues. I checked all of those and sent a PR with backports of all the relevant ones: https://github.com/numpy/numpy/pull/258 Hi Ralf, I went ahead and merged those. What's the easiest way to make things merge into the maintenance/1.6.x branch in the pull requests? Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A 1.6.2 release?
On Sun, Apr 22, 2012 at 3:44 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Apr 22, 2012 at 5:25 AM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Sat, Apr 21, 2012 at 5:16 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sat, Apr 21, 2012 at 2:46 AM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Fri, Apr 20, 2012 at 8:04 PM, Charles R Harris charlesr.har...@gmail.com wrote: Hi All, Given the amount of new stuff coming in 1.7 and the slip in it's schedule, I wonder if it would be worth putting out a 1.6.2 release with fixes for einsum, ticket 1578, perhaps some others. My reasoning is that the fall releases of Fedora, Ubuntu are likely to still use 1.6 and they might as well use a somewhat fixed up version. The downside is located and backporting fixes is likely to be a fair amount of work. A 1.7 release would be preferable, but I'm not sure when we can make that happen. Travis still sounded hopeful of being able to resolve the 1.7 issues relatively soon. On the other hand, even if that's done in one month we'll still miss Debian stable and a 1.6.2 release won't be *that* much work. Let's go for it I would say. Aiming for a RC on May 2nd and final release on May 16th would work for me. I count 280 BUG commits since 1.6.1, so we are going to need to thin those out. Indeed. We can discard all commits related to NA and datetime, and then we should find some balance between how important the fixes are and how much risk there is that they break something. I agree with the couple of backports you've done so far, but I propose to do the rest via PRs. There's also build issues. I checked all of those and sent a PR with backports of all the relevant ones: https://github.com/numpy/numpy/pull/258 Hi Ralf, I went ahead and merged those. What's the easiest way to make things merge into the maintenance/1.6.x branch in the pull requests? When sending a PR: in your own Github repo you press Pull request, then in the next screen under Base branch - tag - commit you change the branch from master to maintenance/1.6.x. Then press Update commit range. Then merge like normal. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Euroscipy 2012 - abstract deadline soon (April 30) + sprints
Hello, this is a reminder of the approaching deadline for abstract submission at the Euroscipy 2012 conference: the deadline is April 30, in one week. Euroscipy 2012 will be held in **Brussels**, **August 23-27**, at the Université Libre de Bruxelles (ULB, Solbosch Campus). The EuroSciPy meeting is a cross-disciplinary gathering focused on the use and development of the Python language in scientific research and industry. This event strives to bring together both users and developers of scientific tools, as well as academic research and state of the art industry. More information about the conference, including practical information, are found on the conference website http://www.euroscipy.org/conference/euroscipy2012 We are soliciting talks and posters that discuss topics related to scientific computing using Python. These include applications, teaching, future development directions, and research. We welcome contributions from the industry as well as the academic world. Submission guidelines are found on http://www.euroscipy.org/card/euroscipy2012_call_for_contributions Also, rooms are available at the ULB for sprints on Tuesday August 28th and Wednesday 29th. If you wish to organize a sprint at Euroscipy, please get in touch with Berkin Malkoc (malk...@itu.edu.tr). Any other questions should be addressed exclusively to org-t...@lists.euroscipy.org We apologize for the inconvenience if you received this e-mail through several mailing-lists. -- Emmanuelle, for the organizing team ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] the state of NA/masking
Hi all, Travis, Mark, and I talked on Skype this week about how to productively move forward with the NA debate, and I got picked to summarize for the list :-). There are three main things we discussed: 1) About process: We seem to agree that this discussion has been ineffective for a variety of reasons, and that it would be best to back up and try the consensus-based approach. Maybe not everyone agrees... I'm not sure how we go about building consensus on whether we need consensus? And we noted that we may not actually all mean the same thing by that. To start a discussion, I'll write up separately what I understand by that term. 2) If we require consensus on our NA implemention, then we have a problem for the 1.7.0 release. The problem is this: -- We have some kind of commitment to keeping compatibility between releases -- Therefore, if we release with NA masks, then we have some kind of commitment to continuing to support these in some form going forward -- But as per above, we can't make such a commitment until we have consensus, and we don't have consensus. Even if we end up deciding that the current code is the best thing ever, we haven't done that yet. Therefore, we have a kind of constrained optimization problem: we need to find the best way to adjust our some kind of commitment, or the current code, or both, so that we can release 1.7. Alternatively we could delay the release until we have reached and implemented consensus, but I have an allergy to putting such amorphous things on our critical path, and I suspect I'm not the only one. (If it turns out that consensus is quick and the release is slow for other reasons, then that'd be great, of course, but why depend on it if we don't have to?) I'll also send a separate email to try and lay out the main options here, as a basis for discussion. 3) And, in the long run, there's the actual question of what we kind of NA support we actually want in numpy. A major problem here is that it's very difficult for anyone who hasn't spent huge amounts of time wading through the mailing list to actually understand what the points of contention are. So, Mark and I are going to *co*-write a document explaining what we see as the main problems, and trying to clarify our disagreements. Of course, this still won't include everyone's point of view, but hopefully it will serve as a good starting point for... you guessed it... discussion. Cheers, -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] NEP mask code and the 1.7 release
We need to decide what to do with the NA masking code currently in master, vis-a-vis the 1.7 release. While this code is great at what it is, we don't actually have consensus yet that it's the best way to give our users what they want/need -- or even an appropriate way. So we need to figure out how to release 1.7 without committing ourselves to supporting this design in the future. Background: what does the code currently in master do? It adds 3 pointers at the end of the PyArrayObject struct (which is better known as the numpy.ndarray object). These new struct members, and some accessors for them, are exposed as part of the public API. There are also a few additions to the Python-level API (mask= argument to np.array, skipna= argument to ufuncs, etc.) What does this mean for compatibility? The change in the ndarray struct is not as problematic as it might seem, compatibility-wise, since Python objects are almost always referred to by pointers. Since the initial part of the struct will continue to have the same memory layout, existing source and binary code that works with PyArrayObject *pointers* will continue to work unchanged. One place where the actual struct size matters is for any C-level ndarray subclasses, which will have their memory layout change, and thus will need to be recompiled. (Python-level ndarray subclasses will have their memory layout change as well -- e.g., they will have different __dictoffset__ values -- but it's unlikely that any existing Python code depends on such details.) What if we want to change our minds later? --- For the same reasons as given above, any new code which avoids referencing the new struct fields referring to masks, or using the new masking APIs, will continue to work even if the masking is later removed. Any new code which *does* refer to the new masking APIs, or references the fields directly, will break if masking is later removed. Specifically, source will fail to compile, and existing binaries will silently access memory that is past the end of the PyArrayObject struct, which will have unpredictable consequences. (Most likely segfaults, but no guarantees.) This applies even to code which simply tries to check whether a mask is present. So I think the preconditions for leaving this code as-is for 1.7 are that we must agree: * We are willing to require a recompile of any C-level ndarray subclasses (do any exist?) * We are willing to make absolutely no guarantees about future compatibility for code which uses APIs marked experimental * We are willing for this breakage to occur in the form of random segfaults * We are okay with the extra 3 pointers worth of memory overhead on each ndarray Personally I can live with all of these if everyone else can, but I'm nervous about reducing our compatibility guarantees like that, and we'd probably need, at a minimum, a flashier EXPERIMENTAL sign than we currently have. (Maybe we should resurrect the weasels ;-) [1]) [1] http://mail.scipy.org/pipermail/numpy-discussion/2012-March/061204.html Any other options? Alternative 1: The obvious other option is to go through and move all the strictly mask-related code out of master and into a branch. Presumably this wouldn't include all the infrastructure that Mark added, since a lot of it is e.g. shared with where=, and that would stay. Even so, this would be a big and possibly time-consuming change. Alternative 2: After auditing the code a bit, the cleanest third option I can think of is: 1. Go through and make sure that all numpy-internal access to the new maskna fields happens via the accessor functions. (This patch would produce no functionality change.) 2. Move the accessors into some numpy-internal header file, so that user code can't call them. 3. Remove the mask= argument to Python-level ndarray constructors, remove the new maskna_ fields from PyArrayObject, and modify the accessors so that they always return NULL, 0, etc., as if the array does not have a mask. This would make 1.7 completely compatible with 1.6 API and ABI-wise. But it would also be a minimal code change, leaving the mask-related code paths in place but inaccessible. If we decided to re-enable them, it would just be matter of reverting steps (3) and (2). The main downside I see with this approach is that leaving a bunch of inaccessible code paths lying around might make it harder to maintain 1.7 as a long term support release. I'm personally willing to implement either of these changes. Or perhaps there's another option that I'm not thinking of! -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] What is consensus anyway
If you hang around big FOSS projects, you'll see the word consensus come up a lot. For example, the glibc steering committee recently dissolved itself in favor of governance directly by the consensus of the people active in glibc development[1]. It's the governing rule of the IETF, which defines many of the most important internet standards[2]. It is the primary way decisions are made on Wikipedia[3]. It's one of the fundamental aspects of accomplishing things within the Apache framework[4]. [1] https://lwn.net/Articles/488778/ [2] https://www.ietf.org/tao.html#getting.things.done [3] https://en.wikipedia.org/wiki/Wikipedia:Consensus [4] https://www.apache.org/foundation/voting.html But it turns out that this consensus thing is actually somewhat mysterious, and one that most programmers immersed in this culture pick it up by osmosis. And numpy in particular has a lot of developers who are not coming from a classic FOSS programmer background! So this is my personal attempt to articulate what it is, and why requiring consensus is probably the best possible approach to project decision making. So what is consensus? Like, voting or something? - This is surprisingly subtle and specific. Consensus means something like, everyone who cares is satisfied with the result. It does *not* mean * Every opinion counts equally * We vote on anything * Every solution must be perfect and flawless * Every solution must leave everyone overjoyed * Everyone must sign off on every solution. It *does* mean * We invite people to speak up * We generally trust individuals to decide how important their opinion is * We generally trust individuals to decide whether or not they can live with some outcome * If they can't, then we take the time to find something better. One simple way of stating this is, everyone has a veto. In practice, such vetoes are almost never used, so this rule is not particularly illuminating on its own. Hence, the rest of this document. What a waste of time! That all sounds very pretty on paper, but we have stuff to get done. --- First, I'll note that this seemingly utopian scheme has a track record of producing such impractical systems as TCP/IP, SMTP, DNS, Apache, GCC, Linux, Samba, Python, ... But mere empirical results are often less convincing than a good story, so I will give you two. Why does a requirement for consensus work? Reason 1 (for optimists): *All of us are smarter than any of us.* For a complex project with many users, it's extraordinarily difficult for any one person to understand the full ramifications of any decision, particularly the sort of far-reaching architectural decisions that are most important. It's even more difficult to understand all the possibilities of all the different possible solutions. In fact, it's *extremely* common that the correct solution to a problem is the one that no-one thinks of until after a month of annoying debate. Spending a month to avoid an architectural problem that will haunt us for years is an *excellent* trade-off, even if it feels interminable at the time. Even two months. Usually disagreements are an indication that a better solution is possible, even when it's not clear what that would be. Reason 2 (for pessimists): *You **will** reach consensus sooner or later; it's less painful to do up front.* Example: NA handling. There are two schemes that people use for this right now -- numpy.ma and ugly NaN kluges (see e.g. nanmean). These are generally agreed to be suboptimal. Recently, two new contenders have shown up: the NEP masked-NA support currently in master, and the unrelated NA support in pandas (which as a library is attracting a *lot* of the statistical analysis folk who care about missing data, kudos to Wes). I think that right now, the most likely future is that a few years from now, many people will still be using the old solutions, and others will have switched to the new (incompatible) solutions, and we will have *4* suboptimal schemes in concurrent use. If (when) this happens, we will have to re-open this discussion yet again, but now with a heck of a mess to clean up. This is FOSS -- if people aren't convinced by your solution, they will just ignore it and do their own thing. So a policy that allows changes to be made without consensus is a recipe for entrenching disagreements and splitting the community. Okay, great, but even if it's the best thing ever, we *can't* hold a vote on every change! What are you actually suggesting we do? Right, that's not the idea. Most changes are pretty obviously uncontroversial, and in fact we usually have the opposite problem -- it's hard to get people to do code review! So having consensus on every change is an ideal, and in practice, just following the reasonable person principle lets us
Re: [Numpy-discussion] What is consensus anyway
On Sun, Apr 22, 2012 at 4:15 PM, Nathaniel Smith n...@pobox.com wrote: If you hang around big FOSS projects, you'll see the word consensus come up a lot. For example, the glibc steering committee recently dissolved itself in favor of governance directly by the consensus of the people active in glibc development[1]. It's the governing rule of the IETF, which defines many of the most important internet standards[2]. It is the primary way decisions are made on Wikipedia[3]. It's one of the fundamental aspects of accomplishing things within the Apache framework[4]. [1] https://lwn.net/Articles/488778/ [2] https://www.ietf.org/tao.html#getting.things.done [3] https://en.wikipedia.org/wiki/Wikipedia:Consensus [4] https://www.apache.org/foundation/voting.html But it turns out that this consensus thing is actually somewhat mysterious, and one that most programmers immersed in this culture pick it up by osmosis. And numpy in particular has a lot of developers who are not coming from a classic FOSS programmer background! So this is my personal attempt to articulate what it is, and why requiring consensus is probably the best possible approach to project decision making. So what is consensus? Like, voting or something? - This is surprisingly subtle and specific. Consensus means something like, everyone who cares is satisfied with the result. It does *not* mean * Every opinion counts equally * We vote on anything * Every solution must be perfect and flawless * Every solution must leave everyone overjoyed * Everyone must sign off on every solution. It *does* mean * We invite people to speak up * We generally trust individuals to decide how important their opinion is * We generally trust individuals to decide whether or not they can live with some outcome * If they can't, then we take the time to find something better. One simple way of stating this is, everyone has a veto. In practice, such vetoes are almost never used, so this rule is not particularly illuminating on its own. Hence, the rest of this document. What a waste of time! That all sounds very pretty on paper, but we have stuff to get done. --- First, I'll note that this seemingly utopian scheme has a track record of producing such impractical systems as TCP/IP, SMTP, DNS, Apache, GCC, Linux, Samba, Python, ... Linux is Linus' private tree. Everything that goes in is his decision, everything that stays out is his decision. Of course, he delegates much of the work to people he trusts, but it doesn't even reach the level of a BDFL, it's DFL. As for consensus, it basically comes down to convincing the gatekeepers one level below Linus that your code might be useful. So bad example. Same with TCP/IP, which was basically Kahn and Cerf consulting with a few others and working by request of DARPA. GCC was Richard Stallman (I got one of the first tapes for a $30 donation), Python was Guido. Some of the projects later developed some form of governance but Guido, for instance, can veto anything he dislikes even if he is disinclined to do so. I'm not saying you're wrong about open source, I'm just saying that that each project differs and it is wrong to imply that they follow some common form of governance under the rubric FOSS and that they all seek consensus. And they certainly don't *start* that way. And there are also plenty of projects that fail when the prime mover loses interest or folks get tired of the politics. But mere empirical results are often less convincing than a good story, so I will give you two. Why does a requirement for consensus work? Reason 1 (for optimists): *All of us are smarter than any of us.* For a complex project with many users, it's extraordinarily difficult for any one person to understand the full ramifications of any decision, particularly the sort of far-reaching architectural decisions that are most important. It's even more difficult to understand all the possibilities of all the different possible solutions. In fact, it's *extremely* common that the correct solution to a problem is the one that no-one thinks of until after a month of annoying debate. Spending a month to avoid an architectural problem that will haunt us for years is an *excellent* trade-off, even if it feels interminable at the time. Even two months. Usually disagreements are an indication that a better solution is possible, even when it's not clear what that would be. Reason 2 (for pessimists): *You **will** reach consensus sooner or later; it's less painful to do up front.* Example: NA handling. There are two schemes that people use for this right now -- numpy.ma and ugly NaN kluges (see e.g. nanmean). These are generally agreed to be suboptimal. Recently, two new contenders have shown up: the NEP masked-NA support currently in master, and the
Re: [Numpy-discussion] NEP mask code and the 1.7 release
On Sun, Apr 22, 2012 at 4:15 PM, Nathaniel Smith n...@pobox.com wrote: We need to decide what to do with the NA masking code currently in master, vis-a-vis the 1.7 release. While this code is great at what it is, we don't actually have consensus yet that it's the best way to give our users what they want/need -- or even an appropriate way. So we need to figure out how to release 1.7 without committing ourselves to supporting this design in the future. Background: what does the code currently in master do? It adds 3 pointers at the end of the PyArrayObject struct (which is better known as the numpy.ndarray object). These new struct members, and some accessors for them, are exposed as part of the public API. There are also a few additions to the Python-level API (mask= argument to np.array, skipna= argument to ufuncs, etc.) What does this mean for compatibility? The change in the ndarray struct is not as problematic as it might seem, compatibility-wise, since Python objects are almost always referred to by pointers. Since the initial part of the struct will continue to have the same memory layout, existing source and binary code that works with PyArrayObject *pointers* will continue to work unchanged. One place where the actual struct size matters is for any C-level ndarray subclasses, which will have their memory layout change, and thus will need to be recompiled. (Python-level ndarray subclasses will have their memory layout change as well -- e.g., they will have different __dictoffset__ values -- but it's unlikely that any existing Python code depends on such details.) What if we want to change our minds later? --- For the same reasons as given above, any new code which avoids referencing the new struct fields referring to masks, or using the new masking APIs, will continue to work even if the masking is later removed. Any new code which *does* refer to the new masking APIs, or references the fields directly, will break if masking is later removed. Specifically, source will fail to compile, and existing binaries will silently access memory that is past the end of the PyArrayObject struct, which will have unpredictable consequences. (Most likely segfaults, but no guarantees.) This applies even to code which simply tries to check whether a mask is present. So I think the preconditions for leaving this code as-is for 1.7 are that we must agree: * We are willing to require a recompile of any C-level ndarray subclasses (do any exist?) * We are willing to make absolutely no guarantees about future compatibility for code which uses APIs marked experimental * We are willing for this breakage to occur in the form of random segfaults * We are okay with the extra 3 pointers worth of memory overhead on each ndarray Personally I can live with all of these if everyone else can, but I'm nervous about reducing our compatibility guarantees like that, and we'd probably need, at a minimum, a flashier EXPERIMENTAL sign than we currently have. (Maybe we should resurrect the weasels ;-) [1]) [1] http://mail.scipy.org/pipermail/numpy-discussion/2012-March/061204.html Any other options? Alternative 1: The obvious other option is to go through and move all the strictly mask-related code out of master and into a branch. Presumably this wouldn't include all the infrastructure that Mark added, since a lot of it is e.g. shared with where=, and that would stay. Even so, this would be a big and possibly time-consuming change. Alternative 2: After auditing the code a bit, the cleanest third option I can think of is: 1. Go through and make sure that all numpy-internal access to the new maskna fields happens via the accessor functions. (This patch would produce no functionality change.) 2. Move the accessors into some numpy-internal header file, so that user code can't call them. 3. Remove the mask= argument to Python-level ndarray constructors, remove the new maskna_ fields from PyArrayObject, and modify the accessors so that they always return NULL, 0, etc., as if the array does not have a mask. This would make 1.7 completely compatible with 1.6 API and ABI-wise. But it would also be a minimal code change, leaving the mask-related code paths in place but inaccessible. If we decided to re-enable them, it would just be matter of reverting steps (3) and (2). The main downside I see with this approach is that leaving a bunch of inaccessible code paths lying around might make it harder to maintain 1.7 as a long term support release. I'm personally willing to implement either of these changes. Or perhaps there's another option that I'm not thinking of! I'm not deeply invested in the current version of masked NA. OTOH, code development usually goes through several
Re: [Numpy-discussion] NEP mask code and the 1.7 release
On Sun, Apr 22, 2012 at 6:26 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Apr 22, 2012 at 4:15 PM, Nathaniel Smith n...@pobox.com wrote: We need to decide what to do with the NA masking code currently in master, vis-a-vis the 1.7 release. While this code is great at what it is, we don't actually have consensus yet that it's the best way to give our users what they want/need -- or even an appropriate way. So we need to figure out how to release 1.7 without committing ourselves to supporting this design in the future. Background: what does the code currently in master do? It adds 3 pointers at the end of the PyArrayObject struct (which is better known as the numpy.ndarray object). These new struct members, and some accessors for them, are exposed as part of the public API. There are also a few additions to the Python-level API (mask= argument to np.array, skipna= argument to ufuncs, etc.) What does this mean for compatibility? The change in the ndarray struct is not as problematic as it might seem, compatibility-wise, since Python objects are almost always referred to by pointers. Since the initial part of the struct will continue to have the same memory layout, existing source and binary code that works with PyArrayObject *pointers* will continue to work unchanged. One place where the actual struct size matters is for any C-level ndarray subclasses, which will have their memory layout change, and thus will need to be recompiled. (Python-level ndarray subclasses will have their memory layout change as well -- e.g., they will have different __dictoffset__ values -- but it's unlikely that any existing Python code depends on such details.) What if we want to change our minds later? --- For the same reasons as given above, any new code which avoids referencing the new struct fields referring to masks, or using the new masking APIs, will continue to work even if the masking is later removed. Any new code which *does* refer to the new masking APIs, or references the fields directly, will break if masking is later removed. Specifically, source will fail to compile, and existing binaries will silently access memory that is past the end of the PyArrayObject struct, which will have unpredictable consequences. (Most likely segfaults, but no guarantees.) This applies even to code which simply tries to check whether a mask is present. So I think the preconditions for leaving this code as-is for 1.7 are that we must agree: * We are willing to require a recompile of any C-level ndarray subclasses (do any exist?) * We are willing to make absolutely no guarantees about future compatibility for code which uses APIs marked experimental * We are willing for this breakage to occur in the form of random segfaults * We are okay with the extra 3 pointers worth of memory overhead on each ndarray Personally I can live with all of these if everyone else can, but I'm nervous about reducing our compatibility guarantees like that, and we'd probably need, at a minimum, a flashier EXPERIMENTAL sign than we currently have. (Maybe we should resurrect the weasels ;-) [1]) [1] http://mail.scipy.org/pipermail/numpy-discussion/2012-March/061204.html Any other options? Alternative 1: The obvious other option is to go through and move all the strictly mask-related code out of master and into a branch. Presumably this wouldn't include all the infrastructure that Mark added, since a lot of it is e.g. shared with where=, and that would stay. Even so, this would be a big and possibly time-consuming change. Alternative 2: After auditing the code a bit, the cleanest third option I can think of is: 1. Go through and make sure that all numpy-internal access to the new maskna fields happens via the accessor functions. (This patch would produce no functionality change.) 2. Move the accessors into some numpy-internal header file, so that user code can't call them. 3. Remove the mask= argument to Python-level ndarray constructors, remove the new maskna_ fields from PyArrayObject, and modify the accessors so that they always return NULL, 0, etc., as if the array does not have a mask. This would make 1.7 completely compatible with 1.6 API and ABI-wise. But it would also be a minimal code change, leaving the mask-related code paths in place but inaccessible. If we decided to re-enable them, it would just be matter of reverting steps (3) and (2). The main downside I see with this approach is that leaving a bunch of inaccessible code paths lying around might make it harder to maintain 1.7 as a long term support release. I'm personally willing to implement either of these changes. Or perhaps there's another option that I'm not thinking of! I'm not deeply invested in
Re: [Numpy-discussion] What is consensus anyway
Hi Nathaniel, thanks for a solid writeup of this topic. I just want to add a note from personal experience, regarding this specific point: On Sun, Apr 22, 2012 at 3:15 PM, Nathaniel Smith n...@pobox.com wrote: Usually disagreements are an indication that a better solution is possible, even when it's not clear what that would be. I think this is *extremely* important, so I want to highlight it from the rest of your post. Regarding how IPython operates, I think we have good evidence to illustrate the value of this... One of the members of the IPython team who joined earliest is Brian Granger: he started working on IPython around 2004 after a conversation we had in the context of a SciPy conference. Some of you may know that Brian and I went to graduate school together, which means we've known each other for much longer than IPython, and we've been good friends since. But that alone doesn't ensure a smooth collaboration; in fact Brian and I extremely often disagree *deeply* on design decisions about IPython. And yet, I think the project makes solid progress, not despite this but in an important way *thanks* to this divergence. Whenever we disagree, it typically means that each of us is seeing a partial solution to a problem, but not a really solid and complete one. I don't recall ever using my 'BDFL vote' in one of these discussions; instead we just keep going around the problem. Typically what happens is that after much discussion, we settle on a new solution that neither of us had quite seen at the start. I mention Brian specifically because him and I seem to be at opposite ends of some weird spectrum, disagreement between the other parties appears to fall somewhere in between. Here's an example that is currently in open discussion, and despite the fact that I'm completely convinced that something like this should go into IPython, I'm waiting. We'll continue the discussion to either find arguments that convince me otherwise, or to convince Brian of the value of the PR: https://github.com/ipython/ipython/pull/1343 It takes both patience and trust for this to work: we have to be willing to wait out the long discussion, and we have to trust that despite how much we may disagree on something, we both play fair and ultimately only want what's best for the project. That means giving the other party the benefit of the doubt at every turn, and having a willingness to let the discussion happen openly as long as is necessary for the project to remain healthy. For example in this case, I'm *really* convinced of my point, and I think blocking this PR actively hurts users. Is it worth saying OK, I'm overriding your concerns here and pushing this forward? Absolutely NOT! I'd only: - alienate Brian, a key member of the project without whom IPython would be nowhere near where it is today, and decrease his motivation to continue working - kill the opportunity for a discussion to produce an even cleaner solution than what we've seen so far - piss off a good friend. I put this last because while that's actually a very important reason for me, the fact that Brian and I are good personal friends is secondary here: this is about discussion between contributors independent of their personal relationships. I hope this perspective is useful... 1. Make it as easy as possible for people to see what's going on and join the discussion. All decisions and reasoning behind decisions take place in public. (On this note, it would be *really* good if pull request notifications went to the list.) If anyone knows how to do this, let me know; I'd like to do the same for IPython and our -dev list. Cheers, f ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion