Re: [Numpy-discussion] (no subject)

2012-04-22 Thread eat
Hi,

On Fri, Apr 20, 2012 at 9:15 PM, Andre Martel soucoupevola...@yahoo.comwrote:

 What would be the best way to remove the maximum from a cube and
 collapse the remaining elements along the z-axis ?
 For example, I want to reduce Cube to NewCube:

  Cube
 array([[[  13,   2,   3,  42],
 [  5, 100,   7,   8],
 [  9,   1,  11,  12]],

[[ 25,   4,  15,   1],
 [ 17,  30,   9,  20],
 [ 21,   2,  23,  24]],

[[ 1,   2,  27,  28],
 [ 29,  18,  31,  32],
 [ -1,   3,  35,   4]]])

 NewCube

 array([[[  13,   2,   3,  1],
 [  5, 30,   7,   8],
 [  9,   1,  11,  12]],

[[ 1,   2,  15,  28],
 [ 17,  18,  9,  20],
 [ -1,   2,  23,   4]]])

 I tried with argmax() and then roll() and delete() but these
 all work on 1-D arrays only. Thanks.

Perhaps it would be more straightforward to process via 2D-arrays, like:
In []: C
Out[]:
array([[[ 13,   2,   3,  42],
[  5, 100,   7,   8],
[  9,   1,  11,  12]],
   [[ 25,   4,  15,   1],
[ 17,  30,   9,  20],
[ 21,   2,  23,  24]],
   [[  1,   2,  27,  28],
[ 29,  18,  31,  32],
[ -1,   3,  35,   4]]])
In []: C_in= C.reshape(3, -1).copy()
In []: ndx= C_in.argmax(0)
In []: C_out= C_in[:2, :]
In []: C_out[:, ndx== 0]= C_in[1:, ndx== 0]
In []: C_out[1, ndx== 1]= C_in[2, ndx== 1]
In []: C_out.reshape(2, 3, 4)
Out[]:
array([[[13,  2,  3,  1],
[ 5, 30,  7,  8],
[ 9,  1, 11, 12]],
   [[ 1,  2, 15, 28],
[17, 18,  9, 20],
[-1,  2, 23,  4]]])

My 2 cents,
-eat


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Removing masked arrays for 1.7? (Was 1.7 blockers)

2012-04-22 Thread Paul Anton Letnes

On 21. apr. 2012, at 00:16, Drew Frank wrote:

 On Fri, Apr 20, 2012 at 11:45 AM, Chris Barker chris.bar...@noaa.gov wrote:
 
 On Fri, Apr 20, 2012 at 11:39 AM, Dag Sverre Seljebotn
 d.s.seljeb...@astro.uio.no wrote:
 Oh, right. I was thinking small as in fits in L2 cache, not small as
 in a few dozen entries.
 
 Another example of a small array use-case: I've been using numpy for
 my research in multi-target tracking, which involves something like a
 bunch of entangled hidden markov models. I represent target states
 with small 2d arrays (e.g. 2x2, 4x4, ..) and observations with small
 1d arrays (1 or 2 elements). It may be possible to combine a bunch of
 these small arrays into a single larger array and use indexing to
 extract views, but it is much cleaner and more intuitive to use
 separate, small arrays. It's also convenient to use numpy arrays
 rather than a custom class because I use the linear algebra
 functionality as well as integration with other libraries (e.g.
 matplotlib).
 
 I also work with approximate probabilistic inference in graphical
 models (belief propagation, etc), which is another area where it can
 be nice to work with many small arrays.
 
 In any case, I just wanted to chime in with my small bit of evidence
 for people wanting to use numpy for work with small arrays, even if
 they are currently pretty slow. If there were a special version of a
 numpy array that would be faster for cases like this, I would
 definitely make use of it.
 
 Drew

Although performance hasn't been a killer for me, I've been using numpy arrays 
(or matrices) for Mueller matrices [0] and Stokes vectors [1]. These describe 
the polarization of light and are always 4x1 vectors or 4x4 matrices. It would 
be nice if my code ran in 1 night instead of one week, although this is still 
tolerable in my case. Again, just an example of how small-vector/matrix 
performance can be important in certain use cases.

Paul

[0] https://en.wikipedia.org/wiki/Mueller_calculus
[1] https://en.wikipedia.org/wiki/Stokes_vector
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A 1.6.2 release?

2012-04-22 Thread Ralf Gommers
On Sat, Apr 21, 2012 at 5:16 PM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Sat, Apr 21, 2012 at 2:46 AM, Ralf Gommers ralf.gomm...@googlemail.com
  wrote:



 On Fri, Apr 20, 2012 at 8:04 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:

 Hi All,

 Given the amount of new stuff coming in 1.7 and the slip in it's
 schedule, I wonder if it would be worth putting out a 1.6.2 release with
 fixes for einsum, ticket 1578, perhaps some others. My reasoning is that
 the fall releases of Fedora, Ubuntu are likely to still use 1.6 and they
 might as well use a somewhat fixed up version. The downside is located and
 backporting fixes is likely to be a fair amount of work. A 1.7 release
 would be preferable, but I'm not sure when we can make that happen.


 Travis still sounded hopeful of being able to resolve the 1.7 issues
 relatively soon. On the other hand, even if that's done in one month we'll
 still miss Debian stable and a 1.6.2 release won't be *that* much work.

 Let's go for it I would say.

 Aiming for a RC on May 2nd and final release on May 16th would work for
 me.


 I count 280 BUG commits since 1.6.1, so we are going to need to thin those
 out.


Indeed. We can discard all commits related to NA and datetime, and then we
should find some balance between how important the fixes are and how much
risk there is that they break something. I agree with the couple of
backports you've done so far, but I propose to do the rest via PRs.

There's also build issues. I checked all of those and sent a PR with
backports of all the relevant ones: https://github.com/numpy/numpy/pull/258

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A 1.6.2 release?

2012-04-22 Thread Charles R Harris
On Sun, Apr 22, 2012 at 5:25 AM, Ralf Gommers
ralf.gomm...@googlemail.comwrote:



 On Sat, Apr 21, 2012 at 5:16 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:



 On Sat, Apr 21, 2012 at 2:46 AM, Ralf Gommers 
 ralf.gomm...@googlemail.com wrote:



 On Fri, Apr 20, 2012 at 8:04 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:

 Hi All,

 Given the amount of new stuff coming in 1.7 and the slip in it's
 schedule, I wonder if it would be worth putting out a 1.6.2 release with
 fixes for einsum, ticket 1578, perhaps some others. My reasoning is that
 the fall releases of Fedora, Ubuntu are likely to still use 1.6 and they
 might as well use a somewhat fixed up version. The downside is located and
 backporting fixes is likely to be a fair amount of work. A 1.7 release
 would be preferable, but I'm not sure when we can make that happen.


 Travis still sounded hopeful of being able to resolve the 1.7 issues
 relatively soon. On the other hand, even if that's done in one month we'll
 still miss Debian stable and a 1.6.2 release won't be *that* much work.

 Let's go for it I would say.

 Aiming for a RC on May 2nd and final release on May 16th would work for
 me.


 I count 280 BUG commits since 1.6.1, so we are going to need to thin
 those out.


 Indeed. We can discard all commits related to NA and datetime, and then we
 should find some balance between how important the fixes are and how much
 risk there is that they break something. I agree with the couple of
 backports you've done so far, but I propose to do the rest via PRs.

 There's also build issues. I checked all of those and sent a PR with
 backports of all the relevant ones:
 https://github.com/numpy/numpy/pull/258


Hi Ralf, I went ahead and merged those. What's the easiest way to make
things merge into the maintenance/1.6.x branch in the pull requests?

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A 1.6.2 release?

2012-04-22 Thread Ralf Gommers
On Sun, Apr 22, 2012 at 3:44 PM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Sun, Apr 22, 2012 at 5:25 AM, Ralf Gommers ralf.gomm...@googlemail.com
  wrote:



 On Sat, Apr 21, 2012 at 5:16 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:



 On Sat, Apr 21, 2012 at 2:46 AM, Ralf Gommers 
 ralf.gomm...@googlemail.com wrote:



 On Fri, Apr 20, 2012 at 8:04 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:

 Hi All,

 Given the amount of new stuff coming in 1.7 and the slip in it's
 schedule, I wonder if it would be worth putting out a 1.6.2 release with
 fixes for einsum, ticket 1578, perhaps some others. My reasoning is that
 the fall releases of Fedora, Ubuntu are likely to still use 1.6 and they
 might as well use a somewhat fixed up version. The downside is located and
 backporting fixes is likely to be a fair amount of work. A 1.7 release
 would be preferable, but I'm not sure when we can make that happen.


 Travis still sounded hopeful of being able to resolve the 1.7 issues
 relatively soon. On the other hand, even if that's done in one month we'll
 still miss Debian stable and a 1.6.2 release won't be *that* much work.

 Let's go for it I would say.

 Aiming for a RC on May 2nd and final release on May 16th would work for
 me.


 I count 280 BUG commits since 1.6.1, so we are going to need to thin
 those out.


 Indeed. We can discard all commits related to NA and datetime, and then
 we should find some balance between how important the fixes are and how
 much risk there is that they break something. I agree with the couple of
 backports you've done so far, but I propose to do the rest via PRs.

 There's also build issues. I checked all of those and sent a PR with
 backports of all the relevant ones:
 https://github.com/numpy/numpy/pull/258


 Hi Ralf, I went ahead and merged those. What's the easiest way to make
 things merge into the maintenance/1.6.x branch in the pull requests?


When sending a PR: in your own Github repo you press Pull request, then
in the next screen under Base branch - tag - commit you change the branch
from master to maintenance/1.6.x. Then press Update commit range.

Then merge like normal.

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Euroscipy 2012 - abstract deadline soon (April 30) + sprints

2012-04-22 Thread Emmanuelle Gouillart
Hello,

this is a reminder of the approaching deadline for abstract submission at
the Euroscipy 2012 conference: the deadline is April 30, in one week.

Euroscipy 2012 will be held in **Brussels**, **August 23-27**, at the
Université Libre de Bruxelles (ULB, Solbosch Campus). 

The EuroSciPy meeting is a cross-disciplinary gathering focused on the
use and development of the Python language in scientific research and
industry. This event strives to bring together both users and developers
of scientific tools, as well as academic research and state of the art
industry.

More information about the conference, including practical information,
are found on the conference website
http://www.euroscipy.org/conference/euroscipy2012

We are soliciting talks and posters that discuss topics related to
scientific computing using Python. These include applications, teaching,
future development directions, and research. We welcome contributions
from the industry as well as the academic world. 

Submission guidelines are found on
http://www.euroscipy.org/card/euroscipy2012_call_for_contributions

Also, rooms are available at the ULB for sprints on Tuesday August 28th 
and Wednesday 29th. If you wish to organize a sprint at Euroscipy, please
get in touch with Berkin Malkoc (malk...@itu.edu.tr).

Any other questions should be addressed exclusively to
org-t...@lists.euroscipy.org

We apologize for the inconvenience if you received this e-mail through
several mailing-lists.

--
Emmanuelle, for the organizing team

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] the state of NA/masking

2012-04-22 Thread Nathaniel Smith
Hi all,

Travis, Mark, and I talked on Skype this week about how to
productively move forward with the NA debate, and I got picked to
summarize for the list :-).

There are three main things we discussed:

1) About process: We seem to agree that this discussion has been
ineffective for a variety of reasons, and that it would be best to
back up and try the consensus-based approach. Maybe not everyone
agrees... I'm not sure how we go about building consensus on whether
we need consensus? And we noted that we may not actually all mean the
same thing by that. To start a discussion, I'll write up separately
what I understand by that term.

2) If we require consensus on our NA implemention, then we have a
problem for the 1.7.0 release. The problem is this:
  -- We have some kind of commitment to keeping compatibility between releases
  -- Therefore, if we release with NA masks, then we have some kind of
commitment to continuing to support these in some form going forward
  -- But as per above, we can't make such a commitment until we have
consensus, and we don't have consensus. Even if we end up deciding
that the current code is the best thing ever, we haven't done that
yet.

Therefore, we have a kind of constrained optimization problem: we need
to find the best way to adjust our some kind of commitment, or the
current code, or both, so that we can release 1.7. Alternatively we
could delay the release until we have reached and implemented
consensus, but I have an allergy to putting such amorphous things on
our critical path, and I suspect I'm not the only one. (If it turns
out that consensus is quick and the release is slow for other reasons,
then that'd be great, of course, but why depend on it if we don't have
to?)

I'll also send a separate email to try and lay out the main options
here, as a basis for discussion.

3) And, in the long run, there's the actual question of what we kind
of NA support we actually want in numpy. A major problem here is that
it's very difficult for anyone who hasn't spent huge amounts of time
wading through the mailing list to actually understand what the points
of contention are. So, Mark and I are going to *co*-write a document
explaining what we see as the main problems, and trying to clarify our
disagreements. Of course, this still won't include everyone's point of
view, but hopefully it will serve as a good starting point for... you
guessed it... discussion.

Cheers,
-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] NEP mask code and the 1.7 release

2012-04-22 Thread Nathaniel Smith
We need to decide what to do with the NA masking code currently in
master, vis-a-vis the 1.7 release. While this code is great at what it
is, we don't actually have consensus yet that it's the best way to
give our users what they want/need -- or even an appropriate way. So
we need to figure out how to release 1.7 without committing ourselves
to supporting this design in the future.

Background: what does the code currently in master do?


It adds 3 pointers at the end of the PyArrayObject struct (which is
better known as the numpy.ndarray object). These new struct members,
and some accessors for them, are exposed as part of the public API.
There are also a few additions to the Python-level API (mask= argument
to np.array, skipna= argument to ufuncs, etc.)

What does this mean for compatibility?


The change in the ndarray struct is not as problematic as it might
seem, compatibility-wise, since Python objects are almost always
referred to by pointers. Since the initial part of the struct will
continue to have the same memory layout, existing source and binary
code that works with PyArrayObject *pointers* will continue to work
unchanged.

One place where the actual struct size matters is for any C-level
ndarray subclasses, which will have their memory layout change, and
thus will need to be recompiled. (Python-level ndarray subclasses will
have their memory layout change as well -- e.g., they will have
different __dictoffset__ values -- but it's unlikely that any existing
Python code depends on such details.)

What if we want to change our minds later?
---

For the same reasons as given above, any new code which avoids
referencing the new struct fields referring to masks, or using the new
masking APIs, will continue to work even if the masking is later
removed.

Any new code which *does* refer to the new masking APIs, or references
the fields directly, will break if masking is later removed.
Specifically, source will fail to compile, and existing binaries will
silently access memory that is past the end of the PyArrayObject
struct, which will have unpredictable consequences. (Most likely
segfaults, but no guarantees.) This applies even to code which simply
tries to check whether a mask is present.

So I think the preconditions for leaving this code as-is for 1.7 are
that we must agree:
  * We are willing to require a recompile of any C-level ndarray
subclasses (do any exist?)
  * We are willing to make absolutely no guarantees about future
compatibility for code which uses APIs marked experimental
  * We are willing for this breakage to occur in the form of random segfaults
  * We are okay with the extra 3 pointers worth of memory overhead on
each ndarray

Personally I can live with all of these if everyone else can, but I'm
nervous about reducing our compatibility guarantees like that, and
we'd probably need, at a minimum, a flashier EXPERIMENTAL sign than we
currently have. (Maybe we should resurrect the weasels ;-) [1])

[1] http://mail.scipy.org/pipermail/numpy-discussion/2012-March/061204.html

Any other options?


Alternative 1: The obvious other option is to go through and move all
the strictly mask-related code out of master and into a branch.
Presumably this wouldn't include all the infrastructure that Mark
added, since a lot of it is e.g. shared with where=, and that would
stay. Even so, this would be a big and possibly time-consuming change.

Alternative 2: After auditing the code a bit, the cleanest third
option I can think of is:

1. Go through and make sure that all numpy-internal access to the new
maskna fields happens via the accessor functions. (This patch would
produce no functionality change.)
2. Move the accessors into some numpy-internal header file, so that
user code can't call them.
3. Remove the mask= argument to Python-level ndarray constructors,
remove the new maskna_ fields from PyArrayObject, and modify the
accessors so that they always return NULL, 0, etc., as if the array
does not have a mask.

This would make 1.7 completely compatible with 1.6 API and ABI-wise.
But it would also be a minimal code change, leaving the mask-related
code paths in place but inaccessible. If we decided to re-enable them,
it would just be matter of reverting steps (3) and (2).

The main downside I see with this approach is that leaving a bunch of
inaccessible code paths lying around might make it harder to maintain
1.7 as a long term support release.

I'm personally willing to implement either of these changes. Or
perhaps there's another option that I'm not thinking of!

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] What is consensus anyway

2012-04-22 Thread Nathaniel Smith
If you hang around big FOSS projects, you'll see the word consensus
come up a lot. For example, the glibc steering committee recently
dissolved itself in favor of governance directly by the consensus of
the people active in glibc development[1]. It's the governing rule of
the IETF, which defines many of the most important internet
standards[2]. It is the primary way decisions are made on
Wikipedia[3]. It's one of the fundamental aspects of accomplishing
things within the Apache framework[4].

[1] https://lwn.net/Articles/488778/
[2] https://www.ietf.org/tao.html#getting.things.done
[3] https://en.wikipedia.org/wiki/Wikipedia:Consensus
[4] https://www.apache.org/foundation/voting.html

But it turns out that this consensus thing is actually somewhat
mysterious, and one that most programmers immersed in this culture
pick it up by osmosis. And numpy in particular has a lot of developers
who are not coming from a classic FOSS programmer background! So this
is my personal attempt to articulate what it is, and why requiring
consensus is probably the best possible approach to project decision
making.

So what is consensus? Like, voting or something?
-

This is surprisingly subtle and specific.

Consensus means something like, everyone who cares is satisfied
with the result.

It does *not* mean
* Every opinion counts equally
* We vote on anything
* Every solution must be perfect and flawless
* Every solution must leave everyone overjoyed
* Everyone must sign off on every solution.

It *does* mean
* We invite people to speak up
* We generally trust individuals to decide how important their opinion is
* We generally trust individuals to decide whether or not they can
live with some outcome
* If they can't, then we take the time to find something better.

One simple way of stating this is, everyone has a veto. In practice,
such vetoes are almost never used, so this rule is not particularly
illuminating on its own. Hence, the rest of this document.

What a waste of time! That all sounds very pretty on paper, but we
have stuff to get done.
---

First, I'll note that this seemingly utopian scheme has a track record
of producing such impractical systems as TCP/IP, SMTP, DNS, Apache,
GCC, Linux, Samba, Python, ...

But mere empirical results are often less convincing than a good
story, so I will give you two. Why does a requirement for consensus
work?

Reason 1 (for optimists): *All of us are smarter than any of us.* For
a complex project with many users, it's extraordinarily difficult for
any one person to understand the full ramifications of any decision,
particularly the sort of far-reaching architectural decisions that are
most important. It's even more difficult to understand all the
possibilities of all the different possible solutions. In fact, it's
*extremely* common that the correct solution to a problem is the one
that no-one thinks of until after a month of annoying debate. Spending
a month to avoid an architectural problem that will haunt us for years
is an *excellent* trade-off, even if it feels interminable at the
time. Even two months. Usually disagreements are an indication that a
better solution is possible, even when it's not clear what that would
be.

Reason 2 (for pessimists): *You **will** reach consensus sooner or
later; it's less painful to do up front.* Example: NA handling. There
are two schemes that people use for this right now -- numpy.ma and
ugly NaN kluges (see e.g. nanmean). These are generally agreed to be
suboptimal. Recently, two new contenders have shown up: the NEP
masked-NA support currently in master, and the unrelated NA support in
pandas (which as a library is attracting a *lot* of the statistical
analysis folk who care about missing data, kudos to Wes). I think that
right now, the most likely future is that a few years from now, many
people will still be using the old solutions, and others will have
switched to the new (incompatible) solutions, and we will have *4*
suboptimal schemes in concurrent use. If (when) this happens, we will
have to re-open this discussion yet again, but now with a heck of a
mess to clean up. This is FOSS -- if people aren't convinced by your
solution, they will just ignore it and do their own thing. So a policy
that allows changes to be made without consensus is a recipe for
entrenching disagreements and splitting the community.

Okay, great, but even if it's the best thing ever, we *can't* hold a
vote on every change! What are you actually suggesting we do?


Right, that's not the idea. Most changes are pretty obviously
uncontroversial, and in fact we usually have the opposite problem --
it's hard to get people to do code review!

So having consensus on every change is an ideal, and in practice, just
following the reasonable person principle lets us 

Re: [Numpy-discussion] What is consensus anyway

2012-04-22 Thread Charles R Harris
On Sun, Apr 22, 2012 at 4:15 PM, Nathaniel Smith n...@pobox.com wrote:

 If you hang around big FOSS projects, you'll see the word consensus
 come up a lot. For example, the glibc steering committee recently
 dissolved itself in favor of governance directly by the consensus of
 the people active in glibc development[1]. It's the governing rule of
 the IETF, which defines many of the most important internet
 standards[2]. It is the primary way decisions are made on
 Wikipedia[3]. It's one of the fundamental aspects of accomplishing
 things within the Apache framework[4].

 [1] https://lwn.net/Articles/488778/
 [2] https://www.ietf.org/tao.html#getting.things.done
 [3] https://en.wikipedia.org/wiki/Wikipedia:Consensus
 [4] https://www.apache.org/foundation/voting.html

 But it turns out that this consensus thing is actually somewhat
 mysterious, and one that most programmers immersed in this culture
 pick it up by osmosis. And numpy in particular has a lot of developers
 who are not coming from a classic FOSS programmer background! So this
 is my personal attempt to articulate what it is, and why requiring
 consensus is probably the best possible approach to project decision
 making.

 So what is consensus? Like, voting or something?
 -

 This is surprisingly subtle and specific.

 Consensus means something like, everyone who cares is satisfied
 with the result.

 It does *not* mean
 * Every opinion counts equally
 * We vote on anything
 * Every solution must be perfect and flawless
 * Every solution must leave everyone overjoyed
 * Everyone must sign off on every solution.

 It *does* mean
 * We invite people to speak up
 * We generally trust individuals to decide how important their opinion is
 * We generally trust individuals to decide whether or not they can
 live with some outcome
 * If they can't, then we take the time to find something better.

 One simple way of stating this is, everyone has a veto. In practice,
 such vetoes are almost never used, so this rule is not particularly
 illuminating on its own. Hence, the rest of this document.

 What a waste of time! That all sounds very pretty on paper, but we
 have stuff to get done.

 ---

 First, I'll note that this seemingly utopian scheme has a track record
 of producing such impractical systems as TCP/IP, SMTP, DNS, Apache,
 GCC, Linux, Samba, Python, ...


Linux is Linus' private tree. Everything that goes in is his decision,
everything that stays out is his decision. Of course, he delegates much of
the work to people he trusts, but it doesn't even reach the level of a
BDFL, it's DFL. As for consensus, it basically comes down to convincing the
gatekeepers one level below Linus that your code might be useful. So bad
example. Same with TCP/IP, which was basically Kahn and Cerf consulting
with a few others and working by request of DARPA. GCC was Richard Stallman
(I got one of the first tapes for a $30 donation), Python was Guido. Some
of the projects later developed some form of governance but Guido, for
instance, can veto anything he dislikes even if he is disinclined to do so.
I'm not saying you're wrong about open source, I'm just saying that that
each project differs and it is wrong to imply that they follow some common
form of governance under the rubric FOSS and that they all seek consensus.
And they certainly don't *start* that way. And there are also plenty of
projects that fail when the prime mover loses interest or folks get tired
of the politics.

But mere empirical results are often less convincing than a good
 story, so I will give you two. Why does a requirement for consensus
 work?

 Reason 1 (for optimists): *All of us are smarter than any of us.* For
 a complex project with many users, it's extraordinarily difficult for
 any one person to understand the full ramifications of any decision,
 particularly the sort of far-reaching architectural decisions that are
 most important. It's even more difficult to understand all the
 possibilities of all the different possible solutions. In fact, it's
 *extremely* common that the correct solution to a problem is the one
 that no-one thinks of until after a month of annoying debate. Spending
 a month to avoid an architectural problem that will haunt us for years
 is an *excellent* trade-off, even if it feels interminable at the
 time. Even two months. Usually disagreements are an indication that a
 better solution is possible, even when it's not clear what that would
 be.

 Reason 2 (for pessimists): *You **will** reach consensus sooner or
 later; it's less painful to do up front.* Example: NA handling. There
 are two schemes that people use for this right now -- numpy.ma and
 ugly NaN kluges (see e.g. nanmean). These are generally agreed to be
 suboptimal. Recently, two new contenders have shown up: the NEP
 masked-NA support currently in master, and the 

Re: [Numpy-discussion] NEP mask code and the 1.7 release

2012-04-22 Thread Charles R Harris
On Sun, Apr 22, 2012 at 4:15 PM, Nathaniel Smith n...@pobox.com wrote:

 We need to decide what to do with the NA masking code currently in
 master, vis-a-vis the 1.7 release. While this code is great at what it
 is, we don't actually have consensus yet that it's the best way to
 give our users what they want/need -- or even an appropriate way. So
 we need to figure out how to release 1.7 without committing ourselves
 to supporting this design in the future.

 Background: what does the code currently in master do?
 

 It adds 3 pointers at the end of the PyArrayObject struct (which is
 better known as the numpy.ndarray object). These new struct members,
 and some accessors for them, are exposed as part of the public API.
 There are also a few additions to the Python-level API (mask= argument
 to np.array, skipna= argument to ufuncs, etc.)

 What does this mean for compatibility?
 

 The change in the ndarray struct is not as problematic as it might
 seem, compatibility-wise, since Python objects are almost always
 referred to by pointers. Since the initial part of the struct will
 continue to have the same memory layout, existing source and binary
 code that works with PyArrayObject *pointers* will continue to work
 unchanged.

 One place where the actual struct size matters is for any C-level
 ndarray subclasses, which will have their memory layout change, and
 thus will need to be recompiled. (Python-level ndarray subclasses will
 have their memory layout change as well -- e.g., they will have
 different __dictoffset__ values -- but it's unlikely that any existing
 Python code depends on such details.)

 What if we want to change our minds later?
 ---

 For the same reasons as given above, any new code which avoids
 referencing the new struct fields referring to masks, or using the new
 masking APIs, will continue to work even if the masking is later
 removed.

 Any new code which *does* refer to the new masking APIs, or references
 the fields directly, will break if masking is later removed.
 Specifically, source will fail to compile, and existing binaries will
 silently access memory that is past the end of the PyArrayObject
 struct, which will have unpredictable consequences. (Most likely
 segfaults, but no guarantees.) This applies even to code which simply
 tries to check whether a mask is present.

 So I think the preconditions for leaving this code as-is for 1.7 are
 that we must agree:
  * We are willing to require a recompile of any C-level ndarray
 subclasses (do any exist?)
  * We are willing to make absolutely no guarantees about future
 compatibility for code which uses APIs marked experimental
  * We are willing for this breakage to occur in the form of random
 segfaults
  * We are okay with the extra 3 pointers worth of memory overhead on
 each ndarray

 Personally I can live with all of these if everyone else can, but I'm
 nervous about reducing our compatibility guarantees like that, and
 we'd probably need, at a minimum, a flashier EXPERIMENTAL sign than we
 currently have. (Maybe we should resurrect the weasels ;-) [1])

 [1]
 http://mail.scipy.org/pipermail/numpy-discussion/2012-March/061204.html

 Any other options?
 

 Alternative 1: The obvious other option is to go through and move all
 the strictly mask-related code out of master and into a branch.
 Presumably this wouldn't include all the infrastructure that Mark
 added, since a lot of it is e.g. shared with where=, and that would
 stay. Even so, this would be a big and possibly time-consuming change.

 Alternative 2: After auditing the code a bit, the cleanest third
 option I can think of is:

 1. Go through and make sure that all numpy-internal access to the new
 maskna fields happens via the accessor functions. (This patch would
 produce no functionality change.)
 2. Move the accessors into some numpy-internal header file, so that
 user code can't call them.
 3. Remove the mask= argument to Python-level ndarray constructors,
 remove the new maskna_ fields from PyArrayObject, and modify the
 accessors so that they always return NULL, 0, etc., as if the array
 does not have a mask.

 This would make 1.7 completely compatible with 1.6 API and ABI-wise.
 But it would also be a minimal code change, leaving the mask-related
 code paths in place but inaccessible. If we decided to re-enable them,
 it would just be matter of reverting steps (3) and (2).

 The main downside I see with this approach is that leaving a bunch of
 inaccessible code paths lying around might make it harder to maintain
 1.7 as a long term support release.

 I'm personally willing to implement either of these changes. Or
 perhaps there's another option that I'm not thinking of!



I'm not deeply invested in the current version of masked NA. OTOH, code
development usually goes through several 

Re: [Numpy-discussion] NEP mask code and the 1.7 release

2012-04-22 Thread Charles R Harris
On Sun, Apr 22, 2012 at 6:26 PM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Sun, Apr 22, 2012 at 4:15 PM, Nathaniel Smith n...@pobox.com wrote:

 We need to decide what to do with the NA masking code currently in
 master, vis-a-vis the 1.7 release. While this code is great at what it
 is, we don't actually have consensus yet that it's the best way to
 give our users what they want/need -- or even an appropriate way. So
 we need to figure out how to release 1.7 without committing ourselves
 to supporting this design in the future.

 Background: what does the code currently in master do?
 

 It adds 3 pointers at the end of the PyArrayObject struct (which is
 better known as the numpy.ndarray object). These new struct members,
 and some accessors for them, are exposed as part of the public API.
 There are also a few additions to the Python-level API (mask= argument
 to np.array, skipna= argument to ufuncs, etc.)

 What does this mean for compatibility?
 

 The change in the ndarray struct is not as problematic as it might
 seem, compatibility-wise, since Python objects are almost always
 referred to by pointers. Since the initial part of the struct will
 continue to have the same memory layout, existing source and binary
 code that works with PyArrayObject *pointers* will continue to work
 unchanged.

 One place where the actual struct size matters is for any C-level
 ndarray subclasses, which will have their memory layout change, and
 thus will need to be recompiled. (Python-level ndarray subclasses will
 have their memory layout change as well -- e.g., they will have
 different __dictoffset__ values -- but it's unlikely that any existing
 Python code depends on such details.)

 What if we want to change our minds later?
 ---

 For the same reasons as given above, any new code which avoids
 referencing the new struct fields referring to masks, or using the new
 masking APIs, will continue to work even if the masking is later
 removed.

 Any new code which *does* refer to the new masking APIs, or references
 the fields directly, will break if masking is later removed.
 Specifically, source will fail to compile, and existing binaries will
 silently access memory that is past the end of the PyArrayObject
 struct, which will have unpredictable consequences. (Most likely
 segfaults, but no guarantees.) This applies even to code which simply
 tries to check whether a mask is present.

 So I think the preconditions for leaving this code as-is for 1.7 are
 that we must agree:
  * We are willing to require a recompile of any C-level ndarray
 subclasses (do any exist?)
  * We are willing to make absolutely no guarantees about future
 compatibility for code which uses APIs marked experimental
  * We are willing for this breakage to occur in the form of random
 segfaults
  * We are okay with the extra 3 pointers worth of memory overhead on
 each ndarray

 Personally I can live with all of these if everyone else can, but I'm
 nervous about reducing our compatibility guarantees like that, and
 we'd probably need, at a minimum, a flashier EXPERIMENTAL sign than we
 currently have. (Maybe we should resurrect the weasels ;-) [1])

 [1]
 http://mail.scipy.org/pipermail/numpy-discussion/2012-March/061204.html

 Any other options?
 

 Alternative 1: The obvious other option is to go through and move all
 the strictly mask-related code out of master and into a branch.
 Presumably this wouldn't include all the infrastructure that Mark
 added, since a lot of it is e.g. shared with where=, and that would
 stay. Even so, this would be a big and possibly time-consuming change.

 Alternative 2: After auditing the code a bit, the cleanest third
 option I can think of is:

 1. Go through and make sure that all numpy-internal access to the new
 maskna fields happens via the accessor functions. (This patch would
 produce no functionality change.)
 2. Move the accessors into some numpy-internal header file, so that
 user code can't call them.
 3. Remove the mask= argument to Python-level ndarray constructors,
 remove the new maskna_ fields from PyArrayObject, and modify the
 accessors so that they always return NULL, 0, etc., as if the array
 does not have a mask.

 This would make 1.7 completely compatible with 1.6 API and ABI-wise.
 But it would also be a minimal code change, leaving the mask-related
 code paths in place but inaccessible. If we decided to re-enable them,
 it would just be matter of reverting steps (3) and (2).

 The main downside I see with this approach is that leaving a bunch of
 inaccessible code paths lying around might make it harder to maintain
 1.7 as a long term support release.

 I'm personally willing to implement either of these changes. Or
 perhaps there's another option that I'm not thinking of!



 I'm not deeply invested in 

Re: [Numpy-discussion] What is consensus anyway

2012-04-22 Thread Fernando Perez
Hi Nathaniel,

thanks for a solid writeup of this topic.  I just want to add a note
from personal experience, regarding this specific point:

On Sun, Apr 22, 2012 at 3:15 PM, Nathaniel Smith n...@pobox.com wrote:
 Usually disagreements are an indication that a
 better solution is possible, even when it's not clear what that would
 be.

I think this is *extremely* important, so I want to highlight it from
the rest of your post.  Regarding how IPython operates, I think we
have good evidence to illustrate the value of this... One of the
members of the IPython team who joined earliest is Brian Granger: he
started working on IPython around 2004 after a conversation we had in
the context of a SciPy conference.  Some of you may know that Brian
and I went to graduate school together, which means we've known each
other for much longer than IPython, and we've been good friends since.
 But that alone doesn't ensure a smooth collaboration; in fact Brian
and I extremely often disagree *deeply* on design decisions about
IPython.

And yet, I think the project makes solid progress, not despite this
but in an important way *thanks* to this divergence.  Whenever we
disagree, it typically means that each of us is seeing a partial
solution to a problem, but not a really solid and complete one.  I
don't recall ever using my 'BDFL vote' in one of these discussions;
instead we just keep going around the problem.  Typically what happens
is that after much discussion, we settle on a new solution that
neither of us had quite seen at the start. I mention Brian
specifically because him and I seem to be at opposite ends of some
weird spectrum, disagreement between the other parties appears to fall
somewhere in between.

Here's an example that is currently in open discussion, and despite
the fact that I'm completely convinced that something like this should
go into IPython, I'm waiting.  We'll continue the discussion to either
find arguments that convince me otherwise, or to convince Brian of the
value of the PR:

https://github.com/ipython/ipython/pull/1343

It takes both patience and trust for this to work: we have to be
willing to wait out the long discussion, and we have to trust that
despite how much we may disagree on something, we both play fair and
ultimately only want what's best for the project.  That means giving
the other party the benefit of the doubt at every turn, and having a
willingness to let the discussion happen openly as long as is
necessary for the project to remain healthy.

For example in this case, I'm *really* convinced of my point, and I
think blocking this PR actively hurts users.  Is it worth saying OK,
I'm overriding your concerns here and pushing this forward?
Absolutely NOT!  I'd only:

- alienate Brian, a key member of the project without whom IPython
would be nowhere near where it is today, and decrease his motivation
to continue working
- kill the opportunity for a discussion to produce an even cleaner
solution than what we've seen so far
- piss off a good friend.  I put this last because while that's
actually a very important reason for me, the fact that Brian and I are
good personal friends is secondary here: this is about discussion
between contributors independent of their personal relationships.

I hope this perspective is useful...

 1. Make it as easy as possible for people to see what's going on and
 join the discussion. All decisions and reasoning behind decisions take
 place in public. (On this note, it would be *really* good if pull
 request notifications went to the list.)

If anyone knows how to do this, let me know; I'd like to do the same
for IPython and our -dev list.

Cheers,

f
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion