Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
Hi Nathaniel, Thanks for the detailed reply; it helped a lot to understand how one could, indeed, have dtypes contain units. And if one had not just on-the-fly conversion from int to float as part of an internal loop, but also on-the-fly multiplication, then it would even be remarkably fast. Will be interesting to think this through in more detail. Still think subclassing ndarray is not all *that* bad (MaskedArray is a different story...), and it may still be needed for my other examples, but perhaps masked/uncertainties do work with the collections idea. Anyway, it now makes sense to focus on dtype first. Thanks again, Marten ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
>From my perspective, a major advantage to dtypes is composability. For example, it's hard to write a library like dask.array (out of core arrays) that can suppose holding any conceivable ndarray subclass (like MaskedArray or quantity), but handling arbitrary dtypes is quite straightforward -- and that dtype information can be directly passed on, without the container library knowing anything about the library that implements the dtype. Stephan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Sun, Aug 30, 2015 at 9:12 PM, Marten van Kerkwijkwrote: > Hi Nathaniel, others, > > I read the discussion of plans with interest. One item that struck me is > that while there are great plans to have a proper extensible and presumably > subclassable dtype, it is discouraged to subclass ndarray itself (rather, it > is encouraged to use a broader array interface). From my experience with > astropy in both Quantity (an ndarray subclass), Time (a separate class > containing high precision times using two ndarray float64), and Table > (initially holding structured arrays, but now sets of Columns, which > themselves are ndarray subclasses), I'm not convinced the broader, new > containers approach is that much preferable. Rather, it leads to a lot of > boiler-plate code to reimplement things ndarray does already (since one is > effectively just calling the methods on the underlying arrays). > > I also think the idea that a dtype becomes something that also contains a > unit is a bit odd. Shouldn't dtype just be about how data is stored? Why > include meta-data such as units? > > Instead, I think a quantity is most logically seen as numbers with a unit, > just like masked arrays are numbers with masks, and variables numbers with > uncertainties. Each of these cases adds extra information in different > forms, and all are quite easily thought of as subclasses of ndarray where > all operations do the normal operation, plus some extra work to keep the > extra information up to date. The intuition behind the array/dtype split is that an array is just a container: it knows how to shuffle bytes around, be reshaped, indexed, etc., but it knows nothing about the meaning of the items it holds -- as far as it's concerned, each entry is just an opaque binary blobs. If it wants to actually do anything with these blobs, it has to ask the dtype for help. The dtype, OTOH, knows how to interpret these blobs, and (in cooperation with ufuncs) to perform operations on them, but it doesn't need to know how they're stored, or about slicing or anything like that -- all that's the container's job. Think about it this way: does it make sense to have a sparse array of numbers-with-units? how about a blosc-style compressed array of numbers-with-units? If yes, then numbers-with-units are a special kind of dtype, not a special kind of array. Another way of getting this intuition: if I have 8 bytes, that could be an int64, or it could be a float64. Which one it is doesn't affect how it's stored at all -- either way it's stored as a chunk of 8 arbitrary bytes. What it affects is how we *interpret* these bytes -- e.g. there is one function called "int64 addition" which takes two 8 byte chunks and returns a new 8 byte chunk as the result, and a second function called "float64 addition" which takes those same two 8 byte chunks and returns a different one. The dtype tells you which of these operations should be used for a particular array. What's special about a float64-with-units? Well, it's 8 bytes, but the addition operation is different from regular float64 addition: it has to do some extra checks and possibly unit conversions. This is exactly what the ufunc dtype dispatch and casting system is there for. This also solves your problem with having to write lots of boilerplate code, b/c if this is a dtype then it means you can just use the actual ndarray class directly without subclassing or anything :-). > Anyway, my suggestion would be to *encourage* rather than discourage ndarray > subclassing, and help this by making ndarray (even) better. So, we very much need robust support for objects-that-quack-like-an-array that are *not* ndarrays, because ndarray subclasses are forced to use ndarray-style strided in-memory storage, and there's huge demand for objects that expose an array-like interface but that use a different storage strategy underneath: sparse arrays, compressed arrays (like blosc), out-of-core arrays, computed-on-demand arrays (like dask), distributed arrays, etc. etc. And once we have solid support for duck-arrays and for user-defined dtypes (as discussed above), then those two things remove a huge amount of the motivation for subclassing ndarray. At the same time, ndarray subclassing is... nearly unmaintainable, AFAICT. The problem with subclassing is that you're basically taking some interface, making a copy of it, and then monkeypatching the copy. As you would expect, this is intrinsically very fragile, because it breaks abstraction barriers. Suddenly things that used to be implementation details -- like which methods are implemented in terms of which other methods -- become part of the public API. And there's never been any coherent, documentable theory of how ndarray subclassing is *supposed* to work, so in practice it's just a bunch of ad hoc hooks designed around the needs of np.matrix and np.ma. We get a regular stream of bug reports asking us to tweak things one way or another, and
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
Hi Nathaniel, others, I read the discussion of plans with interest. One item that struck me is that while there are great plans to have a proper extensible and presumably subclassable dtype, it is discouraged to subclass ndarray itself (rather, it is encouraged to use a broader array interface). From my experience with astropy in both Quantity (an ndarray subclass), Time (a separate class containing high precision times using two ndarray float64), and Table (initially holding structured arrays, but now sets of Columns, which themselves are ndarray subclasses), I'm not convinced the broader, new containers approach is that much preferable. Rather, it leads to a lot of boiler-plate code to reimplement things ndarray does already (since one is effectively just calling the methods on the underlying arrays). I also think the idea that a dtype becomes something that also contains a unit is a bit odd. Shouldn't dtype just be about how data is stored? Why include meta-data such as units? Instead, I think a quantity is most logically seen as numbers with a unit, just like masked arrays are numbers with masks, and variables numbers with uncertainties. Each of these cases adds extra information in different forms, and all are quite easily thought of as subclasses of ndarray where all operations do the normal operation, plus some extra work to keep the extra information up to date. Anyway, my suggestion would be to *encourage* rather than discourage ndarray subclassing, and help this by making ndarray (even) better. All the best, Marten On Thu, Aug 27, 2015 at 11:03 AM, josef.p...@gmail.com wrote: On Wed, Aug 26, 2015 at 10:06 AM, Travis Oliphant tra...@continuum.io wrote: On Wed, Aug 26, 2015 at 1:41 AM, Nathaniel Smith n...@pobox.com wrote: Hi Travis, Thanks for taking the time to write up your thoughts! I have many thoughts in return, but I will try to restrict myself to two main ones :-). 1) On the question of whether work should be directed towards improving NumPy-as-it-is or instead towards a compatibility-breaking replacement: There's plenty of room for debate about whether it's better engineering practice to try and evolve an existing system in place versus starting over, and I guess we have some fundamental disagreements there, but I actually think this debate is a distraction -- we can agree to disagree, because in fact we have to try both. Yes, on this we agree. I think NumPy can improve *and* we can have new innovative array objects. I don't disagree about that. At a practical level: NumPy *is* going to continue to evolve, because it has users and people interested in evolving it; similarly, dynd and other alternatives libraries will also continue to evolve, because they also have people interested in doing it. And at a normative level, this is a good thing! If NumPy and dynd both get better, than that's awesome: the worst case is that NumPy adds the new features that we talked about at the meeting, and dynd simultaneously becomes so awesome that everyone wants to switch to it, and the result of this would be... that those NumPy features are exactly the ones that will make the transition to dynd easier. Or if some part of that plan goes wrong, then well, NumPy will still be there as a fallback, and in the mean time we've actually fixed the major pain points our users are begging us to fix. You seem to be urging us all to make a double-or-nothing wager that your extremely ambitious plans will all work out, with the entire numerical Python ecosystem as the stakes. I think this ambition is awesome, but maybe it'd be wise to hedge our bets a bit? You are mis-characterizing my view. I think NumPy can evolve (though I would personally rather see a bigger change to the underlying system like I outlined before).But, I don't believe it can even evolve easily in the direction needed without breaking ABI and that insisting on not breaking it or even putting too much effort into not breaking it will continue to create less-optimal solutions that are harder to maintain and do not take advantage of knowledge this community now has. I'm also very concerned that 'evolving' NumPy will create a situation where there are regular semantic and subtle API changes that will cause NumPy to be less stable for it's user-base.I've watched this happen. This at a time that people are already looking around for new and different approaches anyway. 2) You really emphasize this idea of an ABI-breaking (but not API-breaking) release, and I think this must indicate some basic gap in how we're looking at things. Where I'm getting stuck here is that... I actually can't think of anything important that we can't do now, but could if we were allowed to break ABI compatibility. The kinds of things that break ABI but keep API are like... rearranging what order the fields in a struct fall in, or changing the numeric value of opaque constants like
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Wed, Aug 26, 2015 at 10:06 AM, Travis Oliphant tra...@continuum.io wrote: On Wed, Aug 26, 2015 at 1:41 AM, Nathaniel Smith n...@pobox.com wrote: Hi Travis, Thanks for taking the time to write up your thoughts! I have many thoughts in return, but I will try to restrict myself to two main ones :-). 1) On the question of whether work should be directed towards improving NumPy-as-it-is or instead towards a compatibility-breaking replacement: There's plenty of room for debate about whether it's better engineering practice to try and evolve an existing system in place versus starting over, and I guess we have some fundamental disagreements there, but I actually think this debate is a distraction -- we can agree to disagree, because in fact we have to try both. Yes, on this we agree. I think NumPy can improve *and* we can have new innovative array objects. I don't disagree about that. At a practical level: NumPy *is* going to continue to evolve, because it has users and people interested in evolving it; similarly, dynd and other alternatives libraries will also continue to evolve, because they also have people interested in doing it. And at a normative level, this is a good thing! If NumPy and dynd both get better, than that's awesome: the worst case is that NumPy adds the new features that we talked about at the meeting, and dynd simultaneously becomes so awesome that everyone wants to switch to it, and the result of this would be... that those NumPy features are exactly the ones that will make the transition to dynd easier. Or if some part of that plan goes wrong, then well, NumPy will still be there as a fallback, and in the mean time we've actually fixed the major pain points our users are begging us to fix. You seem to be urging us all to make a double-or-nothing wager that your extremely ambitious plans will all work out, with the entire numerical Python ecosystem as the stakes. I think this ambition is awesome, but maybe it'd be wise to hedge our bets a bit? You are mis-characterizing my view. I think NumPy can evolve (though I would personally rather see a bigger change to the underlying system like I outlined before).But, I don't believe it can even evolve easily in the direction needed without breaking ABI and that insisting on not breaking it or even putting too much effort into not breaking it will continue to create less-optimal solutions that are harder to maintain and do not take advantage of knowledge this community now has. I'm also very concerned that 'evolving' NumPy will create a situation where there are regular semantic and subtle API changes that will cause NumPy to be less stable for it's user-base.I've watched this happen. This at a time that people are already looking around for new and different approaches anyway. 2) You really emphasize this idea of an ABI-breaking (but not API-breaking) release, and I think this must indicate some basic gap in how we're looking at things. Where I'm getting stuck here is that... I actually can't think of anything important that we can't do now, but could if we were allowed to break ABI compatibility. The kinds of things that break ABI but keep API are like... rearranging what order the fields in a struct fall in, or changing the numeric value of opaque constants like NPY_ARRAY_WRITEABLE. The biggest win I can think of is that we could save a few bytes per array by arranging the fields inside the ndarray struct more optimally, but that's hardly a feature to hang a 2.0 on. You seem to have a vision of this ABI-breaking release as being something very different from that, and I'm not clear on what this vision is. We already broke the ABI with date-time changes --- it's still broken for a certain percentage of users last I checked.So, part of my disagreement is that we've tried this and it didn't work --- even though smart people thought it would.I've had to deal with this personally and I'm not enthusiastic about having to deal with this for the next 5 years because of even more attempts to make changes while not breaking the ABI. I think the group is more careful now --- but I still think the API is broad enough and uses of NumPy deep enough that the effort involved in trying not to break the ABI is just not worth the effort (because it's a non-feature today).Adding new dtypes without breaking the ABI is tricky (and to do it without breaking the ABI is ugly). I also continue to believe that putting out a new ABI-breaking NumPy will allow re-compiling *once* (with some porting changes needed) and not subtle breakages requiring code-changes every time a release is made.If subtle changes aren't made, then the new features won't come. Right now, I'd rather have stability from NumPy than new features. New features can come from other libraries. One specific change that could easily be made in NumPy 2.0 (the current code
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
I thought I'd add a little more specifically about the kind of graphics/point cloud work I'm doing right now at Thinkbox, and how it relates. To echo Francesc's point about NumPy already being an industry standard, within the VFX/graphics industry there is a reference platform definition on Linux, and the most recent iteration of that specifies a version of NumPy. It also includes a bunch of other open source libraries worth taking a look at if you haven't seen them before: http://www.vfxplatform.com/ Point cloud/particle system data, mesh geometry, numerical grids (both dense and sparse), and many other primitive components in graphics are built out of arrays. What NumPy represents for that kind of data is amazing. The extra baggage of an API tied to the CPython GIL can be a hard pill to swallow, though, and this is one of the reasons I'm hopeful that as DyND continues maturing, it can make inroads into places NumPy hasn't been able to. Thanks, Mark On Wed, Aug 26, 2015 at 9:45 AM, Irwin Zaid iz...@continuum.io wrote: Hello everyone, Mark and I thought it would be good to weigh in here and also be explicitly around to discuss DyND. To be clear, neither of us has strong feelings on what NumPy *should* do -- we are both long-time NumPy users and we both see NumPy being around for a while. But, as Francesc mentioned, there is also the open question of where the community should be implementing new features. It would certainly be nice to not have duplication of effort, but a decision like that can only arise naturally from a broad consensus. Travis covered DyND's history and it's relationship with Continuum pretty well, so what's really missing here is what DyND is, where it is going, and how long we think it'll take to get there. We'll try to stick to those topics. We designed DyND to fill what we saw as fundamental gaps in NumPy. These are not only missing features, but also limitations of its architecture. Many of these gaps have been mentioned several times before in this thread and elsewhere, but a brief list would include: better support for missing values, variable-length strings, GPUs, more extensible types, categoricals, more datetime features, ... Some of these were indeed on Nathaniel's list and many of them are already working (albeit sometimes partially) in DyND. And, yes, we strongly feel that NumPy's fundamental dependence on Python itself is a limitation. Why should we not take the fantastic success of NumPy and generalize it across other languages? So, we see DyND is having a twofold purpose. The first is to expand upon the kinds of data that NumPy can represent and do computations upon. The second is to provide a standard array package that can cross the language barrier and easily interoperate between C++, Python, or whatever you want. DyND, at the moment, is quite functional in some areas and lacking a bit in others. There is no doubt that it is still experimental and a bit unstable. But, it has advanced by a lot recently, and we are steadily working towards something like a version 1.0. In fact, DyND's internal C++ architecture stabilized some time ago -- what's missing now is really solid coverage of some common use cases, alongside up-to-date Python bindings and an easy installation process. All of these are in progress and advancing as quick as we can make them. On the other hand, we are also building out some other features. To give just one example that might excite people, DyND now has Numba interoperability -- one can write DyND's equivalent of a ufunc in Python and, with a single decorator, have a broadcasting or reduction callable that gets JITed or (soon) ahead-of-time compiled. Over the next few months, we are hopeful that we can get DyND into a state where it is largely usable by those familiar with NumPy semantics. The reason why we can be a bit more aggressive in our timeline now is because of the great support we are getting from Continuum. With all that said, we are happy to be a part of of any broader conversation involving NumPy and the community. All the best, Irwin and Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Tue, Aug 25, 2015 at 12:21 PM, Antoine Pitrou solip...@pitrou.net wrote: On Tue, 25 Aug 2015 03:03:41 -0700 Nathaniel Smith n...@pobox.com wrote: Supporting third-party dtypes ~ [...] Some features that would become straightforward to implement (e.g. even in third-party libraries) if this were fixed: - missing value support - physical unit tracking (meters / seconds - array of velocity; meters + seconds - error) - better and more diverse datetime representations (e.g. datetimes with attached timezones, or using funky geophysical or astronomical calendars) - categorical data - variable length strings - strings-with-encodings (e.g. latin1) - forward mode automatic differentiation (write a function that computes f(x) where x is an array of float64; pass that function an array with a special dtype and get out both f(x) and f'(x)) - probably others I'm forgetting right now It should also be the opportunity to streamline datetime64 and timedelta64 dtypes. Currently the unit information is IIRC hidden in some weird metadata thing called the PyArray_DatetimeMetaData. Yeah, and PyArray_DatetimeMetaData is an NpyAuxData, which is its own personal little object system implemented in C with its own reference counting system... the design of dtypes has great bones, but the current implementation has a lot of, um, historical baggage. -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
Hi Travis, Thanks for taking the time to write up your thoughts! I have many thoughts in return, but I will try to restrict myself to two main ones :-). 1) On the question of whether work should be directed towards improving NumPy-as-it-is or instead towards a compatibility-breaking replacement: There's plenty of room for debate about whether it's better engineering practice to try and evolve an existing system in place versus starting over, and I guess we have some fundamental disagreements there, but I actually think this debate is a distraction -- we can agree to disagree, because in fact we have to try both. At a practical level: NumPy *is* going to continue to evolve, because it has users and people interested in evolving it; similarly, dynd and other alternatives libraries will also continue to evolve, because they also have people interested in doing it. And at a normative level, this is a good thing! If NumPy and dynd both get better, than that's awesome: the worst case is that NumPy adds the new features that we talked about at the meeting, and dynd simultaneously becomes so awesome that everyone wants to switch to it, and the result of this would be... that those NumPy features are exactly the ones that will make the transition to dynd easier. Or if some part of that plan goes wrong, then well, NumPy will still be there as a fallback, and in the mean time we've actually fixed the major pain points our users are begging us to fix. You seem to be urging us all to make a double-or-nothing wager that your extremely ambitious plans will all work out, with the entire numerical Python ecosystem as the stakes. I think this ambition is awesome, but maybe it'd be wise to hedge our bets a bit? 2) You really emphasize this idea of an ABI-breaking (but not API-breaking) release, and I think this must indicate some basic gap in how we're looking at things. Where I'm getting stuck here is that... I actually can't think of anything important that we can't do now, but could if we were allowed to break ABI compatibility. The kinds of things that break ABI but keep API are like... rearranging what order the fields in a struct fall in, or changing the numeric value of opaque constants like NPY_ARRAY_WRITEABLE. The biggest win I can think of is that we could save a few bytes per array by arranging the fields inside the ndarray struct more optimally, but that's hardly a feature to hang a 2.0 on. You seem to have a vision of this ABI-breaking release as being something very different from that, and I'm not clear on what this vision is. The main reason I personally am against having a big ABI-breaking release is not that I hate ABI breakage a priori, it's that all the big features that I care about and the are users are asking for seem to be ones that... don't actually require doing that. At most they seem to get a mild benefit from breaking some obscure corner cases. So the cost/benefits don't make any sense to me. So: can you give a concrete example of a change you have in mind where breaking ABI would be the key enabler? (I guess you might also be thinking of a separate issue that you sort of allude to: Perhaps we will try to make changes which we think don't involve breaking the ABI, but discover too late that we have failed to fully understand the implications and have broken it by mistake. IIUC this is what happened in the 1.4 timeframe when datetime64 was merged and accidentally renumbered some of the NPY_* constants. Partially I am less worried about this because I have a fair amount of confidence that our review and QA process has improved these days to the point that we would not let a change like that slip through by accident -- we have a lot more active reviewers, people are sensitized to the issues, we've successfully landed intrusive changes like Sebastian's indexing rewrite, ... though this is very much second-hand impressions on my part, and I'd welcome input from folks like Chuck who have a clearer view on how things have changed from then to now. But more importantly, even if this is true, then I can't see how your proposal helps. If we aren't good enough at our jobs to predict when we'll break ABI, then by assumption it makes no sense to pick one release and decide that this is the one time that we'll break ABI.) On Tue, Aug 25, 2015 at 12:00 PM, Travis Oliphant tra...@continuum.io wrote: Thanks for the write-up Nathaniel. There is a lot of great detail and interesting ideas here. I've am very eager to understand how to help NumPy and the wider community move forward however I can (my passions on this have not changed since 1999, though what I myself spend time on has changed). There are a lot of ways to think about approaching this, though. It's hard to get all the ideas on the table, and it was unfortunate we couldn't get everybody wyho are core NumPy devs together in person to have this discussion as there are still a lot of questions unanswered and a lot of thought
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Tue, Aug 25, 2015 at 5:53 PM, David Cournapeau courn...@gmail.com wrote: Thanks for the good summary Nathaniel. Regarding dtype machinery, I agree casting is the hardest part. Unless the code has changed dramatically, this was the main reason why you could not make most of the dtypes separate from numpy codebase (I tried to move the datetime dtype out of multiarray into a separate C extension some years ago). Being able to separate the dtypes from the multiarray module would be an obvious way to drive the internal API change. For practical reasons I don't imagine we'll ever want to actually move the core dtypes out of multiarray -- if nothing else they will always remain a little bit special, like np.array([1.0, 2.0]) will just know that this should use the float64 dtype. But yeah, in general a good heuristic would be that -- aside from a few limited cases like that -- we want to make built-in dtypes and user-defined dtypes use the same APIs. Regarding the use of cython in numpy, was there any discussion about the compilation/size cost of using cython, and talking to the cython team to improve this ? Or was that considered acceptable with current cython for numpy. I am convinced cleanly separating the low level parts from the python C API plumbing would be the single most important thing one could do to make the codebase more amenable. It's still a more blue-sky idea than that... the discussion was more at the level of is this something that is even worth trying to make work and seeing where the problems are? The big immediate problem, before we got into code size issues, would be that we would need to be able to compile a mix of .pyx files and .c files into a single .so, while cython generated code currently makes some strong assumptions about how each .pyx file will live in its own .so. From playing around with it I suspect the first version of making this work will be klugey indeed. But yeah, the thing to do would be for someone to dig in and make the kluges and then decide how to clean them up once you know where they are. -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Wed, Aug 26, 2015 at 10:11 AM, Antoine Pitrou solip...@pitrou.net wrote: On Wed, 26 Aug 2015 16:45:51 + (UTC) Irwin Zaid iz...@continuum.io wrote: So, we see DyND is having a twofold purpose. The first is to expand upon the kinds of data that NumPy can represent and do computations upon. The second is to provide a standard array package that can cross the language barrier and easily interoperate between C++, Python, or whatever you want. One possible limitation is that the lingua franca for language interoperability is C, not C++. DyND doesn't have to be written in C, but exposing a nice C API may help make it attractive to the various language runtimes out there. (even those languages whose runtime doesn't have a compile-time interface to C generally have some kind of cffi or ctypes equivalent to load external C routines at runtime) I kind of like the path LLVM has chosen here, of a stable C API and an unstable C++ API. This has both pros and cons though, so I'm not sure what will be right for DyND in the long term. -Mark Regards Antoine. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Wed, 26 Aug 2015 16:45:51 + (UTC) Irwin Zaid iz...@continuum.io wrote: So, we see DyND is having a twofold purpose. The first is to expand upon the kinds of data that NumPy can represent and do computations upon. The second is to provide a standard array package that can cross the language barrier and easily interoperate between C++, Python, or whatever you want. One possible limitation is that the lingua franca for language interoperability is C, not C++. DyND doesn't have to be written in C, but exposing a nice C API may help make it attractive to the various language runtimes out there. (even those languages whose runtime doesn't have a compile-time interface to C generally have some kind of cffi or ctypes equivalent to load external C routines at runtime) Regards Antoine. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
26.08.2015, 14:14, Francesc Alted kirjoitti: [clip] 2015-08-25 12:03 GMT+02:00 Nathaniel Smith n...@pobox.com: Let's focus on evolving numpy as far as we can without major break-the-world changes (no numpy 2.0, at least in the foreseeable future). And, as a target for that evolution, let's change our focus from numpy as NumPy is the library that gives you the np.ndarray object (plus some attached infrastructure), to NumPy provides the standard framework for working with arrays and array-like objects in Python Sorry to disagree here, but in my opinion NumPy *already* provides the standard framework for working with arrays and array-like objects in Python as its huge popularity shows. If what you mean is that there are too many efforts trying to provide other, specialized data containers (things like DataFrame in pandas, DataArray/Dataset in xarray or carray/ctable in bcolz just to mention a few), then let me say that I am of the opinion that there can't be a silver bullet for tackling all the problems that the PyData community is facing. My reading of the above was that this was about multimethods, and allowing different types of containers to interoperate beyond the array interface and Python's builtin operator hooks. The exact performance details of course vary, and an algorithm written for in-memory arrays just fails for too large on-disk or distributed arrays. However, a case for a minimal common API probably could be made, esp. in algorithms mainly relying on linear algebra. This is to a degree different from subclassing, as many of the array-like objects you might want do not have a simple strided memory model. Pauli ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Wed, Aug 26, 2015 at 6:11 PM, Antoine Pitrou solip...@pitrou.net wrote: One possible limitation is that the lingua franca for language interoperability is C, not C++. DyND doesn't have to be written in C, but exposing a nice C API may help make it attractive to the various language runtimes out there. That is absolutely true and a C API is on the long-term roadmap. At the moment, a C API is not needed for DyND to be stable and usable from Python, which is one reason we aren't doing it now. Irwin ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
Hello everyone, Mark and I thought it would be good to weigh in here and also be explicitly around to discuss DyND. To be clear, neither of us has strong feelings on what NumPy *should* do -- we are both long-time NumPy users and we both see NumPy being around for a while. But, as Francesc mentioned, there is also the open question of where the community should be implementing new features. It would certainly be nice to not have duplication of effort, but a decision like that can only arise naturally from a broad consensus. Travis covered DyND's history and it's relationship with Continuum pretty well, so what's really missing here is what DyND is, where it is going, and how long we think it'll take to get there. We'll try to stick to those topics. We designed DyND to fill what we saw as fundamental gaps in NumPy. These are not only missing features, but also limitations of its architecture. Many of these gaps have been mentioned several times before in this thread and elsewhere, but a brief list would include: better support for missing values, variable-length strings, GPUs, more extensible types, categoricals, more datetime features, ... Some of these were indeed on Nathaniel's list and many of them are already working (albeit sometimes partially) in DyND. And, yes, we strongly feel that NumPy's fundamental dependence on Python itself is a limitation. Why should we not take the fantastic success of NumPy and generalize it across other languages? So, we see DyND is having a twofold purpose. The first is to expand upon the kinds of data that NumPy can represent and do computations upon. The second is to provide a standard array package that can cross the language barrier and easily interoperate between C++, Python, or whatever you want. DyND, at the moment, is quite functional in some areas and lacking a bit in others. There is no doubt that it is still experimental and a bit unstable. But, it has advanced by a lot recently, and we are steadily working towards something like a version 1.0. In fact, DyND's internal C++ architecture stabilized some time ago -- what's missing now is really solid coverage of some common use cases, alongside up-to-date Python bindings and an easy installation process. All of these are in progress and advancing as quick as we can make them. On the other hand, we are also building out some other features. To give just one example that might excite people, DyND now has Numba interoperability -- one can write DyND's equivalent of a ufunc in Python and, with a single decorator, have a broadcasting or reduction callable that gets JITed or (soon) ahead-of-time compiled. Over the next few months, we are hopeful that we can get DyND into a state where it is largely usable by those familiar with NumPy semantics. The reason why we can be a bit more aggressive in our timeline now is because of the great support we are getting from Continuum. With all that said, we are happy to be a part of of any broader conversation involving NumPy and the community. All the best, Irwin and Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Mi, 2015-08-26 at 00:05 -0700, Nathaniel Smith wrote: On Tue, Aug 25, 2015 at 5:53 PM, David Cournapeau courn...@gmail.com wrote: Thanks for the good summary Nathaniel. Regarding dtype machinery, I agree casting is the hardest part. Unless the code has changed dramatically, this was the main reason why you could not make most of the dtypes separate from numpy codebase (I tried to move the datetime dtype out of multiarray into a separate C extension some years ago). Being able to separate the dtypes from the multiarray module would be an obvious way to drive the internal API change. For practical reasons I don't imagine we'll ever want to actually move the core dtypes out of multiarray -- if nothing else they will always remain a little bit special, like np.array([1.0, 2.0]) will just know that this should use the float64 dtype. But yeah, in general a good heuristic would be that -- aside from a few limited cases like that -- we want to make built-in dtypes and user-defined dtypes use the same APIs. Well, casting is the conceptional hardest part. Marrying it to the rest of numpy is probably just as hard ;). With the chance of not having thought this through enough, maybe some points about the general discussion. I think I would like some more clarity of what we want and especially *need* [1]. From SciPy, there were two things I particularly remember: 1. the dtype/scalar issue 2. making an interface to make array-likes interaction more sane (this I think can go quite far, and we are already going part of it) The dtypes/scalars seem a particularly dark corner of numpy and if it is feasible for us to replace it with something new, then I would be willing to do some breaks for it (admittingly, given protest, I would back down from that and another solution would be needed). The point for me is, I currently think a dtype/scalar could get numpy a big way, especially from the point of view of downstream packages. Of course it would be harder to do in numpy then in something new, but it should also be of much more immediate use. Maybe I am going a bit too far with this right now, but I could imagine that if we cannot clean up the dtype/scalars, numpy may indeed be doomed or at least a brick slowing down a lot of other people. And if it is not possible to do this without a numpy 2, then likely that is the way to go. But I am not convinced we should aim to fix all the other stuff at the same time. I am afraid it would just accumulate to grow over everyones heads. In other words, I think if we can muster the resources I would like to see this problem attacked within numpy. If this proves impossible a new dtype abstraction may well be reason for numpy 2, or used by a DyND or similar? But I do believe we should not give up on Numpy here from the start, at least I do not see a compelling reason to do. Instead giving up on numpy seems like the last way out of a misery. And much of the different opinions to me seem to be whether we think this will clearly happen or not or has already happened (or maybe whether it is too costly to do in numpy). Cleaning it up, would open doors to many things. Note that I think it would make the numpy source much less scary, because I think it is the one big piece of code that is maybe not clearly a separate chunk [2]. After making it sane, I would argue that numpy does become much more maintainable and extensible. From my current view, probably enough so for a long time. Also, I think it would give us abstraction to make different/new projects work together better and if done well enough, some grand new project set to replace numpy could reuse it. Of course it is entirely possible that more things need to be changed in numpy and that some others would be just as hard or even harder to do. But if we can identify this as the one big thing that gets us 90% then I refuse to give up hope of doing it in numpy just yet. - Sebastian [1] Travis has said quite a lot about it, but it is not yet clear to me what is a priority/real pain point. Take datashape for example. By now I think that the datashape is likely a good idea to make structured arrays nicer, since it moves the structured part into the array object and not the dtype, which makes sense to me. However, I am not convinced that the datashape is something that would make numpy a compelling amount better. In fact I could imagine that for many things it would make it unnecessarily more complicated for users. [2] Take indexing, I like to think I did not break that much when redoing it (except on purpose, which I hope did not create much trouble). In some sense indexing was simple to redo, because it does not overlap at all with anything else directly. If we get dtypes/scalars more separated, I think we are at a point where this is possible with pretty much any part of numpy. Regarding the use of cython in numpy, was there any discussion about the compilation/size cost of using cython, and
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
Hi, Thanks Nathaniel and others for sparking this discussion as I think it is very timely. 2015-08-25 12:03 GMT+02:00 Nathaniel Smith n...@pobox.com: Let's focus on evolving numpy as far as we can without major break-the-world changes (no numpy 2.0, at least in the foreseeable future). And, as a target for that evolution, let's change our focus from numpy as NumPy is the library that gives you the np.ndarray object (plus some attached infrastructure), to NumPy provides the standard framework for working with arrays and array-like objects in Python Sorry to disagree here, but in my opinion NumPy *already* provides the standard framework for working with arrays and array-like objects in Python as its huge popularity shows. If what you mean is that there are too many efforts trying to provide other, specialized data containers (things like DataFrame in pandas, DataArray/Dataset in xarray or carray/ctable in bcolz just to mention a few), then let me say that I am of the opinion that there can't be a silver bullet for tackling all the problems that the PyData community is facing. The libraries using specialized data containers (pandas, xray, bcolz...) may have more or less machinery on top of them so that conversion to NumPy not necessarily happens internally (many times we don't want conversions for efficiency), but it is the capability of producing NumPy arrays out of them (or parts of them) what makes these specialized containers to be incredible more useful to users because they can use NumPy to fill the missing gaps, or just use NumPy as an intermediate container that acts as input for other libraries. On the subject on why I don't think a universal data container is feasible for PyData, you just have to have a look at how many data structures Python is providing in the language itself (tuples, lists, dicts, sets...), and how many are added in the standard library (like those in the collections sub-package). Every data container is designed to do a couple of things (maybe three) well, but for other use cases it is the responsibility of the user to choose the more appropriate depending on her needs. In the same vein, I also think that it makes little sense to try to come with a standard solution that is going to satisfy everyone's need. IMHO, and despite all efforts, neither NumPy, NumPy 2.0, DyND, bcolz or any other is going to offer the universal data container. Instead of that, let me summarize what users/developers like me need from NumPy for continue creating more specialized data containers: 1) Keep NumPy simple. NumPy is the truly cornerstone of PyData right now, and it will be for the foreseeable future, so please keep it usable and *minimal*. Before adding any more feature the increase in complexity should carefully weighted. 2) Make NumPy more flexible. Any rewrite that allows arrays or dtypes to be subclassed and extended more easily will be a huge win. *But* if in order to allow flexibility you have to make NumPy much more complex, then point 1) should prevail. 3) Make of NumPy a sustainable project. Historically NumPy depended on heroic efforts of individuals to make it what it is now: *an industry standard*. But individual efforts, while laudable, are not enough, so please, please, please continue the effort of constituting a governance team that ensures the future of NumPy (and with it, the whole PyData community). Finally, the question on whether NumPy 2.0 or projects like DyND should be chosen instead for implementing new features is still legitimate, and while I have my own opinions (favourable to DyND), I still see (such is the price of technological debt) a distant future where we will find NumPy as we know it, allowing more innovation to happen in Python Data space. Again, thanks to all those braves that are allowing others to build on top of NumPy's shoulders. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Wed, Aug 26, 2015 at 1:41 AM, Nathaniel Smith n...@pobox.com wrote: Hi Travis, Thanks for taking the time to write up your thoughts! I have many thoughts in return, but I will try to restrict myself to two main ones :-). 1) On the question of whether work should be directed towards improving NumPy-as-it-is or instead towards a compatibility-breaking replacement: There's plenty of room for debate about whether it's better engineering practice to try and evolve an existing system in place versus starting over, and I guess we have some fundamental disagreements there, but I actually think this debate is a distraction -- we can agree to disagree, because in fact we have to try both. Yes, on this we agree. I think NumPy can improve *and* we can have new innovative array objects. I don't disagree about that. At a practical level: NumPy *is* going to continue to evolve, because it has users and people interested in evolving it; similarly, dynd and other alternatives libraries will also continue to evolve, because they also have people interested in doing it. And at a normative level, this is a good thing! If NumPy and dynd both get better, than that's awesome: the worst case is that NumPy adds the new features that we talked about at the meeting, and dynd simultaneously becomes so awesome that everyone wants to switch to it, and the result of this would be... that those NumPy features are exactly the ones that will make the transition to dynd easier. Or if some part of that plan goes wrong, then well, NumPy will still be there as a fallback, and in the mean time we've actually fixed the major pain points our users are begging us to fix. You seem to be urging us all to make a double-or-nothing wager that your extremely ambitious plans will all work out, with the entire numerical Python ecosystem as the stakes. I think this ambition is awesome, but maybe it'd be wise to hedge our bets a bit? You are mis-characterizing my view. I think NumPy can evolve (though I would personally rather see a bigger change to the underlying system like I outlined before).But, I don't believe it can even evolve easily in the direction needed without breaking ABI and that insisting on not breaking it or even putting too much effort into not breaking it will continue to create less-optimal solutions that are harder to maintain and do not take advantage of knowledge this community now has. I'm also very concerned that 'evolving' NumPy will create a situation where there are regular semantic and subtle API changes that will cause NumPy to be less stable for it's user-base.I've watched this happen. This at a time that people are already looking around for new and different approaches anyway. 2) You really emphasize this idea of an ABI-breaking (but not API-breaking) release, and I think this must indicate some basic gap in how we're looking at things. Where I'm getting stuck here is that... I actually can't think of anything important that we can't do now, but could if we were allowed to break ABI compatibility. The kinds of things that break ABI but keep API are like... rearranging what order the fields in a struct fall in, or changing the numeric value of opaque constants like NPY_ARRAY_WRITEABLE. The biggest win I can think of is that we could save a few bytes per array by arranging the fields inside the ndarray struct more optimally, but that's hardly a feature to hang a 2.0 on. You seem to have a vision of this ABI-breaking release as being something very different from that, and I'm not clear on what this vision is. We already broke the ABI with date-time changes --- it's still broken for a certain percentage of users last I checked.So, part of my disagreement is that we've tried this and it didn't work --- even though smart people thought it would.I've had to deal with this personally and I'm not enthusiastic about having to deal with this for the next 5 years because of even more attempts to make changes while not breaking the ABI.I think the group is more careful now --- but I still think the API is broad enough and uses of NumPy deep enough that the effort involved in trying not to break the ABI is just not worth the effort (because it's a non-feature today).Adding new dtypes without breaking the ABI is tricky (and to do it without breaking the ABI is ugly). I also continue to believe that putting out a new ABI-breaking NumPy will allow re-compiling *once* (with some porting changes needed) and not subtle breakages requiring code-changes every time a release is made.If subtle changes aren't made, then the new features won't come. Right now, I'd rather have stability from NumPy than new features. New features can come from other libraries. One specific change that could easily be made in NumPy 2.0 (the current code but with an ABI change) is that Dtypes should become true type objects and array-scalars (which are the current
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Tue, Aug 25, 2015 at 5:03 AM, Nathaniel Smith n...@pobox.com wrote: Hi all, These are the notes from the NumPy dev meeting held July 7, 2015, at the SciPy conference in Austin, presented here so the list can keep up with what happens, and so you can give feedback. Please do give feedback, none of this is final! (Also, if anyone who was there notices anything I left out or mischaracterized, please speak up -- these are a lot of notes I'm trying to gather together, so I could easily have missed something!) Thanks to Jill Cowan and the rest of the SciPy organizers for donating space and organizing logistics for us, and to the Berkeley Institute for Data Science for funding travel for Jaime, Nathaniel, and Sebastian. Attendees = Present in the room for all or part: Daniel Allan, Chris Barker, Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm pretty sure this list is incomplete) Joining remotely for all or part: Stephan Hoyer, Julian Taylor. Formalizing our governance/decision making == This was a major focus of discussion. At a high level, the consensus was to steal IPython's governance document (IPEP 29) and modify it to remove its use of a BDFL as a backstop to normal community consensus-based decision, and replace it with a new backstop based on Apache-project-style consensus voting amongst the core team. I'll send out a proper draft of this shortly for further discussion. Development roadmap === General consensus: Let's assume NumPy is going to remain important indefinitely, and try to make it better, instead of waiting for something better to come along. (This is unlikely to be wasted effort even if something better does come along, and it's hardly a sure thing that that will happen anyway.) Let's focus on evolving numpy as far as we can without major break-the-world changes (no numpy 2.0, at least in the foreseeable future). And, as a target for that evolution, let's change our focus from numpy as NumPy is the library that gives you the np.ndarray object (plus some attached infrastructure), to NumPy provides the standard framework for working with arrays and array-like objects in Python This means, creating defined interfaces between array-like objects / ufunc objects / dtype objects, so that it becomes possible for third parties to add their own and mix-and-match. Right now ufuncs are pretty good at this, but if you want a new array class or dtype then in most cases you pretty much have to modify numpy itself. Vision: instead of everyone who wants a new container type having to reimplement all of numpy, Alice can implement an array class using (sparse / distributed / compressed / tiled / gpu / out-of-core / delayed / ...) storage, pass it to code that was written using direct calls to np.* functions, and it just works. (Instead of np.sin being the way you calculate the sine of an ndarray, it's the way you calculate the sine of any array-like container object.) Vision: Darryl can implement a new dtype for (categorical data / astronomical dates / integers-with-missing-values / ...) without having to touch the numpy core. Vision: Chandni can then come along and combine them by doing a = alice_array([...], dtype=darryl_dtype) and it just works. Vision: no-one is tempted to subclass ndarray, because anything you can do with an ndarray subclass you can also easily do by defining your own new class that implements the array protocol. Supporting third-party array types ~~ Sub-goals: - Get __numpy_ufunc__ done, which will cover a good chunk of numpy's API right there. - Go through the rest of the stuff in numpy, and figure out some story for how to let it handle third-party array classes: - ufunc ALL the things: Some things can be converted directly into (g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some things could be converted into (g)ufuncs if we extended the (g)ufunc interface a bit (e.g. np.sort, np.matmul). - Some things probably need their own __numpy_ufunc__-like extensions (__numpy_concatenate__?) - Provide tools to make it easier to implement the more complicated parts of an array object (e.g. the bazillion different methods, many of which are ufuncs in disguise, or indexing) - Longer-run interesting research project: __numpy_ufunc__ requires that one or the other object have explicit knowledge of how to handle the other, so to handle binary ufuncs with N array types you need something like N**2 __numpy_ufunc__ code paths. As an alternative, if there were some interface that an object could export that provided the operations
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Tue, Aug 25, 2015 at 4:03 AM, Nathaniel Smith n...@pobox.com wrote: Hi all, These are the notes from the NumPy dev meeting held July 7, 2015, at the SciPy conference in Austin, presented here so the list can keep up with what happens, and so you can give feedback. Please do give feedback, none of this is final! (Also, if anyone who was there notices anything I left out or mischaracterized, please speak up -- these are a lot of notes I'm trying to gather together, so I could easily have missed something!) Thanks to Jill Cowan and the rest of the SciPy organizers for donating space and organizing logistics for us, and to the Berkeley Institute for Data Science for funding travel for Jaime, Nathaniel, and Sebastian. Attendees = Present in the room for all or part: Daniel Allan, Chris Barker, Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm pretty sure this list is incomplete) Joining remotely for all or part: Stephan Hoyer, Julian Taylor. Formalizing our governance/decision making == This was a major focus of discussion. At a high level, the consensus was to steal IPython's governance document (IPEP 29) and modify it to remove its use of a BDFL as a backstop to normal community consensus-based decision, and replace it with a new backstop based on Apache-project-style consensus voting amongst the core team. I'll send out a proper draft of this shortly for further discussion. Development roadmap === General consensus: Let's assume NumPy is going to remain important indefinitely, and try to make it better, instead of waiting for something better to come along. (This is unlikely to be wasted effort even if something better does come along, and it's hardly a sure thing that that will happen anyway.) Let's focus on evolving numpy as far as we can without major break-the-world changes (no numpy 2.0, at least in the foreseeable future). And, as a target for that evolution, let's change our focus from numpy as NumPy is the library that gives you the np.ndarray object (plus some attached infrastructure), to NumPy provides the standard framework for working with arrays and array-like objects in Python This means, creating defined interfaces between array-like objects / ufunc objects / dtype objects, so that it becomes possible for third parties to add their own and mix-and-match. Right now ufuncs are pretty good at this, but if you want a new array class or dtype then in most cases you pretty much have to modify numpy itself. Vision: instead of everyone who wants a new container type having to reimplement all of numpy, Alice can implement an array class using (sparse / distributed / compressed / tiled / gpu / out-of-core / delayed / ...) storage, pass it to code that was written using direct calls to np.* functions, and it just works. (Instead of np.sin being the way you calculate the sine of an ndarray, it's the way you calculate the sine of any array-like container object.) Vision: Darryl can implement a new dtype for (categorical data / astronomical dates / integers-with-missing-values / ...) without having to touch the numpy core. Vision: Chandni can then come along and combine them by doing a = alice_array([...], dtype=darryl_dtype) and it just works. Vision: no-one is tempted to subclass ndarray, because anything you can do with an ndarray subclass you can also easily do by defining your own new class that implements the array protocol. Supporting third-party array types ~~ Sub-goals: - Get __numpy_ufunc__ done, which will cover a good chunk of numpy's API right there. - Go through the rest of the stuff in numpy, and figure out some story for how to let it handle third-party array classes: - ufunc ALL the things: Some things can be converted directly into (g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some things could be converted into (g)ufuncs if we extended the (g)ufunc interface a bit (e.g. np.sort, np.matmul). - Some things probably need their own __numpy_ufunc__-like extensions (__numpy_concatenate__?) - Provide tools to make it easier to implement the more complicated parts of an array object (e.g. the bazillion different methods, many of which are ufuncs in disguise, or indexing) - Longer-run interesting research project: __numpy_ufunc__ requires that one or the other object have explicit knowledge of how to handle the other, so to handle binary ufuncs with N array types you need something like N**2 __numpy_ufunc__ code paths. As an alternative, if there were some interface that an object could export that provided the operations
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Tue, Aug 25, 2015 at 3:58 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Tue, Aug 25, 2015 at 1:00 PM, Travis Oliphant tra...@continuum.io wrote: Thanks for the write-up Nathaniel. There is a lot of great detail and interesting ideas here. snip I think that summarizes my main concerns. I will write-up more forward thinking ideas for what else is possible in the coming weeks. In the mean time, thanks for keeping the discussion going. It is extremely exciting to see the help people have continued to provide to maintain and improve NumPy.It will be exciting to see what the next few years bring as well. I think the only thing that looks even a little bit like a numpy 2.0 at this time is dynd. Rewriting numpy, let alone producing numpy 2.0 is a major project. Dynd is 2.5+ years old, 3500+ commits in, and still in progress. If there is a decision to pursue Dynd I could support that, but I think we would want to think deeply about how to make the transition as painless as possible. It would be good at this point to get some feedback from people currently using dynd. IIRC, part of the reason for starting dynd was the perception that is was not possible to evolve numpy without running into compatibility road blocks. Travis, could you perhaps summarize the thinking that went into the decision to make dynd a separate project? I think it would be best if Mark Wiebe speaks up here. I can explain why Continuum supported DyND with some fraction of Mark's time for a few years and give my perspective, but ultimately DyND is Mark's story to tell (and a few talented people have now joined him in the effort). Mark Wiebe was a productive NumPy developer. He was one of a few people that jumped in on the code-base and made substantial and significant changes and came to understand just how hard it can be to develop in the NumPy code-base. He also is a C++ developer who really likes the beauty and power of that language (which definitely biases his NumPy work, but he did put a lot of effort into making NumPy better). Before Peter and I started Continuum, Mark had begun the DyND project as an example of a general-purpose dynamic array library that could be used by any dynamic language to make arrays. In the early days of Continuum, we spent time from at least Mark W, Bryan Van de Ven, Jay Borque, and Francesc Alted looking at how to extend NumPy to add 1) categorical data-types, 2) variable-length strings, and 3) better date-time types.Bryan, a good developer, who has gone on to be a primary developer of Bokeh spent quite a bit of time and had a prototype of categoricals *nearly* working. He did not like working on the NumPy code-base at all. He struggled with it and found it very difficult to extend.He worked closely with Mark Wiebe who helped him the best he could. What took him 4 weeks in NumPy took him 3 days in DyND to build. I think that experience, convinced him and Mark W both that working with NumPy code-base would take too long to make significant progress. Also, during 2012 I was trying to help with release-management (though I ended up just hiring Ondrej Certek to actually do the work and he did a great job of getting a release of NumPy out the door --- thanks to much help from many of you).At that point, I realized very clearly, that what I could best do at this point was to try and get more resources for open source and for the NumPy stack rather than work on the code directly. We also did work with several clients that helped me realize just how many disruptive changes had happened from 1.4 to 1.7 for extensive users of NumPy (much more than would be justified from a we don't break the ABI mantra that was the stated goal). We also realized that the kind of experimentation we wanted to do in the first 2 years of Continuum would just not be possible on the NumPy code-base and the need for getting community buy-in on every decision would slow us down too much --- as we had to iterate rapidly on so many things and find our center as a startup. It also would not be fair to the NumPy community. Our decision to do *all* of our exploration outside the NumPy code base was basically 1) the kinds of changes we wanted ultimately were potentially dramatic and disruptive, 2) it would be too difficult and time-consuming to decide all things in public discussions with the NumPy community --- especially when some things were experimental 3) tying ourselves to releases of NumPy would be difficult at that time, and 4) the design of the NumPy code-base makes it difficult to contribute to --- both Mark W and Bryan V felt they could make progress *much* faster in a new code-base. Continuum did not have enough start-up funding to devote significant time on DyND in the early days.So Mark rallied what resources he could and we supported him the best we could and he made progress. My only real requirement with sponsoring his work when we did
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Tue, Aug 25, 2015 at 3:58 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Tue, Aug 25, 2015 at 1:00 PM, Travis Oliphant tra...@continuum.io wrote: Thanks for the write-up Nathaniel. There is a lot of great detail and interesting ideas here. snip There are at least 3 areas of compatibility (ABI, API, and semantic). ABI-compatibility is a non-feature in today's world. There are so many distributions of the NumPy stack (and conda makes it trivial for anyone to build their own or for you to build one yourself). Making less-optimal software-engineering choices because of fear of breaking the ABI is not something I'm supportive of at all. We should not break ABI every release, but a release every 3 years that breaks ABI is not a problem. API compatibility should be much more sacrosanct, but it is also something that can also be managed. Any NumPy 2.0 should definitely support the full NumPy API (though there could be deprecated swaths).I think the community has done well in using deprecation and limiting the public API to make this more manageable and I would love to see a NumPy 2.0 that solidifies a future-oriented API along with a back-ward compatible API that is also available. Semantic compatibility is the hardest. We have already broken this on multiple occasions throughout the 1.x NumPy releases. Every time you change the code, this can change.This is what I fear causing deep instability over the course of many years. These are things like the casting rule details, the effect of indexing changes, any change to the calculations approaches. It is and has been the most at risk during any code-changes.My view is that a NumPy 2.0 (with a new low-level architecture) minimizes these changes to a single release rather than unavoidably spreading them out over many, many releases. I think that summarizes my main concerns. I will write-up more forward thinking ideas for what else is possible in the coming weeks. In the mean time, thanks for keeping the discussion going. It is extremely exciting to see the help people have continued to provide to maintain and improve NumPy.It will be exciting to see what the next few years bring as well. I think the only thing that looks even a little bit like a numpy 2.0 at this time is dynd. Rewriting numpy, let alone producing numpy 2.0 is a major project. Dynd is 2.5+ years old, 3500+ commits in, and still in progress. If there is a decision to pursue Dynd I could support that, but I think we would want to think deeply about how to make the transition as painless as possible. It would be good at this point to get some feedback from people currently using dynd. IIRC, part of the reason for starting dynd was the perception that is was not possible to evolve numpy without running into compatibility road blocks. Travis, could you perhaps summarize the thinking that went into the decision to make dynd a separate project? Thanks Chuck. I'll do this in a separate email, but I just wanted to point out that when I say NumPy 2.0, I'm actually only specifically talking about a release of NumPy that breaks ABI compatibility --- not some potential re-write. I'm not ruling that out, but I'm not necessarily implying such a thing by saying NumPy 2.0. snip Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- *Travis Oliphant* *Co-founder and CEO* @teoliphant 512-222-5440 http://www.continuum.io ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
Thanks for the write-up Nathaniel. There is a lot of great detail and interesting ideas here. I've am very eager to understand how to help NumPy and the wider community move forward however I can (my passions on this have not changed since 1999, though what I myself spend time on has changed). There are a lot of ways to think about approaching this, though. It's hard to get all the ideas on the table, and it was unfortunate we couldn't get everybody wyho are core NumPy devs together in person to have this discussion as there are still a lot of questions unanswered and a lot of thought that has gone into other approaches that was not brought up or represented in the meeting (how does Numba fit into this, what about data-shape, dynd, memory-views and Python type system, etc.). If NumPy becomes just an interface-specification, then why don't we just do that *outside* NumPy itself in a way that doesn't jeopardize the stability of NumPy today.These are some of the real questions I have. I will try to write up my thoughts in more depth soon, but I won't be able to respond in-depth right now. I just wanted to comment because Nathaniel said I disagree which is only partly true. The three most important things for me are 1) let's make sure we have representation from as wide of the community as possible (this is really hard), 2) let's look around at the broader community and the prior art that is happening in this space right now and 3) let's not pretend we are going to be able to make all this happen without breaking ABI compatibility. Let's just break ABI compatibility with NumPy 2.0 *and* have as much fidelity with the API and semantics of current NumPy as possible (though there will be some changes necessary long-term). I don't think we should intentionally break ABI if we can avoid it, but I also don't think we should spend in-ordinate amounts of time trying to pretend that we won't break ABI (for at least some people), and most importantly we should not pretend *not* to break the ABI when we actually do.We did this once before with the roll-out of date-time, and it was really un-necessary. When I released NumPy 1.0, there were several things that I knew should be fixed very soon (NumPy was never designed to not break ABI).Those problems are still there.Now, that we have quite a bit better understanding of what NumPy *should* be (there have been tremendous strides in understanding and community size over the past 10 years), let's actually make the infrastructure we think will last for the next 20 years (instead of trying to shoe-horn new ideas into a 20-year old code-base that wasn't designed for it). NumPy is a hard code-base. It has been since Numeric days in 1995. I could be wrong, but my guess is that we will be passed by as a community if we don't seize the opportunity to build something better than we can build if we are forced to use a 20 year old code-base. It is more important to not break people's code and to be clear when a re-compile is necessary for dependencies. Those to me are the most important constraints. There are a lot of great ideas that we all have about what we want NumPy to be able to do. Some of this are pretty transformational (and the more exciting they are, the harder I think they are going to be to implement without breaking at least the ABI). There is probably some CAP-like theorem around Stability-Features-Speed-of-Development (pick 2) when it comes to Open Source Software development and making feature-progress with NumPy *is going* to create in-stability which concerns me. I would like to see a little-bit-of-pain one time with a NumPy 2.0, rather than a constant pain because of constant churn over many years approach that Nathaniel seems to advocate. To me NumPy 2.0 is an ABI-breaking release that is as API-compatible as possible and whose semantics are not dramatically different. There are at least 3 areas of compatibility (ABI, API, and semantic). ABI-compatibility is a non-feature in today's world. There are so many distributions of the NumPy stack (and conda makes it trivial for anyone to build their own or for you to build one yourself). Making less-optimal software-engineering choices because of fear of breaking the ABI is not something I'm supportive of at all. We should not break ABI every release, but a release every 3 years that breaks ABI is not a problem. API compatibility should be much more sacrosanct, but it is also something that can also be managed. Any NumPy 2.0 should definitely support the full NumPy API (though there could be deprecated swaths).I think the community has done well in using deprecation and limiting the public API to make this more manageable and I would love to see a NumPy 2.0 that solidifies a future-oriented API along with a back-ward compatible API that is also available. Semantic compatibility is the hardest. We have already broken this on multiple occasions
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Tue, 25 Aug 2015 03:03:41 -0700 Nathaniel Smith n...@pobox.com wrote: Supporting third-party dtypes ~ [...] Some features that would become straightforward to implement (e.g. even in third-party libraries) if this were fixed: - missing value support - physical unit tracking (meters / seconds - array of velocity; meters + seconds - error) - better and more diverse datetime representations (e.g. datetimes with attached timezones, or using funky geophysical or astronomical calendars) - categorical data - variable length strings - strings-with-encodings (e.g. latin1) - forward mode automatic differentiation (write a function that computes f(x) where x is an array of float64; pass that function an array with a special dtype and get out both f(x) and f'(x)) - probably others I'm forgetting right now It should also be the opportunity to streamline datetime64 and timedelta64 dtypes. Currently the unit information is IIRC hidden in some weird metadata thing called the PyArray_DatetimeMetaData. Also, thanks the notes. It has been an interesting read. Regards Antoine. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
Hi all, These are the notes from the NumPy dev meeting held July 7, 2015, at the SciPy conference in Austin, presented here so the list can keep up with what happens, and so you can give feedback. Please do give feedback, none of this is final! (Also, if anyone who was there notices anything I left out or mischaracterized, please speak up -- these are a lot of notes I'm trying to gather together, so I could easily have missed something!) Thanks to Jill Cowan and the rest of the SciPy organizers for donating space and organizing logistics for us, and to the Berkeley Institute for Data Science for funding travel for Jaime, Nathaniel, and Sebastian. Attendees = Present in the room for all or part: Daniel Allan, Chris Barker, Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm pretty sure this list is incomplete) Joining remotely for all or part: Stephan Hoyer, Julian Taylor. Formalizing our governance/decision making == This was a major focus of discussion. At a high level, the consensus was to steal IPython's governance document (IPEP 29) and modify it to remove its use of a BDFL as a backstop to normal community consensus-based decision, and replace it with a new backstop based on Apache-project-style consensus voting amongst the core team. I'll send out a proper draft of this shortly for further discussion. Development roadmap === General consensus: Let's assume NumPy is going to remain important indefinitely, and try to make it better, instead of waiting for something better to come along. (This is unlikely to be wasted effort even if something better does come along, and it's hardly a sure thing that that will happen anyway.) Let's focus on evolving numpy as far as we can without major break-the-world changes (no numpy 2.0, at least in the foreseeable future). And, as a target for that evolution, let's change our focus from numpy as NumPy is the library that gives you the np.ndarray object (plus some attached infrastructure), to NumPy provides the standard framework for working with arrays and array-like objects in Python This means, creating defined interfaces between array-like objects / ufunc objects / dtype objects, so that it becomes possible for third parties to add their own and mix-and-match. Right now ufuncs are pretty good at this, but if you want a new array class or dtype then in most cases you pretty much have to modify numpy itself. Vision: instead of everyone who wants a new container type having to reimplement all of numpy, Alice can implement an array class using (sparse / distributed / compressed / tiled / gpu / out-of-core / delayed / ...) storage, pass it to code that was written using direct calls to np.* functions, and it just works. (Instead of np.sin being the way you calculate the sine of an ndarray, it's the way you calculate the sine of any array-like container object.) Vision: Darryl can implement a new dtype for (categorical data / astronomical dates / integers-with-missing-values / ...) without having to touch the numpy core. Vision: Chandni can then come along and combine them by doing a = alice_array([...], dtype=darryl_dtype) and it just works. Vision: no-one is tempted to subclass ndarray, because anything you can do with an ndarray subclass you can also easily do by defining your own new class that implements the array protocol. Supporting third-party array types ~~ Sub-goals: - Get __numpy_ufunc__ done, which will cover a good chunk of numpy's API right there. - Go through the rest of the stuff in numpy, and figure out some story for how to let it handle third-party array classes: - ufunc ALL the things: Some things can be converted directly into (g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some things could be converted into (g)ufuncs if we extended the (g)ufunc interface a bit (e.g. np.sort, np.matmul). - Some things probably need their own __numpy_ufunc__-like extensions (__numpy_concatenate__?) - Provide tools to make it easier to implement the more complicated parts of an array object (e.g. the bazillion different methods, many of which are ufuncs in disguise, or indexing) - Longer-run interesting research project: __numpy_ufunc__ requires that one or the other object have explicit knowledge of how to handle the other, so to handle binary ufuncs with N array types you need something like N**2 __numpy_ufunc__ code paths. As an alternative, if there were some interface that an object could export that provided the operations nditer needs to efficiently iterate over (chunks of) it, then you would only need N implementations of this interface to handle all N**2 operations. This
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
On Tue, Aug 25, 2015 at 1:00 PM, Travis Oliphant tra...@continuum.io wrote: Thanks for the write-up Nathaniel. There is a lot of great detail and interesting ideas here. I've am very eager to understand how to help NumPy and the wider community move forward however I can (my passions on this have not changed since 1999, though what I myself spend time on has changed). There are a lot of ways to think about approaching this, though. It's hard to get all the ideas on the table, and it was unfortunate we couldn't get everybody wyho are core NumPy devs together in person to have this discussion as there are still a lot of questions unanswered and a lot of thought that has gone into other approaches that was not brought up or represented in the meeting (how does Numba fit into this, what about data-shape, dynd, memory-views and Python type system, etc.). If NumPy becomes just an interface-specification, then why don't we just do that *outside* NumPy itself in a way that doesn't jeopardize the stability of NumPy today.These are some of the real questions I have. I will try to write up my thoughts in more depth soon, but I won't be able to respond in-depth right now. I just wanted to comment because Nathaniel said I disagree which is only partly true. The three most important things for me are 1) let's make sure we have representation from as wide of the community as possible (this is really hard), 2) let's look around at the broader community and the prior art that is happening in this space right now and 3) let's not pretend we are going to be able to make all this happen without breaking ABI compatibility. Let's just break ABI compatibility with NumPy 2.0 *and* have as much fidelity with the API and semantics of current NumPy as possible (though there will be some changes necessary long-term). I don't think we should intentionally break ABI if we can avoid it, but I also don't think we should spend in-ordinate amounts of time trying to pretend that we won't break ABI (for at least some people), and most importantly we should not pretend *not* to break the ABI when we actually do.We did this once before with the roll-out of date-time, and it was really un-necessary. When I released NumPy 1.0, there were several things that I knew should be fixed very soon (NumPy was never designed to not break ABI).Those problems are still there.Now, that we have quite a bit better understanding of what NumPy *should* be (there have been tremendous strides in understanding and community size over the past 10 years), let's actually make the infrastructure we think will last for the next 20 years (instead of trying to shoe-horn new ideas into a 20-year old code-base that wasn't designed for it). NumPy is a hard code-base. It has been since Numeric days in 1995. I could be wrong, but my guess is that we will be passed by as a community if we don't seize the opportunity to build something better than we can build if we are forced to use a 20 year old code-base. It is more important to not break people's code and to be clear when a re-compile is necessary for dependencies. Those to me are the most important constraints. There are a lot of great ideas that we all have about what we want NumPy to be able to do. Some of this are pretty transformational (and the more exciting they are, the harder I think they are going to be to implement without breaking at least the ABI). There is probably some CAP-like theorem around Stability-Features-Speed-of-Development (pick 2) when it comes to Open Source Software development and making feature-progress with NumPy *is going* to create in-stability which concerns me. I would like to see a little-bit-of-pain one time with a NumPy 2.0, rather than a constant pain because of constant churn over many years approach that Nathaniel seems to advocate. To me NumPy 2.0 is an ABI-breaking release that is as API-compatible as possible and whose semantics are not dramatically different. There are at least 3 areas of compatibility (ABI, API, and semantic). ABI-compatibility is a non-feature in today's world. There are so many distributions of the NumPy stack (and conda makes it trivial for anyone to build their own or for you to build one yourself). Making less-optimal software-engineering choices because of fear of breaking the ABI is not something I'm supportive of at all. We should not break ABI every release, but a release every 3 years that breaks ABI is not a problem. API compatibility should be much more sacrosanct, but it is also something that can also be managed. Any NumPy 2.0 should definitely support the full NumPy API (though there could be deprecated swaths).I think the community has done well in using deprecation and limiting the public API to make this more manageable and I would love to see a NumPy 2.0 that solidifies a future-oriented API along with a
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
Hi Nathaniel, Thanks for the notes. In some sense, the new dtype class(es) will provided a way of formalizing these `weird` metadata, and probably exposing them to Python. May I add that please consider adding a way to declare the sorting order (priority and direction) of fields in a structured array in the new dtype as well? Regards, Yu On Tue, Aug 25, 2015 at 12:21 PM, Antoine Pitrou solip...@pitrou.net wrote: On Tue, 25 Aug 2015 03:03:41 -0700 Nathaniel Smith n...@pobox.com wrote: Supporting third-party dtypes ~ [...] Some features that would become straightforward to implement (e.g. even in third-party libraries) if this were fixed: - missing value support - physical unit tracking (meters / seconds - array of velocity; meters + seconds - error) - better and more diverse datetime representations (e.g. datetimes with attached timezones, or using funky geophysical or astronomical calendars) - categorical data - variable length strings - strings-with-encodings (e.g. latin1) - forward mode automatic differentiation (write a function that computes f(x) where x is an array of float64; pass that function an array with a special dtype and get out both f(x) and f'(x)) - probably others I'm forgetting right now It should also be the opportunity to streamline datetime64 and timedelta64 dtypes. Currently the unit information is IIRC hidden in some weird metadata thing called the PyArray_DatetimeMetaData. Also, thanks the notes. It has been an interesting read. Regards Antoine. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
Thanks for the good summary Nathaniel. Regarding dtype machinery, I agree casting is the hardest part. Unless the code has changed dramatically, this was the main reason why you could not make most of the dtypes separate from numpy codebase (I tried to move the datetime dtype out of multiarray into a separate C extension some years ago). Being able to separate the dtypes from the multiarray module would be an obvious way to drive the internal API change. Regarding the use of cython in numpy, was there any discussion about the compilation/size cost of using cython, and talking to the cython team to improve this ? Or was that considered acceptable with current cython for numpy. I am convinced cleanly separating the low level parts from the python C API plumbing would be the single most important thing one could do to make the codebase more amenable. David On Tue, Aug 25, 2015 at 9:58 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Tue, Aug 25, 2015 at 1:00 PM, Travis Oliphant tra...@continuum.io wrote: Thanks for the write-up Nathaniel. There is a lot of great detail and interesting ideas here. I've am very eager to understand how to help NumPy and the wider community move forward however I can (my passions on this have not changed since 1999, though what I myself spend time on has changed). There are a lot of ways to think about approaching this, though. It's hard to get all the ideas on the table, and it was unfortunate we couldn't get everybody wyho are core NumPy devs together in person to have this discussion as there are still a lot of questions unanswered and a lot of thought that has gone into other approaches that was not brought up or represented in the meeting (how does Numba fit into this, what about data-shape, dynd, memory-views and Python type system, etc.). If NumPy becomes just an interface-specification, then why don't we just do that *outside* NumPy itself in a way that doesn't jeopardize the stability of NumPy today.These are some of the real questions I have. I will try to write up my thoughts in more depth soon, but I won't be able to respond in-depth right now. I just wanted to comment because Nathaniel said I disagree which is only partly true. The three most important things for me are 1) let's make sure we have representation from as wide of the community as possible (this is really hard), 2) let's look around at the broader community and the prior art that is happening in this space right now and 3) let's not pretend we are going to be able to make all this happen without breaking ABI compatibility. Let's just break ABI compatibility with NumPy 2.0 *and* have as much fidelity with the API and semantics of current NumPy as possible (though there will be some changes necessary long-term). I don't think we should intentionally break ABI if we can avoid it, but I also don't think we should spend in-ordinate amounts of time trying to pretend that we won't break ABI (for at least some people), and most importantly we should not pretend *not* to break the ABI when we actually do.We did this once before with the roll-out of date-time, and it was really un-necessary. When I released NumPy 1.0, there were several things that I knew should be fixed very soon (NumPy was never designed to not break ABI).Those problems are still there.Now, that we have quite a bit better understanding of what NumPy *should* be (there have been tremendous strides in understanding and community size over the past 10 years), let's actually make the infrastructure we think will last for the next 20 years (instead of trying to shoe-horn new ideas into a 20-year old code-base that wasn't designed for it). NumPy is a hard code-base. It has been since Numeric days in 1995. I could be wrong, but my guess is that we will be passed by as a community if we don't seize the opportunity to build something better than we can build if we are forced to use a 20 year old code-base. It is more important to not break people's code and to be clear when a re-compile is necessary for dependencies. Those to me are the most important constraints. There are a lot of great ideas that we all have about what we want NumPy to be able to do. Some of this are pretty transformational (and the more exciting they are, the harder I think they are going to be to implement without breaking at least the ABI). There is probably some CAP-like theorem around Stability-Features-Speed-of-Development (pick 2) when it comes to Open Source Software development and making feature-progress with NumPy *is going* to create in-stability which concerns me. I would like to see a little-bit-of-pain one time with a NumPy 2.0, rather than a constant pain because of constant churn over many years approach that Nathaniel seems to advocate. To me NumPy 2.0 is an ABI-breaking release that is as API-compatible as possible and whose