Re: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

2014-09-24 Thread Chris Barker
On Tue, Sep 23, 2014 at 4:40 AM, Eric Moore e...@redtetrahedron.org wrote:

  Improving the dtype system requires working on c code.


yes -- it sure does. But I think that is a bit of a Red Herring. I'm barely
competent in C, and don't like it much, but the real barrier to entry for
 me is not that it's in C, but that it's really complex and hard to hack
on, as it wasn't designed to support custom dtypes, etc. from the start.
There is a lot of ugly code in there that has been hacked in to support
various functionality over time. If there was a clean dtype-extension
system in C, then A) it wouldn't be bad C to write, and B) would be pretty
easy to make a Cython-wrapped version.

Travis gave a nice vision for the future, but in the meantime, I'm
wondering:

Could we hack in a generic custom dtype  dtype object into the current
system that would delegate everything to the dtype object -- in a truly
object-oriented way. I'm imagining that this custom dtype object would be a
pyObject and thus very hackable, easy to make a new subclass, etc --
essentially like making a new class in python that emulates one of the
built-in type interfaces.

This would be slow as a dog -- if inside that C loop, numpy would have to
call out to python to do anyting, maybe as simple as arithmetic, but it
would be clean, extensible system, and a good way for folks to plug in and
try out new dtypes when performance didn't matter, or as prototypes for
something that would get plugged in at the C level later once the API was
worked out.

Is this even possible without too much hacking to the current dtype system?
Would it be as simple as adding a bit to the object dtype?

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

2014-09-24 Thread Travis Oliphant
This could actually be done by using the structured dtype pretty easily.
The hard work would be improving the ufunc and generalized ufunc mechanism
to handle structured data-types. Numba actually provides some of this
already, so if you have NumPy + Numba you can do this sort of thing now.

-Travis






On Wed, Sep 24, 2014 at 12:08 PM, Chris Barker chris.bar...@noaa.gov
wrote:

 On Tue, Sep 23, 2014 at 4:40 AM, Eric Moore e...@redtetrahedron.org
 wrote:

  Improving the dtype system requires working on c code.


 yes -- it sure does. But I think that is a bit of a Red Herring. I'm
 barely competent in C, and don't like it much, but the real barrier to
 entry for  me is not that it's in C, but that it's really complex and hard
 to hack on, as it wasn't designed to support custom dtypes, etc. from the
 start. There is a lot of ugly code in there that has been hacked in to
 support various functionality over time. If there was a clean
 dtype-extension system in C, then A) it wouldn't be bad C to write, and B)
 would be pretty easy to make a Cython-wrapped version.

 Travis gave a nice vision for the future, but in the meantime, I'm
 wondering:

 Could we hack in a generic custom dtype  dtype object into the current
 system that would delegate everything to the dtype object -- in a truly
 object-oriented way. I'm imagining that this custom dtype object would be a
 pyObject and thus very hackable, easy to make a new subclass, etc --
 essentially like making a new class in python that emulates one of the
 built-in type interfaces.

 This would be slow as a dog -- if inside that C loop, numpy would have to
 call out to python to do anyting, maybe as simple as arithmetic, but it
 would be clean, extensible system, and a good way for folks to plug in and
 try out new dtypes when performance didn't matter, or as prototypes for
 something that would get plugged in at the C level later once the API was
 worked out.

 Is this even possible without too much hacking to the current dtype
 system? Would it be as simple as adding a bit to the object dtype?

 -Chris

 --

 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 chris.bar...@noaa.gov

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion




-- 

Travis Oliphant
CEO
Continuum Analytics, Inc.
http://www.continuum.io
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

2014-09-23 Thread Stephan Hoyer
On Sun, Sep 21, 2014 at 8:31 PM, Nathaniel Smith n...@pobox.com wrote:

 For cases where people genuinely want to implement a new array-like
 types (e.g. DataFrame or scipy.sparse) then numpy provides a fair
 amount of support for this already (e.g., the various hooks that allow
 things like np.asarray(mydf) or np.sin(mydf) to work), and we're
 working on adding more over time (e.g., __numpy_ufunc__).


Agreed, numpy does a great job of this. It has been a surprising pleasure
to integrate with numpy for my custom array-like types in xray.
__numpy_ufunc__ will let us add a few more neat tricks.


 My feeling though is that in most of the cases you mention,
 implementing a new array-like type is huge overkill. ndarray's
 interface is vast and reimplementing even 90% of it is a huge effort.
 For most of the cases that people seem to run into in practice, the
 solution is to enhance numpy's dtype interface so that it's possible
 for mere mortals to implement new dtypes, e.g. by just subclassing
 np.dtype. This is totally doable and would enable a ton of
 awesomeness, but it requires someone with the time to sit down and
 work on it, and no-one has volunteered yet. Unfortunately it does
 require hacking on C code though.


Something to allow mere mortals such as myself to implement new dtypes
sounds wonderful!

Would it be useful to prototype something like this in pure Python? That
sounds like a task that I could be up for. Like I said, I expect a (mostly)
pure Python solution, at least for categorical and datetime, would be a
more maintainable and even performant enough for use in pandas (given that
this is basically the current approach), as long as the bottlenecks are
dealt with appropriately. Anyone else interested in hacking on this with me?

For what it's worth, I am not convinced that it is that terrible to
reimplement most of the ndarray interface. As long as your object looks
pretty much like an ndarray with a custom dtype, it should be quite
straightforward to wrap the underlying array's methods/properties. So I'm
not too scared of that option, although I agree that it is a complete waste
to do it again and again.

Nathaniel and Jeff  -- thank you so much for detailed replies.

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

2014-09-23 Thread David Cournapeau
On Mon, Sep 22, 2014 at 4:31 AM, Nathaniel Smith n...@pobox.com wrote:

 On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer sho...@gmail.com wrote:
  pandas has some hacks to support custom types of data for which numpy
 can't
  handle well enough or at all. Examples include datetime and Categorical
 [1],
  and others like GeoArray [2] that haven't make it into pandas yet.
 
  Most of these look like numpy arrays but with custom dtypes and type
  specific methods/properties. But clearly nobody is particularly excited
  about writing the the C necessary to implement custom dtypes [3]. Nor is
 do
  we need the ndarray ABI.
 
  In many cases, writing C may not actually even be necessary for
 performance
  reasons, e.g., categorical can be fast enough just by wrapping an integer
  ndarray for the internal storage and using vectorized operations. And
 even
  if it is necessary, I think we'd all rather write Cython than C.
 
  It's great for pandas to write its own ndarray-like wrappers (*not*
  subclasses) that work with pandas, but it's a shame that there isn't a
  standard interface like the ndarray to make these arrays useable for the
  rest of the scientific Python ecosystem. For example, pandas has loads of
  fixes for np.datetime64, but nobody seems to be up for porting them to
 numpy
  (I doubt it would be easy).

 Writing them in the first place probably wasn't easy either :-). I
 don't really know why pandas spends so much effort on reimplementing
 stuff and papering over numpy limitations instead of fixing things
 upstream so that everyone can benefit. I assume they have reasons, and
 I could make some general guesses at what some of them might be, but
 if you want to know what they are -- which is presumably the first
 step in changing the situation -- you'll have to ask them, not us :-).

  I know these sort of concerns are not new, but I wish I had a sense of
 what
  the solution looks like. Is anyone actively working on these issues? Does
  the fix belong in numpy, pandas, blaze or a new project? I'd love to get
 a
  sense of where things stand and how I could help -- without writing any C
  :).

 I think there are there are three parts:

 For stuff that's literally just fixing bugs in stuff that numpy
 already has, then we'd certainly be happy to accept those bug fixes.
 Probably there are things we can do to make this easier, I dunno. I'd
 love to see some of numpy's internals moving into Cython to make them
 easier to hack on, but this won't be simple because right now using
 Cython to implement a module is really an all-or-nothing affair;
 making it possible to mix Cython with numpy's existing C code will
 require upstream changes in Cython.


 For cases where people genuinely want to implement a new array-like
 types (e.g. DataFrame or scipy.sparse) then numpy provides a fair
 amount of support for this already (e.g., the various hooks that allow
 things like np.asarray(mydf) or np.sin(mydf) to work), and we're
 working on adding more over time (e.g., __numpy_ufunc__).

 My feeling though is that in most of the cases you mention,
 implementing a new array-like type is huge overkill. ndarray's
 interface is vast and reimplementing even 90% of it is a huge effort.
 For most of the cases that people seem to run into in practice, the
 solution is to enhance numpy's dtype interface so that it's possible
 for mere mortals to implement new dtypes, e.g. by just subclassing
 np.dtype. This is totally doable and would enable a ton of
 awesomeness, but it requires someone with the time to sit down and
 work on it, and no-one has volunteered yet. Unfortunately it does
 require hacking on C code though.


While preparing my tutorial on NumPy C internals 1 year ago, I tried to get
a basic dtype implemented in cython, and there were various issues even
if you wanted to do all of it in cython (I can't remember the details now).
Solving this would be a good first step.

There were (are ?) also some issues regarding precedence in ufuncs
depending on the new dtype: numpy hardcodes that long double is the highest
precision floating point type, for example, and there were similar issues
regarding datetime handling. Does not matter for completely new types that
don't require interactions with others (categorical ?).

Would it help to prepare a set of implement your own dtype notebooks ? I
have a starting point from last year tutorial (the corresponding slides
were never shown for lack of time).

David




 --
 Nathaniel J. Smith
 Postdoctoral researcher - Informatics - University of Edinburgh
 http://vorpus.org
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

2014-09-23 Thread Todd
On Mon, Sep 22, 2014 at 5:31 AM, Nathaniel Smith n...@pobox.com wrote:

 On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer sho...@gmail.com wrote:
  My feeling though is that in most of the cases you mention,
 implementing a new array-like type is huge overkill. ndarray's
 interface is vast and reimplementing even 90% of it is a huge effort.
 For most of the cases that people seem to run into in practice, the
 solution is to enhance numpy's dtype interface so that it's possible
 for mere mortals to implement new dtypes, e.g. by just subclassing
 np.dtype. This is totally doable and would enable a ton of
 awesomeness, but it requires someone with the time to sit down and
 work on it, and no-one has volunteered yet. Unfortunately it does
 require hacking on C code though.


I'm unclear about the last sentence.  Do you mean improving the dtype
system will require hacking on C code or even if we improve the dtype
system dtypes will still have to be written in C?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

2014-09-23 Thread Eric Moore
On Tuesday, September 23, 2014, Todd toddr...@gmail.com wrote:

 On Mon, Sep 22, 2014 at 5:31 AM, Nathaniel Smith n...@pobox.com
 javascript:_e(%7B%7D,'cvml','n...@pobox.com'); wrote:

 On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer sho...@gmail.com
 javascript:_e(%7B%7D,'cvml','sho...@gmail.com'); wrote:
  My feeling though is that in most of the cases you mention,
 implementing a new array-like type is huge overkill. ndarray's
 interface is vast and reimplementing even 90% of it is a huge effort.
 For most of the cases that people seem to run into in practice, the
 solution is to enhance numpy's dtype interface so that it's possible
 for mere mortals to implement new dtypes, e.g. by just subclassing
 np.dtype. This is totally doable and would enable a ton of
 awesomeness, but it requires someone with the time to sit down and
 work on it, and no-one has volunteered yet. Unfortunately it does
 require hacking on C code though.


 I'm unclear about the last sentence.  Do you mean improving the dtype
 system will require hacking on C code or even if we improve the dtype
 system dtypes will still have to be written in C?


What ends up making this hard is every place numpy does anything with a
dtype needs at least audited and probably changed. All of that is in c
right now, and most of it would likely still be after the fact, simply
because the rest of numpy is in c. Improving the dtype system requires
working on c code.

Eric
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

2014-09-23 Thread Travis Oliphant
On Sun, Sep 21, 2014 at 6:50 PM, Stephan Hoyer sho...@gmail.com wrote:

 pandas has some hacks to support custom types of data for which numpy
 can't handle well enough or at all. Examples include datetime and
 Categorical [1], and others like GeoArray [2] that haven't make it into
 pandas yet.

 Most of these look like numpy arrays but with custom dtypes and type
 specific methods/properties. But clearly nobody is particularly excited
 about writing the the C necessary to implement custom dtypes [3]. Nor is do
 we need the ndarray ABI.

 In many cases, writing C may not actually even be necessary for
 performance reasons, e.g., categorical can be fast enough just by wrapping
 an integer ndarray for the internal storage and using vectorized
 operations. And even if it is necessary, I think we'd all rather write
 Cython than C.

 It's great for pandas to write its own ndarray-like wrappers (*not*
 subclasses) that work with pandas, but it's a shame that there isn't a
 standard interface like the ndarray to make these arrays useable for the
 rest of the scientific Python ecosystem. For example, pandas has loads of
 fixes for np.datetime64, but nobody seems to be up for porting them to
 numpy (I doubt it would be easy).

 I know these sort of concerns are not new, but I wish I had a sense of
 what the solution looks like. Is anyone actively working on these issues?
 Does the fix belong in numpy, pandas, blaze or a new project? I'd love to
 get a sense of where things stand and how I could help -- without writing
 any C :).


Hey Stephan,

There are not easy answers to your questions.   The reason is that NumPy's
dtype system is not extensible enough with its fixed set of builtin
data-types and its bolted-on user-defined datatypes.   The implementation
was adapted from the *descriptor* notion that was in Numeric (written
almost 20 years ago). While a significant improvement over Numeric, the
dtype system in NumPy still has several limitations:

1) it was not designed to add new fundamental data-types without
breaking the ABI (most of the ABI breakage between 1.3 and 1.7 due to the
addition of np.datetime has been pushed to a small corner but it is still
there).

2) The user-defined data-type system which is present is not well
tested and likely incomplete:  it was the best I could come up with at the
time NumPy first came out with a bit of input from people like Fernando
Perez and Francesc Alted.

3) It is far easier than in Numeric to add new data-types (that was a
big part of the effort of NumPy), but it is still not as easy as one would
like to add new data-types (either fundamental ones requiring recompilation
of NumPy or 'user-defined' data-types requiring C-code.

I believe this system has served us well, but it needs to be replaced
eventually.  I think it can be replaced fairly seamlessly in a largely
backward compatible way (though requiring re-compilation of dependencies).
   Fixing the dtype system is a fundamental effort behind several projects
we are working on at Continuum:  datashape, dynd, and numba.These
projects are addressing fundamental limitations in a way that can lead to a
significantly improved framework for scientific and tabular computing in
Python.

In the mean-time, NumPy can continue to improve in small ways and in
orthogonal ways (like the new __numpy_ufunc__ mechanism which allows ufuncs
to work more seamlessly with different kinds of array-like objects).
 This kind of effort as well as the improved buffer protocol in Python,
mean that multiple array-like objects can co-exist and use each-other's
data.   Right now, I think that is the best current way to address the
data-type limitations of NumPy.

Another small project is possible today --- one could today use Numba or
Cython to generate user-defined data-types for existing NumPy.   That would
be an interesting project and would certainly help to understand the
limitations of the user-defined data-type framework without making people
write C-code.   You could use a meta-class and some code-generation
techniques so that by defining a particular class you end-up with a
user-defined data-type for NumPy.

Even while we have been addressing the fundamental limitations of NumPy
with our new tools at Continuum, replacing NumPy is a big undertaking
because of its large user-base.   While I personally think that NumPy could
be replaced for new users as early as next year with a combination of dynd
and numba, the big install base of NumPy means that many people (including
the company I work with, Continuum) will be supporting NumPy 1.X and Pandas
and the rest of the NumPy-Stack for many years to come.

So, even if you see me working and advocating new technology, that should
never be construed as somehow ignoring or abandoning the current technology
base.   I remain deeply interested in the success of the scientific
computing community --- even though I am not currently contributing a lot
of code directly myself.As 

Re: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

2014-09-23 Thread Benjamin Root
Travis,

Thank you for your perspective on this issue. Such input is always valuable
in helping us see where we came from and where we might go.

My perspective on NumPy is fairly different, having come into Python right
after the whole Numeric/NumArray transition to NumPy. One of the things
that really sold me on NumPy was not only just how simple it was for me to
use out of the box, but how easy it was to explicitly state that something
needed to be of one type or another. The dtype notation was fairly simple
and straight-forward.

-- We should not underestimate the value of simple to write and simple to
read notations in Python --

We can go ahead and put as many bells and whistles into the underlaying
infrastructure as you want, but if we can't design a simple notation
language to utilize it, then it will never catch on. This isn't criticism
of the work being done in dynd or the or the other projects, rather, it is
a call for innovation. I don't know how I would design such a notation
language, but we need that ah-ha! moment from *somebody*.

I expressed this back at the NumPy BoF this summer. I would love an
improved notation system that Matplotlib could take advantage of that would
facilitate the plotting of more complicated graphs. But I am also not
really interested in seeing NumPy turn into Pandas. Nothing wrong with
Pandas; I just like the idea of modularity and I think it has suited the
community well. Striking the right balance is going to be extremely
important.

Cheers!
Ben Root


On Tue, Sep 23, 2014 at 9:34 AM, Travis Oliphant tra...@continuum.io
wrote:


 On Sun, Sep 21, 2014 at 6:50 PM, Stephan Hoyer sho...@gmail.com wrote:

 pandas has some hacks to support custom types of data for which numpy
 can't handle well enough or at all. Examples include datetime and
 Categorical [1], and others like GeoArray [2] that haven't make it into
 pandas yet.

 Most of these look like numpy arrays but with custom dtypes and type
 specific methods/properties. But clearly nobody is particularly excited
 about writing the the C necessary to implement custom dtypes [3]. Nor is do
 we need the ndarray ABI.

 In many cases, writing C may not actually even be necessary for
 performance reasons, e.g., categorical can be fast enough just by wrapping
 an integer ndarray for the internal storage and using vectorized
 operations. And even if it is necessary, I think we'd all rather write
 Cython than C.

 It's great for pandas to write its own ndarray-like wrappers (*not*
 subclasses) that work with pandas, but it's a shame that there isn't a
 standard interface like the ndarray to make these arrays useable for the
 rest of the scientific Python ecosystem. For example, pandas has loads of
 fixes for np.datetime64, but nobody seems to be up for porting them to
 numpy (I doubt it would be easy).

 I know these sort of concerns are not new, but I wish I had a sense of
 what the solution looks like. Is anyone actively working on these issues?
 Does the fix belong in numpy, pandas, blaze or a new project? I'd love to
 get a sense of where things stand and how I could help -- without writing
 any C :).


 Hey Stephan,

 There are not easy answers to your questions.   The reason is that NumPy's
 dtype system is not extensible enough with its fixed set of builtin
 data-types and its bolted-on user-defined datatypes.   The implementation
 was adapted from the *descriptor* notion that was in Numeric (written
 almost 20 years ago). While a significant improvement over Numeric, the
 dtype system in NumPy still has several limitations:

 1) it was not designed to add new fundamental data-types without
 breaking the ABI (most of the ABI breakage between 1.3 and 1.7 due to the
 addition of np.datetime has been pushed to a small corner but it is still
 there).

 2) The user-defined data-type system which is present is not well
 tested and likely incomplete:  it was the best I could come up with at the
 time NumPy first came out with a bit of input from people like Fernando
 Perez and Francesc Alted.

 3) It is far easier than in Numeric to add new data-types (that was a
 big part of the effort of NumPy), but it is still not as easy as one would
 like to add new data-types (either fundamental ones requiring recompilation
 of NumPy or 'user-defined' data-types requiring C-code.

 I believe this system has served us well, but it needs to be replaced
 eventually.  I think it can be replaced fairly seamlessly in a largely
 backward compatible way (though requiring re-compilation of dependencies).
Fixing the dtype system is a fundamental effort behind several projects
 we are working on at Continuum:  datashape, dynd, and numba.These
 projects are addressing fundamental limitations in a way that can lead to a
 significantly improved framework for scientific and tabular computing in
 Python.

 In the mean-time, NumPy can continue to improve in small ways and in
 orthogonal ways (like the new __numpy_ufunc__ 

Re: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

2014-09-22 Thread Jeff Reback
Hopefully this is not TL;DR!

Their are 3 'dtype' likes that exist in pandas that could in theory mostly
be migrated back to numpy. These currently exist as the .values in-other-words
the object to which pandas defers data storage and computation for
some/most of operations.

1) SparseArray: This is the basis for SparseSeries. It is ndarray-like (its
actually a ndarray-sub-class) and optimized for the 1-d case. My guess is
that @wesm https://github.com/wesm created this because it a) didn't
exist in numpy, and b) didn't want scipy as an explicity dependency (at the
time), late 2011.

2) datetime support: This is not a target dtype per se, but really a
reimplementation over the top of datetime64[ns], with the associated scalar
Timestamp which is a proper sub-class of datetime.datetime. I believe @wesm
https://github.com/wesm created this because numpy datetime support was
(and still is to some extent) just completely broken (though better in
1.7+). It doesn't support proper timezones, the display is always in the
local timezone., and the scalar type (np.datetime64) is not extensible at
all (e.g. so have not easy to have custom printing, or parsing). These are
all well known by the numpy community and have seen some recent proposals
to remedy.

3) pd.Categorical: This was another class wesm wrote several years ago. It
is actually *could* be a numpy sub-class, though its a bit awkward as its
really a numpy-like sub-class that contains 2 ndarray-like arrays, and is
more appropriately implemented as a container of multiple-ndarrays.

So when we added support for Categoricals recently, why didn't we say try
to push a categorical dtype? I think their are several reasons, in no
particular order:

   -

   pd.Categorical is really a container of multiple ndarrays, and is
   ndarray-like. Further its API is somewhat constrained. It was simpler to
   make a python container class rather than try to sub-class ndarray and
   basically override / throw out many methods (as a lot of computation
   methods simply don't make sense between 2 categoricals). You can make a
   case that this *should not * be in numpy for this reason.
   -

   The changes in pandas for the 3 cases outlined above, were mostly on how
   to integrate these with the top-level containers (Series/DataFrame), rather
   than actually writing / re-writing a new dtype for a ndarray class. We
   always try to reuse, so we just try to extend the ndarray-like rather than
   create a new one from scratch.
   -

   Getting for example a Categorical dtype into numpy prob would take a
   pretty long cycle time. I think you need a champion for new features to
   really push them. It hasn't happened with datetime and that's been a while
   (of course its possible that pandas diverted some of this need)
   -

   API design: I think this is a big issue actually. When I added
   Categorical container support, I didn't want to change the API of
   Categorical much (and it pretty much worked out that way, mainly adding
   to it). So, say we took the path of assuming that numpy would have a nice
   categorical data dtype. We would almost certainly have to wrap it in
   something to provided needed functionaility that would necessarily be
   missing in an initial version. (of course eventually that may not be
   necessary).
   -

   So the 'nobody wants to write in C' argument is true for datetimes, but
   not for SparseArray/Categorical. In fact much of that code is just
   calling out to numpy (though some cython code too).
   -

   from a performance perspective, numpy needs a really good hashtable in
   order to support proper factorizing, which @wesm
   https://github.com/wesm co-opted klib to do (see this thread here
   https://www.mail-archive.com/numpy-discussion@scipy.org/msg46024.html for
   a discussion on this).

So I know I am repeating myself, but it comes down to this. The
API/interface of the delegated methods needs to be defined. For ndarrays it
is long established and well-known. So easy to gear pandas to that. However
with a *newer* type that is not the case, so pandas can easily decide, hey
this is the most correct behavior, let's do it this way, nothing to break,
no back compat needed.


Jeff

On Sun, Sep 21, 2014 at 11:31 PM, Nathaniel Smith n...@pobox.com wrote:

 On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer sho...@gmail.com wrote:
  pandas has some hacks to support custom types of data for which numpy
 can't
  handle well enough or at all. Examples include datetime and Categorical
 [1],
  and others like GeoArray [2] that haven't make it into pandas yet.
 
  Most of these look like numpy arrays but with custom dtypes and type
  specific methods/properties. But clearly nobody is particularly excited
  about writing the the C necessary to implement custom dtypes [3]. Nor is
 do
  we need the ndarray ABI.
 
  In many cases, writing C may not actually even be necessary for
 performance
  reasons, e.g., categorical can be fast enough just by 

[Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

2014-09-21 Thread Stephan Hoyer
pandas has some hacks to support custom types of data for which numpy can't
handle well enough or at all. Examples include datetime and Categorical
[1], and others like GeoArray [2] that haven't make it into pandas yet.

Most of these look like numpy arrays but with custom dtypes and type
specific methods/properties. But clearly nobody is particularly excited
about writing the the C necessary to implement custom dtypes [3]. Nor is do
we need the ndarray ABI.

In many cases, writing C may not actually even be necessary for performance
reasons, e.g., categorical can be fast enough just by wrapping an integer
ndarray for the internal storage and using vectorized operations. And even
if it is necessary, I think we'd all rather write Cython than C.

It's great for pandas to write its own ndarray-like wrappers (*not*
subclasses) that work with pandas, but it's a shame that there isn't a
standard interface like the ndarray to make these arrays useable for the
rest of the scientific Python ecosystem. For example, pandas has loads of
fixes for np.datetime64, but nobody seems to be up for porting them to
numpy (I doubt it would be easy).

I know these sort of concerns are not new, but I wish I had a sense of what
the solution looks like. Is anyone actively working on these issues? Does
the fix belong in numpy, pandas, blaze or a new project? I'd love to get a
sense of where things stand and how I could help -- without writing any C
:).

Thanks,
Stephan

[1] https://github.com/pydata/pandas/pull/7217
[2] https://github.com/geopandas/geopandas/issues/166
[3] https://github.com/numpy/numpy-dtypes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

2014-09-21 Thread Charles R Harris
On Sun, Sep 21, 2014 at 5:50 PM, Stephan Hoyer sho...@gmail.com wrote:

 pandas has some hacks to support custom types of data for which numpy
 can't handle well enough or at all. Examples include datetime and
 Categorical [1], and others like GeoArray [2] that haven't make it into
 pandas yet.

 Most of these look like numpy arrays but with custom dtypes and type
 specific methods/properties. But clearly nobody is particularly excited
 about writing the the C necessary to implement custom dtypes [3]. Nor is do
 we need the ndarray ABI.

 In many cases, writing C may not actually even be necessary for
 performance reasons, e.g., categorical can be fast enough just by wrapping
 an integer ndarray for the internal storage and using vectorized
 operations. And even if it is necessary, I think we'd all rather write
 Cython than C.

 It's great for pandas to write its own ndarray-like wrappers (*not*
 subclasses) that work with pandas, but it's a shame that there isn't a
 standard interface like the ndarray to make these arrays useable for the
 rest of the scientific Python ecosystem. For example, pandas has loads of
 fixes for np.datetime64, but nobody seems to be up for porting them to
 numpy (I doubt it would be easy).

 I know these sort of concerns are not new, but I wish I had a sense of
 what the solution looks like. Is anyone actively working on these issues?
 Does the fix belong in numpy, pandas, blaze or a new project? I'd love to
 get a sense of where things stand and how I could help -- without writing
 any C :).


I haven't thought much about this myself, but others (Nathaniel?) have, and
it would be good to explore the topic and maybe put together some
examples/templates to make this approach easier. Input from someone with
some experience would be *much* appreciated.

The datetime problem persists and I've thinking it would be nice to replace
the current implementation with something simpler that can be stolen from
elsewhere. It would be nice to hear how someone else dealt with the problem.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

2014-09-21 Thread Nathaniel Smith
On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer sho...@gmail.com wrote:
 pandas has some hacks to support custom types of data for which numpy can't
 handle well enough or at all. Examples include datetime and Categorical [1],
 and others like GeoArray [2] that haven't make it into pandas yet.

 Most of these look like numpy arrays but with custom dtypes and type
 specific methods/properties. But clearly nobody is particularly excited
 about writing the the C necessary to implement custom dtypes [3]. Nor is do
 we need the ndarray ABI.

 In many cases, writing C may not actually even be necessary for performance
 reasons, e.g., categorical can be fast enough just by wrapping an integer
 ndarray for the internal storage and using vectorized operations. And even
 if it is necessary, I think we'd all rather write Cython than C.

 It's great for pandas to write its own ndarray-like wrappers (*not*
 subclasses) that work with pandas, but it's a shame that there isn't a
 standard interface like the ndarray to make these arrays useable for the
 rest of the scientific Python ecosystem. For example, pandas has loads of
 fixes for np.datetime64, but nobody seems to be up for porting them to numpy
 (I doubt it would be easy).

Writing them in the first place probably wasn't easy either :-). I
don't really know why pandas spends so much effort on reimplementing
stuff and papering over numpy limitations instead of fixing things
upstream so that everyone can benefit. I assume they have reasons, and
I could make some general guesses at what some of them might be, but
if you want to know what they are -- which is presumably the first
step in changing the situation -- you'll have to ask them, not us :-).

 I know these sort of concerns are not new, but I wish I had a sense of what
 the solution looks like. Is anyone actively working on these issues? Does
 the fix belong in numpy, pandas, blaze or a new project? I'd love to get a
 sense of where things stand and how I could help -- without writing any C
 :).

I think there are there are three parts:

For stuff that's literally just fixing bugs in stuff that numpy
already has, then we'd certainly be happy to accept those bug fixes.
Probably there are things we can do to make this easier, I dunno. I'd
love to see some of numpy's internals moving into Cython to make them
easier to hack on, but this won't be simple because right now using
Cython to implement a module is really an all-or-nothing affair;
making it possible to mix Cython with numpy's existing C code will
require upstream changes in Cython.

For cases where people genuinely want to implement a new array-like
types (e.g. DataFrame or scipy.sparse) then numpy provides a fair
amount of support for this already (e.g., the various hooks that allow
things like np.asarray(mydf) or np.sin(mydf) to work), and we're
working on adding more over time (e.g., __numpy_ufunc__).

My feeling though is that in most of the cases you mention,
implementing a new array-like type is huge overkill. ndarray's
interface is vast and reimplementing even 90% of it is a huge effort.
For most of the cases that people seem to run into in practice, the
solution is to enhance numpy's dtype interface so that it's possible
for mere mortals to implement new dtypes, e.g. by just subclassing
np.dtype. This is totally doable and would enable a ton of
awesomeness, but it requires someone with the time to sit down and
work on it, and no-one has volunteered yet. Unfortunately it does
require hacking on C code though.

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion