Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-21 Thread Keith Goodman
On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney wesmck...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wesmck...@gmail.com wrote:

 Keith (and others),

 What would you think about creating a library of mostly Cython-based
 domain specific functions? So stuff like rolling statistical
 moments, nan* functions like you have here, and all that-- NumPy-array
 only functions that don't necessarily belong in NumPy or SciPy (but
 could be included on down the road). You were already talking about
 this on the statsmodels mailing list for larry. I spent a lot of time
 writing a bunch of these for pandas over the last couple of years, and
 I would have relatively few qualms about moving these outside of
 pandas and introducing a dependency. You could do the same for larry--
 then we'd all be relying on the same well-vetted and tested codebase.

 I've started working on moving window statistics cython functions. I
 plan to make it into a package called Roly (for rolling). The
 signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr,
 window, axis=-1), etc.

 I think of Nanny and Roly as two separate packages. A narrow focus is
 good for a new package. But maybe each package could be a subpackage
 in a super package?

 Would the function signatures in Nanny (exact duplicates of the
 corresponding functions in Numpy and Scipy) work for pandas? I plan to
 use Nanny in larry. I'll try to get the structure of the Nanny package
 in place. But if it doesn't attract any interest after that then I may
 fold it into larry.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 Why make multiple packages? It seems like all these functions are
 somewhat related: practical tools for real-world data analysis (where
 observations are often missing). I suspect having everything under one
 hood would create more interest than chopping things up-- would be
 very useful to folks in many different disciplines (finance,
 economics, statistics, etc.). In R, for example, NA-handling is just a
 part of every day life. Of course in R there is a special NA value
 which is distinct from NaN-- many folks object to the use of NaN for
 missing values. The alternative is masked arrays, but in my case I
 wasn't willing to sacrifice so much performance for purity's sake.

 I could certainly use the nan* functions to replace code in pandas
 where I've handled things in a somewhat ad hoc way.

A package focused on NaN-aware functions sounds like a good idea. I
think a good plan would be to start by making faster, drop-in
replacements for the NaN functions that are already in numpy and
scipy. That is already a lot of work. After that, one possibility is
to add stuff like nancumsum, nanprod, etc. After that moving window
stuff?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-21 Thread josef . pktd
On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney wesmck...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wesmck...@gmail.com wrote:

 Keith (and others),

 What would you think about creating a library of mostly Cython-based
 domain specific functions? So stuff like rolling statistical
 moments, nan* functions like you have here, and all that-- NumPy-array
 only functions that don't necessarily belong in NumPy or SciPy (but
 could be included on down the road). You were already talking about
 this on the statsmodels mailing list for larry. I spent a lot of time
 writing a bunch of these for pandas over the last couple of years, and
 I would have relatively few qualms about moving these outside of
 pandas and introducing a dependency. You could do the same for larry--
 then we'd all be relying on the same well-vetted and tested codebase.

 I've started working on moving window statistics cython functions. I
 plan to make it into a package called Roly (for rolling). The
 signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr,
 window, axis=-1), etc.

 I think of Nanny and Roly as two separate packages. A narrow focus is
 good for a new package. But maybe each package could be a subpackage
 in a super package?

 Would the function signatures in Nanny (exact duplicates of the
 corresponding functions in Numpy and Scipy) work for pandas? I plan to
 use Nanny in larry. I'll try to get the structure of the Nanny package
 in place. But if it doesn't attract any interest after that then I may
 fold it into larry.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 Why make multiple packages? It seems like all these functions are
 somewhat related: practical tools for real-world data analysis (where
 observations are often missing). I suspect having everything under one
 hood would create more interest than chopping things up-- would be
 very useful to folks in many different disciplines (finance,
 economics, statistics, etc.). In R, for example, NA-handling is just a
 part of every day life. Of course in R there is a special NA value
 which is distinct from NaN-- many folks object to the use of NaN for
 missing values. The alternative is masked arrays, but in my case I
 wasn't willing to sacrifice so much performance for purity's sake.

 I could certainly use the nan* functions to replace code in pandas
 where I've handled things in a somewhat ad hoc way.

 A package focused on NaN-aware functions sounds like a good idea. I
 think a good plan would be to start by making faster, drop-in
 replacements for the NaN functions that are already in numpy and
 scipy. That is already a lot of work. After that, one possibility is
 to add stuff like nancumsum, nanprod, etc. After that moving window
 stuff?

and maybe group functions after that?

If there is a lot of repetition, you could use templating. Even simple
string substitution, if it is only replacing the dtype, works pretty
well. It would at least reduce some copy-paste.

Josef

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-21 Thread Wes McKinney
On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wesmck...@gmail.com wrote:

 Keith (and others),

 What would you think about creating a library of mostly Cython-based
 domain specific functions? So stuff like rolling statistical
 moments, nan* functions like you have here, and all that-- NumPy-array
 only functions that don't necessarily belong in NumPy or SciPy (but
 could be included on down the road). You were already talking about
 this on the statsmodels mailing list for larry. I spent a lot of time
 writing a bunch of these for pandas over the last couple of years, and
 I would have relatively few qualms about moving these outside of
 pandas and introducing a dependency. You could do the same for larry--
 then we'd all be relying on the same well-vetted and tested codebase.

 I've started working on moving window statistics cython functions. I
 plan to make it into a package called Roly (for rolling). The
 signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr,
 window, axis=-1), etc.

 I think of Nanny and Roly as two separate packages. A narrow focus is
 good for a new package. But maybe each package could be a subpackage
 in a super package?

 Would the function signatures in Nanny (exact duplicates of the
 corresponding functions in Numpy and Scipy) work for pandas? I plan to
 use Nanny in larry. I'll try to get the structure of the Nanny package
 in place. But if it doesn't attract any interest after that then I may
 fold it into larry.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Why make multiple packages? It seems like all these functions are
somewhat related: practical tools for real-world data analysis (where
observations are often missing). I suspect having everything under one
hood would create more interest than chopping things up-- would be
very useful to folks in many different disciplines (finance,
economics, statistics, etc.). In R, for example, NA-handling is just a
part of every day life. Of course in R there is a special NA value
which is distinct from NaN-- many folks object to the use of NaN for
missing values. The alternative is masked arrays, but in my case I
wasn't willing to sacrifice so much performance for purity's sake.

I could certainly use the nan* functions to replace code in pandas
where I've handled things in a somewhat ad hoc way.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-21 Thread Keith Goodman
On Sun, Nov 21, 2010 at 12:30 PM,  josef.p...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney wesmck...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wesmck...@gmail.com wrote:

 Keith (and others),

 What would you think about creating a library of mostly Cython-based
 domain specific functions? So stuff like rolling statistical
 moments, nan* functions like you have here, and all that-- NumPy-array
 only functions that don't necessarily belong in NumPy or SciPy (but
 could be included on down the road). You were already talking about
 this on the statsmodels mailing list for larry. I spent a lot of time
 writing a bunch of these for pandas over the last couple of years, and
 I would have relatively few qualms about moving these outside of
 pandas and introducing a dependency. You could do the same for larry--
 then we'd all be relying on the same well-vetted and tested codebase.

 I've started working on moving window statistics cython functions. I
 plan to make it into a package called Roly (for rolling). The
 signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr,
 window, axis=-1), etc.

 I think of Nanny and Roly as two separate packages. A narrow focus is
 good for a new package. But maybe each package could be a subpackage
 in a super package?

 Would the function signatures in Nanny (exact duplicates of the
 corresponding functions in Numpy and Scipy) work for pandas? I plan to
 use Nanny in larry. I'll try to get the structure of the Nanny package
 in place. But if it doesn't attract any interest after that then I may
 fold it into larry.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 Why make multiple packages? It seems like all these functions are
 somewhat related: practical tools for real-world data analysis (where
 observations are often missing). I suspect having everything under one
 hood would create more interest than chopping things up-- would be
 very useful to folks in many different disciplines (finance,
 economics, statistics, etc.). In R, for example, NA-handling is just a
 part of every day life. Of course in R there is a special NA value
 which is distinct from NaN-- many folks object to the use of NaN for
 missing values. The alternative is masked arrays, but in my case I
 wasn't willing to sacrifice so much performance for purity's sake.

 I could certainly use the nan* functions to replace code in pandas
 where I've handled things in a somewhat ad hoc way.

 A package focused on NaN-aware functions sounds like a good idea. I
 think a good plan would be to start by making faster, drop-in
 replacements for the NaN functions that are already in numpy and
 scipy. That is already a lot of work. After that, one possibility is
 to add stuff like nancumsum, nanprod, etc. After that moving window
 stuff?

 and maybe group functions after that?

Yes, group functions are on my list.

 If there is a lot of repetition, you could use templating. Even simple
 string substitution, if it is only replacing the dtype, works pretty
 well. It would at least reduce some copy-paste.

Unit test coverage should be good enough to mess around with trying
templating. What's a good way to go? Write my own script that creates
the .pyx file and call it from the make file? Or are there packages
for doing the templating?

I added nanmean (the first scipy function to enter nanny) and nanmin.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-21 Thread josef . pktd
On Sun, Nov 21, 2010 at 5:09 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 12:30 PM,  josef.p...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney wesmck...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wesmck...@gmail.com wrote:

 Keith (and others),

 What would you think about creating a library of mostly Cython-based
 domain specific functions? So stuff like rolling statistical
 moments, nan* functions like you have here, and all that-- NumPy-array
 only functions that don't necessarily belong in NumPy or SciPy (but
 could be included on down the road). You were already talking about
 this on the statsmodels mailing list for larry. I spent a lot of time
 writing a bunch of these for pandas over the last couple of years, and
 I would have relatively few qualms about moving these outside of
 pandas and introducing a dependency. You could do the same for larry--
 then we'd all be relying on the same well-vetted and tested codebase.

 I've started working on moving window statistics cython functions. I
 plan to make it into a package called Roly (for rolling). The
 signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr,
 window, axis=-1), etc.

 I think of Nanny and Roly as two separate packages. A narrow focus is
 good for a new package. But maybe each package could be a subpackage
 in a super package?

 Would the function signatures in Nanny (exact duplicates of the
 corresponding functions in Numpy and Scipy) work for pandas? I plan to
 use Nanny in larry. I'll try to get the structure of the Nanny package
 in place. But if it doesn't attract any interest after that then I may
 fold it into larry.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 Why make multiple packages? It seems like all these functions are
 somewhat related: practical tools for real-world data analysis (where
 observations are often missing). I suspect having everything under one
 hood would create more interest than chopping things up-- would be
 very useful to folks in many different disciplines (finance,
 economics, statistics, etc.). In R, for example, NA-handling is just a
 part of every day life. Of course in R there is a special NA value
 which is distinct from NaN-- many folks object to the use of NaN for
 missing values. The alternative is masked arrays, but in my case I
 wasn't willing to sacrifice so much performance for purity's sake.

 I could certainly use the nan* functions to replace code in pandas
 where I've handled things in a somewhat ad hoc way.

 A package focused on NaN-aware functions sounds like a good idea. I
 think a good plan would be to start by making faster, drop-in
 replacements for the NaN functions that are already in numpy and
 scipy. That is already a lot of work. After that, one possibility is
 to add stuff like nancumsum, nanprod, etc. After that moving window
 stuff?

 and maybe group functions after that?

 Yes, group functions are on my list.

 If there is a lot of repetition, you could use templating. Even simple
 string substitution, if it is only replacing the dtype, works pretty
 well. It would at least reduce some copy-paste.

 Unit test coverage should be good enough to mess around with trying
 templating. What's a good way to go? Write my own script that creates
 the .pyx file and call it from the make file? Or are there packages
 for doing the templating?

Depends on the scale, I tried once with simple string templates
http://codespeak.net/pipermail/cython-dev/2009-August/006614.html

here is a pastbin of another version by (?),
http://pastebin.com/f1a49143d discussed on the cython-dev mailing
list.

The cython list has the discussion every once in a while but I haven't
seen any conclusion yet. For heavier duty templating a proper
templating package (Jinja?) might be better.

I'm not an expert.

Josef



 I added nanmean (the first scipy function to enter nanny) and nanmin.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-21 Thread Wes McKinney
On Sun, Nov 21, 2010 at 6:02 PM,  josef.p...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 5:09 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 12:30 PM,  josef.p...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney wesmck...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman kwgood...@gmail.com 
 wrote:
 On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wesmck...@gmail.com 
 wrote:

 Keith (and others),

 What would you think about creating a library of mostly Cython-based
 domain specific functions? So stuff like rolling statistical
 moments, nan* functions like you have here, and all that-- NumPy-array
 only functions that don't necessarily belong in NumPy or SciPy (but
 could be included on down the road). You were already talking about
 this on the statsmodels mailing list for larry. I spent a lot of time
 writing a bunch of these for pandas over the last couple of years, and
 I would have relatively few qualms about moving these outside of
 pandas and introducing a dependency. You could do the same for larry--
 then we'd all be relying on the same well-vetted and tested codebase.

 I've started working on moving window statistics cython functions. I
 plan to make it into a package called Roly (for rolling). The
 signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr,
 window, axis=-1), etc.

 I think of Nanny and Roly as two separate packages. A narrow focus is
 good for a new package. But maybe each package could be a subpackage
 in a super package?

 Would the function signatures in Nanny (exact duplicates of the
 corresponding functions in Numpy and Scipy) work for pandas? I plan to
 use Nanny in larry. I'll try to get the structure of the Nanny package
 in place. But if it doesn't attract any interest after that then I may
 fold it into larry.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 Why make multiple packages? It seems like all these functions are
 somewhat related: practical tools for real-world data analysis (where
 observations are often missing). I suspect having everything under one
 hood would create more interest than chopping things up-- would be
 very useful to folks in many different disciplines (finance,
 economics, statistics, etc.). In R, for example, NA-handling is just a
 part of every day life. Of course in R there is a special NA value
 which is distinct from NaN-- many folks object to the use of NaN for
 missing values. The alternative is masked arrays, but in my case I
 wasn't willing to sacrifice so much performance for purity's sake.

 I could certainly use the nan* functions to replace code in pandas
 where I've handled things in a somewhat ad hoc way.

 A package focused on NaN-aware functions sounds like a good idea. I
 think a good plan would be to start by making faster, drop-in
 replacements for the NaN functions that are already in numpy and
 scipy. That is already a lot of work. After that, one possibility is
 to add stuff like nancumsum, nanprod, etc. After that moving window
 stuff?

 and maybe group functions after that?

 Yes, group functions are on my list.

 If there is a lot of repetition, you could use templating. Even simple
 string substitution, if it is only replacing the dtype, works pretty
 well. It would at least reduce some copy-paste.

 Unit test coverage should be good enough to mess around with trying
 templating. What's a good way to go? Write my own script that creates
 the .pyx file and call it from the make file? Or are there packages
 for doing the templating?

 Depends on the scale, I tried once with simple string templates
 http://codespeak.net/pipermail/cython-dev/2009-August/006614.html

 here is a pastbin of another version by (?),
 http://pastebin.com/f1a49143d discussed on the cython-dev mailing
 list.

 The cython list has the discussion every once in a while but I haven't
 seen any conclusion yet. For heavier duty templating a proper
 templating package (Jinja?) might be better.

 I'm not an expert.

 Josef



 I added nanmean (the first scipy function to enter nanny) and nanmin.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


What would you say to a single package that contains:

- NaN-aware NumPy and SciPy functions (nanmean, nanmin, etc.)
- moving window functions (moving_{count, sum, mean, var, std, etc.})
- core subroutines for labeled data
- group-by functions
- other things to add to this list?

In other words, basic building computational tools for making
libraries like larry, pandas, etc. and 

Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-21 Thread Wes McKinney
On Sun, Nov 21, 2010 at 6:37 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 3:16 PM, Wes McKinney wesmck...@gmail.com wrote:

 What would you say to a single package that contains:

 - NaN-aware NumPy and SciPy functions (nanmean, nanmin, etc.)

 I'd say yes.

 - moving window functions (moving_{count, sum, mean, var, std, etc.})

 Yes.

 BTW, we both do arr=arr.astype(float), I think, before doing the
 moving statistics. So I speeded things up by running the moving window
 backwards and writing the result in place.

 - core subroutines for labeled data

 Not sure what this would be. Let's discuss.

Basically want to produce a indexing vector based on rules-- something
to pass to ndarray.take later on. And maybe your generic binary-op
function from a while back?

 - group-by functions

 Yes. I have some ideas on function signatures.

 - other things to add to this list?

 A no-op function with a really long doc string!

 In other words, basic building computational tools for making
 libraries like larry, pandas, etc. and doing time series / statistical
 / other manipulations on real world (messy) data sets. The focus isn't
 so much NaN-awareness per se but more practical data wrangling. I
 would be happy to work on such a package and to move all the Cython
 code I've written into it. There's a little bit of datarray overlap
 potentially but I think that's OK

 Maybe we should make a list of function signatures along with brief
 doc strings to get a feel for what we (and hopefully others) have in
 mind?

I've personally never been much for writing specs, but could be
useful. We probably aren't going to get it all right on the first try,
so we'll just do our best and refactor the code later if necessary. We
might be well-served by collecting exemplary data sets and making a
list of things we would like to be able to do easily with that data.

But writing stuff like:

moving_{funcname}(ndarray data, int window, int axis=0, int
min_periods=window) - ndarray
group_aggregate(ndarray data, ndarray labels, int axis=0, function
agg_function) - ndarray
group_transform(...) ...

etc. makes sense

 Where should we continue the discussion? The pystatsmodels mailing
 list? By now the numpy list probably thinks of NaN as Not ANother
 email from this guy.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Maybe let's have the next thread on SciPy-user-- I think what we're
talking about is general enough to be discussed there.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-20 Thread Keith Goodman
On Fri, Nov 19, 2010 at 7:42 PM, Keith Goodman kwgood...@gmail.com wrote:
 I should make a benchmark suite.

 ny.benchit(verbose=False)
Nanny performance benchmark
Nanny 0.0.1dev
Numpy 1.4.1
Speed is numpy time divided by nanny time
NaN means all NaNs
   Speed   TestShapedtypeNaN?
   6.6770  nansum(a, axis=-1)  (500,500)int64
   4.6612  nansum(a, axis=-1)  (1,) float64
   9.0351  nansum(a, axis=-1)  (500,500)int32
   3.0746  nansum(a, axis=-1)  (500,500)float64
  11.5740  nansum(a, axis=-1)  (1,) int32
   6.4484  nansum(a, axis=-1)  (1,) int64
  51.3917  nansum(a, axis=-1)  (500,500)float64  NaN
  13.8692  nansum(a, axis=-1)  (1,) float64  NaN
   6.5327  nanmax(a, axis=-1)  (500,500)int64
   8.8222  nanmax(a, axis=-1)  (1,) float64
   0.2059  nanmax(a, axis=-1)  (500,500)int32
   6.9262  nanmax(a, axis=-1)  (500,500)float64
   5.0688  nanmax(a, axis=-1)  (1,) int32
   6.5605  nanmax(a, axis=-1)  (1,) int64
  48.4850  nanmax(a, axis=-1)  (500,500)float64  NaN
  14.6289  nanmax(a, axis=-1)  (1,) float64  NaN

You can also use the makefile to run the benchmark: make bench
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-20 Thread Wes McKinney
On Sat, Nov 20, 2010 at 6:39 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Fri, Nov 19, 2010 at 7:42 PM, Keith Goodman kwgood...@gmail.com wrote:
 I should make a benchmark suite.

 ny.benchit(verbose=False)
 Nanny performance benchmark
    Nanny 0.0.1dev
    Numpy 1.4.1
    Speed is numpy time divided by nanny time
    NaN means all NaNs
   Speed   Test                Shape        dtype    NaN?
   6.6770  nansum(a, axis=-1)  (500,500)    int64
   4.6612  nansum(a, axis=-1)  (1,)     float64
   9.0351  nansum(a, axis=-1)  (500,500)    int32
   3.0746  nansum(a, axis=-1)  (500,500)    float64
  11.5740  nansum(a, axis=-1)  (1,)     int32
   6.4484  nansum(a, axis=-1)  (1,)     int64
  51.3917  nansum(a, axis=-1)  (500,500)    float64  NaN
  13.8692  nansum(a, axis=-1)  (1,)     float64  NaN
   6.5327  nanmax(a, axis=-1)  (500,500)    int64
   8.8222  nanmax(a, axis=-1)  (1,)     float64
   0.2059  nanmax(a, axis=-1)  (500,500)    int32
   6.9262  nanmax(a, axis=-1)  (500,500)    float64
   5.0688  nanmax(a, axis=-1)  (1,)     int32
   6.5605  nanmax(a, axis=-1)  (1,)     int64
  48.4850  nanmax(a, axis=-1)  (500,500)    float64  NaN
  14.6289  nanmax(a, axis=-1)  (1,)     float64  NaN

 You can also use the makefile to run the benchmark: make bench
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Keith (and others),

What would you think about creating a library of mostly Cython-based
domain specific functions? So stuff like rolling statistical
moments, nan* functions like you have here, and all that-- NumPy-array
only functions that don't necessarily belong in NumPy or SciPy (but
could be included on down the road). You were already talking about
this on the statsmodels mailing list for larry. I spent a lot of time
writing a bunch of these for pandas over the last couple of years, and
I would have relatively few qualms about moving these outside of
pandas and introducing a dependency. You could do the same for larry--
then we'd all be relying on the same well-vetted and tested codebase.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-20 Thread Wes McKinney
On Sat, Nov 20, 2010 at 6:54 PM, Wes McKinney wesmck...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 6:39 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Fri, Nov 19, 2010 at 7:42 PM, Keith Goodman kwgood...@gmail.com wrote:
 I should make a benchmark suite.

 ny.benchit(verbose=False)
 Nanny performance benchmark
    Nanny 0.0.1dev
    Numpy 1.4.1
    Speed is numpy time divided by nanny time
    NaN means all NaNs
   Speed   Test                Shape        dtype    NaN?
   6.6770  nansum(a, axis=-1)  (500,500)    int64
   4.6612  nansum(a, axis=-1)  (1,)     float64
   9.0351  nansum(a, axis=-1)  (500,500)    int32
   3.0746  nansum(a, axis=-1)  (500,500)    float64
  11.5740  nansum(a, axis=-1)  (1,)     int32
   6.4484  nansum(a, axis=-1)  (1,)     int64
  51.3917  nansum(a, axis=-1)  (500,500)    float64  NaN
  13.8692  nansum(a, axis=-1)  (1,)     float64  NaN
   6.5327  nanmax(a, axis=-1)  (500,500)    int64
   8.8222  nanmax(a, axis=-1)  (1,)     float64
   0.2059  nanmax(a, axis=-1)  (500,500)    int32
   6.9262  nanmax(a, axis=-1)  (500,500)    float64
   5.0688  nanmax(a, axis=-1)  (1,)     int32
   6.5605  nanmax(a, axis=-1)  (1,)     int64
  48.4850  nanmax(a, axis=-1)  (500,500)    float64  NaN
  14.6289  nanmax(a, axis=-1)  (1,)     float64  NaN

 You can also use the makefile to run the benchmark: make bench
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 Keith (and others),

 What would you think about creating a library of mostly Cython-based
 domain specific functions? So stuff like rolling statistical
 moments, nan* functions like you have here, and all that-- NumPy-array
 only functions that don't necessarily belong in NumPy or SciPy (but
 could be included on down the road). You were already talking about
 this on the statsmodels mailing list for larry. I spent a lot of time
 writing a bunch of these for pandas over the last couple of years, and
 I would have relatively few qualms about moving these outside of
 pandas and introducing a dependency. You could do the same for larry--
 then we'd all be relying on the same well-vetted and tested codebase.

 - Wes


By the way I wouldn't mind pushing all of my datetime-related code
(date range generation, date offsets, etc.) into this new library,
too.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-20 Thread Keith Goodman
On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wesmck...@gmail.com wrote:

 Keith (and others),

 What would you think about creating a library of mostly Cython-based
 domain specific functions? So stuff like rolling statistical
 moments, nan* functions like you have here, and all that-- NumPy-array
 only functions that don't necessarily belong in NumPy or SciPy (but
 could be included on down the road). You were already talking about
 this on the statsmodels mailing list for larry. I spent a lot of time
 writing a bunch of these for pandas over the last couple of years, and
 I would have relatively few qualms about moving these outside of
 pandas and introducing a dependency. You could do the same for larry--
 then we'd all be relying on the same well-vetted and tested codebase.

I've started working on moving window statistics cython functions. I
plan to make it into a package called Roly (for rolling). The
signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr,
window, axis=-1), etc.

I think of Nanny and Roly as two separate packages. A narrow focus is
good for a new package. But maybe each package could be a subpackage
in a super package?

Would the function signatures in Nanny (exact duplicates of the
corresponding functions in Numpy and Scipy) work for pandas? I plan to
use Nanny in larry. I'll try to get the structure of the Nanny package
in place. But if it doesn't attract any interest after that then I may
fold it into larry.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-20 Thread Charles R Harris
On Sat, Nov 20, 2010 at 4:39 PM, Keith Goodman kwgood...@gmail.com wrote:

 On Fri, Nov 19, 2010 at 7:42 PM, Keith Goodman kwgood...@gmail.com
 wrote:
  I should make a benchmark suite.

  ny.benchit(verbose=False)
 Nanny performance benchmark
Nanny 0.0.1dev
Numpy 1.4.1
Speed is numpy time divided by nanny time
NaN means all NaNs
   Speed   TestShapedtypeNaN?
   6.6770  nansum(a, axis=-1)  (500,500)int64
   4.6612  nansum(a, axis=-1)  (1,) float64
   9.0351  nansum(a, axis=-1)  (500,500)int32
   3.0746  nansum(a, axis=-1)  (500,500)float64
  11.5740  nansum(a, axis=-1)  (1,) int32
   6.4484  nansum(a, axis=-1)  (1,) int64
  51.3917  nansum(a, axis=-1)  (500,500)float64  NaN
  13.8692  nansum(a, axis=-1)  (1,) float64  NaN
   6.5327  nanmax(a, axis=-1)  (500,500)int64
   8.8222  nanmax(a, axis=-1)  (1,) float64
   0.2059  nanmax(a, axis=-1)  (500,500)int32
   6.9262  nanmax(a, axis=-1)  (500,500)float64
   5.0688  nanmax(a, axis=-1)  (1,) int32
   6.5605  nanmax(a, axis=-1)  (1,) int64
  48.4850  nanmax(a, axis=-1)  (500,500)float64  NaN
  14.6289  nanmax(a, axis=-1)  (1,) float64  NaN


Here's what I get using (my current) np.fmax.reduce in place of nanmax.

   Speed   TestShapedtypeNaN?
   3.3717  nansum(a, axis=-1)  (500,500)int64
   5.1639  nansum(a, axis=-1)  (1,) float64
   3.8308  nansum(a, axis=-1)  (500,500)int32
   6.0854  nansum(a, axis=-1)  (500,500)float64
   8.7821  nansum(a, axis=-1)  (1,) int32
   1.1716  nansum(a, axis=-1)  (1,) int64
   5.5777  nansum(a, axis=-1)  (500,500)float64  NaN
   5.8718  nansum(a, axis=-1)  (1,) float64  NaN
   0.5419  nanmax(a, axis=-1)  (500,500)int64
   2.8732  nanmax(a, axis=-1)  (1,) float64
   0.0301  nanmax(a, axis=-1)  (500,500)int32
   2.7437  nanmax(a, axis=-1)  (500,500)float64
   0.7868  nanmax(a, axis=-1)  (1,) int32
   0.5535  nanmax(a, axis=-1)  (1,) int64
   2.8715  nanmax(a, axis=-1)  (500,500)float64  NaN
   2.5937  nanmax(a, axis=-1)  (1,) float64  NaN

I think the really small int32 ratio is due to timing granularity. For
random ints in the range 0..99 the results are not quite as good for fmax,
which I find puzzling.

   Speed   TestShapedtypeNaN?
   3.4021  nansum(a, axis=-1)  (500,500)int64
   5.5913  nansum(a, axis=-1)  (1,) float64
   4.4569  nansum(a, axis=-1)  (500,500)int32
   6.6202  nansum(a, axis=-1)  (500,500)float64
   7.1847  nansum(a, axis=-1)  (1,) int32
   2.0448  nansum(a, axis=-1)  (1,) int64
   6.0257  nansum(a, axis=-1)  (500,500)float64  NaN
   6.3172  nansum(a, axis=-1)  (1,) float64  NaN
   0.9598  nanmax(a, axis=-1)  (500,500)int64
   3.2407  nanmax(a, axis=-1)  (1,) float64
   0.0520  nanmax(a, axis=-1)  (500,500)int32
   3.1954  nanmax(a, axis=-1)  (500,500)float64
   1.5538  nanmax(a, axis=-1)  (1,) int32
   0.3716  nanmax(a, axis=-1)  (1,) int64
   3.2372  nanmax(a, axis=-1)  (500,500)float64  NaN
   2.5633  nanmax(a, axis=-1)  (1,) float64  NaN

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Keith Goodman
=
Nanny
=

Nanny uses the magic of Cython to give you a faster, drop-in replacement for
the NaN functions in NumPy and SciPy.

For example::

 import nanny as ny
 import numpy as np
 arr = np.random.rand(100, 100)

 timeit np.nansum(arr)
1 loops, best of 3: 67.5 us per loop
 timeit ny.nansum(arr)
10 loops, best of 3: 18.2 us per loop

Let's not forget to add some NaNs::

 arr[arr  0.5] = np.nan
 timeit np.nansum(arr)
1000 loops, best of 3: 411 us per loop
 timeit ny.nansum(arr)
1 loops, best of 3: 65 us per loop

Nanny uses a separate Cython function for each combination of ndim, dtype, and
axis. You can get rid of a lot of overhead (useful in an inner loop, e.g.) by
directly importing the function that matches your problem::

 arr = np.random.rand(10, 10)
 from nansum import nansum_2d_float64_axis1

 timeit np.nansum(arr, axis=1)
1 loops, best of 3: 25.5 us per loop
 timeit ny.nansum(arr, axis=1)
10 loops, best of 3: 5.15 us per loop
 timeit nansum_2d_float64_axis1(arr)
100 loops, best of 3: 1.75 us per loop

I put together Nanny as a way to learn Cython. It currently only supports:

- functions: nansum
- Operating systems: 64-bit (accumulator for int32 is hard coded to int64)
- dtype: int32, int64, float64
- ndim: 1, 2, and 3

If there is interest in the project, I could continue adding the remaining NaN
functions from NumPy and SciPy: nanmin, nanmax, nanmean, nanmedian (using a
partial sort), nanstd. But why stop there? How about nancumsum or nanprod? Or
anynan, which could short-circuit once a NaN is found?

Feedback on the code or the direction of the project are welcomed. So is
coding help---without that I doubt the package will ever be completed. Once
nansum is complete, many of the remaining functions will be copy, paste, touch
up operations.

Remember, Nanny quickly protects your precious data from the corrupting
influence of Mr. Nan.


License
===

Nanny is distributed under a Simplified BSD license. Parts of NumPy, which has
a BSD licenses, are included in Nanny. See the LICENSE file, which is
distributed with Nanny, for details.


Installation


You can grab Nanny at http://github.com/kwgoodman/nanny.

nansum of ints is only supported by 64-bit operating systems at the moment.

**GNU/Linux, Mac OS X, et al.**

To install Nanny::

$ python setup.py build
$ sudo python setup.py install

Or, if you wish to specify where Nanny is installed, for example inside
``/usr/local``::

$ python setup.py build
$ sudo python setup.py install --prefix=/usr/local

**Windows**

In order to compile the C code in Nanny you need a Windows version of the gcc
compiler. MinGW (Minimalist GNU for Windows) contains gcc and has been
used to successfully compile Nanny on Windows.

Install MinGW and add it to your system path. Then install Nanny with the
commands::

python setup.py build --compiler=mingw32
python setup.py install

**Post install**

After you have installed Nanny, run the suite of unit tests::

 import nanny
 nanny.test()
snip
Ran 1 tests in 0.008s
OK
nose.result.TextTestResult run=1 errors=0 failures=0
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Nathaniel Smith
On Fri, Nov 19, 2010 at 10:33 AM, Keith Goodman kwgood...@gmail.com wrote:
 Nanny uses the magic of Cython to give you a faster, drop-in replacement for
 the NaN functions in NumPy and SciPy.

Neat!

Why not make this a patch to numpy/scipy instead?

 Nanny uses a separate Cython function for each combination of ndim, dtype, and
 axis. You can get rid of a lot of overhead (useful in an inner loop, e.g.) by
 directly importing the function that matches your problem::

     arr = np.random.rand(10, 10)
     from nansum import nansum_2d_float64_axis1

If this is really useful, then better to provide a function that finds
the correct function for you?

best_nansum = ny.get_best_nansum(ary[0, :, :], axis=1)
for i in xrange(ary.shape[0]):
best_nansum(ary[i, :, :], axis=1)

 - functions: nansum
 - Operating systems: 64-bit (accumulator for int32 is hard coded to int64)
 - dtype: int32, int64, float64
 - ndim: 1, 2, and 3

What does it even mean to do NaN operations on integers? (I'd
sometimes find it *really convenient* if there were a NaN value for
standard computer integers... but there isn't?)

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Benjamin Root
On Fri, Nov 19, 2010 at 12:55 PM, Nathaniel Smith n...@pobox.com wrote:

 On Fri, Nov 19, 2010 at 10:33 AM, Keith Goodman kwgood...@gmail.com
 wrote:
  Nanny uses the magic of Cython to give you a faster, drop-in replacement
 for
  the NaN functions in NumPy and SciPy.

 Neat!

 Why not make this a patch to numpy/scipy instead?

  Nanny uses a separate Cython function for each combination of ndim,
 dtype, and
  axis. You can get rid of a lot of overhead (useful in an inner loop,
 e.g.) by
  directly importing the function that matches your problem::
 
  arr = np.random.rand(10, 10)
  from nansum import nansum_2d_float64_axis1

 If this is really useful, then better to provide a function that finds
 the correct function for you?

 best_nansum = ny.get_best_nansum(ary[0, :, :], axis=1)
 for i in xrange(ary.shape[0]):
best_nansum(ary[i, :, :], axis=1)

  - functions: nansum
  - Operating systems: 64-bit (accumulator for int32 is hard coded to
 int64)
  - dtype: int32, int64, float64
  - ndim: 1, 2, and 3

 What does it even mean to do NaN operations on integers? (I'd
 sometimes find it *really convenient* if there were a NaN value for
 standard computer integers... but there isn't?)

 -- Nathaniel


That's why I use masked arrays.  It is dtype agnostic.

I am curious if there are any lessons that were learned in making Nanny that
could be applied to the masked array functions?

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Keith Goodman
On Fri, Nov 19, 2010 at 10:55 AM, Nathaniel Smith n...@pobox.com wrote:
 On Fri, Nov 19, 2010 at 10:33 AM, Keith Goodman kwgood...@gmail.com wrote:
 Nanny uses the magic of Cython to give you a faster, drop-in replacement for
 the NaN functions in NumPy and SciPy.

 Neat!

 Why not make this a patch to numpy/scipy instead?

My guess is that having separate underlying functions for each dtype,
ndim, and axis would be a nightmare for a large project like Numpy.
But manageable for a focused project like nanny.

 Nanny uses a separate Cython function for each combination of ndim, dtype, 
 and
 axis. You can get rid of a lot of overhead (useful in an inner loop, e.g.) by
 directly importing the function that matches your problem::

     arr = np.random.rand(10, 10)
     from nansum import nansum_2d_float64_axis1

 If this is really useful, then better to provide a function that finds
 the correct function for you?

 best_nansum = ny.get_best_nansum(ary[0, :, :], axis=1)
 for i in xrange(ary.shape[0]):
    best_nansum(ary[i, :, :], axis=1)

That would be useful. It is what nanny.nansum does but it returns the
sum instead of the function.

 - functions: nansum
 - Operating systems: 64-bit (accumulator for int32 is hard coded to int64)
 - dtype: int32, int64, float64
 - ndim: 1, 2, and 3

 What does it even mean to do NaN operations on integers? (I'd
 sometimes find it *really convenient* if there were a NaN value for
 standard computer integers... but there isn't?)

Well, sometimes you write functions without knowing the dtype of the input.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread josef . pktd
On Fri, Nov 19, 2010 at 2:35 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Fri, Nov 19, 2010 at 11:12 AM, Benjamin Root ben.r...@ou.edu wrote:

 That's why I use masked arrays.  It is dtype agnostic.

 I am curious if there are any lessons that were learned in making Nanny that
 could be applied to the masked array functions?

 I suppose you could write a cython function that operates on masked
 arrays. But other than that, I can't think of any lessons. All I can
 think about is speed:

 x = np.ma.array([[1, 2], [3, 4]], mask=[[0, 1], [1, 0]])
 timeit np.sum(x)
 1 loops, best of 3: 25.1 us per loop
 a = np.array([[1, np.nan], [np.nan, 4]])
 timeit ny.nansum(a)
 10 loops, best of 3: 3.11 us per loop
 from nansum import nansum_2d_float64_axisNone
 timeit nansum_2d_float64_axisNone(a)
 100 loops, best of 3: 395 ns per loop

What's the speed advantage of nanny compared to np.nansum that you
have if the arrays are larger, say (1000,10) or (1,100) axis=0 ?

Josef


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Keith Goodman
On Fri, Nov 19, 2010 at 12:10 PM,  josef.p...@gmail.com wrote:

 What's the speed advantage of nanny compared to np.nansum that you
 have if the arrays are larger, say (1000,10) or (1,100) axis=0 ?

Good point. In the small examples I showed so far maybe the speed up
was all in overhead. Fortunately, that's not the case:

 arr = np.random.rand(1000, 1000)
 timeit np.nansum(arr)
100 loops, best of 3: 4.79 ms per loop
 timeit ny.nansum(arr)
1000 loops, best of 3: 1.53 ms per loop

 arr[arr  0.5] = np.nan
 timeit np.nansum(arr)
10 loops, best of 3: 44.5 ms per loop
 timeit ny.nansum(arr)
100 loops, best of 3: 6.18 ms per loop

 timeit np.nansum(arr, axis=0)
10 loops, best of 3: 52.3 ms per loop
 timeit ny.nansum(arr, axis=0)
100 loops, best of 3: 12.2 ms per loop

np.nansum makes a copy of the input array and makes a mask (another
copy) and then uses the mask to set the NaNs to zero in the copy. So
not only is nanny faster, but it uses less memory.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Keith Goodman
On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen p...@iki.fi wrote:
 Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote:
 [clip]
 My guess is that having separate underlying functions for each dtype,
 ndim, and axis would be a nightmare for a large project like Numpy. But
 manageable for a focused project like nanny.

 Might be easier to migrate the nan* functions to using Ufuncs.

 Unless I'm missing something,

        np.nanmax - np.fmax.reduce
        np.nanmin - np.fmin.reduce

 For `nansum`, we'd need to add an ufunc `nanadd`, and for
 `nanargmax/min`, we'd need `argfmin/fmax'.

How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd, please.

 arr = np.random.rand(1000, 1000)
 arr[arr  0.5] = np.nan
 np.nanmax(arr)
   0.4625409581072
 np.fmax.reduce(arr, axis=None)
snip
TypeError: an integer is required
 np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
   0.4625409581072

 timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
100 loops, best of 3: 12.7 ms per loop
 timeit np.nanmax(arr)
10 loops, best of 3: 39.6 ms per loop

 timeit np.nanmax(arr, axis=0)
10 loops, best of 3: 46.5 ms per loop
 timeit np.fmax.reduce(arr, axis=0)
100 loops, best of 3: 12.7 ms per loop
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Keith Goodman
On Fri, Nov 19, 2010 at 12:29 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen p...@iki.fi wrote:
 Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote:
 [clip]
 My guess is that having separate underlying functions for each dtype,
 ndim, and axis would be a nightmare for a large project like Numpy. But
 manageable for a focused project like nanny.

 Might be easier to migrate the nan* functions to using Ufuncs.

 Unless I'm missing something,

        np.nanmax - np.fmax.reduce
        np.nanmin - np.fmin.reduce

 For `nansum`, we'd need to add an ufunc `nanadd`, and for
 `nanargmax/min`, we'd need `argfmin/fmax'.

 How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd, please.

 arr = np.random.rand(1000, 1000)
 arr[arr  0.5] = np.nan
 np.nanmax(arr)
   0.4625409581072
 np.fmax.reduce(arr, axis=None)
 snip
 TypeError: an integer is required
 np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
   0.4625409581072

 timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
 100 loops, best of 3: 12.7 ms per loop
 timeit np.nanmax(arr)
 10 loops, best of 3: 39.6 ms per loop

 timeit np.nanmax(arr, axis=0)
 10 loops, best of 3: 46.5 ms per loop
 timeit np.fmax.reduce(arr, axis=0)
 100 loops, best of 3: 12.7 ms per loop

Cython is faster than np.fmax.reduce.

I wrote a cython version of np.nanmax, called nanmax below. (It only
handles the 2d, float64, axis=None case, but since the array is large
I don't think that explains the time difference).

Note that fmax.reduce is slower than np.nanmax when there are no NaNs:

 arr = np.random.rand(1000, 1000)
 timeit np.nanmax(arr)
100 loops, best of 3: 5.82 ms per loop
 timeit np.fmax.reduce(np.fmax.reduce(arr))
100 loops, best of 3: 9.14 ms per loop
 timeit nanmax(arr)
1000 loops, best of 3: 1.17 ms per loop

 arr[arr  0.5] = np.nan

 timeit np.nanmax(arr)
10 loops, best of 3: 45.5 ms per loop
 timeit np.fmax.reduce(np.fmax.reduce(arr))
100 loops, best of 3: 12.7 ms per loop
 timeit nanmax(arr)
1000 loops, best of 3: 1.17 ms per loop
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Christopher Barker
On 11/19/10 11:19 AM, Keith Goodman wrote:
 On Fri, Nov 19, 2010 at 10:55 AM, Nathaniel Smithn...@pobox.com  wrote:
 Why not make this a patch to numpy/scipy instead?

 My guess is that having separate underlying functions for each dtype,
 ndim, and axis would be a nightmare for a large project like Numpy.

True, but:

1) Having special-cases for the most common cases is not such a bad idea.

2) could one use some sort of templating approach to get all the dtypes 
and such that you want?

3) as for number of dimensions, I don't think to would be to hard to 
generalize that -- at least for contiguous arrays.

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Charles R Harris
On Fri, Nov 19, 2010 at 3:18 PM, Christopher Barker
chris.bar...@noaa.govwrote:

 On 11/19/10 11:19 AM, Keith Goodman wrote:
  On Fri, Nov 19, 2010 at 10:55 AM, Nathaniel Smithn...@pobox.com  wrote:
  Why not make this a patch to numpy/scipy instead?
 
  My guess is that having separate underlying functions for each dtype,
  ndim, and axis would be a nightmare for a large project like Numpy.

 True, but:

 1) Having special-cases for the most common cases is not such a bad idea.

 2) could one use some sort of templating approach to get all the dtypes
 and such that you want?

 3) as for number of dimensions, I don't think to would be to hard to
 generalize that -- at least for contiguous arrays.


Note that the fmax/fmin versions can be sped up in the same way as
sum.reduce was. Also, you should pass the flattened array to the routine for
the axis=None case.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Charles R Harris
On Fri, Nov 19, 2010 at 1:50 PM, Keith Goodman kwgood...@gmail.com wrote:

 On Fri, Nov 19, 2010 at 12:29 PM, Keith Goodman kwgood...@gmail.com
 wrote:
  On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen p...@iki.fi wrote:
  Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote:
  [clip]
  My guess is that having separate underlying functions for each dtype,
  ndim, and axis would be a nightmare for a large project like Numpy. But
  manageable for a focused project like nanny.
 
  Might be easier to migrate the nan* functions to using Ufuncs.
 
  Unless I'm missing something,
 
 np.nanmax - np.fmax.reduce
 np.nanmin - np.fmin.reduce
 
  For `nansum`, we'd need to add an ufunc `nanadd`, and for
  `nanargmax/min`, we'd need `argfmin/fmax'.
 
  How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd,
 please.
 
  arr = np.random.rand(1000, 1000)
  arr[arr  0.5] = np.nan
  np.nanmax(arr)
0.4625409581072
  np.fmax.reduce(arr, axis=None)
  snip
  TypeError: an integer is required
  np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
0.4625409581072
 
  timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
  100 loops, best of 3: 12.7 ms per loop
  timeit np.nanmax(arr)
  10 loops, best of 3: 39.6 ms per loop
 
  timeit np.nanmax(arr, axis=0)
  10 loops, best of 3: 46.5 ms per loop
  timeit np.fmax.reduce(arr, axis=0)
  100 loops, best of 3: 12.7 ms per loop

 Cython is faster than np.fmax.reduce.

 I wrote a cython version of np.nanmax, called nanmax below. (It only
 handles the 2d, float64, axis=None case, but since the array is large
 I don't think that explains the time difference).

 Note that fmax.reduce is slower than np.nanmax when there are no NaNs:

  arr = np.random.rand(1000, 1000)
  timeit np.nanmax(arr)
 100 loops, best of 3: 5.82 ms per loop
  timeit np.fmax.reduce(np.fmax.reduce(arr))
 100 loops, best of 3: 9.14 ms per loop
  timeit nanmax(arr)
 1000 loops, best of 3: 1.17 ms per loop

  arr[arr  0.5] = np.nan

  timeit np.nanmax(arr)
 10 loops, best of 3: 45.5 ms per loop
  timeit np.fmax.reduce(np.fmax.reduce(arr))
 100 loops, best of 3: 12.7 ms per loop
  timeit nanmax(arr)
 1000 loops, best of 3: 1.17 ms per loop


There seem to be some odd hardware/compiler dependencies. I get quite a
different pattern of times:

In [1]: arr = np.random.rand(1000, 1000)

In [2]: timeit np.nanmax(arr)
100 loops, best of 3: 10.4 ms per loop

In [3]: timeit np.fmax.reduce(arr.flat)
100 loops, best of 3: 2.09 ms per loop

In [4]: arr[arr  0.5] = np.nan

In [5]: timeit np.nanmax(arr)
100 loops, best of 3: 12.9 ms per loop

In [6]: timeit np.fmax.reduce(arr.flat)
100 loops, best of 3: 7.09 ms per loop


I've tweaked fmax with the reduce loop option but the nanmax times don't
look like yours at all. I'm also a bit surprised that
you don't see any difference in times when the array contains a lot of nans.
I'm running on AMD Phenom, gcc 4.4.5.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Charles R Harris
On Fri, Nov 19, 2010 at 8:19 PM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Fri, Nov 19, 2010 at 1:50 PM, Keith Goodman kwgood...@gmail.comwrote:

 On Fri, Nov 19, 2010 at 12:29 PM, Keith Goodman kwgood...@gmail.com
 wrote:
  On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen p...@iki.fi wrote:
  Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote:
  [clip]
  My guess is that having separate underlying functions for each dtype,
  ndim, and axis would be a nightmare for a large project like Numpy.
 But
  manageable for a focused project like nanny.
 
  Might be easier to migrate the nan* functions to using Ufuncs.
 
  Unless I'm missing something,
 
 np.nanmax - np.fmax.reduce
 np.nanmin - np.fmin.reduce
 
  For `nansum`, we'd need to add an ufunc `nanadd`, and for
  `nanargmax/min`, we'd need `argfmin/fmax'.
 
  How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd,
 please.
 
  arr = np.random.rand(1000, 1000)
  arr[arr  0.5] = np.nan
  np.nanmax(arr)
0.4625409581072
  np.fmax.reduce(arr, axis=None)
  snip
  TypeError: an integer is required
  np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
0.4625409581072
 
  timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
  100 loops, best of 3: 12.7 ms per loop
  timeit np.nanmax(arr)
  10 loops, best of 3: 39.6 ms per loop
 
  timeit np.nanmax(arr, axis=0)
  10 loops, best of 3: 46.5 ms per loop
  timeit np.fmax.reduce(arr, axis=0)
  100 loops, best of 3: 12.7 ms per loop

 Cython is faster than np.fmax.reduce.

 I wrote a cython version of np.nanmax, called nanmax below. (It only
 handles the 2d, float64, axis=None case, but since the array is large
 I don't think that explains the time difference).

 Note that fmax.reduce is slower than np.nanmax when there are no NaNs:

  arr = np.random.rand(1000, 1000)
  timeit np.nanmax(arr)
 100 loops, best of 3: 5.82 ms per loop
  timeit np.fmax.reduce(np.fmax.reduce(arr))
 100 loops, best of 3: 9.14 ms per loop
  timeit nanmax(arr)
 1000 loops, best of 3: 1.17 ms per loop

  arr[arr  0.5] = np.nan

  timeit np.nanmax(arr)
 10 loops, best of 3: 45.5 ms per loop
  timeit np.fmax.reduce(np.fmax.reduce(arr))
 100 loops, best of 3: 12.7 ms per loop
  timeit nanmax(arr)
 1000 loops, best of 3: 1.17 ms per loop


 There seem to be some odd hardware/compiler dependencies. I get quite a
 different pattern of times:

 In [1]: arr = np.random.rand(1000, 1000)

 In [2]: timeit np.nanmax(arr)
 100 loops, best of 3: 10.4 ms per loop

 In [3]: timeit np.fmax.reduce(arr.flat)
 100 loops, best of 3: 2.09 ms per loop

 In [4]: arr[arr  0.5] = np.nan

 In [5]: timeit np.nanmax(arr)
 100 loops, best of 3: 12.9 ms per loop

 In [6]: timeit np.fmax.reduce(arr.flat)
 100 loops, best of 3: 7.09 ms per loop


 I've tweaked fmax with the reduce loop option but the nanmax times don't
 look like yours at all. I'm also a bit surprised that
 you don't see any difference in times when the array contains a lot of
 nans. I'm running on AMD Phenom, gcc 4.4.5.


However, I noticed that the build wants to be -O1 by default. I have my own
CFLAGS that make it -O2, but It looks like ubuntu's python might be built
with -O1. Hmm. That could certainly cause some odd timings.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Keith Goodman
On Fri, Nov 19, 2010 at 7:19 PM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Fri, Nov 19, 2010 at 1:50 PM, Keith Goodman kwgood...@gmail.com wrote:

 On Fri, Nov 19, 2010 at 12:29 PM, Keith Goodman kwgood...@gmail.com
 wrote:
  On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen p...@iki.fi wrote:
  Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote:
  [clip]
  My guess is that having separate underlying functions for each dtype,
  ndim, and axis would be a nightmare for a large project like Numpy.
  But
  manageable for a focused project like nanny.
 
  Might be easier to migrate the nan* functions to using Ufuncs.
 
  Unless I'm missing something,
 
         np.nanmax - np.fmax.reduce
         np.nanmin - np.fmin.reduce
 
  For `nansum`, we'd need to add an ufunc `nanadd`, and for
  `nanargmax/min`, we'd need `argfmin/fmax'.
 
  How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd,
  please.
 
  arr = np.random.rand(1000, 1000)
  arr[arr  0.5] = np.nan
  np.nanmax(arr)
    0.4625409581072
  np.fmax.reduce(arr, axis=None)
  snip
  TypeError: an integer is required
  np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
    0.4625409581072
 
  timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
  100 loops, best of 3: 12.7 ms per loop
  timeit np.nanmax(arr)
  10 loops, best of 3: 39.6 ms per loop
 
  timeit np.nanmax(arr, axis=0)
  10 loops, best of 3: 46.5 ms per loop
  timeit np.fmax.reduce(arr, axis=0)
  100 loops, best of 3: 12.7 ms per loop

 Cython is faster than np.fmax.reduce.

 I wrote a cython version of np.nanmax, called nanmax below. (It only
 handles the 2d, float64, axis=None case, but since the array is large
 I don't think that explains the time difference).

 Note that fmax.reduce is slower than np.nanmax when there are no NaNs:

  arr = np.random.rand(1000, 1000)
  timeit np.nanmax(arr)
 100 loops, best of 3: 5.82 ms per loop
  timeit np.fmax.reduce(np.fmax.reduce(arr))
 100 loops, best of 3: 9.14 ms per loop
  timeit nanmax(arr)
 1000 loops, best of 3: 1.17 ms per loop

  arr[arr  0.5] = np.nan

  timeit np.nanmax(arr)
 10 loops, best of 3: 45.5 ms per loop
  timeit np.fmax.reduce(np.fmax.reduce(arr))
 100 loops, best of 3: 12.7 ms per loop
  timeit nanmax(arr)
 1000 loops, best of 3: 1.17 ms per loop

 There seem to be some odd hardware/compiler dependencies. I get quite a
 different pattern of times:

 In [1]: arr = np.random.rand(1000, 1000)

 In [2]: timeit np.nanmax(arr)
 100 loops, best of 3: 10.4 ms per loop

 In [3]: timeit np.fmax.reduce(arr.flat)
 100 loops, best of 3: 2.09 ms per loop

 In [4]: arr[arr  0.5] = np.nan

 In [5]: timeit np.nanmax(arr)
 100 loops, best of 3: 12.9 ms per loop

 In [6]: timeit np.fmax.reduce(arr.flat)
 100 loops, best of 3: 7.09 ms per loop


 I've tweaked fmax with the reduce loop option but the nanmax times don't
 look like yours at all. I'm also a bit surprised that
 you don't see any difference in times when the array contains a lot of nans.
 I'm running on AMD Phenom, gcc 4.4.5.

Ubuntu 10.04 64 bit, numpy 1.4.1.

Difference in which times? nanny.nanmax with and wintout NaNs? The
code doesn't explictily check for NaNs (it does check for all NaNs).
It basically loops through the data and does:

allnan = 1
ai = ai[i,k]
if ai  amax:
amax = ai
allnan = 0

I should make a benchmark suite.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread josef . pktd
On Fri, Nov 19, 2010 at 10:42 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Fri, Nov 19, 2010 at 7:19 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:


 On Fri, Nov 19, 2010 at 1:50 PM, Keith Goodman kwgood...@gmail.com wrote:

 On Fri, Nov 19, 2010 at 12:29 PM, Keith Goodman kwgood...@gmail.com
 wrote:
  On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen p...@iki.fi wrote:
  Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote:
  [clip]
  My guess is that having separate underlying functions for each dtype,
  ndim, and axis would be a nightmare for a large project like Numpy.
  But
  manageable for a focused project like nanny.
 
  Might be easier to migrate the nan* functions to using Ufuncs.
 
  Unless I'm missing something,
 
         np.nanmax - np.fmax.reduce
         np.nanmin - np.fmin.reduce
 
  For `nansum`, we'd need to add an ufunc `nanadd`, and for
  `nanargmax/min`, we'd need `argfmin/fmax'.
 
  How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd,
  please.
 
  arr = np.random.rand(1000, 1000)
  arr[arr  0.5] = np.nan
  np.nanmax(arr)
    0.4625409581072
  np.fmax.reduce(arr, axis=None)
  snip
  TypeError: an integer is required
  np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
    0.4625409581072
 
  timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
  100 loops, best of 3: 12.7 ms per loop
  timeit np.nanmax(arr)
  10 loops, best of 3: 39.6 ms per loop
 
  timeit np.nanmax(arr, axis=0)
  10 loops, best of 3: 46.5 ms per loop
  timeit np.fmax.reduce(arr, axis=0)
  100 loops, best of 3: 12.7 ms per loop

 Cython is faster than np.fmax.reduce.

 I wrote a cython version of np.nanmax, called nanmax below. (It only
 handles the 2d, float64, axis=None case, but since the array is large
 I don't think that explains the time difference).

 Note that fmax.reduce is slower than np.nanmax when there are no NaNs:

  arr = np.random.rand(1000, 1000)
  timeit np.nanmax(arr)
 100 loops, best of 3: 5.82 ms per loop
  timeit np.fmax.reduce(np.fmax.reduce(arr))
 100 loops, best of 3: 9.14 ms per loop
  timeit nanmax(arr)
 1000 loops, best of 3: 1.17 ms per loop

  arr[arr  0.5] = np.nan

  timeit np.nanmax(arr)
 10 loops, best of 3: 45.5 ms per loop
  timeit np.fmax.reduce(np.fmax.reduce(arr))
 100 loops, best of 3: 12.7 ms per loop
  timeit nanmax(arr)
 1000 loops, best of 3: 1.17 ms per loop

 There seem to be some odd hardware/compiler dependencies. I get quite a
 different pattern of times:

 In [1]: arr = np.random.rand(1000, 1000)

 In [2]: timeit np.nanmax(arr)
 100 loops, best of 3: 10.4 ms per loop

 In [3]: timeit np.fmax.reduce(arr.flat)
 100 loops, best of 3: 2.09 ms per loop

 In [4]: arr[arr  0.5] = np.nan

 In [5]: timeit np.nanmax(arr)
 100 loops, best of 3: 12.9 ms per loop

 In [6]: timeit np.fmax.reduce(arr.flat)
 100 loops, best of 3: 7.09 ms per loop


 I've tweaked fmax with the reduce loop option but the nanmax times don't
 look like yours at all. I'm also a bit surprised that
 you don't see any difference in times when the array contains a lot of nans.
 I'm running on AMD Phenom, gcc 4.4.5.

 Ubuntu 10.04 64 bit, numpy 1.4.1.

 Difference in which times? nanny.nanmax with and wintout NaNs? The
 code doesn't explictily check for NaNs (it does check for all NaNs).
 It basically loops through the data and does:

 allnan = 1
 ai = ai[i,k]
 if ai  amax:
    amax = ai
    allnan = 0

does this give you the correct answer?

 1np.nan
False

What's the starting value for amax? -inf?

Josef



 I should make a benchmark suite.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Nathaniel Smith
On Fri, Nov 19, 2010 at 7:51 PM,  josef.p...@gmail.com wrote:
 On Fri, Nov 19, 2010 at 10:42 PM, Keith Goodman kwgood...@gmail.com wrote:
 It basically loops through the data and does:

 allnan = 1
 ai = ai[i,k]
 if ai  amax:
    amax = ai
    allnan = 0

 does this give you the correct answer?

 1np.nan
 False

Yes -- notice he does the comparison the other way, and

 1  np.nan
False

(All comparisons involving NaN return false, including, famously, NaN
== NaN, which is why we need np.isnan.)

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Keith Goodman
On Fri, Nov 19, 2010 at 7:51 PM,  josef.p...@gmail.com wrote:

 does this give you the correct answer?

 1np.nan
 False

 What's the starting value for amax? -inf?

Because 1  np.nan is False, the current running max does not get
updated, which is what we want.

 import nanny as ny
 np.nanmax([1, np.nan])
   1.0
 np.nanmax([np.nan, 1])
   1.0
 np.nanmax([np.nan, 1, np.nan])
   1.0

Starting value is -np.inf for floats and stuff like this for ints:

cdef np.int32_t MININT32 = np.iinfo(np.int32).min
cdef np.int64_t MININT64 = np.iinfo(np.int64).min

Numpy does this:

 np.nanmax([])
snip
ValueError: zero-size array to ufunc.reduce without identity

Nanny does this:

 ny.nanmax([])
   nan

So I haven't taken care of that corner case yet. I'll commit nanmax to
github in case anyone wants to give it a try.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Charles R Harris
On Fri, Nov 19, 2010 at 8:42 PM, Keith Goodman kwgood...@gmail.com wrote:

 On Fri, Nov 19, 2010 at 7:19 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 
 
  On Fri, Nov 19, 2010 at 1:50 PM, Keith Goodman kwgood...@gmail.com
 wrote:
 
  On Fri, Nov 19, 2010 at 12:29 PM, Keith Goodman kwgood...@gmail.com
  wrote:
   On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen p...@iki.fi wrote:
   Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote:
   [clip]
   My guess is that having separate underlying functions for each
 dtype,
   ndim, and axis would be a nightmare for a large project like Numpy.
   But
   manageable for a focused project like nanny.
  
   Might be easier to migrate the nan* functions to using Ufuncs.
  
   Unless I'm missing something,
  
  np.nanmax - np.fmax.reduce
  np.nanmin - np.fmin.reduce
  
   For `nansum`, we'd need to add an ufunc `nanadd`, and for
   `nanargmax/min`, we'd need `argfmin/fmax'.
  
   How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd,
   please.
  
   arr = np.random.rand(1000, 1000)
   arr[arr  0.5] = np.nan
   np.nanmax(arr)
 0.4625409581072
   np.fmax.reduce(arr, axis=None)
   snip
   TypeError: an integer is required
   np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
 0.4625409581072
  
   timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0)
   100 loops, best of 3: 12.7 ms per loop
   timeit np.nanmax(arr)
   10 loops, best of 3: 39.6 ms per loop
  
   timeit np.nanmax(arr, axis=0)
   10 loops, best of 3: 46.5 ms per loop
   timeit np.fmax.reduce(arr, axis=0)
   100 loops, best of 3: 12.7 ms per loop
 
  Cython is faster than np.fmax.reduce.
 
  I wrote a cython version of np.nanmax, called nanmax below. (It only
  handles the 2d, float64, axis=None case, but since the array is large
  I don't think that explains the time difference).
 
  Note that fmax.reduce is slower than np.nanmax when there are no NaNs:
 
   arr = np.random.rand(1000, 1000)
   timeit np.nanmax(arr)
  100 loops, best of 3: 5.82 ms per loop
   timeit np.fmax.reduce(np.fmax.reduce(arr))
  100 loops, best of 3: 9.14 ms per loop
   timeit nanmax(arr)
  1000 loops, best of 3: 1.17 ms per loop
 
   arr[arr  0.5] = np.nan
 
   timeit np.nanmax(arr)
  10 loops, best of 3: 45.5 ms per loop
   timeit np.fmax.reduce(np.fmax.reduce(arr))
  100 loops, best of 3: 12.7 ms per loop
   timeit nanmax(arr)
  1000 loops, best of 3: 1.17 ms per loop
 
  There seem to be some odd hardware/compiler dependencies. I get quite a
  different pattern of times:
 
  In [1]: arr = np.random.rand(1000, 1000)
 
  In [2]: timeit np.nanmax(arr)
  100 loops, best of 3: 10.4 ms per loop
 
  In [3]: timeit np.fmax.reduce(arr.flat)
  100 loops, best of 3: 2.09 ms per loop
 
  In [4]: arr[arr  0.5] = np.nan
 
  In [5]: timeit np.nanmax(arr)
  100 loops, best of 3: 12.9 ms per loop
 
  In [6]: timeit np.fmax.reduce(arr.flat)
  100 loops, best of 3: 7.09 ms per loop
 
 
  I've tweaked fmax with the reduce loop option but the nanmax times don't
  look like yours at all. I'm also a bit surprised that
  you don't see any difference in times when the array contains a lot of
 nans.
  I'm running on AMD Phenom, gcc 4.4.5.

 Ubuntu 10.04 64 bit, numpy 1.4.1.

 Difference in which times? nanny.nanmax with and wintout NaNs? The
 code doesn't explictily check for NaNs (it does check for all NaNs).
 It basically loops through the data and does:

 allnan = 1
 ai = ai[i,k]
 if ai  amax:
amax = ai
allnan = 0

 I should make a benchmark suite.
 _


This doesn't look right:

@cython.boundscheck(False)
@cython.wraparound(False)
def nanmax_2d_float64_axisNone(np.ndarray[np.float64_t, ndim=2] a):
nanmax of 2d numpy array with dtype=np.float64 along axis=None.
cdef Py_ssize_t i, j
cdef int arow = a.shape[0], acol = a.shape[1], allnan = 1
cdef np.float64_t amax = 0, aij
for i in range(arow):
for j in range(acol):
aij = a[i,j]
if aij == aij:
amax += aij
allnan = 0
if allnan == 0:
return np.float64(amax)
else:
return NAN

 It's doing a sum, not a comparison.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Keith Goodman
On Fri, Nov 19, 2010 at 8:05 PM, Charles R Harris
charlesr.har...@gmail.com wrote:

 This doesn't look right:

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanmax_2d_float64_axisNone(np.ndarray[np.float64_t, ndim=2] a):
 nanmax of 2d numpy array with dtype=np.float64 along axis=None.
 cdef Py_ssize_t i, j
 cdef int arow = a.shape[0], acol = a.shape[1], allnan = 1
 cdef np.float64_t amax = 0, aij
 for i in range(arow):
 for j in range(acol):
 aij = a[i,j]
 if aij == aij:
 amax += aij
 allnan = 0
 if allnan == 0:
 return np.float64(amax)
 else:
 return NAN

  It's doing a sum, not a comparison.

That was a placeholder. Looks at the latest commit. Sorry for the confusion.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread josef . pktd
On Fri, Nov 19, 2010 at 10:59 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Fri, Nov 19, 2010 at 7:51 PM,  josef.p...@gmail.com wrote:

 does this give you the correct answer?

 1np.nan
 False

 What's the starting value for amax? -inf?

 Because 1  np.nan is False, the current running max does not get
 updated, which is what we want.

 import nanny as ny
 np.nanmax([1, np.nan])
   1.0
 np.nanmax([np.nan, 1])
   1.0
 np.nanmax([np.nan, 1, np.nan])
   1.0

 Starting value is -np.inf for floats and stuff like this for ints:

 cdef np.int32_t MININT32 = np.iinfo(np.int32).min
 cdef np.int64_t MININT64 = np.iinfo(np.int64).min

That's what I thought halfway through typing the question.

 -np.inf-np.inf
False

If the only value is -np.inf, you will return nan, I guess.

 np.nanmax([-np.inf, np.nan])
-inf

Josef
(being picky)


 Numpy does this:

 np.nanmax([])
 snip
 ValueError: zero-size array to ufunc.reduce without identity

 Nanny does this:

 ny.nanmax([])
   nan

 So I haven't taken care of that corner case yet. I'll commit nanmax to
 github in case anyone wants to give it a try.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-19 Thread Keith Goodman
On Fri, Nov 19, 2010 at 8:33 PM,  josef.p...@gmail.com wrote:

 -np.inf-np.inf
 False

 If the only value is -np.inf, you will return nan, I guess.

 np.nanmax([-np.inf, np.nan])
 -inf

That's a great corner case. Thanks, Josef. This looks like it would fix it:

change

if ai  amax:
amax = ai

to

if ai = amax:
amax = ai
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion