Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-18 Thread Sturla Molden
Jaime Fernández del Río jaime.f...@gmail.com wrote:

 I think we have an explicit rule against C++, although I may be wrong.

Currently there is Python, C and Cython in NumPy.

SciPy also has C++ and Fortran code.


Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-15 Thread Jaime Fernández del Río
On Wed, Apr 15, 2015 at 4:36 AM, Neil Girdhar mistersh...@gmail.com wrote:

 Yeah, I'm not arguing, I'm just curious about your reasoning.  That
 explains why not C++.  Why would you want to do this in C and not Python?


Well, the algorithm has to iterate over all the inputs, updating the
estimated percentile positions at every iteration. Because the estimated
percentiles may change in every iteration, I don't think there is an easy
way of vectorizing the calculation with numpy. So I think it would be very
slow if done in Python.

Looking at this in some more details, how is this typically used? Because
it gives you approximate values that should split your sample into
similarly filled bins, but because the values are approximate, to compute a
proper histogram you would still need to do the binning to get the exact
results, right? Even with this drawback P-2 does have an algorithmic
advantage, so for huge inputs and many bins it should come ahead. But for
many medium sized problems it may be faster to simply use np.partition,
which gives you the whole thing in a single go. And it would be much
simpler to implement.

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-15 Thread Neil Girdhar
You got it.  I remember this from when I worked at Google and we would
process (many many) logs.  With enough bins, the approximation is still
really close.  It's great if you want to make an automatic plot of data.
Calling numpy.partition a hundred times is probably slower than calling P^2
with n=100 bins.  I don't think it does O(n) computations per point.  I
think it's more like O(log(n)).

Best,

Neil

On Wed, Apr 15, 2015 at 10:02 AM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Wed, Apr 15, 2015 at 4:36 AM, Neil Girdhar mistersh...@gmail.com
 wrote:

 Yeah, I'm not arguing, I'm just curious about your reasoning.  That
 explains why not C++.  Why would you want to do this in C and not Python?


 Well, the algorithm has to iterate over all the inputs, updating the
 estimated percentile positions at every iteration. Because the estimated
 percentiles may change in every iteration, I don't think there is an easy
 way of vectorizing the calculation with numpy. So I think it would be very
 slow if done in Python.

 Looking at this in some more details, how is this typically used? Because
 it gives you approximate values that should split your sample into
 similarly filled bins, but because the values are approximate, to compute a
 proper histogram you would still need to do the binning to get the exact
 results, right? Even with this drawback P-2 does have an algorithmic
 advantage, so for huge inputs and many bins it should come ahead. But for
 many medium sized problems it may be faster to simply use np.partition,
 which gives you the whole thing in a single go. And it would be much
 simpler to implement.

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
 de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-15 Thread Benjamin Root
Then you can set about convincing matplotlib and friends to
use it by default

Just to note, this proposal was originally made over in the matplotlib
project. We sent it over here where its benefits would have wider reach.
Matplotlib's plan is not to change the defaults, but to offload as much as
possible to numpy so that it can support these new features if they are
available. We might need to do some input validation so that users running
older version of numpy can get a sensible error message.

Cheers!
Ben Root


On Tue, Apr 14, 2015 at 7:12 PM, Nathaniel Smith n...@pobox.com wrote:

 On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com
 wrote:
  Can I suggest that we instead add the P-square algorithm for the dynamic
  calculation of histograms?
  (
 http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
 )
 
  This is already implemented in C++'s boost library
  (
 http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
 )
 
  I implemented it in Boost Python as a module, which I'm happy to share.
  This is much better than fixed-width histograms in practice.  Rather than
  adjusting the number of bins, it adjusts what you really want, which is
 the
  resolution of the bins throughout the domain.

 This definitely sounds like a useful thing to have in numpy or scipy
 (though if it's possible to do without using Boost/C++ that would be
 nice). But yeah, we should leave the existing histogram alone (in this
 regard) and add a new name for this like adaptive_histogram or
 something. Then you can set about convincing matplotlib and friends to
 use it by default :-)

 -n

 --
 Nathaniel J. Smith -- http://vorpus.org
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-15 Thread Jaime Fernández del Río
On Wed, Apr 15, 2015 at 8:06 AM, Neil Girdhar mistersh...@gmail.com wrote:

 You got it.  I remember this from when I worked at Google and we would
 process (many many) logs.  With enough bins, the approximation is still
 really close.  It's great if you want to make an automatic plot of data.
 Calling numpy.partition a hundred times is probably slower than calling P^2
 with n=100 bins.  I don't think it does O(n) computations per point.  I
 think it's more like O(log(n)).


Looking at it again, it probably is O(n) after all: it does a binary
search, which is O(log n), but it then goes on to update all the n bin
counters and estimations, so O(n) I'm afraid. So there is no algorithmic
advantage over partition/percentile: if there are m samples and n bins, P-2
that O(n) m times, while partition does O(m) n times, so both end up being
O(m n). It seems to me that the big thing of P^2 is not having to hold the
full dataset in memory. Online statistics (is that the name for this?),
even if only estimations, is a cool thing, but I am not sure numpy is the
place for them. That's not to say that we couldn't eventually have P^2
implemented for histogram, but I would start off with a partition based one.

Would SciPy have a place for online statistics? Perhaps there's room for
yet another scikit?

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-15 Thread Jaime Fernández del Río
On Wed, Apr 15, 2015 at 9:14 AM, Eric Moore e...@redtetrahedron.org wrote:

 This blog post, and the links within also seem relevant.  Appears to have
 python code available to try things out as well.


 https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest


Very cool indeed... The original works is licensed under an Apache 2.0
license (https://github.com/tdunning/t-digest/blob/master/LICENSE). I am
not fluent in legalese, so not sure whether that means we can use it or
not, seems awfully more complicated than what we normally use.

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-15 Thread Eric Moore
This blog post, and the links within also seem relevant.  Appears to have
python code available to try things out as well.

https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest

-Eric

On Wed, Apr 15, 2015 at 11:24 AM, Benjamin Root ben.r...@ou.edu wrote:

 Then you can set about convincing matplotlib and friends to
 use it by default

 Just to note, this proposal was originally made over in the matplotlib
 project. We sent it over here where its benefits would have wider reach.
 Matplotlib's plan is not to change the defaults, but to offload as much as
 possible to numpy so that it can support these new features if they are
 available. We might need to do some input validation so that users running
 older version of numpy can get a sensible error message.

 Cheers!
 Ben Root


 On Tue, Apr 14, 2015 at 7:12 PM, Nathaniel Smith n...@pobox.com wrote:

 On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com
 wrote:
  Can I suggest that we instead add the P-square algorithm for the dynamic
  calculation of histograms?
  (
 http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
 )
 
  This is already implemented in C++'s boost library
  (
 http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
 )
 
  I implemented it in Boost Python as a module, which I'm happy to share.
  This is much better than fixed-width histograms in practice.  Rather
 than
  adjusting the number of bins, it adjusts what you really want, which is
 the
  resolution of the bins throughout the domain.

 This definitely sounds like a useful thing to have in numpy or scipy
 (though if it's possible to do without using Boost/C++ that would be
 nice). But yeah, we should leave the existing histogram alone (in this
 regard) and add a new name for this like adaptive_histogram or
 something. Then you can set about convincing matplotlib and friends to
 use it by default :-)

 -n

 --
 Nathaniel J. Smith -- http://vorpus.org
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-15 Thread Neil Girdhar
Cool, thanks for looking at this.  P2 might still be better even if the
whole dataset is in memory because of cache misses.  Partition, which I
guess is based on quickselect, is going to run over all of the data as many
times as there are bins roughly, whereas p2 only runs over it once.  From a
cache miss standpoint, I think p2 is better?  Anyway, it might be worth
maybe coding to verify any performance advantages?  Not sure if it should
be in numpy or not since it really should accept an iterable rather than a
numpy vector, right?

Best,

Neil

On Wed, Apr 15, 2015 at 12:40 PM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Wed, Apr 15, 2015 at 8:06 AM, Neil Girdhar mistersh...@gmail.com
 wrote:

 You got it.  I remember this from when I worked at Google and we would
 process (many many) logs.  With enough bins, the approximation is still
 really close.  It's great if you want to make an automatic plot of data.
 Calling numpy.partition a hundred times is probably slower than calling P^2
 with n=100 bins.  I don't think it does O(n) computations per point.  I
 think it's more like O(log(n)).


 Looking at it again, it probably is O(n) after all: it does a binary
 search, which is O(log n), but it then goes on to update all the n bin
 counters and estimations, so O(n) I'm afraid. So there is no algorithmic
 advantage over partition/percentile: if there are m samples and n bins, P-2
 that O(n) m times, while partition does O(m) n times, so both end up being
 O(m n). It seems to me that the big thing of P^2 is not having to hold the
 full dataset in memory. Online statistics (is that the name for this?),
 even if only estimations, is a cool thing, but I am not sure numpy is the
 place for them. That's not to say that we couldn't eventually have P^2
 implemented for histogram, but I would start off with a partition based one.

 Would SciPy have a place for online statistics? Perhaps there's room for
 yet another scikit?

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
 de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-15 Thread Neil Girdhar
Yeah, I'm not arguing, I'm just curious about your reasoning.  That
explains why not C++.  Why would you want to do this in C and not Python?

On Wed, Apr 15, 2015 at 1:48 AM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Tue, Apr 14, 2015 at 6:16 PM, Neil Girdhar mistersh...@gmail.com
 wrote:

 If you're going to C, is there a reason not to go to C++ and include the
 already-written Boost code?  Otherwise, why not use Python?


 I think we have an explicit rule against C++, although I may be wrong. Not
 sure how much of boost we would have to make part of numpy to use that, the
 whole accumulators lib I'm guessing? Seems like an awful lot given what we
 are after.

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
 de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-14 Thread Paul Hobson
On Tue, Apr 14, 2015 at 4:24 PM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Tue, Apr 14, 2015 at 4:12 PM, Nathaniel Smith n...@pobox.com wrote:

 On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com
 wrote:
  Can I suggest that we instead add the P-square algorithm for the dynamic
  calculation of histograms?
  (
 http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
 )
 
  This is already implemented in C++'s boost library
  (
 http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
 )
 
  I implemented it in Boost Python as a module, which I'm happy to share.
  This is much better than fixed-width histograms in practice.  Rather
 than
  adjusting the number of bins, it adjusts what you really want, which is
 the
  resolution of the bins throughout the domain.

 This definitely sounds like a useful thing to have in numpy or scipy
 (though if it's possible to do without using Boost/C++ that would be
 nice). But yeah, we should leave the existing histogram alone (in this
 regard) and add a new name for this like adaptive_histogram or
 something. Then you can set about convincing matplotlib and friends to
 use it by default :-)


 Would having a negative number of bins mean this many, but with optimized
 boundaries be too clever an interface?


As a user, I think so. Wouldn't np.histogram(..., adaptive=True) do well
enough?
-p
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-14 Thread Neil Girdhar
If you're going to C, is there a reason not to go to C++ and include the
already-written Boost code?  Otherwise, why not use Python?

On Tue, Apr 14, 2015 at 7:24 PM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Tue, Apr 14, 2015 at 4:12 PM, Nathaniel Smith n...@pobox.com wrote:

 On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com
 wrote:
  Can I suggest that we instead add the P-square algorithm for the dynamic
  calculation of histograms?
  (
 http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
 )
 
  This is already implemented in C++'s boost library
  (
 http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
 )
 
  I implemented it in Boost Python as a module, which I'm happy to share.
  This is much better than fixed-width histograms in practice.  Rather
 than
  adjusting the number of bins, it adjusts what you really want, which is
 the
  resolution of the bins throughout the domain.

 This definitely sounds like a useful thing to have in numpy or scipy
 (though if it's possible to do without using Boost/C++ that would be
 nice). But yeah, we should leave the existing histogram alone (in this
 regard) and add a new name for this like adaptive_histogram or
 something. Then you can set about convincing matplotlib and friends to
 use it by default :-)


 Would having a negative number of bins mean this many, but with optimized
 boundaries be too clever an interface?

 I have taken a look at the paper linked, and the P-2 algorithm would not
 be too complicated to implement from scratch, although it would require
 writing some C code I'm afraid.

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
 de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-14 Thread Neil Girdhar
By the way, the p^2 algorithm still needs to know how many bins you want.
It just adapts the endpoints of the bins.  I like adaptive=True.  However,
you will have to find a way to return both the bins and and their
calculated endpoints.

The P^2 algorithm can also give approximate answers to numpy.percentile,
numpy.median.  How approximate they are depends on the number of bins you
let it keep track of.  I believe the authors bound the error as a function
of number of points and bins.

On Tue, Apr 14, 2015 at 10:00 PM, Paul Hobson pmhob...@gmail.com wrote:



 On Tue, Apr 14, 2015 at 4:24 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Tue, Apr 14, 2015 at 4:12 PM, Nathaniel Smith n...@pobox.com wrote:

 On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com
 wrote:
  Can I suggest that we instead add the P-square algorithm for the
 dynamic
  calculation of histograms?
  (
 http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
 )
 
  This is already implemented in C++'s boost library
  (
 http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
 )
 
  I implemented it in Boost Python as a module, which I'm happy to share.
  This is much better than fixed-width histograms in practice.  Rather
 than
  adjusting the number of bins, it adjusts what you really want, which
 is the
  resolution of the bins throughout the domain.

 This definitely sounds like a useful thing to have in numpy or scipy
 (though if it's possible to do without using Boost/C++ that would be
 nice). But yeah, we should leave the existing histogram alone (in this
 regard) and add a new name for this like adaptive_histogram or
 something. Then you can set about convincing matplotlib and friends to
 use it by default :-)


 Would having a negative number of bins mean this many, but with
 optimized boundaries be too clever an interface?


 As a user, I think so. Wouldn't np.histogram(..., adaptive=True) do well
 enough?
 -p

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-14 Thread Jaime Fernández del Río
On Tue, Apr 14, 2015 at 6:16 PM, Neil Girdhar mistersh...@gmail.com wrote:

 If you're going to C, is there a reason not to go to C++ and include the
 already-written Boost code?  Otherwise, why not use Python?


I think we have an explicit rule against C++, although I may be wrong. Not
sure how much of boost we would have to make part of numpy to use that, the
whole accumulators lib I'm guessing? Seems like an awful lot given what we
are after.

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-14 Thread Neil Girdhar
Can I suggest that we instead add the P-square algorithm for the dynamic
calculation of histograms?  (
http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
)

This is already implemented in C++'s boost library (
http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
)

I implemented it in Boost Python as a module, which I'm happy to share.
This is much better than fixed-width histograms in practice.  Rather than
adjusting the number of bins, it adjusts what you really want, which is the
resolution of the bins throughout the domain.

Best,

Neil

On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers ralf.gomm...@gmail.com
wrote:



 On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Sun, Apr 12, 2015 at 12:19 AM, Varun nayy...@gmail.com wrote:


 http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta
 tistics/A
 http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A
 utomating%20Binwidth%20Choice%20for%20Histogram.ipynb

 Long story short, histogram visualisations that depend on numpy (such as
 matplotlib, or  nearly all of them) have poor default behaviour as I
 have to
 constantly play around with  the number of bins to get a good idea of
 what I'm
 looking at. The bins=10 works ok for  up to 1000 points or very normal
 data,
 but has poor performance for anything else, and  doesn't account for
 variability either. I don't have a method easily available to scale the
 number
 of bins given the data.

 R doesn't suffer from these problems and provides methods for use with
 it's
 hist  method. I would like to provide similar functionality for
 matplotlib, to
 at least provide  some kind of good starting point, as histograms are
 very
 useful for initial data discovery.

 The notebook above provides an explanation of the problem as well as some
 proposed  alternatives. Use different datasets (type and size) to see the
 performance of the  suggestions. All of the methods proposed exist in R
 and
 literature.

 I've put together an implementation to add this new functionality, but am
 hesitant to  make a pull request as I would like some feedback from a
 maintainer before doing so.


 +1 on the PR.


 +1 as well.

 Unfortunately we can't change the default of 10, but a number of string
 methods, with a bins=auto or some such name prominently recommended in
 the docstring, would be very good to have.

 Ralf

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-14 Thread Chris Barker
On Mon, Apr 13, 2015 at 5:02 AM, Neil Girdhar mistersh...@gmail.com wrote:

 Can I suggest that we instead add the P-square algorithm for the dynamic
 calculation of histograms?  (
 http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
 )


This look slike a great thing to have in numpy. However, I suspect that a
lot of the downstream code that uses histogram expects equally-spaced bins.

So this should probably be a in addition to, rather than an instead of

-CHB




 This is already implemented in C++'s boost library (
 http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
 )

 I implemented it in Boost Python as a module, which I'm happy to share.
 This is much better than fixed-width histograms in practice.  Rather than
 adjusting the number of bins, it adjusts what you really want, which is the
 resolution of the bins throughout the domain.

 Best,

 Neil

 On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers ralf.gomm...@gmail.com
 wrote:



 On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Sun, Apr 12, 2015 at 12:19 AM, Varun nayy...@gmail.com wrote:


 http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta
 tistics/A
 http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A
 utomating%20Binwidth%20Choice%20for%20Histogram.ipynb

 Long story short, histogram visualisations that depend on numpy (such as
 matplotlib, or  nearly all of them) have poor default behaviour as I
 have to
 constantly play around with  the number of bins to get a good idea of
 what I'm
 looking at. The bins=10 works ok for  up to 1000 points or very normal
 data,
 but has poor performance for anything else, and  doesn't account for
 variability either. I don't have a method easily available to scale the
 number
 of bins given the data.

 R doesn't suffer from these problems and provides methods for use with
 it's
 hist  method. I would like to provide similar functionality for
 matplotlib, to
 at least provide  some kind of good starting point, as histograms are
 very
 useful for initial data discovery.

 The notebook above provides an explanation of the problem as well as
 some
 proposed  alternatives. Use different datasets (type and size) to see
 the
 performance of the  suggestions. All of the methods proposed exist in R
 and
 literature.

 I've put together an implementation to add this new functionality, but
 am
 hesitant to  make a pull request as I would like some feedback from a
 maintainer before doing so.


 +1 on the PR.


 +1 as well.

 Unfortunately we can't change the default of 10, but a number of string
 methods, with a bins=auto or some such name prominently recommended in
 the docstring, would be very good to have.

 Ralf

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion




-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-14 Thread Antony Lee
Another improvement would be to make sure, for integer-valued datasets,
that all bins cover the same number of integer, as it is easy to end up
otherwise with bins effectively wider than others:

hist(np.random.randint(11, size=1))

shows a peak in the last bin, as it covers both 9 and 10.

Antony

2015-04-13 5:02 GMT-07:00 Neil Girdhar mistersh...@gmail.com:

 Can I suggest that we instead add the P-square algorithm for the dynamic
 calculation of histograms?  (
 http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
 )

 This is already implemented in C++'s boost library (
 http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
 )

 I implemented it in Boost Python as a module, which I'm happy to share.
 This is much better than fixed-width histograms in practice.  Rather than
 adjusting the number of bins, it adjusts what you really want, which is the
 resolution of the bins throughout the domain.

 Best,

 Neil

 On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers ralf.gomm...@gmail.com
 wrote:



 On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Sun, Apr 12, 2015 at 12:19 AM, Varun nayy...@gmail.com wrote:


 http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta
 tistics/A
 http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A
 utomating%20Binwidth%20Choice%20for%20Histogram.ipynb

 Long story short, histogram visualisations that depend on numpy (such as
 matplotlib, or  nearly all of them) have poor default behaviour as I
 have to
 constantly play around with  the number of bins to get a good idea of
 what I'm
 looking at. The bins=10 works ok for  up to 1000 points or very normal
 data,
 but has poor performance for anything else, and  doesn't account for
 variability either. I don't have a method easily available to scale the
 number
 of bins given the data.

 R doesn't suffer from these problems and provides methods for use with
 it's
 hist  method. I would like to provide similar functionality for
 matplotlib, to
 at least provide  some kind of good starting point, as histograms are
 very
 useful for initial data discovery.

 The notebook above provides an explanation of the problem as well as
 some
 proposed  alternatives. Use different datasets (type and size) to see
 the
 performance of the  suggestions. All of the methods proposed exist in R
 and
 literature.

 I've put together an implementation to add this new functionality, but
 am
 hesitant to  make a pull request as I would like some feedback from a
 maintainer before doing so.


 +1 on the PR.


 +1 as well.

 Unfortunately we can't change the default of 10, but a number of string
 methods, with a bins=auto or some such name prominently recommended in
 the docstring, would be very good to have.

 Ralf

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-14 Thread Neil Girdhar
Yes, you're right.  Although in practice, people almost always want
adaptive bins.

On Tue, Apr 14, 2015 at 5:08 PM, Chris Barker chris.bar...@noaa.gov wrote:

 On Mon, Apr 13, 2015 at 5:02 AM, Neil Girdhar mistersh...@gmail.com
 wrote:

 Can I suggest that we instead add the P-square algorithm for the dynamic
 calculation of histograms?  (
 http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
 )


 This look slike a great thing to have in numpy. However, I suspect that a
 lot of the downstream code that uses histogram expects equally-spaced bins.

 So this should probably be a in addition to, rather than an instead of

 -CHB




 This is already implemented in C++'s boost library (
 http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
 )

 I implemented it in Boost Python as a module, which I'm happy to share.
 This is much better than fixed-width histograms in practice.  Rather than
 adjusting the number of bins, it adjusts what you really want, which is the
 resolution of the bins throughout the domain.

 Best,

 Neil

 On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers ralf.gomm...@gmail.com
 wrote:



 On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Sun, Apr 12, 2015 at 12:19 AM, Varun nayy...@gmail.com wrote:


 http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta
 tistics/A
 http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A
 utomating%20Binwidth%20Choice%20for%20Histogram.ipynb

 Long story short, histogram visualisations that depend on numpy (such
 as
 matplotlib, or  nearly all of them) have poor default behaviour as I
 have to
 constantly play around with  the number of bins to get a good idea of
 what I'm
 looking at. The bins=10 works ok for  up to 1000 points or very normal
 data,
 but has poor performance for anything else, and  doesn't account for
 variability either. I don't have a method easily available to scale
 the number
 of bins given the data.

 R doesn't suffer from these problems and provides methods for use with
 it's
 hist  method. I would like to provide similar functionality for
 matplotlib, to
 at least provide  some kind of good starting point, as histograms are
 very
 useful for initial data discovery.

 The notebook above provides an explanation of the problem as well as
 some
 proposed  alternatives. Use different datasets (type and size) to see
 the
 performance of the  suggestions. All of the methods proposed exist in
 R and
 literature.

 I've put together an implementation to add this new functionality, but
 am
 hesitant to  make a pull request as I would like some feedback from a
 maintainer before doing so.


 +1 on the PR.


 +1 as well.

 Unfortunately we can't change the default of 10, but a number of string
 methods, with a bins=auto or some such name prominently recommended in
 the docstring, would be very good to have.

 Ralf

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion




 --

 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 chris.bar...@noaa.gov

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-14 Thread Nathaniel Smith
On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com wrote:
 Can I suggest that we instead add the P-square algorithm for the dynamic
 calculation of histograms?
 (http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf)

 This is already implemented in C++'s boost library
 (http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp)

 I implemented it in Boost Python as a module, which I'm happy to share.
 This is much better than fixed-width histograms in practice.  Rather than
 adjusting the number of bins, it adjusts what you really want, which is the
 resolution of the bins throughout the domain.

This definitely sounds like a useful thing to have in numpy or scipy
(though if it's possible to do without using Boost/C++ that would be
nice). But yeah, we should leave the existing histogram alone (in this
regard) and add a new name for this like adaptive_histogram or
something. Then you can set about convincing matplotlib and friends to
use it by default :-)

-n

-- 
Nathaniel J. Smith -- http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-14 Thread Jaime Fernández del Río
On Tue, Apr 14, 2015 at 4:12 PM, Nathaniel Smith n...@pobox.com wrote:

 On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com
 wrote:
  Can I suggest that we instead add the P-square algorithm for the dynamic
  calculation of histograms?
  (
 http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
 )
 
  This is already implemented in C++'s boost library
  (
 http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
 )
 
  I implemented it in Boost Python as a module, which I'm happy to share.
  This is much better than fixed-width histograms in practice.  Rather than
  adjusting the number of bins, it adjusts what you really want, which is
 the
  resolution of the bins throughout the domain.

 This definitely sounds like a useful thing to have in numpy or scipy
 (though if it's possible to do without using Boost/C++ that would be
 nice). But yeah, we should leave the existing histogram alone (in this
 regard) and add a new name for this like adaptive_histogram or
 something. Then you can set about convincing matplotlib and friends to
 use it by default :-)


Would having a negative number of bins mean this many, but with optimized
boundaries be too clever an interface?

I have taken a look at the paper linked, and the P-2 algorithm would not be
too complicated to implement from scratch, although it would require
writing some C code I'm afraid.

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-12 Thread Ralf Gommers
On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Sun, Apr 12, 2015 at 12:19 AM, Varun nayy...@gmail.com wrote:


 http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta
 tistics/A
 http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A
 utomating%20Binwidth%20Choice%20for%20Histogram.ipynb

 Long story short, histogram visualisations that depend on numpy (such as
 matplotlib, or  nearly all of them) have poor default behaviour as I have
 to
 constantly play around with  the number of bins to get a good idea of
 what I'm
 looking at. The bins=10 works ok for  up to 1000 points or very normal
 data,
 but has poor performance for anything else, and  doesn't account for
 variability either. I don't have a method easily available to scale the
 number
 of bins given the data.

 R doesn't suffer from these problems and provides methods for use with
 it's
 hist  method. I would like to provide similar functionality for
 matplotlib, to
 at least provide  some kind of good starting point, as histograms are very
 useful for initial data discovery.

 The notebook above provides an explanation of the problem as well as some
 proposed  alternatives. Use different datasets (type and size) to see the
 performance of the  suggestions. All of the methods proposed exist in R
 and
 literature.

 I've put together an implementation to add this new functionality, but am
 hesitant to  make a pull request as I would like some feedback from a
 maintainer before doing so.


 +1 on the PR.


+1 as well.

Unfortunately we can't change the default of 10, but a number of string
methods, with a bins=auto or some such name prominently recommended in
the docstring, would be very good to have.

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-12 Thread Jaime Fernández del Río
On Sun, Apr 12, 2015 at 12:19 AM, Varun nayy...@gmail.com wrote:


 http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta
 tistics/A utomating%20Binwidth%20Choice%20for%20Histogram.ipynb

 Long story short, histogram visualisations that depend on numpy (such as
 matplotlib, or  nearly all of them) have poor default behaviour as I have
 to
 constantly play around with  the number of bins to get a good idea of what
 I'm
 looking at. The bins=10 works ok for  up to 1000 points or very normal
 data,
 but has poor performance for anything else, and  doesn't account for
 variability either. I don't have a method easily available to scale the
 number
 of bins given the data.

 R doesn't suffer from these problems and provides methods for use with it's
 hist  method. I would like to provide similar functionality for
 matplotlib, to
 at least provide  some kind of good starting point, as histograms are very
 useful for initial data discovery.

 The notebook above provides an explanation of the problem as well as some
 proposed  alternatives. Use different datasets (type and size) to see the
 performance of the  suggestions. All of the methods proposed exist in R and
 literature.

 I've put together an implementation to add this new functionality, but am
 hesitant to  make a pull request as I would like some feedback from a
 maintainer before doing so.


+1 on the PR.

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-12 Thread Varun
Using a URL shortener for the notebook to get around the 80 char width limit

http://goo.gl/JmfTRJ

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Automatic number of bins for numpy histograms

2015-04-12 Thread Varun
http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta
tistics/A utomating%20Binwidth%20Choice%20for%20Histogram.ipynb

Long story short, histogram visualisations that depend on numpy (such as
matplotlib, or  nearly all of them) have poor default behaviour as I have to
constantly play around with  the number of bins to get a good idea of what I'm
looking at. The bins=10 works ok for  up to 1000 points or very normal data,
but has poor performance for anything else, and  doesn't account for
variability either. I don't have a method easily available to scale the number
of bins given the data.

R doesn't suffer from these problems and provides methods for use with it's
hist  method. I would like to provide similar functionality for matplotlib, to
at least provide  some kind of good starting point, as histograms are very
useful for initial data discovery.

The notebook above provides an explanation of the problem as well as some
proposed  alternatives. Use different datasets (type and size) to see the
performance of the  suggestions. All of the methods proposed exist in R and
literature.

I've put together an implementation to add this new functionality, but am
hesitant to  make a pull request as I would like some feedback from a
maintainer before doing so.

https://github.com/numpy/numpy/compare/master...nayyarv:master

I've provided them as functions for easy refactoring, as it can be argued that
it should be  in it's own function/file/class, or alternatively can be turned
into simple if, elif statements.  I believe this belongs in numpy as it is
where the functionality exists for histogram  methods that most libraries
build on, and it would useful for them to not require scipy for  example.

I will update the documentation accordingly before making a pull request, and
add in  more tests to show it's functionality. I can adapt my ipython notebook
into a quick  tutorial/help file if need be.

I've already attempted to add this into matplotlib before being redirected
here  https://github.com/matplotlib/matplotlib/issues/4316

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion