Re: [Numpy-discussion] Automatic number of bins for numpy histograms
Jaime Fernández del Río jaime.f...@gmail.com wrote: I think we have an explicit rule against C++, although I may be wrong. Currently there is Python, C and Cython in NumPy. SciPy also has C++ and Fortran code. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
On Wed, Apr 15, 2015 at 4:36 AM, Neil Girdhar mistersh...@gmail.com wrote: Yeah, I'm not arguing, I'm just curious about your reasoning. That explains why not C++. Why would you want to do this in C and not Python? Well, the algorithm has to iterate over all the inputs, updating the estimated percentile positions at every iteration. Because the estimated percentiles may change in every iteration, I don't think there is an easy way of vectorizing the calculation with numpy. So I think it would be very slow if done in Python. Looking at this in some more details, how is this typically used? Because it gives you approximate values that should split your sample into similarly filled bins, but because the values are approximate, to compute a proper histogram you would still need to do the binning to get the exact results, right? Even with this drawback P-2 does have an algorithmic advantage, so for huge inputs and many bins it should come ahead. But for many medium sized problems it may be faster to simply use np.partition, which gives you the whole thing in a single go. And it would be much simpler to implement. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
You got it. I remember this from when I worked at Google and we would process (many many) logs. With enough bins, the approximation is still really close. It's great if you want to make an automatic plot of data. Calling numpy.partition a hundred times is probably slower than calling P^2 with n=100 bins. I don't think it does O(n) computations per point. I think it's more like O(log(n)). Best, Neil On Wed, Apr 15, 2015 at 10:02 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Apr 15, 2015 at 4:36 AM, Neil Girdhar mistersh...@gmail.com wrote: Yeah, I'm not arguing, I'm just curious about your reasoning. That explains why not C++. Why would you want to do this in C and not Python? Well, the algorithm has to iterate over all the inputs, updating the estimated percentile positions at every iteration. Because the estimated percentiles may change in every iteration, I don't think there is an easy way of vectorizing the calculation with numpy. So I think it would be very slow if done in Python. Looking at this in some more details, how is this typically used? Because it gives you approximate values that should split your sample into similarly filled bins, but because the values are approximate, to compute a proper histogram you would still need to do the binning to get the exact results, right? Even with this drawback P-2 does have an algorithmic advantage, so for huge inputs and many bins it should come ahead. But for many medium sized problems it may be faster to simply use np.partition, which gives you the whole thing in a single go. And it would be much simpler to implement. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
Then you can set about convincing matplotlib and friends to use it by default Just to note, this proposal was originally made over in the matplotlib project. We sent it over here where its benefits would have wider reach. Matplotlib's plan is not to change the defaults, but to offload as much as possible to numpy so that it can support these new features if they are available. We might need to do some input validation so that users running older version of numpy can get a sensible error message. Cheers! Ben Root On Tue, Apr 14, 2015 at 7:12 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com wrote: Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf ) This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp ) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like adaptive_histogram or something. Then you can set about convincing matplotlib and friends to use it by default :-) -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
On Wed, Apr 15, 2015 at 8:06 AM, Neil Girdhar mistersh...@gmail.com wrote: You got it. I remember this from when I worked at Google and we would process (many many) logs. With enough bins, the approximation is still really close. It's great if you want to make an automatic plot of data. Calling numpy.partition a hundred times is probably slower than calling P^2 with n=100 bins. I don't think it does O(n) computations per point. I think it's more like O(log(n)). Looking at it again, it probably is O(n) after all: it does a binary search, which is O(log n), but it then goes on to update all the n bin counters and estimations, so O(n) I'm afraid. So there is no algorithmic advantage over partition/percentile: if there are m samples and n bins, P-2 that O(n) m times, while partition does O(m) n times, so both end up being O(m n). It seems to me that the big thing of P^2 is not having to hold the full dataset in memory. Online statistics (is that the name for this?), even if only estimations, is a cool thing, but I am not sure numpy is the place for them. That's not to say that we couldn't eventually have P^2 implemented for histogram, but I would start off with a partition based one. Would SciPy have a place for online statistics? Perhaps there's room for yet another scikit? Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
On Wed, Apr 15, 2015 at 9:14 AM, Eric Moore e...@redtetrahedron.org wrote: This blog post, and the links within also seem relevant. Appears to have python code available to try things out as well. https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest Very cool indeed... The original works is licensed under an Apache 2.0 license (https://github.com/tdunning/t-digest/blob/master/LICENSE). I am not fluent in legalese, so not sure whether that means we can use it or not, seems awfully more complicated than what we normally use. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
This blog post, and the links within also seem relevant. Appears to have python code available to try things out as well. https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest -Eric On Wed, Apr 15, 2015 at 11:24 AM, Benjamin Root ben.r...@ou.edu wrote: Then you can set about convincing matplotlib and friends to use it by default Just to note, this proposal was originally made over in the matplotlib project. We sent it over here where its benefits would have wider reach. Matplotlib's plan is not to change the defaults, but to offload as much as possible to numpy so that it can support these new features if they are available. We might need to do some input validation so that users running older version of numpy can get a sensible error message. Cheers! Ben Root On Tue, Apr 14, 2015 at 7:12 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com wrote: Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf ) This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp ) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like adaptive_histogram or something. Then you can set about convincing matplotlib and friends to use it by default :-) -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
Cool, thanks for looking at this. P2 might still be better even if the whole dataset is in memory because of cache misses. Partition, which I guess is based on quickselect, is going to run over all of the data as many times as there are bins roughly, whereas p2 only runs over it once. From a cache miss standpoint, I think p2 is better? Anyway, it might be worth maybe coding to verify any performance advantages? Not sure if it should be in numpy or not since it really should accept an iterable rather than a numpy vector, right? Best, Neil On Wed, Apr 15, 2015 at 12:40 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Apr 15, 2015 at 8:06 AM, Neil Girdhar mistersh...@gmail.com wrote: You got it. I remember this from when I worked at Google and we would process (many many) logs. With enough bins, the approximation is still really close. It's great if you want to make an automatic plot of data. Calling numpy.partition a hundred times is probably slower than calling P^2 with n=100 bins. I don't think it does O(n) computations per point. I think it's more like O(log(n)). Looking at it again, it probably is O(n) after all: it does a binary search, which is O(log n), but it then goes on to update all the n bin counters and estimations, so O(n) I'm afraid. So there is no algorithmic advantage over partition/percentile: if there are m samples and n bins, P-2 that O(n) m times, while partition does O(m) n times, so both end up being O(m n). It seems to me that the big thing of P^2 is not having to hold the full dataset in memory. Online statistics (is that the name for this?), even if only estimations, is a cool thing, but I am not sure numpy is the place for them. That's not to say that we couldn't eventually have P^2 implemented for histogram, but I would start off with a partition based one. Would SciPy have a place for online statistics? Perhaps there's room for yet another scikit? Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
Yeah, I'm not arguing, I'm just curious about your reasoning. That explains why not C++. Why would you want to do this in C and not Python? On Wed, Apr 15, 2015 at 1:48 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Tue, Apr 14, 2015 at 6:16 PM, Neil Girdhar mistersh...@gmail.com wrote: If you're going to C, is there a reason not to go to C++ and include the already-written Boost code? Otherwise, why not use Python? I think we have an explicit rule against C++, although I may be wrong. Not sure how much of boost we would have to make part of numpy to use that, the whole accumulators lib I'm guessing? Seems like an awful lot given what we are after. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
On Tue, Apr 14, 2015 at 4:24 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Tue, Apr 14, 2015 at 4:12 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com wrote: Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf ) This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp ) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like adaptive_histogram or something. Then you can set about convincing matplotlib and friends to use it by default :-) Would having a negative number of bins mean this many, but with optimized boundaries be too clever an interface? As a user, I think so. Wouldn't np.histogram(..., adaptive=True) do well enough? -p ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
If you're going to C, is there a reason not to go to C++ and include the already-written Boost code? Otherwise, why not use Python? On Tue, Apr 14, 2015 at 7:24 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Tue, Apr 14, 2015 at 4:12 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com wrote: Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf ) This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp ) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like adaptive_histogram or something. Then you can set about convincing matplotlib and friends to use it by default :-) Would having a negative number of bins mean this many, but with optimized boundaries be too clever an interface? I have taken a look at the paper linked, and the P-2 algorithm would not be too complicated to implement from scratch, although it would require writing some C code I'm afraid. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
By the way, the p^2 algorithm still needs to know how many bins you want. It just adapts the endpoints of the bins. I like adaptive=True. However, you will have to find a way to return both the bins and and their calculated endpoints. The P^2 algorithm can also give approximate answers to numpy.percentile, numpy.median. How approximate they are depends on the number of bins you let it keep track of. I believe the authors bound the error as a function of number of points and bins. On Tue, Apr 14, 2015 at 10:00 PM, Paul Hobson pmhob...@gmail.com wrote: On Tue, Apr 14, 2015 at 4:24 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Tue, Apr 14, 2015 at 4:12 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com wrote: Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf ) This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp ) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like adaptive_histogram or something. Then you can set about convincing matplotlib and friends to use it by default :-) Would having a negative number of bins mean this many, but with optimized boundaries be too clever an interface? As a user, I think so. Wouldn't np.histogram(..., adaptive=True) do well enough? -p ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
On Tue, Apr 14, 2015 at 6:16 PM, Neil Girdhar mistersh...@gmail.com wrote: If you're going to C, is there a reason not to go to C++ and include the already-written Boost code? Otherwise, why not use Python? I think we have an explicit rule against C++, although I may be wrong. Not sure how much of boost we would have to make part of numpy to use that, the whole accumulators lib I'm guessing? Seems like an awful lot given what we are after. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf ) This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp ) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. Best, Neil On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers ralf.gomm...@gmail.com wrote: On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Sun, Apr 12, 2015 at 12:19 AM, Varun nayy...@gmail.com wrote: http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta tistics/A http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A utomating%20Binwidth%20Choice%20for%20Histogram.ipynb Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data. R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery. The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature. I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so. +1 on the PR. +1 as well. Unfortunately we can't change the default of 10, but a number of string methods, with a bins=auto or some such name prominently recommended in the docstring, would be very good to have. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
On Mon, Apr 13, 2015 at 5:02 AM, Neil Girdhar mistersh...@gmail.com wrote: Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf ) This look slike a great thing to have in numpy. However, I suspect that a lot of the downstream code that uses histogram expects equally-spaced bins. So this should probably be a in addition to, rather than an instead of -CHB This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp ) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. Best, Neil On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers ralf.gomm...@gmail.com wrote: On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Sun, Apr 12, 2015 at 12:19 AM, Varun nayy...@gmail.com wrote: http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta tistics/A http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A utomating%20Binwidth%20Choice%20for%20Histogram.ipynb Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data. R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery. The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature. I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so. +1 on the PR. +1 as well. Unfortunately we can't change the default of 10, but a number of string methods, with a bins=auto or some such name prominently recommended in the docstring, would be very good to have. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
Another improvement would be to make sure, for integer-valued datasets, that all bins cover the same number of integer, as it is easy to end up otherwise with bins effectively wider than others: hist(np.random.randint(11, size=1)) shows a peak in the last bin, as it covers both 9 and 10. Antony 2015-04-13 5:02 GMT-07:00 Neil Girdhar mistersh...@gmail.com: Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf ) This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp ) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. Best, Neil On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers ralf.gomm...@gmail.com wrote: On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Sun, Apr 12, 2015 at 12:19 AM, Varun nayy...@gmail.com wrote: http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta tistics/A http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A utomating%20Binwidth%20Choice%20for%20Histogram.ipynb Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data. R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery. The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature. I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so. +1 on the PR. +1 as well. Unfortunately we can't change the default of 10, but a number of string methods, with a bins=auto or some such name prominently recommended in the docstring, would be very good to have. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
Yes, you're right. Although in practice, people almost always want adaptive bins. On Tue, Apr 14, 2015 at 5:08 PM, Chris Barker chris.bar...@noaa.gov wrote: On Mon, Apr 13, 2015 at 5:02 AM, Neil Girdhar mistersh...@gmail.com wrote: Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf ) This look slike a great thing to have in numpy. However, I suspect that a lot of the downstream code that uses histogram expects equally-spaced bins. So this should probably be a in addition to, rather than an instead of -CHB This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp ) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. Best, Neil On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers ralf.gomm...@gmail.com wrote: On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Sun, Apr 12, 2015 at 12:19 AM, Varun nayy...@gmail.com wrote: http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta tistics/A http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A utomating%20Binwidth%20Choice%20for%20Histogram.ipynb Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data. R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery. The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature. I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so. +1 on the PR. +1 as well. Unfortunately we can't change the default of 10, but a number of string methods, with a bins=auto or some such name prominently recommended in the docstring, would be very good to have. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com wrote: Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? (http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf) This is already implemented in C++'s boost library (http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like adaptive_histogram or something. Then you can set about convincing matplotlib and friends to use it by default :-) -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
On Tue, Apr 14, 2015 at 4:12 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com wrote: Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf ) This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp ) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like adaptive_histogram or something. Then you can set about convincing matplotlib and friends to use it by default :-) Would having a negative number of bins mean this many, but with optimized boundaries be too clever an interface? I have taken a look at the paper linked, and the P-2 algorithm would not be too complicated to implement from scratch, although it would require writing some C code I'm afraid. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Sun, Apr 12, 2015 at 12:19 AM, Varun nayy...@gmail.com wrote: http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta tistics/A http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A utomating%20Binwidth%20Choice%20for%20Histogram.ipynb Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data. R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery. The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature. I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so. +1 on the PR. +1 as well. Unfortunately we can't change the default of 10, but a number of string methods, with a bins=auto or some such name prominently recommended in the docstring, would be very good to have. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
On Sun, Apr 12, 2015 at 12:19 AM, Varun nayy...@gmail.com wrote: http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta tistics/A utomating%20Binwidth%20Choice%20for%20Histogram.ipynb Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data. R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery. The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature. I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so. +1 on the PR. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
Using a URL shortener for the notebook to get around the 80 char width limit http://goo.gl/JmfTRJ ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Automatic number of bins for numpy histograms
http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta tistics/A utomating%20Binwidth%20Choice%20for%20Histogram.ipynb Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data. R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery. The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature. I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so. https://github.com/numpy/numpy/compare/master...nayyarv:master I've provided them as functions for easy refactoring, as it can be argued that it should be in it's own function/file/class, or alternatively can be turned into simple if, elif statements. I believe this belongs in numpy as it is where the functionality exists for histogram methods that most libraries build on, and it would useful for them to not require scipy for example. I will update the documentation accordingly before making a pull request, and add in more tests to show it's functionality. I can adapt my ipython notebook into a quick tutorial/help file if need be. I've already attempted to add this into matplotlib before being redirected here https://github.com/matplotlib/matplotlib/issues/4316 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion