Re: [Numpy-discussion] Automatic number of bins for numpy histograms
On Wed, Apr 15, 2015 at 4:36 AM, Neil Girdhar mistersh...@gmail.com wrote: Yeah, I'm not arguing, I'm just curious about your reasoning. That explains why not C++. Why would you want to do this in C and not Python? Well, the algorithm has to iterate over all the inputs, updating the estimated percentile positions at every iteration. Because the estimated percentiles may change in every iteration, I don't think there is an easy way of vectorizing the calculation with numpy. So I think it would be very slow if done in Python. Looking at this in some more details, how is this typically used? Because it gives you approximate values that should split your sample into similarly filled bins, but because the values are approximate, to compute a proper histogram you would still need to do the binning to get the exact results, right? Even with this drawback P-2 does have an algorithmic advantage, so for huge inputs and many bins it should come ahead. But for many medium sized problems it may be faster to simply use np.partition, which gives you the whole thing in a single go. And it would be much simpler to implement. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
You got it. I remember this from when I worked at Google and we would process (many many) logs. With enough bins, the approximation is still really close. It's great if you want to make an automatic plot of data. Calling numpy.partition a hundred times is probably slower than calling P^2 with n=100 bins. I don't think it does O(n) computations per point. I think it's more like O(log(n)). Best, Neil On Wed, Apr 15, 2015 at 10:02 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Apr 15, 2015 at 4:36 AM, Neil Girdhar mistersh...@gmail.com wrote: Yeah, I'm not arguing, I'm just curious about your reasoning. That explains why not C++. Why would you want to do this in C and not Python? Well, the algorithm has to iterate over all the inputs, updating the estimated percentile positions at every iteration. Because the estimated percentiles may change in every iteration, I don't think there is an easy way of vectorizing the calculation with numpy. So I think it would be very slow if done in Python. Looking at this in some more details, how is this typically used? Because it gives you approximate values that should split your sample into similarly filled bins, but because the values are approximate, to compute a proper histogram you would still need to do the binning to get the exact results, right? Even with this drawback P-2 does have an algorithmic advantage, so for huge inputs and many bins it should come ahead. But for many medium sized problems it may be faster to simply use np.partition, which gives you the whole thing in a single go. And it would be much simpler to implement. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
Then you can set about convincing matplotlib and friends to use it by default Just to note, this proposal was originally made over in the matplotlib project. We sent it over here where its benefits would have wider reach. Matplotlib's plan is not to change the defaults, but to offload as much as possible to numpy so that it can support these new features if they are available. We might need to do some input validation so that users running older version of numpy can get a sensible error message. Cheers! Ben Root On Tue, Apr 14, 2015 at 7:12 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com wrote: Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf ) This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp ) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like adaptive_histogram or something. Then you can set about convincing matplotlib and friends to use it by default :-) -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
On Wed, Apr 15, 2015 at 8:06 AM, Neil Girdhar mistersh...@gmail.com wrote: You got it. I remember this from when I worked at Google and we would process (many many) logs. With enough bins, the approximation is still really close. It's great if you want to make an automatic plot of data. Calling numpy.partition a hundred times is probably slower than calling P^2 with n=100 bins. I don't think it does O(n) computations per point. I think it's more like O(log(n)). Looking at it again, it probably is O(n) after all: it does a binary search, which is O(log n), but it then goes on to update all the n bin counters and estimations, so O(n) I'm afraid. So there is no algorithmic advantage over partition/percentile: if there are m samples and n bins, P-2 that O(n) m times, while partition does O(m) n times, so both end up being O(m n). It seems to me that the big thing of P^2 is not having to hold the full dataset in memory. Online statistics (is that the name for this?), even if only estimations, is a cool thing, but I am not sure numpy is the place for them. That's not to say that we couldn't eventually have P^2 implemented for histogram, but I would start off with a partition based one. Would SciPy have a place for online statistics? Perhaps there's room for yet another scikit? Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
On Wed, Apr 15, 2015 at 9:14 AM, Eric Moore e...@redtetrahedron.org wrote: This blog post, and the links within also seem relevant. Appears to have python code available to try things out as well. https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest Very cool indeed... The original works is licensed under an Apache 2.0 license (https://github.com/tdunning/t-digest/blob/master/LICENSE). I am not fluent in legalese, so not sure whether that means we can use it or not, seems awfully more complicated than what we normally use. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
This blog post, and the links within also seem relevant. Appears to have python code available to try things out as well. https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest -Eric On Wed, Apr 15, 2015 at 11:24 AM, Benjamin Root ben.r...@ou.edu wrote: Then you can set about convincing matplotlib and friends to use it by default Just to note, this proposal was originally made over in the matplotlib project. We sent it over here where its benefits would have wider reach. Matplotlib's plan is not to change the defaults, but to offload as much as possible to numpy so that it can support these new features if they are available. We might need to do some input validation so that users running older version of numpy can get a sensible error message. Cheers! Ben Root On Tue, Apr 14, 2015 at 7:12 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar mistersh...@gmail.com wrote: Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf ) This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp ) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like adaptive_histogram or something. Then you can set about convincing matplotlib and friends to use it by default :-) -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] [ANN] python-blosc v1.2.5
= Announcing python-blosc 1.2.5 = What is new? This release contains support for Blosc v1.5.4 including changes to how the GIL is kept. This was required because Blosc was refactored in the v1.5.x line to remove global variables and to use context objects instead. As such, it became necessary to keep the GIL while calling Blosc from Python code that uses the multiprocessing module. In addition, is now possible to change the blocksize used by Blosc using ``set_blocksize``. When using this however, bear in mind that the blocksize has been finely tuned to be a good default value and that randomly messing with this value may have unforeseen and unpredictable consequences on the performance of Blosc. Additionally, we can now compile on Posix architectures, thanks again to Andreas Schwab for that one. For more info, you can have a look at the release notes in: https://github.com/Blosc/python-blosc/wiki/Release-notes More docs and examples are available in the documentation site: http://python-blosc.blosc.org What is it? === Blosc (http://www.blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate object manipulations that are memory-bound (http://www.blosc.org/docs/StarvingCPUs.pdf). See http://www.blosc.org/synthetic-benchmarks.html for some benchmarks on how much speed it can achieve in some datasets. Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc. python-blosc (http://python-blosc.blosc.org/) is the Python wrapper for the Blosc compression library. There is also a handy tool built on Blosc called Bloscpack (https://github.com/Blosc/bloscpack). It features a commmand line interface that allows you to compress large binary datafiles on-disk. It also comes with a Python API that has built-in support for serializing and deserializing Numpy arrays both on-disk and in-memory at speeds that are competitive with regular Pickle/cPickle machinery. Installing == python-blosc is in PyPI repository, so installing it is easy: $ pip install -U blosc # yes, you should omit the python- prefix Download sources The sources are managed through github services at: http://github.com/Blosc/python-blosc Documentation = There is Sphinx-based documentation site at: http://python-blosc.blosc.org/ Mailing list There is an official mailing list for Blosc at: bl...@googlegroups.com http://groups.google.es/group/blosc Licenses Both Blosc and its Python wrapper are distributed using the MIT license. See: https://github.com/Blosc/python-blosc/blob/master/LICENSES for more details. **Enjoy data!** ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] IDE's for numpy development?
Le 08/04/2015 21:19, Yuxiang Wang a écrit : I think spyder supports code highlighting in C and that's all... There's no way to compile in Spyder, is there? Well, you could write a compilation script using Scons and run it from spyder ! :) But no, spyder is very python-oriented and there is no way to compile C in spyder. For information the next version should have a better support for plugins so it could be done as a third-party extension. Joseph ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Consider improving numpy.outer's behavior with zero-dimensional vectors
On Wed, Apr 15, 2015 at 6:08 PM, josef.p...@gmail.com wrote: On Wed, Apr 15, 2015 at 5:31 PM, Neil Girdhar mistersh...@gmail.com wrote: Does it work for you to set outer = np.multiply.outer ? It's actually faster on my machine. I assume it does because np.corrcoeff uses it, and it's the same type of use cases. However, I'm not using it very often (I prefer broadcasting), but I've seen it often enough when reviewing code. This is mainly to point out that it could be a popular function (that maybe shouldn't be deprecated) https://github.com/search?utf8=%E2%9C%93q=np.outer 416914 After thinking another minute: I think it should not be deprecated, it's like toepliz. We can use it also to normalize 2d arrays where columns and rows are different not symmetric as in the corrcoef case. Josef Josef On Wed, Apr 15, 2015 at 5:29 PM, josef.p...@gmail.com wrote: On Wed, Apr 15, 2015 at 7:35 AM, Neil Girdhar mistersh...@gmail.com wrote: Yes, I totally agree. If I get started on the PR to deprecate np.outer, maybe I can do it as part of the same PR? On Wed, Apr 15, 2015 at 4:32 AM, Sebastian Berg sebast...@sipsolutions.net wrote: Just a general thing, if someone has a few minutes, I think it would make sense to add the ufunc.reduce thing to all of these functions at least in the See Also or Notes section in the documentation. These special attributes are not that well known, and I think that might be a nice way to make it easier to find. - Sebastian On Di, 2015-04-14 at 22:18 -0400, Nathaniel Smith wrote: I am, yes. On Apr 14, 2015 9:17 PM, Neil Girdhar mistersh...@gmail.com wrote: Ok, I didn't know that. Are you at pycon by any chance? On Tue, Apr 14, 2015 at 7:16 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Apr 14, 2015 at 3:48 PM, Neil Girdhar mistersh...@gmail.com wrote: Yes, I totally agree with you regarding np.sum and np.product, which is why I didn't suggest np.add.reduce, np.multiply.reduce. I wasn't sure whether cumsum and cumprod might be on the line in your judgment. Ah, I see. I think we should treat them the same for now -- all the comments I made apply to a lesser or greater extent (in particular, cumsum and cumprod both do the thing where they dispatch to .cumsum() .cumprod() method). -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I'm just looking at this thread. I see outer used quite often corrcoef = cov / np.outer(std, std) (even I use it sometimes instead of cov / std[:,None] / std Josef ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Consider improving numpy.outer's behavior with zero-dimensional vectors
On Wed, Apr 15, 2015 at 6:40 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Apr 15, 2015 at 6:08 PM, josef.p...@gmail.com wrote: On Wed, Apr 15, 2015 at 5:31 PM, Neil Girdhar mistersh...@gmail.com wrote: Does it work for you to set outer = np.multiply.outer ? It's actually faster on my machine. I assume it does because np.corrcoeff uses it, and it's the same type of use cases. However, I'm not using it very often (I prefer broadcasting), but I've seen it often enough when reviewing code. This is mainly to point out that it could be a popular function (that maybe shouldn't be deprecated) https://github.com/search?utf8=%E2%9C%93q=np.outer 416914 For future reference, that's not the number -- you have to click through to Code and then look at a single-language result to get anything remotely meaningful. In this case b/c they're different by an order of magnitude, and in general because sometimes the top line number is completely made up (like it has no relation to the per-language numbers on the left and then changes around randomly if you simply reload the page). (So 29,397 is what you want in this case.) Also that count then tends to have tons of duplicates (e.g. b/c there are hundreds of copies of numpy itself on github), so you need a big grain of salt when looking at the absolute number, but it can be useful, esp. for relative comparisons. My mistake, rushing too much. github show only 25 code references in numpy itself. in quotes, python only (namespace conscious packages on github) (I think github counts modules not instances) np.cumsum 11,022 np.cumprod 1,290 np.outer 6,838 statsmodels np.cumsum 21 np.cumprod 2 np.outer 15 Josef -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Consider improving numpy.outer's behavior with zero-dimensional vectors
I don't understand. Are you at pycon by any chance? On Wed, Apr 15, 2015 at 6:12 PM, josef.p...@gmail.com wrote: On Wed, Apr 15, 2015 at 6:08 PM, josef.p...@gmail.com wrote: On Wed, Apr 15, 2015 at 5:31 PM, Neil Girdhar mistersh...@gmail.com wrote: Does it work for you to set outer = np.multiply.outer ? It's actually faster on my machine. I assume it does because np.corrcoeff uses it, and it's the same type of use cases. However, I'm not using it very often (I prefer broadcasting), but I've seen it often enough when reviewing code. This is mainly to point out that it could be a popular function (that maybe shouldn't be deprecated) https://github.com/search?utf8=%E2%9C%93q=np.outer 416914 After thinking another minute: I think it should not be deprecated, it's like toepliz. We can use it also to normalize 2d arrays where columns and rows are different not symmetric as in the corrcoef case. Josef Josef On Wed, Apr 15, 2015 at 5:29 PM, josef.p...@gmail.com wrote: On Wed, Apr 15, 2015 at 7:35 AM, Neil Girdhar mistersh...@gmail.com wrote: Yes, I totally agree. If I get started on the PR to deprecate np.outer, maybe I can do it as part of the same PR? On Wed, Apr 15, 2015 at 4:32 AM, Sebastian Berg sebast...@sipsolutions.net wrote: Just a general thing, if someone has a few minutes, I think it would make sense to add the ufunc.reduce thing to all of these functions at least in the See Also or Notes section in the documentation. These special attributes are not that well known, and I think that might be a nice way to make it easier to find. - Sebastian On Di, 2015-04-14 at 22:18 -0400, Nathaniel Smith wrote: I am, yes. On Apr 14, 2015 9:17 PM, Neil Girdhar mistersh...@gmail.com wrote: Ok, I didn't know that. Are you at pycon by any chance? On Tue, Apr 14, 2015 at 7:16 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Apr 14, 2015 at 3:48 PM, Neil Girdhar mistersh...@gmail.com wrote: Yes, I totally agree with you regarding np.sum and np.product, which is why I didn't suggest np.add.reduce, np.multiply.reduce. I wasn't sure whether cumsum and cumprod might be on the line in your judgment. Ah, I see. I think we should treat them the same for now -- all the comments I made apply to a lesser or greater extent (in particular, cumsum and cumprod both do the thing where they dispatch to .cumsum() .cumprod() method). -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I'm just looking at this thread. I see outer used quite often corrcoef = cov / np.outer(std, std) (even I use it sometimes instead of cov / std[:,None] / std Josef ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Consider improving numpy.outer's behavior with zero-dimensional vectors
On Wed, Apr 15, 2015 at 6:08 PM, josef.p...@gmail.com wrote: On Wed, Apr 15, 2015 at 5:31 PM, Neil Girdhar mistersh...@gmail.com wrote: Does it work for you to set outer = np.multiply.outer ? It's actually faster on my machine. I assume it does because np.corrcoeff uses it, and it's the same type of use cases. However, I'm not using it very often (I prefer broadcasting), but I've seen it often enough when reviewing code. This is mainly to point out that it could be a popular function (that maybe shouldn't be deprecated) https://github.com/search?utf8=%E2%9C%93q=np.outer 416914 For future reference, that's not the number -- you have to click through to Code and then look at a single-language result to get anything remotely meaningful. In this case b/c they're different by an order of magnitude, and in general because sometimes the top line number is completely made up (like it has no relation to the per-language numbers on the left and then changes around randomly if you simply reload the page). (So 29,397 is what you want in this case.) Also that count then tends to have tons of duplicates (e.g. b/c there are hundreds of copies of numpy itself on github), so you need a big grain of salt when looking at the absolute number, but it can be useful, esp. for relative comparisons. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
Cool, thanks for looking at this. P2 might still be better even if the whole dataset is in memory because of cache misses. Partition, which I guess is based on quickselect, is going to run over all of the data as many times as there are bins roughly, whereas p2 only runs over it once. From a cache miss standpoint, I think p2 is better? Anyway, it might be worth maybe coding to verify any performance advantages? Not sure if it should be in numpy or not since it really should accept an iterable rather than a numpy vector, right? Best, Neil On Wed, Apr 15, 2015 at 12:40 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Apr 15, 2015 at 8:06 AM, Neil Girdhar mistersh...@gmail.com wrote: You got it. I remember this from when I worked at Google and we would process (many many) logs. With enough bins, the approximation is still really close. It's great if you want to make an automatic plot of data. Calling numpy.partition a hundred times is probably slower than calling P^2 with n=100 bins. I don't think it does O(n) computations per point. I think it's more like O(log(n)). Looking at it again, it probably is O(n) after all: it does a binary search, which is O(log n), but it then goes on to update all the n bin counters and estimations, so O(n) I'm afraid. So there is no algorithmic advantage over partition/percentile: if there are m samples and n bins, P-2 that O(n) m times, while partition does O(m) n times, so both end up being O(m n). It seems to me that the big thing of P^2 is not having to hold the full dataset in memory. Online statistics (is that the name for this?), even if only estimations, is a cool thing, but I am not sure numpy is the place for them. That's not to say that we couldn't eventually have P^2 implemented for histogram, but I would start off with a partition based one. Would SciPy have a place for online statistics? Perhaps there's room for yet another scikit? Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Consider improving numpy.outer's behavior with zero-dimensional vectors
On Wed, Apr 15, 2015 at 5:31 PM, Neil Girdhar mistersh...@gmail.com wrote: Does it work for you to set outer = np.multiply.outer ? It's actually faster on my machine. I assume it does because np.corrcoeff uses it, and it's the same type of use cases. However, I'm not using it very often (I prefer broadcasting), but I've seen it often enough when reviewing code. This is mainly to point out that it could be a popular function (that maybe shouldn't be deprecated) https://github.com/search?utf8=%E2%9C%93q=np.outer 416914 Josef On Wed, Apr 15, 2015 at 5:29 PM, josef.p...@gmail.com wrote: On Wed, Apr 15, 2015 at 7:35 AM, Neil Girdhar mistersh...@gmail.com wrote: Yes, I totally agree. If I get started on the PR to deprecate np.outer, maybe I can do it as part of the same PR? On Wed, Apr 15, 2015 at 4:32 AM, Sebastian Berg sebast...@sipsolutions.net wrote: Just a general thing, if someone has a few minutes, I think it would make sense to add the ufunc.reduce thing to all of these functions at least in the See Also or Notes section in the documentation. These special attributes are not that well known, and I think that might be a nice way to make it easier to find. - Sebastian On Di, 2015-04-14 at 22:18 -0400, Nathaniel Smith wrote: I am, yes. On Apr 14, 2015 9:17 PM, Neil Girdhar mistersh...@gmail.com wrote: Ok, I didn't know that. Are you at pycon by any chance? On Tue, Apr 14, 2015 at 7:16 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Apr 14, 2015 at 3:48 PM, Neil Girdhar mistersh...@gmail.com wrote: Yes, I totally agree with you regarding np.sum and np.product, which is why I didn't suggest np.add.reduce, np.multiply.reduce. I wasn't sure whether cumsum and cumprod might be on the line in your judgment. Ah, I see. I think we should treat them the same for now -- all the comments I made apply to a lesser or greater extent (in particular, cumsum and cumprod both do the thing where they dispatch to .cumsum() .cumprod() method). -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I'm just looking at this thread. I see outer used quite often corrcoef = cov / np.outer(std, std) (even I use it sometimes instead of cov / std[:,None] / std Josef ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Consider improving numpy.outer's behavior with zero-dimensional vectors
Does it work for you to set outer = np.multiply.outer ? It's actually faster on my machine. On Wed, Apr 15, 2015 at 5:29 PM, josef.p...@gmail.com wrote: On Wed, Apr 15, 2015 at 7:35 AM, Neil Girdhar mistersh...@gmail.com wrote: Yes, I totally agree. If I get started on the PR to deprecate np.outer, maybe I can do it as part of the same PR? On Wed, Apr 15, 2015 at 4:32 AM, Sebastian Berg sebast...@sipsolutions.net wrote: Just a general thing, if someone has a few minutes, I think it would make sense to add the ufunc.reduce thing to all of these functions at least in the See Also or Notes section in the documentation. These special attributes are not that well known, and I think that might be a nice way to make it easier to find. - Sebastian On Di, 2015-04-14 at 22:18 -0400, Nathaniel Smith wrote: I am, yes. On Apr 14, 2015 9:17 PM, Neil Girdhar mistersh...@gmail.com wrote: Ok, I didn't know that. Are you at pycon by any chance? On Tue, Apr 14, 2015 at 7:16 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Apr 14, 2015 at 3:48 PM, Neil Girdhar mistersh...@gmail.com wrote: Yes, I totally agree with you regarding np.sum and np.product, which is why I didn't suggest np.add.reduce, np.multiply.reduce. I wasn't sure whether cumsum and cumprod might be on the line in your judgment. Ah, I see. I think we should treat them the same for now -- all the comments I made apply to a lesser or greater extent (in particular, cumsum and cumprod both do the thing where they dispatch to .cumsum() .cumprod() method). -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I'm just looking at this thread. I see outer used quite often corrcoef = cov / np.outer(std, std) (even I use it sometimes instead of cov / std[:,None] / std Josef ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Consider improving numpy.outer's behavior with zero-dimensional vectors
On Wed, Apr 15, 2015 at 7:35 AM, Neil Girdhar mistersh...@gmail.com wrote: Yes, I totally agree. If I get started on the PR to deprecate np.outer, maybe I can do it as part of the same PR? On Wed, Apr 15, 2015 at 4:32 AM, Sebastian Berg sebast...@sipsolutions.net wrote: Just a general thing, if someone has a few minutes, I think it would make sense to add the ufunc.reduce thing to all of these functions at least in the See Also or Notes section in the documentation. These special attributes are not that well known, and I think that might be a nice way to make it easier to find. - Sebastian On Di, 2015-04-14 at 22:18 -0400, Nathaniel Smith wrote: I am, yes. On Apr 14, 2015 9:17 PM, Neil Girdhar mistersh...@gmail.com wrote: Ok, I didn't know that. Are you at pycon by any chance? On Tue, Apr 14, 2015 at 7:16 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Apr 14, 2015 at 3:48 PM, Neil Girdhar mistersh...@gmail.com wrote: Yes, I totally agree with you regarding np.sum and np.product, which is why I didn't suggest np.add.reduce, np.multiply.reduce. I wasn't sure whether cumsum and cumprod might be on the line in your judgment. Ah, I see. I think we should treat them the same for now -- all the comments I made apply to a lesser or greater extent (in particular, cumsum and cumprod both do the thing where they dispatch to .cumsum() .cumprod() method). -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I'm just looking at this thread. I see outer used quite often corrcoef = cov / np.outer(std, std) (even I use it sometimes instead of cov / std[:,None] / std Josef ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Consider improving numpy.outer's behavior with zero-dimensional vectors
Just a general thing, if someone has a few minutes, I think it would make sense to add the ufunc.reduce thing to all of these functions at least in the See Also or Notes section in the documentation. These special attributes are not that well known, and I think that might be a nice way to make it easier to find. - Sebastian On Di, 2015-04-14 at 22:18 -0400, Nathaniel Smith wrote: I am, yes. On Apr 14, 2015 9:17 PM, Neil Girdhar mistersh...@gmail.com wrote: Ok, I didn't know that. Are you at pycon by any chance? On Tue, Apr 14, 2015 at 7:16 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Apr 14, 2015 at 3:48 PM, Neil Girdhar mistersh...@gmail.com wrote: Yes, I totally agree with you regarding np.sum and np.product, which is why I didn't suggest np.add.reduce, np.multiply.reduce. I wasn't sure whether cumsum and cumprod might be on the line in your judgment. Ah, I see. I think we should treat them the same for now -- all the comments I made apply to a lesser or greater extent (in particular, cumsum and cumprod both do the thing where they dispatch to .cumsum() .cumprod() method). -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion signature.asc Description: This is a digitally signed message part ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Consider improving numpy.outer's behavior with zero-dimensional vectors
Yes, I totally agree. If I get started on the PR to deprecate np.outer, maybe I can do it as part of the same PR? On Wed, Apr 15, 2015 at 4:32 AM, Sebastian Berg sebast...@sipsolutions.net wrote: Just a general thing, if someone has a few minutes, I think it would make sense to add the ufunc.reduce thing to all of these functions at least in the See Also or Notes section in the documentation. These special attributes are not that well known, and I think that might be a nice way to make it easier to find. - Sebastian On Di, 2015-04-14 at 22:18 -0400, Nathaniel Smith wrote: I am, yes. On Apr 14, 2015 9:17 PM, Neil Girdhar mistersh...@gmail.com wrote: Ok, I didn't know that. Are you at pycon by any chance? On Tue, Apr 14, 2015 at 7:16 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Apr 14, 2015 at 3:48 PM, Neil Girdhar mistersh...@gmail.com wrote: Yes, I totally agree with you regarding np.sum and np.product, which is why I didn't suggest np.add.reduce, np.multiply.reduce. I wasn't sure whether cumsum and cumprod might be on the line in your judgment. Ah, I see. I think we should treat them the same for now -- all the comments I made apply to a lesser or greater extent (in particular, cumsum and cumprod both do the thing where they dispatch to .cumsum() .cumprod() method). -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Automatic number of bins for numpy histograms
Yeah, I'm not arguing, I'm just curious about your reasoning. That explains why not C++. Why would you want to do this in C and not Python? On Wed, Apr 15, 2015 at 1:48 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Tue, Apr 14, 2015 at 6:16 PM, Neil Girdhar mistersh...@gmail.com wrote: If you're going to C, is there a reason not to go to C++ and include the already-written Boost code? Otherwise, why not use Python? I think we have an explicit rule against C++, although I may be wrong. Not sure how much of boost we would have to make part of numpy to use that, the whole accumulators lib I'm guessing? Seems like an awful lot given what we are after. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion