Re: [Numpy-discussion] Rewrite np.histogram in c?
Hope this isn't too off-topic: but it would be very nice if np.histogram and np.histogram2d supported masked arrays. Is this out of scope for outside the numpy.ma package? On Mon, Mar 16, 2015 at 2:35 PM, Robert McGibbon rmcgi...@gmail.com wrote: Hi, It sounds like putting together a PR makes sense then. I'll try hacking on this a bit. -Robert On Mar 16, 2015 11:20 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer jerome.kief...@esrf.fr wrote: On Mon, 16 Mar 2015 06:56:58 -0700 Jaime Fernández del Río jaime.f...@gmail.com wrote: Dispatching to a different method seems like a no brainer indeed. The question is whether we really need to do this in C. I need to do both unweighted weighted histograms and we got a factor 5 using (simple) cython: it is in the proceedings of Euroscipy, last year. http://arxiv.org/pdf/1412.6367.pdf If I read your paper and code properly, you got 5x faster, mostly because you combined the weighted and unweighted histograms into a single search of the array, and because you used an algorithm that can only be applied to equal- sized bins, similarly to the 10x speed-up Robert was reporting. I think that having a special path for equal sized bins is a great idea: let's do it, PRs are always welcome! Similarly, getting the counts together with the weights seems like a very good idea. I also think that writing it in Python is going to take us 80% of the way there: most of the improvements both of you have reported are not likely to be coming from the language chosen, but from the algorithm used. And if C proves to be sufficiently faster to warrant using it, it should be confined to the number crunching: I don;t think there is any point in rewriting argument parsing in C. Also, keep in mind `np.histogram` can now handle arrays of just about **any** dtype. Handling that complexity in C is not a ride in the park. Other functions like `np.bincount` and `np.digitize` cheat by only handling `double` typed arrays, a luxury that histogram probably can't afford at this point in time. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rewrite np.histogram in c?
On Mon, Mar 23, 2015 at 2:59 PM, Daniel da Silva var.mail.dan...@gmail.com wrote: Hope this isn't too off-topic: but it would be very nice if np.histogram and np.histogram2d supported masked arrays. Is this out of scope for outside the numpy.ma package? Right now it looks like there's no histogram function at all for masked arrays - would be good to improve that situation. If it's as easy as adding to np.histogram something like: if isinstance(a, np.ma.MaskedArray): a = a.data[~a.mask] then it makes sense to add that I think. Ralf On Mon, Mar 16, 2015 at 2:35 PM, Robert McGibbon rmcgi...@gmail.com wrote: Hi, It sounds like putting together a PR makes sense then. I'll try hacking on this a bit. -Robert On Mar 16, 2015 11:20 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer jerome.kief...@esrf.fr wrote: On Mon, 16 Mar 2015 06:56:58 -0700 Jaime Fernández del Río jaime.f...@gmail.com wrote: Dispatching to a different method seems like a no brainer indeed. The question is whether we really need to do this in C. I need to do both unweighted weighted histograms and we got a factor 5 using (simple) cython: it is in the proceedings of Euroscipy, last year. http://arxiv.org/pdf/1412.6367.pdf If I read your paper and code properly, you got 5x faster, mostly because you combined the weighted and unweighted histograms into a single search of the array, and because you used an algorithm that can only be applied to equal- sized bins, similarly to the 10x speed-up Robert was reporting. I think that having a special path for equal sized bins is a great idea: let's do it, PRs are always welcome! Similarly, getting the counts together with the weights seems like a very good idea. I also think that writing it in Python is going to take us 80% of the way there: most of the improvements both of you have reported are not likely to be coming from the language chosen, but from the algorithm used. And if C proves to be sufficiently faster to warrant using it, it should be confined to the number crunching: I don;t think there is any point in rewriting argument parsing in C. Also, keep in mind `np.histogram` can now handle arrays of just about **any** dtype. Handling that complexity in C is not a ride in the park. Other functions like `np.bincount` and `np.digitize` cheat by only handling `double` typed arrays, a luxury that histogram probably can't afford at this point in time. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rewrite np.histogram in c?
On 2015/03/23 7:36 AM, Ralf Gommers wrote: On Mon, Mar 23, 2015 at 2:59 PM, Daniel da Silva var.mail.dan...@gmail.com mailto:var.mail.dan...@gmail.com wrote: Hope this isn't too off-topic: but it would be very nice if np.histogram and np.histogram2d supported masked arrays. Is this out of scope for outside the numpy.ma http://numpy.ma package? Right now it looks like there's no histogram function at all for masked arrays - would be good to improve that situation. If it's as easy as adding to np.histogram something like: if isinstance(a, np.ma.MaskedArray): a = a.data[~a.mask] It looks like it requires a little more than that, but not much. For full support a new mask would need to be made from the logical_or of the a mask and the weights mask, and then used to compress both a and weights. Eric then it makes sense to add that I think. Ralf On Mon, Mar 16, 2015 at 2:35 PM, Robert McGibbon rmcgi...@gmail.com mailto:rmcgi...@gmail.com wrote: Hi, It sounds like putting together a PR makes sense then. I'll try hacking on this a bit. -Robert On Mar 16, 2015 11:20 AM, Jaime Fernández del Río jaime.f...@gmail.com mailto:jaime.f...@gmail.com wrote: On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer jerome.kief...@esrf.fr mailto:jerome.kief...@esrf.fr wrote: On Mon, 16 Mar 2015 06:56:58 -0700 Jaime Fernández del Río jaime.f...@gmail.com mailto:jaime.f...@gmail.com wrote: Dispatching to a different method seems like a no brainer indeed. The question is whether we really need to do this in C. I need to do both unweighted weighted histograms and we got a factor 5 using (simple) cython: it is in the proceedings of Euroscipy, last year. http://arxiv.org/pdf/1412.6367.pdf If I read your paper and code properly, you got 5x faster, mostly because you combined the weighted and unweighted histograms into a single search of the array, and because you used an algorithm that can only be applied to equal- sized bins, similarly to the 10x speed-up Robert was reporting. I think that having a special path for equal sized bins is a great idea: let's do it, PRs are always welcome! Similarly, getting the counts together with the weights seems like a very good idea. I also think that writing it in Python is going to take us 80% of the way there: most of the improvements both of you have reported are not likely to be coming from the language chosen, but from the algorithm used. And if C proves to be sufficiently faster to warrant using it, it should be confined to the number crunching: I don;t think there is any point in rewriting argument parsing in C. Also, keep in mind `np.histogram` can now handle arrays of just about **any** dtype. Handling that complexity in C is not a ride in the park. Other functions like `np.bincount` and `np.digitize` cheat by only handling `double` typed arrays, a luxury that histogram probably can't afford at this point in time. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rewrite np.histogram in c?
On Mar 23, 2015 6:59 AM, Daniel da Silva var.mail.dan...@gmail.com wrote: Hope this isn't too off-topic: but it would be very nice if np.histogram and np.histogram2d supported masked arrays. Is this out of scope for outside the numpy.ma package? Usually the way this kind of thing is handled is by adding an np.ma.histogram function. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rewrite np.histogram in c?
On Sun, Mar 15, 2015 at 11:06 PM, Robert McGibbon rmcgi...@gmail.com wrote: It might make sense to dispatch to difference c implements if the bins are equally spaced (as created by using an integer for the np.histogram bins argument), vs. non-equally-spaced bins. Dispatching to a different method seems like a no brainer indeed. The question is whether we really need to do this in C. Maybe for some very specific case or cases it makes sense to have a super fast C path, e,g. no weights and bins is an integer. Even then, rather than rewriting the whole thing in C, it may be a better idea to leave the parsing of the inputs in Python, and have a C helper function wrapped and privately exposed, similarly to how `np.core.multiarray.interp` is used by `np.interp`. But I would still first give it a try in Python... Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rewrite np.histogram in c?
On Mon, 16 Mar 2015 06:56:58 -0700 Jaime Fernández del Río jaime.f...@gmail.com wrote: Dispatching to a different method seems like a no brainer indeed. The question is whether we really need to do this in C. I need to do both unweighted weighted histograms and we got a factor 5 using (simple) cython: it is in the proceedings of Euroscipy, last year. http://arxiv.org/pdf/1412.6367.pdf We got much faster but that's another story. In fact, many people coming from IDL or Matlab are surprised by the poor performances of numpy's histogram. Cheers -- Jérôme Kieffer tel +33 476 882 445 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rewrite np.histogram in c?
On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer jerome.kief...@esrf.fr wrote: On Mon, 16 Mar 2015 06:56:58 -0700 Jaime Fernández del Río jaime.f...@gmail.com wrote: Dispatching to a different method seems like a no brainer indeed. The question is whether we really need to do this in C. I need to do both unweighted weighted histograms and we got a factor 5 using (simple) cython: it is in the proceedings of Euroscipy, last year. http://arxiv.org/pdf/1412.6367.pdf If I read your paper and code properly, you got 5x faster, mostly because you combined the weighted and unweighted histograms into a single search of the array, and because you used an algorithm that can only be applied to equal- sized bins, similarly to the 10x speed-up Robert was reporting. I think that having a special path for equal sized bins is a great idea: let's do it, PRs are always welcome! Similarly, getting the counts together with the weights seems like a very good idea. I also think that writing it in Python is going to take us 80% of the way there: most of the improvements both of you have reported are not likely to be coming from the language chosen, but from the algorithm used. And if C proves to be sufficiently faster to warrant using it, it should be confined to the number crunching: I don;t think there is any point in rewriting argument parsing in C. Also, keep in mind `np.histogram` can now handle arrays of just about **any** dtype. Handling that complexity in C is not a ride in the park. Other functions like `np.bincount` and `np.digitize` cheat by only handling `double` typed arrays, a luxury that histogram probably can't afford at this point in time. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rewrite np.histogram in c?
My apologies for the typo: 'implements' - 'implementations' -Robert On Sun, Mar 15, 2015 at 11:06 PM, Robert McGibbon rmcgi...@gmail.com wrote: It might make sense to dispatch to difference c implements if the bins are equally spaced (as created by using an integer for the np.histogram bins argument), vs. non-equally-spaced bins. In that case, getting the bigger speedup may be easier, at least for one common use case. -Robert On Sun, Mar 15, 2015 at 11:00 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Sun, Mar 15, 2015 at 9:32 PM, Robert McGibbon rmcgi...@gmail.com wrote: Hi, Numpy.histogram is implemented in python, and is a little sluggish. This has been discussed previously on the mailing list, [1, 2]. It came up in a project that I maintain, where a new feature is bottlenecked by numpy.histogram, and one developer suggested a faster implementation in cython [3]. Would it make sense to reimplement this function in c? or cython? Is moving functions like this from python to c to improve performance within the scope of the development roadmap for numpy? I started implementing this a little bit in c, [4] but I figured I should check in here first. Where do you think the performance gains will come from? The PR in your project that claims a 10x speed-up uses a method that is only fit for equally spaced bins. I want to think that implementing that exact same algorithm in Python with NumPy would be comparably fast, say within 2x. For the general case, NumPy is already doing most of the heavy lifting (the sorting and the searching) in C: simply replicating the same algorithmic approach entirely in C is unlikely to provide any major speed-up. And if the change is to the algorithm, then we should first try it out in Python. That said, if you can speed things up 10x, I don't think there is going to be much opposition to moving it to C! Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rewrite np.histogram in c?
On Sun, Mar 15, 2015 at 9:32 PM, Robert McGibbon rmcgi...@gmail.com wrote: Hi, Numpy.histogram is implemented in python, and is a little sluggish. This has been discussed previously on the mailing list, [1, 2]. It came up in a project that I maintain, where a new feature is bottlenecked by numpy.histogram, and one developer suggested a faster implementation in cython [3]. Would it make sense to reimplement this function in c? or cython? Is moving functions like this from python to c to improve performance within the scope of the development roadmap for numpy? I started implementing this a little bit in c, [4] but I figured I should check in here first. Where do you think the performance gains will come from? The PR in your project that claims a 10x speed-up uses a method that is only fit for equally spaced bins. I want to think that implementing that exact same algorithm in Python with NumPy would be comparably fast, say within 2x. For the general case, NumPy is already doing most of the heavy lifting (the sorting and the searching) in C: simply replicating the same algorithmic approach entirely in C is unlikely to provide any major speed-up. And if the change is to the algorithm, then we should first try it out in Python. That said, if you can speed things up 10x, I don't think there is going to be much opposition to moving it to C! Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rewrite np.histogram in c?
It might make sense to dispatch to difference c implements if the bins are equally spaced (as created by using an integer for the np.histogram bins argument), vs. non-equally-spaced bins. In that case, getting the bigger speedup may be easier, at least for one common use case. -Robert On Sun, Mar 15, 2015 at 11:00 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Sun, Mar 15, 2015 at 9:32 PM, Robert McGibbon rmcgi...@gmail.com wrote: Hi, Numpy.histogram is implemented in python, and is a little sluggish. This has been discussed previously on the mailing list, [1, 2]. It came up in a project that I maintain, where a new feature is bottlenecked by numpy.histogram, and one developer suggested a faster implementation in cython [3]. Would it make sense to reimplement this function in c? or cython? Is moving functions like this from python to c to improve performance within the scope of the development roadmap for numpy? I started implementing this a little bit in c, [4] but I figured I should check in here first. Where do you think the performance gains will come from? The PR in your project that claims a 10x speed-up uses a method that is only fit for equally spaced bins. I want to think that implementing that exact same algorithm in Python with NumPy would be comparably fast, say within 2x. For the general case, NumPy is already doing most of the heavy lifting (the sorting and the searching) in C: simply replicating the same algorithmic approach entirely in C is unlikely to provide any major speed-up. And if the change is to the algorithm, then we should first try it out in Python. That said, if you can speed things up 10x, I don't think there is going to be much opposition to moving it to C! Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rewrite np.histogram in c?
Hi, It sounds like putting together a PR makes sense then. I'll try hacking on this a bit. -Robert On Mar 16, 2015 11:20 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer jerome.kief...@esrf.fr wrote: On Mon, 16 Mar 2015 06:56:58 -0700 Jaime Fernández del Río jaime.f...@gmail.com wrote: Dispatching to a different method seems like a no brainer indeed. The question is whether we really need to do this in C. I need to do both unweighted weighted histograms and we got a factor 5 using (simple) cython: it is in the proceedings of Euroscipy, last year. http://arxiv.org/pdf/1412.6367.pdf If I read your paper and code properly, you got 5x faster, mostly because you combined the weighted and unweighted histograms into a single search of the array, and because you used an algorithm that can only be applied to equal- sized bins, similarly to the 10x speed-up Robert was reporting. I think that having a special path for equal sized bins is a great idea: let's do it, PRs are always welcome! Similarly, getting the counts together with the weights seems like a very good idea. I also think that writing it in Python is going to take us 80% of the way there: most of the improvements both of you have reported are not likely to be coming from the language chosen, but from the algorithm used. And if C proves to be sufficiently faster to warrant using it, it should be confined to the number crunching: I don;t think there is any point in rewriting argument parsing in C. Also, keep in mind `np.histogram` can now handle arrays of just about **any** dtype. Handling that complexity in C is not a ride in the park. Other functions like `np.bincount` and `np.digitize` cheat by only handling `double` typed arrays, a luxury that histogram probably can't afford at this point in time. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Rewrite np.histogram in c?
Hi, Numpy.histogram is implemented in python, and is a little sluggish. This has been discussed previously on the mailing list, [1, 2]. It came up in a project that I maintain, where a new feature is bottlenecked by numpy.histogram, and one developer suggested a faster implementation in cython [3]. Would it make sense to reimplement this function in c? or cython? Is moving functions like this from python to c to improve performance within the scope of the development roadmap for numpy? I started implementing this a little bit in c, [4] but I figured I should check in here first. -Robert [1] http://scipy-user.10969.n7.nabble.com/numpy-histogram-is-slow-td17208.html [2] http://numpy-discussion.10968.n7.nabble.com/Fast-histogram-td9359.html [3] https://github.com/mdtraj/mdtraj/pull/734 [4] https://github.com/rmcgibbo/numpy/tree/histogram ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion