Re: [Numpy-discussion] Rewrite np.histogram in c?

2015-03-23 Thread Daniel da Silva
Hope this isn't too off-topic: but it would be very nice if np.histogram
and np.histogram2d supported masked arrays. Is this out of scope for
outside the numpy.ma package?

On Mon, Mar 16, 2015 at 2:35 PM, Robert McGibbon rmcgi...@gmail.com wrote:

 Hi,

 It sounds like putting together a PR makes sense then. I'll try hacking on
 this a bit.

 -Robert
 On Mar 16, 2015 11:20 AM, Jaime Fernández del Río jaime.f...@gmail.com
 wrote:

 On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer jerome.kief...@esrf.fr
 wrote:

 On Mon, 16 Mar 2015 06:56:58 -0700
 Jaime Fernández del Río jaime.f...@gmail.com wrote:

  Dispatching to a different method seems like a no brainer indeed. The
  question is whether we really need to do this in C.

 I need to do both unweighted  weighted histograms and we got a factor 5
 using (simple) cython:
 it is in the proceedings of Euroscipy, last year.
 http://arxiv.org/pdf/1412.6367.pdf


 If I read your paper and code properly, you got 5x faster, mostly because
 you combined the weighted and unweighted histograms into a single search of
 the array, and because you used an algorithm that can only be applied to
 equal- sized bins, similarly to the 10x speed-up Robert was reporting.

 I think that having a special path for equal sized bins is a great idea:
 let's do it, PRs are always welcome!
 Similarly, getting the counts together with the weights seems like a very
 good idea.

 I also think that writing it in Python is going to take us 80% of the way
 there: most of the improvements both of you have reported are not likely to
 be coming from the language chosen, but from the algorithm used. And if C
 proves to be sufficiently faster to warrant using it, it should be confined
 to the number crunching: I don;t think there is any point in rewriting
 argument parsing in C.

 Also, keep in mind `np.histogram` can now handle arrays of just about
 **any** dtype. Handling that complexity in C is not a ride in the park.
 Other functions like `np.bincount` and `np.digitize` cheat by only handling
 `double` typed arrays, a luxury that histogram probably can't afford at
 this point in time.

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
 de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Rewrite np.histogram in c?

2015-03-23 Thread Ralf Gommers
On Mon, Mar 23, 2015 at 2:59 PM, Daniel da Silva var.mail.dan...@gmail.com
wrote:

 Hope this isn't too off-topic: but it would be very nice if np.histogram
 and np.histogram2d supported masked arrays. Is this out of scope for
 outside the numpy.ma package?


Right now it looks like there's no histogram function at all for masked
arrays - would be good to improve that situation.

If it's as easy as adding to np.histogram something like:

if isinstance(a, np.ma.MaskedArray):
a = a.data[~a.mask]

then it makes sense to add that I think.

Ralf



 On Mon, Mar 16, 2015 at 2:35 PM, Robert McGibbon rmcgi...@gmail.com
 wrote:

 Hi,

 It sounds like putting together a PR makes sense then. I'll try hacking
 on this a bit.

 -Robert
 On Mar 16, 2015 11:20 AM, Jaime Fernández del Río jaime.f...@gmail.com
 wrote:

 On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer jerome.kief...@esrf.fr
 wrote:

 On Mon, 16 Mar 2015 06:56:58 -0700
 Jaime Fernández del Río jaime.f...@gmail.com wrote:

  Dispatching to a different method seems like a no brainer indeed. The
  question is whether we really need to do this in C.

 I need to do both unweighted  weighted histograms and we got a factor
 5 using (simple) cython:
 it is in the proceedings of Euroscipy, last year.
 http://arxiv.org/pdf/1412.6367.pdf


 If I read your paper and code properly, you got 5x faster, mostly
 because you combined the weighted and unweighted histograms into a single
 search of the array, and because you used an algorithm that can only be
 applied to equal- sized bins, similarly to the 10x speed-up Robert was
 reporting.

 I think that having a special path for equal sized bins is a great idea:
 let's do it, PRs are always welcome!
 Similarly, getting the counts together with the weights seems like a
 very good idea.

 I also think that writing it in Python is going to take us 80% of the
 way there: most of the improvements both of you have reported are not
 likely to be coming from the language chosen, but from the algorithm used.
 And if C proves to be sufficiently faster to warrant using it, it should be
 confined to the number crunching: I don;t think there is any point in
 rewriting argument parsing in C.

 Also, keep in mind `np.histogram` can now handle arrays of just about
 **any** dtype. Handling that complexity in C is not a ride in the park.
 Other functions like `np.bincount` and `np.digitize` cheat by only handling
 `double` typed arrays, a luxury that histogram probably can't afford at
 this point in time.

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
 planes de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Rewrite np.histogram in c?

2015-03-23 Thread Eric Firing
On 2015/03/23 7:36 AM, Ralf Gommers wrote:


 On Mon, Mar 23, 2015 at 2:59 PM, Daniel da Silva
 var.mail.dan...@gmail.com mailto:var.mail.dan...@gmail.com wrote:

 Hope this isn't too off-topic: but it would be very nice if
 np.histogram and np.histogram2d supported masked arrays. Is this out
 of scope for outside the numpy.ma http://numpy.ma package?


 Right now it looks like there's no histogram function at all for masked
 arrays - would be good to improve that situation.

 If it's as easy as adding to np.histogram something like:

  if isinstance(a, np.ma.MaskedArray):
  a = a.data[~a.mask]

It looks like it requires a little more than that, but not much.  For 
full support a new mask would need to be made from the logical_or of the 
a mask and the weights mask, and then used to compress both a and 
weights.

Eric


 then it makes sense to add that I think.

 Ralf



 On Mon, Mar 16, 2015 at 2:35 PM, Robert McGibbon rmcgi...@gmail.com
 mailto:rmcgi...@gmail.com wrote:

 Hi,

 It sounds like putting together a PR makes sense then. I'll try
 hacking on this a bit.

 -Robert

 On Mar 16, 2015 11:20 AM, Jaime Fernández del Río
 jaime.f...@gmail.com mailto:jaime.f...@gmail.com wrote:

 On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer
 jerome.kief...@esrf.fr mailto:jerome.kief...@esrf.fr wrote:

 On Mon, 16 Mar 2015 06:56:58 -0700
 Jaime Fernández del Río jaime.f...@gmail.com
 mailto:jaime.f...@gmail.com wrote:

  Dispatching to a different method seems like a no brainer 
 indeed. The
  question is whether we really need to do this in C.

 I need to do both unweighted  weighted histograms and
 we got a factor 5 using (simple) cython:
 it is in the proceedings of Euroscipy, last year.
 http://arxiv.org/pdf/1412.6367.pdf


 If I read your paper and code properly, you got 5x faster,
 mostly because you combined the weighted and unweighted
 histograms into a single search of the array, and because
 you used an algorithm that can only be applied to equal-
 sized bins, similarly to the 10x speed-up Robert was reporting.

 I think that having a special path for equal sized bins is a
 great idea: let's do it, PRs are always welcome!
 Similarly, getting the counts together with the weights
 seems like a very good idea.

 I also think that writing it in Python is going to take us
 80% of the way there: most of the improvements both of you
 have reported are not likely to be coming from the language
 chosen, but from the algorithm used. And if C proves to be
 sufficiently faster to warrant using it, it should be
 confined to the number crunching: I don;t think there is any
 point in rewriting argument parsing in C.

 Also, keep in mind `np.histogram` can now handle arrays of
 just about **any** dtype. Handling that complexity in C is
 not a ride in the park. Other functions like `np.bincount`
 and `np.digitize` cheat by only handling `double` typed
 arrays, a luxury that histogram probably can't afford at
 this point in time.

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale
 en sus planes de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion




 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Rewrite np.histogram in c?

2015-03-23 Thread Nathaniel Smith
On Mar 23, 2015 6:59 AM, Daniel da Silva var.mail.dan...@gmail.com
wrote:

 Hope this isn't too off-topic: but it would be very nice if np.histogram
and np.histogram2d supported masked arrays. Is this out of scope for
outside the numpy.ma package?

Usually the way this kind of thing is handled is by adding an
np.ma.histogram function.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Rewrite np.histogram in c?

2015-03-16 Thread Jaime Fernández del Río
On Sun, Mar 15, 2015 at 11:06 PM, Robert McGibbon rmcgi...@gmail.com
wrote:

 It might make sense to dispatch to difference c implements if the bins are
 equally spaced (as created by using an integer for the np.histogram bins
 argument), vs. non-equally-spaced bins.


Dispatching to a different method seems like a no brainer indeed. The
question is whether we really need to do this in C. Maybe for some very
specific case or cases it makes sense to have a super fast C path, e,g. no
weights and bins is an integer. Even then, rather than rewriting the whole
thing in C, it may be a better idea to leave the parsing of the inputs in
Python, and have a C helper function wrapped and privately exposed,
similarly to how `np.core.multiarray.interp` is used by `np.interp`.

But I would still first give it a try in Python...

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Rewrite np.histogram in c?

2015-03-16 Thread Jerome Kieffer
On Mon, 16 Mar 2015 06:56:58 -0700
Jaime Fernández del Río jaime.f...@gmail.com wrote:

 Dispatching to a different method seems like a no brainer indeed. The
 question is whether we really need to do this in C.

I need to do both unweighted  weighted histograms and we got a factor 5 using 
(simple) cython:
it is in the proceedings of Euroscipy, last year.
http://arxiv.org/pdf/1412.6367.pdf

We got much faster but that's another story.

In fact, many people coming from IDL or Matlab are surprised by the
poor performances of numpy's histogram.

Cheers

-- 
Jérôme Kieffer
tel +33 476 882 445
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Rewrite np.histogram in c?

2015-03-16 Thread Jaime Fernández del Río
On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer jerome.kief...@esrf.fr
wrote:

 On Mon, 16 Mar 2015 06:56:58 -0700
 Jaime Fernández del Río jaime.f...@gmail.com wrote:

  Dispatching to a different method seems like a no brainer indeed. The
  question is whether we really need to do this in C.

 I need to do both unweighted  weighted histograms and we got a factor 5
 using (simple) cython:
 it is in the proceedings of Euroscipy, last year.
 http://arxiv.org/pdf/1412.6367.pdf


If I read your paper and code properly, you got 5x faster, mostly because
you combined the weighted and unweighted histograms into a single search of
the array, and because you used an algorithm that can only be applied to
equal- sized bins, similarly to the 10x speed-up Robert was reporting.

I think that having a special path for equal sized bins is a great idea:
let's do it, PRs are always welcome!
Similarly, getting the counts together with the weights seems like a very
good idea.

I also think that writing it in Python is going to take us 80% of the way
there: most of the improvements both of you have reported are not likely to
be coming from the language chosen, but from the algorithm used. And if C
proves to be sufficiently faster to warrant using it, it should be confined
to the number crunching: I don;t think there is any point in rewriting
argument parsing in C.

Also, keep in mind `np.histogram` can now handle arrays of just about
**any** dtype. Handling that complexity in C is not a ride in the park.
Other functions like `np.bincount` and `np.digitize` cheat by only handling
`double` typed arrays, a luxury that histogram probably can't afford at
this point in time.

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Rewrite np.histogram in c?

2015-03-16 Thread Robert McGibbon
My apologies for the typo: 'implements' - 'implementations'

-Robert

On Sun, Mar 15, 2015 at 11:06 PM, Robert McGibbon rmcgi...@gmail.com
wrote:

 It might make sense to dispatch to difference c implements if the bins are
 equally spaced (as created by using an integer for the np.histogram bins
 argument), vs. non-equally-spaced bins.

 In that case, getting the bigger speedup may be easier, at least for one
 common use case.

 -Robert

 On Sun, Mar 15, 2015 at 11:00 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Sun, Mar 15, 2015 at 9:32 PM, Robert McGibbon rmcgi...@gmail.com
 wrote:

 Hi,

 Numpy.histogram is implemented in python, and is a little sluggish. This
 has been discussed previously on the mailing list, [1, 2]. It came up in a
 project that I maintain, where a new feature is bottlenecked by
 numpy.histogram, and one developer suggested a faster implementation in
 cython [3].

 Would it make sense to reimplement this function in c? or cython? Is
 moving functions like this from python to c to improve performance within
 the scope of the development roadmap for numpy? I started implementing this
 a little bit in c, [4] but I figured I should check in here first.


 Where do you think the performance gains will come from? The PR in your
 project that claims a 10x speed-up uses a method that is only fit for
 equally spaced bins. I want to think that implementing that exact same
 algorithm in Python with NumPy would be comparably fast, say within 2x.

 For the general case, NumPy is already doing most of the heavy lifting
 (the sorting and the searching) in C: simply replicating the same
 algorithmic approach entirely in C is unlikely to provide any major
 speed-up. And if the change is to the algorithm, then we should first try
 it out in Python.

 That said, if you can speed things up 10x, I don't think there is going
 to be much opposition to moving it to C!

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
 de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Rewrite np.histogram in c?

2015-03-16 Thread Jaime Fernández del Río
On Sun, Mar 15, 2015 at 9:32 PM, Robert McGibbon rmcgi...@gmail.com wrote:

 Hi,

 Numpy.histogram is implemented in python, and is a little sluggish. This
 has been discussed previously on the mailing list, [1, 2]. It came up in a
 project that I maintain, where a new feature is bottlenecked by
 numpy.histogram, and one developer suggested a faster implementation in
 cython [3].

 Would it make sense to reimplement this function in c? or cython? Is
 moving functions like this from python to c to improve performance within
 the scope of the development roadmap for numpy? I started implementing this
 a little bit in c, [4] but I figured I should check in here first.


Where do you think the performance gains will come from? The PR in your
project that claims a 10x speed-up uses a method that is only fit for
equally spaced bins. I want to think that implementing that exact same
algorithm in Python with NumPy would be comparably fast, say within 2x.

For the general case, NumPy is already doing most of the heavy lifting (the
sorting and the searching) in C: simply replicating the same algorithmic
approach entirely in C is unlikely to provide any major speed-up. And if
the change is to the algorithm, then we should first try it out in Python.

That said, if you can speed things up 10x, I don't think there is going to
be much opposition to moving it to C!

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Rewrite np.histogram in c?

2015-03-16 Thread Robert McGibbon
It might make sense to dispatch to difference c implements if the bins are
equally spaced (as created by using an integer for the np.histogram bins
argument), vs. non-equally-spaced bins.

In that case, getting the bigger speedup may be easier, at least for one
common use case.

-Robert

On Sun, Mar 15, 2015 at 11:00 PM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Sun, Mar 15, 2015 at 9:32 PM, Robert McGibbon rmcgi...@gmail.com
 wrote:

 Hi,

 Numpy.histogram is implemented in python, and is a little sluggish. This
 has been discussed previously on the mailing list, [1, 2]. It came up in a
 project that I maintain, where a new feature is bottlenecked by
 numpy.histogram, and one developer suggested a faster implementation in
 cython [3].

 Would it make sense to reimplement this function in c? or cython? Is
 moving functions like this from python to c to improve performance within
 the scope of the development roadmap for numpy? I started implementing this
 a little bit in c, [4] but I figured I should check in here first.


 Where do you think the performance gains will come from? The PR in your
 project that claims a 10x speed-up uses a method that is only fit for
 equally spaced bins. I want to think that implementing that exact same
 algorithm in Python with NumPy would be comparably fast, say within 2x.

 For the general case, NumPy is already doing most of the heavy lifting
 (the sorting and the searching) in C: simply replicating the same
 algorithmic approach entirely in C is unlikely to provide any major
 speed-up. And if the change is to the algorithm, then we should first try
 it out in Python.

 That said, if you can speed things up 10x, I don't think there is going to
 be much opposition to moving it to C!

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
 de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Rewrite np.histogram in c?

2015-03-16 Thread Robert McGibbon
Hi,

It sounds like putting together a PR makes sense then. I'll try hacking on
this a bit.

-Robert
On Mar 16, 2015 11:20 AM, Jaime Fernández del Río jaime.f...@gmail.com
wrote:

 On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer jerome.kief...@esrf.fr
 wrote:

 On Mon, 16 Mar 2015 06:56:58 -0700
 Jaime Fernández del Río jaime.f...@gmail.com wrote:

  Dispatching to a different method seems like a no brainer indeed. The
  question is whether we really need to do this in C.

 I need to do both unweighted  weighted histograms and we got a factor 5
 using (simple) cython:
 it is in the proceedings of Euroscipy, last year.
 http://arxiv.org/pdf/1412.6367.pdf


 If I read your paper and code properly, you got 5x faster, mostly because
 you combined the weighted and unweighted histograms into a single search of
 the array, and because you used an algorithm that can only be applied to
 equal- sized bins, similarly to the 10x speed-up Robert was reporting.

 I think that having a special path for equal sized bins is a great idea:
 let's do it, PRs are always welcome!
 Similarly, getting the counts together with the weights seems like a very
 good idea.

 I also think that writing it in Python is going to take us 80% of the way
 there: most of the improvements both of you have reported are not likely to
 be coming from the language chosen, but from the algorithm used. And if C
 proves to be sufficiently faster to warrant using it, it should be confined
 to the number crunching: I don;t think there is any point in rewriting
 argument parsing in C.

 Also, keep in mind `np.histogram` can now handle arrays of just about
 **any** dtype. Handling that complexity in C is not a ride in the park.
 Other functions like `np.bincount` and `np.digitize` cheat by only handling
 `double` typed arrays, a luxury that histogram probably can't afford at
 this point in time.

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
 de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Rewrite np.histogram in c?

2015-03-15 Thread Robert McGibbon
Hi,

Numpy.histogram is implemented in python, and is a little sluggish. This
has been discussed previously on the mailing list, [1, 2]. It came up in a
project that I maintain, where a new feature is bottlenecked by
numpy.histogram, and one developer suggested a faster implementation in
cython [3].

Would it make sense to reimplement this function in c? or cython? Is moving
functions like this from python to c to improve performance within the
scope of the development roadmap for numpy? I started implementing this a
little bit in c, [4] but I figured I should check in here first.

-Robert

[1]
http://scipy-user.10969.n7.nabble.com/numpy-histogram-is-slow-td17208.html
[2] http://numpy-discussion.10968.n7.nabble.com/Fast-histogram-td9359.html
[3] https://github.com/mdtraj/mdtraj/pull/734
[4] https://github.com/rmcgibbo/numpy/tree/histogram
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion