Re: [Numpy-discussion] Implementing a find first style function

2013-03-28 Thread Phil Elson
I've specifically not tuned it, primarily because the get the best tuning
you need to know the likelihood of finding a match. One option would be to
allow users to specify a probability parameter which would chunk the
array into size*probability chunks - an additional parameter could then be
exposed to limit the maximum chunk size to give the user control of the
maximum memory overhead that the routine could use.

I'll submit a PR and we can discuss inline.

Thanks for the response Nathaniel.





On 27 March 2013 12:19, Nathaniel Smith n...@pobox.com wrote:

 On Tue, Mar 26, 2013 at 9:20 AM, Phil Elson pelson@gmail.com wrote:
  Bump.
 
  I'd be interested to know if this is a desirable feature for numpy?
  (specifically the 1D find functionality rather than the any/all
 also
  discussed)
  If so, I'd be more than happy to submit a PR, but I don't want to put in
 the
  effort if the principle isn't desirable in the core of numpy.

 I don't think anyone has a strong opinion either way :-). It seems
 like a fairly general interface that people might find useful, so I
 don't see an immediate objection to including it in principle. It
 would help to see the actual numbers from a tuned version though to
 know how much benefit there is to get...

 -n
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Implementing a find first style function

2013-03-27 Thread Nathaniel Smith
On Tue, Mar 26, 2013 at 9:20 AM, Phil Elson pelson@gmail.com wrote:
 Bump.

 I'd be interested to know if this is a desirable feature for numpy?
 (specifically the 1D find functionality rather than the any/all also
 discussed)
 If so, I'd be more than happy to submit a PR, but I don't want to put in the
 effort if the principle isn't desirable in the core of numpy.

I don't think anyone has a strong opinion either way :-). It seems
like a fairly general interface that people might find useful, so I
don't see an immediate objection to including it in principle. It
would help to see the actual numbers from a tuned version though to
know how much benefit there is to get...

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Implementing a find first style function

2013-03-26 Thread Phil Elson
Bump.

I'd be interested to know if this is a desirable feature for numpy?
(specifically the 1D find functionality rather than the any/all also
discussed)
If so, I'd be more than happy to submit a PR, but I don't want to put in
the effort if the principle isn't desirable in the core of numpy.

Cheers,



On 8 March 2013 17:38, Phil Elson pelson@gmail.com wrote:

 Interesting. I hadn't thought of those. I've implemented (very roughly
 without a sound logic check) and benchmarked:

 def my_any(a, predicate, chunk_size=2048):
 try:
 next(find(a, predicate, chunk_size))
 return True
 except StopIteration:
 return False

 def my_all(a, predicate, chunk_size=2048):
 return not my_any(a, lambda a: ~predicate(a), chunk_size)


 With the following setup:

 import numpy as np
 import numpy.random

 np.random.seed(1)
 a = np.random.randn(1e8)


 For a low frequency *any*:

 In [12]: %timeit (np.abs(a)  6).any()
 1 loops, best of 3: 1.29 s per loop

 In [13]: %timeit my_any(a, lambda a: np.abs(a)  6)

 1 loops, best of 3: 792 ms per loop

 In [14]: %timeit my_any(a, lambda a: np.abs(a)  6, chunk_size=1)
 1 loops, best of 3: 654 ms per loop

 For a False *any*:

 In [16]: %timeit (np.abs(a)  7).any()
 1 loops, best of 3: 1.22 s per loop

 In [17]: %timeit my_any(a, lambda a: np.abs(a)  7)
 1 loops, best of 3: 2.4 s per loop

 For a high probability *any*:

 In [28]: %timeit (np.abs(a)  1).any()
 1 loops, best of 3: 972 ms per loop

 In [27]: %timeit my_any(a, lambda a: np.abs(a)  1)
 1 loops, best of 3: 67 us per loop

 ---

 For a low probability *all*:

 In [18]: %timeit (np.abs(a)  6).all()
 1 loops, best of 3: 1.16 s per loop

 In [19]: %timeit my_all(a, lambda a: np.abs(a)  6)
 1 loops, best of 3: 880 ms per loop

 In [20]: %timeit my_all(a, lambda a: np.abs(a)  6, chunk_size=1)
 1 loops, best of 3: 706 ms per loop

 For a True *all*:

 In [22]: %timeit (np.abs(a)  7).all()
 1 loops, best of 3: 1.47 s per loop

 In [23]: %timeit my_all(a, lambda a: np.abs(a)  7)
 1 loops, best of 3: 2.65 s per loop

 For a high probability *all*:

 In [25]: %timeit (np.abs(a)  1).all()
 1 loops, best of 3: 978 ms per loop

 In [26]: %timeit my_all(a, lambda a: np.abs(a)  1)
 1 loops, best of 3: 73.6 us per loop







 On 6 March 2013 21:16, Benjamin Root ben.r...@ou.edu wrote:



 On Tue, Mar 5, 2013 at 9:15 AM, Phil Elson pelson@gmail.com wrote:

 The ticket https://github.com/numpy/numpy/issues/2269 discusses the
 possibility of implementing a find first style function which can
 optimise the process of finding the first value(s) which match a predicate
 in a given 1D array. For example:


  a = np.sin(np.linspace(0, np.pi, 200))
  print find_first(a, lambda a: a  0.9)
 ((71, ), 0.900479032457)


 This has been discussed in several locations:

 https://github.com/numpy/numpy/issues/2269
 https://github.com/numpy/numpy/issues/2333

 http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item


 *Rationale*

 For small arrays there is no real reason to avoid doing:

  a = np.sin(np.linspace(0, np.pi, 200))
  ind = (a  0.9).nonzero()[0][0]
  print (ind, ), a[ind]
 (71,) 0.900479032457


 But for larger arrays, this can lead to massive amounts of work even if
 the result is one of the first to be computed. Example:

  a = np.arange(1e8)
  print (a == 5).nonzero()[0][0]
 5


 So a function which terminates when the first matching value is found is
 desirable.

 As mentioned in #2269, it is possible to define a consistent ordering
 which allows this functionality for 1D arrays, but IMHO it overcomplicates
 the problem and was not a case that I personally needed, so I've limited
 the scope to 1D arrays only.


 *Implementation*

 My initial assumption was that to get any kind of performance I would
 need to write the *find* function in C, however after prototyping with
 some array chunking it became apparent that a trivial python function would
 be quick enough for my needs.

 The approach I've implemented in the code found in #2269 simply breaks
 the array into sub-arrays of maximum length *chunk_size* (2048 by
 default, though there is no real science to this number), applies the given
 predicating function, and yields the results from *nonzero()*. The
 given function should be a python function which operates on the whole of
 the sub-array element-wise (i.e. the function should be vectorized).
 Returning a generator also has the benefit of allowing users to get the
 first *n* matching values/indices.


 *Results*


 I timed the implementation of *find* found in my comment at
 https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with
 an obvious test:


 In [1]: from np_utils import find

 In [2]: import numpy as np

 In [3]: import numpy.random

 In [4]: np.random.seed(1)

 In [5]: a = np.random.randn(1e8)

 In [6]: a.min(), a.max()
 Out[6]: (-6.1194900990552776, 5.9632246301166321)

 In [7]: next(find(a, 

Re: [Numpy-discussion] Implementing a find first style function

2013-03-08 Thread Phil Elson
Interesting. I hadn't thought of those. I've implemented (very roughly
without a sound logic check) and benchmarked:

def my_any(a, predicate, chunk_size=2048):
try:
next(find(a, predicate, chunk_size))
return True
except StopIteration:
return False

def my_all(a, predicate, chunk_size=2048):
return not my_any(a, lambda a: ~predicate(a), chunk_size)


With the following setup:

import numpy as np
import numpy.random

np.random.seed(1)
a = np.random.randn(1e8)


For a low frequency *any*:

In [12]: %timeit (np.abs(a)  6).any()
1 loops, best of 3: 1.29 s per loop

In [13]: %timeit my_any(a, lambda a: np.abs(a)  6)
1 loops, best of 3: 792 ms per loop

In [14]: %timeit my_any(a, lambda a: np.abs(a)  6, chunk_size=1)
1 loops, best of 3: 654 ms per loop

For a False *any*:

In [16]: %timeit (np.abs(a)  7).any()
1 loops, best of 3: 1.22 s per loop

In [17]: %timeit my_any(a, lambda a: np.abs(a)  7)
1 loops, best of 3: 2.4 s per loop

For a high probability *any*:

In [28]: %timeit (np.abs(a)  1).any()
1 loops, best of 3: 972 ms per loop

In [27]: %timeit my_any(a, lambda a: np.abs(a)  1)
1 loops, best of 3: 67 us per loop

---

For a low probability *all*:

In [18]: %timeit (np.abs(a)  6).all()
1 loops, best of 3: 1.16 s per loop

In [19]: %timeit my_all(a, lambda a: np.abs(a)  6)
1 loops, best of 3: 880 ms per loop

In [20]: %timeit my_all(a, lambda a: np.abs(a)  6, chunk_size=1)
1 loops, best of 3: 706 ms per loop

For a True *all*:

In [22]: %timeit (np.abs(a)  7).all()
1 loops, best of 3: 1.47 s per loop

In [23]: %timeit my_all(a, lambda a: np.abs(a)  7)
1 loops, best of 3: 2.65 s per loop

For a high probability *all*:

In [25]: %timeit (np.abs(a)  1).all()
1 loops, best of 3: 978 ms per loop

In [26]: %timeit my_all(a, lambda a: np.abs(a)  1)
1 loops, best of 3: 73.6 us per loop







On 6 March 2013 21:16, Benjamin Root ben.r...@ou.edu wrote:



 On Tue, Mar 5, 2013 at 9:15 AM, Phil Elson pelson@gmail.com wrote:

 The ticket https://github.com/numpy/numpy/issues/2269 discusses the
 possibility of implementing a find first style function which can
 optimise the process of finding the first value(s) which match a predicate
 in a given 1D array. For example:


  a = np.sin(np.linspace(0, np.pi, 200))
  print find_first(a, lambda a: a  0.9)
 ((71, ), 0.900479032457)


 This has been discussed in several locations:

 https://github.com/numpy/numpy/issues/2269
 https://github.com/numpy/numpy/issues/2333

 http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item


 *Rationale*

 For small arrays there is no real reason to avoid doing:

  a = np.sin(np.linspace(0, np.pi, 200))
  ind = (a  0.9).nonzero()[0][0]
  print (ind, ), a[ind]
 (71,) 0.900479032457


 But for larger arrays, this can lead to massive amounts of work even if
 the result is one of the first to be computed. Example:

  a = np.arange(1e8)
  print (a == 5).nonzero()[0][0]
 5


 So a function which terminates when the first matching value is found is
 desirable.

 As mentioned in #2269, it is possible to define a consistent ordering
 which allows this functionality for 1D arrays, but IMHO it overcomplicates
 the problem and was not a case that I personally needed, so I've limited
 the scope to 1D arrays only.


 *Implementation*

 My initial assumption was that to get any kind of performance I would
 need to write the *find* function in C, however after prototyping with
 some array chunking it became apparent that a trivial python function would
 be quick enough for my needs.

 The approach I've implemented in the code found in #2269 simply breaks
 the array into sub-arrays of maximum length *chunk_size* (2048 by
 default, though there is no real science to this number), applies the given
 predicating function, and yields the results from *nonzero()*. The given
 function should be a python function which operates on the whole of the
 sub-array element-wise (i.e. the function should be vectorized). Returning
 a generator also has the benefit of allowing users to get the first 
 *n*matching values/indices.


 *Results*


 I timed the implementation of *find* found in my comment at
 https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with an
 obvious test:


 In [1]: from np_utils import find

 In [2]: import numpy as np

 In [3]: import numpy.random

 In [4]: np.random.seed(1)

 In [5]: a = np.random.randn(1e8)

 In [6]: a.min(), a.max()
 Out[6]: (-6.1194900990552776, 5.9632246301166321)

 In [7]: next(find(a, lambda a: np.abs(a)  6))
 Out[7]: ((33105441,), -6.1194900990552776)

 In [8]: (np.abs(a)  6).nonzero()
 Out[8]: (array([33105441]),)

 In [9]: %timeit (np.abs(a)  6).nonzero()
 1 loops, best of 3: 1.51 s per loop

 In [10]: %timeit next(find(a, lambda a: np.abs(a)  6))
 1 loops, best of 3: 912 ms per loop

 In [11]: %timeit next(find(a, lambda a: np.abs(a)  6, chunk_size=10))
 1 loops, best of 3: 470 ms per loop

 

Re: [Numpy-discussion] Implementing a find first style function

2013-03-06 Thread Benjamin Root
On Tue, Mar 5, 2013 at 9:15 AM, Phil Elson pelson@gmail.com wrote:

 The ticket https://github.com/numpy/numpy/issues/2269 discusses the
 possibility of implementing a find first style function which can
 optimise the process of finding the first value(s) which match a predicate
 in a given 1D array. For example:


  a = np.sin(np.linspace(0, np.pi, 200))
  print find_first(a, lambda a: a  0.9)
 ((71, ), 0.900479032457)


 This has been discussed in several locations:

 https://github.com/numpy/numpy/issues/2269
 https://github.com/numpy/numpy/issues/2333

 http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item


 *Rationale*

 For small arrays there is no real reason to avoid doing:

  a = np.sin(np.linspace(0, np.pi, 200))
  ind = (a  0.9).nonzero()[0][0]
  print (ind, ), a[ind]
 (71,) 0.900479032457


 But for larger arrays, this can lead to massive amounts of work even if
 the result is one of the first to be computed. Example:

  a = np.arange(1e8)
  print (a == 5).nonzero()[0][0]
 5


 So a function which terminates when the first matching value is found is
 desirable.

 As mentioned in #2269, it is possible to define a consistent ordering
 which allows this functionality for 1D arrays, but IMHO it overcomplicates
 the problem and was not a case that I personally needed, so I've limited
 the scope to 1D arrays only.


 *Implementation*

 My initial assumption was that to get any kind of performance I would need
 to write the *find* function in C, however after prototyping with some
 array chunking it became apparent that a trivial python function would be
 quick enough for my needs.

 The approach I've implemented in the code found in #2269 simply breaks the
 array into sub-arrays of maximum length *chunk_size* (2048 by default,
 though there is no real science to this number), applies the given
 predicating function, and yields the results from *nonzero()*. The given
 function should be a python function which operates on the whole of the
 sub-array element-wise (i.e. the function should be vectorized). Returning
 a generator also has the benefit of allowing users to get the first 
 *n*matching values/indices.


 *Results*


 I timed the implementation of *find* found in my comment at
 https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with an
 obvious test:


 In [1]: from np_utils import find

 In [2]: import numpy as np

 In [3]: import numpy.random

 In [4]: np.random.seed(1)

 In [5]: a = np.random.randn(1e8)

 In [6]: a.min(), a.max()
 Out[6]: (-6.1194900990552776, 5.9632246301166321)

 In [7]: next(find(a, lambda a: np.abs(a)  6))
 Out[7]: ((33105441,), -6.1194900990552776)

 In [8]: (np.abs(a)  6).nonzero()
 Out[8]: (array([33105441]),)

 In [9]: %timeit (np.abs(a)  6).nonzero()
 1 loops, best of 3: 1.51 s per loop

 In [10]: %timeit next(find(a, lambda a: np.abs(a)  6))
 1 loops, best of 3: 912 ms per loop

 In [11]: %timeit next(find(a, lambda a: np.abs(a)  6, chunk_size=10))
 1 loops, best of 3: 470 ms per loop

 In [12]: %timeit next(find(a, lambda a: np.abs(a)  6, chunk_size=100))
 1 loops, best of 3: 483 ms per loop


 This shows that picking a sensible *chunk_size* can yield massive
 speed-ups (nonzero is x3 slower in one case). A similar example with a much
 smaller 1D array shows similar promise:

 In [41]: a = np.random.randn(1e4)

 In [42]: %timeit next(find(a, lambda a: np.abs(a)  3))
 1 loops, best of 3: 35.8 us per loop

 In [43]: %timeit (np.abs(a)  3).nonzero()
 1 loops, best of 3: 148 us per loop


 As I commented on the issue tracker, if you think this function is worth
 taking forward, I'd be happy to open up a pull request.

 Feedback greatfully received.

 Cheers,

 Phil



In the interest of generalizing code and such, could such approaches be
used for functions like np.any() and np.all() for short-circuiting if True
or False (respectively) are found?  I wonder what other sort of functions
in NumPy might benefit from this?

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Implementing a find first style function

2013-03-05 Thread Phil Elson
The ticket https://github.com/numpy/numpy/issues/2269 discusses the
possibility of implementing a find first style function which can
optimise the process of finding the first value(s) which match a predicate
in a given 1D array. For example:


 a = np.sin(np.linspace(0, np.pi, 200))
 print find_first(a, lambda a: a  0.9)
((71, ), 0.900479032457)


This has been discussed in several locations:

https://github.com/numpy/numpy/issues/2269
https://github.com/numpy/numpy/issues/2333
http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item


*Rationale*

For small arrays there is no real reason to avoid doing:

 a = np.sin(np.linspace(0, np.pi, 200))
 ind = (a  0.9).nonzero()[0][0]
 print (ind, ), a[ind]
(71,) 0.900479032457


But for larger arrays, this can lead to massive amounts of work even if the
result is one of the first to be computed. Example:

 a = np.arange(1e8)
 print (a == 5).nonzero()[0][0]
5


So a function which terminates when the first matching value is found is
desirable.

As mentioned in #2269, it is possible to define a consistent ordering which
allows this functionality for 1D arrays, but IMHO it overcomplicates the
problem and was not a case that I personally needed, so I've limited the
scope to 1D arrays only.


*Implementation*

My initial assumption was that to get any kind of performance I would need
to write the *find* function in C, however after prototyping with some
array chunking it became apparent that a trivial python function would be
quick enough for my needs.

The approach I've implemented in the code found in #2269 simply breaks the
array into sub-arrays of maximum length *chunk_size* (2048 by default,
though there is no real science to this number), applies the given
predicating function, and yields the results from *nonzero()*. The given
function should be a python function which operates on the whole of the
sub-array element-wise (i.e. the function should be vectorized). Returning
a generator also has the benefit of allowing users to get the first
*n*matching values/indices.


*Results*


I timed the implementation of *find* found in my comment at
https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with an
obvious test:


In [1]: from np_utils import find

In [2]: import numpy as np

In [3]: import numpy.random

In [4]: np.random.seed(1)

In [5]: a = np.random.randn(1e8)

In [6]: a.min(), a.max()
Out[6]: (-6.1194900990552776, 5.9632246301166321)

In [7]: next(find(a, lambda a: np.abs(a)  6))
Out[7]: ((33105441,), -6.1194900990552776)

In [8]: (np.abs(a)  6).nonzero()
Out[8]: (array([33105441]),)

In [9]: %timeit (np.abs(a)  6).nonzero()
1 loops, best of 3: 1.51 s per loop

In [10]: %timeit next(find(a, lambda a: np.abs(a)  6))
1 loops, best of 3: 912 ms per loop

In [11]: %timeit next(find(a, lambda a: np.abs(a)  6, chunk_size=10))
1 loops, best of 3: 470 ms per loop

In [12]: %timeit next(find(a, lambda a: np.abs(a)  6, chunk_size=100))
1 loops, best of 3: 483 ms per loop


This shows that picking a sensible *chunk_size* can yield massive speed-ups
(nonzero is x3 slower in one case). A similar example with a much smaller
1D array shows similar promise:

In [41]: a = np.random.randn(1e4)

In [42]: %timeit next(find(a, lambda a: np.abs(a)  3))
1 loops, best of 3: 35.8 us per loop

In [43]: %timeit (np.abs(a)  3).nonzero()
1 loops, best of 3: 148 us per loop


As I commented on the issue tracker, if you think this function is worth
taking forward, I'd be happy to open up a pull request.

Feedback greatfully received.

Cheers,

Phil
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion