Re: [Numpy-discussion] Implementing a find first style function
I've specifically not tuned it, primarily because the get the best tuning you need to know the likelihood of finding a match. One option would be to allow users to specify a probability parameter which would chunk the array into size*probability chunks - an additional parameter could then be exposed to limit the maximum chunk size to give the user control of the maximum memory overhead that the routine could use. I'll submit a PR and we can discuss inline. Thanks for the response Nathaniel. On 27 March 2013 12:19, Nathaniel Smith n...@pobox.com wrote: On Tue, Mar 26, 2013 at 9:20 AM, Phil Elson pelson@gmail.com wrote: Bump. I'd be interested to know if this is a desirable feature for numpy? (specifically the 1D find functionality rather than the any/all also discussed) If so, I'd be more than happy to submit a PR, but I don't want to put in the effort if the principle isn't desirable in the core of numpy. I don't think anyone has a strong opinion either way :-). It seems like a fairly general interface that people might find useful, so I don't see an immediate objection to including it in principle. It would help to see the actual numbers from a tuned version though to know how much benefit there is to get... -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Implementing a find first style function
On Tue, Mar 26, 2013 at 9:20 AM, Phil Elson pelson@gmail.com wrote: Bump. I'd be interested to know if this is a desirable feature for numpy? (specifically the 1D find functionality rather than the any/all also discussed) If so, I'd be more than happy to submit a PR, but I don't want to put in the effort if the principle isn't desirable in the core of numpy. I don't think anyone has a strong opinion either way :-). It seems like a fairly general interface that people might find useful, so I don't see an immediate objection to including it in principle. It would help to see the actual numbers from a tuned version though to know how much benefit there is to get... -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Implementing a find first style function
Bump. I'd be interested to know if this is a desirable feature for numpy? (specifically the 1D find functionality rather than the any/all also discussed) If so, I'd be more than happy to submit a PR, but I don't want to put in the effort if the principle isn't desirable in the core of numpy. Cheers, On 8 March 2013 17:38, Phil Elson pelson@gmail.com wrote: Interesting. I hadn't thought of those. I've implemented (very roughly without a sound logic check) and benchmarked: def my_any(a, predicate, chunk_size=2048): try: next(find(a, predicate, chunk_size)) return True except StopIteration: return False def my_all(a, predicate, chunk_size=2048): return not my_any(a, lambda a: ~predicate(a), chunk_size) With the following setup: import numpy as np import numpy.random np.random.seed(1) a = np.random.randn(1e8) For a low frequency *any*: In [12]: %timeit (np.abs(a) 6).any() 1 loops, best of 3: 1.29 s per loop In [13]: %timeit my_any(a, lambda a: np.abs(a) 6) 1 loops, best of 3: 792 ms per loop In [14]: %timeit my_any(a, lambda a: np.abs(a) 6, chunk_size=1) 1 loops, best of 3: 654 ms per loop For a False *any*: In [16]: %timeit (np.abs(a) 7).any() 1 loops, best of 3: 1.22 s per loop In [17]: %timeit my_any(a, lambda a: np.abs(a) 7) 1 loops, best of 3: 2.4 s per loop For a high probability *any*: In [28]: %timeit (np.abs(a) 1).any() 1 loops, best of 3: 972 ms per loop In [27]: %timeit my_any(a, lambda a: np.abs(a) 1) 1 loops, best of 3: 67 us per loop --- For a low probability *all*: In [18]: %timeit (np.abs(a) 6).all() 1 loops, best of 3: 1.16 s per loop In [19]: %timeit my_all(a, lambda a: np.abs(a) 6) 1 loops, best of 3: 880 ms per loop In [20]: %timeit my_all(a, lambda a: np.abs(a) 6, chunk_size=1) 1 loops, best of 3: 706 ms per loop For a True *all*: In [22]: %timeit (np.abs(a) 7).all() 1 loops, best of 3: 1.47 s per loop In [23]: %timeit my_all(a, lambda a: np.abs(a) 7) 1 loops, best of 3: 2.65 s per loop For a high probability *all*: In [25]: %timeit (np.abs(a) 1).all() 1 loops, best of 3: 978 ms per loop In [26]: %timeit my_all(a, lambda a: np.abs(a) 1) 1 loops, best of 3: 73.6 us per loop On 6 March 2013 21:16, Benjamin Root ben.r...@ou.edu wrote: On Tue, Mar 5, 2013 at 9:15 AM, Phil Elson pelson@gmail.com wrote: The ticket https://github.com/numpy/numpy/issues/2269 discusses the possibility of implementing a find first style function which can optimise the process of finding the first value(s) which match a predicate in a given 1D array. For example: a = np.sin(np.linspace(0, np.pi, 200)) print find_first(a, lambda a: a 0.9) ((71, ), 0.900479032457) This has been discussed in several locations: https://github.com/numpy/numpy/issues/2269 https://github.com/numpy/numpy/issues/2333 http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item *Rationale* For small arrays there is no real reason to avoid doing: a = np.sin(np.linspace(0, np.pi, 200)) ind = (a 0.9).nonzero()[0][0] print (ind, ), a[ind] (71,) 0.900479032457 But for larger arrays, this can lead to massive amounts of work even if the result is one of the first to be computed. Example: a = np.arange(1e8) print (a == 5).nonzero()[0][0] 5 So a function which terminates when the first matching value is found is desirable. As mentioned in #2269, it is possible to define a consistent ordering which allows this functionality for 1D arrays, but IMHO it overcomplicates the problem and was not a case that I personally needed, so I've limited the scope to 1D arrays only. *Implementation* My initial assumption was that to get any kind of performance I would need to write the *find* function in C, however after prototyping with some array chunking it became apparent that a trivial python function would be quick enough for my needs. The approach I've implemented in the code found in #2269 simply breaks the array into sub-arrays of maximum length *chunk_size* (2048 by default, though there is no real science to this number), applies the given predicating function, and yields the results from *nonzero()*. The given function should be a python function which operates on the whole of the sub-array element-wise (i.e. the function should be vectorized). Returning a generator also has the benefit of allowing users to get the first *n* matching values/indices. *Results* I timed the implementation of *find* found in my comment at https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with an obvious test: In [1]: from np_utils import find In [2]: import numpy as np In [3]: import numpy.random In [4]: np.random.seed(1) In [5]: a = np.random.randn(1e8) In [6]: a.min(), a.max() Out[6]: (-6.1194900990552776, 5.9632246301166321) In [7]: next(find(a,
Re: [Numpy-discussion] Implementing a find first style function
Interesting. I hadn't thought of those. I've implemented (very roughly without a sound logic check) and benchmarked: def my_any(a, predicate, chunk_size=2048): try: next(find(a, predicate, chunk_size)) return True except StopIteration: return False def my_all(a, predicate, chunk_size=2048): return not my_any(a, lambda a: ~predicate(a), chunk_size) With the following setup: import numpy as np import numpy.random np.random.seed(1) a = np.random.randn(1e8) For a low frequency *any*: In [12]: %timeit (np.abs(a) 6).any() 1 loops, best of 3: 1.29 s per loop In [13]: %timeit my_any(a, lambda a: np.abs(a) 6) 1 loops, best of 3: 792 ms per loop In [14]: %timeit my_any(a, lambda a: np.abs(a) 6, chunk_size=1) 1 loops, best of 3: 654 ms per loop For a False *any*: In [16]: %timeit (np.abs(a) 7).any() 1 loops, best of 3: 1.22 s per loop In [17]: %timeit my_any(a, lambda a: np.abs(a) 7) 1 loops, best of 3: 2.4 s per loop For a high probability *any*: In [28]: %timeit (np.abs(a) 1).any() 1 loops, best of 3: 972 ms per loop In [27]: %timeit my_any(a, lambda a: np.abs(a) 1) 1 loops, best of 3: 67 us per loop --- For a low probability *all*: In [18]: %timeit (np.abs(a) 6).all() 1 loops, best of 3: 1.16 s per loop In [19]: %timeit my_all(a, lambda a: np.abs(a) 6) 1 loops, best of 3: 880 ms per loop In [20]: %timeit my_all(a, lambda a: np.abs(a) 6, chunk_size=1) 1 loops, best of 3: 706 ms per loop For a True *all*: In [22]: %timeit (np.abs(a) 7).all() 1 loops, best of 3: 1.47 s per loop In [23]: %timeit my_all(a, lambda a: np.abs(a) 7) 1 loops, best of 3: 2.65 s per loop For a high probability *all*: In [25]: %timeit (np.abs(a) 1).all() 1 loops, best of 3: 978 ms per loop In [26]: %timeit my_all(a, lambda a: np.abs(a) 1) 1 loops, best of 3: 73.6 us per loop On 6 March 2013 21:16, Benjamin Root ben.r...@ou.edu wrote: On Tue, Mar 5, 2013 at 9:15 AM, Phil Elson pelson@gmail.com wrote: The ticket https://github.com/numpy/numpy/issues/2269 discusses the possibility of implementing a find first style function which can optimise the process of finding the first value(s) which match a predicate in a given 1D array. For example: a = np.sin(np.linspace(0, np.pi, 200)) print find_first(a, lambda a: a 0.9) ((71, ), 0.900479032457) This has been discussed in several locations: https://github.com/numpy/numpy/issues/2269 https://github.com/numpy/numpy/issues/2333 http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item *Rationale* For small arrays there is no real reason to avoid doing: a = np.sin(np.linspace(0, np.pi, 200)) ind = (a 0.9).nonzero()[0][0] print (ind, ), a[ind] (71,) 0.900479032457 But for larger arrays, this can lead to massive amounts of work even if the result is one of the first to be computed. Example: a = np.arange(1e8) print (a == 5).nonzero()[0][0] 5 So a function which terminates when the first matching value is found is desirable. As mentioned in #2269, it is possible to define a consistent ordering which allows this functionality for 1D arrays, but IMHO it overcomplicates the problem and was not a case that I personally needed, so I've limited the scope to 1D arrays only. *Implementation* My initial assumption was that to get any kind of performance I would need to write the *find* function in C, however after prototyping with some array chunking it became apparent that a trivial python function would be quick enough for my needs. The approach I've implemented in the code found in #2269 simply breaks the array into sub-arrays of maximum length *chunk_size* (2048 by default, though there is no real science to this number), applies the given predicating function, and yields the results from *nonzero()*. The given function should be a python function which operates on the whole of the sub-array element-wise (i.e. the function should be vectorized). Returning a generator also has the benefit of allowing users to get the first *n*matching values/indices. *Results* I timed the implementation of *find* found in my comment at https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with an obvious test: In [1]: from np_utils import find In [2]: import numpy as np In [3]: import numpy.random In [4]: np.random.seed(1) In [5]: a = np.random.randn(1e8) In [6]: a.min(), a.max() Out[6]: (-6.1194900990552776, 5.9632246301166321) In [7]: next(find(a, lambda a: np.abs(a) 6)) Out[7]: ((33105441,), -6.1194900990552776) In [8]: (np.abs(a) 6).nonzero() Out[8]: (array([33105441]),) In [9]: %timeit (np.abs(a) 6).nonzero() 1 loops, best of 3: 1.51 s per loop In [10]: %timeit next(find(a, lambda a: np.abs(a) 6)) 1 loops, best of 3: 912 ms per loop In [11]: %timeit next(find(a, lambda a: np.abs(a) 6, chunk_size=10)) 1 loops, best of 3: 470 ms per loop
Re: [Numpy-discussion] Implementing a find first style function
On Tue, Mar 5, 2013 at 9:15 AM, Phil Elson pelson@gmail.com wrote: The ticket https://github.com/numpy/numpy/issues/2269 discusses the possibility of implementing a find first style function which can optimise the process of finding the first value(s) which match a predicate in a given 1D array. For example: a = np.sin(np.linspace(0, np.pi, 200)) print find_first(a, lambda a: a 0.9) ((71, ), 0.900479032457) This has been discussed in several locations: https://github.com/numpy/numpy/issues/2269 https://github.com/numpy/numpy/issues/2333 http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item *Rationale* For small arrays there is no real reason to avoid doing: a = np.sin(np.linspace(0, np.pi, 200)) ind = (a 0.9).nonzero()[0][0] print (ind, ), a[ind] (71,) 0.900479032457 But for larger arrays, this can lead to massive amounts of work even if the result is one of the first to be computed. Example: a = np.arange(1e8) print (a == 5).nonzero()[0][0] 5 So a function which terminates when the first matching value is found is desirable. As mentioned in #2269, it is possible to define a consistent ordering which allows this functionality for 1D arrays, but IMHO it overcomplicates the problem and was not a case that I personally needed, so I've limited the scope to 1D arrays only. *Implementation* My initial assumption was that to get any kind of performance I would need to write the *find* function in C, however after prototyping with some array chunking it became apparent that a trivial python function would be quick enough for my needs. The approach I've implemented in the code found in #2269 simply breaks the array into sub-arrays of maximum length *chunk_size* (2048 by default, though there is no real science to this number), applies the given predicating function, and yields the results from *nonzero()*. The given function should be a python function which operates on the whole of the sub-array element-wise (i.e. the function should be vectorized). Returning a generator also has the benefit of allowing users to get the first *n*matching values/indices. *Results* I timed the implementation of *find* found in my comment at https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with an obvious test: In [1]: from np_utils import find In [2]: import numpy as np In [3]: import numpy.random In [4]: np.random.seed(1) In [5]: a = np.random.randn(1e8) In [6]: a.min(), a.max() Out[6]: (-6.1194900990552776, 5.9632246301166321) In [7]: next(find(a, lambda a: np.abs(a) 6)) Out[7]: ((33105441,), -6.1194900990552776) In [8]: (np.abs(a) 6).nonzero() Out[8]: (array([33105441]),) In [9]: %timeit (np.abs(a) 6).nonzero() 1 loops, best of 3: 1.51 s per loop In [10]: %timeit next(find(a, lambda a: np.abs(a) 6)) 1 loops, best of 3: 912 ms per loop In [11]: %timeit next(find(a, lambda a: np.abs(a) 6, chunk_size=10)) 1 loops, best of 3: 470 ms per loop In [12]: %timeit next(find(a, lambda a: np.abs(a) 6, chunk_size=100)) 1 loops, best of 3: 483 ms per loop This shows that picking a sensible *chunk_size* can yield massive speed-ups (nonzero is x3 slower in one case). A similar example with a much smaller 1D array shows similar promise: In [41]: a = np.random.randn(1e4) In [42]: %timeit next(find(a, lambda a: np.abs(a) 3)) 1 loops, best of 3: 35.8 us per loop In [43]: %timeit (np.abs(a) 3).nonzero() 1 loops, best of 3: 148 us per loop As I commented on the issue tracker, if you think this function is worth taking forward, I'd be happy to open up a pull request. Feedback greatfully received. Cheers, Phil In the interest of generalizing code and such, could such approaches be used for functions like np.any() and np.all() for short-circuiting if True or False (respectively) are found? I wonder what other sort of functions in NumPy might benefit from this? Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Implementing a find first style function
The ticket https://github.com/numpy/numpy/issues/2269 discusses the possibility of implementing a find first style function which can optimise the process of finding the first value(s) which match a predicate in a given 1D array. For example: a = np.sin(np.linspace(0, np.pi, 200)) print find_first(a, lambda a: a 0.9) ((71, ), 0.900479032457) This has been discussed in several locations: https://github.com/numpy/numpy/issues/2269 https://github.com/numpy/numpy/issues/2333 http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item *Rationale* For small arrays there is no real reason to avoid doing: a = np.sin(np.linspace(0, np.pi, 200)) ind = (a 0.9).nonzero()[0][0] print (ind, ), a[ind] (71,) 0.900479032457 But for larger arrays, this can lead to massive amounts of work even if the result is one of the first to be computed. Example: a = np.arange(1e8) print (a == 5).nonzero()[0][0] 5 So a function which terminates when the first matching value is found is desirable. As mentioned in #2269, it is possible to define a consistent ordering which allows this functionality for 1D arrays, but IMHO it overcomplicates the problem and was not a case that I personally needed, so I've limited the scope to 1D arrays only. *Implementation* My initial assumption was that to get any kind of performance I would need to write the *find* function in C, however after prototyping with some array chunking it became apparent that a trivial python function would be quick enough for my needs. The approach I've implemented in the code found in #2269 simply breaks the array into sub-arrays of maximum length *chunk_size* (2048 by default, though there is no real science to this number), applies the given predicating function, and yields the results from *nonzero()*. The given function should be a python function which operates on the whole of the sub-array element-wise (i.e. the function should be vectorized). Returning a generator also has the benefit of allowing users to get the first *n*matching values/indices. *Results* I timed the implementation of *find* found in my comment at https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with an obvious test: In [1]: from np_utils import find In [2]: import numpy as np In [3]: import numpy.random In [4]: np.random.seed(1) In [5]: a = np.random.randn(1e8) In [6]: a.min(), a.max() Out[6]: (-6.1194900990552776, 5.9632246301166321) In [7]: next(find(a, lambda a: np.abs(a) 6)) Out[7]: ((33105441,), -6.1194900990552776) In [8]: (np.abs(a) 6).nonzero() Out[8]: (array([33105441]),) In [9]: %timeit (np.abs(a) 6).nonzero() 1 loops, best of 3: 1.51 s per loop In [10]: %timeit next(find(a, lambda a: np.abs(a) 6)) 1 loops, best of 3: 912 ms per loop In [11]: %timeit next(find(a, lambda a: np.abs(a) 6, chunk_size=10)) 1 loops, best of 3: 470 ms per loop In [12]: %timeit next(find(a, lambda a: np.abs(a) 6, chunk_size=100)) 1 loops, best of 3: 483 ms per loop This shows that picking a sensible *chunk_size* can yield massive speed-ups (nonzero is x3 slower in one case). A similar example with a much smaller 1D array shows similar promise: In [41]: a = np.random.randn(1e4) In [42]: %timeit next(find(a, lambda a: np.abs(a) 3)) 1 loops, best of 3: 35.8 us per loop In [43]: %timeit (np.abs(a) 3).nonzero() 1 loops, best of 3: 148 us per loop As I commented on the issue tracker, if you think this function is worth taking forward, I'd be happy to open up a pull request. Feedback greatfully received. Cheers, Phil ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion