Re: [Numpy-discussion] fast_any_all , a trivial but fast/useful helper function for numpy
From my previous mail: this has the same performance as your code: a = empty([3] list(A.shape) For anyone that is interested. I ran a benchmark on the code after Julian kindly provided me with a correction to the listing he posted. a = empty([3] + list(A.shape)) a[0] = A5; a[1] = B2; a[2] = A10; np.any(a, 0) Julian also suggested trying the idiom np.vstack([A,B,C]) instead of [A,B,C]. Revised benchmarks here. I've moved the [A5, B2, A10] creation outside the timing loop in all cases since it was distorting results due to array creation, which shouldn't be part of the any() timing measurement. I'm also now using separate test arrays to avoid the possibility of side effects between tests of different functions. The following results are produced consistently: np.any() - 2.68 s np.any() with Julian's first idiom above: - 0.24s faa.any() (original version) - 0.2s np.any() with vstack(): 0.14s faa.any() with vstack: 0.1s faa.any() without vstack: 0.08s (alternative faa implementations: 0.11-0.12s) Conclusion: fast_any_all is 30x faster than numpy.any() 1.7 fast_any_all is 43% faster than numpy.any() 1.7 with the vstack() idiom, which I understand to be the basis for the new approach in numpy.any() 1.8 development branch. I'd be really interested to see the benchmarks under the current 1.8 master branch of numpy. Please can someone try this and send me the file? # git clone https://github.com/gbb/numpy-fast-any-all.git (read the source code to make sure I'm not evil) # cd numpy-fast-any-all # python test_fast_any_all.py BENCHMARK.txt Incidentally, this is an appropriate example of a case where a 'performance idiom' becomes a 'penalty idiom' unexpectedly when the underlying implementation changes (vstack). Thanks for your suggestions, Julian. Graeme. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast_any_all , a trivial but fast/useful helper function for numpy
Hi Robert, Thanks for proposing an alternative implementation approach. However, did you test your proposal before you made the assertion about its behaviour? reduce(np.logical_or, inputs, False) reduce(np.logical_and, inputs, True) This code consistently benchmarks 20% slower than the method I use (tested on two different machines several times). Your fast_logic() is basically reduce(). No, it isn't. Updated benchmarks for your proposal and also for another alternative implemenation using boolean indexing at: https://github.com/gbb/numpy-fast-any-all/blob/master/BENCHMARK.md Three general points arising from this: 1 - idioms don't have test coverage Generally, by using idioms rather than functions, you risk mistyping or misusing the form of the idiom and thus introducing a bug. You also lose out on explicit testing and implicit 'real world testing' that tends to build up around library functions. 2 - idioms aren't maintained or updated (and they have a unknown shelf life) An idiom might be fast today (or not), it may be correct today, but tomorrow is unknown. A key problem is that the relative performance of the parts of a library like numpy will keep changing - sometimes substantially - and idiomatic approaches to overcome performance difficulties in the short term tend to become outdated and even harmful very quickly. As in this example, they can even be harmful from the moment they're written. Browsing a site like stackoverflow should show you both new and experienced users often taking inefficient approaches because of outdated idiomatic advice. 3 - idioms are OK, but functions are better, because implementation hiding and abstraction are good things. If you use a benchmarked/tested function which acknowledges a range of alternative implementations, you have a reasonable degree of confidence that you're getting the best performance and correct behaviour, because you can actually see the effects of the alternative implementations in benchmarks/test output. It's a lot more sensible to use a function from a publicly available library - any library - than to manually maintain a set of idioms and have to continually search your software for the idioms, benchmark them to see if they're still beneficial, and modify them when they're not. Graeme ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast_any_all , a trivial but fast/useful helper function for numpy
Hi Julian, Thanks for the post. It's great to hear that the main numpy function is improving in 1.8, though I think there is still plenty of value here for performance junkies :-) I don't have 1.8beta installed (and I can't conveniently install it on my machines just now). If you have time, and have the beta installed, could you try this and mail me the output from the benchmark? I'm curious to know. # git clone https://github.com/gbb/numpy-fast-any-all.git # cd numpy-fast-any-all # python test-fast-any-all.py Graeme On Sep 4, 2013, at 7:38 PM, Julian Taylor jtaylor.deb...@googlemail.com wrote: The result is 14 to 17x faster than np.any() for this use case.* any/all and boolean operations have been significantly speed up by vectorization in numpy 1.8 [0]. They are now around 10 times faster than before, especially if the boolean array fits into one of the cpu caching layers. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast_any_all , a trivial but fast/useful helper function for numpy
This is good stuff, but I can't help thinking that if I needed to do an any/all test on a number of arrays with common and/or combos -- I'd probably write a Cython function to do it. It could be a bit tricky to make it really general, but not bad for a couple specific dtypes / use cases. -just a thought... Also -- how does this work with numexpr? It would be nice if it could handle these kinds of cases. -Chris On Thu, Sep 5, 2013 at 1:54 AM, Graeme B. Bell g...@skogoglandskap.nowrote: Hi Julian, Thanks for the post. It's great to hear that the main numpy function is improving in 1.8, though I think there is still plenty of value here for performance junkies :-) I don't have 1.8beta installed (and I can't conveniently install it on my machines just now). If you have time, and have the beta installed, could you try this and mail me the output from the benchmark? I'm curious to know. # git clone https://github.com/gbb/numpy-fast-any-all.git # cd numpy-fast-any-all # python test-fast-any-all.py Graeme On Sep 4, 2013, at 7:38 PM, Julian Taylor jtaylor.deb...@googlemail.com wrote: The result is 14 to 17x faster than np.any() for this use case.* any/all and boolean operations have been significantly speed up by vectorization in numpy 1.8 [0]. They are now around 10 times faster than before, especially if the boolean array fits into one of the cpu caching layers. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast_any_all , a trivial but fast/useful helper function for numpy
hi, its not np.any that is slow in this case its np.array([A, B, C]) np.dstack([A, B, C]) is better but writing it like this has the same performance as your code: a = empty([3] list(A.shape) a[0] = A5; a[1] = B2; a[2] = A10; np.any(a, 0) I'll check if creating an array from a sequence can be improved for this case. On 05.09.2013 10:54, Graeme B. Bell wrote: Hi Julian, Thanks for the post. It's great to hear that the main numpy function is improving in 1.8, though I think there is still plenty of value here for performance junkies :-) I don't have 1.8beta installed (and I can't conveniently install it on my machines just now). If you have time, and have the beta installed, could you try this and mail me the output from the benchmark? I'm curious to know. # git clone https://github.com/gbb/numpy-fast-any-all.git # cd numpy-fast-any-all # python test-fast-any-all.py Graeme On Sep 4, 2013, at 7:38 PM, Julian Taylor jtaylor.deb...@googlemail.com wrote: The result is 14 to 17x faster than np.any() for this use case.* any/all and boolean operations have been significantly speed up by vectorization in numpy 1.8 [0]. They are now around 10 times faster than before, especially if the boolean array fits into one of the cpu caching layers. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast_any_all , a trivial but fast/useful helper function for numpy
On Wed, Sep 4, 2013 at 11:05 AM, Graeme B. Bell g...@skogoglandskap.no wrote: In my current GIS raster work I often have a situation where I generate code something like this: np.any([A4, A==2, B==5, ...]) However, np.any() is quite slow. It's possible to use np.logical_or to solve the problem, but then you get nested logical_or's, since logical_or combines only two parameters. It's also possible to use integer maths e.g. (A4)+(A==2)+(B==5)0. The question is: which is best (syntactically, in terms of performance, etc)? I've written a little helper function to provide a faster version of any() and all(). It's embarrassingly simple - just a for loop. However, I think there's a syntactic advantage to using a helper function for this situation rather than writing it idiomatically each time; and it reduces the chance of a bug in idiomatic implementation. However, the code does not cover all the use cases currently addressed by np.any() and np.all(). I benchmarked to pick the fastest underlying implementation (logical_or rather than integer maths). The result is 14 to 17x faster than np.any() for this use case.* Code benchmark here: https://github.com/gbb/numpy-fast-any-all Please feel welcome to use it or improve it :-) Try the following: any(map(np.any, inputs)) all(map(np.all, inputs)) -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast_any_all , a trivial but fast/useful helper function for numpy
Sorry, I should have been more clear. As shown in the benchmark/example, the method is replacing the behaviour of np.any(inputs, 0) not the behaviour of np.any(inputs) Here, where I'm making decisions based on overlaying layers of raster data in the same shape, I don't want to map the entire dataset to a single boolean, rather I want to preserve the layers' shape but identify if a condition was matched in any of the overlaid layers, generating a mask. For example, this type of reasoning: def mask(): for all pixel locations in the images, A, B and C: if A[location] is 3, 19, or between 21 and 30 AND B[location] is any value AND C[location] is 1-4, 9-13... pixel=True This naturally fits the any/all metaphor. Will update the description on github. Graeme. On Sep 4, 2013, at 12:05 PM, Graeme Bell g...@skogoglandskap.no wrote: In my current GIS raster work I often have a situation where I generate code something like this: np.any([A4, A==2, B==5, ...]) However, np.any() is quite slow. It's possible to use np.logical_or to solve the problem, but then you get nested logical_or's, since logical_or combines only two parameters. It's also possible to use integer maths e.g. (A4)+(A==2)+(B==5)0. The question is: which is best (syntactically, in terms of performance, etc)? I've written a little helper function to provide a faster version of any() and all(). It's embarrassingly simple - just a for loop. However, I think there's a syntactic advantage to using a helper function for this situation rather than writing it idiomatically each time; and it reduces the chance of a bug in idiomatic implementation. However, the code does not cover all the use cases currently addressed by np.any() and np.all(). I benchmarked to pick the fastest underlying implementation (logical_or rather than integer maths). The result is 14 to 17x faster than np.any() for this use case.* Code benchmark here: https://github.com/gbb/numpy-fast-any-all Please feel welcome to use it or improve it :-) Graeme. * (Should this become an execution path in np.any()... ?) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast_any_all , a trivial but fast/useful helper function for numpy
For the record, I started a discussion about 6 months ago about a find_first type function which avoided running the logic over the whole array (using lambdas instead). This spilled into a discussion about implementing a short-cutted any or all function: http://numpy-discussion.10968.n7.nabble.com/Implementing-a-find-first-style-function-tp33085.htmlwith some interesting results. Nothing more has been done with those discussions, but you may find it of interest. (And I'd still be interested in taking it forwards if you have any comments) Cheers, On 4 September 2013 13:14, Graeme B. Bell g...@skogoglandskap.no wrote: Sorry, I should have been more clear. As shown in the benchmark/example, the method is replacing the behaviour of np.any(inputs, 0) not the behaviour of np.any(inputs) Here, where I'm making decisions based on overlaying layers of raster data in the same shape, I don't want to map the entire dataset to a single boolean, rather I want to preserve the layers' shape but identify if a condition was matched in any of the overlaid layers, generating a mask. For example, this type of reasoning: def mask(): for all pixel locations in the images, A, B and C: if A[location] is 3, 19, or between 21 and 30 AND B[location] is any value AND C[location] is 1-4, 9-13... pixel=True This naturally fits the any/all metaphor. Will update the description on github. Graeme. On Sep 4, 2013, at 12:05 PM, Graeme Bell g...@skogoglandskap.no wrote: In my current GIS raster work I often have a situation where I generate code something like this: np.any([A4, A==2, B==5, ...]) However, np.any() is quite slow. It's possible to use np.logical_or to solve the problem, but then you get nested logical_or's, since logical_or combines only two parameters. It's also possible to use integer maths e.g. (A4)+(A==2)+(B==5)0. The question is: which is best (syntactically, in terms of performance, etc)? I've written a little helper function to provide a faster version of any() and all(). It's embarrassingly simple - just a for loop. However, I think there's a syntactic advantage to using a helper function for this situation rather than writing it idiomatically each time; and it reduces the chance of a bug in idiomatic implementation. However, the code does not cover all the use cases currently addressed by np.any() and np.all(). I benchmarked to pick the fastest underlying implementation (logical_or rather than integer maths). The result is 14 to 17x faster than np.any() for this use case.* Code benchmark here: https://github.com/gbb/numpy-fast-any-all Please feel welcome to use it or improve it :-) Graeme. * (Should this become an execution path in np.any()... ?) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast_any_all , a trivial but fast/useful helper function for numpy
On 04.09.2013 12:05, Graeme B. Bell wrote: In my current GIS raster work I often have a situation where I generate code something like this: np.any([A4, A==2, B==5, ...]) However, np.any() is quite slow. It's possible to use np.logical_or to solve the problem, but then you get nested logical_or's, since logical_or combines only two parameters. It's also possible to use integer maths e.g. (A4)+(A==2)+(B==5)0. The question is: which is best (syntactically, in terms of performance, etc)? I've written a little helper function to provide a faster version of any() and all(). It's embarrassingly simple - just a for loop. However, I think there's a syntactic advantage to using a helper function for this situation rather than writing it idiomatically each time; and it reduces the chance of a bug in idiomatic implementation. However, the code does not cover all the use cases currently addressed by np.any() and np.all(). I benchmarked to pick the fastest underlying implementation (logical_or rather than integer maths). The result is 14 to 17x faster than np.any() for this use case.* any/all and boolean operations have been significantly speed up by vectorization in numpy 1.8 [0]. They are now around 10 times faster than before, especially if the boolean array fits into one of the cpu caching layers. If they don't I recommend using a blocking utility function, something like: for i in range(0, n, blocksize): view = d[i:i+blocksize] #dostuff on view with this method and the new vectorizations in numpy you are almost as fast as numexpr for floats and probably a lot faster with bools. [0] http://www.onerussian.com/tmp/numpy-vbench/vb_vb_ufunc.html#numpy-and-bool http://www.onerussian.com/tmp/numpy-vbench/vb_vb_reduce.html#numpy-any-slow (the dip before 1.7 was part of the NA branch and never released, 1.8 adds some of its optimizations back) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion