Re: [Numpy-discussion] Array vectorization in numpy
Hi. That is really amazing. I checked out that numexpr branch and saw some strange results when evaluating expressions on a multi-core i7 processor. Running the numexpr.test() yields a few 'F', which I suppose are failing tests. I tried to let the tests finish but it takes more than 20 min, is there any way to run the tests individually? Is there a specific mailing list for numexpr, so I can avoid 'spamming' numpy? Thanks! -- Carlos Becker On Wed, Jul 20, 2011 at 8:01 PM, Mark Wiebe mwwi...@gmail.com wrote: On Wed, Jul 20, 2011 at 5:52 PM, srean srean.l...@gmail.com wrote: I think this is essential to speed up numpy. Maybe numexpr could handle this in the future? Right now the general use of numexpr is result = numexpr.evaluate(whatever), so the same problem seems to be there. With this I am not saying that numpy is not worth it, just that for many applications (specially with huge matrices/arrays), pre-allocation does make a huge difference, especially if we want to attract more people to using numpy. The ufuncs and many scipy functions take a out parameter where you can specify a pre-allocated array. It can be a little awkward writing expressions that way, but the capability is there. This is a slight digression: is there a way to have a out parameter like semantics with numexpr. I have always used it as a[:] = numexpr(expression) But I dont think numexpr builds the value in place. Is it possible to have side-effects with numexpr as opposed to obtaining values, for example a= a * b + c The documentation is not clear about this. Oh and I do not find the out parameter awkward at all. Its very handy. Furthermore, if I may, here is a request that the Blitz++ source be updated. Seems like there is a lot of activity on the Blitz++ repository and weave is very handy too and can be used as easily as numexpr. In order to make sure the 1.6 nditer supports multithreading, I adapted numexpr to use it. The branch which does this is here: http://code.google.com/p/numexpr/source/browse/#svn%2Fbranches%2Fnewiter This supports out, order, and casting parameters, visible here: http://code.google.com/p/numexpr/source/browse/branches/newiter/numexpr/necompiler.py#615 It's pretty much ready to go, just needs someone to do the release management. -Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
This is a slight digression: is there a way to have a out parameter like semantics with numexpr. I have always used it as a[:] = numexpr(expression) In order to make sure the 1.6 nditer supports multithreading, I adapted numexpr to use it. The branch which does this is here: http://code.google.com/p/numexpr/source/browse/#svn%2Fbranches%2Fnewiter This supports out, order, and casting parameters, visible here: http://code.google.com/p/numexpr/source/browse/branches/newiter/numexpr/necompiler.py#615 It's pretty much ready to go, just needs someone to do the release management. -Mark Oh excellent, I did not know that the out parameter was available. Hope this gets in soon. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
There is a total lack of vectorization in your code, so you are right about the lack of vectorization. What happens is that you see the result of the Matlab JIT compiler speeding up the loop. With a vectorized array expression, there will hardly be any difference. Sturla Den 19.07.2011 11:05, skrev Carlos Becker: Hi, I started with numpy a few days ago. I was timing some array operations and found that numpy takes 3 or 4 times longer than Matlab on a simple array-minus-scalar operation. This looks as if there is a lack of vectorization, even though this is just a guess. I hope this is not reposting. I tried searching the mailing list database but did not find anything related specifically to a problem like this one. Here there is the python test code: from datetime import datetime import numpy as np def test(): m = np.ones([2000,2000],float) N = 100 t1 = datetime.now() for x in range(N): k = m - 0.5 t2 = datetime.now() print (t2 - t1).total_seconds() / N And matlab: m = rand(2000,2000); N = 100; tic; for I=1:N k = m - 0.5; end toc / N I have the impression that the speed boost with Matlab is not related to matlab optimizations, since singe-runs also render similar timings. I tried compiling ATLAS for SSE2 and didn't observe any difference. Any clues? Thanks, Carlos ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Those are very interesting examples. I think that pre-allocation is very important, and something similar happens in Matlab if no pre-allocation is done: it takes 3-4x longer than with pre-allocation. The main difference is that Matlab is able to take into account a pre-allocated array/matrix, probably avoiding the creation of a temporary and writing the results directly in the pre-allocated array. I think this is essential to speed up numpy. Maybe numexpr could handle this in the future? Right now the general use of numexpr is result = numexpr.evaluate(whatever), so the same problem seems to be there. With this I am not saying that numpy is not worth it, just that for many applications (specially with huge matrices/arrays), pre-allocation does make a huge difference, especially if we want to attract more people to using numpy. -- Carlos Becker On Wed, Jul 20, 2011 at 1:42 AM, Chad Netzer chad.net...@gmail.com wrote: On Tue, Jul 19, 2011 at 6:10 PM, Pauli Virtanen p...@iki.fi wrote: k = m - 0.5 does here the same thing as k = np.empty_like(m) np.subtract(m, 0.5, out=k) The memory allocation (empty_like and the subsequent deallocation) costs essentially nothing, and there are no temporaries or copying in `subtract`. As verification: import timeit import numpy as np t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.53904647827148433 t=timeit.Timer('k = np.empty_like(m);np.subtract(m, 0.5, out=k)', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.54006035327911373 The trivial difference is expected as extra python parsing overhead, I think. Which leads me to apologize, since in my previous post I clearly meant to type m -= 0.5, not m =- 0.5, which is *quite* a different operation... Carlos, and Lutz, please take heed. :) In fact, as Lutz pointed out, that example was not at all what I intended to show anyway. So, just to demonstrate how it was wrong: t=timeit.Timer('m =- 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.058299207687377931 t=timeit.Timer('m -= 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.28192551136016847 t=timeit.Timer('np.subtract(m, 0.5, m)', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.27014491558074949 t=timeit.Timer('np.subtract(m, 0.5, k)', setup='import numpy as np;m = np.ones([8092,8092],float); k = np.empty_like(m)') np.mean(t.repeat(repeat=10, number=1)) 0.54962997436523442 Perhaps the difference in the last two simply comes down to cache effects (having to iterate over two different large memory blocks, rather than one)? -Chad ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Den 19.07.2011 17:49, skrev Carlos Becker: - Matlab: 0.0089 - Numpy: 0.051 - Numpy with blitz: 0.0043 Now blitz is considerably faster! Anyways, I am concerned about numpy being much slower, in this case taking 2x the time of the previous operation. I guess this is because of the way that python operands/arguments are passed. Should I always use weave.blitz? That depends on how many milliseconds you intend to save. CPU time is expensive, so you cannot afford to loose any... Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Den 20.07.2011 08:49, skrev Carlos Becker: The main difference is that Matlab is able to take into account a pre-allocated array/matrix, probably avoiding the creation of a temporary and writing the results directly in the pre-allocated array. I think this is essential to speed up numpy. As for speed, I think those who need Fortran, C or Cython knows where to find it. Yes, in certain situations you can make Matlab run faster than NumPy and vice versa. But I want to see an example of a real problem where it really matters -- not just my faulty loop was a few milliseconds faster in Matlab. NumPy could use a more intelligent memory management and reuse some storage space, but I am not sure how much difference it would make. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
On Wednesday, July 20, 2011, Carlos Becker carlosbec...@gmail.com wrote: Those are very interesting examples. I think that pre-allocation is very important, and something similar happens in Matlab if no pre-allocation is done: it takes 3-4x longer than with pre-allocation.The main difference is that Matlab is able to take into account a pre-allocated array/matrix, probably avoiding the creation of a temporary and writing the results directly in the pre-allocated array. I think this is essential to speed up numpy. Maybe numexpr could handle this in the future? Right now the general use of numexpr is result = numexpr.evaluate(whatever), so the same problem seems to be there. With this I am not saying that numpy is not worth it, just that for many applications (specially with huge matrices/arrays), pre-allocation does make a huge difference, especially if we want to attract more people to using numpy. The ufuncs and many scipy functions take a out parameter where you can specify a pre-allocated array. It can be a little awkward writing expressions that way, but the capability is there. But, ultimately, I think the main value with python and numpy is not it's speed, but rather the ease of use and how quickly one can develop working code with it. If you want to squeeze every CPU resources, you could program in assembly, but good luck getting that linear solver done in time. Don't get me wrong, there is always room for improvements, and I would love to see numpy go even faster. However, I doubt that converting matlab users would *require* speed to be the main selling point. Ease of development and full-featured, high-quality standard and third-party libraries have always been the top selling points for me. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Hi all. Thanks for the feedback. My point is not to start a war on matlab/numpy. This comes out of my wish to switch from Matlab to something more appealing. I like numpy and python, being a proper language (not like matlab scripts, whose syntax is patched and destroyed as new versions come up). I am impressed at how complete numpy is, and with numexpr as well. In my case, sometimes it is required to process 1k images or more, and 2x speed improvement in this case means 2 hours of processing vs 4. Someone would say 'switch to C/C++', but this is not my point. This thread came up when comparing matlab and python to find whether performance is somehow similar, and even if numpy is a bit slower, I would not mind. However, 3x-4x is an important difference. I tried the operations with an output argument and didn't see much difference. I have to try that further with other examples. If I find something new I will let you know, I would be glad to switch to numpy soon ;) Cheers, Carlos On Wed, Jul 20, 2011 at 9:21 AM, Benjamin Root ben.r...@ou.edu wrote: On Wednesday, July 20, 2011, Carlos Becker carlosbec...@gmail.com wrote: Those are very interesting examples. I think that pre-allocation is very important, and something similar happens in Matlab if no pre-allocation is done: it takes 3-4x longer than with pre-allocation.The main difference is that Matlab is able to take into account a pre-allocated array/matrix, probably avoiding the creation of a temporary and writing the results directly in the pre-allocated array. I think this is essential to speed up numpy. Maybe numexpr could handle this in the future? Right now the general use of numexpr is result = numexpr.evaluate(whatever), so the same problem seems to be there. With this I am not saying that numpy is not worth it, just that for many applications (specially with huge matrices/arrays), pre-allocation does make a huge difference, especially if we want to attract more people to using numpy. The ufuncs and many scipy functions take a out parameter where you can specify a pre-allocated array. It can be a little awkward writing expressions that way, but the capability is there. But, ultimately, I think the main value with python and numpy is not it's speed, but rather the ease of use and how quickly one can develop working code with it. If you want to squeeze every CPU resources, you could program in assembly, but good luck getting that linear solver done in time. Don't get me wrong, there is always room for improvements, and I would love to see numpy go even faster. However, I doubt that converting matlab users would *require* speed to be the main selling point. Ease of development and full-featured, high-quality standard and third-party libraries have always been the top selling points for me. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Den 19.07.2011 11:05, skrev Carlos Becker: N = 100; tic; for I=1:N k = m - 0.5; end toc / N m = rand(2000,2000); Here, Matlab's JIT compiler can probably hoist the invariant out of the loop, and just do I=N k = m - 0.5 Try this instead: for I=1:N k = I - 0.5; end S.M. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Den 20.07.2011 09:35, skrev Carlos Becker: In my case, sometimes it is required to process 1k images or more, and 2x speed improvement in this case means 2 hours of processing vs 4. Can you demonstrate that Matlab is faster than NumPy for this task? Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Wed, 20 Jul 2011 08:49:21 +0200, Carlos Becker wrote: Those are very interesting examples. I think that pre-allocation is very important, and something similar happens in Matlab if no pre-allocation is done: it takes 3-4x longer than with pre-allocation. The main difference is that Matlab is able to take into account a pre-allocated array/matrix, probably avoiding the creation of a temporary and writing the results directly in the pre-allocated array. You have not demonstrated that the difference you have comes from pre-allocation. If it would come from pre-allocation, how come I get the same speed as an equivalent C implementation, which *does* pre-allocation, using exactly the same benchmark codes as you have posted? -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Hi, On Wed, Jul 20, 2011 at 2:42 AM, Chad Netzer chad.net...@gmail.com wrote: On Tue, Jul 19, 2011 at 6:10 PM, Pauli Virtanen p...@iki.fi wrote: k = m - 0.5 does here the same thing as k = np.empty_like(m) np.subtract(m, 0.5, out=k) The memory allocation (empty_like and the subsequent deallocation) costs essentially nothing, and there are no temporaries or copying in `subtract`. As verification: import timeit import numpy as np t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.53904647827148433 t=timeit.Timer('k = np.empty_like(m);np.subtract(m, 0.5, out=k)', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.54006035327911373 The trivial difference is expected as extra python parsing overhead, I think. Which leads me to apologize, since in my previous post I clearly meant to type m -= 0.5, not m =- 0.5, which is *quite* a different operation... Carlos, and Lutz, please take heed. :) In fact, as Lutz pointed out, that example was not at all what I intended to show anyway. So, just to demonstrate how it was wrong Perhaps slightly OT, but here is something very odd going on. I would expect the performance to be in totally different ballpark. t=timeit.Timer('m =- 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.058299207687377931 More like: In []: %timeit m =- .5 1000 loops, best of 3: 35 ns per loop -eat t=timeit.Timer('m -= 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.28192551136016847 t=timeit.Timer('np.subtract(m, 0.5, m)', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.27014491558074949 t=timeit.Timer('np.subtract(m, 0.5, k)', setup='import numpy as np;m = np.ones([8092,8092],float); k = np.empty_like(m)') np.mean(t.repeat(repeat=10, number=1)) 0.54962997436523442 Perhaps the difference in the last two simply comes down to cache effects (having to iterate over two different large memory blocks, rather than one)? -Chad ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
On Tue, Jul 19, 2011 at 11:49 PM, Carlos Becker carlosbec...@gmail.com wrote: Those are very interesting examples. Cool. I think that pre-allocation is very important, and something similar happens in Matlab if no pre-allocation is done: it takes 3-4x longer than with pre-allocation. Can you provide a simple example of this in Matlab code? I'd like to see the code you are testing with, and the numbers you are reporting, all in one post (please). So far we've seen some code in your first post, some numbers in your follow up, but being spread out it makes it hard to know what exactly you are asserting. The main difference is that Matlab is able to take into account a pre-allocated array/matrix, probably avoiding the creation of a temporary and writing the results directly in the pre-allocated array. Now I believe you are guessing. My last example showed the effect of only using a pre-allocated result array in numpy; It was still slower than an in place operation (ie. overwriting the array used to calculate the result), which may be due to machine memory considerations. The simple operation you are testing (an array operated on by a scalar) is dominated by the memory access speeds of reading and writing to the large arrays. With a separate, pre-allocated array, there is twice the memory to read and write to, and hence twice the time. At least that's my guess, are you saying Matlab does this 3-4 times faster than numpy? I'd really like to see the *exact* code you are testing, with the specific numbers you are getting for that code, if it's not too much trouble. With this I am not saying that numpy is not worth it, just that for many applications (specially with huge matrices/arrays), pre-allocation does make a huge difference, especially if we want to attract more people to using numpy. What do you mean by 'pre-allocated'? It is certainly perfectly feasible to pre-allocate numpy arrays and use them as the target of operations, as my examples showed. And you can also easily do sums and multiplies using in-place array operations, with Python nomenclature. It's true that you have to do some work at optimizing some expressions if you wish to avoid temporary array objects being created during multi-term expression evaluations, but the manual discusses this and gives the reasons why. Is this what you mean by pre-allocation? I'm still not sure where exactly you are seeing a problem; can you show us exactly what Matlab code cannot be made to run as efficiently with numpy? -Chad ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
I will be away from my computer for a week, but what I could try today shows that Matlab JIT is doing some tricks so the results I have shown previously for Matlab are likely to be wrong. In this sense, it seems to be that timings are similar between numpy and matlab if Jit tricks are avoided. Next week I will run more tests. I am planning to summarize the results and put them somewhere on the web, since in some cases numpy +numexpr greatly outperform matlab- however I will first make sure that JIT is not shadowing the conclusions El 20/07/2011, a las 11:04, Pauli Virtanen p...@iki.fi escribió: Wed, 20 Jul 2011 08:49:21 +0200, Carlos Becker wrote: Those are very interesting examples. I think that pre-allocation is very important, and something similar happens in Matlab if no pre- allocation is done: it takes 3-4x longer than with pre-allocation. The main difference is that Matlab is able to take into account a pre- allocated array/matrix, probably avoiding the creation of a temporary and writing the results directly in the pre-allocated array. You have not demonstrated that the difference you have comes from pre-allocation. If it would come from pre-allocation, how come I get the same speed as an equivalent C implementation, which *does* pre-allocation, using exactly the same benchmark codes as you have posted? -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Wed, 20 Jul 2011 09:04:09 +, Pauli Virtanen wrote: Wed, 20 Jul 2011 08:49:21 +0200, Carlos Becker wrote: Those are very interesting examples. I think that pre-allocation is very important, and something similar happens in Matlab if no pre-allocation is done: it takes 3-4x longer than with pre-allocation. The main difference is that Matlab is able to take into account a pre-allocated array/matrix, probably avoiding the creation of a temporary and writing the results directly in the pre-allocated array. You have not demonstrated that the difference you have comes from pre-allocation. If it would come from pre-allocation, how come I get the same speed as an equivalent C implementation, which *does* pre-allocation, using exactly the same benchmark codes as you have posted? Ok, I looked at it at a different machine, and in the end I'll have to agree with you :) Some interesting data points: on my Eee laptop (with the test codes below), I get Numpy: 0.0771 Numpy (preallocated): 0.0383 On a different machine, on the other hand: Numpy: 0.0288 Numpy (preallocated): 0.0283 For larger array sizes the situation starts to change (4000x4000): Numpy (allocation zeroing): 0.161 Numpy: 0.1953 Numpy (preallocated): 0.1153 Also interestingly, Zeroing (memset, per element): 1.04313e-08# 4000x4000 array Zeroing (memset, per element): 1.05333e-08# 3000x3000 array Zeroing (memset, per element): 1.04427e-08# 2048x2048 array Zeroing (memset, per element): 2.24223e-09# 2048x2047 array Zeroing (memset, per element): 2.1e-09# 2000x2000 array Zeroing (memset, per element): 1.75e-09 # 200x200 array Zeroing (memset, preallocated, per element): 2.06667e-09 # 3000x3000 Zeroing (memset, preallocated, per element): 2.0504e-09 # 2048x2048 Zeroing (memset, preallocated, per element): 1.94e-09# 200x200 There is a sharp order-of-magnitude change of speed in malloc+memset of an array, which is not present in memset itself. (This is then also reflected in the Numpy performance -- floating point operations probably don't cost much compared to memory access speed.) It seems that either the kernel or the C library changes the way it handles allocation at that point. So yes, it seems that you were right after all: for large arrays preallocation may be an issue, but the exact size limit depends on the machine etc. in question, which is why I didn't manage to see this at first. *** In this particular case, we are somewhat limited by Python on what optimizations we can do. It is not possible to have the expression k = m - 0.5 be translated into `np.subtract(m, 0.5, k)` instead of `k = np.subtract(m, 0.5)` because this translation is done by Python itself. If the translation is taken away from Python, e.g., by switching to lazy evaluation or via Numexpr, then things can be improved. There have been some ideas around on implementing matrix algebra lazily in this way, with the ability of reusing temporary buffers. The issue of reusing temporaries in expressions such as `a = 0.3*x + y + z` should however be possible to address within Numpy. Pauli #include stdlib.h #include time.h #include stdio.h #include string.h int main() { double *a, *b; int N = 2000*2000, M=100; int j; int k; clock_t start, end; a = (double*)malloc(sizeof(double)*N); b = (double*)malloc(sizeof(double)*N); start = clock(); for (k = 0; k M; ++k) { memset(b, '\0', sizeof(double)*N); } end = clock(); printf(Zeroing (memset, preallocated, per element): %g\n, ((double)(end-start))/CLOCKS_PER_SEC/M/N); free(b); start = clock(); for (k = 0; k M; ++k) { b = (double*)malloc(sizeof(double)*N); memset(b, '\0', sizeof(double)*N); free(b); } end = clock(); printf(Zeroing (memset, per element): %g\n, ((double)(end-start))/CLOCKS_PER_SEC/M/N); b = (double*)malloc(sizeof(double)*N); start = clock(); for (k = 0; k M; ++k) { for (j = 0; j N; ++j) { b[j] = a[j] - 0.5; } } end = clock(); printf(Operation in C (preallocated): %g\n, ((double)(end-start))/CLOCKS_PER_SEC/M); free(b); start = clock(); for (k = 0; k M; ++k) { b = (double*)malloc(sizeof(double)*N); for (j = 0; j N; ++j) { b[j] = a[j] - 0.5; } free(b); } end = clock(); printf(Operation in C: %g\n, ((double)(end-start))/CLOCKS_PER_SEC/M); return 0; } import time import numpy as np print np.__version__, np.__file__ m = np.ones([2000, 2000],float) N = 100 print (m.size * m.dtype.itemsize) / 1e6 t1 = time.clock() for x in range(N): k = np.zeros_like(m) t2 = time.clock() print Numpy (allocation zeroing):, (t2 - t1) / N t1 = time.clock()
Re: [Numpy-discussion] Array vectorization in numpy
On Wed, Jul 20, 2011 at 3:57 AM, eat e.antero.ta...@gmail.com wrote: Perhaps slightly OT, but here is something very odd going on. I would expect the performance to be in totally different ballpark. t=timeit.Timer('m =- 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.058299207687377931 More like: In []: %timeit m =- .5 1000 loops, best of 3: 35 ns per loop -eat I think that's the effect of the timer having a low resolution, and the repeat value being 10, instead of the default 100. For the huge array operations, a small repeat value wasn't a problem. But my mistake made it a simple python assignment, and for such a quick operation you need to repeat it a great many times between timer calls to get a meaningful result: In [1]: %timeit m = -0.5 1000 loops, best of 3: 39.1 ns per loop In [2]: import timeit In [3]: t=timeit.Timer('m = -0.5') In [4]: t.timeit(number=10) Out[35]: 38.36219096183777 So, directly repeating the assignment a billion times puts it into 'nanoseconds per assignment' units, and the results from ipython %timeit and the timeit() call are comparable (approx 39 nanoseconds per loop). -Chad ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Wed, 20 Jul 2011 11:31:41 +, Pauli Virtanen wrote: [clip] There is a sharp order-of-magnitude change of speed in malloc+memset of an array, which is not present in memset itself. (This is then also reflected in the Numpy performance -- floating point operations probably don't cost much compared to memory access speed.) It seems that either the kernel or the C library changes the way it handles allocation at that point. The explanation seems to be the following: (a) When the process adjusts the size of its heap, the kernel must zero new pages it gives to the process (because they might contain sensitive information from other processes) [1] (b) GNU libc hangs onto some memory even after free() is called, so that the heap size doesn't need to be adjusted continuously. This is controlled by parameters that can be tuned with the mallopt() function. [2] Because of (a), there is a performance hit probably equivalent to `memset(buf, 0, size)` or more (kernel overheads?) for using newly allocated memory the first time. But because of (b), this hit mainly applies to buffers larger than some threshold. Preallocating can get rid of this overhead, but it probably only matters in places where you reuse the same memory many times, and the operations done are not much more expensive than whatever the kernel needs to do. Alternatively, you can call mallopt(M_TRIM_THRESHOLD, N); mallopt(M_TOP_PAD, N); mallopt(M_MMAP_MAX, 0); with large enough `N`, and let libc manage the memory reuse for you. .. [1] http://stackoverflow.com/questions/1327261 .. [2] http://www.gnu.org/s/hello/manual/libc/Malloc-Tunable-Parameters.html -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
I think this is essential to speed up numpy. Maybe numexpr could handle this in the future? Right now the general use of numexpr is result = numexpr.evaluate(whatever), so the same problem seems to be there. With this I am not saying that numpy is not worth it, just that for many applications (specially with huge matrices/arrays), pre-allocation does make a huge difference, especially if we want to attract more people to using numpy. The ufuncs and many scipy functions take a out parameter where you can specify a pre-allocated array. It can be a little awkward writing expressions that way, but the capability is there. This is a slight digression: is there a way to have a out parameter like semantics with numexpr. I have always used it as a[:] = numexpr(expression) But I dont think numexpr builds the value in place. Is it possible to have side-effects with numexpr as opposed to obtaining values, for example a= a * b + c The documentation is not clear about this. Oh and I do not find the out parameter awkward at all. Its very handy. Furthermore, if I may, here is a request that the Blitz++ source be updated. Seems like there is a lot of activity on the Blitz++ repository and weave is very handy too and can be used as easily as numexpr. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
On Wed, Jul 20, 2011 at 5:52 PM, srean srean.l...@gmail.com wrote: I think this is essential to speed up numpy. Maybe numexpr could handle this in the future? Right now the general use of numexpr is result = numexpr.evaluate(whatever), so the same problem seems to be there. With this I am not saying that numpy is not worth it, just that for many applications (specially with huge matrices/arrays), pre-allocation does make a huge difference, especially if we want to attract more people to using numpy. The ufuncs and many scipy functions take a out parameter where you can specify a pre-allocated array. It can be a little awkward writing expressions that way, but the capability is there. This is a slight digression: is there a way to have a out parameter like semantics with numexpr. I have always used it as a[:] = numexpr(expression) But I dont think numexpr builds the value in place. Is it possible to have side-effects with numexpr as opposed to obtaining values, for example a= a * b + c The documentation is not clear about this. Oh and I do not find the out parameter awkward at all. Its very handy. Furthermore, if I may, here is a request that the Blitz++ source be updated. Seems like there is a lot of activity on the Blitz++ repository and weave is very handy too and can be used as easily as numexpr. In order to make sure the 1.6 nditer supports multithreading, I adapted numexpr to use it. The branch which does this is here: http://code.google.com/p/numexpr/source/browse/#svn%2Fbranches%2Fnewiter This supports out, order, and casting parameters, visible here: http://code.google.com/p/numexpr/source/browse/branches/newiter/numexpr/necompiler.py#615 It's pretty much ready to go, just needs someone to do the release management. -Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Array vectorization in numpy
Hi, I started with numpy a few days ago. I was timing some array operations and found that numpy takes 3 or 4 times longer than Matlab on a simple array-minus-scalar operation. This looks as if there is a lack of vectorization, even though this is just a guess. I hope this is not reposting. I tried searching the mailing list database but did not find anything related specifically to a problem like this one. Here there is the python test code: from datetime import datetime import numpy as np def test(): m = np.ones([2000,2000],float) N = 100 t1 = datetime.now() for x in range(N): k = m - 0.5 t2 = datetime.now() print (t2 - t1).total_seconds() / N And matlab: m = rand(2000,2000); N = 100; tic; for I=1:N k = m - 0.5; end toc / N I have the impression that the speed boost with Matlab is not related to matlab optimizations, since singe-runs also render similar timings. I tried compiling ATLAS for SSE2 and didn't observe any difference. Any clues? Thanks, Carlos ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Tue, 19 Jul 2011 11:05:18 +0200, Carlos Becker wrote: Hi, I started with numpy a few days ago. I was timing some array operations and found that numpy takes 3 or 4 times longer than Matlab on a simple array-minus-scalar operation. This looks as if there is a lack of vectorization, even though this is just a guess. I hope this is not reposting. I tried searching the mailing list database but did not find anything related specifically to a problem like this one. I see essentially no performance difference: Matlab [7.10.0.499 (R2010a)]: 0.0321 Numpy [1.6.0]: 0.03117567 If later versions of Matlab can parallelize the computation across multiple processors, that could be one possibility for the difference you see. Alternatively, you may have compiled Numpy with optimizations turned off. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Hi Pauli, thanks for the quick answer. Is there a way to check the optimization flags of numpy after installation? I am away of a matlab installation now, but I remember I saw a single processor active with matlab. I will check it again soon Thanks! El 19/07/2011, a las 13:10, Pauli Virtanen p...@iki.fi escribió: Tue, 19 Jul 2011 11:05:18 +0200, Carlos Becker wrote: Hi, I started with numpy a few days ago. I was timing some array operations and found that numpy takes 3 or 4 times longer than Matlab on a simple array-minus-scalar operation. This looks as if there is a lack of vectorization, even though this is just a guess. I hope this is not reposting. I tried searching the mailing list database but did not find anything related specifically to a problem like this one. I see essentially no performance difference: Matlab [7.10.0.499 (R2010a)]: 0.0321 Numpy [1.6.0]: 0.03117567 If later versions of Matlab can parallelize the computation across multiple processors, that could be one possibility for the difference you see. Alternatively, you may have compiled Numpy with optimizations turned off. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
I made more tests with the same operation, restricting Matlab to use a single processing unit. I got: - Matlab: 0.0063 sec avg - Numpy: 0.026 sec avg - Numpy with weave.blitz: 0.0041 Note that weave.blitz is even faster than Matlab (slightly). I tried on an older computer, and I got similar results between matlab and numpy without weave.blitz, so maybe it has to do with 'new' vectorization opcodes. Anyhow, even though these results are not very promising, it gets worse if I try to do something like: result = (m - 0.5)*0.3 and I get the following timings: - Matlab: 0.0089 - Numpy: 0.051 - Numpy with blitz: 0.0043 Now blitz is considerably faster! Anyways, I am concerned about numpy being much slower, in this case taking 2x the time of the previous operation. I guess this is because of the way that python operands/arguments are passed. Should I always use weave.blitz? Carlos On Tue, Jul 19, 2011 at 1:40 PM, Carlos Becker carlosbec...@gmail.comwrote: Hi Pauli, thanks for the quick answer. Is there a way to check the optimization flags of numpy after installation? I am away of a matlab installation now, but I remember I saw a single processor active with matlab. I will check it again soon Thanks! El 19/07/2011, a las 13:10, Pauli Virtanen p...@iki.fi escribió: Tue, 19 Jul 2011 11:05:18 +0200, Carlos Becker wrote: Hi, I started with numpy a few days ago. I was timing some array operations and found that numpy takes 3 or 4 times longer than Matlab on a simple array-minus-scalar operation. This looks as if there is a lack of vectorization, even though this is just a guess. I hope this is not reposting. I tried searching the mailing list database but did not find anything related specifically to a problem like this one. I see essentially no performance difference: Matlab [7.10.0.499 (R2010a)]: 0.0321 Numpy [1.6.0]: 0.03117567 If later versions of Matlab can parallelize the computation across multiple processors, that could be one possibility for the difference you see. Alternatively, you may have compiled Numpy with optimizations turned off. __**_ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/**listinfo/numpy-discussionhttp://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
On Tue, Jul 19, 2011 at 9:49 AM, Carlos Becker carlosbec...@gmail.comwrote: I made more tests with the same operation, restricting Matlab to use a single processing unit. I got: - Matlab: 0.0063 sec avg - Numpy: 0.026 sec avg - Numpy with weave.blitz: 0.0041 Note that weave.blitz is even faster than Matlab (slightly). I tried on an older computer, and I got similar results between matlab and numpy without weave.blitz, so maybe it has to do with 'new' vectorization opcodes. Anyhow, even though these results are not very promising, it gets worse if I try to do something like: result = (m - 0.5)*0.3 and I get the following timings: - Matlab: 0.0089 - Numpy: 0.051 - Numpy with blitz: 0.0043 Now blitz is considerably faster! Anyways, I am concerned about numpy being much slower, in this case taking 2x the time of the previous operation. I guess this is because of the way that python operands/arguments are passed. Should I always use weave.blitz? Out of curiosity, what os/architecture are you running on? What version of numpy are you using? By and large, you shouldn't spend time programming in blitz, it will ruin the whole point of using numpy in the first place. If there is an inefficiency somewhere it is better to fix the core problem, whatever it is. numpy Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
On Tue, Jul 19, 2011 at 11:19 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Tue, Jul 19, 2011 at 9:49 AM, Carlos Becker carlosbec...@gmail.comwrote: I made more tests with the same operation, restricting Matlab to use a single processing unit. I got: - Matlab: 0.0063 sec avg - Numpy: 0.026 sec avg - Numpy with weave.blitz: 0.0041 Note that weave.blitz is even faster than Matlab (slightly). I tried on an older computer, and I got similar results between matlab and numpy without weave.blitz, so maybe it has to do with 'new' vectorization opcodes. Anyhow, even though these results are not very promising, it gets worse if I try to do something like: result = (m - 0.5)*0.3 and I get the following timings: - Matlab: 0.0089 - Numpy: 0.051 - Numpy with blitz: 0.0043 Now blitz is considerably faster! Anyways, I am concerned about numpy being much slower, in this case taking 2x the time of the previous operation. I guess this is because of the way that python operands/arguments are passed. Should I always use weave.blitz? Out of curiosity, what os/architecture are you running on? What version of numpy are you using? By and large, you shouldn't spend time programming in blitz, it will ruin the whole point of using numpy in the first place. If there is an inefficiency somewhere it is better to fix the core problem, whatever it is. numpy Chuck Also what version of matlab were you using? -Chris JS ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Hi, everything was run on linux. I am using numpy 2.0.0.dev-64fce7c, but I tried an older version (cannot remember which one now) and saw similar results. Matlab is R2011a, and I used taskset to assign its process to a single core. Linux is 32-bit, on Intel Core i7-2630QM. Besides the matlab/numpy comparison, I think that there is an inherent problem with how expressions are handled, in terms of efficiency. For instance, k = (m - 0.5)*0.3 takes 52msec average here (2000x2000 array), while k = (m - 0.5)*0.3*0.2 takes 0.079, and k = (m - 0.5)*0.3*0.2*0.1 takes 101msec. Placing parentheses around the scalar multipliers shows that it seems to have to do with how expressions are handled, is there sometihng that can be done about this so that numpy can deal with expressions rather than single operations chained by python itself? Maybe I am missing the point as well. -- Carlos Becker On Tue, Jul 19, 2011 at 6:44 PM, Christopher Jordan-Squire cjord...@uw.eduwrote: On Tue, Jul 19, 2011 at 11:19 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Tue, Jul 19, 2011 at 9:49 AM, Carlos Becker carlosbec...@gmail.comwrote: I made more tests with the same operation, restricting Matlab to use a single processing unit. I got: - Matlab: 0.0063 sec avg - Numpy: 0.026 sec avg - Numpy with weave.blitz: 0.0041 Note that weave.blitz is even faster than Matlab (slightly). I tried on an older computer, and I got similar results between matlab and numpy without weave.blitz, so maybe it has to do with 'new' vectorization opcodes. Anyhow, even though these results are not very promising, it gets worse if I try to do something like: result = (m - 0.5)*0.3 and I get the following timings: - Matlab: 0.0089 - Numpy: 0.051 - Numpy with blitz: 0.0043 Now blitz is considerably faster! Anyways, I am concerned about numpy being much slower, in this case taking 2x the time of the previous operation. I guess this is because of the way that python operands/arguments are passed. Should I always use weave.blitz? Out of curiosity, what os/architecture are you running on? What version of numpy are you using? By and large, you shouldn't spend time programming in blitz, it will ruin the whole point of using numpy in the first place. If there is an inefficiency somewhere it is better to fix the core problem, whatever it is. numpy Chuck Also what version of matlab were you using? -Chris JS ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
On Tue, Jul 19, 2011 at 4:05 AM, Carlos Becker carlosbec...@gmail.com wrote: Hi, I started with numpy a few days ago. I was timing some array operations and found that numpy takes 3 or 4 times longer than Matlab on a simple array-minus-scalar operation. Doing these kinds of timings correctly is a tricky issue, and the method you used is at fault. It is testing more than just the vectorized array-minus-scalar operation, it is also timing a range() call and list creation for the loop, as well as vector result object creation and deletion time, both of which add constant overhead to the result (which is itself rather small and susceptible to overhead bias). Whereas the matlab loop range equivalent is part of the syntax itself, and can therefore be optimized better. And depending on the type of garbage collection Matlab uses, it may defer the destruction of the temporaries until after the timing is done (ie. when it exits, whereas Python has to destruct the object on each loop they way you've written it.) First of all, use the 'timeit' module for timing: %python import timeit t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m = np.ones([2000,2000],float)') np.mean(t.repeat(repeat=100, number=1)) 0.022081942558288575 That will at least give you a more accurate timing of just the summing expression itself, and not the loop overhead. Furthermore, you can also reuse the m array for the sum, rather than allocating a new one, which will give you a better idea of just the vectorized subtration time: t=timeit.Timer('m -= 0.5', setup='import numpy as np;m = np.ones([2000,2000],float)') np.mean(t.repeat(repeat=100, number=1)) 0.015955450534820555 Note that the value has dropped considerably. In the end, what you are attempting to time is fairly simple, so any extra overhead you add that is not actually the vectorized sum, will bias your results. You have to be extremely careful with these timing comparisons, since you may be comparing apples to oranges. At the least, try to give the vectorizing code much more work to do, for example you are summing only over about 32 Megs. Try about half a gig, and compare that with Matlab, in order to reduce the percentage of overhead to summing in your timings: t=timeit.Timer('m -= 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=100, number=1)) 0.26796033143997194 Try comparing these examples to your existing Matlab timings, and you should find Python w/ numpy comparing favorably (or even beating Matlab). Of course, then you could improve your Matlab timings; in the end they should be almost the same when done properly. If not, by all means let us know. -Chad ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
On Tue, Jul 19, 2011 at 2:27 PM, Carlos Becker carlosbec...@gmail.com wrote: Hi, everything was run on linux. Placing parentheses around the scalar multipliers shows that it seems to have to do with how expressions are handled, is there sometihng that can be done about this so that numpy can deal with expressions rather than single operations chained by python itself? Numpy is constrained (when using scalars) to Python's normal expression ordering rules, which tend to evaluate left to right. So, once an expression gets promoted to an array, adding more scalars will do array math rather than being able to collapse all the scalar math. To extend my example from before: t=timeit.Timer('k = m - 0.5 + 0.4 - 0.3 + 0.2 - 0.1', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 2.9083001852035522 t=timeit.Timer('k = 0.5 + 0.4 - 0.3 + 0.2 - 0.1 + m', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.52074816226959231 In the second case, the first 4 sums are done in scalar math, effectively collapsing the work down to '0.7 + m', however in the first case the whole expression is upconverted to an array computation right from the start, making the total amount of work much greater. If python had a way of exposing it's expression tree to objects *during execution* and allowed for delayed expression evaluation, such quirks might be avoidable. But it's a complex issue, and not always a problem in practice. In general, as with all computation numerics, you have to be aware of the underlying evaluation order and associativity to fully understand your results, and if you understand that, you can optimize you yourself. So, to show the difference with your example: t=timeit.Timer('k = (m - 0.5)*0.3*0.2', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 1.682368244019 t=timeit.Timer('k = 0.2*0.3*(m - 0.5)', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 1.1084311008453369 -C ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
On Tue, 19 Jul 2011 17:49:14 +0200, Carlos Becker wrote: I made more tests with the same operation, restricting Matlab to use a single processing unit. I got: - Matlab: 0.0063 sec avg - Numpy: 0.026 sec avg - Numpy with weave.blitz: 0.0041 To check if it's an issue with building without optimizations, look at the build log: C compiler: gcc -pthread -fno-strict-aliasing -ggdb -fPIC ... gcc: build/src.linux-x86_64-2.7/numpy/core/src/umath/umathmodule.c I.e., look on the C compiler: line nearest to the umathmodule compilation. Above is an example with no optimization. *** For me, compared to zeroing the memory via memset plain C implementation (Numpy 1.6.0 / gcc): Blitz: 0.00746664 Numpy: 0.00711051 Zeroing (memset): 0.0026 Operation in C: 0.00706667 with gcc -O3 -ffast-math -march=native -mfpmath=sse optimizations for the C code (involving SSE2 vectorization and whatnot, looking at the assembler output). Numpy is already going essentially at the maximum speed. - #include stdlib.h #include time.h #include stdio.h #include string.h int main() { double *a, *b; int N = 2000*2000, M=300; int j; int k; clock_t start, end; a = (double*)malloc(sizeof(double)*N); b = (double*)malloc(sizeof(double)*N); start = clock(); for (k = 0; k M; ++k) { memset(a, '\0', sizeof(double)*N); } end = clock(); printf(Zeroing (memset): %g\n, ((double)(end-start))/CLOCKS_PER_SEC/M); start = clock(); for (k = 0; k M; ++k) { for (j = 0; j N; ++j) { b[j] = a[j] - 0.5; } } end = clock(); printf(Operation in C: %g\n, ((double)(end-start))/CLOCKS_PER_SEC/M); return 0; } ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
For such expressions you should try numexpr package: It allows the same type of optimisation as Matlab does: run a single loop over the matrix elements instead of repetitive loops and intermediate objects creation. Nadav Besides the matlab/numpy comparison, I think that there is an inherent problem with how expressions are handled, in terms of efficiency. For instance, k = (m - 0.5)*0.3 takes 52msec average here (2000x2000 array), while k = (m - 0.5)*0.3*0.2 takes 0.079, and k = (m - 0.5)*0.3*0.2*0.1 takes 101msec. Placing parentheses around the scalar multipliers shows that it seems to have to do with how expressions are handled, is there sometihng that can be done about this so that numpy can deal with expressions rather than single operations chained by python itself? Maybe I am missing the point as well. -- Carlos Becker ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Carlos Becker wrote: Besides the matlab/numpy comparison, I think that there is an inherent problem with how expressions are handled, in terms of efficiency. For instance, k = (m - 0.5)*0.3 takes 52msec average here (2000x2000 array), while k = (m - 0.5)*0.3*0.2 takes 0.079, and k = (m - 0.5)*0.3*0.2*0.1 takes 101msec. Placing parentheses around the scalar multipliers shows that it seems to have to do with how expressions are handled, is there sometihng that can be done about this so that numpy can deal with expressions rather than single operations chained by python itself? well, it is Python, and Python itself does not know anything about array math -- so you need to be careful to do that correctly yourself. Python aside, understanding how parentheses effect computation, even for algebraically equal expressions is a good thing to understand. Aside from issues with scalars, a Python expression like: a = a * b * c does: multiply a and b and put it in a temporary multiply that temporary by c and put that in a temporary assign the final temporary to a so it does, in fact, create two unnecessary temporaries for this simple expression. If you are concerned about performance, there are ways to control that. in-place operators is one: a *= b a *= c will not create any temporaries, and will probably be faster. It still does two loops through the data, though. If your arrays are too big to fit in cache, that could effect performance. To get around that you need to get fancy. One option is numexpr: http://code.google.com/p/numexpr/ numexpr takes an entire expression as input, and can thus optimize some of this at the expression level, make good use of cache, etc. -- it's pretty cool. There are a few other options, including weave, of course. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Thanks Chad for the explanation on those details. I am new to python and I still have a lot to learn, this was very useful. Now I get similar results between matlab and numpy when I re-use the memory allocated for m with 'm -= 0.5'. However, if I don't, I obtain this 4x penalty with numpy, even with the 8092x8092 array. Would it be possible to do k = m - 0.5 and pre-alllocate k such that python does not have to waste time on that? Another interesting case is something like k = m + n + p, which I guess should be run with numexpr in order to accelerate it. Regarding operator evaluation, I thought of the same thing, and it is what would happen in other languages as well if no expression 'templates' ( such as what the Eigen library uses ). I will look at numexpr, I hope I can replace all my matlab image processing and computational needs with python, and interface with C++ code that uses Eigen if I need an extra speed boost. Right now I am trying spyder as a debugger, which looks very nice. If a good debugger is available, it could totally replace matlab/octave for researchers/engineers/etc for some specific needs. I will try numexpr now to see the performance gain. Thanks to all that replied to this topic, it was very useful. On Tue, Jul 19, 2011 at 10:11 PM, Nadav Horesh nad...@visionsense.comwrote: For such expressions you should try numexpr package: It allows the same type of optimisation as Matlab does: run a single loop over the matrix elements instead of repetitive loops and intermediate objects creation. Nadav Besides the matlab/numpy comparison, I think that there is an inherent problem with how expressions are handled, in terms of efficiency. For instance, k = (m - 0.5)*0.3 takes 52msec average here (2000x2000 array), while k = (m - 0.5)*0.3*0.2 takes 0.079, and k = (m - 0.5)*0.3*0.2*0.1 takes 101msec. Placing parentheses around the scalar multipliers shows that it seems to have to do with how expressions are handled, is there sometihng that can be done about this so that numpy can deal with expressions rather than single operations chained by python itself? Maybe I am missing the point as well. -- Carlos Becker ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
On Tue, Jul 19, 2011 at 3:35 PM, Carlos Becker carlosbec...@gmail.com wrote: Thanks Chad for the explanation on those details. I am new to python and I However, if I don't, I obtain this 4x penalty with numpy, even with the 8092x8092 array. Would it be possible to do k = m - 0.5 and pre-alllocate k such that python does not have to waste time on that? I suspect the 4x penalty is related to the expression evaluation overhead (temporaries and copying), so hopefully numexpr() will help, or just remembering to use the in-place operators whenever appropriate. To answer your question, though, you can allocate an array, without initializing it, with the empty() function. Note - if you aren't absolutely sure you are going to overwrite every single element of the array, this could leave you with uninitialized values in your array. I'd just go ahead and use the zeros() function instead, to be safe (it's initialized outside the timeit() timing loop): %python import timeit import numpy as np t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m = np.ones([8092,8092],float); k = np.zeros(m.size, m.dtype)') np.mean(t.repeat(repeat=10, number=1)) 0.58557529449462886 t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.53153839111328127 t=timeit.Timer('m =- 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.038648796081542966 As you can see, preallocation doesn't seem to affect the results all that much, it's the overhead of creating a temporary, then copying it to the result, that seems to matter here. The in-place operation was much faster. Here we see that just copying m to k, takes up more time than the 'k = m + 0.5' operation: t=timeit.Timer('k = m.copy()', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.63301105499267574 Possibly that is because 8K*8K matrices are a bit too big for this kind of benchmark; I recommend also trying it with 4K*4K, and your original 2K*2K to see if the results are consistent. Remember, the timeit() setup is hiding the initial allocation time of m from the results, but it still exists, and should be accounted for in determining the overall execution time of the in-place operation results. Also, with these large array sizes, make sure these tests are in a fresh python instance, so that the process address space isn't tainted with old object allocations (which may cause your OS to 'swap' the now unused memory, and ruin your timing values). -Chad ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
On Jul 19, 2011, at 3:15 PM, Chad Netzer wrote: %python import timeit import numpy as np t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m = np.ones([8092,8092],float); k = np.zeros(m.size, m.dtype)') np.mean(t.repeat(repeat=10, number=1)) 0.58557529449462886 t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.53153839111328127 I am surprised that there is any difference between these two approaches at all. I would have thought that in both cases a temporary array holding the result of m-0.5 is created, which is then assigned to the variable k. However, it seems that the second approach is about 10% faster (I see a similar difference on my machine). Why would that be? On the other hand, if I actually use the preallocated space for k, and replace the assignment by k[…] = m - 0.5 then this takes about 40% more time, presumably because the content of the temporary array is copied into k (which must be initialized by m.shape instead of m.size in this case). Thanks, Lutz ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
On Tue, 19 Jul 2011 17:15:47 -0500, Chad Netzer wrote: On Tue, Jul 19, 2011 at 3:35 PM, Carlos Becker carlosbec...@gmail.com [clip] However, if I don't, I obtain this 4x penalty with numpy, even with the 8092x8092 array. Would it be possible to do k = m - 0.5 and pre-alllocate k such that python does not have to waste time on that? I suspect the 4x penalty is related to the expression evaluation overhead (temporaries and copying), so hopefully numexpr() will help, or just remembering to use the in-place operators whenever appropriate. Doubtful: k = m - 0.5 does here the same thing as k = np.empty_like(m) np.subtract(m, 0.5, out=k) The memory allocation (empty_like and the subsequent deallocation) costs essentially nothing, and there are no temporaries or copying in `subtract`. *** There's something else going on -- on my machine, the Numpy operation runs exactly at the same speed as C, so this issue must have a platform-dependent explanation. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
On Tue, Jul 19, 2011 at 6:10 PM, Pauli Virtanen p...@iki.fi wrote: k = m - 0.5 does here the same thing as k = np.empty_like(m) np.subtract(m, 0.5, out=k) The memory allocation (empty_like and the subsequent deallocation) costs essentially nothing, and there are no temporaries or copying in `subtract`. As verification: import timeit import numpy as np t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.53904647827148433 t=timeit.Timer('k = np.empty_like(m);np.subtract(m, 0.5, out=k)', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.54006035327911373 The trivial difference is expected as extra python parsing overhead, I think. Which leads me to apologize, since in my previous post I clearly meant to type m -= 0.5, not m =- 0.5, which is *quite* a different operation... Carlos, and Lutz, please take heed. :) In fact, as Lutz pointed out, that example was not at all what I intended to show anyway. So, just to demonstrate how it was wrong: t=timeit.Timer('m =- 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.058299207687377931 t=timeit.Timer('m -= 0.5', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.28192551136016847 t=timeit.Timer('np.subtract(m, 0.5, m)', setup='import numpy as np;m = np.ones([8092,8092],float)') np.mean(t.repeat(repeat=10, number=1)) 0.27014491558074949 t=timeit.Timer('np.subtract(m, 0.5, k)', setup='import numpy as np;m = np.ones([8092,8092],float); k = np.empty_like(m)') np.mean(t.repeat(repeat=10, number=1)) 0.54962997436523442 Perhaps the difference in the last two simply comes down to cache effects (having to iterate over two different large memory blocks, rather than one)? -Chad ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion