subject:"\[Numpy\-discussion\] Array vectorization in numpy"

There is a total lack of vectorization in your code, so you are right 
about the lack of vectorization.


What happens is that you see the result of the Matlab JIT compiler 
speeding up the loop.


With a vectorized array expression, there will hardly be any difference.

Sturla


Den 19.07.2011 11:05, skrev Carlos Becker:
Hi, I started with numpy a few days ago. I was timing some array 
operations and found that numpy takes 3 or 4 times longer than Matlab 
on a simple array-minus-scalar operation.
This looks as if there is a lack of vectorization, even though this is 
just a guess. I hope this is not reposting. I tried searching the 
mailing list database but did not find anything related specifically 
to a problem like this one.


Here there is the python test code:



from datetime import datetime

import numpy as np

def test():

m = np.ones([2000,2000],float)

N = 100

t1 = datetime.now()

for x in range(N):

k = m - 0.5

t2 = datetime.now()

print (t2 - t1).total_seconds() / N




And matlab:




m = rand(2000,2000);


N = 100;

tic;

for I=1:N

k = m - 0.5;

end

toc / N




I have the impression that the speed boost with Matlab is not related 
to matlab optimizations, since singe-runs also render similar timings.


I tried compiling ATLAS for SSE2 and didn't observe any difference. 
Any clues?



Thanks,

Carlos



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-20 Thread Carlos Becker

Those are very interesting examples. I think that pre-allocation is very
important, and something similar happens in Matlab if no pre-allocation is
done: it takes 3-4x longer than with pre-allocation.
The main difference is that Matlab is able to take into account a
pre-allocated array/matrix, probably avoiding the creation of a temporary
and writing the results directly in the pre-allocated array.

I think this is essential to speed up numpy. Maybe numexpr could handle this
in the future? Right now the general use of numexpr is result =
numexpr.evaluate(whatever), so the same problem seems to be there.

With this I am not saying that numpy is not worth it, just that for many
applications (specially with huge matrices/arrays), pre-allocation does make
a huge difference, especially if we want to attract more people to using
numpy.

--
Carlos Becker


On Wed, Jul 20, 2011 at 1:42 AM, Chad Netzer chad.net...@gmail.com wrote:

 On Tue, Jul 19, 2011 at 6:10 PM, Pauli Virtanen p...@iki.fi wrote:

 k = m - 0.5
 
  does here the same thing as
 
 k = np.empty_like(m)
 np.subtract(m, 0.5, out=k)
 
  The memory allocation (empty_like and the subsequent deallocation)
  costs essentially nothing, and there are no temporaries or copying
  in `subtract`.

 As verification:

  import timeit
  import numpy as np
  t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m =
 np.ones([8092,8092],float)')
  np.mean(t.repeat(repeat=10, number=1))
 0.53904647827148433

  t=timeit.Timer('k = np.empty_like(m);np.subtract(m, 0.5, out=k)',
 setup='import numpy as np;m = np.ones([8092,8092],float)')
  np.mean(t.repeat(repeat=10, number=1))
 0.54006035327911373

 The trivial difference is expected as extra python parsing overhead, I
 think.

 Which leads me to apologize, since in my previous post I clearly meant
 to type m -= 0.5, not m =- 0.5, which is *quite* a different
 operation...  Carlos, and Lutz, please take heed. :)  In fact, as Lutz
 pointed out, that example was not at all what I intended to show
 anyway.


 So, just to demonstrate how it was wrong:

  t=timeit.Timer('m =- 0.5', setup='import numpy as np;m =
 np.ones([8092,8092],float)')
  np.mean(t.repeat(repeat=10, number=1))
 0.058299207687377931

  t=timeit.Timer('m -= 0.5', setup='import numpy as np;m =
 np.ones([8092,8092],float)')
  np.mean(t.repeat(repeat=10, number=1))
 0.28192551136016847

  t=timeit.Timer('np.subtract(m, 0.5, m)', setup='import numpy as np;m =
 np.ones([8092,8092],float)')
  np.mean(t.repeat(repeat=10, number=1))
 0.27014491558074949

  t=timeit.Timer('np.subtract(m, 0.5, k)', setup='import numpy as np;m =
 np.ones([8092,8092],float); k = np.empty_like(m)')
  np.mean(t.repeat(repeat=10, number=1))
 0.54962997436523442

 Perhaps the difference in the last two simply comes down to cache
 effects (having to iterate over two different large memory blocks,
 rather than one)?

 -Chad
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

Den 19.07.2011 17:49, skrev Carlos Becker:

 - Matlab: 0.0089
 - Numpy: 0.051
 - Numpy with blitz: 0.0043

 Now blitz is considerably faster! Anyways, I am concerned about numpy 
 being much slower, in this case taking 2x the time of the previous 
 operation.
 I guess this is because of the way that python operands/arguments are 
 passed. Should I always use weave.blitz?


That depends on how many milliseconds you intend to save. CPU time is 
expensive, so you cannot afford to loose any...


Sturla





___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

Den 20.07.2011 08:49, skrev Carlos Becker:

 The main difference is that Matlab is able to take into account a 
 pre-allocated array/matrix, probably avoiding the creation of a 
 temporary and writing the results directly in the pre-allocated array.

 I think this is essential to speed up numpy.

As for speed, I think those who need Fortran, C or Cython knows where to 
find it.

Yes, in certain situations you can make Matlab run faster than NumPy and 
vice versa. But I want to see an example of a real problem where it 
really matters -- not just my faulty loop was a few milliseconds faster 
in Matlab.

NumPy could use a more intelligent memory management and reuse some 
storage space, but I am not sure how much difference it would make.

Sturla
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-20 Thread Benjamin Root

On Wednesday, July 20, 2011, Carlos Becker carlosbec...@gmail.com wrote:
 Those are very interesting examples. I think that pre-allocation is very 
 important, and something similar happens in Matlab if no pre-allocation is 
 done: it takes 3-4x longer than with pre-allocation.The main difference is 
 that Matlab is able to take into account a pre-allocated array/matrix, 
 probably avoiding the creation of a temporary and writing the results 
 directly in the pre-allocated array.

 I think this is essential to speed up numpy. Maybe numexpr could handle this 
 in the future? Right now the general use of numexpr is result = 
 numexpr.evaluate(whatever), so the same problem seems to be there.

 With this I am not saying that numpy is not worth it, just that for many 
 applications (specially with huge matrices/arrays), pre-allocation does make 
 a huge difference, especially if we want to attract more people to using 
 numpy.

The ufuncs and many scipy functions take a out parameter where you
can specify a pre-allocated array.  It can be a little awkward writing
expressions that way, but the capability is there.

But, ultimately, I think the main value with python and numpy is not
it's speed, but rather the ease of use and how quickly one can develop
working code with it.  If you want to squeeze every CPU resources, you
could program in assembly, but good luck getting that linear solver
done in time.

Don't get me wrong, there is always room for improvements, and I would
love to see numpy go even faster.  However, I doubt that converting
matlab users would *require* speed to be the main selling point.  Ease
of development and full-featured, high-quality standard and
third-party libraries have always been the top selling points for me.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-20 Thread Carlos Becker

Hi all. Thanks for the feedback.
My point is not to start a war on matlab/numpy. This comes out of my wish to
switch from Matlab to something more appealing.
I like numpy and python, being a proper language (not like matlab scripts,
whose syntax is patched and destroyed as new versions come up).
I am impressed at how complete numpy is, and with numexpr as well.

In my case, sometimes it is required to process 1k images or more, and 2x
speed improvement in this case means 2 hours of processing vs 4.
Someone would say 'switch to C/C++', but this is not my point. This thread
came up when comparing matlab and python to find whether performance is
somehow similar,
and even if numpy is a bit slower, I would not mind. However, 3x-4x is an
important difference.

I tried the operations with an output argument and didn't see much
difference. I have to try that further with other examples.
If I find something new I will let you know, I would be glad to switch to
numpy soon ;)

Cheers,
Carlos


On Wed, Jul 20, 2011 at 9:21 AM, Benjamin Root ben.r...@ou.edu wrote:

 On Wednesday, July 20, 2011, Carlos Becker carlosbec...@gmail.com wrote:
  Those are very interesting examples. I think that pre-allocation is very
 important, and something similar happens in Matlab if no pre-allocation is
 done: it takes 3-4x longer than with pre-allocation.The main difference is
 that Matlab is able to take into account a pre-allocated array/matrix,
 probably avoiding the creation of a temporary and writing the results
 directly in the pre-allocated array.
 
  I think this is essential to speed up numpy. Maybe numexpr could handle
 this in the future? Right now the general use of numexpr is result =
 numexpr.evaluate(whatever), so the same problem seems to be there.
 
  With this I am not saying that numpy is not worth it, just that for many
 applications (specially with huge matrices/arrays), pre-allocation does make
 a huge difference, especially if we want to attract more people to using
 numpy.

 The ufuncs and many scipy functions take a out parameter where you
 can specify a pre-allocated array.  It can be a little awkward writing
 expressions that way, but the capability is there.

 But, ultimately, I think the main value with python and numpy is not
 it's speed, but rather the ease of use and how quickly one can develop
 working code with it.  If you want to squeeze every CPU resources, you
 could program in assembly, but good luck getting that linear solver
 done in time.

 Don't get me wrong, there is always room for improvements, and I would
 love to see numpy go even faster.  However, I doubt that converting
 matlab users would *require* speed to be the main selling point.  Ease
 of development and full-featured, high-quality standard and
 third-party libraries have always been the top selling points for me.

 Ben Root
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy


Den 19.07.2011 11:05, skrev Carlos Becker:



N = 100;

tic;

for I=1:N

k = m - 0.5;

end

toc / N




m = rand(2000,2000);


Here, Matlab's JIT compiler can probably hoist the invariant out of the 
loop, and just do


   I=N
   k = m - 0.5

Try this instead:

   for I=1:N

  k = I - 0.5;

end


S.M.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

Den 20.07.2011 09:35, skrev Carlos Becker:

 In my case, sometimes it is required to process 1k images or more, and 
 2x speed improvement in this case means 2 hours of processing vs 4.

Can you demonstrate that Matlab is faster than NumPy for this task?

Sturla
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-20 Thread Pauli Virtanen

Wed, 20 Jul 2011 08:49:21 +0200, Carlos Becker wrote:
 Those are very interesting examples. I think that pre-allocation is very
 important, and something similar happens in Matlab if no pre-allocation
 is done: it takes 3-4x longer than with pre-allocation. The main
 difference is that Matlab is able to take into account a pre-allocated
 array/matrix, probably avoiding the creation of a temporary and writing
 the results directly in the pre-allocated array.

You have not demonstrated that the difference you have comes from
pre-allocation.

If it would come from pre-allocation, how come I get the same speed
as an equivalent C implementation, which *does* pre-allocation, using
exactly the same benchmark codes as you have posted?

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-20 Thread eat

Hi,

On Wed, Jul 20, 2011 at 2:42 AM, Chad Netzer chad.net...@gmail.com wrote:

 On Tue, Jul 19, 2011 at 6:10 PM, Pauli Virtanen p...@iki.fi wrote:

 k = m - 0.5
 
  does here the same thing as
 
 k = np.empty_like(m)
 np.subtract(m, 0.5, out=k)
 
  The memory allocation (empty_like and the subsequent deallocation)
  costs essentially nothing, and there are no temporaries or copying
  in `subtract`.

 As verification:

  import timeit
  import numpy as np
  t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m =
 np.ones([8092,8092],float)')
  np.mean(t.repeat(repeat=10, number=1))
 0.53904647827148433

  t=timeit.Timer('k = np.empty_like(m);np.subtract(m, 0.5, out=k)',
 setup='import numpy as np;m = np.ones([8092,8092],float)')
  np.mean(t.repeat(repeat=10, number=1))
 0.54006035327911373

 The trivial difference is expected as extra python parsing overhead, I
 think.

 Which leads me to apologize, since in my previous post I clearly meant
 to type m -= 0.5, not m =- 0.5, which is *quite* a different
 operation...  Carlos, and Lutz, please take heed. :)  In fact, as Lutz
 pointed out, that example was not at all what I intended to show
 anyway.


 So, just to demonstrate how it was wrong

Perhaps slightly OT, but here is something very odd going on. I would expect
the performance to be in totally different ballpark.

  t=timeit.Timer('m =- 0.5', setup='import numpy as np;m =
 np.ones([8092,8092],float)')
  np.mean(t.repeat(repeat=10, number=1))
 0.058299207687377931

More like:
In []: %timeit m =- .5
1000 loops, best of 3: 35 ns per loop

-eat


  t=timeit.Timer('m -= 0.5', setup='import numpy as np;m =
 np.ones([8092,8092],float)')
  np.mean(t.repeat(repeat=10, number=1))
 0.28192551136016847

  t=timeit.Timer('np.subtract(m, 0.5, m)', setup='import numpy as np;m =
 np.ones([8092,8092],float)')
  np.mean(t.repeat(repeat=10, number=1))
 0.27014491558074949

  t=timeit.Timer('np.subtract(m, 0.5, k)', setup='import numpy as np;m =
 np.ones([8092,8092],float); k = np.empty_like(m)')
  np.mean(t.repeat(repeat=10, number=1))
 0.54962997436523442

 Perhaps the difference in the last two simply comes down to cache
 effects (having to iterate over two different large memory blocks,
 rather than one)?

 -Chad
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-20 Thread Chad Netzer

On Tue, Jul 19, 2011 at 11:49 PM, Carlos Becker carlosbec...@gmail.com wrote:
 Those are very interesting examples.

Cool.

 I think that pre-allocation is very
 important, and something similar happens in Matlab if no pre-allocation is
 done: it takes 3-4x longer than with pre-allocation.

Can you provide a simple example of this in Matlab code?  I'd like to
see the code you are testing with, and the numbers you are reporting,
all in one post (please).  So far we've seen some code in your first
post, some numbers in your follow up, but being spread out it makes it
hard to know what exactly you are asserting.

 The main difference is that Matlab is able to take into account a
 pre-allocated array/matrix, probably avoiding the creation of a temporary
 and writing the results directly in the pre-allocated array.

Now I believe you are guessing.  My last example showed the effect of
only using a pre-allocated result array in numpy; It was still slower
than an in place operation (ie. overwriting the array used to
calculate the result), which may be due to machine memory
considerations.  The simple operation you are testing (an array
operated on by a scalar) is dominated by the memory access speeds of
reading and writing to the large arrays.  With a separate,
pre-allocated array, there is twice the memory to read and write to,
and hence twice the time.  At least that's my guess, are you saying
Matlab does this 3-4 times faster than numpy?  I'd really like to see
the *exact* code you are testing, with the specific numbers you are
getting for that code, if it's not too much trouble.

 With this I am not saying that numpy is not worth it, just that for many
 applications (specially with huge matrices/arrays), pre-allocation does make
 a huge difference, especially if we want to attract more people to using
 numpy.

What do you mean by 'pre-allocated'?  It is certainly perfectly
feasible to pre-allocate numpy arrays and use them as the target of
operations, as my examples showed.  And you can also easily do sums
and multiplies using in-place array operations, with Python
nomenclature.  It's true that you have to do some work at optimizing
some expressions if you wish to avoid temporary array objects being
created during multi-term expression evaluations, but the manual
discusses this and gives the reasons why.  Is this what you mean by
pre-allocation?

I'm still not sure where exactly you are seeing a problem; can you
show us exactly what Matlab code cannot be made to run as efficiently
with numpy?

-Chad
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-20 Thread Carlos Becker

I will be away from my computer for a week, but what I could try today  
shows that Matlab JIT is doing some tricks so the results I have shown  
previously for Matlab are likely to be wrong.
In this sense, it seems to be that timings are similar between numpy  
and matlab if Jit tricks are avoided.

Next week I will run more tests. I am planning to summarize the  
results and put them somewhere on the web, since in some cases numpy 
+numexpr greatly outperform matlab- however I will first make sure  
that JIT is not shadowing the conclusions

El 20/07/2011, a las 11:04, Pauli Virtanen p...@iki.fi escribió:

 Wed, 20 Jul 2011 08:49:21 +0200, Carlos Becker wrote:
 Those are very interesting examples. I think that pre-allocation is  
 very
 important, and something similar happens in Matlab if no pre- 
 allocation
 is done: it takes 3-4x longer than with pre-allocation. The main
 difference is that Matlab is able to take into account a pre- 
 allocated
 array/matrix, probably avoiding the creation of a temporary and  
 writing
 the results directly in the pre-allocated array.

 You have not demonstrated that the difference you have comes from
 pre-allocation.

 If it would come from pre-allocation, how come I get the same speed
 as an equivalent C implementation, which *does* pre-allocation, using
 exactly the same benchmark codes as you have posted?

 -- 
 Pauli Virtanen

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-20 Thread Pauli Virtanen

Wed, 20 Jul 2011 09:04:09 +, Pauli Virtanen wrote:
 Wed, 20 Jul 2011 08:49:21 +0200, Carlos Becker wrote:
 Those are very interesting examples. I think that pre-allocation is
 very important, and something similar happens in Matlab if no
 pre-allocation is done: it takes 3-4x longer than with pre-allocation.
 The main difference is that Matlab is able to take into account a
 pre-allocated array/matrix, probably avoiding the creation of a
 temporary and writing the results directly in the pre-allocated array.
 
 You have not demonstrated that the difference you have comes from
 pre-allocation.
 
 If it would come from pre-allocation, how come I get the same speed as
 an equivalent C implementation, which *does* pre-allocation, using
 exactly the same benchmark codes as you have posted?

Ok, I looked at it at a different machine, and in the end I'll have to
agree with you :)

Some interesting data points: on my Eee laptop (with the test codes
below), I get

Numpy: 0.0771
Numpy (preallocated): 0.0383

On a different machine, on the other hand:

Numpy: 0.0288
Numpy (preallocated): 0.0283

For larger array sizes the situation starts to change (4000x4000):

Numpy (allocation  zeroing): 0.161
Numpy: 0.1953
Numpy (preallocated): 0.1153

Also interestingly,

Zeroing (memset, per element): 1.04313e-08# 4000x4000 array
Zeroing (memset, per element): 1.05333e-08# 3000x3000 array
Zeroing (memset, per element): 1.04427e-08# 2048x2048 array
Zeroing (memset, per element): 2.24223e-09# 2048x2047 array
Zeroing (memset, per element): 2.1e-09# 2000x2000 array
Zeroing (memset, per element): 1.75e-09   # 200x200 array

Zeroing (memset, preallocated, per element): 2.06667e-09 # 3000x3000
Zeroing (memset, preallocated, per element): 2.0504e-09  # 2048x2048
Zeroing (memset, preallocated, per element): 1.94e-09# 200x200

There is a sharp order-of-magnitude change of speed in malloc+memset
of an array, which is not present in memset itself. (This is then also
reflected in the Numpy performance -- floating point operations probably
don't cost much compared to memory access speed.) It seems that either the
kernel or the C library changes the way it handles allocation at that point.

So yes, it seems that you were right after all: for large arrays
preallocation may be an issue, but the exact size limit depends
on the machine etc. in question, which is why I didn't manage to
see this at first.

***

In this particular case, we are somewhat limited by Python on what
optimizations we can do. It is not possible to have the expression
k = m - 0.5 be translated into `np.subtract(m, 0.5, k)` instead
of `k = np.subtract(m, 0.5)` because this translation is done by
Python itself.

If the translation is taken away from Python, e.g., by switching
to lazy evaluation or via Numexpr, then things can be improved.
There have been some ideas around on implementing matrix algebra
lazily in this way, with the ability of reusing temporary buffers.

The issue of reusing temporaries in expressions such as `a = 0.3*x + y + z`
should however be possible to address within Numpy.

Pauli


#include stdlib.h
#include time.h
#include stdio.h
#include string.h

int main()
{
double *a, *b;
int N = 2000*2000, M=100;
int j;
int k;
clock_t start, end;

a = (double*)malloc(sizeof(double)*N); 

b = (double*)malloc(sizeof(double)*N);
start = clock();
for (k = 0; k  M; ++k) {
memset(b, '\0', sizeof(double)*N);
}
end = clock();
printf(Zeroing (memset, preallocated, per element): %g\n,
((double)(end-start))/CLOCKS_PER_SEC/M/N);
free(b);

start = clock();
for (k = 0; k  M; ++k) {
b = (double*)malloc(sizeof(double)*N);
memset(b, '\0', sizeof(double)*N);
free(b);
}
end = clock();
printf(Zeroing (memset, per element): %g\n,
((double)(end-start))/CLOCKS_PER_SEC/M/N);

b = (double*)malloc(sizeof(double)*N);
start = clock();
for (k = 0; k  M; ++k) {
for (j = 0; j  N; ++j) {
b[j] = a[j] - 0.5;
}
}
end = clock();
printf(Operation in C (preallocated): %g\n,
((double)(end-start))/CLOCKS_PER_SEC/M);
free(b);

start = clock();
for (k = 0; k  M; ++k) {
b = (double*)malloc(sizeof(double)*N);
for (j = 0; j  N; ++j) {
b[j] = a[j] - 0.5;
}
free(b);
}
end = clock();
printf(Operation in C: %g\n,
((double)(end-start))/CLOCKS_PER_SEC/M);

return 0;
}

import time
import numpy as np

print np.__version__, np.__file__

m = np.ones([2000, 2000],float)
N = 100

print (m.size * m.dtype.itemsize) / 1e6

t1 = time.clock()
for x in range(N):
k = np.zeros_like(m)
t2 = time.clock()
print Numpy (allocation  zeroing):, (t2 - t1) / N

t1 = time.clock()

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-20 Thread Chad Netzer

On Wed, Jul 20, 2011 at 3:57 AM, eat e.antero.ta...@gmail.com wrote:
 Perhaps slightly OT, but here is something very odd going on. I would expect
 the performance to be in totally different ballpark.

  t=timeit.Timer('m =- 0.5', setup='import numpy as np;m =
  np.ones([8092,8092],float)')
  np.mean(t.repeat(repeat=10, number=1))
 0.058299207687377931

 More like:
 In []: %timeit m =- .5
 1000 loops, best of 3: 35 ns per loop
 -eat

I think that's the effect of the timer having a low resolution, and
the repeat value being 10, instead of the default 100.  For the
huge array operations, a small repeat value wasn't a problem. But my
mistake made it a simple python assignment, and for such a quick
operation you need to repeat it a great many times between timer calls
to get a meaningful result:

In [1]: %timeit m = -0.5
1000 loops, best of 3: 39.1 ns per loop

In [2]: import timeit
In [3]: t=timeit.Timer('m = -0.5')
In [4]: t.timeit(number=10)
Out[35]: 38.36219096183777

So, directly repeating the assignment a billion times puts it into
'nanoseconds per assignment' units, and the results from ipython
%timeit and the timeit() call are comparable (approx 39 nanoseconds
per loop).

-Chad
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-20 Thread Pauli Virtanen

Wed, 20 Jul 2011 11:31:41 +, Pauli Virtanen wrote:
[clip]
 There is a sharp order-of-magnitude change of speed in malloc+memset of
 an array, which is not present in memset itself. (This is then also
 reflected in the Numpy performance -- floating point operations probably
 don't cost much compared to memory access speed.) It seems that either
 the kernel or the C library changes the way it handles allocation at
 that point.

The explanation seems to be the following:

(a) When the process adjusts the size of its heap, the kernel must zero
new pages it gives to the process (because they might contain
sensitive information from other processes) [1]

(b) GNU libc hangs onto some memory even after free() is called,
so that the heap size doesn't need to be adjusted continuously.
This is controlled by parameters that can be tuned with the
mallopt() function. [2]

Because of (a), there is a performance hit probably equivalent 
to `memset(buf, 0, size)` or more (kernel overheads?) for using
newly allocated memory the first time. But because of (b), this
hit mainly applies to buffers larger than some threshold.

Preallocating can get rid of this overhead, but it probably only matters
in places where you reuse the same memory many times, and the operations
done are not much more expensive than whatever the kernel needs to do.

Alternatively, you can call

mallopt(M_TRIM_THRESHOLD, N);
mallopt(M_TOP_PAD, N);
mallopt(M_MMAP_MAX, 0);

with large enough `N`, and let libc manage the memory reuse for you.

.. [1] http://stackoverflow.com/questions/1327261

.. [2] http://www.gnu.org/s/hello/manual/libc/Malloc-Tunable-Parameters.html

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-20 Thread srean

 I think this is essential to speed up numpy. Maybe numexpr could handle this 
 in the future? Right now the general use of numexpr is result = 
 numexpr.evaluate(whatever), so the same problem seems to be there.

 With this I am not saying that numpy is not worth it, just that for many 
 applications (specially with huge matrices/arrays), pre-allocation does make 
 a huge difference, especially if we want to attract more people to using 
 numpy.

 The ufuncs and many scipy functions take a out parameter where you
 can specify a pre-allocated array.  It can be a little awkward writing
 expressions that way, but the capability is there.

This is a slight digression: is there a way to have a out parameter
like semantics with numexpr. I have always used it as

a[:] = numexpr(expression)

But I dont think numexpr builds the value in place. Is it possible to
have side-effects with numexpr as opposed to obtaining values, for
example

a= a * b + c

The documentation is not clear about this. Oh and I do not find the
out parameter awkward at all. Its very handy. Furthermore, if I may,
here is a request that the Blitz++ source be updated. Seems like there
is a lot of activity on the Blitz++ repository and weave is very handy
too and can be used as easily as numexpr.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-20 Thread Mark Wiebe

On Wed, Jul 20, 2011 at 5:52 PM, srean srean.l...@gmail.com wrote:

  I think this is essential to speed up numpy. Maybe numexpr could handle
 this in the future? Right now the general use of numexpr is result =
 numexpr.evaluate(whatever), so the same problem seems to be there.
 
  With this I am not saying that numpy is not worth it, just that for many
 applications (specially with huge matrices/arrays), pre-allocation does make
 a huge difference, especially if we want to attract more people to using
 numpy.
 
  The ufuncs and many scipy functions take a out parameter where you
  can specify a pre-allocated array.  It can be a little awkward writing
  expressions that way, but the capability is there.

 This is a slight digression: is there a way to have a out parameter
 like semantics with numexpr. I have always used it as

 a[:] = numexpr(expression)

 But I dont think numexpr builds the value in place. Is it possible to
 have side-effects with numexpr as opposed to obtaining values, for
 example

 a= a * b + c

 The documentation is not clear about this. Oh and I do not find the
 out parameter awkward at all. Its very handy. Furthermore, if I may,
 here is a request that the Blitz++ source be updated. Seems like there
 is a lot of activity on the Blitz++ repository and weave is very handy
 too and can be used as easily as numexpr.


In order to make sure the 1.6 nditer supports multithreading, I adapted
numexpr to use it. The branch which does this is here:

http://code.google.com/p/numexpr/source/browse/#svn%2Fbranches%2Fnewiter

This supports out, order, and casting parameters, visible here:

http://code.google.com/p/numexpr/source/browse/branches/newiter/numexpr/necompiler.py#615

It's pretty much ready to go, just needs someone to do the release
management.

-Mark

___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Array vectorization in numpy

Hi, I started with numpy a few days ago. I was timing some array operations
and found that numpy takes 3 or 4 times longer than Matlab on a simple
array-minus-scalar operation.
This looks as if there is a lack of vectorization, even though this is just
a guess. I hope this is not reposting. I tried searching the mailing list
database but did not find anything related specifically to a problem like
this one.

Here there is the python test code:



from datetime import datetime

import numpy as np

def test():

m = np.ones([2000,2000],float)

 N = 100

t1 = datetime.now()

for x in range(N):

k = m - 0.5

t2 = datetime.now()

 print (t2 - t1).total_seconds() / N




And matlab:




m = rand(2000,2000);


N = 100;

tic;

for I=1:N

k = m - 0.5;

end

toc / N




I have the impression that the speed boost with Matlab is not related to
matlab optimizations, since singe-runs also render similar timings.

I tried compiling ATLAS for SSE2 and didn't observe any difference. Any
clues?


Thanks,

Carlos
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-19 Thread Pauli Virtanen

Tue, 19 Jul 2011 11:05:18 +0200, Carlos Becker wrote:
 Hi, I started with numpy a few days ago. I was timing some array
 operations and found that numpy takes 3 or 4 times longer than Matlab on
 a simple array-minus-scalar operation.
 This looks as if there is a lack of vectorization, even though this is
 just a guess. I hope this is not reposting. I tried searching the
 mailing list database but did not find anything related specifically to
 a problem like this one.

I see essentially no performance difference:

Matlab [7.10.0.499 (R2010a)]: 0.0321
Numpy [1.6.0]: 0.03117567

If later versions of Matlab can parallelize the computation across
multiple processors, that could be one possibility for the difference
you see. Alternatively, you may have compiled Numpy with optimizations
turned off.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

Hi Pauli, thanks for the quick answer.
Is there a way to check the optimization flags of numpy after  
installation?

I am away of a matlab installation now, but I remember I saw a single  
processor active with matlab. I will check it again soon

Thanks!



El 19/07/2011, a las 13:10, Pauli Virtanen p...@iki.fi escribió:

 Tue, 19 Jul 2011 11:05:18 +0200, Carlos Becker wrote:
 Hi, I started with numpy a few days ago. I was timing some array
 operations and found that numpy takes 3 or 4 times longer than  
 Matlab on
 a simple array-minus-scalar operation.
 This looks as if there is a lack of vectorization, even though this  
 is
 just a guess. I hope this is not reposting. I tried searching the
 mailing list database but did not find anything related  
 specifically to
 a problem like this one.

 I see essentially no performance difference:

 Matlab [7.10.0.499 (R2010a)]: 0.0321
 Numpy [1.6.0]: 0.03117567

 If later versions of Matlab can parallelize the computation across
 multiple processors, that could be one possibility for the difference
 you see. Alternatively, you may have compiled Numpy with optimizations
 turned off.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

I made more tests with the same operation, restricting Matlab to use a
single processing unit. I got:

- Matlab: 0.0063 sec avg
- Numpy: 0.026 sec avg
- Numpy with weave.blitz: 0.0041

Note that weave.blitz is even faster than Matlab (slightly).
I tried on an older computer, and I got similar results between matlab and
numpy without weave.blitz, so maybe it has to do with 'new' vectorization
opcodes.

Anyhow, even though these results are not very promising, it gets worse if I
try to do something like:

result = (m - 0.5)*0.3

and I get the following timings:

- Matlab: 0.0089
- Numpy: 0.051
- Numpy with blitz: 0.0043

Now blitz is considerably faster! Anyways, I am concerned about numpy being
much slower, in this case taking 2x the time of the previous operation.
I guess this is because of the way that python operands/arguments are
passed. Should I always use weave.blitz?

Carlos

On Tue, Jul 19, 2011 at 1:40 PM, Carlos Becker carlosbec...@gmail.comwrote:

Hi Pauli, thanks for the quick answer.
Is there a way to check the optimization flags of numpy after installation?

I am away of a matlab installation now, but I remember I saw a single
processor active with matlab. I will check it again soon

Thanks!

El 19/07/2011, a las 13:10, Pauli Virtanen p...@iki.fi escribió:

Tue, 19 Jul 2011 11:05:18 +0200, Carlos Becker wrote:

Hi, I started with numpy a few days ago. I was timing some array
operations and found that numpy takes 3 or 4 times longer than Matlab on
a simple array-minus-scalar operation.
This looks as if there is a lack of vectorization, even though this is
just a guess. I hope this is not reposting. I tried searching the
mailing list database but did not find anything related specifically to
a problem like this one.

I see essentially no performance difference:

Matlab [7.10.0.499 (R2010a)]: 0.0321
Numpy [1.6.0]: 0.03117567

If later versions of Matlab can parallelize the computation across
multiple processors, that could be one possibility for the difference
you see. Alternatively, you may have compiled Numpy with optimizations
turned off.

__**_
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/**listinfo/numpy-discussionhttp://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-19 Thread Charles R Harris

On Tue, Jul 19, 2011 at 9:49 AM, Carlos Becker carlosbec...@gmail.comwrote:

 I made more tests with the same operation, restricting Matlab to use a
 single processing unit. I got:

 - Matlab: 0.0063 sec avg
 - Numpy: 0.026 sec avg
 - Numpy with weave.blitz: 0.0041

 Note that weave.blitz is even faster than Matlab (slightly).
 I tried on an older computer, and I got similar results between matlab and
 numpy without weave.blitz, so maybe it has to do with 'new' vectorization
 opcodes.

 Anyhow, even though these results are not very promising, it gets worse if
 I try to do something like:

 result = (m - 0.5)*0.3

 and I get the following timings:

 - Matlab: 0.0089
 - Numpy: 0.051
 - Numpy with blitz: 0.0043

 Now blitz is considerably faster! Anyways, I am concerned about numpy being
 much slower, in this case taking 2x the time of the previous operation.
 I guess this is because of the way that python operands/arguments are
 passed. Should I always use weave.blitz?


Out of curiosity, what os/architecture are you running on? What version of
numpy are you using?

By and large, you shouldn't spend time programming in blitz, it will ruin
the whole point of using numpy in the first place. If there is an
inefficiency somewhere it is better to fix the core problem, whatever it is.

numpy

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-19 Thread Christopher Jordan-Squire

On Tue, Jul 19, 2011 at 11:19 AM, Charles R Harris 
charlesr.har...@gmail.com wrote:



 On Tue, Jul 19, 2011 at 9:49 AM, Carlos Becker carlosbec...@gmail.comwrote:

 I made more tests with the same operation, restricting Matlab to use a
 single processing unit. I got:

 - Matlab: 0.0063 sec avg
 - Numpy: 0.026 sec avg
 - Numpy with weave.blitz: 0.0041

 Note that weave.blitz is even faster than Matlab (slightly).
 I tried on an older computer, and I got similar results between matlab and
 numpy without weave.blitz, so maybe it has to do with 'new' vectorization
 opcodes.

 Anyhow, even though these results are not very promising, it gets worse if
 I try to do something like:

 result = (m - 0.5)*0.3

 and I get the following timings:

 - Matlab: 0.0089
 - Numpy: 0.051
 - Numpy with blitz: 0.0043

 Now blitz is considerably faster! Anyways, I am concerned about numpy
 being much slower, in this case taking 2x the time of the previous
 operation.
 I guess this is because of the way that python operands/arguments are
 passed. Should I always use weave.blitz?


 Out of curiosity, what os/architecture are you running on? What version of
 numpy are you using?

 By and large, you shouldn't spend time programming in blitz, it will ruin
 the whole point of using numpy in the first place. If there is an
 inefficiency somewhere it is better to fix the core problem, whatever it is.

 numpy

 Chuck


Also what version of matlab were you using?

-Chris JS



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

Hi, everything was run on linux.

I am using numpy 2.0.0.dev-64fce7c, but I tried an older version (cannot
remember which one now) and saw similar results.
Matlab is R2011a, and I used taskset to assign its process to a single core.

Linux is 32-bit, on Intel Core i7-2630QM.

Besides the matlab/numpy comparison, I think that there is an inherent
problem with how expressions are handled, in terms of efficiency.
For instance, k = (m - 0.5)*0.3 takes 52msec average here (2000x2000 array),
while k = (m - 0.5)*0.3*0.2 takes 0.079, and k = (m - 0.5)*0.3*0.2*0.1 takes
101msec.
Placing parentheses around the scalar multipliers shows that it seems to
have to do with how expressions are handled, is there sometihng that can be
done about this so that numpy can deal with expressions rather than single
operations chained by python itself? Maybe I am missing the point as well.

--
Carlos Becker


On Tue, Jul 19, 2011 at 6:44 PM, Christopher Jordan-Squire
cjord...@uw.eduwrote:



 On Tue, Jul 19, 2011 at 11:19 AM, Charles R Harris 
 charlesr.har...@gmail.com wrote:



 On Tue, Jul 19, 2011 at 9:49 AM, Carlos Becker carlosbec...@gmail.comwrote:

 I made more tests with the same operation, restricting Matlab to use a
 single processing unit. I got:

 - Matlab: 0.0063 sec avg
 - Numpy: 0.026 sec avg
 - Numpy with weave.blitz: 0.0041

 Note that weave.blitz is even faster than Matlab (slightly).
 I tried on an older computer, and I got similar results between matlab
 and numpy without weave.blitz, so maybe it has to do with 'new'
 vectorization opcodes.

 Anyhow, even though these results are not very promising, it gets worse
 if I try to do something like:

 result = (m - 0.5)*0.3

 and I get the following timings:

 - Matlab: 0.0089
 - Numpy: 0.051
 - Numpy with blitz: 0.0043

 Now blitz is considerably faster! Anyways, I am concerned about numpy
 being much slower, in this case taking 2x the time of the previous
 operation.
 I guess this is because of the way that python operands/arguments are
 passed. Should I always use weave.blitz?


 Out of curiosity, what os/architecture are you running on? What version of
 numpy are you using?

 By and large, you shouldn't spend time programming in blitz, it will ruin
 the whole point of using numpy in the first place. If there is an
 inefficiency somewhere it is better to fix the core problem, whatever it is.

 numpy

 Chuck


 Also what version of matlab were you using?

 -Chris JS



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

On Tue, Jul 19, 2011 at 4:05 AM, Carlos Becker carlosbec...@gmail.com wrote:
 Hi, I started with numpy a few days ago. I was timing some array operations
 and found that numpy takes 3 or 4 times longer than Matlab on a simple
 array-minus-scalar operation.


Doing these kinds of timings correctly is a tricky issue, and the
method you used is at fault.  It is testing more than just the
vectorized array-minus-scalar operation, it is also timing a range()
call and list creation for the loop, as well as vector result object
creation and deletion time, both of which add constant overhead to the
result (which is itself rather small and susceptible to overhead
bias).  Whereas the matlab loop range equivalent is part of the syntax
itself, and can therefore be optimized better.  And depending on the
type of garbage collection Matlab uses, it may defer the destruction
of the temporaries until after the timing is done (ie. when it exits,
whereas Python has to destruct the object on each loop they way you've
written it.)

First of all, use the 'timeit' module for timing:

%python
 import timeit
 t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m = 
 np.ones([2000,2000],float)')
 np.mean(t.repeat(repeat=100, number=1))
0.022081942558288575

That will at least give you a more accurate timing of just the summing
expression itself, and not the loop overhead.  Furthermore, you can
also reuse the m array for the sum, rather than allocating a new one,
which will give you a better idea of just the vectorized subtration
time:

 t=timeit.Timer('m -= 0.5', setup='import numpy as np;m = 
 np.ones([2000,2000],float)')
 np.mean(t.repeat(repeat=100, number=1))
0.015955450534820555

Note that the value has dropped considerably.

In the end, what you are attempting to time is fairly simple, so any
extra overhead you add that is not actually the vectorized sum, will
bias your results.  You have to be extremely careful with these timing
comparisons, since you may be comparing apples to oranges.

At the least, try to give the vectorizing code much more work to do,
for example you are summing only over about 32 Megs.  Try about half a
gig, and compare that with Matlab, in order to reduce the percentage
of overhead to summing in your timings:

 t=timeit.Timer('m -= 0.5', setup='import numpy as np;m = 
 np.ones([8092,8092],float)')
 np.mean(t.repeat(repeat=100, number=1))
0.26796033143997194

Try comparing these examples to your existing Matlab timings, and you
should find Python w/ numpy comparing favorably (or even beating
Matlab).  Of course, then you could improve your Matlab timings; in
the end they should be almost the same when done properly.  If not, by
all means let us know.

-Chad
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

On Tue, Jul 19, 2011 at 2:27 PM, Carlos Becker carlosbec...@gmail.com wrote:
 Hi, everything was run on linux.

 Placing parentheses around the scalar multipliers shows that it seems to
 have to do with how expressions are handled, is there sometihng that can be
 done about this so that numpy can deal with expressions rather than single
 operations chained by python itself?

Numpy is constrained (when using scalars) to Python's normal
expression ordering rules, which tend to evaluate left to right.  So,
once an expression gets promoted to an array, adding more scalars will
do array math rather than being able to collapse all the scalar math.
To extend my example from before:

 t=timeit.Timer('k = m - 0.5 + 0.4 - 0.3 + 0.2 - 0.1', setup='import numpy 
 as np;m = np.ones([8092,8092],float)')
 np.mean(t.repeat(repeat=10, number=1))
2.9083001852035522

 t=timeit.Timer('k = 0.5 + 0.4 - 0.3 + 0.2 - 0.1 + m', setup='import numpy 
 as np;m = np.ones([8092,8092],float)')
 np.mean(t.repeat(repeat=10, number=1))
0.52074816226959231

In the second case, the first 4 sums are done in scalar math,
effectively collapsing the work down to '0.7 + m', however in the
first case the whole expression is upconverted to an array computation
right from the start, making the total amount of work much greater.

If python had a way of exposing it's expression tree to objects
*during execution* and allowed for delayed expression evaluation, such
quirks might be avoidable.  But it's a complex issue, and not always a
problem in practice.  In general, as with all computation numerics,
you have to be aware of the underlying evaluation order and
associativity to fully understand your results, and if you understand
that, you can optimize you yourself.

So, to show the difference with your example:

 t=timeit.Timer('k = (m - 0.5)*0.3*0.2', setup='import numpy as np;m = 
 np.ones([8092,8092],float)')
 np.mean(t.repeat(repeat=10, number=1))
1.682368244019

 t=timeit.Timer('k = 0.2*0.3*(m - 0.5)', setup='import numpy as np;m = 
 np.ones([8092,8092],float)')
 np.mean(t.repeat(repeat=10, number=1))
1.1084311008453369


-C
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-19 Thread Pauli Virtanen

On Tue, 19 Jul 2011 17:49:14 +0200, Carlos Becker wrote:
 I made more tests with the same operation, restricting Matlab to use a
 single processing unit. I got:
 
 - Matlab: 0.0063 sec avg
 - Numpy: 0.026 sec avg
 - Numpy with weave.blitz: 0.0041

To check if it's an issue with building without optimizations,
look at the build log:

C compiler: gcc -pthread -fno-strict-aliasing -ggdb -fPIC
...
gcc: build/src.linux-x86_64-2.7/numpy/core/src/umath/umathmodule.c

I.e., look on the C compiler: line nearest to the umathmodule
compilation. Above is an example with no optimization.

***

For me, compared to zeroing the memory via memset  plain
C implementation (Numpy 1.6.0 / gcc):

Blitz: 0.00746664
Numpy: 0.00711051
Zeroing (memset): 0.0026
Operation in C: 0.00706667

with gcc -O3 -ffast-math -march=native -mfpmath=sse optimizations
for the C code (involving SSE2 vectorization and whatnot, looking at
the assembler output). Numpy is already going essentially at the maximum
speed.

-
#include stdlib.h
#include time.h
#include stdio.h
#include string.h

int main()
{
double *a, *b;
int N = 2000*2000, M=300;
int j;
int k;
clock_t start, end;

a = (double*)malloc(sizeof(double)*N);
b = (double*)malloc(sizeof(double)*N);

start = clock();
for (k = 0; k  M; ++k) {
memset(a, '\0', sizeof(double)*N);
}
end = clock();
printf(Zeroing (memset): %g\n, ((double)(end-start))/CLOCKS_PER_SEC/M);

start = clock();
for (k = 0; k  M; ++k) {
for (j = 0; j  N; ++j) {
b[j] = a[j] - 0.5;
}
}
end = clock();
printf(Operation in C: %g\n, ((double)(end-start))/CLOCKS_PER_SEC/M);

return 0;
}

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-19 Thread Nadav Horesh

For such expressions you should try numexpr package: It allows the same type of 
optimisation as Matlab does:  run a single loop over the matrix elements 
instead of repetitive loops and intermediate objects creation.

  Nadav

 Besides the matlab/numpy comparison, I think that there is an inherent 
 problem with how expressions are handled, in terms of efficiency.
 For instance, k = (m - 0.5)*0.3 takes 52msec average here (2000x2000 array), 
 while k = (m - 0.5)*0.3*0.2 takes 0.079, and k = (m - 0.5)*0.3*0.2*0.1  
 takes 101msec.
 Placing parentheses around the scalar multipliers shows that it seems to have 
 to do with how expressions are handled, is there sometihng that can be done 
 about this so that numpy can deal with expressions rather than single 
 operations chained by python itself? Maybe I am missing the point as well.

--
Carlos Becker
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-19 Thread Christopher Barker

Carlos Becker wrote:
 Besides the matlab/numpy comparison, I think that there is an inherent 
 problem with how expressions are handled, in terms of efficiency.
 For instance, k = (m - 0.5)*0.3 takes 52msec average here (2000x2000 
 array), while k = (m - 0.5)*0.3*0.2 takes 0.079, and k = (m - 
 0.5)*0.3*0.2*0.1 takes 101msec.
 Placing parentheses around the scalar multipliers shows that it seems to 
 have to do with how expressions are handled, is there sometihng that can 
 be done about this so that numpy can deal with expressions rather than 
 single operations chained by python itself?


well, it is Python, and Python itself does not know anything about array 
math -- so you need to be careful to do that correctly yourself. Python 
aside, understanding how parentheses effect computation, even for 
algebraically equal expressions is a good thing to understand.

Aside from issues with scalars, a Python expression like:

a = a * b * c

does:
   multiply a and b and put it in a temporary
   multiply that temporary by c and put that in a temporary
   assign the final temporary to a

so it does, in fact, create two unnecessary temporaries for this 
simple expression.

If you are concerned about performance, there are ways to control that. 
in-place operators is one:

a *= b
a *= c

will not create any temporaries, and will probably be faster.

It still does two loops through the data, though. If your arrays are too 
big to fit in cache, that could effect performance. To get around that 
you need to get fancy. One option is numexpr:

http://code.google.com/p/numexpr/

numexpr takes an entire expression as input, and can thus optimize some 
of this at the expression level, make good use of cache, etc. -- it's 
pretty cool.

There are a few other options, including weave, of course.

-Chris




-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

Thanks Chad for the explanation on those details. I am new to python and I
still have a lot to learn, this was very useful.
Now I get similar results between matlab and numpy when I re-use the memory
allocated for m with 'm -= 0.5'.

However, if I don't, I obtain this 4x penalty with numpy, even with the
8092x8092 array. Would it be possible to do k = m - 0.5 and pre-alllocate k
such that python does not have to waste time on that?

Another interesting case is something like k = m + n + p, which I guess
should be run with numexpr in order to accelerate it.
Regarding operator evaluation, I thought of the same thing, and it is what
would happen in other languages as well if no expression 'templates' ( such
as what the Eigen library uses ).

I will look at numexpr, I hope I can replace all my matlab image processing
and computational needs with python, and interface with C++ code that uses
Eigen if I need an extra speed boost.
Right now I am trying spyder as a debugger, which looks very nice. If a good
debugger is available, it could totally replace matlab/octave for
researchers/engineers/etc for some specific needs.

I will try numexpr now to see the performance gain.

Thanks to all that replied to this topic, it was very useful.


On Tue, Jul 19, 2011 at 10:11 PM, Nadav Horesh nad...@visionsense.comwrote:

 For such expressions you should try numexpr package: It allows the same
 type of optimisation as Matlab does:  run a single loop over the matrix
 elements instead of repetitive loops and intermediate objects creation.

  Nadav

  Besides the matlab/numpy comparison, I think that there is an inherent
 problem with how expressions are handled, in terms of efficiency.
  For instance, k = (m - 0.5)*0.3 takes 52msec average here (2000x2000
 array), while k = (m - 0.5)*0.3*0.2 takes 0.079, and k = (m -
 0.5)*0.3*0.2*0.1  takes 101msec.
  Placing parentheses around the scalar multipliers shows that it seems to
 have to do with how expressions are handled, is there sometihng that can be
 done about this so that numpy can deal with expressions rather than single
 operations chained by python itself? Maybe I am missing the point as well.

 --
 Carlos Becker
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

On Tue, Jul 19, 2011 at 3:35 PM, Carlos Becker carlosbec...@gmail.com wrote:
 Thanks Chad for the explanation on those details. I am new to python and I

 However, if I don't, I obtain this 4x penalty with numpy, even with the
 8092x8092 array. Would it be possible to do k = m - 0.5 and pre-alllocate k
 such that python does not have to waste time on that?

I suspect the 4x penalty is related to the expression evaluation
overhead (temporaries and copying), so hopefully numexpr() will help,
or just remembering to use the in-place operators whenever
appropriate.

To answer your question, though, you can allocate an array, without
initializing it, with the empty() function.  Note - if you aren't
absolutely sure you are going to overwrite every single element of the
array, this could leave you with uninitialized values in your array.
I'd just go ahead and use the zeros() function instead, to be safe
(it's initialized outside the timeit() timing loop):

%python
 import timeit
 import numpy as np

 t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m = 
 np.ones([8092,8092],float); k = np.zeros(m.size, m.dtype)')
 np.mean(t.repeat(repeat=10, number=1))
0.58557529449462886

 t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m = 
 np.ones([8092,8092],float)')
 np.mean(t.repeat(repeat=10, number=1))
0.53153839111328127

 t=timeit.Timer('m =- 0.5', setup='import numpy as np;m = 
 np.ones([8092,8092],float)')
 np.mean(t.repeat(repeat=10, number=1))
0.038648796081542966

As you can see, preallocation doesn't seem to affect the results all
that much, it's the overhead of creating a temporary, then copying it
to the result, that seems to matter here.  The in-place operation was
much faster.

Here we see that just copying m to k, takes up more time than the 'k =
m + 0.5' operation:

 t=timeit.Timer('k = m.copy()', setup='import numpy as np;m = 
 np.ones([8092,8092],float)')
 np.mean(t.repeat(repeat=10, number=1))
0.63301105499267574

Possibly that is because 8K*8K matrices are a bit too big for this
kind of benchmark; I recommend also trying it with 4K*4K, and your
original 2K*2K to see if the results are consistent.  Remember, the
timeit() setup is hiding the initial allocation time of m from the
results, but it still exists, and should be accounted for in
determining the overall execution time of the in-place operation
results.

Also, with these large array sizes, make sure these tests are in a
fresh python instance, so that the process  address space isn't
tainted with old object allocations (which may cause your OS to 'swap'
the now unused memory, and ruin your timing values).

-Chad
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-19 Thread Lutz Maibaum

On Jul 19, 2011, at 3:15 PM, Chad Netzer wrote:
 %python
 import timeit
 import numpy as np
 
 t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m = 
 np.ones([8092,8092],float); k = np.zeros(m.size, m.dtype)')
 np.mean(t.repeat(repeat=10, number=1))
 0.58557529449462886
 
 t=timeit.Timer('k = m - 0.5', setup='import numpy as np;m = 
 np.ones([8092,8092],float)')
 np.mean(t.repeat(repeat=10, number=1))
 0.53153839111328127

I am surprised that there is any difference between these two approaches at 
all. I would have thought that in both cases a temporary array holding the 
result of m-0.5 is created, which is then assigned to the variable k. However, 
it seems that the second approach is about 10% faster (I see a similar 
difference on my machine). Why would that be?

On the other hand, if I actually use the preallocated space for k, and replace 
the assignment by

k[…] = m - 0.5

then this takes about 40% more time, presumably because the content of the 
temporary array is copied into k (which must be initialized by m.shape instead 
of m.size in this case).

Thanks,

  Lutz

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy

2011-07-19 Thread Pauli Virtanen

On Tue, 19 Jul 2011 17:15:47 -0500, Chad Netzer wrote:
 On Tue, Jul 19, 2011 at 3:35 PM, Carlos Becker carlosbec...@gmail.com
[clip]
 However, if I don't, I obtain this 4x penalty with numpy, even with the
 8092x8092 array. Would it be possible to do k = m - 0.5 and
 pre-alllocate k such that python does not have to waste time on that?
 
 I suspect the 4x penalty is related to the expression evaluation
 overhead (temporaries and copying), so hopefully numexpr() will help, or
 just remembering to use the in-place operators whenever appropriate.

Doubtful:

k = m - 0.5

does here the same thing as

k = np.empty_like(m)
np.subtract(m, 0.5, out=k)

The memory allocation (empty_like and the subsequent deallocation)
costs essentially nothing, and there are no temporaries or copying
in `subtract`.

***

There's something else going on -- on my machine, the Numpy operation
runs exactly at the same speed as C, so this issue must have
a platform-dependent explanation.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Array vectorization in numpy