Re: [Numpy-discussion] Numpy and OpenMP

2008-03-19 Thread David Cournapeau
Charles R Harris wrote:


 Image processing may be a special in that many cases it is almost 
 embarrassingly parallel. Perhaps some special libraries for that sort 
 of application could be put together and just bits of c code be run on 
 different processors. Not that I know much about parallel processing, 
 but that would be my first take.

For me, the basic problem is that there is no support for this kind of 
thing in numpy right now (loading specific implementation at runtime). I 
think it would be a worthwhile goal for 1.1: the ability to load at 
runtime different implementations (for example: load multi-core blas on 
multi-core CPU); instead of of linking atlas/mkl, they would be used as 
plug-ins. This would require a significant work, though.

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy and OpenMP

2008-03-17 Thread Christopher Barker
   Plus a certain amount of numpy code depends on order of
   evaluation:
  
   a[:-1] = 2*a[1:]

I'm confused here. My understanding of how it now works is that the 
above translates to:

1) create a new array (call it temp1) from a[1:], which shares a's data 
block.
2) create a temp2 array by multiplying 2 times each of the elements in 
temp1, and writing them into a new array, with a new data block
3) copy that temporary array into a[:-1]

Why couldn't step (2) be parallelized? Why isn't it already with, BLAS? 
Doesn't BLAS must have such simple routines?

Also, maybe numexpr could benefit from this?

-Chris




-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy and OpenMP

2008-03-17 Thread Robert Kern
On Mon, Mar 17, 2008 at 12:06 PM, Christopher Barker
[EMAIL PROTECTED] wrote:
Plus a certain amount of numpy code depends on order of
 evaluation:

 a[:-1] = 2*a[1:]

  I'm confused here. My understanding of how it now works is that the
  above translates to:

  1) create a new array (call it temp1) from a[1:], which shares a's data
  block.
  2) create a temp2 array by multiplying 2 times each of the elements in
  temp1, and writing them into a new array, with a new data block
  3) copy that temporary array into a[:-1]

  Why couldn't step (2) be parallelized? Why isn't it already with, BLAS?
  Doesn't BLAS must have such simple routines?

Yes, but they are rarely optimized. We only (optionally) use the BLAS
to accelerate dot(). Using the BLAS in more fundamental parts of numpy
would be problematic from a build standpoint (or conversely a code
complexity standpoint if it remains optional).

  Also, maybe numexpr could benefit from this?

Possibly. You can answer this definitively by writing the code to try it out.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy and OpenMP

2008-03-17 Thread Francesc Altet
A Monday 17 March 2008, Christopher Barker escrigué:
Plus a certain amount of numpy code depends on order of
evaluation:
   
a[:-1] = 2*a[1:]

 I'm confused here. My understanding of how it now works is that the
 above translates to:

 1) create a new array (call it temp1) from a[1:], which shares a's
 data block.
 2) create a temp2 array by multiplying 2 times each of the elements
 in temp1, and writing them into a new array, with a new data block 3)
 copy that temporary array into a[:-1]

 Why couldn't step (2) be parallelized? Why isn't it already with,
 BLAS? Doesn't BLAS must have such simple routines?

Probably yes, but the problem is that this kind of operations, namely, 
vector-to-vector (usually found in the BLAS1 subset of BLAS), are 
normally memory-bounded, so you can take little avantage from using 
BLAS, most specially in modern processors, where the gap between the 
CPU throughput and the memory bandwith is quite high (and increasing).
In modern machines, the use of BLAS is more interesting in vector-matrix 
(BLAS2) computations, but definitely is in matrix-matrix (BLAS3) ones 
(which is where the oportunities for cache reuse is higher) where the 
speedups can really be very good.

 Also, maybe numexpr could benefit from this?

Maybe, but unfortunately it wouldn't be able to achieve high speedups.  
Right now, numexpr is focused in accelerating mainly vector-vector 
operations (or matrix-matrix, but element-wise, much like NumPy, so 
that the cache cannot be reused), with some smart optimizations for 
strided and unaligned arrays (in this scenario, it can be 2x or 3x 
faster than NumPy, even for very simple operations like 'a+b').

In a similar way, OpenMP (or whatever parallel paradigm) will only 
generally be useful when you have to deal with lots of data, and your 
algorithm can have the oportunity to structure it so that small 
portions of them can be reused many times.

Cheers,

-- 
0,0   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 -
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy and OpenMP

2008-03-17 Thread Charles R Harris
On Mon, Mar 17, 2008 at 1:59 PM, Gnata Xavier [EMAIL PROTECTED]
wrote:

 Francesc Altet wrote:
  A Monday 17 March 2008, Christopher Barker escrigué:
 
Plus a certain amount of numpy code depends on order of
evaluation:
   
a[:-1] = 2*a[1:]
 
  I'm confused here. My understanding of how it now works is that the
  above translates to:
 
  1) create a new array (call it temp1) from a[1:], which shares a's
  data block.
  2) create a temp2 array by multiplying 2 times each of the elements
  in temp1, and writing them into a new array, with a new data block 3)
  copy that temporary array into a[:-1]
 
  Why couldn't step (2) be parallelized? Why isn't it already with,
  BLAS? Doesn't BLAS must have such simple routines?
 
 
  Probably yes, but the problem is that this kind of operations, namely,
  vector-to-vector (usually found in the BLAS1 subset of BLAS), are
  normally memory-bounded, so you can take little avantage from using
  BLAS, most specially in modern processors, where the gap between the
  CPU throughput and the memory bandwith is quite high (and increasing).
  In modern machines, the use of BLAS is more interesting in vector-matrix
  (BLAS2) computations, but definitely is in matrix-matrix (BLAS3) ones
  (which is where the oportunities for cache reuse is higher) where the
  speedups can really be very good.
 
 
  Also, maybe numexpr could benefit from this?
 
 
  Maybe, but unfortunately it wouldn't be able to achieve high speedups.
  Right now, numexpr is focused in accelerating mainly vector-vector
  operations (or matrix-matrix, but element-wise, much like NumPy, so
  that the cache cannot be reused), with some smart optimizations for
  strided and unaligned arrays (in this scenario, it can be 2x or 3x
  faster than NumPy, even for very simple operations like 'a+b').
 
  In a similar way, OpenMP (or whatever parallel paradigm) will only
  generally be useful when you have to deal with lots of data, and your
  algorithm can have the oportunity to structure it so that small
  portions of them can be reused many times.
 
  Cheers,
 
 

 Well, linear alagera is another topic.

 What I can see from IDL (for innstance) is that it provides the user
 with a TOTAL function which take avantage  of several CPU when the
 number of elements is large. It also provides a very simple way to set a
 max number of threads.

 I really really would like to see something like that in numpy (just to
 be able to tell somone switch to numpy it is free and you will get
 exactly the same). For now, I have a problem when they ask for //
 functions like TOTAL.

 For now, we can do that using C inline threaded code but it is *complex*
 and 2000x2000 images are now common. It is not a corner case any more.


Image processing may be a special in that many cases it is almost
embarrassingly parallel. Perhaps some special libraries for that sort of
application could be put together and just bits of c code be run on
different processors. Not that I know much about parallel processing, but
that would be my first take.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy and OpenMP

2008-03-17 Thread Gnata Xavier
Charles R Harris wrote:


 On Mon, Mar 17, 2008 at 1:59 PM, Gnata Xavier [EMAIL PROTECTED] 
 mailto:[EMAIL PROTECTED] wrote:

 Francesc Altet wrote:
  A Monday 17 March 2008, Christopher Barker escrigué:
 
Plus a certain amount of numpy code depends on order of
evaluation:
   
a[:-1] = 2*a[1:]
 
  I'm confused here. My understanding of how it now works is that the
  above translates to:
 
  1) create a new array (call it temp1) from a[1:], which shares a's
  data block.
  2) create a temp2 array by multiplying 2 times each of the elements
  in temp1, and writing them into a new array, with a new data
 block 3)
  copy that temporary array into a[:-1]
 
  Why couldn't step (2) be parallelized? Why isn't it already with,
  BLAS? Doesn't BLAS must have such simple routines?
 
 
  Probably yes, but the problem is that this kind of operations,
 namely,
  vector-to-vector (usually found in the BLAS1 subset of BLAS), are
  normally memory-bounded, so you can take little avantage from using
  BLAS, most specially in modern processors, where the gap between the
  CPU throughput and the memory bandwith is quite high (and
 increasing).
  In modern machines, the use of BLAS is more interesting in
 vector-matrix
  (BLAS2) computations, but definitely is in matrix-matrix (BLAS3)
 ones
  (which is where the oportunities for cache reuse is higher)
 where the
  speedups can really be very good.
 
 
  Also, maybe numexpr could benefit from this?
 
 
  Maybe, but unfortunately it wouldn't be able to achieve high
 speedups.
  Right now, numexpr is focused in accelerating mainly vector-vector
  operations (or matrix-matrix, but element-wise, much like NumPy, so
  that the cache cannot be reused), with some smart optimizations for
  strided and unaligned arrays (in this scenario, it can be 2x or 3x
  faster than NumPy, even for very simple operations like 'a+b').
 
  In a similar way, OpenMP (or whatever parallel paradigm) will only
  generally be useful when you have to deal with lots of data, and
 your
  algorithm can have the oportunity to structure it so that small
  portions of them can be reused many times.
 
  Cheers,
 
 

 Well, linear alagera is another topic.

 What I can see from IDL (for innstance) is that it provides the user
 with a TOTAL function which take avantage  of several CPU when the
 number of elements is large. It also provides a very simple way to
 set a
 max number of threads.

 I really really would like to see something like that in numpy
 (just to
 be able to tell somone switch to numpy it is free and you will get
 exactly the same). For now, I have a problem when they ask for //
 functions like TOTAL.

 For now, we can do that using C inline threaded code but it is
 *complex*
 and 2000x2000 images are now common. It is not a corner case any more.


 Image processing may be a special in that many cases it is almost 
 embarrassingly parallel.

yes but who likes to do that ?
One trivial case : Divide a image by its mean :
Compute the mean of the image
Divide the image by its mean

It should be 3 small lines of code no more.

Using the embarrassingly parallel paradigm to compute that, I would 
have to store the partial results and then run another exe to read then. 
Ugly. ugly but very common in the proto phases. Or it can be pipes or 
sockets or...wait just write in C/MPI if you want to do that. Tunning 
this C/MPI code you will get the best performances.

Ok fine. Fine but in a few months quadcores will be cheap. Using 
numpy, I now I never get the best performances on a multicores machine 
and I do not care. I just get the best 
performance/time_needed_to_code_that ratio, by far, and that is why IMHO 
numpy is great :). The problem is that on a multicore machine, this 
ratio is not that high because there is no way to perform s = sum(A) in 
a maybe-sub-obtimal  but not nonocore way. Sublinear scaling (let say 
real life scaling) will always be better that  nothing.
 

Xavier

 Perhaps some special libraries for that sort of application could be 
 put together and just bits of c code be run on different processors. 
 Not that I know much about parallel processing, but that would be my 
 first take.

 Chuck

 

 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion
   

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy and OpenMP

2008-03-17 Thread Robert Kern
On Mon, Mar 17, 2008 at 6:03 PM, Gnata Xavier [EMAIL PROTECTED] wrote:

  Ok fine. Fine but in a few months quadcores will be cheap. Using
  numpy, I now I never get the best performances on a multicores machine
  and I do not care. I just get the best
  performance/time_needed_to_code_that ratio, by far, and that is why IMHO
  numpy is great :). The problem is that on a multicore machine, this
  ratio is not that high because there is no way to perform s = sum(A) in
  a maybe-sub-obtimal  but not nonocore way. Sublinear scaling (let say
  real life scaling) will always be better that  nothing.

Please, by all means go for it.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Numpy and OpenMP

2008-03-15 Thread Gnata Xavier
Hi,

Numpy is great : I can see several IDL/matlab projects switching to numpy :)
However, it would be s nice to be able to put some OpenMP into the 
numpy code.

It would be nice to be able to be able to use several CPU using the 
numpy syntax ie A=sqrt(B).

Ok, we can use some inline C/C++ code but it is not so easy.
Ok, we can split the data over several python executables (one per CPU) 
but A=sqrt(B) is so simple...

numpy + recent gcc with OpenMP  -- :) ?
Any comments ?

Xavier
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy and OpenMP

2008-03-15 Thread Robert Kern
On Sat, Mar 15, 2008 at 2:48 PM, Gnata Xavier [EMAIL PROTECTED] wrote:
 Hi,

  Numpy is great : I can see several IDL/matlab projects switching to numpy :)
  However, it would be s nice to be able to put some OpenMP into the
  numpy code.

  It would be nice to be able to be able to use several CPU using the
  numpy syntax ie A=sqrt(B).

  Ok, we can use some inline C/C++ code but it is not so easy.
  Ok, we can split the data over several python executables (one per CPU)
  but A=sqrt(B) is so simple...

  numpy + recent gcc with OpenMP  -- :) ?
  Any comments ?

Eric Jones tried to use multithreading to split the computation of
ufuncs across CPUs. Ultimately, the overhead of locking and unlocking
made it prohibitive for medium-sized arrays and only somewhat
disappointing improvements in performance for quite large arrays. I'm
not familiar enough with OpenMP to determine if this result would be
applicable to it. If you would like to try, we can certainly give you
pointers as to where to start.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy and OpenMP

2008-03-15 Thread Anne Archibald
On 15/03/2008, Damian Eads [EMAIL PROTECTED] wrote:
 Robert Kern wrote:
   Eric Jones tried to use multithreading to split the computation of
   ufuncs across CPUs. Ultimately, the overhead of locking and unlocking
   made it prohibitive for medium-sized arrays and only somewhat
   disappointing improvements in performance for quite large arrays. I'm
   not familiar enough with OpenMP to determine if this result would be
   applicable to it. If you would like to try, we can certainly give you
   pointers as to where to start.

 Perhaps I'm missing something. How is locking and synchronization an
  issue when each thread is writing to a mutually exclusive part of the
  output buffer?

The trick is to efficiently allocate these output buffers. If you
simply give each thread 1/n th of the job, if one CPU is otherwise
occupied it doubles your computation time. If you break the job into
many pieces and let threads grab them, you need to worry about locking
to keep two threads from grabbing the same piece of data. Plus,
depending on where things are in memory you can kill performance by
abusing the caches (maintaining cache consistency across CPUs can be a
challenge). Plus a certain amount of numpy code depends on order of
evaluation:

a[:-1] = 2*a[1:]

Correctly handling all this can take a lot of overhead, and require a
lot of knowledge about hardware. OpenMP tries to take care of some of
this in a way that's easy on the programmer.

To answer the OP's question, there is a relatively small number of C
inner loops that could be marked up with OpenMP #pragmas to cover most
matrix operations. Matrix linear algebra is a separate question, since
numpy/scipy prefers to use optimized third-party libraries - in these
cases one would need to use parallel linear algebra libraries (which
do exist, I think, and are plug-compatible). So parallelizing numpy is
probably feasible, and probably not too difficult, and would be
valuable. The biggest catch, I think, would be compilation issues - is
it possible to link an OpenMP-compiled shared library into a normal
executable?

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy and OpenMP

2008-03-15 Thread Scott Ransom
On Sat, Mar 15, 2008 at 07:33:51PM -0400, Anne Archibald wrote:
 ...
 To answer the OP's question, there is a relatively small number of C
 inner loops that could be marked up with OpenMP #pragmas to cover most
 matrix operations. Matrix linear algebra is a separate question, since
 numpy/scipy prefers to use optimized third-party libraries - in these
 cases one would need to use parallel linear algebra libraries (which
 do exist, I think, and are plug-compatible). So parallelizing numpy is
 probably feasible, and probably not too difficult, and would be
 valuable.

OTOH, there are reasons to _not_ want numpy to automatically use
OpenMP.  I personally have a lot of multi-core CPUs and/or
multi-processor servers that I use numpy on.  The way I use numpy
is to run a bunch of (embarassingly) parallel numpy jobs, one for
each CPU core.  If OpenMP became standard (and it does work well
in gcc 4.2 and 4.3), we definitely want to have control over how
it is used...

 The biggest catch, I think, would be compilation issues - is
 it possible to link an OpenMP-compiled shared library into a normal
 executable?

I think so.  The new gcc compilers use the libgomp libraries to
provide the OpenMP functionality.  I'm pretty sure those work just
like any other libraries.

S

-- 
Scott M. RansomAddress:  NRAO
Phone:  (434) 296-0320   520 Edgemont Rd.
email:  [EMAIL PROTECTED] Charlottesville, VA 22903 USA
GPG Fingerprint: 06A9 9553 78BE 16DB 407B  FFCA 9BFA B6FF FFD3 2989
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy and OpenMP

2008-03-15 Thread Gnata Xavier
Scott Ransom wrote:
 On Sat, Mar 15, 2008 at 07:33:51PM -0400, Anne Archibald wrote:
   
 ...
 To answer the OP's question, there is a relatively small number of C
 inner loops that could be marked up with OpenMP #pragmas to cover most
 matrix operations. Matrix linear algebra is a separate question, since
 numpy/scipy prefers to use optimized third-party libraries - in these
 cases one would need to use parallel linear algebra libraries (which
 do exist, I think, and are plug-compatible). So parallelizing numpy is
 probably feasible, and probably not too difficult, and would be
 valuable.
 

 OTOH, there are reasons to _not_ want numpy to automatically use
 OpenMP.  I personally have a lot of multi-core CPUs and/or
 multi-processor servers that I use numpy on.  The way I use numpy
 is to run a bunch of (embarassingly) parallel numpy jobs, one for
 each CPU core.  If OpenMP became standard (and it does work well
 in gcc 4.2 and 4.3), we definitely want to have control over how
 it is used...

   

embarassingly parallel spliting is just fine in some cases (KISS) but IMHO 
there is a point to get OpenMP into numpy.
Look at the g++ people : They have added a parallel version of the C++ STL into 
gcc4.3. Of course the non paralell one is still the standard/defaut one but 
here is the trend.
For now we have no easy way to perform A = B + C on more than one CPU in numpy 
(except the limited embarassingly parallel paradigm) 

Yes, we want to be able to tune and to switch off (by default?) the numpy 
threading capability, but IMHO having this threading capability will always be 
better than a fully non paralell numpy.

 

 The biggest catch, I think, would be compilation issues - is
 it possible to link an OpenMP-compiled shared library into a normal
 executable?
 

 I think so.  The new gcc compilers use the libgomp libraries to
 provide the OpenMP functionality.  I'm pretty sure those work just
 like any other libraries.

 S

   

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy and OpenMP

2008-03-15 Thread Damian Eads
Anne,

Sure. I've found multi-threaded scientific computation to give mixed 
results. For some things, it results in very significant performance 
gains, and other things, it's not worth the trouble at all. It really 
does depend on what you're doing. But, I don't think it's fair to paint 
multithreaded programming with the same brush just because there exist 
pathologies.

Robert: what benchmarks were performed showing less than pleasing 
performance gains?

Anne Archibald wrote:
 On 15/03/2008, Damian Eads [EMAIL PROTECTED] wrote:
 Robert Kern wrote:
   Eric Jones tried to use multithreading to split the computation of
   ufuncs across CPUs. Ultimately, the overhead of locking and unlocking
   made it prohibitive for medium-sized arrays and only somewhat
   disappointing improvements in performance for quite large arrays. I'm
   not familiar enough with OpenMP to determine if this result would be
   applicable to it. If you would like to try, we can certainly give you
   pointers as to where to start.

 Perhaps I'm missing something. How is locking and synchronization an
  issue when each thread is writing to a mutually exclusive part of the
  output buffer?
 
 The trick is to efficiently allocate these output buffers. If you
 simply give each thread 1/n th of the job, if one CPU is otherwise
 occupied it doubles your computation time. If you break the job into
 many pieces and let threads grab them, you need to worry about locking
 to keep two threads from grabbing the same piece of data.

For element-wise unary and binary array operations, there would never be 
two threads reading from the same memory at the same time. When 
performing matrix multiplication, more than two threads will access the 
same memory but this is fine as long as their accesses are read-only. 
The moment there is a chance one thread might need to write to the same 
buffer that one or more threads are reading from, use a read/write lock 
(pthreads supports this).

As far as coordinating the work for the threads, there are several 
possible approaches (this is not a complete list):

   1. assign to each of them the part of the buffer to work on 
beforehand. This assumes each thread will compute at the same rate and 
will finish the same chunk roughly in the same amount of time. This is 
not always a valid assumption.

   2. assign smaller chunks, leaving a large amount of unassigned work. 
As threads complete computation of a chunk, assign them another chunk. 
This requires some memory to keep track of the chunks assigned and 
unassigned. Since it is possible for multiple threads to try to access 
(with at least one modifying thread) this chunk assignment structure at 
the same time, you need synchronization. In some cases, the overhead for 
doing this synchronization is minimal.

   3. use approach #2 but assign chunk sizes of random sizes to reduce 
contention between threads trying to access the chunk assignment 
structure at the same time.

   4. for very large jobs, have a chunk assignment server. Some of my 
experiments take several weeks and are spread across 64 processors (8 
machines, 8 processors per machine). Individual units of computation 
take anywhere from 30 minutes to 8 hours. The cost of asking the chunk 
assignment server for a new chunk are minimal relative to the amount of 
time it takes to compute on the chunk. By not assigning all the 
computation up front in the beginning, most processors are working 
nearly all the time. It's only during the last day or two of the 
experiment, do there exist processors with nothing to do.

 Plus,
 depending on where things are in memory you can kill performance by
 abusing the caches (maintaining cache consistency across CPUs can be a
 challenge). Plus a certain amount of numpy code depends on order of
 evaluation:
 
 a[:-1] = 2*a[1:]

Yes, but there are many, many instances when the order of evaluation in 
an array is sequential. I'm not advocating that numpy tool be devised to 
handle the parallelization of arbitrary computation, just common kinds 
of computation where performance gains might be realized.

 Correctly handling all this can take a lot of overhead, and require a
 lot of knowledge about hardware. OpenMP tries to take care of some of
 this in a way that's easy on the programmer.
 
 To answer the OP's question, there is a relatively small number of C
 inner loops that could be marked up with OpenMP #pragmas to cover most
 matrix operations. Matrix linear algebra is a separate question, since
 numpy/scipy prefers to use optimized third-party libraries - in these
 cases one would need to use parallel linear algebra libraries (which
 do exist, I think, and are plug-compatible).  So parallelizing numpy is
 probably feasible, and probably not too difficult, and would be
 valuable.

Yes, but there is a limit to the parallelization that can be achieved 
with vanilla numpy. numpy evaluates Python expressions, one at a time; 
thus, expressions like

   sqrt(0.5 * B 

Re: [Numpy-discussion] Numpy and OpenMP

2008-03-15 Thread Robert Kern
On Sat, Mar 15, 2008 at 8:25 PM, Damian Eads [EMAIL PROTECTED] wrote:
  Robert: what benchmarks were performed showing less than pleasing
  performance gains?

The implementation is in the multicore branch. This particular file is
the main benchmark Eric was using.

http://svn.scipy.org/svn/numpy/branches/multicore/benchmarks/time_thread.py

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion