Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-24 Thread Matthew Brett
Hi,

  Note that the plug-in idea is just my own idea, it is not something
  agreed by anyone else. So maybe it won't be done for numpy 1.1, or at
  all. It depends on the main maintainers of numpy.

I'm +3 for the plugin idea - it would have huge benefits for
installation and automatic optimization.  What needs to be done?  Who
could do it?

Matthew
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-24 Thread Gnata Xavier
David Cournapeau wrote:
 Gnata Xavier wrote:
   
 Ok I will try to see what I can do but it is sure that we do need the 
 plug-in system first (read before the threads in the numpy release). 
 During the devel of 1.1, I will try to find some time to understand 
 where I should put some pragma into ufunct using a very conservation 
 approach. Any people with some OpenMP knowledge are welcome because I'm 
 not a OpenMP expert but only an OpenMP user in my C/C++ codes.
 

 Note that the plug-in idea is just my own idea, it is not something 
 agreed by anyone else. So maybe it won't be done for numpy 1.1, or at 
 all. It depends on the main maintainers of numpy.

   
 and the results :
 100080  10.308471   30.007250
 100 160 1.9025635.800172
 10  320 0.5430081.123274
 1   640 0.2068230.223031
 100012800.0888980.044268
 100 25600.1504290.008880
 10  51200.2895890.002084

  --- On this machine, we should start to use threads *in this testcase* 
 iif size=1 (a 100*100 image is a very very small one :))
 

 Maybe openMP can be more clever, but it tends to show that openMP, when 
 used naively, can *not* decide how many threads to use. That's really 
 the core problem: again, I don't know much about openMP, but almost any 
 project using multi-thread/multi-process and not being embarrassingly 
 parallel has the problem that it makes things much slower for many cases 
 where thread creation/management and co have a lot of overhead 
 proportionally to the computation. The problem is to determine the N, 
 dynamically, or in a way which works well for most cases. OpenMP was 
 created for HPC, where you have very large data; it is not so obvious to 
 me that it is adapted to numpy which has to be much more flexible. Being 
 fast on a given problem is easy; being fast on a whole range, that's 
 another story: the problem really is to be as fast as before on small 
 arrays.

 The fact that matlab, while having much more ressources than us, took 
 years to do it, makes me extremely skeptical on the efficient use of 
 multi-threading without real benchmarks for numpy. They have a dedicated 
 team, who developed a JIT for matlab, which insert multi-thread code 
 on the fly (for m files, not when you are in the interpreter), and who 
 uses multi-thread blas/lapack (which is already available in numpy 
 depending on the blas/lapack you are using).

 But again, and that's really the only thing I have to say: prove me wrong :)

 David
   
I can't :) I can't for a simple reason : Quoting IDL documentation :

There are instances when allowing IDL to use its default thread pool 
settings can lead to undesired results. In some instances, a 
multithreaded implementation using the thread pool may actually take 
longer to complete a given job than a single-threaded implementation.
http://idlastro.gsfc.nasa.gov/idl_html_help/The_IDL_Thread_Pool.html

To prevent the use of the thread pool for computations that involve too 
few data elements, IDL supports a minimum threshold value for thread 
pool computations. The minimum threshold value is contained in the 
TPOOL_MIN_ELTS field of the !CPU system variable. See the following 
sections for details on modifying this value.

At work, I can see people switching from IDL to numpy/scipy/pylab.
They are very happy with numpy but they would to find this thread pool 
capability in numpy.
All these guys come from C (or from fortran), often from C/fortran MPI 
or OpenMP.
They know which part of a code should be thread and which part should 
not. As a result, they are very happy with the IDL thread pool.

I'm just thinking how to translate that into numpy.
Now I have to have a close look at the ufuncs code and to figure out how 
to add -fopenmp.

 From a very pragmatic point of view :

What is the best/simplest way to use inline C or whatever to do that :

I have a large array A and, at some points of my nice numpy code,  I 
would like to compute let say the threaded sum or the sine of this 
array? Assuming that I know how to write it in C/OpenMP code. (The 
background is I really know that in my case it is much faster... and I 
asked my boss for a multi-core machine ;)).


Cheers,
Xavier





___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-24 Thread Joe Harrington
A couple of thoughts on parallelism:

1. Can someone come up with a small set of cases and time them on
numpy, IDL, Matlab, and C, using various parallel schemes, for each of
a representative set of architectures?  We're comparing a benchmark to
itself on different architectures, rather than seeing whether the
thread capability is helping our competition on the same architecture.
If it's mostly not helping them, we can forget it for the time being.
I suspect that it is, in fact, helping them, or at least not hurting
them.

2. Would it slow things much to have some state that the routines
check before deciding whether to run a parallel implementation or not?
It could default to single thread except in the cases where
parallelism always helps, but the user can configure it to multithread
beyond certain threshholds of, say, number of elements.  Then, in the
short term, a savvy user can tweak that state to get parallelism for
more than N elements.  In the longer term, there could be a test
routine that would run on install and configure the state for that
particular machine.  When numpy started it would read the saved file
and computation would be optimized for that machine.  The user could
always override it.

3. We should remember the first rule of parallel programming, which
Anne quotes as premature optimization is the root of all evil.
There is a lot to fix in numpy that is more fundamental than speed.  I
am the first to want things fast (I would love my secondary eclipse
analysis to run in less than a week), but we have gaping holes in
documentation and other areas that one would expect to have been
filled before a 1.0 release.  I hope we can get them filled for 1.1.
It bears repeating that our main resource shortage is in person-hours,
and we'll get more of those as the community grows.  Right now our
deficit in documentation is hurting us badly, while our deficit in
parallelism is not.  There is no faster way of growing the community
than making it trivial to learn how to use numpy without hand-holding
from an experienced user.  Let's explore parallelism to assess when
and how it might be right to do it, but let's stay focussed on the
fundamentals until we have those nailed.

--jh--
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-24 Thread Gnata Xavier

 A couple of thoughts on parallelism:

 1. Can someone come up with a small set of cases and time them on
 numpy, IDL, Matlab, and C, using various parallel schemes, for each of
 a representative set of architectures?  We're comparing a benchmark to
 itself on different architectures, rather than seeing whether the
 thread capability is helping our competition on the same architecture.
 If it's mostly not helping them, we can forget it for the time being.
 I suspect that it is, in fact, helping them, or at least not hurting
 them.

   
Well I could ask some IDL users to provide you with benchmarks.
In C/OpenMP I have posted a trivial code.

 2. Would it slow things much to have some state that the routines
 check before deciding whether to run a parallel implementation or not?
 It could default to single thread except in the cases where
 parallelism always helps, but the user can configure it to multithread
 beyond certain threshholds of, say, number of elements.  Then, in the
 short term, a savvy user can tweak that state to get parallelism for
 more than N elements.  In the longer term, there could be a test
 routine that would run on install and configure the state for that
 particular machine.  When numpy started it would read the saved file
 and computation would be optimized for that machine.  The user could
 always override it.

   
No it wouldn't cost that much and that is exactly the way IDL (for 
instance) works.

 3. We should remember the first rule of parallel programming, which
 Anne quotes as premature optimization is the root of all evil.
 There is a lot to fix in numpy that is more fundamental than speed.  I
 am the first to want things fast (I would love my secondary eclipse
 analysis to run in less than a week), but we have gaping holes in
 documentation and other areas that one would expect to have been
 filled before a 1.0 release.  I hope we can get them filled for 1.1.
 It bears repeating that our main resource shortage is in person-hours,
 and we'll get more of those as the community grows.  Right now our
 deficit in documentation is hurting us badly, while our deficit in
 parallelism is not.  There is no faster way of growing the community
 than making it trivial to learn how to use numpy without hand-holding
 from an experienced user.  Let's explore parallelism to assess when
 and how it might be right to do it, but let's stay focussed on the
 fundamentals until we have those nailed.

   
That is well put and clear.
It is also clear that our deficit in parallelism is not hurting us that 
badly.
It is a real problem in some communities like astronomers and images 
processing people but the lack of documentation is  the first one, that 
is true.

 
Xavier
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-24 Thread Matthieu Brucher

 It is a real problem in some communities like astronomers and images
 processing people but the lack of documentation is  the first one, that
 is true.


Even in those communities, I think that a lot could be done at a higher
level, as what IPython1 does (tasks parallelism).

Matthieu
-- 
French PhD student
Website : http://matthieu-brucher.developpez.com/
Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn : http://www.linkedin.com/in/matthieubrucher
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-24 Thread Robert Kern
On Sat, Mar 22, 2008 at 4:25 PM, Charles R Harris
[EMAIL PROTECTED] wrote:

 On Sat, Mar 22, 2008 at 2:59 PM, Robert Kern [EMAIL PROTECTED] wrote:
 
  On Sat, Mar 22, 2008 at 2:04 PM, Charles R Harris
  [EMAIL PROTECTED] wrote:
 
   Maybe it's time to revisit the template subsystem I pulled out of
 Django.
 
  I am still -lots on using the Django template system. Please, please,
  please, look at Jinja or another templating package that could be
  dropped in without *any* modification.
 

 Well, I have a script that pulls the relevant parts out of Django. I know
 you had a bad experience, but...
 That said, Jinja looks interesting. It uses the Django syntax, which was one
 of the things I liked most about Django templates. In fact, it looks pretty
 much like Django templates made into a standalone application, which is what
 I was after. However, it's big, the installed egg is about 1Mib, which is
 roughly 12x the size as my cutdown version of Django, and it has some
 c-code, so would need building.

The C code is optional.

 On the other hand, it also looks like it
 contains a lot of extraneous stuff, like translations, that could be
 removed. Would you be adverse to adding it in if it looks useful?

I would still *really* prefer that you use a single-file templating
module at the expense of template aesthetics and even features. I am
still unconvinced that we need more features. You haven't shown me any
concrete examples. If we do the features of a larger package that we
need to cut down, I'd prefer a package that we can cut down by simply
removing files, not one that requires the modification of files.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-24 Thread Gnata Xavier
Matthieu Brucher wrote:

 It is a real problem in some communities like astronomers and images
 processing people but the lack of documentation is  the first one,
 that
 is true.


 Even in those communities, I think that a lot could be done at a 
 higher level, as what IPython1 does (tasks parallelism).

 Matthieu

Well it is not that easy. We have several numpy code following like this :
1) open an large data file to get a numpy array
2) perform computations on this array (I'm only talking of the numpy 
part here. scipy is something else)
3) Write the result is another large file

It is so simple to write using numpy :)
Now, if I want to have several exe, step 3 is often a problem. The only 
simple way to speed this up is to slit step 2 into threads (assuming 
that there is no other possible optimisation like sse which is false but 
out of the scope of numpy users).

Using C, we can do that using OpenMP pragma. It may not be optimal but 
it radio speedup/time_to_code is very large :) Now, we are switching 
from C to numpy because we cannot spend that much time to play with 
gdb/pointers to open an image anymore.


Xavier
ps : I have seen your blog and you can send me an email off line about 
this topic and what you are doing :)
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-24 Thread Charles R Harris
On Mon, Mar 24, 2008 at 10:35 AM, Robert Kern [EMAIL PROTECTED] wrote:

 On Sat, Mar 22, 2008 at 4:25 PM, Charles R Harris
 [EMAIL PROTECTED] wrote:
 
  On Sat, Mar 22, 2008 at 2:59 PM, Robert Kern [EMAIL PROTECTED]
 wrote:
  
   On Sat, Mar 22, 2008 at 2:04 PM, Charles R Harris
   [EMAIL PROTECTED] wrote:
  
Maybe it's time to revisit the template subsystem I pulled out of
  Django.
  
   I am still -lots on using the Django template system. Please, please,
   please, look at Jinja or another templating package that could be
   dropped in without *any* modification.
  
 
  Well, I have a script that pulls the relevant parts out of Django. I
 know
  you had a bad experience, but...
  That said, Jinja looks interesting. It uses the Django syntax, which was
 one
  of the things I liked most about Django templates. In fact, it looks
 pretty
  much like Django templates made into a standalone application, which is
 what
  I was after. However, it's big, the installed egg is about 1Mib, which
 is
  roughly 12x the size as my cutdown version of Django, and it has some
  c-code, so would need building.

 The C code is optional.

  On the other hand, it also looks like it
  contains a lot of extraneous stuff, like translations, that could be
  removed. Would you be adverse to adding it in if it looks useful?

 I would still *really* prefer that you use a single-file templating
 module at the expense of template aesthetics and even features. I am
 still unconvinced that we need more features. You haven't shown me any
 concrete examples. If we do the features of a larger package that we
 need to cut down, I'd prefer a package that we can cut down by simply
 removing files, not one that requires the modification of files.


If you simply pull out the template subsystem, it is about 1Mib, which is
why Jinja looks like Django to me. If you remove extraneous files from
Django, and probably Jinja, it comes down to about 400K. If you go on to
remove extraneous capabilities it drops down to  100K. It could all be made
into a single file. In fact I had a lot of Django rewritten with that in
mind. Well, that and the fact I can't leave code untouched. What I liked
about the idea, besides the syntax with if statements, nested for loops,
includes, and a few filters, is that it allowed moving some of the common
code out of the source and into higher level build files where variable
values and flags could be set, making the whole build process more
transparent and adaptable. That said, it is hard to beat the compactness of
the current version.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-24 Thread David Cournapeau
Matthew Brett wrote:

 I'm +3 for the plugin idea - it would have huge benefits for
 installation and automatic optimization.  What needs to be done?  Who
 could do it?

The main issues are portability, and reliability I think. All OS 
supported by numpy have more or less a dynamic library loading support 
(that's how python itself works, after all, unless you compile 
everything statically), but the devil is in the details. In particular:
- I am not sure whether plugin unloading is supported by all OS (Mac 
os X posix api does not enable unloading, for example, you have to use a 
specific API which I do not know anything about). Maybe we do not need 
it, I don't know; unloading sounds really difficult to support reliably 
anyway (how to make sure numpy won't use the functions of the plugins ?)
- build issues: it is really the same thing than ctypes at the build 
level, I think
- an api so that it can be used throughout numpy. there should be a 
clear interface for the plugins, which does not sound trivial either. 
That's the most difficult part in my mind (well, maybe not difficult, 
but time-consuming at least).

That's one of the reason why I was thinking about a gradual move of most 
core functionalities of the core toward a separate C library, with a 
simple and crystal clear interface, without any reference to any python 
API, just plain C with plain pointers. We could then force this core, 
pure C library to be used only through dereferencing of an additional 
pointer, thus enabling dynamic change of the actual functions (at least 
when numpy is started).

I have to say I really like the idea of more explicit separation between 
the actual computation code and the python wrapping; it can only help if 
we decide to write some of the wrappers in Cython/ctypes/whatever 
instead of pure C as today. It has many advantages in terms of 
reliability, maintainability and performance (testing performance would 
be much easier I think, since it could be done in pure C).

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-24 Thread Robert Kern
On Mon, Mar 24, 2008 at 12:12 PM, Gnata Xavier [EMAIL PROTECTED] wrote:

  Well it is not that easy. We have several numpy code following like this :
  1) open an large data file to get a numpy array
  2) perform computations on this array (I'm only talking of the numpy
  part here. scipy is something else)
  3) Write the result is another large file

  It is so simple to write using numpy :)
  Now, if I want to have several exe, step 3 is often a problem.

If that large file can be accessed by memory-mapping, then step 3 can
actually be quite easy. You have one program make the empty file of
the given size (f.seek(FILE_SIZE); f.write('\0'); f.seek(0,0)) and
then make each of the parallel programs memory map the file and only
write to their respective portions.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-24 Thread Gnata Xavier
Robert Kern wrote:
 On Mon, Mar 24, 2008 at 12:12 PM, Gnata Xavier [EMAIL PROTECTED] wrote:

   
  Well it is not that easy. We have several numpy code following like this :
  1) open an large data file to get a numpy array
  2) perform computations on this array (I'm only talking of the numpy
  part here. scipy is something else)
  3) Write the result is another large file

  It is so simple to write using numpy :)
  Now, if I want to have several exe, step 3 is often a problem.
 

 If that large file can be accessed by memory-mapping, then step 3 can
 actually be quite easy. You have one program make the empty file of
 the given size (f.seek(FILE_SIZE); f.write('\0'); f.seek(0,0)) and
 then make each of the parallel programs memory map the file and only
 write to their respective portions.

   
Yep but that is the best case.
Our standard case is a quite long sequence of simple computation on  
arrays.
Some part are clearly thread-candidates but not every parts.
For instance, at step N+1 I have to multiply foo by the sum of a large 
array computed at step N-1.
I can split the sum computation over several exe but it is not 
convenient at all and not that easy to get the sum at the end (I know 
ugly ways to do that. ugly).

One step large computations can be split into several exe. Several steps 
large one are another story :(

Xavier





___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread Charles R Harris
On Sat, Mar 22, 2008 at 10:59 PM, David Cournapeau 
[EMAIL PROTECTED] wrote:

 Charles R Harris wrote:
 
  It looks like memory access is the bottleneck, otherwise running 4
  floats through in parallel should go a lot faster. I need to modify
  the program a bit and see how it works for doubles.

 I am not sure the benchmark is really meaningful: it does not uses
 aligned buffers (16 bytes alignement), and because of that, does not
 give a good idea of what can be expected from SSE. It shows why it is
 not so easy to get good performances, and why just throwing a few
 optimized loops won't work, though. Using sse/sse2 from unaligned
 buffers is a waste of time. Without this alignement, you need to take
 into account the alignement (using _mm_loadu_ps vs _mm_load_ps), and
 that's extremely slow, basically killing most of the speed increase you
 can expect from using sse.


Yep, but I expect the compilers to take care of alignment, say by inserting
a few single ops when needed. So I would just as soon leave vectorization to
the compilers and wait until they get that good. The only thing needed then
is contiguous data.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread David Cournapeau
Charles R Harris wrote:

 Yep, but I expect the compilers to take care of alignment, say by 
 inserting a few single ops when needed.

The other solution would be to have aligned allocators (it won't solve 
all cases, of course). Because the compilers will never be able to take 
care of the cases where you call external libraries (fftw, where we 
could have between 50 % and more than 100 % speed increase if we had 
aligned data buffers by default).

 So I would just as soon leave vectorization to the compilers and wait 
 until they get that good. The only thing needed then is contiguous data.

For non contiguous data, things will be extremely slow anyway, so I 
don't think those need a lot of attention. If you care about 
performances, you should not use non contiguous data.

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread Emanuele Olivetti
James Philbin wrote:
 OK, i've written a simple benchmark which implements an elementwise
 multiply (A=B*C) in three different ways (standard C, intrinsics, hand
 coded assembly). On the face of things the results seem to indicate
 that the vectorization works best on medium sized inputs. If people
 could post the results of running the benchmark on their machines
 (takes ~1min) along with the output of gcc --version and their chip
 model, that wd be v useful.

 It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench


CPU: Intel(R) Core(TM)2 CPU  T7400  @ 2.16GHz
(macbook, intel core 2 duo)

gcc (GCC) 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)
(ubuntu gutsy gibbon 7.10)

$ ./vec_bench
Testing methods...
All OK

Problem size  Simple 
Intrin  Inline
 100   0.0003ms (100.0%)   0.0002ms ( 68.3%)   0.0002ms
( 75.6%)
1000   0.0023ms (100.0%)   0.0018ms ( 76.7%)   0.0020ms
( 87.1%)
   1   0.0361ms (100.0%)   0.0193ms ( 53.4%)   0.0338ms
( 93.7%)
  10   0.2839ms (100.0%)   0.1351ms ( 47.6%)   0.0937ms
( 33.0%)
 100   4.2108ms (100.0%)   4.1234ms ( 97.9%)   4.0886ms
( 97.1%)
1000  45.3192ms (100.0%)  45.5359ms (100.5%)  45.3466ms
(100.1%)


Note that there is some variance in the results. Here is a second run to
have
an idea (look at Inline, size=1):

$ ./vec_bench
Testing methods...
All OK

Problem size  Simple 
Intrin  Inline
 100   0.0003ms (100.0%)   0.0002ms ( 69.5%)   0.0002ms
( 74.1%)
1000   0.0024ms (100.0%)   0.0018ms ( 75.9%)   0.0020ms
( 86.4%)
   1   0.0324ms (100.0%)   0.0186ms ( 57.3%)   0.0226ms
( 69.6%)
  10   0.2840ms (100.0%)   0.1171ms ( 41.2%)   0.0939ms
( 33.1%)
 100   4.4034ms (100.0%)   4.3657ms ( 99.1%)   4.0465ms
( 91.9%)
1000  44.4854ms (100.0%)  43.9502ms ( 98.8%)  43.6824ms
( 98.2%)


HTH

Emanuele

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread James Philbin
Wow, a much more varied set of results than I was expecting. Could
someone who has gcc 4.3 installed compile it with:
gcc -msse -O2 -ftree-vectorize -ftree-vectorizer-verbose=5 -S
vec_bench.c -o vec_bench.s

And attach vec_bench.s and the verbose output from gcc.

James
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread David Cournapeau
Gnata Xavier wrote:

 Hi,

 I have a very limited knowledge of openmp but please consider this 
 testcase :

   

Honestly, if it was that simple, it would already have been done for a 
long time. The problem is that your test-case is not even remotely close 
to how things have to be done in numpy.

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread Francesc Altet
A Sunday 23 March 2008, Charles R Harris escrigué:
 gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
 cpu:  Intel(R) Core(TM)2 CPU  6600  @ 2.40GHz

 Problem size  Simple  Intrin
 Inline
  100   0.0002ms (100.0%)   0.0001ms ( 68.7%)  
 0.0001ms ( 74.8%)
 1000   0.0015ms (100.0%)   0.0011ms ( 72.0%)  
 0.0012ms ( 80.4%)
1   0.0154ms (100.0%)   0.0111ms ( 72.1%)  
 0.0122ms ( 79.1%)
   10   0.1081ms (100.0%)   0.0759ms ( 70.2%)  
 0.0811ms ( 75.0%)
  100   2.7778ms (100.0%)   2.8172ms (101.4%)  
 2.7929ms ( 100.5%)
 1000  28.1577ms (100.0%)  28.7332ms (102.0%) 
 28.4669ms ( 101.1%)

I'm mystified about your machine requiring just 28s for completing the 
10 million test, and most of the other, similar processors (some faster 
than yours), in this thread falls pretty far from your figure.  What 
sort of memory subsystem are you using?

 It looks like memory access is the bottleneck, otherwise running 4
 floats through in parallel should go a lot faster.

Yes, that's probably right.  This test is mainly measuring the memory 
access speed of machines for large datasets.  For small ones, my guess 
is that the data is directly placed in caches, so there is no need to 
transport them to the CPU prior to do the calculations.  However, I'm 
not sure whether this kind of optimizations for small datasets would be 
very useful in practice (read general NumPy calculations), but I'm 
rather sceptical about this.

Cheers,

-- 
0,0   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 -
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread Francesc Altet
A Sunday 23 March 2008, David Cournapeau escrigué:
 Gnata Xavier wrote:
  Hi,
 
  I have a very limited knowledge of openmp but please consider this
  testcase :

 Honestly, if it was that simple, it would already have been done for
 a long time. The problem is that your test-case is not even remotely
 close to how things have to be done in numpy.

Why not?  IMHO, complex operations requiring a great deal of operations 
per word, like trigonometric, exponential, etc..., are the best 
candidates to take advantage of several cores or even SSE instructions 
(not sure whether SSE supports this sort of operations, though).

At any rate, this is exactly the kind of parallel optimizations that 
make sense in Numexpr, in the sense that you could obtain decent 
speedups with multicore processors.

Cheers,

-- 
0,0   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 -
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread David Cournapeau
Francesc Altet wrote:

 Why not?  IMHO, complex operations requiring a great deal of operations 
 per word, like trigonometric, exponential, etc..., are the best 
 candidates to take advantage of several cores or even SSE instructions 
 (not sure whether SSE supports this sort of operations, though).

I was talking about the general using openmp thing in numpy context. 
If it was just adding one line at one place in the source code, someone 
would already have done it, no ? But there are build issues, for 
example: you have to add support for openmp at compilation and link, you 
have to make sure it works with compilers which do not support it.

Even without taking into account the build issues, there is the problem 
of correctly annotating the source code depending on the context. For 
example, many interesting places where to use openmp in numpy would need 
more than just using the parallel for pragma. From what I know of 
openMP, the annotations may depend on the kind of operation you are 
doing (independent element-wise operations or not). Also, the test case 
posted before use a really big N, where you are sure that using 
multi-thread is efficient. What happens if N is small ? Basically, the 
posted test is the best situation which can happen (big N, known 
operation with known context, etc...). That's a proof that openMP works, 
not that it can work for numpy.

I find the example of sse rather enlightening: in theory, you should 
expect a 100-300 % speed increase using sse, but even with pure C code 
in a controlled manner, on one platform (linux + gcc), with varying, 
recent CPU, the results are fundamentally different. So what would 
happen in numpy, where you don't control things that much ?

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread Matthieu Brucher

 I find the example of sse rather enlightening: in theory, you should
 expect a 100-300 % speed increase using sse, but even with pure C code
 in a controlled manner, on one platform (linux + gcc), with varying,
 recent CPU, the results are fundamentally different. So what would
 happen in numpy, where you don't control things that much ?


This means that what we measure is not what we think we measure. The time we
get is not only dependent on the number of instructions. Did someone make a
complete instrumented profile of the code that everyone is testing with
callgrind or the Visual Studio profiler ? This will tell us excatly what is
happening :
- instructions
- cache issues (that is likely to be the bottleneck, but without a proof,
nothing should be done about it)
- SSE efficiency
- ...

I think that to be really efficient, one would have to use a dynamic
prefetcher, but these things are not available on x86 and even it were the
case will never make it to the general public because they can't be proof
tested (binary modifications on the fly). But they are really efficient when
going through an array.

Matthieu
-- 
French PhD student
Website : http://matthieu-brucher.developpez.com/
Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn : http://www.linkedin.com/in/matthieubrucher
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread Gnata Xavier
David Cournapeau wrote:
 Francesc Altet wrote:
   
 Why not?  IMHO, complex operations requiring a great deal of operations 
 per word, like trigonometric, exponential, etc..., are the best 
 candidates to take advantage of several cores or even SSE instructions 
 (not sure whether SSE supports this sort of operations, though).
 

 I was talking about the general using openmp thing in numpy context. 
 If it was just adding one line at one place in the source code, someone 
 would already have done it, no ? But there are build issues, for 
 example: you have to add support for openmp at compilation and link, you 
 have to make sure it works with compilers which do not support it.

 Even without taking into account the build issues, there is the problem 
 of correctly annotating the source code depending on the context. For 
 example, many interesting places where to use openmp in numpy would need 
 more than just using the parallel for pragma. From what I know of 
 openMP, the annotations may depend on the kind of operation you are 
 doing (independent element-wise operations or not). Also, the test case 
 posted before use a really big N, where you are sure that using 
 multi-thread is efficient. What happens if N is small ? Basically, the 
 posted test is the best situation which can happen (big N, known 
 operation with known context, etc...). That's a proof that openMP works, 
 not that it can work for numpy.

 I find the example of sse rather enlightening: in theory, you should 
 expect a 100-300 % speed increase using sse, but even with pure C code 
 in a controlled manner, on one platform (linux + gcc), with varying, 
   
 recent CPU, the results are fundamentally different. So what would 
 happen in numpy, where you don't control things that much ?

 cheers,

 David
   
Well of course my goal was not to say that my simple testcase can be 
copied/pasted into numpy :)
Of ourse it is one of the best case to use openmp.
Of course pragma can be more complex than that (you can tell variables 
that can/cannot be shared for instance).

The size : Using openmp will be slower on small arrays, that is clear 
but the user doing very large computations is smart enough to know when 
he need to split it's job into threads. The obvious solution is to 
provide the user with // and non // functions.

sse : sse can help a lot but multithreading just scales where sse 
mono-thread based solutions don't.

Build/link : It is an issue. It has to be tested. I do not know because 
I haven't even tried.

So, IMHO it would be nice to try to put some OpenMP simple pragmas into 
numpy *only to see what is going on*.

Even if it only work with gcc or even if...I do not know... It would be 
a first step. step by step :)

If the performances are so bad, ok, forget about itbut it would be 
sad because the next generation CPU will not be more powerfull, they 
will only have more that one or two cores on the same chip.

Xavier
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread David Cournapeau
Gnata Xavier wrote:
 Well of course my goal was not to say that my simple testcase can be 
 copied/pasted into numpy :)
 Of ourse it is one of the best case to use openmp.
 Of course pragma can be more complex than that (you can tell variables 
 that can/cannot be shared for instance).

 The size : Using openmp will be slower on small arrays, that is clear 
 but the user doing very large computations is smart enough to know when 
 he need to split it's job into threads. The obvious solution is to 
 provide the user with // and non // functions.

IMHO, that's a really bad solution. It should be dynamically enabled 
(like in matlab, if I remember correctly). And this means having a plug 
subsystem to load/unload different implementation... that is one of the 
thing I was interested in getting done for numpy 1.1 (or above).

For small arrays: how much slower ? Does it make the code slower than 
without open mp ? For example, what does your code says when N is 10, 
100, 1000 ?


 sse : sse can help a lot but multithreading just scales where sse 
 mono-thread based solutions don't.

It depends: it scales pretty well if you use several processus, and if 
you can design your application in a multi-process way.


 Build/link : It is an issue. It has to be tested. I do not know because 
 I haven't even tried.

 So, IMHO it would be nice to try to put some OpenMP simple pragmas into 
 numpy *only to see what is going on*.

 Even if it only work with gcc or even if...I do not know... It would be 
 a first step. step by step :)

I agree about the step by step approach; I am just not sure I agree with 
your steps, that's all. Personally, I would first try getting a plug-in 
system working with numpy. But really, prove me wrong. Do it, try 
putting some pragma at some places in the ufunc machinery or somewhere 
else; as I said earlier, I would be happy to add support for open mp at 
the build level, at least in numscons. I would love being proven wrong 
and having a numpy which scales well with multi-core :)

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread Matthieu Brucher

 If the performances are so bad, ok, forget about itbut it would be
 sad because the next generation CPU will not be more powerfull, they
 will only have more that one or two cores on the same chip.


I don't think this is the worst that will happen. The worst is what has been
seen for decades : the CPU raw power raising faster than memory speed
(bandwidth and latency). With the next generation of Intel's CPU, the memory
controller will at last be on the CPU and correctly shared between cores,
but for the moment, with our issues, splitting this kind of parallel jobs
(additions, subtractions, ...) will not enhance speed as the bottleneck is
the memory controller/system bus that is already used at 100% by one core.

Matthieu
-- 
French PhD student
Website : http://matthieu-brucher.developpez.com/
Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn : http://www.linkedin.com/in/matthieubrucher
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread Scott Ransom
Hi David et al,

Very interesting.  I thought that the 64-bit gcc's automatically
aligned memory on 16-bit (or 32-bit) boundaries.  But apparently
not.  Because running your code certainly made the intrinsic code
quite a bit faster.  However, another thing that I noticed was
that the simple code was _much_ faster using gcc-4.3 with -O3
than with -O2.  I've noticed this will some other code recently as
well -- the auto loop-unrolling really helps for this type of
code.

You can see my benchmarks here (posted there to avoind line wrap
issues):
http://www.cv.nrao.edu/~sransom/vec_results.txt

Scott


On Sun, Mar 23, 2008 at 01:59:39PM +0900, David Cournapeau wrote:
 Charles R Harris wrote:
 
  It looks like memory access is the bottleneck, otherwise running 4 
  floats through in parallel should go a lot faster. I need to modify 
  the program a bit and see how it works for doubles.
 
 I am not sure the benchmark is really meaningful: it does not uses 
 aligned buffers (16 bytes alignement), and because of that, does not 
 give a good idea of what can be expected from SSE. It shows why it is 
 not so easy to get good performances, and why just throwing a few 
 optimized loops won't work, though. Using sse/sse2 from unaligned 
 buffers is a waste of time. Without this alignement, you need to take 
 into account the alignement (using _mm_loadu_ps vs _mm_load_ps), and 
 that's extremely slow, basically killing most of the speed increase you 
 can expect from using sse.
 
 Here what I get with the above benchmark:
 
  100   0.0002ms (100.0%)   0.0001ms ( 71.5%)   0.0001ms 
 ( 85.0%)
 1000   0.0014ms (100.0%)   0.0010ms ( 70.6%)   0.0013ms 
 ( 96.8%)
1   0.0162ms (100.0%)   0.0095ms ( 58.2%)   0.0128ms 
 ( 78.7%)
   10   0.4189ms (100.0%)   0.4135ms ( 98.7%)   0.4149ms 
 ( 99.0%)
  100   5.9523ms (100.0%)   5.8933ms ( 99.0%)   5.8910ms 
 ( 99.0%)
 1000  58.9645ms (100.0%)  58.2620ms ( 98.8%)  58.7443ms 
 ( 99.6%)
 
 Basically, no help at all: this is on a P4, which fpu is extremely slow 
 if not used with optimized sse.
 
 Now, if I use posix_memalign, replace the intrinsics for aligned access, 
 and use an accurate cycle counter (cycle.h, provided by fftw).
 
 Compiled as is:
 
 Testing methods...
 All OK
 
 Problem size  Simple  
 Intrin  Inline
  1004.16e+02 cycles (100.0%)4.04e+02 cycles 
 ( 97.1%)4.92e+02 cycles (118.3%)
 10003.66e+03 cycles (100.0%)3.11e+03 cycles 
 ( 84.8%)4.10e+03 cycles (111.9%)
13.47e+04 cycles (100.0%)3.01e+04 cycles 
 ( 86.7%)4.06e+04 cycles (116.8%)
   101.36e+06 cycles (100.0%)1.34e+06 cycles 
 ( 98.7%)1.45e+06 cycles (106.7%)
  1001.92e+07 cycles (100.0%)1.87e+07 cycles 
 ( 97.1%)1.89e+07 cycles ( 98.2%)
 10001.86e+08 cycles (100.0%)1.80e+08 cycles 
 ( 96.8%)1.81e+08 cycles ( 97.4%)
 
 Compiled with -DALIGNED, wich uses aligned access intrinsics:
 
 Testing methods...
 All OK
 
 Problem size  Simple  
 Intrin  Inline
  1004.16e+02 cycles (100.0%)1.96e+02 cycles 
 ( 47.1%)4.92e+02 cycles (118.3%)
 10003.82e+03 cycles (100.0%)1.56e+03 cycles 
 ( 40.8%)4.22e+03 cycles (110.4%)
13.46e+04 cycles (100.0%)1.92e+04 cycles 
 ( 55.5%)4.13e+04 cycles (119.4%)
   101.32e+06 cycles (100.0%)1.12e+06 cycles 
 ( 85.0%)1.16e+06 cycles ( 87.8%)
  1001.95e+07 cycles (100.0%)1.92e+07 cycles 
 ( 98.3%)1.95e+07 cycles (100.2%)
 10001.82e+08 cycles (100.0%)1.79e+08 cycles 
 ( 98.4%)1.81e+08 cycles ( 99.3%)
 
 This gives a drastic difference (I did not touch inline code, because it 
 is sunday and I am lazy). If I use this on a sane CPU (core 2 duo, 
 macbook) instead of my pentium4, I get better results (in particular, 
 sse code is never slower, and I get a double speed increase as long as 
 the buffer can be in cache).
 
 It looks like using prefect also gives some improvements when on the 
 edge of the cache size (my P4 has a 512 kb L2 cache):
 
 Testing methods...
 All OK
 
 Problem size  Simple  
 Intrin  Inline
  1004.16e+02 cycles (100.0%)2.52e+02 cycles 
 ( 60.6%)4.92e+02 cycles (118.3%)
 10003.55e+03 cycles (100.0%)1.85e+03 cycles 
 ( 52.2%)4.21e+03 cycles (118.7%)
13.48e+04 cycles (100.0%)1.76e+04 cycles 
 ( 50.6%)4.13e+04 cycles (118.9%)
   101.11e+06 cycles (100.0%)

Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread David Cournapeau
Scott Ransom wrote:
 Hi David et al,

 Very interesting.  I thought that the 64-bit gcc's automatically
 aligned memory on 16-bit (or 32-bit) boundaries. 

Note that I am talking about bytes, not bits. Default alignement depend 
on many parameters, like the OS, C runtime. For example, on mac os X, 
malloc defaults to 16 bytes aligned (I guess this comes from ppc ages, 
where the only way to keep up with x86 was to aggressively use altivec). 
On glibc, it is 8 bytes aligned; for big sizes (where big is linked to 
the mmap threshold), it is almost never 16 bytes aligned (there was a 
discussion on this on numpy ML initiated by Steve G. Johnson, one of the 
main FFTW developer). I don't know about dependency on 64 bits archs.

IMHO, the only real solution for this point is to have some support for 
aligned buffers in numpy, with aligned memory allocators.

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread Charles R Harris
On Sun, Mar 23, 2008 at 6:41 AM, Francesc Altet [EMAIL PROTECTED] wrote:

 A Sunday 23 March 2008, Charles R Harris escrigué:
  gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
  cpu:  Intel(R) Core(TM)2 CPU  6600  @ 2.40GHz
 
  Problem size  Simple  Intrin
  Inline
   100   0.0002ms (100.0%)   0.0001ms ( 68.7%)
  0.0001ms ( 74.8%)
  1000   0.0015ms (100.0%)   0.0011ms ( 72.0%)
  0.0012ms ( 80.4%)
 1   0.0154ms (100.0%)   0.0111ms ( 72.1%)
  0.0122ms ( 79.1%)
10   0.1081ms (100.0%)   0.0759ms ( 70.2%)
  0.0811ms ( 75.0%)
   100   2.7778ms (100.0%)   2.8172ms (101.4%)
  2.7929ms ( 100.5%)
  1000  28.1577ms (100.0%)  28.7332ms (102.0%)
  28.4669ms ( 101.1%)

 I'm mystified about your machine requiring just 28s for completing the
 10 million test, and most of the other, similar processors (some faster
 than yours), in this thread falls pretty far from your figure.  What
 sort of memory subsystem are you using?


Yeah, I noticed that ;) The cpu is an E6600, which was the low end of the
performance core duo processors before the recent Intel releases, the north
bridge (memory controller) is a P35, and the memory is DDR2 running at 800
MHz with 4-4-4-12 timing. The only things I tweaked were the memory voltage
and timings. Raising the memory speed from 667 to 800 made a noticeable
difference in my perception of speed, which is remarkable in itself. The
motherboard was cheap, it goes for $70 these days.

I've seen folks overclocking the E6600 up to 3.8 GHz and over 3GHz is
common. Sometimes it's almost tempting...

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread James Philbin
OK, i'm really impressed with the improvements in vectorization for
gcc 4.3. It really seems like it's able to work with real loops which
wasn't the case with 4.1. I think Chuck's right that we should simply
special case contiguous data and allow the auto-vectorizer to do the
rest. Something like this for the ufuncs:

 /**begin repeat

   #TYPE=(BOOL,
BYTE,UBYTE,SHORT,USHORT,INT,UINT,LONG,ULONG,LONGLONG,ULONGLONG,FLOAT,DOUBLE,LONGDOUBLE)*2#
   #OP=||, +*13, ^, -*13#
   #kind=add*14, subtract*14#
   #typ=(Bool, byte, ubyte, short, ushort, int, uint, long, ulong,
longlong, ulonglong, float, double, longdouble)*2#
*/

static void
@[EMAIL PROTECTED]@[EMAIL PROTECTED](@typ@ *i1, @typ@ *i2, @type@ *op, int n)
{
   int i;
   for (i=0; in; i++) {
  op[i] = i1[i] @OP@ i2[i];
   }
}

static void
@[EMAIL PROTECTED]@kind@(char **args, intp *dimensions, intp *steps, void *func)
{
register intp i;
intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0];
char *i1=args[0], *i2=args[1], *op=args[2];

if (is1==1  is2==1  os==1)
return @[EMAIL PROTECTED]@[EMAIL PROTECTED]((@typ@ *)i1, (@typ@ *)i2, 
(@typ@ *)os, n);

for(i=0; in; i++, i1+=is1, i2+=is2, op+=os) {
*((@typ@ *)op)=*((@typ@ *)i1) @OP@ *((@typ@ *)i2);
}
}
/**end repeat**/

We also need to add -ftree-vectorize to the standard compile flags at
least for the ufuncs.

James
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread Anne Archibald
On 23/03/2008, David Cournapeau [EMAIL PROTECTED] wrote:
 Gnata Xavier wrote:
  
   Hi,
  
   I have a very limited knowledge of openmp but please consider this
   testcase :
  
  


 Honestly, if it was that simple, it would already have been done for a
  long time. The problem is that your test-case is not even remotely close
  to how things have to be done in numpy.

Actually, there are a few places where a parallel for would serve to
accelerate all ufuncs. There are build issues, yes, though they are
mild; we would also want to provide some API to turn parallelization
on and off, and we'd have to verify that OpenMP did not slow down
small arrays, but that would be it. (And I suspect that OpenMP is
smart enough to use single threads without locking when multiple
threads won't help. Certainly all the information is available to
OpenMP to make such decisions.)

This is why I suggested making this change: it should be a low-cost,
high-gain change.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread Matthieu Brucher

 (And I suspect that OpenMP is
 smart enough to use single threads without locking when multiple
 threads won't help. Certainly all the information is available to
 OpenMP to make such decisions.)


Unfortunately, I don't think there is such a think. For instance the number
of threads used by MKL is told by a environment variable.

Matthieu
-- 
French PhD student
Website : http://matthieu-brucher.developpez.com/
Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn : http://www.linkedin.com/in/matthieubrucher
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread David Cournapeau
Anne Archibald wrote:

 Actually, there are a few places where a parallel for would serve to
 accelerate all ufuncs. There are build issues, yes, though they are
 mild;

Maybe, maybe not. Anyway, I said that I would step in to resolve those 
issues if someone else does the coding.

  we would also want to provide some API to turn parallelization
 on and off, and we'd have to verify that OpenMP did not slow down
 small arrays, but that would be it. (And I suspect that OpenMP is
 smart enough to use single threads without locking when multiple
 threads won't help. Certainly all the information is available to
 OpenMP to make such decisions.)
   

How so ? Maybe you're right, but that's not so obvious to me. But since 
several people seem to know openMP and are eager to add it to numpy, it 
should be easy to add it to numpy: all they have to do it getting numpy 
sources from svn, and start coding :) With numscons at least, they can 
manually add the -fopenmp and -lgomp flags from the command line to 
quickly do a prototype (it should not be much more difficult with 
distutils).

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-23 Thread David Cournapeau
Gnata Xavier wrote:
 Ok I will try to see what I can do but it is sure that we do need the 
 plug-in system first (read before the threads in the numpy release). 
 During the devel of 1.1, I will try to find some time to understand 
 where I should put some pragma into ufunct using a very conservation 
 approach. Any people with some OpenMP knowledge are welcome because I'm 
 not a OpenMP expert but only an OpenMP user in my C/C++ codes.

Note that the plug-in idea is just my own idea, it is not something 
agreed by anyone else. So maybe it won't be done for numpy 1.1, or at 
all. It depends on the main maintainers of numpy.



 and the results :
 100080  10.308471   30.007250
 100 160 1.9025635.800172
 10  320 0.5430081.123274
 1   640 0.2068230.223031
 100012800.0888980.044268
 100 25600.1504290.008880
 10  51200.2895890.002084

  --- On this machine, we should start to use threads *in this testcase* 
 iif size=1 (a 100*100 image is a very very small one :))

Maybe openMP can be more clever, but it tends to show that openMP, when 
used naively, can *not* decide how many threads to use. That's really 
the core problem: again, I don't know much about openMP, but almost any 
project using multi-thread/multi-process and not being embarrassingly 
parallel has the problem that it makes things much slower for many cases 
where thread creation/management and co have a lot of overhead 
proportionally to the computation. The problem is to determine the N, 
dynamically, or in a way which works well for most cases. OpenMP was 
created for HPC, where you have very large data; it is not so obvious to 
me that it is adapted to numpy which has to be much more flexible. Being 
fast on a given problem is easy; being fast on a whole range, that's 
another story: the problem really is to be as fast as before on small 
arrays.

The fact that matlab, while having much more ressources than us, took 
years to do it, makes me extremely skeptical on the efficient use of 
multi-threading without real benchmarks for numpy. They have a dedicated 
team, who developed a JIT for matlab, which insert multi-thread code 
on the fly (for m files, not when you are in the interpreter), and who 
uses multi-thread blas/lapack (which is already available in numpy 
depending on the blas/lapack you are using).

But again, and that's really the only thing I have to say: prove me wrong :)

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread David Cournapeau
Matthieu Brucher wrote:
 Hi,

 It seems complicated to add OpenMP in the code, I don't think many 
 people have the knowlegde to do this, not mentioning the fact that 
 there are a lotof Python calls in the different functions.

Yes, this makes potential optimizations harder, at least for someone 
like me who do not know well about numpy internals. That's still 
something I have not thought a lot about, but that's an example of why I 
like the idea of splitting numpy C code in core C / wrappers: you would 
only use open MP in the core C library, and everything would be 
transparent at higher levels (if I understand correctly how openMP 
works, which may very well not be true :) ).

OpenMP, sse, etc... those are different views of the same underlying 
problem, in this context. But I do not know enough about numpy 
internals yet (in particular, how the number protocol works, and the 
relationship with the ufunc machinery) to know if it is feasible in a 
reasonable number of hours, or even if it is feasible at all :)

 The multicore Matlab does seems for more related to the underlying 
 libraries than to something they did.


Yes, that's why I put the matlab link: actually, most of the parallel 
thing it does is related to the mkl and co. That's something which is 
much easier to handle, and possible right now if I understand right ?

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread James Philbin
Personally, I think that the time would be better spent optimizing
routines for single-threaded code and relying on BLAS and LAPACK
libraries to use multiple cores for more complex calculations. In
particular, doing some basic loop unrolling and SSE versions of the
ufuncs would be beneficial. I have some experience writing SSE code
using intrinsics and would be happy to give it a shot if people tell
me what functions I should focus on.

James
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Neal Becker
James Philbin wrote:

 Personally, I think that the time would be better spent optimizing
 routines for single-threaded code and relying on BLAS and LAPACK
 libraries to use multiple cores for more complex calculations. In
 particular, doing some basic loop unrolling and SSE versions of the
 ufuncs would be beneficial. I have some experience writing SSE code
 using intrinsics and would be happy to give it a shot if people tell
 me what functions I should focus on.
 
 James

gcc keeps advancing autovectorization.  Is manual vectorization worth the
trouble?

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread James Philbin
  gcc keeps advancing autovectorization.  Is manual vectorization worth the
  trouble?

Well, the way that the ufuncs are written at the moment,
-ftree-vectorize will never kick in due to the non-constant strides.
To get this to work, one has to special case out unary strides. Even
with constant strides -ftree-vectorize often produces sub-optimal code
as it has to make very conservative assumptions about the content of
variables (it can do better if -fstrict-aliasing is used, but I think
numpy is not compiled with this flag). So in other words, yes, I think
it is worth hand vectorizing (and unrolling) the most common cases.

James
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Charles R Harris
On Sat, Mar 22, 2008 at 11:43 AM, Neal Becker [EMAIL PROTECTED] wrote:

 James Philbin wrote:

  Personally, I think that the time would be better spent optimizing
  routines for single-threaded code and relying on BLAS and LAPACK
  libraries to use multiple cores for more complex calculations. In
  particular, doing some basic loop unrolling and SSE versions of the
  ufuncs would be beneficial. I have some experience writing SSE code
  using intrinsics and would be happy to give it a shot if people tell
  me what functions I should focus on.
 
  James

 gcc keeps advancing autovectorization.  Is manual vectorization worth the
 trouble?


The inner loop of a unary ufunc looks like

/*UFUNC_API*/
static void
PyUFunc_d_d(char **args, intp *dimensions, intp *steps, void *func)
{
intp i;
char *ip1=args[0], *op=args[1];
for(i=0; i*dimensions; i++, ip1+=steps[0], op+=steps[1]) {
*(double *)op = ((DoubleUnaryFunc *)func)(*(double *)ip1);
}
}


While it might help the compiler to put the steps on the stack as constants,
it is hard to see how the compiler could vectorize the loop given the
information available and the fact that the input data might not be aligned
or contiguous. I suppose one could make a small local buffer, copy the data
into it, and then use sse, and that might actually help for some things. But
it is also likely that the function itself won't deal gracefully with
vectorized data.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Charles R Harris
On Sat, Mar 22, 2008 at 12:01 PM, James Philbin [EMAIL PROTECTED] wrote:

   gcc keeps advancing autovectorization.  Is manual vectorization worth
 the
   trouble?

 Well, the way that the ufuncs are written at the moment,
 -ftree-vectorize will never kick in due to the non-constant strides.
 To get this to work, one has to special case out unary strides. Even
 with constant strides -ftree-vectorize often produces sub-optimal code
 as it has to make very conservative assumptions about the content of
 variables (it can do better if -fstrict-aliasing is used, but I think


Numpy would die with strict-aliasing because it relies on casting char
pointers to various types. It might be possible to use void pointers, but
that would be a major do over and it isn't clear what the strides and
offsets, currently in char units, would become.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Travis E. Oliphant
James Philbin wrote:
 Personally, I think that the time would be better spent optimizing
 routines for single-threaded code and relying on BLAS and LAPACK
 libraries to use multiple cores for more complex calculations. In
 particular, doing some basic loop unrolling and SSE versions of the
 ufuncs would be beneficial. I have some experience writing SSE code
 using intrinsics and would be happy to give it a shot if people tell
 me what functions I should focus on.

   
Fabulous!   This is on my Project List of todo items for NumPy.  See 
http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend 
some time refactoring the ufunc loops so that the templating does not 
get in the way of doing this on a case by case basis.

1) You should focus on the math operations:  add, subtract, multiply, 
divide, and so forth.
2) Then for combined operations we should expose the functionality at 
a high-level.  So, that somebody could write code to take advantage of it. 

It would be easiest to use intrinsics which would then work for AMD, 
Intel, on multiple compilers.

-Travis O.


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Travis E. Oliphant
Charles R Harris wrote:


 On Sat, Mar 22, 2008 at 11:43 AM, Neal Becker [EMAIL PROTECTED] 
 mailto:[EMAIL PROTECTED] wrote:

 James Philbin wrote:

  Personally, I think that the time would be better spent optimizing
  routines for single-threaded code and relying on BLAS and LAPACK
  libraries to use multiple cores for more complex calculations. In
  particular, doing some basic loop unrolling and SSE versions of the
  ufuncs would be beneficial. I have some experience writing SSE code
  using intrinsics and would be happy to give it a shot if people tell
  me what functions I should focus on.
 
  James

 gcc keeps advancing autovectorization.  Is manual vectorization
 worth the
 trouble?


 The inner loop of a unary ufunc looks like

 /*UFUNC_API*/
 static void
 PyUFunc_d_d(char **args, intp *dimensions, intp *steps, void *func)
 {
 intp i;
 char *ip1=args[0], *op=args[1];
 for(i=0; i*dimensions; i++, ip1+=steps[0], op+=steps[1]) {
 *(double *)op = ((DoubleUnaryFunc *)func)(*(double *)ip1);
 }
 }


 While it might help the compiler to put the steps on the stack as 
 constants, it is hard to see how the compiler could vectorize the loop 
 given the information available and the fact that the input data might 
 not be aligned or contiguous. I suppose one could make a small local 
 buffer, copy the data into it, and then use sse, and that might 
 actually help for some things. But it is also likely that the function 
 itself won't deal gracefully with vectorized data.

I think the thing to do is to special-case the code so that if the 
strides work for vectorization, then a different bit of code is executed 
and this current code is used as the final special-case.

Something like this would be relatively straightforward, if a bit 
tedious, to do.

-Travis



___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread James Philbin
OK, so a few questions:
1. I'm not familiar with the format of the code generators. Should I
pull the special case out of the /** begin repeats or should I do a
conditional inside the repeats (how does one do this?).
2. I don't have access to Windows+VisualC, so I will need some help
testing for Windows.
3. Should patches be posted to the mailing list or checked into svn?

James
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Thomas Grill
Am 22.03.2008 um 19:20 schrieb Travis E. Oliphant:

I think the thing to do is to special-case the code so that if the
strides work for vectorization, then a different bit of code is executed
and this current code is used as the final special-case.

Something like this would be relatively straightforward, if a bit
tedious, to do.


I've experimented with branching the ufuncs into different constant
strides and aligned/unaligned cases to be able to use SSE using
compiler intrinsics.
I expected a considerable gain as i was using float32 with stride 1
most of the time.
However, profiling revealed that hardly anything was gained because of
1) non-alignment of the vectors this _could_ be handled by
shuffled loading of the values though
2) the fact that my application used relatively large vectors that
wouldn't fit into the CPU cache, hence the memory transfer slowed down
the CPU.

I found the latter to be a real showstopper for most of my experiments
with SIMD. It's especially a problem for numpy because smaller vectors
have a lot of Python/numpy overhead, and larger ones don't really
benefit due to cache exhaustion.
I'm curious whether OpenMP gives better results, as multi-cores often
share their caches.

greetings,
Thomas
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Anne Archibald
On 22/03/2008, Thomas Grill [EMAIL PROTECTED] wrote:

 I've experimented with branching the ufuncs into different constant
  strides and aligned/unaligned cases to be able to use SSE using
  compiler intrinsics.
  I expected a considerable gain as i was using float32 with stride 1
  most of the time.
  However, profiling revealed that hardly anything was gained because of
  1) non-alignment of the vectors this _could_ be handled by
  shuffled loading of the values though
  2) the fact that my application used relatively large vectors that
  wouldn't fit into the CPU cache, hence the memory transfer slowed down
  the CPU.

  I found the latter to be a real showstopper for most of my experiments
  with SIMD. It's especially a problem for numpy because smaller vectors
  have a lot of Python/numpy overhead, and larger ones don't really
  benefit due to cache exhaustion.

This particular issue can sometimes be reduced by clever use of the
prefetching intrinsics. I'm not totally sure it's going to help inside
most ufuncs, though, since the runtime is so dominated by memory
reads. In a program I was writing I had time to do a 128-point real
FFT in the time it took to load the next 64 floats...

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Anne Archibald
On 22/03/2008, Travis E. Oliphant [EMAIL PROTECTED] wrote:
 James Philbin wrote:
   Personally, I think that the time would be better spent optimizing
   routines for single-threaded code and relying on BLAS and LAPACK
   libraries to use multiple cores for more complex calculations. In
   particular, doing some basic loop unrolling and SSE versions of the
   ufuncs would be beneficial. I have some experience writing SSE code
   using intrinsics and would be happy to give it a shot if people tell
   me what functions I should focus on.

 Fabulous!   This is on my Project List of todo items for NumPy.  See
  http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend
  some time refactoring the ufunc loops so that the templating does not
  get in the way of doing this on a case by case basis.

  1) You should focus on the math operations:  add, subtract, multiply,
  divide, and so forth.
  2) Then for combined operations we should expose the functionality at
  a high-level.  So, that somebody could write code to take advantage of it.

  It would be easiest to use intrinsics which would then work for AMD,
  Intel, on multiple compilers.

I think even heavier use of code generation would be a good idea here.
There are so many different versions of each loop, and the fastest way
to run each one is going to be different for different versions and
different platforms, that a routine that assembled the code from
chunks and picked the fastest combination for each instance might make
a big difference - this is roughly what FFTW and ATLAS do.

There are also some optimizations to be made at a higher level that
might give these optimizations more traction. For example:

A = randn(100*100)
A.shape = (100,100)
A*A

There's no reason the multiply ufunc couldn't flatten A and use a
single unstrided loop to do the multiplication.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread James Philbin
  However, profiling revealed that hardly anything was gained because of
  1) non-alignment of the vectors this _could_ be handled by
  shuffled loading of the values though
  2) the fact that my application used relatively large vectors that
  wouldn't fit into the CPU cache, hence the memory transfer slowed down
  the CPU.
I've had generally positive results from vectorizing code in the past,
admittedly on architectures with fast memory buses (Xeon 5100s). Naive
implementations of most simple vector operations (dot,+,-,etc) were
sped up by around ~20%. I also haven't found aligned accesses to make
much difference (~2-3%), but this might be dependent on the
architecture.

James
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Charles R Harris
On Sat, Mar 22, 2008 at 12:54 PM, Anne Archibald [EMAIL PROTECTED]
wrote:

 On 22/03/2008, Travis E. Oliphant [EMAIL PROTECTED] wrote:
  James Philbin wrote:
Personally, I think that the time would be better spent optimizing
routines for single-threaded code and relying on BLAS and LAPACK
libraries to use multiple cores for more complex calculations. In
particular, doing some basic loop unrolling and SSE versions of the
ufuncs would be beneficial. I have some experience writing SSE code
using intrinsics and would be happy to give it a shot if people tell
me what functions I should focus on.
 
  Fabulous!   This is on my Project List of todo items for NumPy.  See
   http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend
   some time refactoring the ufunc loops so that the templating does not
   get in the way of doing this on a case by case basis.
 
   1) You should focus on the math operations:  add, subtract, multiply,
   divide, and so forth.
   2) Then for combined operations we should expose the functionality at
   a high-level.  So, that somebody could write code to take advantage of
 it.
 
   It would be easiest to use intrinsics which would then work for AMD,
   Intel, on multiple compilers.

 I think even heavier use of code generation would be a good idea here.
 There are so many different versions of each loop, and the fastest way
 to run each one is going to be different for different versions and
 different platforms, that a routine that assembled the code from
 chunks and picked the fastest combination for each instance might make
 a big difference - this is roughly what FFTW and ATLAS do.


Maybe it's time to revisit the template subsystem I pulled out of Django.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Travis E. Oliphant
Anne Archibald wrote:
 On 22/03/2008, Travis E. Oliphant [EMAIL PROTECTED] wrote:
   
 James Philbin wrote:
   Personally, I think that the time would be better spent optimizing
   routines for single-threaded code and relying on BLAS and LAPACK
   libraries to use multiple cores for more complex calculations. In
   particular, doing some basic loop unrolling and SSE versions of the
   ufuncs would be beneficial. I have some experience writing SSE code
   using intrinsics and would be happy to give it a shot if people tell
   me what functions I should focus on.

 Fabulous!   This is on my Project List of todo items for NumPy.  See
  http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend
  some time refactoring the ufunc loops so that the templating does not
  get in the way of doing this on a case by case basis.

  1) You should focus on the math operations:  add, subtract, multiply,
  divide, and so forth.
  2) Then for combined operations we should expose the functionality at
  a high-level.  So, that somebody could write code to take advantage of it.

  It would be easiest to use intrinsics which would then work for AMD,
  Intel, on multiple compilers.
 

 I think even heavier use of code generation would be a good idea here.
 There are so many different versions of each loop, and the fastest way
 to run each one is going to be different for different versions and
 different platforms, that a routine that assembled the code from
 chunks and picked the fastest combination for each instance might make
 a big difference - this is roughly what FFTW and ATLAS do.

 There are also some optimizations to be made at a higher level that
 might give these optimizations more traction. For example:

 A = randn(100*100)
 A.shape = (100,100)
 A*A

 There's no reason the multiply ufunc couldn't flatten A and use a
 single unstrided loop to do the multiplication.
   
Good idea,  it does already do that :-)  The ufunc machinery is also a 
good place for an optional thread pool.

Perhaps we could drum up interest in a Need for Speed Sprint on NumPy 
sometime over the next few months.


-Travis O.

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Robert Kern
On Sat, Mar 22, 2008 at 2:04 PM, Charles R Harris
[EMAIL PROTECTED] wrote:

 Maybe it's time to revisit the template subsystem I pulled out of Django.

I am still -lots on using the Django template system. Please, please,
please, look at Jinja or another templating package that could be
dropped in without *any* modification.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Charles R Harris
On Sat, Mar 22, 2008 at 2:59 PM, Robert Kern [EMAIL PROTECTED] wrote:

 On Sat, Mar 22, 2008 at 2:04 PM, Charles R Harris
 [EMAIL PROTECTED] wrote:

  Maybe it's time to revisit the template subsystem I pulled out of
 Django.

 I am still -lots on using the Django template system. Please, please,
 please, look at Jinja or another templating package that could be
 dropped in without *any* modification.


Well, I have a script that pulls the relevant parts out of Django. I know
you had a bad experience, but...
That said, Jinja looks interesting. It uses the Django syntax, which was one
of the things I liked most about Django templates. In fact, it looks pretty
much like Django templates made into a standalone application, which is what
I was after. However, it's big, the installed egg is about 1Mib, which is
roughly 12x the size as my cutdown version of Django, and it has some
c-code, so would need building. On the other hand, it also looks like it
contains a lot of extraneous stuff, like translations, that could be
removed. Would you be adverse to adding it in if it looks useful?

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Stéfan van der Walt
On Sat, Mar 22, 2008 at 8:16 PM, Travis E. Oliphant
[EMAIL PROTECTED] wrote:
  Perhaps we could drum up interest in a Need for Speed Sprint on NumPy
  sometime over the next few months.

I guess we'd all like our computations to complete more quickly, as
long as they still give valid results.  I suggest we make sure that we
have very decent test coverage of the C code before doing any further
optimization.  The regression tests cover a number of important corner
cases, but we don't have all that many tests covering everyday usage.
The ideal place for these would be inside the docstrings -- and if we
get the wiki - docstring roundtripping working properly, anyone
would be able to contribute towards that goal.

I know that is is possible to track coverage using gcov (you have to
compile numpy into the python binary), but if anyone has a better way,
I'd like to hear about it.

Regards
Stéfan
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread James Philbin
OK, i've written a simple benchmark which implements an elementwise
multiply (A=B*C) in three different ways (standard C, intrinsics, hand
coded assembly). On the face of things the results seem to indicate
that the vectorization works best on medium sized inputs. If people
could post the results of running the benchmark on their machines
(takes ~1min) along with the output of gcc --version and their chip
model, that wd be v useful.

It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench

Here's two:

CPU: Core Duo T2500 @ 2GHz
gcc --version: gcc (GCC) 4.1.2 (Ubuntu 4.1.2-0ubuntu4)
Problem size  Simple  Intrin  Inline
 100   0.0003ms (100.0%)   0.0002ms ( 67.7%)   0.0002ms ( 50.6%)
1000   0.0030ms (100.0%)   0.0021ms ( 69.2%)   0.0015ms ( 50.6%)
   1   0.0370ms (100.0%)   0.0267ms ( 72.0%)   0.0279ms ( 75.4%)
  10   0.2258ms (100.0%)   0.1469ms ( 65.0%)   0.1273ms ( 56.4%)
 100   4.5690ms (100.0%)   4.4616ms ( 97.6%)   4.4185ms ( 96.7%)
1000  47.0022ms (100.0%)  45.4100ms ( 96.6%)  44.4437ms ( 94.6%)

CPU: Intel Xeon E5345 @ 2.33Ghz
gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
Problem size  Simple  Intrin  Inline
 100   0.0001ms (100.0%)   0.0001ms ( 69.2%)   0.0001ms ( 77.4%)
1000   0.0010ms (100.0%)   0.0008ms ( 78.1%)   0.0009ms ( 86.6%)
   1   0.0108ms (100.0%)   0.0088ms ( 81.2%)   0.0086ms ( 79.6%)
  10   0.1131ms (100.0%)   0.0897ms ( 79.3%)   0.0872ms ( 77.1%)
 100   5.2103ms (100.0%)   3.9153ms ( 75.1%)   3.8328ms ( 73.6%)
1000  54.1815ms (100.0%)  51.8286ms ( 95.7%)  51.4366ms ( 94.9%)

James
#include assert.h
#include stdio.h
#include stdlib.h
#include math.h
#include xmmintrin.h

int sizes[6] = {100, 1000, 1, 10, 100, 1000};
int nsizes = 6;
int repeats = 300;

double
accurate_time()
{
  struct timeval t;
  gettimeofday(t,NULL);
  return (double)t.tv_sec + t.tv_usec*0.01;
}

void
rand_vector(int sz, float* vec)
{
  int i;
  for (i=0; isz; i++) {
vec[i] = (float)rand()/RAND_MAX;
  }
}

typedef void (*vec_func)(float*, float*, float*, int);

double
time_func(vec_func func, int sz)
{
  int i;
  double t1, t2;
  float *v1, *v2, *v3;

  v1 = (float*)malloc(sizeof(float)*sz);
  v2 = (float*)malloc(sizeof(float)*sz);
  v3 = (float*)malloc(sizeof(float)*sz);
  rand_vector(sz, v1);
  rand_vector(sz, v2);

  t1 = accurate_time();
  for (i=0; irepeats; i++) {
func(v1, v2, v3, sz);
  }
  t2 = accurate_time();
 
  free(v1);
  free(v2);
  free(v3);

  return (t2-t1)/repeats;
}

void
multiply_float_simple(float* a, float* b, float* c, int sz)
{
  int i;
  for (i=0; isz; i++) {
c[i] = a[i]*b[i];
  }
}

void
multiply_float_intrin(float* a, float* b, float* c, int sz)
{
  __m128 a_, b_, c_;
  int i;
  for (i = 0; i(sz  -4); i+=4) {
a_ = _mm_loadu_ps(a + i);
b_ = _mm_loadu_ps(b + i);
c_ = _mm_mul_ps(a_, b_);
_mm_storeu_ps(c + i, c_);
  }
  for ( ; i  sz; i++) {
a_ = _mm_load_ss(a + i);
b_ = _mm_load_ss(b + i);
c_ = _mm_mul_ss(a_, b_);
_mm_store_ss(c + i, c_);
  }
}

void
multiply_float_inline(float* a, float* b, float* c, int sz)
{
#ifndef __SSE__
#error Use -msse
#else
  __asm__ __volatile__(
   mov %[sz], %%eax\n\t
   test $-4, %%eax\n\t
   jz .L_11_%=\n\t
   .L_10_%=:\n\t

   movlps 0(%[a]), %%xmm0\n\t // xmm0 = a
   movhps 8(%[a]), %%xmm0\n\t

   movlps 0(%[b]), %%xmm1\n\t // xmm1 = b
   movhps 8(%[b]), %%xmm1\n\t

   mulps %%xmm0, %%xmm1\n\t // xmm1 = a*b

   movlps %%xmm1, 0(%[c])\n\t
   movhps %%xmm1, 8(%[c])\n\t

   add $16, %[a]\n\t
   add $16, %[b]\n\t
   add $16, %[c]\n\t

   sub $4, %%eax\n\t
   test $-4, %%eax\n\t
   jnz .L_10_%=\n\t
   .L_11_%=:\n\t

   test $-1, %%eax\n\t
   jz .L_13_%=\n\t
   .L_12_%=:\n\t

   movss 0(%[a]), %%xmm0\n\t
   movss 0(%[b]), %%xmm1\n\t
   mulss %%xmm0, %%xmm1\n\t
   movss %%xmm1, 0(%[c])\n\t
   
   add $4, %[a]\n\t
   add $4, %[b]\n\t
   add $4, %[c]\n\t

   sub $1, %%eax\n\t
   test $-1, %%eax\n\t
   jnz .L_12_%=\n\t
   .L_13_%=:\n\t
  : // Outputs
   [a] =r (a),
   [b] =r (b),
   [c] =r (c)
  : // Inputs
   0 (a),
   1 (b),
   2 (c),
   [sz] m (sz)
  : %eax, %xmm0, %xmm1
  );
#endif
}

vec_func methods[3] = {multiply_float_simple, multiply_float_intrin, multiply_float_inline};
char *method_names[3] = {Simple, Intrin, Inline};
int nmethods = 3;

void
test_methods(void)
{
  int test_size = 10003; // Not divisible by 4
  int i, j;
  float *v1, *v2, *v3, *v4;
  v1 = (float*)malloc(test_size*sizeof(float));
  v2 = (float*)malloc(test_size*sizeof(float));
  v3 = (float*)malloc(test_size*sizeof(float));
  v4 = (float*)malloc(test_size*sizeof(float));

  rand_vector(test_size, v1);
  rand_vector(test_size, v2);

  methods[0](v1, v2, v3, test_size);

  for (i=1; inmethods; i++) {
methods[i](v1, v2, v4, 

Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Neal Becker
gcc --version
gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[EMAIL PROTECTED] ~]$ cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Core(TM)2 Duo CPU T7500  @ 2.20GHz
stepping: 11
cpu MHz : 2201.000
cache size  : 4096 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 
ssse3 cx16 xtpr lahf_lm ida
bogomips: 4393.14
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Core(TM)2 Duo CPU T7500  @ 2.20GHz
stepping: 11
cpu MHz : 2201.000
cache size  : 4096 KB
physical id : 0
siblings: 2
core id : 1
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 
ssse3 cx16 xtpr lahf_lm ida
bogomips: 4389.47
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

[EMAIL PROTECTED] ~]$ gcc -O2 vec_bench.c -o vec_bench
[EMAIL PROTECTED] ~]$ ./vec_bench
Testing methods...
All OK

Problem size  Simple  Intrin  Inline
 100   0.0003ms (100.0%)   0.0003ms ( 78.3%)   0.0003ms ( 75.5%)
1000   0.0029ms (100.0%)   0.0022ms ( 75.9%)   0.0026ms ( 87.0%)
   1   0.0131ms (100.0%)   0.0085ms ( 65.0%)   0.0092ms ( 70.3%)
  10   0.1210ms (100.0%)   0.0875ms ( 72.3%)   0.0932ms ( 77.0%)
 100   4.2518ms (100.0%)   7.5801ms (178.3%)   7.6278ms (179.4%)
1000  81.6962ms (100.0%)  79.8668ms ( 97.8%)  81.6365ms ( 99.9%)
[EMAIL PROTECTED] ~]$ gcc -O3 -ffast-math vec_bench.c -o vec_bench
[EMAIL PROTECTED] ~]$ ./vec_bench
Testing methods...
All OK

Problem size  Simple  Intrin  Inline
 100   0.0003ms (100.0%)   0.0002ms ( 68.4%)   0.0003ms ( 74.2%)
1000   0.0029ms (100.0%)   0.0023ms ( 77.2%)   0.0025ms ( 86.9%)
   1   0.0353ms (100.0%)   0.0086ms ( 24.5%)   0.0092ms ( 26.1%)
  10   0.1497ms (100.0%)   0.1013ms ( 67.6%)   0.1146ms ( 76.6%)
 100   4.4004ms (100.0%)   7.5651ms (171.9%)   7.6200ms (173.2%)
1000  81.3631ms (100.0%)  83.3591ms (102.5%)  79.8199ms ( 98.1%)
[EMAIL PROTECTED] ~]$ gcc -O3 -msse4a vec_bench.c -o vec_bench
[EMAIL PROTECTED] ~]$ ./vec_bench
Testing methods...
All OK

Problem size  Simple  Intrin  Inline
 100   0.0001ms (100.0%)   0.0001ms ( 67.5%)   0.0001ms ( 74.8%)
1000   0.0011ms (100.0%)   0.0008ms ( 78.0%)   0.0009ms ( 86.4%)
   1   0.0116ms (100.0%)   0.0085ms ( 73.2%)   0.0092ms ( 79.1%)
  10   0.1500ms (100.0%)   0.0873ms ( 58.2%)   0.0931ms ( 62.1%)
 100   4.2654ms (100.0%)   7.5623ms (177.3%)   7.5713ms (177.5%)
1000  79.4805ms (100.0%)  81.0649ms (102.0%)  81.1859ms (102.1%)

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Charles R Harris
On Sat, Mar 22, 2008 at 5:03 PM, James Philbin [EMAIL PROTECTED] wrote:

 OK, i've written a simple benchmark which implements an elementwise
 multiply (A=B*C) in three different ways (standard C, intrinsics, hand
 coded assembly). On the face of things the results seem to indicate
 that the vectorization works best on medium sized inputs. If people
 could post the results of running the benchmark on their machines
 (takes ~1min) along with the output of gcc --version and their chip
 model, that wd be v useful.

 It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench

 Here's two:

 CPU: Core Duo T2500 @ 2GHz
 gcc --version: gcc (GCC) 4.1.2 (Ubuntu 4.1.2-0ubuntu4)
Problem size  Simple  Intrin
  Inline
 100   0.0003ms (100.0%)   0.0002ms ( 67.7%)   0.0002ms (
 50.6%)
1000   0.0030ms (100.0%)   0.0021ms ( 69.2%)   0.0015ms (
 50.6%)
   1   0.0370ms (100.0%)   0.0267ms ( 72.0%)   0.0279ms (
 75.4%)
  10   0.2258ms (100.0%)   0.1469ms ( 65.0%)   0.1273ms (
 56.4%)
 100   4.5690ms (100.0%)   4.4616ms ( 97.6%)   4.4185ms (
 96.7%)
1000  47.0022ms (100.0%)  45.4100ms ( 96.6%)  44.4437ms (
 94.6%)

 CPU: Intel Xeon E5345 @ 2.33Ghz
 gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
Problem size  Simple  Intrin
  Inline
 100   0.0001ms (100.0%)   0.0001ms ( 69.2%)   0.0001ms (
 77.4%)
1000   0.0010ms (100.0%)   0.0008ms ( 78.1%)   0.0009ms (
 86.6%)
   1   0.0108ms (100.0%)   0.0088ms ( 81.2%)   0.0086ms (
 79.6%)
  10   0.1131ms (100.0%)   0.0897ms ( 79.3%)   0.0872ms (
 77.1%)
 100   5.2103ms (100.0%)   3.9153ms ( 75.1%)   3.8328ms (
 73.6%)
1000  54.1815ms (100.0%)  51.8286ms ( 95.7%)  51.4366ms (
 94.9%)


gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
cpu:  Intel(R) Core(TM)2 CPU  6600  @ 2.40GHz

Problem size  Simple  Intrin
Inline
 100   0.0002ms (100.0%)   0.0001ms ( 68.7%)   0.0001ms (
74.8%)
1000   0.0015ms (100.0%)   0.0011ms ( 72.0%)   0.0012ms (
80.4%)
   1   0.0154ms (100.0%)   0.0111ms ( 72.1%)   0.0122ms (
79.1%)
  10   0.1081ms (100.0%)   0.0759ms ( 70.2%)   0.0811ms (
75.0%)
 100   2.7778ms (100.0%)   2.8172ms (101.4%)   2.7929ms (
100.5%)
1000  28.1577ms (100.0%)  28.7332ms (102.0%)  28.4669ms (
101.1%)

It looks like memory access is the bottleneck, otherwise running 4 floats
through in parallel should go a lot faster. I need to modify the program a
bit and see how it works for doubles.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Thomas Grill
Hi,
here's my results:

Intel Core 2 Duo, 2.16GHz, 667MHz bus, 4MB Cache
running under OSX 10.5.2

please note that the auto-vectorizer of gcc-4.3 is doing really well

gr~~~

-

gcc version 4.0.1 (Apple Inc. build 5465)

xbook-2:temp thomas$ gcc -msse -O2 vec_bench.c -o vec_bench
xbook-2:temp thomas$ ./vec_bench
Testing methods...
All OK

Problem size  Simple  Intrin  Inline
 100   0.0002ms (100.0%)   0.0001ms ( 83.2%)   0.0001ms ( 85.1%)
1000   0.0014ms (100.0%)   0.0014ms ( 99.5%)   0.0014ms ( 97.6%)
   1   0.0180ms (100.0%)   0.0137ms ( 76.1%)   0.0103ms ( 56.9%)
  10   0.1307ms (100.0%)   0.1153ms ( 88.2%)   0.0952ms ( 72.8%)
 100   4.0309ms (100.0%)   4.1641ms (103.3%)   4.0129ms ( 99.6%)
1000  43.2557ms (100.0%)  43.5919ms (100.8%)  42.6391ms ( 98.6%)



gcc version 4.3.0 20080125 (experimental) (GCC)

xbook-2:temp thomas$ gcc-4.3 -msse -O2 vec_bench.c -o vec_bench
xbook-2:temp thomas$ ./vec_bench
Testing methods...
All OK

Problem size  Simple  Intrin  Inline
 100   0.0002ms (100.0%)   0.0001ms ( 77.4%)   0.0001ms ( 72.0%)
1000   0.0017ms (100.0%)   0.0014ms ( 84.4%)   0.0014ms ( 79.4%)
   1   0.0173ms (100.0%)   0.0148ms ( 85.4%)   0.0104ms ( 59.9%)
  10   0.1276ms (100.0%)   0.1243ms ( 97.4%)   0.0952ms ( 74.6%)
 100   4.0466ms (100.0%)   4.1168ms (101.7%)   4.0348ms ( 99.7%)
1000  43.1842ms (100.0%)  43.2989ms (100.3%)  44.2171ms (102.4%)

xbook-2:temp thomas$ gcc-4.3 -msse -O2 -ftree-vectorize vec_bench.c -o vec_bench
xbook-2:temp thomas$ ./vec_bench
Testing methods...
All OK

Problem size  Simple  Intrin  Inline
 100   0.0001ms (100.0%)   0.0001ms (126.6%)   0.0001ms (120.3%)
1000   0.0011ms (100.0%)   0.0014ms (136.3%)   0.0014ms (127.9%)
   1   0.0144ms (100.0%)   0.0153ms (106.3%)   0.0103ms ( 72.0%)
  10   0.1027ms (100.0%)   0.1243ms (121.0%)   0.0953ms ( 92.8%)
 100   3.9691ms (100.0%)   4.1197ms (103.8%)   4.0252ms (101.4%)
1000  42.1922ms (100.0%)  43.6721ms (103.5%)  43.4035ms (102.9%)
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Charles R Harris
On Sat, Mar 22, 2008 at 5:32 PM, Charles R Harris [EMAIL PROTECTED]
wrote:



 On Sat, Mar 22, 2008 at 5:03 PM, James Philbin [EMAIL PROTECTED] wrote:

  OK, i've written a simple benchmark which implements an elementwise
  multiply (A=B*C) in three different ways (standard C, intrinsics, hand
  coded assembly). On the face of things the results seem to indicate
  that the vectorization works best on medium sized inputs. If people
  could post the results of running the benchmark on their machines
  (takes ~1min) along with the output of gcc --version and their chip
  model, that wd be v useful.
 
  It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench
 
  Here's two:
 
  CPU: Core Duo T2500 @ 2GHz
  gcc --version: gcc (GCC) 4.1.2 (Ubuntu 4.1.2-0ubuntu4)
 Problem size  Simple  Intrin
   Inline
  100   0.0003ms (100.0%)   0.0002ms ( 67.7%)   0.0002ms (
  50.6%)
 1000   0.0030ms (100.0%)   0.0021ms ( 69.2%)   0.0015ms (
  50.6%)
1   0.0370ms (100.0%)   0.0267ms ( 72.0%)   0.0279ms (
  75.4%)
   10   0.2258ms (100.0%)   0.1469ms ( 65.0%)   0.1273ms (
  56.4%)
  100   4.5690ms (100.0%)   4.4616ms ( 97.6%)   4.4185ms (
  96.7%)
 1000  47.0022ms (100.0%)  45.4100ms ( 96.6%)  44.4437ms (
  94.6%)
 
  CPU: Intel Xeon E5345 @ 2.33Ghz
  gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
 Problem size  Simple  Intrin
   Inline
  100   0.0001ms (100.0%)   0.0001ms ( 69.2%)   0.0001ms (
  77.4%)
 1000   0.0010ms (100.0%)   0.0008ms ( 78.1%)   0.0009ms (
  86.6%)
1   0.0108ms (100.0%)   0.0088ms ( 81.2%)   0.0086ms (
  79.6%)
   10   0.1131ms (100.0%)   0.0897ms ( 79.3%)   0.0872ms (
  77.1%)
  100   5.2103ms (100.0%)   3.9153ms ( 75.1%)   3.8328ms (
  73.6%)
 1000  54.1815ms (100.0%)  51.8286ms ( 95.7%)  51.4366ms (
  94.9%)
 

 gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
 cpu:  Intel(R) Core(TM)2 CPU  6600  @ 2.40GHz

 Problem size  Simple  Intrin
 Inline
  100   0.0002ms (100.0%)   0.0001ms ( 68.7%)   0.0001ms (
 74.8%)
 1000   0.0015ms (100.0%)   0.0011ms ( 72.0%)   0.0012ms (
 80.4%)
1   0.0154ms (100.0%)   0.0111ms ( 72.1%)   0.0122ms (
 79.1%)
   10   0.1081ms (100.0%)   0.0759ms ( 70.2%)   0.0811ms (
 75.0%)
  100   2.7778ms (100.0%)   2.8172ms (101.4%)   2.7929ms (
 100.5%)
 1000  28.1577ms (100.0%)  28.7332ms (102.0%)  28.4669ms (
 101.1%)

 It looks like memory access is the bottleneck, otherwise running 4 floats
 through in parallel should go a lot faster. I need to modify the program a
 bit and see how it works for doubles.


Doubles don't look so good running on a 32 bit OS. Maybe alignment would
help.
Compiled with gcc -msse2 -mfpmath=sse -O2 vec_bench_dbl.c -o vec_bench_dbl


Problem size  Simple  Intrin
 100   0.0002ms (100.0%)   0.0002ms (149.5%)
1000   0.0015ms (100.0%)   0.0024ms (159.0%)
   1   0.0219ms (100.0%)   0.0180ms ( 81.9%)
  10   0.1518ms (100.0%)   0.1686ms (111.1%)
 100   5.5588ms (100.0%)   5.8145ms (104.6%)
1000  56.7152ms (100.0%)  59.3139ms (104.6%)


Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Charles R Harris
On Sat, Mar 22, 2008 at 6:34 PM, Charles R Harris [EMAIL PROTECTED]
wrote:

I've attached a double version.  Compile with
gcc -msse2 -mfpmath=sse -O2 vec_bench_dbl.c -o vec_bench_dbl

Chuck
#include assert.h
#include stdio.h
#include stdlib.h
#include math.h
#include emmintrin.h

int sizes[6] = {100, 1000, 1, 10, 100, 1000};
int nsizes = 6;
int repeats = 300;

double
accurate_time()
{
  struct timeval t;
  gettimeofday(t,NULL);
  return (double)t.tv_sec + t.tv_usec*0.01;
}

void
rand_vector(int sz, double* vec)
{
  int i;
  for (i=0; isz; i++) {
vec[i] = (double)rand()/RAND_MAX;
  }
}

typedef void (*vec_func)(double*, double*, double*, int);

double
time_func(vec_func func, int sz)
{
  int i;
  double t1, t2;
  double *v1, *v2, *v3;

  v1 = (double*)malloc(sizeof(double)*sz);
  v2 = (double*)malloc(sizeof(double)*sz);
  v3 = (double*)malloc(sizeof(double)*sz);
  rand_vector(sz, v1);
  rand_vector(sz, v2);

  t1 = accurate_time();
  for (i=0; irepeats; i++) {
func(v1, v2, v3, sz);
  }
  t2 = accurate_time();

  free(v1);
  free(v2);
  free(v3);

  return (t2-t1)/repeats;
}

void
multiply_double_simple(double* a, double* b, double* c, int sz)
{
  int i;
  for (i=0; isz; i++) {
c[i] = a[i]*b[i];
  }
}

void
multiply_double_intrin(double* a, double* b, double* c, int sz)
{
  __m128d a_, b_, c_;
  int i;
  for (i = 0; i(sz  -2); i+=2) {
a_ = _mm_loadu_pd(a + i);
b_ = _mm_loadu_pd(b + i);
c_ = _mm_mul_pd(a_, b_);
_mm_storeu_pd(c + i, c_);
  }
  for ( ; i  sz; i++) {
a_ = _mm_load_sd(a + i);
b_ = _mm_load_sd(b + i);
c_ = _mm_mul_sd(a_, b_);
_mm_store_sd(c + i, c_);
  }
}

/*
void
multiply_float_inline(float* a, float* b, float* c, int sz)
{
#ifndef __SSE__
#error Use -msse
#else
  __asm__ __volatile__(
   mov %[sz], %%eax\n\t
   test $-4, %%eax\n\t
   jz .L_11_%=\n\t
   .L_10_%=:\n\t

   movlps 0(%[a]), %%xmm0\n\t // xmm0 = a
   movhps 8(%[a]), %%xmm0\n\t

   movlps 0(%[b]), %%xmm1\n\t // xmm1 = b
   movhps 8(%[b]), %%xmm1\n\t

   mulps %%xmm0, %%xmm1\n\t // xmm1 = a*b

   movlps %%xmm1, 0(%[c])\n\t
   movhps %%xmm1, 8(%[c])\n\t

   add $16, %[a]\n\t
   add $16, %[b]\n\t
   add $16, %[c]\n\t

   sub $4, %%eax\n\t
   test $-4, %%eax\n\t
   jnz .L_10_%=\n\t
   .L_11_%=:\n\t

   test $-1, %%eax\n\t
   jz .L_13_%=\n\t
   .L_12_%=:\n\t

   movss 0(%[a]), %%xmm0\n\t
   movss 0(%[b]), %%xmm1\n\t
   mulss %%xmm0, %%xmm1\n\t
   movss %%xmm1, 0(%[c])\n\t
   
   add $4, %[a]\n\t
   add $4, %[b]\n\t
   add $4, %[c]\n\t

   sub $1, %%eax\n\t
   test $-1, %%eax\n\t
   jnz .L_12_%=\n\t
   .L_13_%=:\n\t
  : // Outputs
   [a] =r (a),
   [b] =r (b),
   [c] =r (c)
  : // Inputs
   0 (a),
   1 (b),
   2 (c),
   [sz] m (sz)
  : %eax, %xmm0, %xmm1
  );
#endif
}
*/

vec_func methods[3] = {multiply_double_simple, multiply_double_intrin};
char *method_names[3] = {Simple, Intrin};
int nmethods = 2;

void
test_methods(void)
{
  int test_size = 10003; // Not divisible by 4
  int i, j;
  double *v1, *v2, *v3, *v4;
  v1 = (double*)malloc(test_size*sizeof(double));
  v2 = (double*)malloc(test_size*sizeof(double));
  v3 = (double*)malloc(test_size*sizeof(double));
  v4 = (double*)malloc(test_size*sizeof(double));

  rand_vector(test_size, v1);
  rand_vector(test_size, v2);

  methods[0](v1, v2, v3, test_size);

  for (i=1; inmethods; i++) {
methods[i](v1, v2, v4, test_size);
for (j=0; jtest_size; j++) {
  assert(v4[j] == v3[j]);
}
  }

  free(v1);
  free(v2);
  free(v3);
  free(v4);
}

int
main(void)
{
  int i, j;
  double dt;
  double dt_simp;

  printf(Testing methods...\n);
  test_methods();
  printf(All OK\n\n);

  printf(%20s, Problem size);
  for (i=0; inmethods; i++) {
printf(%20s, method_names[i]);
  }
  printf(\n);

  for (i=0; insizes; i++) {
printf(%20d, sizes[i]);
dt_simp = time_func(methods[0], sizes[i])*1000.0;
printf(  %7.4fms (%5.1f%%), dt_simp, 100.0);
for (j=1; jnmethods; j++) {
  dt = time_func(methods[j], sizes[i])*1000.0;
  printf(  %7.4fms (%5.1f%%), dt, (100.0*dt)/dt_simp);
}
printf(\n);
  }
}
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Scott Ransom
Here are results under 64-bit linux using gcc-4.3 (which by
default turns on the various sse flags).  Note that -O3 is
significantly better than -O2 for the simple calls:

nimrod:~$ cat /proc/cpuinfo | grep model name | head -1
model name  : Intel(R) Xeon(R) CPU   E5450  @ 3.00GHz

nimrod:~$ gcc-4.3 --version
gcc-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease)

nimrod:~$ gcc-4.3 -O2 vec_bench.c -o vec_bench
nimrod:~$ ./vec_bench
Testing methods...
All OK
Problem size Simple  Intrin  Inline
 100   0.0001ms (100.0%)   0.0001ms ( 70.8%)   0.0001ms ( 74.3%)
1000   0.0008ms (100.0%)   0.0006ms ( 70.3%)   0.0007ms ( 80.3%)
   1   0.0085ms (100.0%)   0.0061ms ( 72.0%)   0.0067ms ( 78.8%)
  10   0.0882ms (100.0%)   0.0627ms ( 71.1%)   0.0677ms ( 76.7%)
 100   3.6748ms (100.0%)   3.3312ms ( 90.7%)   3.3139ms ( 90.2%)
1000  37.1154ms (100.0%)  35.9762ms ( 96.9%)  36.1126ms ( 97.3%)

nimrod:~$ gcc-4.3 -O3 vec_bench.c -o vec_bench
nimrod:~$ ./vec_bench
Testing methods...
All OK
Problem size Simple  Intrin  Inline
 100   0.0001ms (100.0%)   0.0001ms (111.1%)   0.0001ms (116.7%)
1000   0.0005ms (100.0%)   0.0006ms (111.3%)   0.0007ms (126.8%)
   1   0.0056ms (100.0%)   0.0061ms (108.6%)   0.0067ms (118.9%)
  10   0.0581ms (100.0%)   0.0626ms (107.8%)   0.0677ms (116.5%)
 100   3.4549ms (100.0%)   3.3339ms ( 96.5%)   3.3255ms ( 96.3%)
1000  34.8186ms (100.0%)  35.9767ms (103.3%)  36.1099ms (103.7%)


nimrod:~$ ./vec_bench_dbl
Testing methods...
All OK
Problem size  Simple  Intrin
 100   0.0001ms (100.0%)   0.0001ms (132.5%)
1000   0.0009ms (100.0%)   0.0012ms (134.5%)
   1   0.0119ms (100.0%)   0.0124ms (104.1%)
  10   0.1226ms (100.0%)   0.1276ms (104.1%)
 100   7.0047ms (100.0%)   6.6654ms ( 95.2%)
1000  70.0060ms (100.0%)  71.9692ms (102.8%)

nimrod:~$ gcc-4.3 -O3 vec_bench_dbl.c -o vec_bench_dbl
nimrod:~$ ./vec_bench_dbl
Testing methods...
All OK
Problem size  Simple  Intrin
 100   0.0001ms (100.0%)   0.0002ms (289.8%)
1000   0.0007ms (100.0%)   0.0012ms (172.7%)
   1   0.0114ms (100.0%)   0.0124ms (109.4%)
  10   0.1159ms (100.0%)   0.1278ms (110.3%)
 100   6.9252ms (100.0%)   6.6585ms ( 96.1%)
1000  69.1913ms (100.0%)  71.9664ms (104.0%)


On Sat, Mar 22, 2008 at 06:41:31PM -0600, Charles R Harris wrote:
 On Sat, Mar 22, 2008 at 6:34 PM, Charles R Harris [EMAIL PROTECTED]
 wrote:
 
 I've attached a double version.  Compile with
 gcc -msse2 -mfpmath=sse -O2 vec_bench_dbl.c -o vec_bench_dbl
 
 Chuck


 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion


-- 
-- 
Scott M. RansomAddress:  NRAO
Phone:  (434) 296-0320   520 Edgemont Rd.
email:  [EMAIL PROTECTED] Charlottesville, VA 22903 USA
GPG Fingerprint: 06A9 9553 78BE 16DB 407B  FFCA 9BFA B6FF FFD3 2989
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Neal Becker
Thomas Grill wrote:

 Hi,
 here's my results:
 
 Intel Core 2 Duo, 2.16GHz, 667MHz bus, 4MB Cache
 running under OSX 10.5.2
 
 please note that the auto-vectorizer of gcc-4.3 is doing really well
 
 gr~~~
 
 -
 
 gcc version 4.0.1 (Apple Inc. build 5465)
 
 xbook-2:temp thomas$ gcc -msse -O2 vec_bench.c -o vec_bench
 xbook-2:temp thomas$ ./vec_bench
 Testing methods...
 All OK
 
 Problem size  Simple  Intrin 
 Inline
  100   0.0002ms (100.0%)   0.0001ms ( 83.2%)   0.0001ms (
  85.1%)
 1000   0.0014ms (100.0%)   0.0014ms ( 99.5%)   0.0014ms (
 97.6%)
1   0.0180ms (100.0%)   0.0137ms ( 76.1%)   0.0103ms (
56.9%)
   10   0.1307ms (100.0%)   0.1153ms ( 88.2%)   0.0952ms (
   72.8%)
  100   4.0309ms (100.0%)   4.1641ms (103.3%)   4.0129ms (
  99.6%)
 1000  43.2557ms (100.0%)  43.5919ms (100.8%)  42.6391ms (
 98.6%)
 
 
 
 gcc version 4.3.0 20080125 (experimental) (GCC)
 
 xbook-2:temp thomas$ gcc-4.3 -msse -O2 vec_bench.c -o vec_bench
 xbook-2:temp thomas$ ./vec_bench
 Testing methods...
 All OK
 
 Problem size  Simple  Intrin 
 Inline
  100   0.0002ms (100.0%)   0.0001ms ( 77.4%)   0.0001ms (
  72.0%)
 1000   0.0017ms (100.0%)   0.0014ms ( 84.4%)   0.0014ms (
 79.4%)
1   0.0173ms (100.0%)   0.0148ms ( 85.4%)   0.0104ms (
59.9%)
   10   0.1276ms (100.0%)   0.1243ms ( 97.4%)   0.0952ms (
   74.6%)
  100   4.0466ms (100.0%)   4.1168ms (101.7%)   4.0348ms (
  99.7%)
 1000  43.1842ms (100.0%)  43.2989ms (100.3%)  44.2171ms
 (102.4%)
 
 xbook-2:temp thomas$ gcc-4.3 -msse -O2 -ftree-vectorize vec_bench.c -o
 vec_bench xbook-2:temp thomas$ ./vec_bench
 Testing methods...
 All OK
 
 Problem size  Simple  Intrin 
 Inline
  100   0.0001ms (100.0%)   0.0001ms (126.6%)   0.0001ms
  (120.3%)
 1000   0.0011ms (100.0%)   0.0014ms (136.3%)   0.0014ms
 (127.9%)
1   0.0144ms (100.0%)   0.0153ms (106.3%)   0.0103ms (
72.0%)
   10   0.1027ms (100.0%)   0.1243ms (121.0%)   0.0953ms (
   92.8%)
  100   3.9691ms (100.0%)   4.1197ms (103.8%)   4.0252ms
  (101.4%)
 1000  42.1922ms (100.0%)  43.6721ms (103.5%)  43.4035ms
 (102.9%)
gcc version 4.3.0 20080307 (Red Hat 4.3.0-2) (GCC) 
gcc -msse -O2 -ftree-vectorize vec_bench.c -o vec_bench
mock-chroot ./vec_bench
Testing methods...
All OK

Problem size  Simple  Intrin  Inline
 100   0.0001ms (100.0%)   0.0001ms (141.6%)   0.0001ms (108.0%)
1000   0.0008ms (100.0%)   0.0011ms (149.9%)   0.0008ms (100.4%)
   1   0.0135ms (100.0%)   0.0197ms (145.8%)   0.0133ms ( 98.8%)
  10   0.6415ms (100.0%)   0.4918ms ( 76.7%)   0.5052ms ( 78.8%)
 100   7.5364ms (100.0%)   7.9987ms (106.1%)   7.4832ms ( 99.3%)
1000  76.3927ms (100.0%)  76.8933ms (100.7%)  75.1002ms ( 98.3%)
model name  : AMD Athlon(tm) 64 Processor 3200+
stepping: 10
cpu MHz : 2000.068
cache size  : 1024 KB

Now same, but with  gcc --version
gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
Testing methods...
All OK

Problem size  Simple  Intrin  Inline
 100   0.0002ms (100.0%)   0.0001ms ( 77.2%)   0.0001ms ( 58.7%)
1000   0.0015ms (100.0%)   0.0011ms ( 73.5%)   0.0008ms ( 52.6%)
   1   0.0214ms (100.0%)   0.0195ms ( 90.9%)   0.0363ms (169.3%)
  10   0.6620ms (100.0%)   0.5614ms ( 84.8%)   0.5527ms ( 83.5%)
 100   7.5975ms (100.0%)   7.3826ms ( 97.2%)   7.3380ms ( 96.6%)
1000  75.8361ms (100.0%)  84.0476ms (110.8%)  77.2884ms (101.9%)

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread Charles R Harris
On Sat, Mar 22, 2008 at 7:35 PM, Scott Ransom [EMAIL PROTECTED] wrote:

 Here are results under 64-bit linux using gcc-4.3 (which by
 default turns on the various sse flags).  Note that -O3 is
 significantly better than -O2 for the simple calls:

 nimrod:~$ cat /proc/cpuinfo | grep model name | head -1
 model name  : Intel(R) Xeon(R) CPU   E5450  @ 3.00GHz

 nimrod:~$ gcc-4.3 --version
 gcc-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease)

 nimrod:~$ gcc-4.3 -O2 vec_bench.c -o vec_bench
 nimrod:~$ ./vec_bench
 Testing methods...
 All OK
 Problem size Simple  Intrin  Inline
  100   0.0001ms (100.0%)   0.0001ms ( 70.8%)   0.0001ms ( 74.3%)
1000   0.0008ms (100.0%)   0.0006ms ( 70.3%)   0.0007ms ( 80.3%)
   1   0.0085ms (100.0%)   0.0061ms ( 72.0%)   0.0067ms ( 78.8%)
  10   0.0882ms (100.0%)   0.0627ms ( 71.1%)   0.0677ms ( 76.7%)
  100   3.6748ms (100.0%)   3.3312ms ( 90.7%)   3.3139ms ( 90.2%)
 1000  37.1154ms (100.0%)  35.9762ms ( 96.9%)  36.1126ms ( 97.3%)

 nimrod:~$ gcc-4.3 -O3 vec_bench.c -o vec_bench
 nimrod:~$ ./vec_bench
 Testing methods...
 All OK
 Problem size Simple  Intrin  Inline
  100   0.0001ms (100.0%)   0.0001ms (111.1%)   0.0001ms (116.7%)
1000   0.0005ms (100.0%)   0.0006ms (111.3%)   0.0007ms (126.8%)
   1   0.0056ms (100.0%)   0.0061ms (108.6%)   0.0067ms (118.9%)
  10   0.0581ms (100.0%)   0.0626ms (107.8%)   0.0677ms (116.5%)
  100   3.4549ms (100.0%)   3.3339ms ( 96.5%)   3.3255ms ( 96.3%)
 1000  34.8186ms (100.0%)  35.9767ms (103.3%)  36.1099ms (103.7%)


 nimrod:~$ ./vec_bench_dbl
 Testing methods...
 All OK
 Problem size  Simple  Intrin
 100   0.0001ms (100.0%)   0.0001ms (132.5%)
1000   0.0009ms (100.0%)   0.0012ms (134.5%)
   1   0.0119ms (100.0%)   0.0124ms (104.1%)
  10   0.1226ms (100.0%)   0.1276ms (104.1%)
 100   7.0047ms (100.0%)   6.6654ms ( 95.2%)
1000  70.0060ms (100.0%)  71.9692ms (102.8%)

 nimrod:~$ gcc-4.3 -O3 vec_bench_dbl.c -o vec_bench_dbl
 nimrod:~$ ./vec_bench_dbl
 Testing methods...
 All OK
 Problem size  Simple  Intrin
 100   0.0001ms (100.0%)   0.0002ms (289.8%)
1000   0.0007ms (100.0%)   0.0012ms (172.7%)
   1   0.0114ms (100.0%)   0.0124ms (109.4%)
  10   0.1159ms (100.0%)   0.1278ms (110.3%)
 100   6.9252ms (100.0%)   6.6585ms ( 96.1%)
1000  69.1913ms (100.0%)  71.9664ms (104.0%)


It looks to me like the best approach here is to generate operator specific
loops for arithmetic, then check the step size in the loop for contiguous
data, and if found branch to a block where the pointers have been cast to
the right type. The loop itself could even check for operator type by
switching on the function address so that the code modifications could be
localized. The compiler can do the rest.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

2008-03-22 Thread David Cournapeau
Charles R Harris wrote:

 It looks like memory access is the bottleneck, otherwise running 4 
 floats through in parallel should go a lot faster. I need to modify 
 the program a bit and see how it works for doubles.

I am not sure the benchmark is really meaningful: it does not uses 
aligned buffers (16 bytes alignement), and because of that, does not 
give a good idea of what can be expected from SSE. It shows why it is 
not so easy to get good performances, and why just throwing a few 
optimized loops won't work, though. Using sse/sse2 from unaligned 
buffers is a waste of time. Without this alignement, you need to take 
into account the alignement (using _mm_loadu_ps vs _mm_load_ps), and 
that's extremely slow, basically killing most of the speed increase you 
can expect from using sse.

Here what I get with the above benchmark:

 100   0.0002ms (100.0%)   0.0001ms ( 71.5%)   0.0001ms 
( 85.0%)
1000   0.0014ms (100.0%)   0.0010ms ( 70.6%)   0.0013ms 
( 96.8%)
   1   0.0162ms (100.0%)   0.0095ms ( 58.2%)   0.0128ms 
( 78.7%)
  10   0.4189ms (100.0%)   0.4135ms ( 98.7%)   0.4149ms 
( 99.0%)
 100   5.9523ms (100.0%)   5.8933ms ( 99.0%)   5.8910ms 
( 99.0%)
1000  58.9645ms (100.0%)  58.2620ms ( 98.8%)  58.7443ms 
( 99.6%)

Basically, no help at all: this is on a P4, which fpu is extremely slow 
if not used with optimized sse.

Now, if I use posix_memalign, replace the intrinsics for aligned access, 
and use an accurate cycle counter (cycle.h, provided by fftw).

Compiled as is:

Testing methods...
All OK

Problem size  Simple  
Intrin  Inline
 1004.16e+02 cycles (100.0%)4.04e+02 cycles 
( 97.1%)4.92e+02 cycles (118.3%)
10003.66e+03 cycles (100.0%)3.11e+03 cycles 
( 84.8%)4.10e+03 cycles (111.9%)
   13.47e+04 cycles (100.0%)3.01e+04 cycles 
( 86.7%)4.06e+04 cycles (116.8%)
  101.36e+06 cycles (100.0%)1.34e+06 cycles 
( 98.7%)1.45e+06 cycles (106.7%)
 1001.92e+07 cycles (100.0%)1.87e+07 cycles 
( 97.1%)1.89e+07 cycles ( 98.2%)
10001.86e+08 cycles (100.0%)1.80e+08 cycles 
( 96.8%)1.81e+08 cycles ( 97.4%)

Compiled with -DALIGNED, wich uses aligned access intrinsics:

Testing methods...
All OK

Problem size  Simple  
Intrin  Inline
 1004.16e+02 cycles (100.0%)1.96e+02 cycles 
( 47.1%)4.92e+02 cycles (118.3%)
10003.82e+03 cycles (100.0%)1.56e+03 cycles 
( 40.8%)4.22e+03 cycles (110.4%)
   13.46e+04 cycles (100.0%)1.92e+04 cycles 
( 55.5%)4.13e+04 cycles (119.4%)
  101.32e+06 cycles (100.0%)1.12e+06 cycles 
( 85.0%)1.16e+06 cycles ( 87.8%)
 1001.95e+07 cycles (100.0%)1.92e+07 cycles 
( 98.3%)1.95e+07 cycles (100.2%)
10001.82e+08 cycles (100.0%)1.79e+08 cycles 
( 98.4%)1.81e+08 cycles ( 99.3%)

This gives a drastic difference (I did not touch inline code, because it 
is sunday and I am lazy). If I use this on a sane CPU (core 2 duo, 
macbook) instead of my pentium4, I get better results (in particular, 
sse code is never slower, and I get a double speed increase as long as 
the buffer can be in cache).

It looks like using prefect also gives some improvements when on the 
edge of the cache size (my P4 has a 512 kb L2 cache):

Testing methods...
All OK

Problem size  Simple  
Intrin  Inline
 1004.16e+02 cycles (100.0%)2.52e+02 cycles 
( 60.6%)4.92e+02 cycles (118.3%)
10003.55e+03 cycles (100.0%)1.85e+03 cycles 
( 52.2%)4.21e+03 cycles (118.7%)
   13.48e+04 cycles (100.0%)1.76e+04 cycles 
( 50.6%)4.13e+04 cycles (118.9%)
  101.11e+06 cycles (100.0%)7.20e+05 cycles 
( 64.8%)1.12e+06 cycles (101.3%)
 1001.91e+07 cycles (100.0%)1.98e+07 cycles 
(103.4%)1.91e+07 cycles (100.0%)
10001.83e+08 cycles (100.0%)1.90e+08 cycles 
(103.9%)1.82e+08 cycles ( 99.3%)

The code can be seen there:

http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/vec_bench.c
http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/Makefile
http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/cycle.h

Another thing that I have not seen mentioned but may worth pursuing is 
using SSE in element-wise operations: you can have extremely fast exp, 
sin, cos and co using sse. Those are much easier to include in numpy 
(but much more difficult