Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-20 Thread Francesc Alted
A Sunday 20 February 2011 00:01:59 Sturla Molden escrigué:
 pthreads will give you better control than OpenMP, but are messy and
 painful to work with.
 
 With MPI you have separate processes, so everything is completely
 isolated. It's more difficult to program and debug than OpenMP code,
 but will usually perform better.

To be more specific, MPI will perform better if you don't need to share 
the memory of your working set among all your processes, but in case you 
need to do this, threads (pthreads, OpenMP) lets you access all parts of 
the working set in memory much more easily.  In fact, all the threads of 
a process can access the complete working set transparently and 
efficiently (although this is precisely why they are trickier to 
program: you have to explicitly avoid simultaneous writing in the same 
memory area), not to mention that threads are much cheaper to create 
than processes.

Generally speaking, if your problem is large, CPU intensive and not very 
memory bounded, MPI usually leads to better results.  Otherwise threads 
tend to do a better job.  The thing is that memory is increasingly 
becoming more and more a bottleneck nowadays, so threads are here to 
stay for a long, long time (which is certainly unfortunate for 
programmers, specially for pthreads ones :).

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-19 Thread Sebastian Haase
Thanks a lot. Very informative. I guess what you say about cache line
is dirtied is related to the info I got with valgrind (see my email
in this thread: L1 Data Write Miss 3636).
Can one assume that the cache line is always a few mega bytes ?

Thanks,
Sebastian

On Sat, Feb 19, 2011 at 12:40 AM, Sturla Molden stu...@molden.no wrote:
 Den 17.02.2011 16:31, skrev Matthieu Brucher:

 It may also be the sizes of the chunk OMP uses. You can/should specify
 them.in

 Matthieu

 the OMP pragma so that it is a multiple of the cache line size or something
 close.

 Also beware of false sharing among the threads. When one processor updates
 the array dist in Sebastian's code, the cache line is dirtied for the
 other processors:

   #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
   for(i=0;ina;i++) {
      ax=a_ps[i*nx1];
      ay=a_ps[i*nx1+1];
      for(j=0;jnb;j++) {
  dif_x = ax - b_ps[j*nx2];
          dif_y = ay - b_ps[j*nx2+1];

  /* update shared memory */

          dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);

  /* ... and poof the cache is dirty */

  }
   }

 Whenever this happens, the processors must stop whatever they are doing to
 resynchronize their cache lines. False sharing can therefore work as an
 invisible GIL inside OpenMP code.The processors can appear to run in
 syrup, and there is excessive traffic on the memory bus.

 This is also why MPI programs often scale better than OpenMP programs,
 despite the IPC overhead.

 An advice when working with OpenMP is to let each thread write to private
 data arrays, and only share read-only arrays.

 One can e.g. use OpenMP's reduction pragma to achieve this. E.g. intialize
 the array dist with zeros, and use reduction(+:dist) in the OpenMP pragma
 line.

 Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-19 Thread Pauli Virtanen
On Sat, 19 Feb 2011 18:13:44 +0100, Sebastian Haase wrote:
 Thanks a lot. Very informative. I guess what you say about cache line
 is dirtied is related to the info I got with valgrind (see my email in
 this thread: L1 Data Write Miss 3636). Can one assume that the cache
 line is always a few mega bytes ?

Cache lines are typically much smaller, 16-512 bytes.

In this specific case, since the stride of the `i` loop is only 
2*sizeof(float) = 16 bytes  cache line size, threads running with 
different `i` tend to write to the same cache lines.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-19 Thread Sturla Molden
Den 19.02.2011 18:13, skrev Sebastian Haase:
 Can one assume that the cache line is always a few mega bytes ?

Don't confuse the size of a cache with the size of a cache line.

A cache line (which is the unit that gets marked dirty) is typically 
8-512 bytes.

Make sure your OpenMP threads stay off each others cache lines, and it 
will scale nicely.

For example, you can specify a chunk-size to the schedule pragma to 
force them apart; it applies to the loop index, so you must do 
calculations for the block size on the shared write buffer. If you use 
reduction(+:dist) the write buffer will be completely private, but you 
get summations after the loop. That is the limited amount of control you 
get with OpenMP.

pthreads will give you better control than OpenMP, but are messy and 
painful to work with.

With MPI you have separate processes, so everything is completely 
isolated. It's more difficult to program and debug than OpenMP code, but 
will usually perform better.

Sturla


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-18 Thread Sturla Molden

Den 17.02.2011 16:31, skrev Matthieu Brucher:
It may also be the sizes of the chunk OMP uses. You can/should specify 
them.in http://them.in


Matthieu

the OMP pragma so that it is a multiple of the cache line size or 
something close.


Also beware of false sharing among the threads. When one processor 
updates the array dist in Sebastian's code, the cache line is dirtied 
for the other processors:


  #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
  for(i=0;ina;i++){
 ax=a_ps[i*nx1];
 ay=a_ps[i*nx1+1];
 for(j=0;jnb;j++) {
 dif_x = ax - b_ps[j*nx2];
 dif_y = ay - b_ps[j*nx2+1];

 /* update shared memory */

 dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);

 /* ... and poof the cache is dirty */

 }
  }

Whenever this happens, the processors must stop whatever they are doing 
to resynchronize their cache lines. False sharing can therefore work 
as an invisible GIL inside OpenMP code.The processors can appear to 
run in syrup, and there is excessive traffic on the memory bus.


This is also why MPI programs often scale better than OpenMP programs, 
despite the IPC overhead.


An advice when working with OpenMP is to let each thread write to 
private data arrays, and only share read-only arrays.


One can e.g. use OpenMP's reduction pragma to achieve this. E.g. 
intialize the array dist with zeros, and use reduction(+:dist) in the 
OpenMP pragma line.


Sturla
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-17 Thread Sebastian Haase
Eric,
thanks for insisting on this. I noticed that, when I saw it first,
just to forget about it again ...
The new timings on my machine are:
$: gcc -O3 -c the_lib.c -fPIC -fopenmp -ffast-math
$: gcc -shared -o the_lib.so the_lib.o -lgomp -lm
$: python2.5 the_python_prog.py
c_threads 1  time  0.000897128582001
c_threads 2  time  0.000540800094604
c_threads 3  time  0.00035933971405
c_threads 4  time  0.000529370307922
c_threads 5  time  0.00049122095108
c_threads 6  time  0.000540502071381
c_threads 7  time  0.00058007911
c_threads 8  time  0.000643739700317
c_threads 9  time  0.000622930526733
c_threads 10  time  0.000680360794067
c_threads 11  time  0.000613269805908
c_threads 12  time  0.000633401870728

That is, your OpenMP version is again fastest using 3 threads on my 4 core CPU.
It is now 2.34x times faster than my non-OpenMP code (which compares
to scipy...cdist).
And, it is (only !?) 7% slower than then non-OpenMP code, when running
on 1 thread.
(The speedup 3 threads vs. 1 thread is 2.5x)

So, that is pretty good !! What I don't understand,
why did you start your first post with
I don't have the slightest idea what I'm doing
;-)
Do you think, one could get even better ?
And, where does the 7% slow-down (for single thread) come from ?
Is it possible to have the OpenMP option in a code, without _any_
penalty for 1 core machines ?

Thanks,
- Sebastian

On Thu, Feb 17, 2011 at 2:12 AM, Eric Carlson ecarl...@eng.ua.edu wrote:
 Sebastian,
 Optimization appears to be important here. I used no optimization in my
 previous post, so you could try the -O3 compile option:

  gcc -O3 -c my_lib.c -fPIC -fopenmp -ffast-math

 for na=329 and nb=340 I get (about 7.5 speedup)
 c_threads 1  time  0.00103106021881
 c_threads 2  time  0.000528309345245
 c_threads 3  time  0.000362541675568
 c_threads 4  time  0.00028993844986
 c_threads 5  time  0.000287840366364
 c_threads 6  time  0.000264899730682
 c_threads 7  time  0.000244019031525
 c_threads 8  time  0.000242137908936
 c_threads 9  time  0.000232398509979
 c_threads 10  time  0.000227460861206
 c_threads 11  time  0.00021938085556
 c_threads 12  time  0.000216970443726
 c_threads 13  time  0.000215198993683
 c_threads 14  time  0.00021940946579
 c_threads 15  time  0.000204219818115
 c_threads 16  time  0.000216958522797
 c_threads 17  time  0.000219728946686
 c_threads 18  time  0.00010272522
 c_threads 19  time  0.000157492160797
 c_threads 20  time  0.000171000957489
 c_threads 21  time  0.000147500038147
 c_threads 22  time  0.000141770839691
 c_threads 23  time  0.000137741565704

 for na=3290 and nb=3400 (about 11.5 speedup)
 c_threads 1  time  0.100258581638
 c_threads 2  time  0.0501346611977
 c_threads 3  time  0.0335096096992
 c_threads 4  time  0.0253720903397
 c_threads 5  time  0.0208190107346
 c_threads 6  time  0.0173784399033
 c_threads 7  time  0.0148811817169
 c_threads 8  time  0.0130474209785
 c_threads 9  time  0.011598110199
 c_threads 10  time  0.0104278612137
 c_threads 11  time  0.00950778007507
 c_threads 12  time  0.00870131969452
 c_threads 13  time  0.015882730484
 c_threads 14  time  0.0148504400253
 c_threads 15  time  0.0139465212822
 c_threads 16  time  0.0130301308632
 c_threads 17  time  0.012240819931
 c_threads 18  time  0.011567029953
 c_threads 19  time  0.0109891605377
 c_threads 20  time  0.0104281497002
 c_threads 21  time  0.00992572069168
 c_threads 22  time  0.00957406997681
 c_threads 23  time  0.00936627149582


 for na=329 and nb=340, cdist comes in at 0.00111914873123 which is
 1.085x slower than the c version for my system.

 for na=3290 and nb=3400 cdist gives  0.143441538811

 Cheers,
 Eric


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-17 Thread Matthieu Brucher
 Do you think, one could get even better ?
 And, where does the 7% slow-down (for single thread) come from ?
 Is it possible to have the OpenMP option in a code, without _any_
 penalty for 1 core machines ?


There will always be a penalty for parallel code that runs on one core. You
have at least the overhead for splitting the data.

Matthieu
-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-17 Thread Matthieu Brucher
 Then, where does the overhead come from ? --
 The call toomp_set_dynamic(dynamic);
 Or the
 #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)


It may be this. You initialize a thread pool, even if it has only one
thread, and there is the dynamic part, so OpenMP may create several chunks
instead of one big chunk.

Matthieu
-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-17 Thread Matthieu Brucher
It may also be the sizes of the chunk OMP uses. You can/should specify
them.in the OMP pragma so that it is a multiple of the cache line size or
something close.

Matthieu

2011/2/17 Sebastian Haase seb.ha...@gmail.com

 Hi,
 More surprises:
 shaase@iris:~/code/SwiggedDistOMP: gcc -O3 -c the_lib.c -fPIC -fopenmp
 -ffast-math
 shaase@iris:~/code/SwiggedDistOMP: gcc -shared -o the_lib.so the_lib.o
 -lgomp -lm
 shaase@iris:~/code/SwiggedDistOMP: priithon the_python_prog.py
 c_threads 0  time  0.000437839031219# this is now, without
 #pragma omp parallel for ...
 c_threads 1  time  0.000865449905396
 c_threads 2  time  0.000520548820496
 c_threads 3  time  0.00033704996109
 c_threads 4  time  0.000620169639587
 c_threads 5  time  0.000465350151062
 c_threads 6  time  0.000696349143982

 This correct now the timing of, max OpenMP speed (3 threads) vs. no
 OpenMP to speedup of (only!) 1.3x
 Not 2.33x (which was the number I got when comparing OpenMP to the
 cdist function).
 The c code is now:

 the_lib.c

 --
 #include stdio.h
 #include time.h
 #include omp.h
 #include math.h

 void dists2d(  double *a_ps, int na,
  double *b_ps, int nb,
  double *dist, int num_threads)
 {

   int i, j;
double ax,ay, dif_x, dif_y;
   int nx1=2;
   int nx2=2;

if(num_threads0)
  {
   int dynamic=0;
   omp_set_dynamic(dynamic);
omp_set_num_threads(num_threads);


 #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
for(i=0;ina;i++)
 {
   ax=a_ps[i*nx1];
ay=a_ps[i*nx1+1];
   for(j=0;jnb;j++)
 { dif_x = ax - b_ps[j*nx2];
dif_y = ay - b_ps[j*nx2+1];
dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);
 }
 }
  } else {
for(i=0;ina;i++)
 {
   ax=a_ps[i*nx1];
ay=a_ps[i*nx1+1];
   for(j=0;jnb;j++)
 { dif_x = ax - b_ps[j*nx2];
dif_y = ay - b_ps[j*nx2+1];
dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);
 }
 }
   }
 }
 --
 $ gcc -O3 -c the_lib.c -fPIC -fopenmp -ffast-math
 $ gcc -shared -o the_lib.so the_lib.o -lgomp -lm

 So, I guess I found a way of getting rid of the OpenMP overhead when
 run with 1 thread,
 and found that - if measured correctly, using same compiler settings
 and so on - the speedup is so small that there no point in doing
 OpenMP - again.
 (For my case, having (only) 4 cores)


 Cheers,
 Sebastian.



 On Thu, Feb 17, 2011 at 10:57 AM, Matthieu Brucher
 matthieu.bruc...@gmail.com wrote:
 
  Then, where does the overhead come from ? --
  The call toomp_set_dynamic(dynamic);
  Or the
  #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
 
  It may be this. You initialize a thread pool, even if it has only one
  thread, and there is the dynamic part, so OpenMP may create several
 chunks
  instead of one big chunk.
 
  Matthieu
  --
  Information System Engineer, Ph.D.
  Blog: http://matt.eifelle.com
  LinkedIn: http://www.linkedin.com/in/matthieubrucher
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion




-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-17 Thread Eric Carlson
For 4 cores, on your system, your conclusion makes some sense. That 
said, I played around with this on both a core 2 duo and the 12 core 
system. For the 12-core system, on my tests the 0 case ran extremely 
close to the 2-thread case for all my sizes.

The core 2 duo runs windows 7, and after downloading pthreadsGC2.dll 
from the pthreads project, I was able to use openmp under a year-old 
(32-bit) pythonxy distribution with mingw. The result, 0 threads come in 
slightly faster than one thread, .00102 versus .00106, and 2 threads 
took .00060.

My current theory is that gcc under linux uses some background trick to 
get two thread-like streams going. As I assess scale-up under linux, I 
will need to consider this behavior.

Creating optimal codes with OpenMP certainly requires a considerable 
commitment. Given the problem-specific fine tuning required, I would not 
expect much gain in general-purpose routines. In specific routines like 
cdist, it might make more sense. I talked to a Dell HPC rep today, and 
he said that squeezing out an extra 15% performance boost on an Intel 
CPU was a pleasant surprise, so the 30% improvement is maybe not so bad.

Cheers,
Eric

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-16 Thread Sebastian Haase
Chris,
OK, sorry -- I miss read (cdist doc says A and B must have same number
of columns(!) not rows).
On my machine I got the exact same timing as my (non OpenMP) C code.
That is really got, compared to normal ufunc based numpy code.
But my question in this thread is, how to get better than that,  using either
SSE or OpenMP (or CUDA?)

Thanks again for telling me about scipy.distance,
Sebastian



On Wed, Feb 16, 2011 at 2:28 AM, Chris Colbert sccolb...@gmail.com wrote:

 The `cdist` function in scipy spatial does what you want, and takes ~ 1ms on
 my machine.

 In [1]: import numpy as np
 In [2]: from scipy.spatial.distance import cdist
 In [3]: a = np.random.random((340, 2))
 In [4]: b = np.random.random((329, 2))
 In [5]: c = cdist(a, b)
 In [6]: c.shape
 Out[6]: (340, 329)
 In [7]: %timeit cdist(a, b)
 1000 loops, best of 3: 1.18 ms per loop


 On Tue, Feb 15, 2011 at 4:42 PM, Sebastian Haase seb.ha...@gmail.com
 wrote:

 Hi Eat,
 I will surely try these routines tomorrow,
 but I still think that neither scipy function does the complete
 distance calculation of all possible pairs as done by my C code.
 For 2 arrays, X and Y, of nX and nY 2d coordinates respectively, I
 need to get nX times nY distances computed.
 From the online documentation I understand that
 pdist calculates something like nX square distances for a single array
 X of nX coordinates,
 and cdist would calculate nX distances, where nX is required to equal nY.

 Thanks for looking into this,
 Sebastian

 On Tue, Feb 15, 2011 at 9:11 PM, eat e.antero.ta...@gmail.com wrote:
  Hi,
 
  On Tue, Feb 15, 2011 at 5:50 PM, Sebastian Haase seb.ha...@gmail.com
  wrote:
 
  Hi,
  I assume that someone here could maybe help me, and I'm hoping it's
  not too much off topic.
  I have 2 arrays of 2d point coordinates and would like to calculate
  all pairwise distances as fast as possible.
  Going from Python/Numpy to a (Swigged) C extension already gave me a
  55x speedup.
  (.9ms vs. 50ms for arrays of length 329 and 340).
 
  With my very modest machine (Intel Dual CPU E2200, 2.2 Ghz) utilizing
  scipy.spatial.distance.pdist will take some 1.5 ms for such arrays. Even
  the
  simple linear algebra based full matrix calculations can be done less
  than 5
  ms.
 
  My two cents,
  eat
 
  I'm using gcc on Linux.
  Now I'm wondering if I could go even faster !?
  My hope that the compiler might automagically do some SSE2
  optimization got disappointed.
  Since I have a 4 core CPU I thought OpenMP might be an option;
  I never used that, and after some playing around I managed to get
  (only) 50% slowdown(!) :-(
 
  My code in short is this:
  (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
  ---Ccode --
  void dists2d(
                    double *a_ps, int nx1, int na,
                    double *b_ps, int nx2, int nb,
                    double *dist, int nx3, int ny3)  throw (char*)
  {
   if(nx1 != 2)  throw (char*) a must be of shape (n,2);
   if(nx2 != 2)  throw (char*) b must be of shape (n,2);
   if(nx3 != nb || ny3 != na)    throw (char*) c must be of shape
  (na,nb);
 
   double *dist_;
   int i, j;
 
  #pragma omp parallel private(dist_, j, i)
   {
  #pragma omp for nowait
         for(i=0;ina;i++)
           {
                 //num_threads=omp_get_num_threads();  -- 4
                 dist_ = dist+i*nb;                 // dists_  is  only
  introduced for OpenMP
                 for(j=0;jnb;j++)
                   {
                         *dist_++  = sqrt( sq(a_ps[i*nx1]   -
  b_ps[j*nx2]) +
 
   sq(a_ps[i*nx1+1]
  - b_ps[j*nx2+1]) );
                   }
           }
   }
  }
  ---/Ccode --
  There is probably a simple mistake in this code - as I said I never
  used OpenMP before.
  It should be not too difficult to use OpenMP correctly here
   or -  maybe better -
  is there a simple SSE(2,3,4) version that might be even better than
  OpenMP... !?
 
  I supposed, that I did not get the #pragma omp lines right - any idea ?
  Or is it in general not possible to speed this kind of code up using
  OpenMP !?
 
  Thanks,
  Sebastian Haase
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-16 Thread Sebastian Haase
Eric,
this is amazing !! Thanks very much, I have rarely seen such a compact
source example that just worked.
The timings I get are:
c_threads  1  time  0.00155731916428
c_threads  2  time  0.000829789638519
c_threads  3  time  0.00061688839
c_threads  4  time  0.000704760551453
c_threads  5  time  0.000933599472046
c_threads  6  time  0.000809240341187
c_threads  7  time  0.000837240219116
c_threads  8  time  0.000817658901215
c_threads  9  time  0.00084393026
c_threads 10  time  0.000861320495605
c_threads 11  time  0.000936930179596
c_threads 12  time  0.000847370624542

The optimum for my
Intel(R) Core(TM)2 Quad CPUQ9550  @ 2.83GHz
seems to be at 3 threads with .6 ms for the given test case.

I just reran my normal non-OpenMP C based code, and it takes
.84 ms   (1.35x slower).
from scipy.spatial import distance
distance.cdist(a,b)   takes
.9 ms  -- which includes the allocation of the output array, because
there is no `out` option available.


So I'm happy that OpenMP works,
but apparently on my CPU the speed increase is not overwhelming (yet)...

Thanks,
-- Sebastian


On Wed, Feb 16, 2011 at 4:50 AM, Eric Carlson ecarl...@eng.ua.edu wrote:
 I don't have the slightest idea what I'm doing, but



 
 file name - the_lib.c
 ___
 #include stdio.h
 #include time.h
 #include omp.h
 #include math.h

 void dists2d(      double *a_ps, int na,
                   double *b_ps, int nb,
                   double *dist, int num_threads)
 {

    int i, j;
    int dynamic=0;
    omp_set_dynamic(dynamic);
    omp_set_num_threads(num_threads);
    double ax,ay, dif_x, dif_y;
    int nx1=2;
    int nx2=2;


 #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
        for(i=0;ina;i++)
          {
                ax=a_ps[i*nx1];
                 ay=a_ps[i*nx1+1];
                for(j=0;jnb;j++)
                  {     dif_x = ax - b_ps[j*nx2];
                         dif_y = ay - b_ps[j*nx2+1];
                         dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);
                  }
          }
 }


 


 COMPILE:
 __
 gcc -c the_lib.c -fPIC -fopenmp -ffast-math
 gcc -shared -o the_lib.so the_lib.o -lgomp -lm


 

 the_python_prog.py
 _

 from ctypes import *
 my_lib=CDLL('the_lib.so') #or full path to lib
 import numpy as np
 import time

 na=329
 nb=340
 a=np.random.rand(na,2)
 b=np.random.rand(nb,2)
 c=np.zeros(na*nb)
 trials=100
 max_threads = 24
 for k in range(1,max_threads):
     n_threads =c_int(k)
     na2=c_int(na)
     nb2=c_int(nb)

     start = time.time()
     for k1 in range(trials):
         ret =
 my_lib.dists2d(a.ctypes.data_as(c_void_p),na2,b.ctypes.data_as(c_void_p),nb2,c.ctypes.data_as(c_void_p),n_threads)
     print c_threads,k,  time , (time.time()-start)/trials



 
 Results on my machine, dual xeon, 12 cores
 na=329
 nb=340
 

 100 trials each:
 c_threads 1  time  0.00109949827194
 c_threads 2  time  0.0005726313591
 c_threads 3  time  0.000429179668427
 c_threads 4  time  0.000349278450012
 c_threads 5  time  0.000287139415741
 c_threads 6  time  0.000252468585968
 c_threads 7  time  0.000222821235657
 c_threads 8  time  0.000206289291382
 c_threads 9  time  0.000187981128693
 c_threads 10  time  0.000172770023346
 c_threads 11  time  0.00016461853
 c_threads 12  time  0.000157740116119

 
 
 Results on my machine, dual xeon, 12 cores
 na=3290
 nb=3400
 __
 100 trials each:
 c_threads 1  time  0.10744508028
 c_threads 2  time  0.054223771
 c_threads 3  time  0.037127559185
 c_threads 4  time  0.0280736112595
 c_threads 5  time  0.0228648614883
 c_threads 6  time  0.0194904088974
 c_threads 7  time  0.0165715909004
 c_threads 8  time  0.0145838689804
 c_threads 9  time  0.0130002498627
 c_threads 10  time  0.0116940999031
 c_threads 11  time  0.0107557415962
 c_threads 12  time  0.00990005016327 (speedup almost 11)

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-16 Thread Sebastian Haase
Update:
I just noticed that using Eric's OpenMP code gave me only a 1.35x
speedup when comparing 3 threads vs. my non OpenMP code. However, when
comparing 3 threads vs. 1 thread, I could call this a 2.55x speedup.
This obviously sounds much better, but is obviously not the number
that matters...
(Ergo Eric's 11x speedup on 12 cores might also be more like 5x or 6x
when compared to simple C. w/o any OpenMP overhead).

Finally, I have to report that when plugging the C code (from
scipy.spatial.distance.cdist) into my real application,
which compares for one sample file point pairs from a sequence of 1000
microscopy images,
 the speed gain appears to be a total of 2x (only !!) -- compared to
my original app that used the simple numpy version implemented like
this:

def pyDist2(a_ps,b_ps, ab_dist):
ab_dist = N.zeros(shape=(a_ps.shape[0],b_ps.shape[0]), dtype= N.float64)
for i in range(a_ps.shape[0]):
d = b_ps- a_ps[i]
ab_dist[i] = N.sqrt( N.sum(d**2, -1) )
return ab_dist

(The test I was using before was just testing one pair of two arrays
of length 329 and 340, whereas the average length for the 1000 images
is 260
It could also be that I have substantial constant over from some
OpenGL lists that I'm generating in my app.)

Nevertheless, I'm happy having learned something about OpenMP.
I'm still trying to get more info from valgrind.
And, maybe someone could comment, if SSE would help for the function
in question.

Thanks,
Sebastian.



On Wed, Feb 16, 2011 at 1:50 PM, Sebastian Haase seb.ha...@gmail.com wrote:
 Eric,
 this is amazing !! Thanks very much, I have rarely seen such a compact
 source example that just worked.
 The timings I get are:
 c_threads  1  time  0.00155731916428
 c_threads  2  time  0.000829789638519
 c_threads  3  time  0.00061688839
 c_threads  4  time  0.000704760551453
 c_threads  5  time  0.000933599472046
 c_threads  6  time  0.000809240341187
 c_threads  7  time  0.000837240219116
 c_threads  8  time  0.000817658901215
 c_threads  9  time  0.00084393026
 c_threads 10  time  0.000861320495605
 c_threads 11  time  0.000936930179596
 c_threads 12  time  0.000847370624542

 The optimum for my
 Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
 seems to be at 3 threads with .6 ms for the given test case.

 I just reran my normal non-OpenMP C based code, and it takes
 .84 ms   (1.35x slower).
 from scipy.spatial import distance
 distance.cdist(a,b)   takes
 .9 ms  -- which includes the allocation of the output array, because
 there is no `out` option available.


 So I'm happy that OpenMP works,
 but apparently on my CPU the speed increase is not overwhelming (yet)...

 Thanks,
 -- Sebastian


 On Wed, Feb 16, 2011 at 4:50 AM, Eric Carlson ecarl...@eng.ua.edu wrote:
 I don't have the slightest idea what I'm doing, but



 
 file name - the_lib.c
 ___
 #include stdio.h
 #include time.h
 #include omp.h
 #include math.h

 void dists2d(      double *a_ps, int na,
                   double *b_ps, int nb,
                   double *dist, int num_threads)
 {

    int i, j;
    int dynamic=0;
    omp_set_dynamic(dynamic);
    omp_set_num_threads(num_threads);
    double ax,ay, dif_x, dif_y;
    int nx1=2;
    int nx2=2;


 #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
        for(i=0;ina;i++)
          {
                ax=a_ps[i*nx1];
                 ay=a_ps[i*nx1+1];
                for(j=0;jnb;j++)
                  {     dif_x = ax - b_ps[j*nx2];
                         dif_y = ay - b_ps[j*nx2+1];
                         dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);
                  }
          }
 }


 


 COMPILE:
 __
 gcc -c the_lib.c -fPIC -fopenmp -ffast-math
 gcc -shared -o the_lib.so the_lib.o -lgomp -lm


 

 the_python_prog.py
 _

 from ctypes import *
 my_lib=CDLL('the_lib.so') #or full path to lib
 import numpy as np
 import time

 na=329
 nb=340
 a=np.random.rand(na,2)
 b=np.random.rand(nb,2)
 c=np.zeros(na*nb)
 trials=100
 max_threads = 24
 for k in range(1,max_threads):
     n_threads =c_int(k)
     na2=c_int(na)
     nb2=c_int(nb)

     start = time.time()
     for k1 in range(trials):
         ret =
 my_lib.dists2d(a.ctypes.data_as(c_void_p),na2,b.ctypes.data_as(c_void_p),nb2,c.ctypes.data_as(c_void_p),n_threads)
     print c_threads,k,  time , (time.time()-start)/trials



 
 Results on my machine, dual xeon, 12 cores
 na=329
 nb=340
 

 100 trials each:
 c_threads 1  time  0.00109949827194
 c_threads 2  time  0.0005726313591
 c_threads 3  time  0.000429179668427
 c_threads 4  time  0.000349278450012
 c_threads 5  time  0.000287139415741
 c_threads 6  time  0.000252468585968
 c_threads 7  time  0.000222821235657
 c_threads 8  time  0.000206289291382
 c_threads 9  time  0.000187981128693
 c_threads 10  time  0.000172770023346
 c_threads 11  time  0.00016461853
 c_threads 12  time  0.000157740116119

 
 
 Results on my machine, dual xeon, 12 cores
 na=3290
 

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-16 Thread Eric Carlson
Sebastian,
Optimization appears to be important here. I used no optimization in my 
previous post, so you could try the -O3 compile option:

  gcc -O3 -c my_lib.c -fPIC -fopenmp -ffast-math

for na=329 and nb=340 I get (about 7.5 speedup)
c_threads 1  time  0.00103106021881
c_threads 2  time  0.000528309345245
c_threads 3  time  0.000362541675568
c_threads 4  time  0.00028993844986
c_threads 5  time  0.000287840366364
c_threads 6  time  0.000264899730682
c_threads 7  time  0.000244019031525
c_threads 8  time  0.000242137908936
c_threads 9  time  0.000232398509979
c_threads 10  time  0.000227460861206
c_threads 11  time  0.00021938085556
c_threads 12  time  0.000216970443726
c_threads 13  time  0.000215198993683
c_threads 14  time  0.00021940946579
c_threads 15  time  0.000204219818115
c_threads 16  time  0.000216958522797
c_threads 17  time  0.000219728946686
c_threads 18  time  0.00010272522
c_threads 19  time  0.000157492160797
c_threads 20  time  0.000171000957489
c_threads 21  time  0.000147500038147
c_threads 22  time  0.000141770839691
c_threads 23  time  0.000137741565704

for na=3290 and nb=3400 (about 11.5 speedup)
c_threads 1  time  0.100258581638
c_threads 2  time  0.0501346611977
c_threads 3  time  0.0335096096992
c_threads 4  time  0.0253720903397
c_threads 5  time  0.0208190107346
c_threads 6  time  0.0173784399033
c_threads 7  time  0.0148811817169
c_threads 8  time  0.0130474209785
c_threads 9  time  0.011598110199
c_threads 10  time  0.0104278612137
c_threads 11  time  0.00950778007507
c_threads 12  time  0.00870131969452
c_threads 13  time  0.015882730484
c_threads 14  time  0.0148504400253
c_threads 15  time  0.0139465212822
c_threads 16  time  0.0130301308632
c_threads 17  time  0.012240819931
c_threads 18  time  0.011567029953
c_threads 19  time  0.0109891605377
c_threads 20  time  0.0104281497002
c_threads 21  time  0.00992572069168
c_threads 22  time  0.00957406997681
c_threads 23  time  0.00936627149582


for na=329 and nb=340, cdist comes in at 0.00111914873123 which is 
1.085x slower than the c version for my system.

for na=3290 and nb=3400 cdist gives  0.143441538811

Cheers,
Eric


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Sebastian Haase
Hi,
I assume that someone here could maybe help me, and I'm hoping it's
not too much off topic.
I have 2 arrays of 2d point coordinates and would like to calculate
all pairwise distances as fast as possible.
Going from Python/Numpy to a (Swigged) C extension already gave me a
55x speedup.
(.9ms vs. 50ms for arrays of length 329 and 340).
I'm using gcc on Linux.
Now I'm wondering if I could go even faster !?
My hope that the compiler might automagically do some SSE2
optimization got disappointed.
Since I have a 4 core CPU I thought OpenMP might be an option;
I never used that, and after some playing around I managed to get
(only) 50% slowdown(!) :-(

My code in short is this:
(My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
---Ccode --
void dists2d(
   double *a_ps, int nx1, int na,
   double *b_ps, int nx2, int nb,
   double *dist, int nx3, int ny3)  throw (char*)
{
  if(nx1 != 2)  throw (char*) a must be of shape (n,2);
  if(nx2 != 2)  throw (char*) b must be of shape (n,2);
  if(nx3 != nb || ny3 != na)throw (char*) c must be of shape (na,nb);

  double *dist_;
  int i, j;

#pragma omp parallel private(dist_, j, i)
  {
#pragma omp for nowait
for(i=0;ina;i++)
  {
//num_threads=omp_get_num_threads();  -- 4
dist_ = dist+i*nb; // dists_  is  only 
introduced for OpenMP
for(j=0;jnb;j++)
  {
*dist_++  = sqrt( sq(a_ps[i*nx1]   - b_ps[j*nx2]) +
  sq(a_ps[i*nx1+1] - 
b_ps[j*nx2+1]) );
  }
  }
  }
}
---/Ccode --
There is probably a simple mistake in this code - as I said I never
used OpenMP before.
It should be not too difficult to use OpenMP correctly here
 or -  maybe better -
is there a simple SSE(2,3,4) version that might be even better than OpenMP... !?

I supposed, that I did not get the #pragma omp lines right - any idea ?
Or is it in general not possible to speed this kind of code up using OpenMP !?

Thanks,
Sebastian Haase
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Matthieu Brucher
Hi,

My first move would be to add a restrict keyword to dist (i.e. dist is the
only pointer to the specific memory location), and then declare dist_ inside
the first loop also with a restrict.
Then, I would run valgrind or a PAPI profil on your code to see what causes
the issue (false sharing, ...)

Matthieu

2011/2/15 Sebastian Haase seb.ha...@gmail.com

 Hi,
 I assume that someone here could maybe help me, and I'm hoping it's
 not too much off topic.
 I have 2 arrays of 2d point coordinates and would like to calculate
 all pairwise distances as fast as possible.
 Going from Python/Numpy to a (Swigged) C extension already gave me a
 55x speedup.
 (.9ms vs. 50ms for arrays of length 329 and 340).
 I'm using gcc on Linux.
 Now I'm wondering if I could go even faster !?
 My hope that the compiler might automagically do some SSE2
 optimization got disappointed.
 Since I have a 4 core CPU I thought OpenMP might be an option;
 I never used that, and after some playing around I managed to get
 (only) 50% slowdown(!) :-(

 My code in short is this:
 (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
 ---Ccode --
 void dists2d(
   double *a_ps, int nx1, int na,
   double *b_ps, int nx2, int nb,
   double *dist, int nx3, int ny3)  throw (char*)
 {
  if(nx1 != 2)  throw (char*) a must be of shape (n,2);
  if(nx2 != 2)  throw (char*) b must be of shape (n,2);
  if(nx3 != nb || ny3 != na)throw (char*) c must be of shape (na,nb);

  double *dist_;
  int i, j;

 #pragma omp parallel private(dist_, j, i)
  {
 #pragma omp for nowait
for(i=0;ina;i++)
  {
//num_threads=omp_get_num_threads();  -- 4
dist_ = dist+i*nb; // dists_  is  only
 introduced for OpenMP
for(j=0;jnb;j++)
  {
*dist_++  = sqrt( sq(a_ps[i*nx1]   - b_ps[j*nx2]) +
  sq(a_ps[i*nx1+1] -
 b_ps[j*nx2+1]) );
  }
  }
  }
 }
 ---/Ccode --
 There is probably a simple mistake in this code - as I said I never
 used OpenMP before.
 It should be not too difficult to use OpenMP correctly here
  or -  maybe better -
 is there a simple SSE(2,3,4) version that might be even better than
 OpenMP... !?

 I supposed, that I did not get the #pragma omp lines right - any idea ?
 Or is it in general not possible to speed this kind of code up using OpenMP
 !?

 Thanks,
 Sebastian Haase
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion




-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Sebastian Haase
Thanks Matthieu,
using __restrict__ with g++ did not change anything. How do I use
valgrind with C extensions?
I don't know what PAPI profil is ...?
-Sebastian


On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
matthieu.bruc...@gmail.com wrote:
 Hi,
 My first move would be to add a restrict keyword to dist (i.e. dist is the
 only pointer to the specific memory location), and then declare dist_ inside
 the first loop also with a restrict.
 Then, I would run valgrind or a PAPI profil on your code to see what causes
 the issue (false sharing, ...)
 Matthieu

 2011/2/15 Sebastian Haase seb.ha...@gmail.com

 Hi,
 I assume that someone here could maybe help me, and I'm hoping it's
 not too much off topic.
 I have 2 arrays of 2d point coordinates and would like to calculate
 all pairwise distances as fast as possible.
 Going from Python/Numpy to a (Swigged) C extension already gave me a
 55x speedup.
 (.9ms vs. 50ms for arrays of length 329 and 340).
 I'm using gcc on Linux.
 Now I'm wondering if I could go even faster !?
 My hope that the compiler might automagically do some SSE2
 optimization got disappointed.
 Since I have a 4 core CPU I thought OpenMP might be an option;
 I never used that, and after some playing around I managed to get
 (only) 50% slowdown(!) :-(

 My code in short is this:
 (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
 ---Ccode --
 void dists2d(
                   double *a_ps, int nx1, int na,
                   double *b_ps, int nx2, int nb,
                   double *dist, int nx3, int ny3)  throw (char*)
 {
  if(nx1 != 2)  throw (char*) a must be of shape (n,2);
  if(nx2 != 2)  throw (char*) b must be of shape (n,2);
  if(nx3 != nb || ny3 != na)    throw (char*) c must be of shape (na,nb);

  double *dist_;
  int i, j;

 #pragma omp parallel private(dist_, j, i)
  {
 #pragma omp for nowait
        for(i=0;ina;i++)
          {
                //num_threads=omp_get_num_threads();  -- 4
                dist_ = dist+i*nb;                 // dists_  is  only
 introduced for OpenMP
                for(j=0;jnb;j++)
                  {
                        *dist_++  = sqrt( sq(a_ps[i*nx1]   - b_ps[j*nx2]) +
                                                          sq(a_ps[i*nx1+1]
 - b_ps[j*nx2+1]) );
                  }
          }
  }
 }
 ---/Ccode --
 There is probably a simple mistake in this code - as I said I never
 used OpenMP before.
 It should be not too difficult to use OpenMP correctly here
  or -  maybe better -
 is there a simple SSE(2,3,4) version that might be even better than
 OpenMP... !?

 I supposed, that I did not get the #pragma omp lines right - any idea ?
 Or is it in general not possible to speed this kind of code up using
 OpenMP !?

 Thanks,
 Sebastian Haase
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 --
 Information System Engineer, Ph.D.
 Blog: http://matt.eifelle.com
 LinkedIn: http://www.linkedin.com/in/matthieubrucher

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Matthieu Brucher
Use directly restrict in C99 mode (__restrict does not have exactly the same
semantics).

For a valgrind profil, you can check my blog (
http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
Basically, if you have a python script, you can valgrind --optionsinmyblog
python myscript.py

For PAPI, you have to install several packages (perf module for kernel for
instance) and a GUI to analyze the results (in Eclispe, it should be
possible).

Matthieu

2011/2/15 Sebastian Haase seb.ha...@gmail.com

 Thanks Matthieu,
 using __restrict__ with g++ did not change anything. How do I use
 valgrind with C extensions?
 I don't know what PAPI profil is ...?
 -Sebastian


 On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
 matthieu.bruc...@gmail.com wrote:
  Hi,
  My first move would be to add a restrict keyword to dist (i.e. dist is
 the
  only pointer to the specific memory location), and then declare dist_
 inside
  the first loop also with a restrict.
  Then, I would run valgrind or a PAPI profil on your code to see what
 causes
  the issue (false sharing, ...)
  Matthieu
 
  2011/2/15 Sebastian Haase seb.ha...@gmail.com
 
  Hi,
  I assume that someone here could maybe help me, and I'm hoping it's
  not too much off topic.
  I have 2 arrays of 2d point coordinates and would like to calculate
  all pairwise distances as fast as possible.
  Going from Python/Numpy to a (Swigged) C extension already gave me a
  55x speedup.
  (.9ms vs. 50ms for arrays of length 329 and 340).
  I'm using gcc on Linux.
  Now I'm wondering if I could go even faster !?
  My hope that the compiler might automagically do some SSE2
  optimization got disappointed.
  Since I have a 4 core CPU I thought OpenMP might be an option;
  I never used that, and after some playing around I managed to get
  (only) 50% slowdown(!) :-(
 
  My code in short is this:
  (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
  ---Ccode --
  void dists2d(
double *a_ps, int nx1, int na,
double *b_ps, int nx2, int nb,
double *dist, int nx3, int ny3)  throw (char*)
  {
   if(nx1 != 2)  throw (char*) a must be of shape (n,2);
   if(nx2 != 2)  throw (char*) b must be of shape (n,2);
   if(nx3 != nb || ny3 != na)throw (char*) c must be of shape
 (na,nb);
 
   double *dist_;
   int i, j;
 
  #pragma omp parallel private(dist_, j, i)
   {
  #pragma omp for nowait
 for(i=0;ina;i++)
   {
 //num_threads=omp_get_num_threads();  -- 4
 dist_ = dist+i*nb; // dists_  is  only
  introduced for OpenMP
 for(j=0;jnb;j++)
   {
 *dist_++  = sqrt( sq(a_ps[i*nx1]   - b_ps[j*nx2])
 +
 
  sq(a_ps[i*nx1+1]
  - b_ps[j*nx2+1]) );
   }
   }
   }
  }
  ---/Ccode --
  There is probably a simple mistake in this code - as I said I never
  used OpenMP before.
  It should be not too difficult to use OpenMP correctly here
   or -  maybe better -
  is there a simple SSE(2,3,4) version that might be even better than
  OpenMP... !?
 
  I supposed, that I did not get the #pragma omp lines right - any idea ?
  Or is it in general not possible to speed this kind of code up using
  OpenMP !?
 
  Thanks,
  Sebastian Haase
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 
  --
  Information System Engineer, Ph.D.
  Blog: http://matt.eifelle.com
  LinkedIn: http://www.linkedin.com/in/matthieubrucher
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion




-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Wes McKinney
On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher
matthieu.bruc...@gmail.com wrote:
 Use directly restrict in C99 mode (__restrict does not have exactly the same
 semantics).
 For a valgrind profil, you can check my blog
 (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
 Basically, if you have a python script, you can valgrind --optionsinmyblog
 python myscript.py
 For PAPI, you have to install several packages (perf module for kernel for
 instance) and a GUI to analyze the results (in Eclispe, it should be
 possible).
 Matthieu
 2011/2/15 Sebastian Haase seb.ha...@gmail.com

 Thanks Matthieu,
 using __restrict__ with g++ did not change anything. How do I use
 valgrind with C extensions?
 I don't know what PAPI profil is ...?
 -Sebastian


 On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
 matthieu.bruc...@gmail.com wrote:
  Hi,
  My first move would be to add a restrict keyword to dist (i.e. dist is
  the
  only pointer to the specific memory location), and then declare dist_
  inside
  the first loop also with a restrict.
  Then, I would run valgrind or a PAPI profil on your code to see what
  causes
  the issue (false sharing, ...)
  Matthieu
 
  2011/2/15 Sebastian Haase seb.ha...@gmail.com
 
  Hi,
  I assume that someone here could maybe help me, and I'm hoping it's
  not too much off topic.
  I have 2 arrays of 2d point coordinates and would like to calculate
  all pairwise distances as fast as possible.
  Going from Python/Numpy to a (Swigged) C extension already gave me a
  55x speedup.
  (.9ms vs. 50ms for arrays of length 329 and 340).
  I'm using gcc on Linux.
  Now I'm wondering if I could go even faster !?
  My hope that the compiler might automagically do some SSE2
  optimization got disappointed.
  Since I have a 4 core CPU I thought OpenMP might be an option;
  I never used that, and after some playing around I managed to get
  (only) 50% slowdown(!) :-(
 
  My code in short is this:
  (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
  ---Ccode --
  void dists2d(
                    double *a_ps, int nx1, int na,
                    double *b_ps, int nx2, int nb,
                    double *dist, int nx3, int ny3)  throw (char*)
  {
   if(nx1 != 2)  throw (char*) a must be of shape (n,2);
   if(nx2 != 2)  throw (char*) b must be of shape (n,2);
   if(nx3 != nb || ny3 != na)    throw (char*) c must be of shape
  (na,nb);
 
   double *dist_;
   int i, j;
 
  #pragma omp parallel private(dist_, j, i)
   {
  #pragma omp for nowait
         for(i=0;ina;i++)
           {
                 //num_threads=omp_get_num_threads();  -- 4
                 dist_ = dist+i*nb;                 // dists_  is  only
  introduced for OpenMP
                 for(j=0;jnb;j++)
                   {
                         *dist_++  = sqrt( sq(a_ps[i*nx1]   -
  b_ps[j*nx2]) +
 
   sq(a_ps[i*nx1+1]
  - b_ps[j*nx2+1]) );
                   }
           }
   }
  }
  ---/Ccode --
  There is probably a simple mistake in this code - as I said I never
  used OpenMP before.
  It should be not too difficult to use OpenMP correctly here
   or -  maybe better -
  is there a simple SSE(2,3,4) version that might be even better than
  OpenMP... !?
 
  I supposed, that I did not get the #pragma omp lines right - any idea ?
  Or is it in general not possible to speed this kind of code up using
  OpenMP !?
 
  Thanks,
  Sebastian Haase
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 
  --
  Information System Engineer, Ph.D.
  Blog: http://matt.eifelle.com
  LinkedIn: http://www.linkedin.com/in/matthieubrucher
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 --
 Information System Engineer, Ph.D.
 Blog: http://matt.eifelle.com
 LinkedIn: http://www.linkedin.com/in/matthieubrucher

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



As an aside, do you have a GPU-equipped machine? This function looks
like it would be an easy CUDA target. Depends on who the users of the
function are (e.g. do they also have the CUDA runtime) if whether you
wanted to go that route.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Sebastian Haase
Wes,
I think I should have a couple of GPUs. I would be ready for anything
... if you think that I could do some easy(!) CUDA programming here,
maybe you could guide me into the right direction...
Thanks,
Sebastian.


On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney wesmck...@gmail.com wrote:
 On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher
 matthieu.bruc...@gmail.com wrote:
 Use directly restrict in C99 mode (__restrict does not have exactly the same
 semantics).
 For a valgrind profil, you can check my blog
 (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
 Basically, if you have a python script, you can valgrind --optionsinmyblog
 python myscript.py
 For PAPI, you have to install several packages (perf module for kernel for
 instance) and a GUI to analyze the results (in Eclispe, it should be
 possible).
 Matthieu
 2011/2/15 Sebastian Haase seb.ha...@gmail.com

 Thanks Matthieu,
 using __restrict__ with g++ did not change anything. How do I use
 valgrind with C extensions?
 I don't know what PAPI profil is ...?
 -Sebastian


 On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
 matthieu.bruc...@gmail.com wrote:
  Hi,
  My first move would be to add a restrict keyword to dist (i.e. dist is
  the
  only pointer to the specific memory location), and then declare dist_
  inside
  the first loop also with a restrict.
  Then, I would run valgrind or a PAPI profil on your code to see what
  causes
  the issue (false sharing, ...)
  Matthieu
 
  2011/2/15 Sebastian Haase seb.ha...@gmail.com
 
  Hi,
  I assume that someone here could maybe help me, and I'm hoping it's
  not too much off topic.
  I have 2 arrays of 2d point coordinates and would like to calculate
  all pairwise distances as fast as possible.
  Going from Python/Numpy to a (Swigged) C extension already gave me a
  55x speedup.
  (.9ms vs. 50ms for arrays of length 329 and 340).
  I'm using gcc on Linux.
  Now I'm wondering if I could go even faster !?
  My hope that the compiler might automagically do some SSE2
  optimization got disappointed.
  Since I have a 4 core CPU I thought OpenMP might be an option;
  I never used that, and after some playing around I managed to get
  (only) 50% slowdown(!) :-(
 
  My code in short is this:
  (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
  ---Ccode --
  void dists2d(
                    double *a_ps, int nx1, int na,
                    double *b_ps, int nx2, int nb,
                    double *dist, int nx3, int ny3)  throw (char*)
  {
   if(nx1 != 2)  throw (char*) a must be of shape (n,2);
   if(nx2 != 2)  throw (char*) b must be of shape (n,2);
   if(nx3 != nb || ny3 != na)    throw (char*) c must be of shape
  (na,nb);
 
   double *dist_;
   int i, j;
 
  #pragma omp parallel private(dist_, j, i)
   {
  #pragma omp for nowait
         for(i=0;ina;i++)
           {
                 //num_threads=omp_get_num_threads();  -- 4
                 dist_ = dist+i*nb;                 // dists_  is  only
  introduced for OpenMP
                 for(j=0;jnb;j++)
                   {
                         *dist_++  = sqrt( sq(a_ps[i*nx1]   -
  b_ps[j*nx2]) +
 
   sq(a_ps[i*nx1+1]
  - b_ps[j*nx2+1]) );
                   }
           }
   }
  }
  ---/Ccode --
  There is probably a simple mistake in this code - as I said I never
  used OpenMP before.
  It should be not too difficult to use OpenMP correctly here
   or -  maybe better -
  is there a simple SSE(2,3,4) version that might be even better than
  OpenMP... !?
 
  I supposed, that I did not get the #pragma omp lines right - any idea ?
  Or is it in general not possible to speed this kind of code up using
  OpenMP !?
 
  Thanks,
  Sebastian Haase
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 
  --
  Information System Engineer, Ph.D.
  Blog: http://matt.eifelle.com
  LinkedIn: http://www.linkedin.com/in/matthieubrucher
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 --
 Information System Engineer, Ph.D.
 Blog: http://matt.eifelle.com
 LinkedIn: http://www.linkedin.com/in/matthieubrucher

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 As an aside, do you have a GPU-equipped machine? This function looks
 like it would be an easy CUDA target. Depends on who the users of the
 function are (e.g. do they also have the CUDA runtime) if whether you
 wanted to go that route.
 ___
 NumPy-Discussion mailing list
 

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Wes McKinney
On Tue, Feb 15, 2011 at 11:33 AM, Sebastian Haase seb.ha...@gmail.com wrote:
 Wes,
 I think I should have a couple of GPUs. I would be ready for anything
 ... if you think that I could do some easy(!) CUDA programming here,
 maybe you could guide me into the right direction...
 Thanks,
 Sebastian.


 On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney wesmck...@gmail.com wrote:
 On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher
 matthieu.bruc...@gmail.com wrote:
 Use directly restrict in C99 mode (__restrict does not have exactly the same
 semantics).
 For a valgrind profil, you can check my blog
 (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
 Basically, if you have a python script, you can valgrind --optionsinmyblog
 python myscript.py
 For PAPI, you have to install several packages (perf module for kernel for
 instance) and a GUI to analyze the results (in Eclispe, it should be
 possible).
 Matthieu
 2011/2/15 Sebastian Haase seb.ha...@gmail.com

 Thanks Matthieu,
 using __restrict__ with g++ did not change anything. How do I use
 valgrind with C extensions?
 I don't know what PAPI profil is ...?
 -Sebastian


 On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
 matthieu.bruc...@gmail.com wrote:
  Hi,
  My first move would be to add a restrict keyword to dist (i.e. dist is
  the
  only pointer to the specific memory location), and then declare dist_
  inside
  the first loop also with a restrict.
  Then, I would run valgrind or a PAPI profil on your code to see what
  causes
  the issue (false sharing, ...)
  Matthieu
 
  2011/2/15 Sebastian Haase seb.ha...@gmail.com
 
  Hi,
  I assume that someone here could maybe help me, and I'm hoping it's
  not too much off topic.
  I have 2 arrays of 2d point coordinates and would like to calculate
  all pairwise distances as fast as possible.
  Going from Python/Numpy to a (Swigged) C extension already gave me a
  55x speedup.
  (.9ms vs. 50ms for arrays of length 329 and 340).
  I'm using gcc on Linux.
  Now I'm wondering if I could go even faster !?
  My hope that the compiler might automagically do some SSE2
  optimization got disappointed.
  Since I have a 4 core CPU I thought OpenMP might be an option;
  I never used that, and after some playing around I managed to get
  (only) 50% slowdown(!) :-(
 
  My code in short is this:
  (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
  ---Ccode --
  void dists2d(
                    double *a_ps, int nx1, int na,
                    double *b_ps, int nx2, int nb,
                    double *dist, int nx3, int ny3)  throw (char*)
  {
   if(nx1 != 2)  throw (char*) a must be of shape (n,2);
   if(nx2 != 2)  throw (char*) b must be of shape (n,2);
   if(nx3 != nb || ny3 != na)    throw (char*) c must be of shape
  (na,nb);
 
   double *dist_;
   int i, j;
 
  #pragma omp parallel private(dist_, j, i)
   {
  #pragma omp for nowait
         for(i=0;ina;i++)
           {
                 //num_threads=omp_get_num_threads();  -- 4
                 dist_ = dist+i*nb;                 // dists_  is  only
  introduced for OpenMP
                 for(j=0;jnb;j++)
                   {
                         *dist_++  = sqrt( sq(a_ps[i*nx1]   -
  b_ps[j*nx2]) +
 
   sq(a_ps[i*nx1+1]
  - b_ps[j*nx2+1]) );
                   }
           }
   }
  }
  ---/Ccode --
  There is probably a simple mistake in this code - as I said I never
  used OpenMP before.
  It should be not too difficult to use OpenMP correctly here
   or -  maybe better -
  is there a simple SSE(2,3,4) version that might be even better than
  OpenMP... !?
 
  I supposed, that I did not get the #pragma omp lines right - any idea ?
  Or is it in general not possible to speed this kind of code up using
  OpenMP !?
 
  Thanks,
  Sebastian Haase
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 
  --
  Information System Engineer, Ph.D.
  Blog: http://matt.eifelle.com
  LinkedIn: http://www.linkedin.com/in/matthieubrucher
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 --
 Information System Engineer, Ph.D.
 Blog: http://matt.eifelle.com
 LinkedIn: http://www.linkedin.com/in/matthieubrucher

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 As an aside, do you have a GPU-equipped machine? This function looks
 like it would be an easy CUDA target. Depends on who the users of the
 function are (e.g. do they also have the CUDA runtime) if whether you
 wanted to go that route.
 

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread eat
Hi,

On Tue, Feb 15, 2011 at 5:50 PM, Sebastian Haase seb.ha...@gmail.comwrote:

 Hi,
 I assume that someone here could maybe help me, and I'm hoping it's
 not too much off topic.
 I have 2 arrays of 2d point coordinates and would like to calculate
 all pairwise distances as fast as possible.
 Going from Python/Numpy to a (Swigged) C extension already gave me a
 55x speedup.
 (.9ms vs. 50ms for arrays of length 329 and 340).

With my very modest machine (Intel Dual CPU E2200, 2.2 Ghz) utilizing
scipy.spatial.distance.pdist will take some 1.5 ms for such arrays. Even the
simple linear algebra based full matrix calculations can be done less than 5
ms.


My two cents,
eat

 I'm using gcc on Linux.
 Now I'm wondering if I could go even faster !?
 My hope that the compiler might automagically do some SSE2
 optimization got disappointed.
 Since I have a 4 core CPU I thought OpenMP might be an option;
 I never used that, and after some playing around I managed to get
 (only) 50% slowdown(!) :-(

 My code in short is this:
 (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
 ---Ccode --
 void dists2d(
   double *a_ps, int nx1, int na,
   double *b_ps, int nx2, int nb,
   double *dist, int nx3, int ny3)  throw (char*)
 {
  if(nx1 != 2)  throw (char*) a must be of shape (n,2);
  if(nx2 != 2)  throw (char*) b must be of shape (n,2);
  if(nx3 != nb || ny3 != na)throw (char*) c must be of shape (na,nb);

  double *dist_;
  int i, j;

 #pragma omp parallel private(dist_, j, i)
  {
 #pragma omp for nowait
for(i=0;ina;i++)
  {
//num_threads=omp_get_num_threads();  -- 4
dist_ = dist+i*nb; // dists_  is  only
 introduced for OpenMP
for(j=0;jnb;j++)
  {
*dist_++  = sqrt( sq(a_ps[i*nx1]   - b_ps[j*nx2]) +
  sq(a_ps[i*nx1+1] -
 b_ps[j*nx2+1]) );
  }
  }
  }
 }
 ---/Ccode --
 There is probably a simple mistake in this code - as I said I never
 used OpenMP before.
 It should be not too difficult to use OpenMP correctly here
  or -  maybe better -
 is there a simple SSE(2,3,4) version that might be even better than
 OpenMP... !?

 I supposed, that I did not get the #pragma omp lines right - any idea ?
 Or is it in general not possible to speed this kind of code up using OpenMP
 !?

 Thanks,
 Sebastian Haase
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Sebastian Haase
Hi Eat,
I will surely try these routines tomorrow,
but I still think that neither scipy function does the complete
distance calculation of all possible pairs as done by my C code.
For 2 arrays, X and Y, of nX and nY 2d coordinates respectively, I
need to get nX times nY distances computed.
From the online documentation I understand that
pdist calculates something like nX square distances for a single array
X of nX coordinates,
and cdist would calculate nX distances, where nX is required to equal nY.

Thanks for looking into this,
Sebastian

On Tue, Feb 15, 2011 at 9:11 PM, eat e.antero.ta...@gmail.com wrote:
 Hi,

 On Tue, Feb 15, 2011 at 5:50 PM, Sebastian Haase seb.ha...@gmail.com
 wrote:

 Hi,
 I assume that someone here could maybe help me, and I'm hoping it's
 not too much off topic.
 I have 2 arrays of 2d point coordinates and would like to calculate
 all pairwise distances as fast as possible.
 Going from Python/Numpy to a (Swigged) C extension already gave me a
 55x speedup.
 (.9ms vs. 50ms for arrays of length 329 and 340).

 With my very modest machine (Intel Dual CPU E2200, 2.2 Ghz) utilizing
 scipy.spatial.distance.pdist will take some 1.5 ms for such arrays. Even the
 simple linear algebra based full matrix calculations can be done less than 5
 ms.

 My two cents,
 eat

 I'm using gcc on Linux.
 Now I'm wondering if I could go even faster !?
 My hope that the compiler might automagically do some SSE2
 optimization got disappointed.
 Since I have a 4 core CPU I thought OpenMP might be an option;
 I never used that, and after some playing around I managed to get
 (only) 50% slowdown(!) :-(

 My code in short is this:
 (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
 ---Ccode --
 void dists2d(
                   double *a_ps, int nx1, int na,
                   double *b_ps, int nx2, int nb,
                   double *dist, int nx3, int ny3)  throw (char*)
 {
  if(nx1 != 2)  throw (char*) a must be of shape (n,2);
  if(nx2 != 2)  throw (char*) b must be of shape (n,2);
  if(nx3 != nb || ny3 != na)    throw (char*) c must be of shape (na,nb);

  double *dist_;
  int i, j;

 #pragma omp parallel private(dist_, j, i)
  {
 #pragma omp for nowait
        for(i=0;ina;i++)
          {
                //num_threads=omp_get_num_threads();  -- 4
                dist_ = dist+i*nb;                 // dists_  is  only
 introduced for OpenMP
                for(j=0;jnb;j++)
                  {
                        *dist_++  = sqrt( sq(a_ps[i*nx1]   - b_ps[j*nx2]) +
                                                          sq(a_ps[i*nx1+1]
 - b_ps[j*nx2+1]) );
                  }
          }
  }
 }
 ---/Ccode --
 There is probably a simple mistake in this code - as I said I never
 used OpenMP before.
 It should be not too difficult to use OpenMP correctly here
  or -  maybe better -
 is there a simple SSE(2,3,4) version that might be even better than
 OpenMP... !?

 I supposed, that I did not get the #pragma omp lines right - any idea ?
 Or is it in general not possible to speed this kind of code up using
 OpenMP !?

 Thanks,
 Sebastian Haase
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Chris Colbert
The `cdist` function in scipy spatial does what you want, and takes ~ 1ms on
my machine.


In [1]: import numpy as np

In [2]: from scipy.spatial.distance import cdist

In [3]: a = np.random.random((340, 2))

In [4]: b = np.random.random((329, 2))

In [5]: c = cdist(a, b)

In [6]: c.shape
Out[6]: (340, 329)

In [7]: %timeit cdist(a, b)
1000 loops, best of 3: 1.18 ms per loop



On Tue, Feb 15, 2011 at 4:42 PM, Sebastian Haase seb.ha...@gmail.comwrote:

 Hi Eat,
 I will surely try these routines tomorrow,
 but I still think that neither scipy function does the complete
 distance calculation of all possible pairs as done by my C code.
 For 2 arrays, X and Y, of nX and nY 2d coordinates respectively, I
 need to get nX times nY distances computed.
 From the online documentation I understand that
 pdist calculates something like nX square distances for a single array
 X of nX coordinates,
 and cdist would calculate nX distances, where nX is required to equal nY.

 Thanks for looking into this,
 Sebastian

 On Tue, Feb 15, 2011 at 9:11 PM, eat e.antero.ta...@gmail.com wrote:
  Hi,
 
  On Tue, Feb 15, 2011 at 5:50 PM, Sebastian Haase seb.ha...@gmail.com
  wrote:
 
  Hi,
  I assume that someone here could maybe help me, and I'm hoping it's
  not too much off topic.
  I have 2 arrays of 2d point coordinates and would like to calculate
  all pairwise distances as fast as possible.
  Going from Python/Numpy to a (Swigged) C extension already gave me a
  55x speedup.
  (.9ms vs. 50ms for arrays of length 329 and 340).
 
  With my very modest machine (Intel Dual CPU E2200, 2.2 Ghz) utilizing
  scipy.spatial.distance.pdist will take some 1.5 ms for such arrays. Even
 the
  simple linear algebra based full matrix calculations can be done less
 than 5
  ms.
 
  My two cents,
  eat
 
  I'm using gcc on Linux.
  Now I'm wondering if I could go even faster !?
  My hope that the compiler might automagically do some SSE2
  optimization got disappointed.
  Since I have a 4 core CPU I thought OpenMP might be an option;
  I never used that, and after some playing around I managed to get
  (only) 50% slowdown(!) :-(
 
  My code in short is this:
  (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
  ---Ccode --
  void dists2d(
double *a_ps, int nx1, int na,
double *b_ps, int nx2, int nb,
double *dist, int nx3, int ny3)  throw (char*)
  {
   if(nx1 != 2)  throw (char*) a must be of shape (n,2);
   if(nx2 != 2)  throw (char*) b must be of shape (n,2);
   if(nx3 != nb || ny3 != na)throw (char*) c must be of shape
 (na,nb);
 
   double *dist_;
   int i, j;
 
  #pragma omp parallel private(dist_, j, i)
   {
  #pragma omp for nowait
 for(i=0;ina;i++)
   {
 //num_threads=omp_get_num_threads();  -- 4
 dist_ = dist+i*nb; // dists_  is  only
  introduced for OpenMP
 for(j=0;jnb;j++)
   {
 *dist_++  = sqrt( sq(a_ps[i*nx1]   - b_ps[j*nx2])
 +
 
  sq(a_ps[i*nx1+1]
  - b_ps[j*nx2+1]) );
   }
   }
   }
  }
  ---/Ccode --
  There is probably a simple mistake in this code - as I said I never
  used OpenMP before.
  It should be not too difficult to use OpenMP correctly here
   or -  maybe better -
  is there a simple SSE(2,3,4) version that might be even better than
  OpenMP... !?
 
  I supposed, that I did not get the #pragma omp lines right - any idea ?
  Or is it in general not possible to speed this kind of code up using
  OpenMP !?
 
  Thanks,
  Sebastian Haase
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Jonathan Taylor
Take a look at a nice project coming out of my department:

http://code.google.com/p/cudamat/

Best,
Jon.


On Tue, Feb 15, 2011 at 11:33 AM, Sebastian Haase seb.ha...@gmail.com wrote:
 Wes,
 I think I should have a couple of GPUs. I would be ready for anything
 ... if you think that I could do some easy(!) CUDA programming here,
 maybe you could guide me into the right direction...
 Thanks,
 Sebastian.


 On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney wesmck...@gmail.com wrote:
 On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher
 matthieu.bruc...@gmail.com wrote:
 Use directly restrict in C99 mode (__restrict does not have exactly the same
 semantics).
 For a valgrind profil, you can check my blog
 (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
 Basically, if you have a python script, you can valgrind --optionsinmyblog
 python myscript.py
 For PAPI, you have to install several packages (perf module for kernel for
 instance) and a GUI to analyze the results (in Eclispe, it should be
 possible).
 Matthieu
 2011/2/15 Sebastian Haase seb.ha...@gmail.com

 Thanks Matthieu,
 using __restrict__ with g++ did not change anything. How do I use
 valgrind with C extensions?
 I don't know what PAPI profil is ...?
 -Sebastian


 On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
 matthieu.bruc...@gmail.com wrote:
  Hi,
  My first move would be to add a restrict keyword to dist (i.e. dist is
  the
  only pointer to the specific memory location), and then declare dist_
  inside
  the first loop also with a restrict.
  Then, I would run valgrind or a PAPI profil on your code to see what
  causes
  the issue (false sharing, ...)
  Matthieu
 
  2011/2/15 Sebastian Haase seb.ha...@gmail.com
 
  Hi,
  I assume that someone here could maybe help me, and I'm hoping it's
  not too much off topic.
  I have 2 arrays of 2d point coordinates and would like to calculate
  all pairwise distances as fast as possible.
  Going from Python/Numpy to a (Swigged) C extension already gave me a
  55x speedup.
  (.9ms vs. 50ms for arrays of length 329 and 340).
  I'm using gcc on Linux.
  Now I'm wondering if I could go even faster !?
  My hope that the compiler might automagically do some SSE2
  optimization got disappointed.
  Since I have a 4 core CPU I thought OpenMP might be an option;
  I never used that, and after some playing around I managed to get
  (only) 50% slowdown(!) :-(
 
  My code in short is this:
  (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
  ---Ccode --
  void dists2d(
                    double *a_ps, int nx1, int na,
                    double *b_ps, int nx2, int nb,
                    double *dist, int nx3, int ny3)  throw (char*)
  {
   if(nx1 != 2)  throw (char*) a must be of shape (n,2);
   if(nx2 != 2)  throw (char*) b must be of shape (n,2);
   if(nx3 != nb || ny3 != na)    throw (char*) c must be of shape
  (na,nb);
 
   double *dist_;
   int i, j;
 
  #pragma omp parallel private(dist_, j, i)
   {
  #pragma omp for nowait
         for(i=0;ina;i++)
           {
                 //num_threads=omp_get_num_threads();  -- 4
                 dist_ = dist+i*nb;                 // dists_  is  only
  introduced for OpenMP
                 for(j=0;jnb;j++)
                   {
                         *dist_++  = sqrt( sq(a_ps[i*nx1]   -
  b_ps[j*nx2]) +
 
   sq(a_ps[i*nx1+1]
  - b_ps[j*nx2+1]) );
                   }
           }
   }
  }
  ---/Ccode --
  There is probably a simple mistake in this code - as I said I never
  used OpenMP before.
  It should be not too difficult to use OpenMP correctly here
   or -  maybe better -
  is there a simple SSE(2,3,4) version that might be even better than
  OpenMP... !?
 
  I supposed, that I did not get the #pragma omp lines right - any idea ?
  Or is it in general not possible to speed this kind of code up using
  OpenMP !?
 
  Thanks,
  Sebastian Haase
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 
  --
  Information System Engineer, Ph.D.
  Blog: http://matt.eifelle.com
  LinkedIn: http://www.linkedin.com/in/matthieubrucher
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 --
 Information System Engineer, Ph.D.
 Blog: http://matt.eifelle.com
 LinkedIn: http://www.linkedin.com/in/matthieubrucher

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 As an aside, do you have a GPU-equipped machine? This function looks
 like it would be an easy CUDA target. Depends on who the users of the
 

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Eric Carlson
I don't have the slightest idea what I'm doing, but




file name - the_lib.c
___
#include stdio.h
#include time.h
#include omp.h
#include math.h

void dists2d(  double *a_ps, int na,
   double *b_ps, int nb,
   double *dist, int num_threads)
{

int i, j;
int dynamic=0;
omp_set_dynamic(dynamic);
omp_set_num_threads(num_threads);
double ax,ay, dif_x, dif_y;
int nx1=2;
int nx2=2;


#pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
for(i=0;ina;i++)
  {
ax=a_ps[i*nx1];
 ay=a_ps[i*nx1+1];
for(j=0;jnb;j++)
  { dif_x = ax - b_ps[j*nx2];
 dif_y = ay - b_ps[j*nx2+1];
 dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);
  }
  }
}





COMPILE:
__
gcc -c the_lib.c -fPIC -fopenmp -ffast-math
gcc -shared -o the_lib.so the_lib.o -lgomp -lm




the_python_prog.py
_

from ctypes import *
my_lib=CDLL('the_lib.so') #or full path to lib
import numpy as np
import time

na=329
nb=340
a=np.random.rand(na,2)
b=np.random.rand(nb,2)
c=np.zeros(na*nb)
trials=100
max_threads = 24
for k in range(1,max_threads):
 n_threads =c_int(k)
 na2=c_int(na)
 nb2=c_int(nb)

 start = time.time()
 for k1 in range(trials):
 ret = 
my_lib.dists2d(a.ctypes.data_as(c_void_p),na2,b.ctypes.data_as(c_void_p),nb2,c.ctypes.data_as(c_void_p),n_threads)
 print c_threads,k,  time , (time.time()-start)/trials




Results on my machine, dual xeon, 12 cores
na=329
nb=340


100 trials each:
c_threads 1  time  0.00109949827194
c_threads 2  time  0.0005726313591
c_threads 3  time  0.000429179668427
c_threads 4  time  0.000349278450012
c_threads 5  time  0.000287139415741
c_threads 6  time  0.000252468585968
c_threads 7  time  0.000222821235657
c_threads 8  time  0.000206289291382
c_threads 9  time  0.000187981128693
c_threads 10  time  0.000172770023346
c_threads 11  time  0.00016461853
c_threads 12  time  0.000157740116119



Results on my machine, dual xeon, 12 cores
na=3290
nb=3400
__
100 trials each:
c_threads 1  time  0.10744508028
c_threads 2  time  0.054223771
c_threads 3  time  0.037127559185
c_threads 4  time  0.0280736112595
c_threads 5  time  0.0228648614883
c_threads 6  time  0.0194904088974
c_threads 7  time  0.0165715909004
c_threads 8  time  0.0145838689804
c_threads 9  time  0.0130002498627
c_threads 10  time  0.0116940999031
c_threads 11  time  0.0107557415962
c_threads 12  time  0.00990005016327 (speedup almost 11)

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion