Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
A Sunday 20 February 2011 00:01:59 Sturla Molden escrigué: pthreads will give you better control than OpenMP, but are messy and painful to work with. With MPI you have separate processes, so everything is completely isolated. It's more difficult to program and debug than OpenMP code, but will usually perform better. To be more specific, MPI will perform better if you don't need to share the memory of your working set among all your processes, but in case you need to do this, threads (pthreads, OpenMP) lets you access all parts of the working set in memory much more easily. In fact, all the threads of a process can access the complete working set transparently and efficiently (although this is precisely why they are trickier to program: you have to explicitly avoid simultaneous writing in the same memory area), not to mention that threads are much cheaper to create than processes. Generally speaking, if your problem is large, CPU intensive and not very memory bounded, MPI usually leads to better results. Otherwise threads tend to do a better job. The thing is that memory is increasingly becoming more and more a bottleneck nowadays, so threads are here to stay for a long, long time (which is certainly unfortunate for programmers, specially for pthreads ones :). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Thanks a lot. Very informative. I guess what you say about cache line is dirtied is related to the info I got with valgrind (see my email in this thread: L1 Data Write Miss 3636). Can one assume that the cache line is always a few mega bytes ? Thanks, Sebastian On Sat, Feb 19, 2011 at 12:40 AM, Sturla Molden stu...@molden.no wrote: Den 17.02.2011 16:31, skrev Matthieu Brucher: It may also be the sizes of the chunk OMP uses. You can/should specify them.in Matthieu the OMP pragma so that it is a multiple of the cache line size or something close. Also beware of false sharing among the threads. When one processor updates the array dist in Sebastian's code, the cache line is dirtied for the other processors: #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y) for(i=0;ina;i++) { ax=a_ps[i*nx1]; ay=a_ps[i*nx1+1]; for(j=0;jnb;j++) { dif_x = ax - b_ps[j*nx2]; dif_y = ay - b_ps[j*nx2+1]; /* update shared memory */ dist[2*i+j] = sqrt(dif_x*dif_x+dif_y*dif_y); /* ... and poof the cache is dirty */ } } Whenever this happens, the processors must stop whatever they are doing to resynchronize their cache lines. False sharing can therefore work as an invisible GIL inside OpenMP code.The processors can appear to run in syrup, and there is excessive traffic on the memory bus. This is also why MPI programs often scale better than OpenMP programs, despite the IPC overhead. An advice when working with OpenMP is to let each thread write to private data arrays, and only share read-only arrays. One can e.g. use OpenMP's reduction pragma to achieve this. E.g. intialize the array dist with zeros, and use reduction(+:dist) in the OpenMP pragma line. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
On Sat, 19 Feb 2011 18:13:44 +0100, Sebastian Haase wrote: Thanks a lot. Very informative. I guess what you say about cache line is dirtied is related to the info I got with valgrind (see my email in this thread: L1 Data Write Miss 3636). Can one assume that the cache line is always a few mega bytes ? Cache lines are typically much smaller, 16-512 bytes. In this specific case, since the stride of the `i` loop is only 2*sizeof(float) = 16 bytes cache line size, threads running with different `i` tend to write to the same cache lines. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Den 19.02.2011 18:13, skrev Sebastian Haase: Can one assume that the cache line is always a few mega bytes ? Don't confuse the size of a cache with the size of a cache line. A cache line (which is the unit that gets marked dirty) is typically 8-512 bytes. Make sure your OpenMP threads stay off each others cache lines, and it will scale nicely. For example, you can specify a chunk-size to the schedule pragma to force them apart; it applies to the loop index, so you must do calculations for the block size on the shared write buffer. If you use reduction(+:dist) the write buffer will be completely private, but you get summations after the loop. That is the limited amount of control you get with OpenMP. pthreads will give you better control than OpenMP, but are messy and painful to work with. With MPI you have separate processes, so everything is completely isolated. It's more difficult to program and debug than OpenMP code, but will usually perform better. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Den 17.02.2011 16:31, skrev Matthieu Brucher: It may also be the sizes of the chunk OMP uses. You can/should specify them.in http://them.in Matthieu the OMP pragma so that it is a multiple of the cache line size or something close. Also beware of false sharing among the threads. When one processor updates the array dist in Sebastian's code, the cache line is dirtied for the other processors: #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y) for(i=0;ina;i++){ ax=a_ps[i*nx1]; ay=a_ps[i*nx1+1]; for(j=0;jnb;j++) { dif_x = ax - b_ps[j*nx2]; dif_y = ay - b_ps[j*nx2+1]; /* update shared memory */ dist[2*i+j] = sqrt(dif_x*dif_x+dif_y*dif_y); /* ... and poof the cache is dirty */ } } Whenever this happens, the processors must stop whatever they are doing to resynchronize their cache lines. False sharing can therefore work as an invisible GIL inside OpenMP code.The processors can appear to run in syrup, and there is excessive traffic on the memory bus. This is also why MPI programs often scale better than OpenMP programs, despite the IPC overhead. An advice when working with OpenMP is to let each thread write to private data arrays, and only share read-only arrays. One can e.g. use OpenMP's reduction pragma to achieve this. E.g. intialize the array dist with zeros, and use reduction(+:dist) in the OpenMP pragma line. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Eric, thanks for insisting on this. I noticed that, when I saw it first, just to forget about it again ... The new timings on my machine are: $: gcc -O3 -c the_lib.c -fPIC -fopenmp -ffast-math $: gcc -shared -o the_lib.so the_lib.o -lgomp -lm $: python2.5 the_python_prog.py c_threads 1 time 0.000897128582001 c_threads 2 time 0.000540800094604 c_threads 3 time 0.00035933971405 c_threads 4 time 0.000529370307922 c_threads 5 time 0.00049122095108 c_threads 6 time 0.000540502071381 c_threads 7 time 0.00058007911 c_threads 8 time 0.000643739700317 c_threads 9 time 0.000622930526733 c_threads 10 time 0.000680360794067 c_threads 11 time 0.000613269805908 c_threads 12 time 0.000633401870728 That is, your OpenMP version is again fastest using 3 threads on my 4 core CPU. It is now 2.34x times faster than my non-OpenMP code (which compares to scipy...cdist). And, it is (only !?) 7% slower than then non-OpenMP code, when running on 1 thread. (The speedup 3 threads vs. 1 thread is 2.5x) So, that is pretty good !! What I don't understand, why did you start your first post with I don't have the slightest idea what I'm doing ;-) Do you think, one could get even better ? And, where does the 7% slow-down (for single thread) come from ? Is it possible to have the OpenMP option in a code, without _any_ penalty for 1 core machines ? Thanks, - Sebastian On Thu, Feb 17, 2011 at 2:12 AM, Eric Carlson ecarl...@eng.ua.edu wrote: Sebastian, Optimization appears to be important here. I used no optimization in my previous post, so you could try the -O3 compile option: gcc -O3 -c my_lib.c -fPIC -fopenmp -ffast-math for na=329 and nb=340 I get (about 7.5 speedup) c_threads 1 time 0.00103106021881 c_threads 2 time 0.000528309345245 c_threads 3 time 0.000362541675568 c_threads 4 time 0.00028993844986 c_threads 5 time 0.000287840366364 c_threads 6 time 0.000264899730682 c_threads 7 time 0.000244019031525 c_threads 8 time 0.000242137908936 c_threads 9 time 0.000232398509979 c_threads 10 time 0.000227460861206 c_threads 11 time 0.00021938085556 c_threads 12 time 0.000216970443726 c_threads 13 time 0.000215198993683 c_threads 14 time 0.00021940946579 c_threads 15 time 0.000204219818115 c_threads 16 time 0.000216958522797 c_threads 17 time 0.000219728946686 c_threads 18 time 0.00010272522 c_threads 19 time 0.000157492160797 c_threads 20 time 0.000171000957489 c_threads 21 time 0.000147500038147 c_threads 22 time 0.000141770839691 c_threads 23 time 0.000137741565704 for na=3290 and nb=3400 (about 11.5 speedup) c_threads 1 time 0.100258581638 c_threads 2 time 0.0501346611977 c_threads 3 time 0.0335096096992 c_threads 4 time 0.0253720903397 c_threads 5 time 0.0208190107346 c_threads 6 time 0.0173784399033 c_threads 7 time 0.0148811817169 c_threads 8 time 0.0130474209785 c_threads 9 time 0.011598110199 c_threads 10 time 0.0104278612137 c_threads 11 time 0.00950778007507 c_threads 12 time 0.00870131969452 c_threads 13 time 0.015882730484 c_threads 14 time 0.0148504400253 c_threads 15 time 0.0139465212822 c_threads 16 time 0.0130301308632 c_threads 17 time 0.012240819931 c_threads 18 time 0.011567029953 c_threads 19 time 0.0109891605377 c_threads 20 time 0.0104281497002 c_threads 21 time 0.00992572069168 c_threads 22 time 0.00957406997681 c_threads 23 time 0.00936627149582 for na=329 and nb=340, cdist comes in at 0.00111914873123 which is 1.085x slower than the c version for my system. for na=3290 and nb=3400 cdist gives 0.143441538811 Cheers, Eric ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Do you think, one could get even better ? And, where does the 7% slow-down (for single thread) come from ? Is it possible to have the OpenMP option in a code, without _any_ penalty for 1 core machines ? There will always be a penalty for parallel code that runs on one core. You have at least the overhead for splitting the data. Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Then, where does the overhead come from ? -- The call toomp_set_dynamic(dynamic); Or the #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y) It may be this. You initialize a thread pool, even if it has only one thread, and there is the dynamic part, so OpenMP may create several chunks instead of one big chunk. Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
It may also be the sizes of the chunk OMP uses. You can/should specify them.in the OMP pragma so that it is a multiple of the cache line size or something close. Matthieu 2011/2/17 Sebastian Haase seb.ha...@gmail.com Hi, More surprises: shaase@iris:~/code/SwiggedDistOMP: gcc -O3 -c the_lib.c -fPIC -fopenmp -ffast-math shaase@iris:~/code/SwiggedDistOMP: gcc -shared -o the_lib.so the_lib.o -lgomp -lm shaase@iris:~/code/SwiggedDistOMP: priithon the_python_prog.py c_threads 0 time 0.000437839031219# this is now, without #pragma omp parallel for ... c_threads 1 time 0.000865449905396 c_threads 2 time 0.000520548820496 c_threads 3 time 0.00033704996109 c_threads 4 time 0.000620169639587 c_threads 5 time 0.000465350151062 c_threads 6 time 0.000696349143982 This correct now the timing of, max OpenMP speed (3 threads) vs. no OpenMP to speedup of (only!) 1.3x Not 2.33x (which was the number I got when comparing OpenMP to the cdist function). The c code is now: the_lib.c -- #include stdio.h #include time.h #include omp.h #include math.h void dists2d( double *a_ps, int na, double *b_ps, int nb, double *dist, int num_threads) { int i, j; double ax,ay, dif_x, dif_y; int nx1=2; int nx2=2; if(num_threads0) { int dynamic=0; omp_set_dynamic(dynamic); omp_set_num_threads(num_threads); #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y) for(i=0;ina;i++) { ax=a_ps[i*nx1]; ay=a_ps[i*nx1+1]; for(j=0;jnb;j++) { dif_x = ax - b_ps[j*nx2]; dif_y = ay - b_ps[j*nx2+1]; dist[2*i+j] = sqrt(dif_x*dif_x+dif_y*dif_y); } } } else { for(i=0;ina;i++) { ax=a_ps[i*nx1]; ay=a_ps[i*nx1+1]; for(j=0;jnb;j++) { dif_x = ax - b_ps[j*nx2]; dif_y = ay - b_ps[j*nx2+1]; dist[2*i+j] = sqrt(dif_x*dif_x+dif_y*dif_y); } } } } -- $ gcc -O3 -c the_lib.c -fPIC -fopenmp -ffast-math $ gcc -shared -o the_lib.so the_lib.o -lgomp -lm So, I guess I found a way of getting rid of the OpenMP overhead when run with 1 thread, and found that - if measured correctly, using same compiler settings and so on - the speedup is so small that there no point in doing OpenMP - again. (For my case, having (only) 4 cores) Cheers, Sebastian. On Thu, Feb 17, 2011 at 10:57 AM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Then, where does the overhead come from ? -- The call toomp_set_dynamic(dynamic); Or the #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y) It may be this. You initialize a thread pool, even if it has only one thread, and there is the dynamic part, so OpenMP may create several chunks instead of one big chunk. Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
For 4 cores, on your system, your conclusion makes some sense. That said, I played around with this on both a core 2 duo and the 12 core system. For the 12-core system, on my tests the 0 case ran extremely close to the 2-thread case for all my sizes. The core 2 duo runs windows 7, and after downloading pthreadsGC2.dll from the pthreads project, I was able to use openmp under a year-old (32-bit) pythonxy distribution with mingw. The result, 0 threads come in slightly faster than one thread, .00102 versus .00106, and 2 threads took .00060. My current theory is that gcc under linux uses some background trick to get two thread-like streams going. As I assess scale-up under linux, I will need to consider this behavior. Creating optimal codes with OpenMP certainly requires a considerable commitment. Given the problem-specific fine tuning required, I would not expect much gain in general-purpose routines. In specific routines like cdist, it might make more sense. I talked to a Dell HPC rep today, and he said that squeezing out an extra 15% performance boost on an Intel CPU was a pleasant surprise, so the 30% improvement is maybe not so bad. Cheers, Eric ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Chris, OK, sorry -- I miss read (cdist doc says A and B must have same number of columns(!) not rows). On my machine I got the exact same timing as my (non OpenMP) C code. That is really got, compared to normal ufunc based numpy code. But my question in this thread is, how to get better than that, using either SSE or OpenMP (or CUDA?) Thanks again for telling me about scipy.distance, Sebastian On Wed, Feb 16, 2011 at 2:28 AM, Chris Colbert sccolb...@gmail.com wrote: The `cdist` function in scipy spatial does what you want, and takes ~ 1ms on my machine. In [1]: import numpy as np In [2]: from scipy.spatial.distance import cdist In [3]: a = np.random.random((340, 2)) In [4]: b = np.random.random((329, 2)) In [5]: c = cdist(a, b) In [6]: c.shape Out[6]: (340, 329) In [7]: %timeit cdist(a, b) 1000 loops, best of 3: 1.18 ms per loop On Tue, Feb 15, 2011 at 4:42 PM, Sebastian Haase seb.ha...@gmail.com wrote: Hi Eat, I will surely try these routines tomorrow, but I still think that neither scipy function does the complete distance calculation of all possible pairs as done by my C code. For 2 arrays, X and Y, of nX and nY 2d coordinates respectively, I need to get nX times nY distances computed. From the online documentation I understand that pdist calculates something like nX square distances for a single array X of nX coordinates, and cdist would calculate nX distances, where nX is required to equal nY. Thanks for looking into this, Sebastian On Tue, Feb 15, 2011 at 9:11 PM, eat e.antero.ta...@gmail.com wrote: Hi, On Tue, Feb 15, 2011 at 5:50 PM, Sebastian Haase seb.ha...@gmail.com wrote: Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). With my very modest machine (Intel Dual CPU E2200, 2.2 Ghz) utilizing scipy.spatial.distance.pdist will take some 1.5 ms for such arrays. Even the simple linear algebra based full matrix calculations can be done less than 5 ms. My two cents, eat I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na) throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Eric, this is amazing !! Thanks very much, I have rarely seen such a compact source example that just worked. The timings I get are: c_threads 1 time 0.00155731916428 c_threads 2 time 0.000829789638519 c_threads 3 time 0.00061688839 c_threads 4 time 0.000704760551453 c_threads 5 time 0.000933599472046 c_threads 6 time 0.000809240341187 c_threads 7 time 0.000837240219116 c_threads 8 time 0.000817658901215 c_threads 9 time 0.00084393026 c_threads 10 time 0.000861320495605 c_threads 11 time 0.000936930179596 c_threads 12 time 0.000847370624542 The optimum for my Intel(R) Core(TM)2 Quad CPUQ9550 @ 2.83GHz seems to be at 3 threads with .6 ms for the given test case. I just reran my normal non-OpenMP C based code, and it takes .84 ms (1.35x slower). from scipy.spatial import distance distance.cdist(a,b) takes .9 ms -- which includes the allocation of the output array, because there is no `out` option available. So I'm happy that OpenMP works, but apparently on my CPU the speed increase is not overwhelming (yet)... Thanks, -- Sebastian On Wed, Feb 16, 2011 at 4:50 AM, Eric Carlson ecarl...@eng.ua.edu wrote: I don't have the slightest idea what I'm doing, but file name - the_lib.c ___ #include stdio.h #include time.h #include omp.h #include math.h void dists2d( double *a_ps, int na, double *b_ps, int nb, double *dist, int num_threads) { int i, j; int dynamic=0; omp_set_dynamic(dynamic); omp_set_num_threads(num_threads); double ax,ay, dif_x, dif_y; int nx1=2; int nx2=2; #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y) for(i=0;ina;i++) { ax=a_ps[i*nx1]; ay=a_ps[i*nx1+1]; for(j=0;jnb;j++) { dif_x = ax - b_ps[j*nx2]; dif_y = ay - b_ps[j*nx2+1]; dist[2*i+j] = sqrt(dif_x*dif_x+dif_y*dif_y); } } } COMPILE: __ gcc -c the_lib.c -fPIC -fopenmp -ffast-math gcc -shared -o the_lib.so the_lib.o -lgomp -lm the_python_prog.py _ from ctypes import * my_lib=CDLL('the_lib.so') #or full path to lib import numpy as np import time na=329 nb=340 a=np.random.rand(na,2) b=np.random.rand(nb,2) c=np.zeros(na*nb) trials=100 max_threads = 24 for k in range(1,max_threads): n_threads =c_int(k) na2=c_int(na) nb2=c_int(nb) start = time.time() for k1 in range(trials): ret = my_lib.dists2d(a.ctypes.data_as(c_void_p),na2,b.ctypes.data_as(c_void_p),nb2,c.ctypes.data_as(c_void_p),n_threads) print c_threads,k, time , (time.time()-start)/trials Results on my machine, dual xeon, 12 cores na=329 nb=340 100 trials each: c_threads 1 time 0.00109949827194 c_threads 2 time 0.0005726313591 c_threads 3 time 0.000429179668427 c_threads 4 time 0.000349278450012 c_threads 5 time 0.000287139415741 c_threads 6 time 0.000252468585968 c_threads 7 time 0.000222821235657 c_threads 8 time 0.000206289291382 c_threads 9 time 0.000187981128693 c_threads 10 time 0.000172770023346 c_threads 11 time 0.00016461853 c_threads 12 time 0.000157740116119 Results on my machine, dual xeon, 12 cores na=3290 nb=3400 __ 100 trials each: c_threads 1 time 0.10744508028 c_threads 2 time 0.054223771 c_threads 3 time 0.037127559185 c_threads 4 time 0.0280736112595 c_threads 5 time 0.0228648614883 c_threads 6 time 0.0194904088974 c_threads 7 time 0.0165715909004 c_threads 8 time 0.0145838689804 c_threads 9 time 0.0130002498627 c_threads 10 time 0.0116940999031 c_threads 11 time 0.0107557415962 c_threads 12 time 0.00990005016327 (speedup almost 11) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Update: I just noticed that using Eric's OpenMP code gave me only a 1.35x speedup when comparing 3 threads vs. my non OpenMP code. However, when comparing 3 threads vs. 1 thread, I could call this a 2.55x speedup. This obviously sounds much better, but is obviously not the number that matters... (Ergo Eric's 11x speedup on 12 cores might also be more like 5x or 6x when compared to simple C. w/o any OpenMP overhead). Finally, I have to report that when plugging the C code (from scipy.spatial.distance.cdist) into my real application, which compares for one sample file point pairs from a sequence of 1000 microscopy images, the speed gain appears to be a total of 2x (only !!) -- compared to my original app that used the simple numpy version implemented like this: def pyDist2(a_ps,b_ps, ab_dist): ab_dist = N.zeros(shape=(a_ps.shape[0],b_ps.shape[0]), dtype= N.float64) for i in range(a_ps.shape[0]): d = b_ps- a_ps[i] ab_dist[i] = N.sqrt( N.sum(d**2, -1) ) return ab_dist (The test I was using before was just testing one pair of two arrays of length 329 and 340, whereas the average length for the 1000 images is 260 It could also be that I have substantial constant over from some OpenGL lists that I'm generating in my app.) Nevertheless, I'm happy having learned something about OpenMP. I'm still trying to get more info from valgrind. And, maybe someone could comment, if SSE would help for the function in question. Thanks, Sebastian. On Wed, Feb 16, 2011 at 1:50 PM, Sebastian Haase seb.ha...@gmail.com wrote: Eric, this is amazing !! Thanks very much, I have rarely seen such a compact source example that just worked. The timings I get are: c_threads 1 time 0.00155731916428 c_threads 2 time 0.000829789638519 c_threads 3 time 0.00061688839 c_threads 4 time 0.000704760551453 c_threads 5 time 0.000933599472046 c_threads 6 time 0.000809240341187 c_threads 7 time 0.000837240219116 c_threads 8 time 0.000817658901215 c_threads 9 time 0.00084393026 c_threads 10 time 0.000861320495605 c_threads 11 time 0.000936930179596 c_threads 12 time 0.000847370624542 The optimum for my Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz seems to be at 3 threads with .6 ms for the given test case. I just reran my normal non-OpenMP C based code, and it takes .84 ms (1.35x slower). from scipy.spatial import distance distance.cdist(a,b) takes .9 ms -- which includes the allocation of the output array, because there is no `out` option available. So I'm happy that OpenMP works, but apparently on my CPU the speed increase is not overwhelming (yet)... Thanks, -- Sebastian On Wed, Feb 16, 2011 at 4:50 AM, Eric Carlson ecarl...@eng.ua.edu wrote: I don't have the slightest idea what I'm doing, but file name - the_lib.c ___ #include stdio.h #include time.h #include omp.h #include math.h void dists2d( double *a_ps, int na, double *b_ps, int nb, double *dist, int num_threads) { int i, j; int dynamic=0; omp_set_dynamic(dynamic); omp_set_num_threads(num_threads); double ax,ay, dif_x, dif_y; int nx1=2; int nx2=2; #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y) for(i=0;ina;i++) { ax=a_ps[i*nx1]; ay=a_ps[i*nx1+1]; for(j=0;jnb;j++) { dif_x = ax - b_ps[j*nx2]; dif_y = ay - b_ps[j*nx2+1]; dist[2*i+j] = sqrt(dif_x*dif_x+dif_y*dif_y); } } } COMPILE: __ gcc -c the_lib.c -fPIC -fopenmp -ffast-math gcc -shared -o the_lib.so the_lib.o -lgomp -lm the_python_prog.py _ from ctypes import * my_lib=CDLL('the_lib.so') #or full path to lib import numpy as np import time na=329 nb=340 a=np.random.rand(na,2) b=np.random.rand(nb,2) c=np.zeros(na*nb) trials=100 max_threads = 24 for k in range(1,max_threads): n_threads =c_int(k) na2=c_int(na) nb2=c_int(nb) start = time.time() for k1 in range(trials): ret = my_lib.dists2d(a.ctypes.data_as(c_void_p),na2,b.ctypes.data_as(c_void_p),nb2,c.ctypes.data_as(c_void_p),n_threads) print c_threads,k, time , (time.time()-start)/trials Results on my machine, dual xeon, 12 cores na=329 nb=340 100 trials each: c_threads 1 time 0.00109949827194 c_threads 2 time 0.0005726313591 c_threads 3 time 0.000429179668427 c_threads 4 time 0.000349278450012 c_threads 5 time 0.000287139415741 c_threads 6 time 0.000252468585968 c_threads 7 time 0.000222821235657 c_threads 8 time 0.000206289291382 c_threads 9 time 0.000187981128693 c_threads 10 time 0.000172770023346 c_threads 11 time 0.00016461853 c_threads 12 time 0.000157740116119 Results on my machine, dual xeon, 12 cores na=3290
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Sebastian, Optimization appears to be important here. I used no optimization in my previous post, so you could try the -O3 compile option: gcc -O3 -c my_lib.c -fPIC -fopenmp -ffast-math for na=329 and nb=340 I get (about 7.5 speedup) c_threads 1 time 0.00103106021881 c_threads 2 time 0.000528309345245 c_threads 3 time 0.000362541675568 c_threads 4 time 0.00028993844986 c_threads 5 time 0.000287840366364 c_threads 6 time 0.000264899730682 c_threads 7 time 0.000244019031525 c_threads 8 time 0.000242137908936 c_threads 9 time 0.000232398509979 c_threads 10 time 0.000227460861206 c_threads 11 time 0.00021938085556 c_threads 12 time 0.000216970443726 c_threads 13 time 0.000215198993683 c_threads 14 time 0.00021940946579 c_threads 15 time 0.000204219818115 c_threads 16 time 0.000216958522797 c_threads 17 time 0.000219728946686 c_threads 18 time 0.00010272522 c_threads 19 time 0.000157492160797 c_threads 20 time 0.000171000957489 c_threads 21 time 0.000147500038147 c_threads 22 time 0.000141770839691 c_threads 23 time 0.000137741565704 for na=3290 and nb=3400 (about 11.5 speedup) c_threads 1 time 0.100258581638 c_threads 2 time 0.0501346611977 c_threads 3 time 0.0335096096992 c_threads 4 time 0.0253720903397 c_threads 5 time 0.0208190107346 c_threads 6 time 0.0173784399033 c_threads 7 time 0.0148811817169 c_threads 8 time 0.0130474209785 c_threads 9 time 0.011598110199 c_threads 10 time 0.0104278612137 c_threads 11 time 0.00950778007507 c_threads 12 time 0.00870131969452 c_threads 13 time 0.015882730484 c_threads 14 time 0.0148504400253 c_threads 15 time 0.0139465212822 c_threads 16 time 0.0130301308632 c_threads 17 time 0.012240819931 c_threads 18 time 0.011567029953 c_threads 19 time 0.0109891605377 c_threads 20 time 0.0104281497002 c_threads 21 time 0.00992572069168 c_threads 22 time 0.00957406997681 c_threads 23 time 0.00936627149582 for na=329 and nb=340, cdist comes in at 0.00111914873123 which is 1.085x slower than the c version for my system. for na=3290 and nb=3400 cdist gives 0.143441538811 Cheers, Eric ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na)throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Hi, My first move would be to add a restrict keyword to dist (i.e. dist is the only pointer to the specific memory location), and then declare dist_ inside the first loop also with a restrict. Then, I would run valgrind or a PAPI profil on your code to see what causes the issue (false sharing, ...) Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na)throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Thanks Matthieu, using __restrict__ with g++ did not change anything. How do I use valgrind with C extensions? I don't know what PAPI profil is ...? -Sebastian On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Hi, My first move would be to add a restrict keyword to dist (i.e. dist is the only pointer to the specific memory location), and then declare dist_ inside the first loop also with a restrict. Then, I would run valgrind or a PAPI profil on your code to see what causes the issue (false sharing, ...) Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na) throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Use directly restrict in C99 mode (__restrict does not have exactly the same semantics). For a valgrind profil, you can check my blog ( http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/) Basically, if you have a python script, you can valgrind --optionsinmyblog python myscript.py For PAPI, you have to install several packages (perf module for kernel for instance) and a GUI to analyze the results (in Eclispe, it should be possible). Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Thanks Matthieu, using __restrict__ with g++ did not change anything. How do I use valgrind with C extensions? I don't know what PAPI profil is ...? -Sebastian On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Hi, My first move would be to add a restrict keyword to dist (i.e. dist is the only pointer to the specific memory location), and then declare dist_ inside the first loop also with a restrict. Then, I would run valgrind or a PAPI profil on your code to see what causes the issue (false sharing, ...) Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na)throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Use directly restrict in C99 mode (__restrict does not have exactly the same semantics). For a valgrind profil, you can check my blog (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/) Basically, if you have a python script, you can valgrind --optionsinmyblog python myscript.py For PAPI, you have to install several packages (perf module for kernel for instance) and a GUI to analyze the results (in Eclispe, it should be possible). Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Thanks Matthieu, using __restrict__ with g++ did not change anything. How do I use valgrind with C extensions? I don't know what PAPI profil is ...? -Sebastian On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Hi, My first move would be to add a restrict keyword to dist (i.e. dist is the only pointer to the specific memory location), and then declare dist_ inside the first loop also with a restrict. Then, I would run valgrind or a PAPI profil on your code to see what causes the issue (false sharing, ...) Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na) throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion As an aside, do you have a GPU-equipped machine? This function looks like it would be an easy CUDA target. Depends on who the users of the function are (e.g. do they also have the CUDA runtime) if whether you wanted to go that route. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Wes, I think I should have a couple of GPUs. I would be ready for anything ... if you think that I could do some easy(!) CUDA programming here, maybe you could guide me into the right direction... Thanks, Sebastian. On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney wesmck...@gmail.com wrote: On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Use directly restrict in C99 mode (__restrict does not have exactly the same semantics). For a valgrind profil, you can check my blog (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/) Basically, if you have a python script, you can valgrind --optionsinmyblog python myscript.py For PAPI, you have to install several packages (perf module for kernel for instance) and a GUI to analyze the results (in Eclispe, it should be possible). Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Thanks Matthieu, using __restrict__ with g++ did not change anything. How do I use valgrind with C extensions? I don't know what PAPI profil is ...? -Sebastian On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Hi, My first move would be to add a restrict keyword to dist (i.e. dist is the only pointer to the specific memory location), and then declare dist_ inside the first loop also with a restrict. Then, I would run valgrind or a PAPI profil on your code to see what causes the issue (false sharing, ...) Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na) throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion As an aside, do you have a GPU-equipped machine? This function looks like it would be an easy CUDA target. Depends on who the users of the function are (e.g. do they also have the CUDA runtime) if whether you wanted to go that route. ___ NumPy-Discussion mailing list
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
On Tue, Feb 15, 2011 at 11:33 AM, Sebastian Haase seb.ha...@gmail.com wrote: Wes, I think I should have a couple of GPUs. I would be ready for anything ... if you think that I could do some easy(!) CUDA programming here, maybe you could guide me into the right direction... Thanks, Sebastian. On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney wesmck...@gmail.com wrote: On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Use directly restrict in C99 mode (__restrict does not have exactly the same semantics). For a valgrind profil, you can check my blog (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/) Basically, if you have a python script, you can valgrind --optionsinmyblog python myscript.py For PAPI, you have to install several packages (perf module for kernel for instance) and a GUI to analyze the results (in Eclispe, it should be possible). Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Thanks Matthieu, using __restrict__ with g++ did not change anything. How do I use valgrind with C extensions? I don't know what PAPI profil is ...? -Sebastian On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Hi, My first move would be to add a restrict keyword to dist (i.e. dist is the only pointer to the specific memory location), and then declare dist_ inside the first loop also with a restrict. Then, I would run valgrind or a PAPI profil on your code to see what causes the issue (false sharing, ...) Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na) throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion As an aside, do you have a GPU-equipped machine? This function looks like it would be an easy CUDA target. Depends on who the users of the function are (e.g. do they also have the CUDA runtime) if whether you wanted to go that route.
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Hi, On Tue, Feb 15, 2011 at 5:50 PM, Sebastian Haase seb.ha...@gmail.comwrote: Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). With my very modest machine (Intel Dual CPU E2200, 2.2 Ghz) utilizing scipy.spatial.distance.pdist will take some 1.5 ms for such arrays. Even the simple linear algebra based full matrix calculations can be done less than 5 ms. My two cents, eat I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na)throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Hi Eat, I will surely try these routines tomorrow, but I still think that neither scipy function does the complete distance calculation of all possible pairs as done by my C code. For 2 arrays, X and Y, of nX and nY 2d coordinates respectively, I need to get nX times nY distances computed. From the online documentation I understand that pdist calculates something like nX square distances for a single array X of nX coordinates, and cdist would calculate nX distances, where nX is required to equal nY. Thanks for looking into this, Sebastian On Tue, Feb 15, 2011 at 9:11 PM, eat e.antero.ta...@gmail.com wrote: Hi, On Tue, Feb 15, 2011 at 5:50 PM, Sebastian Haase seb.ha...@gmail.com wrote: Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). With my very modest machine (Intel Dual CPU E2200, 2.2 Ghz) utilizing scipy.spatial.distance.pdist will take some 1.5 ms for such arrays. Even the simple linear algebra based full matrix calculations can be done less than 5 ms. My two cents, eat I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na) throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
The `cdist` function in scipy spatial does what you want, and takes ~ 1ms on my machine. In [1]: import numpy as np In [2]: from scipy.spatial.distance import cdist In [3]: a = np.random.random((340, 2)) In [4]: b = np.random.random((329, 2)) In [5]: c = cdist(a, b) In [6]: c.shape Out[6]: (340, 329) In [7]: %timeit cdist(a, b) 1000 loops, best of 3: 1.18 ms per loop On Tue, Feb 15, 2011 at 4:42 PM, Sebastian Haase seb.ha...@gmail.comwrote: Hi Eat, I will surely try these routines tomorrow, but I still think that neither scipy function does the complete distance calculation of all possible pairs as done by my C code. For 2 arrays, X and Y, of nX and nY 2d coordinates respectively, I need to get nX times nY distances computed. From the online documentation I understand that pdist calculates something like nX square distances for a single array X of nX coordinates, and cdist would calculate nX distances, where nX is required to equal nY. Thanks for looking into this, Sebastian On Tue, Feb 15, 2011 at 9:11 PM, eat e.antero.ta...@gmail.com wrote: Hi, On Tue, Feb 15, 2011 at 5:50 PM, Sebastian Haase seb.ha...@gmail.com wrote: Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). With my very modest machine (Intel Dual CPU E2200, 2.2 Ghz) utilizing scipy.spatial.distance.pdist will take some 1.5 ms for such arrays. Even the simple linear algebra based full matrix calculations can be done less than 5 ms. My two cents, eat I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na)throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Take a look at a nice project coming out of my department: http://code.google.com/p/cudamat/ Best, Jon. On Tue, Feb 15, 2011 at 11:33 AM, Sebastian Haase seb.ha...@gmail.com wrote: Wes, I think I should have a couple of GPUs. I would be ready for anything ... if you think that I could do some easy(!) CUDA programming here, maybe you could guide me into the right direction... Thanks, Sebastian. On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney wesmck...@gmail.com wrote: On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Use directly restrict in C99 mode (__restrict does not have exactly the same semantics). For a valgrind profil, you can check my blog (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/) Basically, if you have a python script, you can valgrind --optionsinmyblog python myscript.py For PAPI, you have to install several packages (perf module for kernel for instance) and a GUI to analyze the results (in Eclispe, it should be possible). Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Thanks Matthieu, using __restrict__ with g++ did not change anything. How do I use valgrind with C extensions? I don't know what PAPI profil is ...? -Sebastian On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Hi, My first move would be to add a restrict keyword to dist (i.e. dist is the only pointer to the specific memory location), and then declare dist_ inside the first loop also with a restrict. Then, I would run valgrind or a PAPI profil on your code to see what causes the issue (false sharing, ...) Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na) throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion As an aside, do you have a GPU-equipped machine? This function looks like it would be an easy CUDA target. Depends on who the users of the
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
I don't have the slightest idea what I'm doing, but file name - the_lib.c ___ #include stdio.h #include time.h #include omp.h #include math.h void dists2d( double *a_ps, int na, double *b_ps, int nb, double *dist, int num_threads) { int i, j; int dynamic=0; omp_set_dynamic(dynamic); omp_set_num_threads(num_threads); double ax,ay, dif_x, dif_y; int nx1=2; int nx2=2; #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y) for(i=0;ina;i++) { ax=a_ps[i*nx1]; ay=a_ps[i*nx1+1]; for(j=0;jnb;j++) { dif_x = ax - b_ps[j*nx2]; dif_y = ay - b_ps[j*nx2+1]; dist[2*i+j] = sqrt(dif_x*dif_x+dif_y*dif_y); } } } COMPILE: __ gcc -c the_lib.c -fPIC -fopenmp -ffast-math gcc -shared -o the_lib.so the_lib.o -lgomp -lm the_python_prog.py _ from ctypes import * my_lib=CDLL('the_lib.so') #or full path to lib import numpy as np import time na=329 nb=340 a=np.random.rand(na,2) b=np.random.rand(nb,2) c=np.zeros(na*nb) trials=100 max_threads = 24 for k in range(1,max_threads): n_threads =c_int(k) na2=c_int(na) nb2=c_int(nb) start = time.time() for k1 in range(trials): ret = my_lib.dists2d(a.ctypes.data_as(c_void_p),na2,b.ctypes.data_as(c_void_p),nb2,c.ctypes.data_as(c_void_p),n_threads) print c_threads,k, time , (time.time()-start)/trials Results on my machine, dual xeon, 12 cores na=329 nb=340 100 trials each: c_threads 1 time 0.00109949827194 c_threads 2 time 0.0005726313591 c_threads 3 time 0.000429179668427 c_threads 4 time 0.000349278450012 c_threads 5 time 0.000287139415741 c_threads 6 time 0.000252468585968 c_threads 7 time 0.000222821235657 c_threads 8 time 0.000206289291382 c_threads 9 time 0.000187981128693 c_threads 10 time 0.000172770023346 c_threads 11 time 0.00016461853 c_threads 12 time 0.000157740116119 Results on my machine, dual xeon, 12 cores na=3290 nb=3400 __ 100 trials each: c_threads 1 time 0.10744508028 c_threads 2 time 0.054223771 c_threads 3 time 0.037127559185 c_threads 4 time 0.0280736112595 c_threads 5 time 0.0228648614883 c_threads 6 time 0.0194904088974 c_threads 7 time 0.0165715909004 c_threads 8 time 0.0145838689804 c_threads 9 time 0.0130002498627 c_threads 10 time 0.0116940999031 c_threads 11 time 0.0107557415962 c_threads 12 time 0.00990005016327 (speedup almost 11) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion