On Wed, 21 Sep 2005, Konstantin Kudin wrote: KK> Hi,
kostya, KK> I've done some parallel benchmarks for the CP code so I thought I'd KK> share them with the rest of the group. The system we have is a cluster KK> of dual Opterons 2.0 Ghz with 1Gbit ethernet. please keep in mind, that for reasonable scaling with car-parrinello MD you usually need a better interconnect than gigabit ethernet. some time ago i summarized some tests results for the CPMD code at: http://www.theochem.ruhr-uni-bochum.de/~axel.kohlmeyer/cpmd-bench.html#parallel the general issues apply to the espresso CP codes as well as to CPMD. with gigabit (or TCP/IP at that) you suffer most from the very high latencies. this is especially bad for the all-to-all communications that are needed for the FFTs, and more visibly for dual-cpu. for quad-cpu nodes you also should be in the bandwidth limit. now, how much this becomes visible depends much on the size of the job. we have a cluster of dual-opteron 246 (2.0GHz) with two gigabit networks (data and MPI separately) and it usually does not pay to run jobs across more than 3-4 nodes. and even then you already 'waste' about 20% of your cpu power. the only saving grace is the fact, that a better interconnect will coste much more than one wasted node. KK> I looked at 2 different measures of time, CPU time, and wall time KK> computed as the difference between "This run was started" and "This run KK> was terminated". By the way, such wall time could probably be printed KK> by the code directly to be readily available. probably, but you can also get the number as easily by using the 'time' command to start the jobs. KK> The system is a reasonably sized simulation cell with 20 CP KK> (electronic+ionic) steps total. KK> KK> The compiler is IFC 9.0, GOTO library is for BLAS, and mpich 1.2.6 KK> used for the MPI. The CP version is the CVS from Aug. 20, 2005. KK> KK> What is crazy is that even for 2 cpus sitting in the same box there is KK> lots of cpu time just lost somewhere. The strange thing is that the KK> quad we have at 2.2 Ghz seems to lose just as much wall time as 2 duals KK> talking across the network. And note how 4 cpus are barely better than KK> 2x compared to single cpu performance if the wall clock time is KK> considered. please check, whether your MPI library does use shared memory communication properly and that your kernel supports setting the proper CPU and Memory affinity (and you set it). i have seen some numbers where this makes over 20% difference on a dual machine and i would expect it matters even more on quad machines. KK> I know Nicola Marzari has done some parallel benchmarks, but I do not KK> think that wall times were being paid attention to ... KK> KK> Kostya KK> KK> P.S. Any suggestions what might be going on here? you also have to take into account, that when you are running a gamma point only calculation, you are missing the most efficient parallelization (across k-points) that helps running, e.g., pw.x rather efficiently on 'not so high'-performance networks. best regards, axel. KK> KK> KK> Ncpu CPU time Wall time KK> 1 1h22m 1h24m KK> 2 45m33.41s 57m13s KK> 4 27m30.80s 44m21s KK> 6 18m22.71s 43m18s KK> 8 14m53.91s 45m56s KK> KK> 4(quad) 37m18.56s 45m32s KK> KK> KK> KK> __________________________________________________ KK> Do You Yahoo!? KK> Tired of spam? Yahoo! Mail has the best spam protection around KK> http://mail.yahoo.com KK> _______________________________________________ KK> Pw_forum mailing list KK> Pw_forum at pwscf.org KK> http://www.democritos.it/mailman/listinfo/pw_forum KK> KK> -- ======================================================================= Dr. Axel Kohlmeyer e-mail: axel.kohlmeyer at theochem.ruhr-uni-bochum.de Lehrstuhl fuer Theoretische Chemie Phone: ++49 (0)234/32-26673 Ruhr-Universitaet Bochum - NC 03/53 Fax: ++49 (0)234/32-14045 D-44780 Bochum http://www.theochem.ruhr-uni-bochum.de/~axel.kohlmeyer/ ======================================================================= If you make something idiot-proof, the universe creates a better idiot.
