On Fri, Feb 27, 2009 at 6:34 PM, Vo, Trinh <trinh.vo at jpl.nasa.gov> wrote: > Dear Axel, > > Thanks for clarification. > > About the benchmarks, I just simply to see how well is the performance of > the cluster we bought in term of scaling with QE. I sent some plots to > you, but the email did not go thru because of the restriction of the size > (larger than 40K). > > Currently, I am not happy at the fact that the difference in CPU time and > wall time is too large. When I run a longer job, which took ~2h CPU time > long, the wall time was ~7h when I run from the head node, and ~4h when I
that probably means you ran a job that was too big in the machine and thus swapping all the time. for your reference, here are some numbers from one of our local clusters. the machine has: 2x Intel Xeon E5430 @ 2.66GHz and 8GB per node and a 2xDDR infiniband interconnect. this first block are runs with four nodes and different -npernode numbers: h2o-32-4x2.out: CP : 18.33s CPU time, 18.79s wall time h2o-32-4x4.out: CP : 16.75s CPU time, 17.50s wall time h2o-32-4x8.out: CP : 25.31s CPU time, 25.94s wall time h2o-64-4x1.out: CP : 2m50.18s CPU time, 3m18.88s wall time h2o-64-4x2.out: CP : 1m29.72s CPU time, 1m33.60s wall time h2o-64-4x4.out: CP : 1m12.42s CPU time, 1m13.70s wall time h2o-64-4x8.out: CP : 1m19.53s CPU time, 1m20.86s wall time as you can see, same as with cp2k, using 8 cores per node is hurting performance, especially for smaller jobs, and using 4 cores per node is a much better choice. and here the corresponding single node times (run on the frontend): h2o-32-np1.out: CP : 2m24.38s CPU time, 2m39.38s wall time h2o-32-np2.out: CP : 1m24.22s CPU time, 1m42.09s wall time h2o-32-np4.out: CP : 48.92s CPU time, 51.58s wall time h2o-32-np8.out: CP : 41.89s CPU time, 42.72s wall time h2o-64-np2.out: CP : 6m39.17s CPU time, 7m49.54s wall time h2o-64-np4.out: CP : 4m19.69s CPU time, 5m14.73s wall time h2o-64-np8.out: CP : 4m12.16s CPU time, 4m24.57s wall time the saturation of the memory bandwidth becomes apparent (little gain going from 4 mpi tasks to 8 mpi tasks). you have to keep in mind on the intel quad cores, the difference between using 4 cores and 8 cores is especially drastic, as the cpus share caches between two cores, so with 4 cores i have effectively double the L2-cache as with 8 cores. it would be interesting to see somebody do a similar test with AMD quad cores, since those are true quad cores. you should also note, that those timings contain some non-parallel overhead that happens when starting a job. for testing production speed you should run a 20 step and a 10 step job and then subtract the time for the 10 step job from the 20 step job to get the timing for 10 steps. HTH, axel. > _______________________________________________ > Pw_forum mailing list > Pw_forum at pwscf.org > http://www.democritos.it/mailman/listinfo/pw_forum > > -- ======================================================================= Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu Center for Molecular Modeling -- University of Pennsylvania Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323 tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425 ======================================================================= If you make something idiot-proof, the universe creates a better idiot.
