On Tue, Feb 24, 2009 at 1:45 AM, Huiqun Zhou <hqzhou at nju.edu.cn> wrote: > Dear list users:
hi all, > I happened to test duration times of calculating the system I'm > investigating against number of pools used. There are totally > 36 k points. But the results surprised me quite a lot. > > no pool: 6m21.02s CPU time, 6m45.88s wall time > 2 pools: 7m19.39s CPU time, 7m38.99s wall time > 4 pools: 11m59.09s CPU time, 12m14.66s wall time > 8 pools: 21m28.77s CPU time, 21m38.71s wall time > > The machine I'm using is an AMD box with 2 quad core shanghai. > > Is my understanding of usage of pool wrong? sorry for replying to an old mail in this thread, but it has the proper times to compare to. the input you sent me, does not seem to be the exactly the same as the one you used for the benchmarks (rather a bit larger). but i reduced the number of k-points to yield 36 and have some numbers here. this is on dual intel quad core E5430 @ 2.66GHz cpus with 8GB DDR2 ram. i also modified the input to set wfcdir to use the local scratch rather than my working directory (as this is on an NFS server) and test with disk_io='high' and 'low'. on a single node (always with 8 MPI tasks) i get: 1node-1pools-high.out: PWSCF : 18m55.62s CPU time, 26m 7.20s wall time 1node-2pools-high.out: PWSCF : 14m46.03s CPU time, 18m 0.26s wall time 1node-4pools-high.out: PWSCF : 14m 5.27s CPU time, 16m44.03s wall time 1node-8pools-high.out: PWSCF : 32m29.71s CPU time, 35m 0.35s wall time 1node-1pools-low.out: PWSCF : 18m36.88s CPU time, 19m24.71s wall time 1node-2pools-low.out: PWSCF : 15m 0.98s CPU time, 15m42.56s wall time 1node-4pools-low.out: PWSCF : 14m 6.97s CPU time, 14m55.57s wall time 1node-8pools-low.out: PWSCF : 31m51.68s CPU time, 32m46.77s wall time so the result is not quite as drastic, but with 8 pools on the node, the machine is suffering. one can also see that disk_io='low' is helping to reduce waiting time (disk_io='high' still writes files into the working directory, which is on slow NFS). so for my machine it looks as if 4 pools is the optimal compromise. to further investigate whether pools or gspace parallelization is more efficient i then started to run the same job across multiple nodes. this uses only 4 cores per node, i.e. the total number of mpi tasks is still 8. 2node-1pools-high.out: PWSCF : 12m 0.88s CPU time, 17m42.01s wall time 2node-2pools-high.out: PWSCF : 8m42.96s CPU time, 11m44.88s wall time 2node-4pools-high.out: PWSCF : 6m26.72s CPU time, 8m54.83s wall time 2node-8pools-high.out: PWSCF : 12m47.61s CPU time, 15m18.67s wall time 2node-1pools-low.out: PWSCF : 10m53.87s CPU time, 11m35.94s wall time 2node-2pools-low.out: PWSCF : 8m37.37s CPU time, 9m23.17s wall time 2node-4pools-low.out: PWSCF : 6m22.87s CPU time, 7m11.22s wall time 2node-8pools-low.out: PWSCF : 13m 7.30s CPU time, 13m57.71s wall time in the next test, i doubled the number of nodes again, but this time kept 4 mpi tasks per node, also i'm only using disk_io='low'. 4node-4pools-low.out: PWSCF : 4m52.92s CPU time, 5m38.90s wall time 4node-8pools-low.out: PWSCF : 4m29.73s CPU time, 5m17.86s wall time interesting, now the striking difference between 4 pools and 8 pools is gone. since i doubled the number of nodes, the memory consumption per mpi task in the 8 pools case should have dropped to a similar level as in the 4 pools case with 2 nodes. to confirm this, lets run the same job with 16 pools: 4node-16pools-low.out: PWSCF : 10m54.57s CPU time, 11m53.59s wall time bingo! the only explanation for this is cache memory. so in this specific case, up to about "half a wavefunction" memory consumption per node, the caching of the cpu is much more effective. so the "more pools is better"-rule has to be augmented by "unless it makes the cpu cache less efficient". since 36 kpoints is wholly divisible by 6 but not by 8, now a test with 6 nodes. 6node-4pools-low.out: PWSCF : 3m41.65s CPU time, 4m25.15s wall time 6node-6pools-low.out: PWSCF : 3m40.12s CPU time, 4m23.33s wall time 6node-8pools-low.out: PWSCF : 3m14.13s CPU time, 3m57.76s wall time 6node-12pools-low.out: PWSCF : 3m37.96s CPU time, 4m25.91s wall time 6node-24pools-low.out: PWSCF : 10m55.18s CPU time, 11m47.87s wall time so 6 pools is more efficient than 4, but 8 even more than 6 or 12, which should lead to a better distribution of the work. so the modified "rule" from above seems to hold. ok, can we get any faster. ~4min walltime for a 21 scf cycle single point run is already pretty good and the serial overhead (and wf_collect=.true.) should kick in. so now with 8 nodes and 32 mpi tasks. 8node-4pools-low.out: PWSCF : 3m22.02s CPU time, 4m 7.06s wall time 8node-8pools-low.out: PWSCF : 3m14.52s CPU time, 3m58.86s wall time 8node-16pools-low.out: PWSCF : 3m36.18s CPU time, 4m24.21s wall time hmmm, not much better, but now for the final test. since we have 36 k-points and we need at least two mpi tasks per pool to get good performance, lets try 18 nodes with 4 mpi tasks each: 18node-9pools-low.out: PWSCF : 1m57.06s CPU time, 3m37.31s wall time 18node-18pools-low.out: PWSCF : 2m 2.62s CPU time, 2m45.51s wall time 18node-36pools-low.out: PWSCF : 2m45.61s CPU time, 3m33.00s wall time not spectacular scaling, but still improving. but it looks like writing the final wavefunction costs about 45 seconds or more, as indicated by the difference between cpu and walltime. at this level, you better not use disk_io='high', as that will put a _severe_ disk load on the machine that is carrying the working directory (particularly bad for NFS servers), the code will generate and continuously rewrite in this case 144 files... and the walltime to cputime ratio quickly rises (a factor of 5 in my case so i stopped the job before the NFS server would die). in summary, it is obviously getting more complicated to define a "rule" of what gives the best performance. some experimentation is always required and sometimes there will be surprises. i have not touched the issue of network speed (all tests were done across a 4xDDR infiniband network). i hope this little benchmark excursion was as interesting and thought provoking for you as it was for me. thanks for everybody that gave their input to this discussion. cheers, axel. p.s.: perhaps at some point it might be interesting to organize a workshop on "post-compilation optimization" for pw.x for different types of jobs and hardware. > Huiqun Zhou > @Nanjing University, China > _______________________________________________ > Pw_forum mailing list > Pw_forum at pwscf.org > http://www.democritos.it/mailman/listinfo/pw_forum > > -- ======================================================================= Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu Center for Molecular Modeling -- University of Pennsylvania Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323 tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425 ======================================================================= If you make something idiot-proof, the universe creates a better idiot.
