Axel and list-users, I'm terribly sorry for my delayed response. I personally want to thank Axel for his thorough investigation, to-the-point analysis and detailed report, every bit of experience will benefit us a lot in our future life in computation.
As I said, the test maching I used is an AMD box with 2 way quad core Shanghai (opteron 23xx, 2.3 GHz) , which has only 1 MB L2 cache, the test case may make the CPU cache inefficient more quickly than Intel ones (usually 4-6 MB L2 cache) according to your findings. BTW, the input file I sent you is absolutely the same as I used here. AMD's shanghai and Intel's nehalem, which I have done several tests on both platforms, are much better than their previous processors. It seems Axel needs to connect more powerful machines with his high-end infiniBand network(^o^) Thanks again, Axel! Dr. Huiqun Zhou @Earth Sciences, Nanjing University, China ----- Original Message ----- From: "Axel Kohlmeyer" <[email protected]> To: "PWSCF Forum" <pw_forum at pwscf.org> Sent: Wednesday, March 04, 2009 10:51 AM Subject: Re: [Pw_forum] Use of pool > On Tue, Feb 24, 2009 at 1:45 AM, Huiqun Zhou <hqzhou at nju.edu.cn> wrote: >> Dear list users: > > hi all, > >> I happened to test duration times of calculating the system I'm >> investigating against number of pools used. There are totally >> 36 k points. But the results surprised me quite a lot. >> >> no pool: 6m21.02s CPU time, 6m45.88s wall time >> 2 pools: 7m19.39s CPU time, 7m38.99s wall time >> 4 pools: 11m59.09s CPU time, 12m14.66s wall time >> 8 pools: 21m28.77s CPU time, 21m38.71s wall time >> >> The machine I'm using is an AMD box with 2 quad core shanghai. >> >> Is my understanding of usage of pool wrong? > > sorry for replying to an old mail in this thread, but it has the > proper times to compare to. the input you sent me, does not > seem to be the exactly the same as the one you used for the > benchmarks (rather a bit larger). but i reduced the number of > k-points to yield 36 and have some numbers here. > this is on dual intel quad core E5430 @ 2.66GHz cpus with 8GB DDR2 ram. > i also modified the input to set wfcdir to use the local scratch rather > than my > working directory (as this is on an NFS server) and test with > disk_io='high' and 'low'. > on a single node (always with 8 MPI tasks) i get: > > 1node-1pools-high.out: PWSCF : 18m55.62s CPU time, 26m > 7.20s wall time > 1node-2pools-high.out: PWSCF : 14m46.03s CPU time, 18m > 0.26s wall time > 1node-4pools-high.out: PWSCF : 14m 5.27s CPU time, > 16m44.03s wall time > 1node-8pools-high.out: PWSCF : 32m29.71s CPU time, 35m > 0.35s wall time > > 1node-1pools-low.out: PWSCF : 18m36.88s CPU time, > 19m24.71s wall time > 1node-2pools-low.out: PWSCF : 15m 0.98s CPU time, > 15m42.56s wall time > 1node-4pools-low.out: PWSCF : 14m 6.97s CPU time, > 14m55.57s wall time > 1node-8pools-low.out: PWSCF : 31m51.68s CPU time, > 32m46.77s wall time > > so the result is not quite as drastic, but with 8 pools on the node, > the machine is suffering. > one can also see that disk_io='low' is helping to reduce waiting time > (disk_io='high' still > writes files into the working directory, which is on slow NFS). so for > my machine it looks > as if 4 pools is the optimal compromise. to further investigate > whether pools or gspace > parallelization is more efficient i then started to run the same job > across multiple nodes. > this uses only 4 cores per node, i.e. the total number of mpi tasks is > still 8. > > 2node-1pools-high.out: PWSCF : 12m 0.88s CPU time, > 17m42.01s wall time > 2node-2pools-high.out: PWSCF : 8m42.96s CPU time, > 11m44.88s wall time > 2node-4pools-high.out: PWSCF : 6m26.72s CPU time, > 8m54.83s wall time > 2node-8pools-high.out: PWSCF : 12m47.61s CPU time, > 15m18.67s wall time > > 2node-1pools-low.out: PWSCF : 10m53.87s CPU time, > 11m35.94s wall time > 2node-2pools-low.out: PWSCF : 8m37.37s CPU time, > 9m23.17s wall time > 2node-4pools-low.out: PWSCF : 6m22.87s CPU time, > 7m11.22s wall time > 2node-8pools-low.out: PWSCF : 13m 7.30s CPU time, > 13m57.71s wall time > > in the next test, i doubled the number of nodes again, but this time > kept 4 mpi tasks per node, > also i'm only using disk_io='low'. > > 4node-4pools-low.out: PWSCF : 4m52.92s CPU time, > 5m38.90s wall time > 4node-8pools-low.out: PWSCF : 4m29.73s CPU time, > 5m17.86s wall time > > interesting, now the striking difference between 4 pools and 8 pools > is gone. since i > doubled the number of nodes, the memory consumption per mpi task in the 8 > pools > case should have dropped to a similar level as in the 4 pools case with 2 > nodes. > to confirm this, lets run the same job with 16 pools: > > 4node-16pools-low.out: PWSCF : 10m54.57s CPU time, > 11m53.59s wall time > > bingo! the only explanation for this is cache memory. so in this specific > case, > up to about "half a wavefunction" memory consumption per node, the caching > of the cpu is much more effective. so the "more pools is better"-rule has > to be > augmented by "unless it makes the cpu cache less efficient". > > since 36 kpoints is wholly divisible by 6 but not by 8, now a test > with 6 nodes. > > 6node-4pools-low.out: PWSCF : 3m41.65s CPU time, > 4m25.15s wall time > 6node-6pools-low.out: PWSCF : 3m40.12s CPU time, > 4m23.33s wall time > 6node-8pools-low.out: PWSCF : 3m14.13s CPU time, > 3m57.76s wall time > 6node-12pools-low.out: PWSCF : 3m37.96s CPU time, > 4m25.91s wall time > 6node-24pools-low.out: PWSCF : 10m55.18s CPU time, > 11m47.87s wall time > > so 6 pools is more efficient than 4, but 8 even more than 6 or 12, > which should lead to > a better distribution of the work. so the modified "rule" from above > seems to hold. > ok, can we get any faster. ~4min walltime for a 21 scf cycle single > point run is already pretty > good and the serial overhead (and wf_collect=.true.) should kick in. > so now with 8 nodes > and 32 mpi tasks. > > 8node-4pools-low.out: PWSCF : 3m22.02s CPU time, 4m > 7.06s wall time > 8node-8pools-low.out: PWSCF : 3m14.52s CPU time, > 3m58.86s wall time > 8node-16pools-low.out: PWSCF : 3m36.18s CPU time, > 4m24.21s wall time > > hmmm, not much better, but now for the final test. since we have 36 > k-points and > we need at least two mpi tasks per pool to get good performance, lets > try 18 nodes > with 4 mpi tasks each: > > 18node-9pools-low.out: PWSCF : 1m57.06s CPU time, > 3m37.31s wall time > 18node-18pools-low.out: PWSCF : 2m 2.62s CPU time, > 2m45.51s wall time > 18node-36pools-low.out: PWSCF : 2m45.61s CPU time, > 3m33.00s wall time > > not spectacular scaling, but still improving. but it looks like > writing the final wavefunction > costs about 45 seconds or more, as indicated by the difference between > cpu and walltime. > > at this level, you better not use disk_io='high', as that will put a > _severe_ disk load on > the machine that is carrying the working directory (particularly bad > for NFS servers), > the code will generate and continuously rewrite in this case 144 files... > and the walltime to cputime ratio quickly rises (a factor of 5 in my > case so i stopped the > job before the NFS server would die). > > in summary, it is obviously getting more complicated to define a > "rule" of what gives > the best performance. some experimentation is always required and > sometimes there > will be surprises. i have not touched the issue of network speed (all > tests were done > across a 4xDDR infiniband network). > > i hope this little benchmark excursion was as interesting and thought > provoking for you > as it was for me. thanks for everybody that gave their input to this > discussion. > > cheers, > axel. > > p.s.: perhaps at some point it might be interesting to organize a workshop > on > "post-compilation optimization" for pw.x for different types of jobs > and hardware. > >> Huiqun Zhou >> @Nanjing University, China >> _______________________________________________ >> Pw_forum mailing list >> Pw_forum at pwscf.org >> http://www.democritos.it/mailman/listinfo/pw_forum >> >> > > > > -- > ======================================================================= > Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu > Center for Molecular Modeling -- University of Pennsylvania > Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323 > tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425 > ======================================================================= > If you make something idiot-proof, the universe creates a better idiot. > _______________________________________________ > Pw_forum mailing list > Pw_forum at pwscf.org > http://www.democritos.it/mailman/listinfo/pw_forum >
