Axel, Thank you so much for the tip, the '-assume buffered_io' did the magic. Now all the restart info is dumped in ~53 sec when I use the 4 processors on each node. Using only 3 processors per node (fixed total number of procs) results in a very minor improvement.
Thanks! Silviu. Axel Kohlmeyer wrote: > On Wed, 8 Feb 2006, Silviu Zilberman wrote: > > hi, > > SZ> Paolo Giannozzi wrote: > SZ> > On Wednesday 08 February 2006 12:18, Silviu Zilberman wrote: > SZ> > > SZ> > > SZ> >> I have been running calculations on Lemieux which is an alpha cluster > SZ> >> super computer in Pittsburgh. For some reason that is still mysterious > SZ> >> to me, writing these density files on the scratch space took very long > SZ> >> time, ~30 (!) minutes for a 68MB file. > SZ> >> > SZ> > > SZ> > maybe you should rename your machine "Lepire" :-) > > paolo, you should be careful with remarks like this. > people in pittsburgh take sports _very_ seriously and > don't like others making fun of the names of their idol > from the pittsburgh penguins. you may get away with it > since the steelers just won the superbowl last weekend... > > > SZ> > The charge density file should be written only when the wavefunctions > SZ> > are written, every "isave" steps and at the end of the run. If it is > written > SZ> > at each time step, this is definitely wrong. > SZ> > > SZ> The charge density is written only every isave steps, and I set it to a > SZ> very large number to avoid this time-consuming i/o. But even if I write > SZ> it just once at the end of the calculation, it would still require ~90 > SZ> minutes for 3 files, which is completely crazy, given that the maximal > SZ> time allocated per job is 12 hours on this supercomputer. > > lemieux is an alpha and the dec compiler has the unfortunate property > to do synchronous i/o by default. this will have a desasterous effect > on a networked filesystem used by (too) many users. > please try compiling with '-assume buffered_io' and let me know > if that helps. > > SZ> > The charge density in a parallel calculation is collected to a single > node > SZ> > and written from there. Since it is not wise to collect it into a single > SZ> > array, each slice from each processor is collected and written. Maybe > SZ> > this algorithm is not optimal (maybe it is even "pessimal"). You should > SZ> > try to understand where exactly the machine spends all this time and > SZ> > why > > another recommendation from the PSC staff is only use 3 processors per > node for the actual job (which generally is the performance limit for > memory bandwidth consuming jobs like DFT/PW/PP codes) so that there is > some cpu capacity left for asynchrous operation (e.g. kernel i/o and > the MPI and NFS threads). > > regards, > axel. > > SZ> > > SZ> I may do it, but for the time being, these files are not very useful for > SZ> me. I can change the code to respect again the disk_io parameter and > SZ> avoid writing these files all together. However I would like to know > SZ> first if there was some reasoning behind dumping these files by default > SZ> without user control over it. > SZ> > SZ> Thanks, Silviu. > SZ> > SZ> > SZ> _______________________________________________ > SZ> Pw_forum mailing list > SZ> Pw_forum at pwscf.org > SZ> http://www.democritos.it/mailman/listinfo/pw_forum > SZ> > >
