On Tue, 2013-02-26 at 11:00 +0000, Karttunen Antti wrote:
> Dear Andrea,
> 
> I tried your new only_init keyword by running the new 
> GRID_example/run_example_3 and with my own test jobs. It worked great, but 
> only after I sorted out one really strange issue. The example was crashing 
> for apparently random (q,irr) combinations right in the beginning of ph.x. 
> The error occurred at subroutine check_directory_phsave and it was coming 
> from the following call:
> 
> CALL iotk_open_read(iunout, FILE = TRIM(filename1), &
>                                         BINARY = .FALSE., IERR = ierr )
> which leads to:
> IF (ierr /= 0) CALL errore('check_directory_phsave','opening file',1)
> 
> ierr from iotk was always 2.
> 
> I did tried to see, which file caused the problem and it was always 
> dynmat.1.0.xml (maybe because it's the first file that is read in in the 
> loop). Now, the strange thing was that this occurred randomly: for about 20% 
> of the (q,irr) combinations within the example. And the crashing combinations 
> changed between runs. Furthermore, my own tests showed that that even if some 
> (q,irr) combination crashed, ph.x would run successfully if I re-ran the same 
> input several times (something like 5 times). I'm running the tests on a 
> local filesystem, so it's not an NFS issue. I guess it could be a compiler 
> issue with iotk (I used ifort 12.1.5), but at least compiling iotk without 
> any optimization flags did not help. I'm really puzzled by the (apparently) 
> random nature of the crashes, I could not figure out what is the 
> non-deterministic factor here (I sure hope it's not my hard drive...).
> 

Thank you for your help in identifying bugs. Now I have commited a bug
fix to check_directory_phsave. 

> In any case, I could avoid the crashes by replacing the
> IF (ierr /= 0 ) GOTO 100
> with 
> IF (ierr /= 0 ) CYCLE
> in check_directory_phsave and uncommenting the CALL errore after the loop. I 
> think this is also how the things were done before code revision 9858. We 
> were previously using revision 9772 and this problem never occurred there 
> (maybe the problem is unrelated to only_init and just surfaced now when I 
> changed the revision?).
> 
> It would be interesting to know what was causing the random iotk_open_read 
> errors. Furthermore, maybe the whole loop over nqs and irr_iq in 
> check_directory_phsave could be skipped for cases where start_q=last_q and 
> start_irr=last_irr? This is the normal case for grid jobs and for large 
> systems this could help to avoid hundreds of filesystem operations for every 
> job (since we are going to do just one (q,irr), it's not that interesting 
> whether the other ones have been done or not). But maybe this would have some 
> side-effects, so at the moment I'll just use the above trick.
> 
OK, I will see if I can do something for this problem.

Andrea

> Best wishes,
> Antti
> 
> -- 
> Dr. Antti Karttunen
> Department of Chemistry
> University of Jyv?skyl?, Finland
> Tel: +358-50-3473475
> WWW: http://www.iki.fi/ankarttu 
> 
> 
> -----Original Message-----
> From: pw_forum-bounces at pwscf.org [mailto:pw_forum-bounces at pwscf.org] On 
> Behalf Of Andrea Dal Corso
> Sent: Monday, February 25, 2013 7:36 PM
> To: PWSCF Forum
> Subject: Re: [Pw_forum] ph.x: Avoiding the recalculation of the band 
> structure in distributed phonon dispersion jobs
> 
> 
> On Mon, 2013-02-11 at 18:30 +0000, Karttunen Antti wrote:
> > Dear all, 
> > 
> > We are using the start_q/last_q and start_irr/last_irr keywords to execute 
> > phonon dispersion jobs within a HPC grid service. The scheme works really 
> > nicely and we are able to run fairly large phonon dispersion calculations 
> > very efficiently. However it would be great to know if we could further 
> > increase the efficiency by avoiding the recalculation of the band structure 
> > at all irreps for every q.
> > 
> > A concrete example: We are using a 4x4x4 q-point grid to investigate the 
> > phonon dispersions of cubic silicon clathrate (FCC structure with 34 atoms 
> > in the primitive cell),requiring the calculation of 8 q-points in total. 
> > While the number of symmetry-independent q-points is rather low, the 
> > individual q-points can contain as many as 101 irreps (558 (q,irrep) 
> > calculations in total). While in "normal" phonon dispersion calculations 
> > the band structure is solved once for every q, in the distributed phonon 
> > dispersion calculations every single (q,irrep) job calculates the band 
> > structure before doing the actual phonon calculation (except q=1). So, the 
> > band structure is "re-calculated" numerous times in the distributed scheme. 
> > The overhead is not negligible: For a single (q,irrep) job at the q-points 
> > with the lowest symmetry,  the band structure calculation can typically 
> > take ~10 CPU hours of the total execution time of ~60 CPU hours (we are 
> > running the jobs in the grid with just o
 
 ne
>  
>   CPU).
> > 
> > For systems like this, it would be really great if we could do something 
> > like this:
> > 1) Precalculate the band structure for every q (for example, for irrep=1),
> > 2) Write the results of the band structure calculation to a file for every 
> > q 
> > 3) For all other irreps, just read the precalculated band structure from 
> > the file.
> > 
> > We are already using a similar scheme to avoid the re-calculation of the 
> > dielectric constant for all q=1 irreps:
> > 1) Precalculate the dielectric constant for (q=1,irrep=1)
> > 2) Use data-file.1.xml with DIELECTRIC_CONSTANT and EFFECTIVE_CHARGES as 
> > the starting point for other q=1 irreps.
> > 3) With recover=.true., the re-calculation of the dielectric constant is 
> > avoided
> > 
> > However, we have not been able to devise a similar scheme to avoid the 
> > re-calculation of the band structure for q>1. I've been reading the source 
> > code but at least based on check_initial_status.f90 it seems that reading 
> > the bands is only possible if there is a restart file available (i.e. the 
> > calculation has been interrupted).  So, while the built-in logic supports 
> > restarting "normal" phonon dispersion calculations, we haven't been able to 
> > find out a way to read the band structure into a single (q,irrep) job. 
> > 
> 
> I thought that this procedure was already working if you copied all the
> files produced by ph.x using start_irr=1 and last_irr=1 as a preparatory
> run, but there were still some problems. I commited a script in the SVN
> version inspired to your suggestion (see GRID_example/run_example_3).
> Hopefully you can adapt it to your cases. 
> 
> HTH,
> 
> Andrea
> 
> 
> 
> 
> > We would really appreciate any comments or ideas on how to avoid the 
> > overhead from the band structure calculations in the above scenario.
> > 
> > Best wishes,
> > Antti Karttunen
> > 
-- 
Andrea Dal Corso                    Tel. 0039-040-3787428
SISSA, Via Bonomea 265              Fax. 0039-040-3787249
I-34136 Trieste (Italy)             e-mail: dalcorso at sissa.it


Reply via email to