On Tue, 2013-02-26 at 11:00 +0000, Karttunen Antti wrote:
> Dear Andrea,
>
> I tried your new only_init keyword by running the new
> GRID_example/run_example_3 and with my own test jobs. It worked great, but
> only after I sorted out one really strange issue. The example was crashing
> for apparently random (q,irr) combinations right in the beginning of ph.x.
> The error occurred at subroutine check_directory_phsave and it was coming
> from the following call:
>
> CALL iotk_open_read(iunout, FILE = TRIM(filename1), &
> BINARY = .FALSE., IERR = ierr )
> which leads to:
> IF (ierr /= 0) CALL errore('check_directory_phsave','opening file',1)
>
> ierr from iotk was always 2.
>
> I did tried to see, which file caused the problem and it was always
> dynmat.1.0.xml (maybe because it's the first file that is read in in the
> loop). Now, the strange thing was that this occurred randomly: for about 20%
> of the (q,irr) combinations within the example. And the crashing combinations
> changed between runs. Furthermore, my own tests showed that that even if some
> (q,irr) combination crashed, ph.x would run successfully if I re-ran the same
> input several times (something like 5 times). I'm running the tests on a
> local filesystem, so it's not an NFS issue. I guess it could be a compiler
> issue with iotk (I used ifort 12.1.5), but at least compiling iotk without
> any optimization flags did not help. I'm really puzzled by the (apparently)
> random nature of the crashes, I could not figure out what is the
> non-deterministic factor here (I sure hope it's not my hard drive...).
>
Thank you for your help in identifying bugs. Now I have commited a bug
fix to check_directory_phsave.
> In any case, I could avoid the crashes by replacing the
> IF (ierr /= 0 ) GOTO 100
> with
> IF (ierr /= 0 ) CYCLE
> in check_directory_phsave and uncommenting the CALL errore after the loop. I
> think this is also how the things were done before code revision 9858. We
> were previously using revision 9772 and this problem never occurred there
> (maybe the problem is unrelated to only_init and just surfaced now when I
> changed the revision?).
>
> It would be interesting to know what was causing the random iotk_open_read
> errors. Furthermore, maybe the whole loop over nqs and irr_iq in
> check_directory_phsave could be skipped for cases where start_q=last_q and
> start_irr=last_irr? This is the normal case for grid jobs and for large
> systems this could help to avoid hundreds of filesystem operations for every
> job (since we are going to do just one (q,irr), it's not that interesting
> whether the other ones have been done or not). But maybe this would have some
> side-effects, so at the moment I'll just use the above trick.
>
OK, I will see if I can do something for this problem.
Andrea
> Best wishes,
> Antti
>
> --
> Dr. Antti Karttunen
> Department of Chemistry
> University of Jyv?skyl?, Finland
> Tel: +358-50-3473475
> WWW: http://www.iki.fi/ankarttu
>
>
> -----Original Message-----
> From: pw_forum-bounces at pwscf.org [mailto:pw_forum-bounces at pwscf.org] On
> Behalf Of Andrea Dal Corso
> Sent: Monday, February 25, 2013 7:36 PM
> To: PWSCF Forum
> Subject: Re: [Pw_forum] ph.x: Avoiding the recalculation of the band
> structure in distributed phonon dispersion jobs
>
>
> On Mon, 2013-02-11 at 18:30 +0000, Karttunen Antti wrote:
> > Dear all,
> >
> > We are using the start_q/last_q and start_irr/last_irr keywords to execute
> > phonon dispersion jobs within a HPC grid service. The scheme works really
> > nicely and we are able to run fairly large phonon dispersion calculations
> > very efficiently. However it would be great to know if we could further
> > increase the efficiency by avoiding the recalculation of the band structure
> > at all irreps for every q.
> >
> > A concrete example: We are using a 4x4x4 q-point grid to investigate the
> > phonon dispersions of cubic silicon clathrate (FCC structure with 34 atoms
> > in the primitive cell),requiring the calculation of 8 q-points in total.
> > While the number of symmetry-independent q-points is rather low, the
> > individual q-points can contain as many as 101 irreps (558 (q,irrep)
> > calculations in total). While in "normal" phonon dispersion calculations
> > the band structure is solved once for every q, in the distributed phonon
> > dispersion calculations every single (q,irrep) job calculates the band
> > structure before doing the actual phonon calculation (except q=1). So, the
> > band structure is "re-calculated" numerous times in the distributed scheme.
> > The overhead is not negligible: For a single (q,irrep) job at the q-points
> > with the lowest symmetry, the band structure calculation can typically
> > take ~10 CPU hours of the total execution time of ~60 CPU hours (we are
> > running the jobs in the grid with just o
ne
>
> CPU).
> >
> > For systems like this, it would be really great if we could do something
> > like this:
> > 1) Precalculate the band structure for every q (for example, for irrep=1),
> > 2) Write the results of the band structure calculation to a file for every
> > q
> > 3) For all other irreps, just read the precalculated band structure from
> > the file.
> >
> > We are already using a similar scheme to avoid the re-calculation of the
> > dielectric constant for all q=1 irreps:
> > 1) Precalculate the dielectric constant for (q=1,irrep=1)
> > 2) Use data-file.1.xml with DIELECTRIC_CONSTANT and EFFECTIVE_CHARGES as
> > the starting point for other q=1 irreps.
> > 3) With recover=.true., the re-calculation of the dielectric constant is
> > avoided
> >
> > However, we have not been able to devise a similar scheme to avoid the
> > re-calculation of the band structure for q>1. I've been reading the source
> > code but at least based on check_initial_status.f90 it seems that reading
> > the bands is only possible if there is a restart file available (i.e. the
> > calculation has been interrupted). So, while the built-in logic supports
> > restarting "normal" phonon dispersion calculations, we haven't been able to
> > find out a way to read the band structure into a single (q,irrep) job.
> >
>
> I thought that this procedure was already working if you copied all the
> files produced by ph.x using start_irr=1 and last_irr=1 as a preparatory
> run, but there were still some problems. I commited a script in the SVN
> version inspired to your suggestion (see GRID_example/run_example_3).
> Hopefully you can adapt it to your cases.
>
> HTH,
>
> Andrea
>
>
>
>
> > We would really appreciate any comments or ideas on how to avoid the
> > overhead from the band structure calculations in the above scenario.
> >
> > Best wishes,
> > Antti Karttunen
> >
--
Andrea Dal Corso Tel. 0039-040-3787428
SISSA, Via Bonomea 265 Fax. 0039-040-3787249
I-34136 Trieste (Italy) e-mail: dalcorso at sissa.it