Dear Andrea,
I tried your new only_init keyword by running the new
GRID_example/run_example_3 and with my own test jobs. It worked great, but only
after I sorted out one really strange issue. The example was crashing for
apparently random (q,irr) combinations right in the beginning of ph.x. The
error occurred at subroutine check_directory_phsave and it was coming from the
following call:
CALL iotk_open_read(iunout, FILE = TRIM(filename1), &
BINARY = .FALSE., IERR = ierr )
which leads to:
IF (ierr /= 0) CALL errore('check_directory_phsave','opening file',1)
ierr from iotk was always 2.
I did tried to see, which file caused the problem and it was always
dynmat.1.0.xml (maybe because it's the first file that is read in in the loop).
Now, the strange thing was that this occurred randomly: for about 20% of the
(q,irr) combinations within the example. And the crashing combinations changed
between runs. Furthermore, my own tests showed that that even if some (q,irr)
combination crashed, ph.x would run successfully if I re-ran the same input
several times (something like 5 times). I'm running the tests on a local
filesystem, so it's not an NFS issue. I guess it could be a compiler issue with
iotk (I used ifort 12.1.5), but at least compiling iotk without any
optimization flags did not help. I'm really puzzled by the (apparently) random
nature of the crashes, I could not figure out what is the non-deterministic
factor here (I sure hope it's not my hard drive...).
In any case, I could avoid the crashes by replacing the
IF (ierr /= 0 ) GOTO 100
with
IF (ierr /= 0 ) CYCLE
in check_directory_phsave and uncommenting the CALL errore after the loop. I
think this is also how the things were done before code revision 9858. We were
previously using revision 9772 and this problem never occurred there (maybe the
problem is unrelated to only_init and just surfaced now when I changed the
revision?).
It would be interesting to know what was causing the random iotk_open_read
errors. Furthermore, maybe the whole loop over nqs and irr_iq in
check_directory_phsave could be skipped for cases where start_q=last_q and
start_irr=last_irr? This is the normal case for grid jobs and for large systems
this could help to avoid hundreds of filesystem operations for every job (since
we are going to do just one (q,irr), it's not that interesting whether the
other ones have been done or not). But maybe this would have some side-effects,
so at the moment I'll just use the above trick.
Best wishes,
Antti
--
Dr. Antti Karttunen
Department of Chemistry
University of Jyv?skyl?, Finland
Tel: +358-50-3473475
WWW: http://www.iki.fi/ankarttu
-----Original Message-----
From: pw_forum-bounces at pwscf.org [mailto:[email protected]] On
Behalf Of Andrea Dal Corso
Sent: Monday, February 25, 2013 7:36 PM
To: PWSCF Forum
Subject: Re: [Pw_forum] ph.x: Avoiding the recalculation of the band structure
in distributed phonon dispersion jobs
On Mon, 2013-02-11 at 18:30 +0000, Karttunen Antti wrote:
> Dear all,
>
> We are using the start_q/last_q and start_irr/last_irr keywords to execute
> phonon dispersion jobs within a HPC grid service. The scheme works really
> nicely and we are able to run fairly large phonon dispersion calculations
> very efficiently. However it would be great to know if we could further
> increase the efficiency by avoiding the recalculation of the band structure
> at all irreps for every q.
>
> A concrete example: We are using a 4x4x4 q-point grid to investigate the
> phonon dispersions of cubic silicon clathrate (FCC structure with 34 atoms in
> the primitive cell),requiring the calculation of 8 q-points in total. While
> the number of symmetry-independent q-points is rather low, the individual
> q-points can contain as many as 101 irreps (558 (q,irrep) calculations in
> total). While in "normal" phonon dispersion calculations the band structure
> is solved once for every q, in the distributed phonon dispersion calculations
> every single (q,irrep) job calculates the band structure before doing the
> actual phonon calculation (except q=1). So, the band structure is
> "re-calculated" numerous times in the distributed scheme. The overhead is not
> negligible: For a single (q,irrep) job at the q-points with the lowest
> symmetry, the band structure calculation can typically take ~10 CPU hours of
> the total execution time of ~60 CPU hours (we are running the jobs in the
> grid with just one
CPU).
>
> For systems like this, it would be really great if we could do something like
> this:
> 1) Precalculate the band structure for every q (for example, for irrep=1),
> 2) Write the results of the band structure calculation to a file for every q
> 3) For all other irreps, just read the precalculated band structure from the
> file.
>
> We are already using a similar scheme to avoid the re-calculation of the
> dielectric constant for all q=1 irreps:
> 1) Precalculate the dielectric constant for (q=1,irrep=1)
> 2) Use data-file.1.xml with DIELECTRIC_CONSTANT and EFFECTIVE_CHARGES as the
> starting point for other q=1 irreps.
> 3) With recover=.true., the re-calculation of the dielectric constant is
> avoided
>
> However, we have not been able to devise a similar scheme to avoid the
> re-calculation of the band structure for q>1. I've been reading the source
> code but at least based on check_initial_status.f90 it seems that reading the
> bands is only possible if there is a restart file available (i.e. the
> calculation has been interrupted). So, while the built-in logic supports
> restarting "normal" phonon dispersion calculations, we haven't been able to
> find out a way to read the band structure into a single (q,irrep) job.
>
I thought that this procedure was already working if you copied all the
files produced by ph.x using start_irr=1 and last_irr=1 as a preparatory
run, but there were still some problems. I commited a script in the SVN
version inspired to your suggestion (see GRID_example/run_example_3).
Hopefully you can adapt it to your cases.
HTH,
Andrea
> We would really appreciate any comments or ideas on how to avoid the overhead
> from the band structure calculations in the above scenario.
>
> Best wishes,
> Antti Karttunen
>
--
Andrea Dal Corso Tel. 0039-040-3787428
SISSA, Via Bonomea 265 Fax. 0039-040-3787249
I-34136 Trieste (Italy) e-mail: dalcorso at sissa.it
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://pwscf.org/mailman/listinfo/pw_forum