1) no it's not correct. prefix.EXIT is a file that the user creates to make the 
program stop before completion. When the program finds this file in the outdir 
or in the working directory the program stops writes the restart files and 
deletes prefix.EXIT, so the file is practically never present after the program 
has stopped unless something has gone wrong. To restart a relaxation you just 
need the files contained in the prefix.save directory and possibly the restart 
files.
>> Thanks to let me know. May I ask how to create prefix.EXIT file? Is this an 
>> empty file just with that name, that I can make from shell command in 
>> submission script?

Also, I think my previous jobs were not "cleanly stopped", because I didn't 
used "max_seconds" neither I created any prefix.EXIT file at any moment.  So 
that is why I think my stopped jobs cannot be continued.

But let me try to continue this one.  Inside the prefix.save folder of 
corresponding job, I can only find 3 files: charge-density.dat, 
data-file-schema.xml, and paw.txt. So, I need to copy those files to outdir 
location, and submit restart job with different name of .in and .out file in 
the same folder, and set restart_mode to restart. Right?



2) max_seconds uses the same time as printed in the WALL_TIME which is the time 
elapsed since the job has started CPU_TIME is the time actually used by the CPU 
they differ because a CPU usage is not always 100%, may be less but if you use 
multithreading may also be much larger than 100%. Consider only WALL time to 
keep things simple.
Just look at the seconds the program takes to make an scf loop and set 
smax_seconds to one week minus that time. This already very conservative, no 
need to use a longer time.

3) do not change the prefix name ...
To restart  the program will look for a directory called prefix.save  if you 
change the prefix the program will not be able to read anything...


>> Thank you so much for the answers. Are there any links which explained of 
>> how to restart QE jobs in detail? I searched in user manual and  input file 
>> description webpage but I couldn't find any useful info...

Thank you again!!




________________________________
From: users <[email protected]> on behalf of SISSA 
<[email protected]>
Sent: Monday, July 8, 2019 2:38:00 PM
To: Quantum ESPRESSO users Forum
Subject: Re: [QE-users] Question about restarting relaxation jobs

1) no it's not correct. prefix.EXIT is a file that the user creates to make the 
program stop before completion. When the program finds this file in the outdir 
or in the working directory the program stops writes the restart files and 
deletes prefix.EXIT, so the file is practically never present after the program 
has stopped unless something has gone wrong. To restart a relaxation you just 
need the files contained in the prefix.save directory and possibly the restart 
files.

2) max_seconds uses the same time as printed in the WALL_TIME which is the time 
elapsed since the job has started CPU_TIME is the time actually used by the CPU 
they differ because a CPU usage is not always 100%, may be less but if you use 
multithreading may also be much larger than 100%. Consider only WALL time to 
keep things simple.
Just look at the seconds the program takes to make an scf loop and set 
smax_seconds to one week minus that time. This already very conservative, no 
need to use a longer time.

3) do not change the prefix name ...
To restart  the program will look for a directory called prefix.save  if you 
change the prefix the program will not be able to read anything...

Il 8 lug 2019 6:43 PM, "Yeon, Jejoon" <[email protected]> ha scritto:

Thank you so much Pietro


May I ask one more question?


1) This is just double checking question. I checked the folder where the 
relaxation was stopped by cluster due to wall time limit (I didn't set max 
seconds). In the output folder, I can see prefix.save/ folder and pwscf.save/ 
folder, and prefix.update and prefix.bfgs file. But because I have no 
prefix.EXIT folder neither prefix.EXIT file, I cannot restart this simulation. 
Is this correct?


2) Now I'm setting "max_seconds" to all my QE DFT works. But I found that CPU 
time and wall time is slightly different. From my recent finished calculation, 
it is written at the end of the  output file:

PWSCF        :   4d21h44m CPU   4d22h12m WALL
   This run was terminated on:  21:27:26   1Jul2019

I used 30 cores, and set 7 days of wall time. Simulation finished before wall 
time, but I'm not sure why this slight difference of CPU time and wall time 
occur.
In this case, what would be the good time of max seconds CPU time when compared 
to wall time? If I request 7 days of wall time to cluster, then would it be 
more "safe" to set 6 days or 6.5 days of CPU time for max seconds?


3) This is also double checking question. If I wish to start the restart in 
same folder, I would better to change prefix from "SimulatinoA" to 
"SimulationA_restart1", would it be OK?  Also, if I wish to use different 
folder, I need to copy entire files inside prefix.EXIT folder to the new 
restart folder, is this correct?


Thank you so much for friendly answers to beginner question!!

________________________________
From: users <[email protected]> on behalf of Pietro 
Davide Delugas <[email protected]>
Sent: Monday, July 8, 2019 4:14:06 AM
To: [email protected]
Subject: Re: [QE-users] Question about restarting relaxation jobs

Hello

1)  and 2) PW writes the restart files only when it terminates before 
convergence is reached either because the max number of steps (and the  max 
number may be  either the number of  electronic steps during scf of  number of 
ionic steps during structural relaxation) or the the execution time exceed  
max_seconds specified in input or because the user has stopped the calculation 
creating a file in the outdir called prefix.EXIT.

If restart_mode in &control is set to "restart"  pw will try to restart the 
relaxation from the last POSITIONS  which have been saved in the prefix.save 
directory using the last saved charge density and wave functions. If it finds 
the restart files it will use them also. This mechanism works fine if 
positions, charge density and wave functions  data have been saved regularly,  
but if the calculation is going to be stopped abruptly, for example by the job 
manager,  there is no way to prevent that the stop arrives when the program is 
writing these data. The safer way to go when you are using a job manager is to 
set the max_seconds variable to a number consistently lower than the time 
allocated by the job manager, the difference between these two times should be 
enough to allow to the program to pass through one of the check_points at 
which, during execution, it checks if the execution time has exceeded the 
max_seconds s or if the user has created a prefix.EXIT file. To estimate how 
long should be the difference between max_seconds and the scheduled execution 
time check how long it takes to the program to make an scf loop, this one will  
a very safe estimate, you could reduce this time significantly and things 
should be working.



3) I don't understans what you want to do. You create the prefix.EXIT file when 
you want to stop your calculation and you want the calculation to finish 
smoothly saving all restart information so that it can resatart from more or 
less the same point when it was interrupted. It is completely senseless to 
rename  the output file as prefix.EXIT because it will make the program to stop 
as soon as a check_point detects the file and the file will be deleted. The 
only thing that you have to do when restarting a calculation is

  *     Specify restart_mode = 'restart' in the input.in file

  *     take care that the information saved in output.out is not rewritten by 
the new execution just use something as     mpirun pw.x  < input.in >> 
output.out which appends the new output to the old one or redirect the output 
to files with different names

4)   outdir must be the same or if you want to use a different one you have to 
create the new outdir befor restarting and copy there all the data of the 
previous calculation i.e. the prefix.save directory.


5) don't complicate things too much



Pietro


On 7/6/19 3:59 PM, Yeon, Jejoon wrote:

Hello


I have very small amount of experience using QE, so please excuse my beginner 
question. I'm about to start relaxation of big crystal structure, and I wish to 
make my QE relaxation jobs ready for restart. Here are my questions:


1) According to "restarting" section from manual, 
(https://www.quantum-espresso.org/Doc/pw_user_guide/node20.html) it seems that 
QE does not creates the dedicated restart file, is this correct?

2) If I set up "max_seconds" option as 604800 seconds (1 week), and request 
wall time to server 1 week, do my calculation jobs are ready to restart after 1 
week? (1 week is just example but our server cluster have maximum some walltime 
limitation, and I don't think any of my relaxation works will be finished 
within that time. ) Also, does this "max_seconds" option must be required to 
restart?

3) When I execute QE in the submit script, I use something similar as:
mpirun pw.x  < input.in > output.out
In this case, if the relaxation job is killed due to wall time limit (without 
setting max_seconds), can I just change the name of the output.out to 
prefix.EXIT, (of course I set up prefix in the input file) and then include 
restart_mode = "restart" in the input file, then submit a job for restart?
I have old files which are finished after reaching wall time limit without 
"max_seconds" option, and I'm curious if I can use those files to restart.

4) I also use outdir option in the input file, does the outdir option should be 
the same when restart?

5) Are there any other things or useful hints that I need to consider when 
restart?

Thank you




_______________________________________________
Quantum ESPRESSO is supported by MaX 
(www.max-centre.eu/quantum-espresso<http://www.max-centre.eu/quantum-espresso>)
users mailing list 
[email protected]<mailto:[email protected]>
https://lists.quantum-espresso.org/mailman/listinfo/users


_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users

Reply via email to