* yes you can create the exit file just with a touch command

 * no don't create it in the submission script, there is no sensible
   reason to create it with the submission script, maybe I have not
   been clear, the program stops as soon as it finds the exit file, if
   you created it with the submission file, the program would stop just
   after starting without doing nothing. what for ?

 * if you have doubts about the reliability of your saved data it is
   probably better to copy the last positions obtained by you previous
   run in the input and restart from scratch from those coordinates.
 * if you really want to change prefix (this is another thing which is
   not very frequently needed actually) just copy the whole prefix.save
   directory into new_prefix.save directory





On 08/07/19 21:22, Yeon, Jejoon wrote:
1) no it's not correct. prefix.EXIT is a file that the user creates to make the program stop before completion. When the program finds this file in the outdir or in the working directory the program stops writes the restart files and deletes prefix.EXIT, so the file is practically never present after the program has stopped unless something has gone wrong. To restart a relaxation you just need the files contained in the prefix.save directory and possibly the restart files. >> Thanks to let me know. May I ask how to createprefix.EXIT file? Is this an empty file just with that name, that I can make from shell command in submission script?

Also, I think my previous jobswerenot "cleanly stopped", because I didn't used "max_seconds" neither I created any prefix.EXIT file at any moment.  So that is why I think my stopped jobs cannot be continued.

But let me try to continue this one. Inside the prefix.save folder of corresponding job, I can only find 3 files: charge-density.dat, data-file-schema.xml, and paw.txt. So, I need to copy those files to outdir location, and submit restart job with different name of .in and .out file in the same folder, and set restart_mode to restart. Right?



2) max_seconds uses the same time as printed in the WALL_TIME which is the time elapsed since the job has started CPU_TIME is the time actually used by the CPU they differ because a CPU usage is not always 100%, may be less but if you use multithreading may also be much larger than 100%. Consider only WALL time to keep things simple. Just look at the seconds the program takes to make an scf loop and set smax_seconds to one week minus that time. This already very conservative, no need to use a longer time.

3) do not change the prefix name ...
To restart  the program will look for a directory called prefix.save  if you change the prefix the program will not be able to read anything...

>> Thank you so much for the answers. Are there any links which explained of how to restart QE jobs in detail? I searched in user manual and input file description webpage but I couldn't find any useful info...


Thank you again!!



------------------------------------------------------------------------
*From:* users <[email protected]> on behalf of SISSA <[email protected]>
*Sent:* Monday, July 8, 2019 2:38:00 PM
*To:* Quantum ESPRESSO users Forum
*Subject:* Re: [QE-users] Question about restarting relaxation jobs
1) no it's not correct. prefix.EXIT is a file that the user creates to make the program stop before completion. When the program finds this file in the outdir or in the working directory the program stops writes the restart files and deletes prefix.EXIT, so the file is practically never present after the program has stopped unless something has gone wrong. To restart a relaxation you just need the files contained in the prefix.save directory and possibly the restart files.

2) max_seconds uses the same time as printed in the WALL_TIME which is the time elapsed since the job has started CPU_TIME is the time actually used by the CPU they differ because a CPU usage is not always 100%, may be less but if you use multithreading may also be much larger than 100%. Consider only WALL time to keep things simple. Just look at the seconds the program takes to make an scf loop and set smax_seconds to one week minus that time. This already very conservative, no need to use a longer time.

3) do not change the prefix name ...
To restart  the program will look for a directory called prefix.save  if you change the prefix the program will not be able to read anything...

Il 8 lug 2019 6:43 PM, "Yeon, Jejoon" <[email protected]> ha scritto:

    Thank you so much Pietro


    May I ask one more question?


    1) This is just double checking question. I checked the folder
    where the relaxation was stopped by cluster due to wall time limit
    (I didn't set max seconds). In the output folder, I can see
    prefix.save/ folder and pwscf.save/ folder, and prefix.update and
    prefix.bfgs file. But because I have no prefix.EXIT folder neither
    prefix.EXIT file, I cannot restart this simulation. Is this correct?


    2) Now I'm setting "max_seconds" to all my QE DFT works. But I
    found that CPU time and wall time is slightly different. From my
    recent finished calculation, it is written at the end of the 
    output file:

    PWSCF        :  4d21h44m CPU   4d22h12m WALL
       This run was terminated on:  21:27:26   1Jul2019

    I used 30 cores, and set 7 days of wall time. Simulation finished
    before wall time, but I'm not sure why this slight difference of
    CPU time and wall time occur.
    In this case, what would be the good time of max seconds CPU time
    when compared to wall time? If I request 7 days of wall time to
    cluster, then would it be more "safe" to set 6 days or 6.5 days of
    CPU time for max seconds?


    3) This is also double checking question. If I wish to start the
    restart in same folder, I would better to change prefix from
    "SimulatinoA" to "SimulationA_restart1", would it be OK?  Also, if
    I wish to use different folder, I need to copy entire files inside
    prefix.EXIT folder to the new restart folder, is this correct?


    Thank you so much for friendly answers to beginner question!!

    ------------------------------------------------------------------------
    *From:* users <[email protected]> on behalf
    of Pietro Davide Delugas <[email protected]>
    *Sent:* Monday, July 8, 2019 4:14:06 AM
    *To:* [email protected]
    *Subject:* Re: [QE-users] Question about restarting relaxation jobs
    Hello

    1)  and 2) PW writes the restart files only when it terminates
    before convergence is reached either because the max number of
    steps (and the max number may be  either the number of electronic
    steps during scf of  number of ionic steps during structural
    relaxation) or the the execution time exceed  max_seconds
    specified in input or because the user has stopped the calculation
    creating a file in the outdir called prefix.EXIT.

    If restart_mode in &control is set to "restart"  pw will try to
    restart the relaxation from the last POSITIONS  which have been
    saved in the prefix.save directory using the last saved charge
    density and wave functions. If it finds the restart files it will
    use them also. This mechanism works fine if positions, charge
    density and wave functions  data have been saved regularly,  but
    if the calculation is going to be stopped abruptly, for example by
    the job manager,  there is no way to prevent that the stop arrives
    when the program is writing these data. The safer way to go when
    you are using a job manager is to set the max_seconds variable to
    a number consistently lower than the time allocated by the job
    manager, the difference between these two times should be enough
    to allow to the program to pass through one of the check_points at
    which, during execution, it checks if the execution time has
    exceeded the max_seconds s or if the user has created a
    prefix.EXIT file. To estimate how long should be the difference
    between max_seconds and the scheduled execution time check how
    long it takes to the program to make an scf loop, this one will  a
    very safe estimate, you could reduce this time significantly and
    things should be working.



    3) I don't understans what you want to do. You create the
    prefix.EXIT file when you want to stop your calculation and you
    want the calculation to finish smoothly saving all restart
    information so that it can resatart from more or less the same
    point when it was interrupted. It is completely senseless to
    rename  the output file as prefix.EXIT because it will make the
    program to stop as soon as a check_point detects the file and the
    file will be deleted. The only thing that you have to do when
    restarting a calculation is

      *   Specify restart_mode = 'restart' in the input.in file

      *   take care that the information saved in output.out is not
        rewritten by the new execution just use something as mpirun
        pw.x  < input.in >> output.out which appends the new output to
        the old one or redirect the output to files with different names

    4)   outdir must be the same or if you want to use a different one
    you have to create the new outdir befor restarting and copy there
    all the data of the previous calculation i.e. the prefix.save
    directory.


    5) don't complicate things too much



    Pietro



    On 7/6/19 3:59 PM, Yeon, Jejoon wrote:

        Hello


        I have very small amount of experience using QE, so please
        excuse my beginner question. I'm about to start relaxation of
        big crystal structure, and I wish to make my QE relaxation
        jobs ready for restart. Here are my questions:


        1) According to "restarting" section from manual,
        (https://www.quantum-espresso.org/Doc/pw_user_guide/node20.html)
        it seems that QE does not creates the dedicated restart file,
        is this correct?


        2) If I set up "max_seconds" option as 604800 seconds (1
        week), and request wall time to server 1 week, do my
        calculation jobs are ready to restart after 1 week? (1 week is
        just example but our server cluster have maximum some
        walltime limitation, and I don't think any of my relaxation
        works will be finished within that time. ) Also, does this
        "max_seconds" option must be required to restart?

        3) When I execute QE in the submit script, I use something
        similar as:
        mpirun pw.x  < input.in > output.out
        In this case, if the relaxation job is killed due to wall time
        limit (without setting max_seconds), can I just change the
        name of the output.out to prefix.EXIT, (of course I set up
        prefix in the input file) and then include restart_mode =
        "restart" in the input file, then submit a job for restart?
        I have old files which are finished after reaching wall time
        limit without "max_seconds" option, and I'm curious if I can
        use those files to restart.

        4) I also use outdir option in the input file, does the outdir
        option should be the same when restart?

        5) Are there any other things or useful hints that I need to
        consider when restart?

        Thank you


        _______________________________________________
        Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso  
<http://www.max-centre.eu/quantum-espresso>)
        users mailing [email protected]  
<mailto:[email protected]>
        https://lists.quantum-espresso.org/mailman/listinfo/users




_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users
_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users

Reply via email to