A QuantumEspresso, multinode and multiprocess MPI job has been terminated
with the following messages in the log file

     total cpu time spent up to now is    63540.4 secs

     total energy              =  -14004.61932175 Ry
     Harris-Foulkes estimate   =  -14004.73511665 Ry
     estimated scf accuracy    <       0.84597958 Ry

     iteration #  7     ecut=    48.95 Ry     beta= 0.70
     Davidson diagonalization with overlap
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[7952,0],0] on node compute-0-0
  Remote daemon: [[7952,0],1] on node compute-0-1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

The slurm script for that is

#SBATCH --job-name=myQE
#SBATCH --output=mos2.rlx.out
#SBATCH --ntasks=14
#SBATCH --mem-per-cpu=17G
#SBATCH --nodes=6
#SBATCH --partition=QUARTZ
#SBATCH --account=z5
mpirun pw.x -i mos2.rlx.in

The job is running on Slurm 18.08 and Rocks7 which its default OpenMPI

Other jobs with OMPI and slurm and QE are fine. So, I want to know how can
I narrow my searches to find the root of the problem of this specific
problem. For example, I don't know if the QE job had been diverged in
calculations or not. Is there any way to find more information about that.

Any idea?

users mailing list

Reply via email to