Hi, A QuantumEspresso, multinode and multiprocess MPI job has been terminated with the following messages in the log file
total cpu time spent up to now is 63540.4 secs total energy = -14004.61932175 Ry Harris-Foulkes estimate = -14004.73511665 Ry estimated scf accuracy < 0.84597958 Ry iteration # 7 ecut= 48.95 Ry beta= 0.70 Davidson diagonalization with overlap -------------------------------------------------------------------------- ORTE has lost communication with a remote daemon. HNP daemon : [[7952,0],0] on node compute-0-0 Remote daemon: [[7952,0],1] on node compute-0-1 This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job. -------------------------------------------------------------------------- The slurm script for that is #!/bin/bash #SBATCH --job-name=myQE #SBATCH --output=mos2.rlx.out #SBATCH --ntasks=14 #SBATCH --mem-per-cpu=17G #SBATCH --nodes=6 #SBATCH --partition=QUARTZ #SBATCH --account=z5 mpirun pw.x -i mos2.rlx.in The job is running on Slurm 18.08 and Rocks7 which its default OpenMPI 2.1.1. Other jobs with OMPI and slurm and QE are fine. So, I want to know how can I narrow my searches to find the root of the problem of this specific problem. For example, I don't know if the QE job had been diverged in calculations or not. Is there any way to find more information about that. Any idea? Regards, Mahmood
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users