Hi,
A QuantumEspresso, multinode and multiprocess MPI job has been terminated
with the following messages in the log file


     total cpu time spent up to now is    63540.4 secs

     total energy              =  -14004.61932175 Ry
     Harris-Foulkes estimate   =  -14004.73511665 Ry
     estimated scf accuracy    <       0.84597958 Ry

     iteration #  7     ecut=    48.95 Ry     beta= 0.70
     Davidson diagonalization with overlap
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[7952,0],0] on node compute-0-0
  Remote daemon: [[7952,0],1] on node compute-0-1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------




The slurm script for that is

#!/bin/bash
#SBATCH --job-name=myQE
#SBATCH --output=mos2.rlx.out
#SBATCH --ntasks=14
#SBATCH --mem-per-cpu=17G
#SBATCH --nodes=6
#SBATCH --partition=QUARTZ
#SBATCH --account=z5
mpirun pw.x -i mos2.rlx.in


The job is running on Slurm 18.08 and Rocks7 which its default OpenMPI
2.1.1.

Other jobs with OMPI and slurm and QE are fine. So, I want to know how can
I narrow my searches to find the root of the problem of this specific
problem. For example, I don't know if the QE job had been diverged in
calculations or not. Is there any way to find more information about that.

Any idea?

Regards,
Mahmood
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to