Dear all, I got the following error when I try to checkpoint mpi application using sbatch script.
srun_cr: error: slurm_receive_msgs: Socket timed out on send/recv operation srun_cr: error: failed to checkpoint step tasks - chkpt_watchdog: 'srun_cr' (tgid/pid 29156/29156) exited with signal 9 during checkpoint Checkpoint cancelled by application: unable to checkpoint slurmstepd: get_exit_code task 0 died by signal srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: *** STEP 494.0 ON head-node CANCELLED AT 2016-05-31T07:59:30 *** srun: error: head-node: tasks 0-7: Terminated this is my sbatch script : #!/bin/bash #SBATCH -J mm #SBATCH -o output/mm%j.out #SBATCH -A necis #SBATCH -N 1 #SBATCH -n 8 #SBATCH --time=01:30:00 #SBATCH --checkpoint=6 #SBATCH --checkpoint-dir=output #SBATCH [email protected] #SBATCH --mail-type=begin #SBATCH --mail-type=end srun_cr --mpi=pmi2 ./mm.o mm.o is just a simple mpi matrix multiplication. I use mpich-3.2, ubuntu 14.04, kernel 3.13.0-24-generic. This is my mpich-3.2 configure command : ./configure --enable-checkpointing --prefix=/usr/local --with-slurm=/usr/local --with-blcr=/usr/local and this is my slurm configure command : ./configure --prefix=/usr/local --with-mysql_config=/usr/bin --with-blcr=/usr/local anyone please tell me how to solve this ? Thank you in advance. Regards, Husen
