Dear all,

I got the following error when I try to checkpoint mpi application using
sbatch script.

srun_cr: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun_cr: error: failed to checkpoint step tasks
- chkpt_watchdog: 'srun_cr' (tgid/pid 29156/29156) exited with signal 9
during checkpoint
Checkpoint cancelled by application: unable to checkpoint
slurmstepd: get_exit_code task 0 died by signal
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: *** STEP 494.0 ON head-node CANCELLED AT 2016-05-31T07:59:30 ***
srun: error: head-node: tasks 0-7: Terminated


this is my sbatch script :

#!/bin/bash
#SBATCH -J mm
#SBATCH -o output/mm%j.out
#SBATCH -A necis
#SBATCH -N 1
#SBATCH -n 8
#SBATCH --time=01:30:00
#SBATCH --checkpoint=6
#SBATCH --checkpoint-dir=output
#SBATCH [email protected]
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

srun_cr --mpi=pmi2 ./mm.o

mm.o is just a simple mpi matrix multiplication.
I use mpich-3.2, ubuntu 14.04, kernel 3.13.0-24-generic.

This is my mpich-3.2 configure command :

./configure --enable-checkpointing --prefix=/usr/local
--with-slurm=/usr/local --with-blcr=/usr/local

and this is my slurm configure command :

./configure --prefix=/usr/local --with-mysql_config=/usr/bin
--with-blcr=/usr/local

anyone please tell me how to solve this ?
Thank you in advance.


Regards,

Husen

Reply via email to