Hello,
I've tried BLCR checkpoint/restart SLURM plugin (slurm-2.2.0) and got several problems: 1) When I activated checkpoint/blcr plugin for slurm by appending the proper string in slurm.conf (CheckpointType=checkpoint/blcr) and restarted slurm-daemons, all jobs disappeared from queue (running and pending). Is that a correct behaviour when checkpoint plugin is being activated ? 2) I've tried to checkpoint and restart a simple job (submitted via sbatch) with step like this
srun_cr --mpi=none sleep 1000
It was successful.
But I could not checkpoint an OpenMPI job containing MPI_Init(), sleep( 1000 ) and MPI_Finalize(): dmesg showed me a lot of calltraces from blcr modules and all job processes were hung (kernel/node were alive). Test program was linked against the openmpi-1.5 library, later was built with blcr support. I saw in slurm docs that blcr plugin was verified with mvapich2. Did anybody try it with openmpi library ? I am going to build mvapich2 library. How should I run my job steps to able checkpoint: via srun_cr, srun or mpirun ?

Thanks.

--
Best regards, Dennis.

Reply via email to