Hello,
I've tried BLCR checkpoint/restart SLURM plugin (slurm-2.2.0) and got
several problems:
1) When I activated checkpoint/blcr plugin for slurm by appending the
proper string in slurm.conf (CheckpointType=checkpoint/blcr) and
restarted slurm-daemons, all jobs disappeared from queue (running and
pending). Is that a correct behaviour when checkpoint plugin is being
activated ?
2) I've tried to checkpoint and restart a simple job (submitted via
sbatch) with step like this
srun_cr --mpi=none sleep 1000
It was successful.
But I could not checkpoint an OpenMPI job containing MPI_Init(), sleep(
1000 ) and MPI_Finalize(): dmesg showed me a lot of calltraces from blcr
modules and all job processes were hung (kernel/node were alive). Test
program was linked against the openmpi-1.5 library, later was built with
blcr support.
I saw in slurm docs that blcr plugin was verified with mvapich2. Did
anybody try it with openmpi library ?
I am going to build mvapich2 library. How should I run my job steps to
able checkpoint: via srun_cr, srun or mpirun ?
Thanks.
--
Best regards, Dennis.