I've found two things, first you could try srun_cr instead of srun and the second is, do your job needs more than 5 minutes?!
But I'm not sure, so you may try it and post the result.

Hello Danny,

I have tried to restart using "scontrol checkpoint restart <jobid>" but it
doesn't work.
In addition, "<jobid>.0" directory and its content are doesn't exist in my
The following is my batch job :

=====================batch job===================

#SBATCH -o mm-%j.out
#SBATCH -A pro
#SBATCH -n 24
#SBATCH --checkpoint=5
#SBATCH --checkpoint-dir=/mirror/source/cr
#SBATCH --time=01:30:00
#SBATCH --mail-user=hus...@gmail.com
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

srun --mpi=pmi2 ./mm.o

===================end batch job================

is there something that prevents me from getting the right directory
structure ?



usually the directory, which is specified by --checkpoint-dir, should have
the following structure:
|__ script.ckpt
|__ <jobid>.0
      |__ task.0.ckpt
      |__ task.1.ckpt
      |__ ...

But you only have to run the following command to restart your batch job:
scontrol checkpoint restart <jobid>

I tried only batch jobs and currently I try to build MVAPICH2 with BLCR
and Slurm support, because that mpi library is explicitly mentioned in the
Slurm documentation.

A colleague also tested DMTCP but no success.

Kind reagards
TU Dresden

Hi all,
Thank you for your reply

Danny :
I have installed BLCR and SLURM successfully.
I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
JobCheckpointDir in order for slurm to support checkpoint.

I have tried to checkpoint a simple MPI parallel application many times in
my small cluster, and like you said, after checkpoint is completed there
a directory named with jobid in  --checkpoint-dir. in that directory there
is a file named "script.ckpt". I tried to restart directly using srun
command below :

srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o

where --restart-dir is directory that contains "script.ckpt".
Unfortunately, I got the following error :

Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file
srun: error: compute-node: task 0: Exited with exit code 255

As we can see from the error message above, there was no "task.0.ckpt"
file. I don't know how to get such file. The files that I got from
checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
two files in JobCheckpointDir named "<jobid>.ckpt" and "<jobid>.ckpt.old".

According to the information in section srun in this link
http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
completed there should be checkpoint files of the form "<jobid>.ckpt" and
"<jobid>.<stepid>.ckpt" in --checkpoint-dir.

Any idea to solve this ?

Manuel :

Yes, BLCR doesn't support checkpoint/restart parallel/distributed
application by itself (
But it can be used by other software to do that (I hope the software is

I have ever tried to restart mpi application using DMTCP but it doesn't
Would you please tell me how to do that ?

Thank you in advance,



I forgot something to add, you have to create a directory for the
checkpoint meta data, which is for default located in
mkdir -p /var/slurm/checkpoint
chown -R slurm /var/slurm
or you define your own directory in slurm.conf:
JobCheckpointDir=<your directory>

The parameters you could check with:
scontrol show config | grep checkpoint

Kind regards,
TU Dresden

we don't get it to work too, but we already build Slurm with the BLCR.

You first have to install the BLCR library, which is described on the
following website:

Then we build and installed Slurm from source and BLCR checkpointing has
been included.

After that you have to set at least one Parameter in the file

It exists two ways to create ceckpointing, you could either make a
checkpoint by the following command from outside your job:
scontrol checkpoint create <jobid>
or you could let Slurm do some periodical checkpoints with the following
sbatch parameter:
#SBATCH --checkpoint <minutes>
We also tried:
#SBATCH --checkpoint <minutes>:<seconds>
#SBATCH --checkpoint 0:10
to test it, but it doesn't work for us.

We also set the parameter for the checkpoint directory:
#SBATCH --checkpoint-dir <directory>

After you create a checkpoint and in your checkpoint directory is
a directory with name of your jobid, you could restart the job by the
following command:
scontrol checkpoint restart <jobid>

We tested some sequential and openmp programs with different parameters
and it works (checkpoint creation and restarting),
but *we don't get any mpi library to work*, we already tested some
programs build with openmpi and intelmpi.
The checkpoint will be created but we get the following error when we
want to restart them:
- Failed to open file '/'
- cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
- cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
Restart failed: Is a directory
srun: error: taurusi4010: task 0: Exited with exit code 21

So, it would be great if you could confirm our problems, maybe then
schedmd higher up the priority of such mails;-)
If you get it to work, please help us to understand how.

Kind reagards,
TU Dresden

Hi all,
Based on the information in this link
Slurm able to checkpoint the whole batch jobs and then Restart
batch jobs and job steps from checkpoint files.

Anyone please tell me how to do that ?
I need help.

Thank you in advance.


Husen Rusdiansyah
University of Indonesia

