
we don't get it to work too, but we already build Slurm with the BLCR.

You first have to install the BLCR library, which is described on the following website:

Then we build and installed Slurm from source and BLCR checkpointing has been included.

After that you have to set at least one Parameter in the file "slurm.conf":

It exists two ways to create ceckpointing, you could either make a checkpoint by the following command from outside your job:
scontrol checkpoint create <jobid>
or you could let Slurm do some periodical checkpoints with the following sbatch parameter:
#SBATCH --checkpoint <minutes>
We also tried:
#SBATCH --checkpoint <minutes>:<seconds>
#SBATCH --checkpoint 0:10
to test it, but it doesn't work for us.

We also set the parameter for the checkpoint directory:
#SBATCH --checkpoint-dir <directory>

After you create a checkpoint and in your checkpoint directory is created a directory with name of your jobid, you could restart the job by the following command:
scontrol checkpoint restart <jobid>

We tested some sequential and openmp programs with different parameters and it works (checkpoint creation and restarting), but *we don't get any mpi library to work*, we already tested some programs build with openmpi and intelmpi. The checkpoint will be created but we get the following error when we want to restart them:
- Failed to open file '/'
- cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
- cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
Restart failed: Is a directory
srun: error: taurusi4010: task 0: Exited with exit code 21

So, it would be great if you could confirm our problems, maybe then schedmd higher up the priority of such mails;-)
If you get it to work, please help us to understand how.

Kind reagards,
TU Dresden

Am 11.04.2016 um 10:09 schrieb Husen R:
Hi all,

Based on the information in this link
Slurm able to checkpoint the whole batch jobs and then Restart execution of
batch jobs and job steps from checkpoint files.

Anyone please tell me how to do that ?
I need help.

Thank you in advance.


Husen Rusdiansyah
University of Indonesia

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to