Hello,

we don't get it to work too, but we already build Slurm with the BLCR.

You first have to install the BLCR library, which is described on the following website:
https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html

Then we build and installed Slurm from source and BLCR checkpointing has been included.

After that you have to set at least one Parameter in the file "slurm.conf":
CheckpointType=checkpoint/blcr

It exists two ways to create ceckpointing, you could either make a checkpoint by the following command from outside your job:
scontrol checkpoint create <jobid>
or you could let Slurm do some periodical checkpoints with the following sbatch parameter:
#SBATCH --checkpoint <minutes>
We also tried:
#SBATCH --checkpoint <minutes>:<seconds>
e.g.
#SBATCH --checkpoint 0:10
to test it, but it doesn't work for us.

We also set the parameter for the checkpoint directory:
#SBATCH --checkpoint-dir <directory>

After you create a checkpoint and in your checkpoint directory is created a directory with name of your jobid, you could restart the job by the following command:
scontrol checkpoint restart <jobid>

We tested some sequential and openmp programs with different parameters and it works (checkpoint creation and restarting), but *we don't get any mpi library to work*, we already tested some programs build with openmpi and intelmpi. The checkpoint will be created but we get the following error when we want to restart them:
- Failed to open file '/'
- cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
- cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
Restart failed: Is a directory
srun: error: taurusi4010: task 0: Exited with exit code 21

So, it would be great if you could confirm our problems, maybe then schedmd higher up the priority of such mails;-)
If you get it to work, please help us to understand how.

Kind reagards,
Danny
TU Dresden
Germany

Am 11.04.2016 um 10:09 schrieb Husen R:
Hi all,

Based on the information in this link
http://slurm.schedmd.com/checkpoint_blcr.html,
Slurm able to checkpoint the whole batch jobs and then Restart execution of
batch jobs and job steps from checkpoint files.

Anyone please tell me how to do that ?
I need help.

Thank you in advance.

Regards,


Husen Rusdiansyah
University of Indonesia

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to