Hello, we don't get it to work too, but we already build Slurm with the BLCR.
You first have to install the BLCR library, which is described on the following website:
https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.htmlThen we build and installed Slurm from source and BLCR checkpointing has been included.
After that you have to set at least one Parameter in the file "slurm.conf": CheckpointType=checkpoint/blcrIt exists two ways to create ceckpointing, you could either make a checkpoint by the following command from outside your job:
scontrol checkpoint create <jobid>or you could let Slurm do some periodical checkpoints with the following sbatch parameter:
#SBATCH --checkpoint <minutes> We also tried: #SBATCH --checkpoint <minutes>:<seconds> e.g. #SBATCH --checkpoint 0:10 to test it, but it doesn't work for us. We also set the parameter for the checkpoint directory: #SBATCH --checkpoint-dir <directory>After you create a checkpoint and in your checkpoint directory is created a directory with name of your jobid, you could restart the job by the following command:
scontrol checkpoint restart <jobid>We tested some sequential and openmp programs with different parameters and it works (checkpoint creation and restarting), but *we don't get any mpi library to work*, we already tested some programs build with openmpi and intelmpi. The checkpoint will be created but we get the following error when we want to restart them:
- Failed to open file '/' - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) - cr_rstrt_child [28534]: Unable to restore files! (err=-21) Restart failed: Is a directory srun: error: taurusi4010: task 0: Exited with exit code 21So, it would be great if you could confirm our problems, maybe then schedmd higher up the priority of such mails;-)
If you get it to work, please help us to understand how. Kind reagards, Danny TU Dresden Germany Am 11.04.2016 um 10:09 schrieb Husen R:
Hi all, Based on the information in this link http://slurm.schedmd.com/checkpoint_blcr.html, Slurm able to checkpoint the whole batch jobs and then Restart execution of batch jobs and job steps from checkpoint files. Anyone please tell me how to do that ? I need help. Thank you in advance. Regards, Husen Rusdiansyah University of Indonesia
smime.p7s
Description: S/MIME Cryptographic Signature