Hi Danny, all, As far as I know, unfortunately BLCR does not count with MPI support At lest I haven't been able to achieve it.
On the other side, DMTCP ( http://dmtcp.sourceforge.net/ ) does work with MPI. My team is very interested on counting with a reliable checkpoint/restar mechanism in Slurm, so we are now plugin to integrate it. We are facing some technical problems, but are working together with DMTCP team to solve them and we are confident on having the integration ready soon. anyway, i'll send a mail to this list when it's ready. Cheers, Manuel 2016-04-14 7:03 GMT+02:00 Danny Rotscher <[email protected]>: > I forgot something to add, you have to create a directory for the checkpoint > meta data, which is for default located in /var/slurm/checkpoint: > mkdir -p /var/slurm/checkpoint > chown -R slurm /var/slurm > or you define your own directory in slurm.conf: > JobCheckpointDir=<your directory> > > The parameters you could check with: > scontrol show config | grep checkpoint > > Kind regards, > Danny > TU Dresden > Germany > > Am 14.04.2016 um 06:41 schrieb Danny Rotscher: >> >> Hello, >> >> we don't get it to work too, but we already build Slurm with the BLCR. >> >> You first have to install the BLCR library, which is described on the >> following website: >> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html >> >> Then we build and installed Slurm from source and BLCR checkpointing has >> been included. >> >> After that you have to set at least one Parameter in the file >> "slurm.conf": >> CheckpointType=checkpoint/blcr >> >> It exists two ways to create ceckpointing, you could either make a >> checkpoint by the following command from outside your job: >> scontrol checkpoint create <jobid> >> or you could let Slurm do some periodical checkpoints with the following >> sbatch parameter: >> #SBATCH --checkpoint <minutes> >> We also tried: >> #SBATCH --checkpoint <minutes>:<seconds> >> e.g. >> #SBATCH --checkpoint 0:10 >> to test it, but it doesn't work for us. >> >> We also set the parameter for the checkpoint directory: >> #SBATCH --checkpoint-dir <directory> >> >> After you create a checkpoint and in your checkpoint directory is created >> a directory with name of your jobid, you could restart the job by the >> following command: >> scontrol checkpoint restart <jobid> >> >> We tested some sequential and openmp programs with different parameters >> and it works (checkpoint creation and restarting), >> but *we don't get any mpi library to work*, we already tested some >> programs build with openmpi and intelmpi. >> The checkpoint will be created but we get the following error when we want >> to restart them: >> - Failed to open file '/' >> - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) >> - cr_rstrt_child [28534]: Unable to restore files! (err=-21) >> Restart failed: Is a directory >> srun: error: taurusi4010: task 0: Exited with exit code 21 >> >> So, it would be great if you could confirm our problems, maybe then >> schedmd higher up the priority of such mails;-) >> If you get it to work, please help us to understand how. >> >> Kind reagards, >> Danny >> TU Dresden >> Germany >> >> Am 11.04.2016 um 10:09 schrieb Husen R: >>> >>> Hi all, >>> >>> Based on the information in this link >>> http://slurm.schedmd.com/checkpoint_blcr.html, >>> Slurm able to checkpoint the whole batch jobs and then Restart execution >>> of >>> batch jobs and job steps from checkpoint files. >>> >>> Anyone please tell me how to do that ? >>> I need help. >>> >>> Thank you in advance. >>> >>> Regards, >>> >>> >>> Husen Rusdiansyah >>> University of Indonesia >> >> >
