Hi all, Thank you for your reply Danny : I have installed BLCR and SLURM successfully. I also have configured CheckpointType, --checkpoint, --checkpoint-dir and JobCheckpointDir in order for slurm to support checkpoint.
I have tried to checkpoint a simple MPI parallel application many times in my small cluster, and like you said, after checkpoint is completed there is a directory named with jobid in --checkpoint-dir. in that directory there is a file named "script.ckpt". I tried to restart directly using srun command below : srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o where --restart-dir is directory that contains "script.ckpt". Unfortunately, I got the following error : Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or directory srun: error: compute-node: task 0: Exited with exit code 255 As we can see from the error message above, there was no "task.0.ckpt" file. I don't know how to get such file. The files that I got from checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and two files in JobCheckpointDir named "<jobid>.ckpt" and "<jobid>.ckpt.old". According to the information in section srun in this link http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed there should be checkpoint files of the form "<jobid>.ckpt" and "<jobid>.<stepid>.ckpt" in --checkpoint-dir. Any idea to solve this ? Manuel : Yes, BLCR doesn't support checkpoint/restart parallel/distributed application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by other software to do that (I hope the software is SLURM..huhu) I have ever tried to restart mpi application using DMTCP but it doesn't work. Would you please tell me how to do that ? Thank you in advance, Regards, Husen On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < [email protected]> wrote: > I forgot something to add, you have to create a directory for the > checkpoint meta data, which is for default located in /var/slurm/checkpoint: > mkdir -p /var/slurm/checkpoint > chown -R slurm /var/slurm > or you define your own directory in slurm.conf: > JobCheckpointDir=<your directory> > > The parameters you could check with: > scontrol show config | grep checkpoint > > Kind regards, > Danny > TU Dresden > Germany > > Am 14.04.2016 um 06:41 schrieb Danny Rotscher: > >> Hello, >> >> we don't get it to work too, but we already build Slurm with the BLCR. >> >> You first have to install the BLCR library, which is described on the >> following website: >> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html >> >> Then we build and installed Slurm from source and BLCR checkpointing has >> been included. >> >> After that you have to set at least one Parameter in the file >> "slurm.conf": >> CheckpointType=checkpoint/blcr >> >> It exists two ways to create ceckpointing, you could either make a >> checkpoint by the following command from outside your job: >> scontrol checkpoint create <jobid> >> or you could let Slurm do some periodical checkpoints with the following >> sbatch parameter: >> #SBATCH --checkpoint <minutes> >> We also tried: >> #SBATCH --checkpoint <minutes>:<seconds> >> e.g. >> #SBATCH --checkpoint 0:10 >> to test it, but it doesn't work for us. >> >> We also set the parameter for the checkpoint directory: >> #SBATCH --checkpoint-dir <directory> >> >> After you create a checkpoint and in your checkpoint directory is created >> a directory with name of your jobid, you could restart the job by the >> following command: >> scontrol checkpoint restart <jobid> >> >> We tested some sequential and openmp programs with different parameters >> and it works (checkpoint creation and restarting), >> but *we don't get any mpi library to work*, we already tested some >> programs build with openmpi and intelmpi. >> The checkpoint will be created but we get the following error when we >> want to restart them: >> - Failed to open file '/' >> - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) >> - cr_rstrt_child [28534]: Unable to restore files! (err=-21) >> Restart failed: Is a directory >> srun: error: taurusi4010: task 0: Exited with exit code 21 >> >> So, it would be great if you could confirm our problems, maybe then >> schedmd higher up the priority of such mails;-) >> If you get it to work, please help us to understand how. >> >> Kind reagards, >> Danny >> TU Dresden >> Germany >> >> Am 11.04.2016 um 10:09 schrieb Husen R: >> >>> Hi all, >>> >>> Based on the information in this link >>> http://slurm.schedmd.com/checkpoint_blcr.html, >>> Slurm able to checkpoint the whole batch jobs and then Restart execution >>> of >>> batch jobs and job steps from checkpoint files. >>> >>> Anyone please tell me how to do that ? >>> I need help. >>> >>> Thank you in advance. >>> >>> Regards, >>> >>> >>> Husen Rusdiansyah >>> University of Indonesia >>> >> >> >
