There is a good tutorial on how to use DMTCP on their github page, https://github.com/dmtcp/dmtcp/blob/master/QUICK-START.md
I would start there. Anyway, probably this Slurm mailing list is not the best place to ask for that information. Best regards, Manuel 2016-04-14 11:01 GMT+02:00 Husen R <[email protected]>: > Hi all, > Thank you for your reply > > Danny : > I have installed BLCR and SLURM successfully. > I also have configured CheckpointType, --checkpoint, --checkpoint-dir and > JobCheckpointDir in order for slurm to support checkpoint. > > I have tried to checkpoint a simple MPI parallel application many times in > my small cluster, and like you said, after checkpoint is completed there is > a directory named with jobid in --checkpoint-dir. in that directory there > is a file named "script.ckpt". I tried to restart directly using srun > command below : > > srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o > > where --restart-dir is directory that contains "script.ckpt". > Unfortunately, I got the following error : > > Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or > directory > srun: error: compute-node: task 0: Exited with exit code 255 > > As we can see from the error message above, there was no "task.0.ckpt" file. > I don't know how to get such file. The files that I got from checkpoint > operation is a file named "script.ckpt" in --checkpoint-dir and two files in > JobCheckpointDir named "<jobid>.ckpt" and "<jobid>.ckpt.old". > > According to the information in section srun in this link > http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed > there should be checkpoint files of the form "<jobid>.ckpt" and > "<jobid>.<stepid>.ckpt" in --checkpoint-dir. > > Any idea to solve this ? > > Manuel : > > Yes, BLCR doesn't support checkpoint/restart parallel/distributed > application by itself ( > https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by > other software to do that (I hope the software is SLURM..huhu) > > I have ever tried to restart mpi application using DMTCP but it doesn't > work. > Would you please tell me how to do that ? > > > Thank you in advance, > > Regards, > > > Husen > > > > > > On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher > <[email protected]> wrote: >> >> I forgot something to add, you have to create a directory for the >> checkpoint meta data, which is for default located in /var/slurm/checkpoint: >> mkdir -p /var/slurm/checkpoint >> chown -R slurm /var/slurm >> or you define your own directory in slurm.conf: >> JobCheckpointDir=<your directory> >> >> The parameters you could check with: >> scontrol show config | grep checkpoint >> >> Kind regards, >> Danny >> TU Dresden >> Germany >> >> Am 14.04.2016 um 06:41 schrieb Danny Rotscher: >>> >>> Hello, >>> >>> we don't get it to work too, but we already build Slurm with the BLCR. >>> >>> You first have to install the BLCR library, which is described on the >>> following website: >>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html >>> >>> Then we build and installed Slurm from source and BLCR checkpointing has >>> been included. >>> >>> After that you have to set at least one Parameter in the file >>> "slurm.conf": >>> CheckpointType=checkpoint/blcr >>> >>> It exists two ways to create ceckpointing, you could either make a >>> checkpoint by the following command from outside your job: >>> scontrol checkpoint create <jobid> >>> or you could let Slurm do some periodical checkpoints with the following >>> sbatch parameter: >>> #SBATCH --checkpoint <minutes> >>> We also tried: >>> #SBATCH --checkpoint <minutes>:<seconds> >>> e.g. >>> #SBATCH --checkpoint 0:10 >>> to test it, but it doesn't work for us. >>> >>> We also set the parameter for the checkpoint directory: >>> #SBATCH --checkpoint-dir <directory> >>> >>> After you create a checkpoint and in your checkpoint directory is created >>> a directory with name of your jobid, you could restart the job by the >>> following command: >>> scontrol checkpoint restart <jobid> >>> >>> We tested some sequential and openmp programs with different parameters >>> and it works (checkpoint creation and restarting), >>> but *we don't get any mpi library to work*, we already tested some >>> programs build with openmpi and intelmpi. >>> The checkpoint will be created but we get the following error when we >>> want to restart them: >>> - Failed to open file '/' >>> - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) >>> - cr_rstrt_child [28534]: Unable to restore files! (err=-21) >>> Restart failed: Is a directory >>> srun: error: taurusi4010: task 0: Exited with exit code 21 >>> >>> So, it would be great if you could confirm our problems, maybe then >>> schedmd higher up the priority of such mails;-) >>> If you get it to work, please help us to understand how. >>> >>> Kind reagards, >>> Danny >>> TU Dresden >>> Germany >>> >>> Am 11.04.2016 um 10:09 schrieb Husen R: >>>> >>>> Hi all, >>>> >>>> Based on the information in this link >>>> http://slurm.schedmd.com/checkpoint_blcr.html, >>>> Slurm able to checkpoint the whole batch jobs and then Restart execution >>>> of >>>> batch jobs and job steps from checkpoint files. >>>> >>>> Anyone please tell me how to do that ? >>>> I need help. >>>> >>>> Thank you in advance. >>>> >>>> Regards, >>>> >>>> >>>> Husen Rusdiansyah >>>> University of Indonesia >>> >>> >> >
