Hello Danny, I have tried to restart using "scontrol checkpoint restart <jobid>" but it doesn't work. In addition, "<jobid>.0" directory and its content are doesn't exist in my --checkpoint-dir. The following is my batch job :
=====================batch job=================== #!/bin/bash #SBATCH -J MatMul #SBATCH -o mm-%j.out #SBATCH -A pro #SBATCH -N 3 #SBATCH -n 24 #SBATCH --checkpoint=5 #SBATCH --checkpoint-dir=/mirror/source/cr #SBATCH --time=01:30:00 #SBATCH --mail-user=hus...@gmail.com #SBATCH --mail-type=begin #SBATCH --mail-type=end srun --mpi=pmi2 ./mm.o ===================end batch job================ is there something that prevents me from getting the right directory structure ? Regards, Husen On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: > Hello, > > usually the directory, which is specified by --checkpoint-dir, should have > the following structure: > <jobid> > |__ script.ckpt > |__ <jobid>.0 > |__ task.0.ckpt > |__ task.1.ckpt > |__ ... > > But you only have to run the following command to restart your batch job: > scontrol checkpoint restart <jobid> > > I tried only batch jobs and currently I try to build MVAPICH2 with BLCR > and Slurm support, because that mpi library is explicitly mentioned in the > Slurm documentation. > > A colleague also tested DMTCP but no success. > > Kind reagards > Danny > TU Dresden > Germany > > > Am 14.04.2016 um 11:01 schrieb Husen R: > >> Hi all, >> Thank you for your reply >> >> Danny : >> I have installed BLCR and SLURM successfully. >> I also have configured CheckpointType, --checkpoint, --checkpoint-dir and >> JobCheckpointDir in order for slurm to support checkpoint. >> >> I have tried to checkpoint a simple MPI parallel application many times in >> my small cluster, and like you said, after checkpoint is completed there >> is >> a directory named with jobid in --checkpoint-dir. in that directory there >> is a file named "script.ckpt". I tried to restart directly using srun >> command below : >> >> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o >> >> where --restart-dir is directory that contains "script.ckpt". >> Unfortunately, I got the following error : >> >> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file >> or >> directory >> srun: error: compute-node: task 0: Exited with exit code 255 >> >> As we can see from the error message above, there was no "task.0.ckpt" >> file. I don't know how to get such file. The files that I got from >> checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and >> two files in JobCheckpointDir named "<jobid>.ckpt" and "<jobid>.ckpt.old". >> >> According to the information in section srun in this link >> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is >> completed there should be checkpoint files of the form "<jobid>.ckpt" and >> "<jobid>.<stepid>.ckpt" in --checkpoint-dir. >> >> Any idea to solve this ? >> >> Manuel : >> >> Yes, BLCR doesn't support checkpoint/restart parallel/distributed >> application by itself ( >> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). >> But it can be used by other software to do that (I hope the software is >> SLURM..huhu) >> >> I have ever tried to restart mpi application using DMTCP but it doesn't >> work. >> Would you please tell me how to do that ? >> >> >> Thank you in advance, >> >> Regards, >> >> >> Husen >> >> >> >> >> >> On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < >> danny.rotsc...@tu-dresden.de> wrote: >> >> I forgot something to add, you have to create a directory for the >>> checkpoint meta data, which is for default located in >>> /var/slurm/checkpoint: >>> mkdir -p /var/slurm/checkpoint >>> chown -R slurm /var/slurm >>> or you define your own directory in slurm.conf: >>> JobCheckpointDir=<your directory> >>> >>> The parameters you could check with: >>> scontrol show config | grep checkpoint >>> >>> Kind regards, >>> Danny >>> TU Dresden >>> Germany >>> >>> Am 14.04.2016 um 06:41 schrieb Danny Rotscher: >>> >>> Hello, >>>> >>>> we don't get it to work too, but we already build Slurm with the BLCR. >>>> >>>> You first have to install the BLCR library, which is described on the >>>> following website: >>>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html >>>> >>>> Then we build and installed Slurm from source and BLCR checkpointing has >>>> been included. >>>> >>>> After that you have to set at least one Parameter in the file >>>> "slurm.conf": >>>> CheckpointType=checkpoint/blcr >>>> >>>> It exists two ways to create ceckpointing, you could either make a >>>> checkpoint by the following command from outside your job: >>>> scontrol checkpoint create <jobid> >>>> or you could let Slurm do some periodical checkpoints with the following >>>> sbatch parameter: >>>> #SBATCH --checkpoint <minutes> >>>> We also tried: >>>> #SBATCH --checkpoint <minutes>:<seconds> >>>> e.g. >>>> #SBATCH --checkpoint 0:10 >>>> to test it, but it doesn't work for us. >>>> >>>> We also set the parameter for the checkpoint directory: >>>> #SBATCH --checkpoint-dir <directory> >>>> >>>> After you create a checkpoint and in your checkpoint directory is >>>> created >>>> a directory with name of your jobid, you could restart the job by the >>>> following command: >>>> scontrol checkpoint restart <jobid> >>>> >>>> We tested some sequential and openmp programs with different parameters >>>> and it works (checkpoint creation and restarting), >>>> but *we don't get any mpi library to work*, we already tested some >>>> programs build with openmpi and intelmpi. >>>> The checkpoint will be created but we get the following error when we >>>> want to restart them: >>>> - Failed to open file '/' >>>> - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) >>>> - cr_rstrt_child [28534]: Unable to restore files! (err=-21) >>>> Restart failed: Is a directory >>>> srun: error: taurusi4010: task 0: Exited with exit code 21 >>>> >>>> So, it would be great if you could confirm our problems, maybe then >>>> schedmd higher up the priority of such mails;-) >>>> If you get it to work, please help us to understand how. >>>> >>>> Kind reagards, >>>> Danny >>>> TU Dresden >>>> Germany >>>> >>>> Am 11.04.2016 um 10:09 schrieb Husen R: >>>> >>>> Hi all, >>>>> >>>>> Based on the information in this link >>>>> http://slurm.schedmd.com/checkpoint_blcr.html, >>>>> Slurm able to checkpoint the whole batch jobs and then Restart >>>>> execution >>>>> of >>>>> batch jobs and job steps from checkpoint files. >>>>> >>>>> Anyone please tell me how to do that ? >>>>> I need help. >>>>> >>>>> Thank you in advance. >>>>> >>>>> Regards, >>>>> >>>>> >>>>> Husen Rusdiansyah >>>>> University of Indonesia >>>>> >>>>> >>>> > -- > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Danny Rotscher > HPC-Support > > Technische Universität Dresden > Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) > 01062 Dresden > Tel.: +49 351 463-35853 > Fax : +49 351 463-37773 > E-Mail: danny.rotsc...@tu-dresden.de > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > >