Danny : I'm unable to use srun_cr command. I got this error message from slurmctld log file after submitting srun_cr with sbatch:
[2016-04-14T19:22:42.719] job_complete: JobID=67 State=0x1 NodeCnt=2 WEXITSTATUS 255 Any idea to fix this ? - yes, my job needs more than 5 minutes. Andy : Yes, /mirror directory is shared across my cluster. I have configured it using NFS. Regards, Husen On Thu, Apr 14, 2016 at 6:15 PM, Danny Rotscher < [email protected]> wrote: > I've found two things, first you could try srun_cr instead of srun and the > second is, do your job needs more than 5 minutes?! > But I'm not sure, so you may try it and post the result. > > > Am 14.04.2016 um 12:56 schrieb Husen R: > >> Hello Danny, >> >> I have tried to restart using "scontrol checkpoint restart <jobid>" but it >> doesn't work. >> In addition, "<jobid>.0" directory and its content are doesn't exist in my >> --checkpoint-dir. >> The following is my batch job : >> >> =====================batch job=================== >> >> #!/bin/bash >> #SBATCH -J MatMul >> #SBATCH -o mm-%j.out >> #SBATCH -A pro >> #SBATCH -N 3 >> #SBATCH -n 24 >> #SBATCH --checkpoint=5 >> #SBATCH --checkpoint-dir=/mirror/source/cr >> #SBATCH --time=01:30:00 >> #SBATCH [email protected] >> #SBATCH --mail-type=begin >> #SBATCH --mail-type=end >> >> srun --mpi=pmi2 ./mm.o >> >> ===================end batch job================ >> >> is there something that prevents me from getting the right directory >> structure ? >> >> >> Regards, >> >> >> >> Husen >> >> >> >> >> On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher < >> [email protected]> wrote: >> >> Hello, >>> >>> usually the directory, which is specified by --checkpoint-dir, should >>> have >>> the following structure: >>> <jobid> >>> |__ script.ckpt >>> |__ <jobid>.0 >>> |__ task.0.ckpt >>> |__ task.1.ckpt >>> |__ ... >>> >>> But you only have to run the following command to restart your batch job: >>> scontrol checkpoint restart <jobid> >>> >>> I tried only batch jobs and currently I try to build MVAPICH2 with BLCR >>> and Slurm support, because that mpi library is explicitly mentioned in >>> the >>> Slurm documentation. >>> >>> A colleague also tested DMTCP but no success. >>> >>> Kind reagards >>> Danny >>> TU Dresden >>> Germany >>> >>> >>> Am 14.04.2016 um 11:01 schrieb Husen R: >>> >>> Hi all, >>>> Thank you for your reply >>>> >>>> Danny : >>>> I have installed BLCR and SLURM successfully. >>>> I also have configured CheckpointType, --checkpoint, --checkpoint-dir >>>> and >>>> JobCheckpointDir in order for slurm to support checkpoint. >>>> >>>> I have tried to checkpoint a simple MPI parallel application many times >>>> in >>>> my small cluster, and like you said, after checkpoint is completed there >>>> is >>>> a directory named with jobid in --checkpoint-dir. in that directory >>>> there >>>> is a file named "script.ckpt". I tried to restart directly using srun >>>> command below : >>>> >>>> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o >>>> >>>> where --restart-dir is directory that contains "script.ckpt". >>>> Unfortunately, I got the following error : >>>> >>>> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file >>>> or >>>> directory >>>> srun: error: compute-node: task 0: Exited with exit code 255 >>>> >>>> As we can see from the error message above, there was no "task.0.ckpt" >>>> file. I don't know how to get such file. The files that I got from >>>> checkpoint operation is a file named "script.ckpt" in --checkpoint-dir >>>> and >>>> two files in JobCheckpointDir named "<jobid>.ckpt" and >>>> "<jobid>.ckpt.old". >>>> >>>> According to the information in section srun in this link >>>> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is >>>> completed there should be checkpoint files of the form "<jobid>.ckpt" >>>> and >>>> "<jobid>.<stepid>.ckpt" in --checkpoint-dir. >>>> >>>> Any idea to solve this ? >>>> >>>> Manuel : >>>> >>>> Yes, BLCR doesn't support checkpoint/restart parallel/distributed >>>> application by itself ( >>>> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). >>>> But it can be used by other software to do that (I hope the software is >>>> SLURM..huhu) >>>> >>>> I have ever tried to restart mpi application using DMTCP but it doesn't >>>> work. >>>> Would you please tell me how to do that ? >>>> >>>> >>>> Thank you in advance, >>>> >>>> Regards, >>>> >>>> >>>> Husen >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < >>>> [email protected]> wrote: >>>> >>>> I forgot something to add, you have to create a directory for the >>>> >>>>> checkpoint meta data, which is for default located in >>>>> /var/slurm/checkpoint: >>>>> mkdir -p /var/slurm/checkpoint >>>>> chown -R slurm /var/slurm >>>>> or you define your own directory in slurm.conf: >>>>> JobCheckpointDir=<your directory> >>>>> >>>>> The parameters you could check with: >>>>> scontrol show config | grep checkpoint >>>>> >>>>> Kind regards, >>>>> Danny >>>>> TU Dresden >>>>> Germany >>>>> >>>>> Am 14.04.2016 um 06:41 schrieb Danny Rotscher: >>>>> >>>>> Hello, >>>>> >>>>>> we don't get it to work too, but we already build Slurm with the BLCR. >>>>>> >>>>>> You first have to install the BLCR library, which is described on the >>>>>> following website: >>>>>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html >>>>>> >>>>>> Then we build and installed Slurm from source and BLCR checkpointing >>>>>> has >>>>>> been included. >>>>>> >>>>>> After that you have to set at least one Parameter in the file >>>>>> "slurm.conf": >>>>>> CheckpointType=checkpoint/blcr >>>>>> >>>>>> It exists two ways to create ceckpointing, you could either make a >>>>>> checkpoint by the following command from outside your job: >>>>>> scontrol checkpoint create <jobid> >>>>>> or you could let Slurm do some periodical checkpoints with the >>>>>> following >>>>>> sbatch parameter: >>>>>> #SBATCH --checkpoint <minutes> >>>>>> We also tried: >>>>>> #SBATCH --checkpoint <minutes>:<seconds> >>>>>> e.g. >>>>>> #SBATCH --checkpoint 0:10 >>>>>> to test it, but it doesn't work for us. >>>>>> >>>>>> We also set the parameter for the checkpoint directory: >>>>>> #SBATCH --checkpoint-dir <directory> >>>>>> >>>>>> After you create a checkpoint and in your checkpoint directory is >>>>>> created >>>>>> a directory with name of your jobid, you could restart the job by the >>>>>> following command: >>>>>> scontrol checkpoint restart <jobid> >>>>>> >>>>>> We tested some sequential and openmp programs with different >>>>>> parameters >>>>>> and it works (checkpoint creation and restarting), >>>>>> but *we don't get any mpi library to work*, we already tested some >>>>>> programs build with openmpi and intelmpi. >>>>>> The checkpoint will be created but we get the following error when we >>>>>> want to restart them: >>>>>> - Failed to open file '/' >>>>>> - cr_restore_all_files [28534]: Unable to restore fd 3 >>>>>> (type=1,err=-21) >>>>>> - cr_rstrt_child [28534]: Unable to restore files! (err=-21) >>>>>> Restart failed: Is a directory >>>>>> srun: error: taurusi4010: task 0: Exited with exit code 21 >>>>>> >>>>>> So, it would be great if you could confirm our problems, maybe then >>>>>> schedmd higher up the priority of such mails;-) >>>>>> If you get it to work, please help us to understand how. >>>>>> >>>>>> Kind reagards, >>>>>> Danny >>>>>> TU Dresden >>>>>> Germany >>>>>> >>>>>> Am 11.04.2016 um 10:09 schrieb Husen R: >>>>>> >>>>>> Hi all, >>>>>> >>>>>>> Based on the information in this link >>>>>>> http://slurm.schedmd.com/checkpoint_blcr.html, >>>>>>> Slurm able to checkpoint the whole batch jobs and then Restart >>>>>>> execution >>>>>>> of >>>>>>> batch jobs and job steps from checkpoint files. >>>>>>> >>>>>>> Anyone please tell me how to do that ? >>>>>>> I need help. >>>>>>> >>>>>>> Thank you in advance. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> >>>>>>> Husen Rusdiansyah >>>>>>> University of Indonesia >>>>>>> >>>>>>> >>>>>>> -- >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> Danny Rotscher >>> HPC-Support >>> >>> Technische Universität Dresden >>> Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) >>> 01062 Dresden >>> Tel.: +49 351 463-35853 >>> Fax : +49 351 463-37773 >>> E-Mail: [email protected] >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> >>> >>> >>> > -- > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Danny Rotscher > HPC-Support > > Technische Universität Dresden > Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) > 01062 Dresden > Tel.: +49 351 463-35853 > Fax : +49 351 463-37773 > E-Mail: [email protected] > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > >
