But I'm not sure, so you may try it and post the result.
Am 14.04.2016 um 12:56 schrieb Husen R:
Hello Danny, I have tried to restart using "scontrol checkpoint restart <jobid>" but it doesn't work. In addition, "<jobid>.0" directory and its content are doesn't exist in my --checkpoint-dir. The following is my batch job : =====================batch job=================== #!/bin/bash #SBATCH -J MatMul #SBATCH -o mm-%j.out #SBATCH -A pro #SBATCH -N 3 #SBATCH -n 24 #SBATCH --checkpoint=5 #SBATCH --checkpoint-dir=/mirror/source/cr #SBATCH --time=01:30:00 #SBATCH --mail-user=hus...@gmail.com #SBATCH --mail-type=begin #SBATCH --mail-type=end srun --mpi=pmi2 ./mm.o ===================end batch job================ is there something that prevents me from getting the right directory structure ? Regards, Husen On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote:Hello, usually the directory, which is specified by --checkpoint-dir, should have the following structure: <jobid> |__ script.ckpt |__ <jobid>.0 |__ task.0.ckpt |__ task.1.ckpt |__ ... But you only have to run the following command to restart your batch job: scontrol checkpoint restart <jobid> I tried only batch jobs and currently I try to build MVAPICH2 with BLCR and Slurm support, because that mpi library is explicitly mentioned in the Slurm documentation. A colleague also tested DMTCP but no success. Kind reagards Danny TU Dresden Germany Am 14.04.2016 um 11:01 schrieb Husen R:Hi all, Thank you for your reply Danny : I have installed BLCR and SLURM successfully. I also have configured CheckpointType, --checkpoint, --checkpoint-dir and JobCheckpointDir in order for slurm to support checkpoint. I have tried to checkpoint a simple MPI parallel application many times in my small cluster, and like you said, after checkpoint is completed there is a directory named with jobid in --checkpoint-dir. in that directory there is a file named "script.ckpt". I tried to restart directly using srun command below : srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o where --restart-dir is directory that contains "script.ckpt". Unfortunately, I got the following error : Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or directory srun: error: compute-node: task 0: Exited with exit code 255 As we can see from the error message above, there was no "task.0.ckpt" file. I don't know how to get such file. The files that I got from checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and two files in JobCheckpointDir named "<jobid>.ckpt" and "<jobid>.ckpt.old". According to the information in section srun in this link http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed there should be checkpoint files of the form "<jobid>.ckpt" and "<jobid>.<stepid>.ckpt" in --checkpoint-dir. Any idea to solve this ? Manuel : Yes, BLCR doesn't support checkpoint/restart parallel/distributed application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by other software to do that (I hope the software is SLURM..huhu) I have ever tried to restart mpi application using DMTCP but it doesn't work. Would you please tell me how to do that ? Thank you in advance, Regards, Husen On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: I forgot something to add, you have to create a directory for thecheckpoint meta data, which is for default located in /var/slurm/checkpoint: mkdir -p /var/slurm/checkpoint chown -R slurm /var/slurm or you define your own directory in slurm.conf: JobCheckpointDir=<your directory> The parameters you could check with: scontrol show config | grep checkpoint Kind regards, Danny TU Dresden Germany Am 14.04.2016 um 06:41 schrieb Danny Rotscher: Hello,we don't get it to work too, but we already build Slurm with the BLCR. You first have to install the BLCR library, which is described on the following website: https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html Then we build and installed Slurm from source and BLCR checkpointing has been included. After that you have to set at least one Parameter in the file "slurm.conf": CheckpointType=checkpoint/blcr It exists two ways to create ceckpointing, you could either make a checkpoint by the following command from outside your job: scontrol checkpoint create <jobid> or you could let Slurm do some periodical checkpoints with the following sbatch parameter: #SBATCH --checkpoint <minutes> We also tried: #SBATCH --checkpoint <minutes>:<seconds> e.g. #SBATCH --checkpoint 0:10 to test it, but it doesn't work for us. We also set the parameter for the checkpoint directory: #SBATCH --checkpoint-dir <directory> After you create a checkpoint and in your checkpoint directory is created a directory with name of your jobid, you could restart the job by the following command: scontrol checkpoint restart <jobid> We tested some sequential and openmp programs with different parameters and it works (checkpoint creation and restarting), but *we don't get any mpi library to work*, we already tested some programs build with openmpi and intelmpi. The checkpoint will be created but we get the following error when we want to restart them: - Failed to open file '/' - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) - cr_rstrt_child [28534]: Unable to restore files! (err=-21) Restart failed: Is a directory srun: error: taurusi4010: task 0: Exited with exit code 21 So, it would be great if you could confirm our problems, maybe then schedmd higher up the priority of such mails;-) If you get it to work, please help us to understand how. Kind reagards, Danny TU Dresden Germany Am 11.04.2016 um 10:09 schrieb Husen R: Hi all,Based on the information in this link http://slurm.schedmd.com/checkpoint_blcr.html, Slurm able to checkpoint the whole batch jobs and then Restart execution of batch jobs and job steps from checkpoint files. Anyone please tell me how to do that ? I need help. Thank you in advance. Regards, Husen Rusdiansyah University of Indonesia-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Danny Rotscher HPC-Support Technische Universität Dresden Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) 01062 Dresden Tel.: +49 351 463-35853 Fax : +49 351 463-37773 E-Mail: danny.rotsc...@tu-dresden.de ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Danny Rotscher HPC-Support Technische Universität Dresden Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) 01062 Dresden Tel.: +49 351 463-35853 Fax : +49 351 463-37773 E-Mail: danny.rotsc...@tu-dresden.de ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
smime.p7s
Description: S/MIME Cryptographic Signature