Is your /mirror directory shared across your cluster?
 
 On 04/14/2016 06:56 AM, Husen R wrote:
   Re: [slurm-dev] Re: Slurm Checkpoint/Restart example
   
   Hello Danny,
     I have tried to restart using "scontrol checkpoint restart
         <jobid>" but it doesn't work.
     In addition,
         "<jobid>.0" directory and its content are doesn't
         exist in my --checkpoint-dir.
     The following is my batch
         job :
     =====================batch
         job===================
       #!/bin/bash
       #SBATCH -J MatMul
       #SBATCH -o mm-%j.out
       #SBATCH -A pro
       #SBATCH -N 3
       #SBATCH -n 24
       #SBATCH --checkpoint=5
       #SBATCH
             --checkpoint-dir=/mirror/source/cr
       #SBATCH --time=01:30:00
       #SBATCH [email protected]
       #SBATCH --mail-type=begin
       #SBATCH --mail-type=end
       srun --mpi=pmi2 ./mm.o
       ===================end batch
             job================
       is there something�that prevents me from getting the
             right directory structure ?
       Regards,
       Husen
     On Thu, Apr 14, 2016 at 5:36 PM, Danny
       Rotscher <[email protected]>
       wrote:
       Hello,
         
         usually the directory, which is specified by
         --checkpoint-dir, should have the following structure:
         <jobid>
         |__ script.ckpt
         |__ <jobid>.0
         � � �|__ task.0.ckpt
         � � �|__ task.1.ckpt
         � � �|__ ...
         
         But you only have to run the following command to restart
         your batch job:
         scontrol checkpoint restart <jobid>
         
         I tried only batch jobs and currently I try to build
         MVAPICH2 with BLCR and Slurm support, because that mpi
         library is explicitly mentioned in the Slurm documentation.
         
         A colleague also tested DMTCP but no success.
         
         Kind reagards
         Danny
         TU Dresden
         Germany
             Am 14.04.2016 um 11:01 schrieb Husen R:
             
               Hi all,
               Thank you for your reply
               
               Danny :
               I have installed BLCR and SLURM successfully.
               I also have configured CheckpointType, --checkpoint,
               --checkpoint-dir and
               JobCheckpointDir in order for slurm to support
               checkpoint.
               
               I have tried to checkpoint a simple MPI parallel
               application many times in
               my small cluster, and like you said, after checkpoint
               is completed there is
               a directory named with jobid in� --checkpoint-dir. in
               that directory there
               is a file named "script.ckpt". I tried to restart
               directly using srun
               command below :
               
               srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51
               ./mm.o
               
               where --restart-dir is directory that contains
               "script.ckpt".
               Unfortunately, I got the following error :
               
               Failed to open(/mirror/source/cr/51/task.0.ckpt,
               O_RDONLY): No such file or
               directory
               srun: error: compute-node: task 0: Exited with exit
               code 255
               
               As we can see from the error message above, there was
               no "task.0.ckpt"
               file. I don't know how to get such file. The files
               that I got from
               checkpoint operation is a file named "script.ckpt" in
               --checkpoint-dir and
               two files in JobCheckpointDir named
               "<jobid>.ckpt" and "<jobid>.ckpt.old".
               
               According to the information in section srun in this
               link
               http://slurm.schedmd.com/checkpoint_blcr.html,
               after checkpoint is
               completed there should be checkpoint files of the form
               "<jobid>.ckpt" and
               "<jobid>.<stepid>.ckpt" in
               --checkpoint-dir.
               
               Any idea to solve this ?
               
               Manuel :
               
               Yes, BLCR doesn't support checkpoint/restart
               parallel/distributed
               application by itself ( 
https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
               But it can be used by other software to do that (I
               hope the software is
               SLURM..huhu)
               
               I have ever tried to restart mpi application using
               DMTCP but it doesn't
               work.
               Would you please tell me how to do that ?
               Thank you in advance,
               
               Regards,
               Husen
               On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
               [email protected]>
               wrote:
                 I forgot something to add, you have to create a
                 directory for the
                 checkpoint meta data, which is for default located
                 in /var/slurm/checkpoint:
                 mkdir -p /var/slurm/checkpoint
                 chown -R slurm /var/slurm
                 or you define your own directory in slurm.conf:
                 JobCheckpointDir=<your directory>
                 
                 The parameters you could check with:
                 scontrol show config | grep checkpoint
                 
                 Kind regards,
                 Danny
                 TU Dresden
                 Germany
                 
                 Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
                   Hello,
                   
                   we don't get it to work too, but we already build
                   Slurm with the BLCR.
                   
                   You first have to install the BLCR library, which
                   is described on the
                   following website:
                   https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html
                   
                   Then we build and installed Slurm from source and
                   BLCR checkpointing has
                   been included.
                   
                   After that you have to set at least one Parameter
                   in the file
                   "slurm.conf":
                   CheckpointType=checkpoint/blcr
                   
                   It exists two ways to create ceckpointing, you
                   could either make a
                   checkpoint by the following command from outside
                   your job:
                   scontrol checkpoint create <jobid>
                   or you could let Slurm do some periodical
                   checkpoints with the following
                   sbatch parameter:
                   #SBATCH --checkpoint <minutes>
                   We also tried:
                   #SBATCH --checkpoint
                   <minutes>:<seconds>
                   e.g.
                   #SBATCH --checkpoint 0:10
                   to test it, but it doesn't work for us.
                   
                   We also set the parameter for the checkpoint
                   directory:
                   #SBATCH --checkpoint-dir <directory>
                   
                   After you create a checkpoint and in your
                   checkpoint directory is created
                   a directory with name of your jobid, you could
                   restart the job by the
                   following command:
                   scontrol checkpoint restart <jobid>
                   
                   We tested some sequential and openmp programs with
                   different parameters
                   and it works (checkpoint creation and restarting),
                   but *we don't get any mpi library to work*, we
                   already tested some
                   programs build with openmpi and intelmpi.
                   The checkpoint will be created but we get the
                   following error when we
                   want to restart them:
                   - Failed to open file '/'
                   - cr_restore_all_files [28534]:� Unable to restore
                   fd 3 (type=1,err=-21)
                   - cr_rstrt_child [28534]:� Unable to restore
                   files!� (err=-21)
                   Restart failed: Is a directory
                   srun: error: taurusi4010: task 0: Exited with exit
                   code 21
                   
                   So, it would be great if you could confirm our
                   problems, maybe then
                   schedmd higher up the priority of such mails;-)
                   If you get it to work, please help us to
                   understand how.
                   
                   Kind reagards,
                   Danny
                   TU Dresden
                   Germany
                   
                   Am 11.04.2016 um 10:09 schrieb Husen R:
                     Hi all,
                     
                     Based on the information in this link
                     http://slurm.schedmd.com/checkpoint_blcr.html,
                     Slurm able to checkpoint the whole batch jobs
                     and then Restart execution
                     of
                     batch jobs and job steps from checkpoint files.
                     
                     Anyone please tell me how to do that ?
                     I need help.
                     
                     Thank you in advance.
                     
                     Regards,
                     Husen Rusdiansyah
                     University of Indonesia
             -- 
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
             Danny Rotscher
             HPC-Support
             
             Technische Universität Dresden
             Zentrum für Informationsdienste und
             Hochleistungsrechnen (ZIH)
             01062 Dresden
             Tel.: +49 351
               463-35853
             Fax : +49 351
               463-37773
             E-Mail: [email protected]
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Reply via email to