[slurm-dev] Re: Slurm Checkpoint/Restart example

Husen R Thu, 14 Apr 2016 03:56:19 -0700

Hello Danny,

I have tried to restart using "scontrol checkpoint restart <jobid>" but it
doesn't work.
In addition, "<jobid>.0" directory and its content are doesn't exist in my
--checkpoint-dir.
The following is my batch job :


=====================batch job===================

#!/bin/bash
#SBATCH -J MatMul
#SBATCH -o mm-%j.out
#SBATCH -A pro
#SBATCH -N 3
#SBATCH -n 24
#SBATCH --checkpoint=5
#SBATCH --checkpoint-dir=/mirror/source/cr
#SBATCH --time=01:30:00
#SBATCH --mail-user=hus...@gmail.com
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

srun --mpi=pmi2 ./mm.o

===================end batch job================

is there something that prevents me from getting the right directory
structure ?


Regards,



Husen




On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

> Hello,
>
> usually the directory, which is specified by --checkpoint-dir, should have
> the following structure:
> <jobid>
> |__ script.ckpt
> |__ <jobid>.0
>      |__ task.0.ckpt
>      |__ task.1.ckpt
>      |__ ...
>
> But you only have to run the following command to restart your batch job:
> scontrol checkpoint restart <jobid>
>
> I tried only batch jobs and currently I try to build MVAPICH2 with BLCR
> and Slurm support, because that mpi library is explicitly mentioned in the
> Slurm documentation.
>
> A colleague also tested DMTCP but no success.
>
> Kind reagards
> Danny
> TU Dresden
> Germany
>
>
> Am 14.04.2016 um 11:01 schrieb Husen R:
>
>> Hi all,
>> Thank you for your reply
>>
>> Danny :
>> I have installed BLCR and SLURM successfully.
>> I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
>> JobCheckpointDir in order for slurm to support checkpoint.
>>
>> I have tried to checkpoint a simple MPI parallel application many times in
>> my small cluster, and like you said, after checkpoint is completed there
>> is
>> a directory named with jobid in  --checkpoint-dir. in that directory there
>> is a file named "script.ckpt". I tried to restart directly using srun
>> command below :
>>
>> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o
>>
>> where --restart-dir is directory that contains "script.ckpt".
>> Unfortunately, I got the following error :
>>
>> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file
>> or
>> directory
>> srun: error: compute-node: task 0: Exited with exit code 255
>>
>> As we can see from the error message above, there was no "task.0.ckpt"
>> file. I don't know how to get such file. The files that I got from
>> checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
>> two files in JobCheckpointDir named "<jobid>.ckpt" and "<jobid>.ckpt.old".
>>
>> According to the information in section srun in this link
>> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
>> completed there should be checkpoint files of the form "<jobid>.ckpt" and
>> "<jobid>.<stepid>.ckpt" in --checkpoint-dir.
>>
>> Any idea to solve this ?
>>
>> Manuel :
>>
>> Yes, BLCR doesn't support checkpoint/restart parallel/distributed
>> application by itself (
>> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
>> But it can be used by other software to do that (I hope the software is
>> SLURM..huhu)
>>
>> I have ever tried to restart mpi application using DMTCP but it doesn't
>> work.
>> Would you please tell me how to do that ?
>>
>>
>> Thank you in advance,
>>
>> Regards,
>>
>>
>> Husen
>>
>>
>>
>>
>>
>> On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
>> danny.rotsc...@tu-dresden.de> wrote:
>>
>> I forgot something to add, you have to create a directory for the
>>> checkpoint meta data, which is for default located in
>>> /var/slurm/checkpoint:
>>> mkdir -p /var/slurm/checkpoint
>>> chown -R slurm /var/slurm
>>> or you define your own directory in slurm.conf:
>>> JobCheckpointDir=<your directory>
>>>
>>> The parameters you could check with:
>>> scontrol show config | grep checkpoint
>>>
>>> Kind regards,
>>> Danny
>>> TU Dresden
>>> Germany
>>>
>>> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>>>
>>> Hello,
>>>>
>>>> we don't get it to work too, but we already build Slurm with the BLCR.
>>>>
>>>> You first have to install the BLCR library, which is described on the
>>>> following website:
>>>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html
>>>>
>>>> Then we build and installed Slurm from source and BLCR checkpointing has
>>>> been included.
>>>>
>>>> After that you have to set at least one Parameter in the file
>>>> "slurm.conf":
>>>> CheckpointType=checkpoint/blcr
>>>>
>>>> It exists two ways to create ceckpointing, you could either make a
>>>> checkpoint by the following command from outside your job:
>>>> scontrol checkpoint create <jobid>
>>>> or you could let Slurm do some periodical checkpoints with the following
>>>> sbatch parameter:
>>>> #SBATCH --checkpoint <minutes>
>>>> We also tried:
>>>> #SBATCH --checkpoint <minutes>:<seconds>
>>>> e.g.
>>>> #SBATCH --checkpoint 0:10
>>>> to test it, but it doesn't work for us.
>>>>
>>>> We also set the parameter for the checkpoint directory:
>>>> #SBATCH --checkpoint-dir <directory>
>>>>
>>>> After you create a checkpoint and in your checkpoint directory is
>>>> created
>>>> a directory with name of your jobid, you could restart the job by the
>>>> following command:
>>>> scontrol checkpoint restart <jobid>
>>>>
>>>> We tested some sequential and openmp programs with different parameters
>>>> and it works (checkpoint creation and restarting),
>>>> but *we don't get any mpi library to work*, we already tested some
>>>> programs build with openmpi and intelmpi.
>>>> The checkpoint will be created but we get the following error when we
>>>> want to restart them:
>>>> - Failed to open file '/'
>>>> - cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
>>>> - cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
>>>> Restart failed: Is a directory
>>>> srun: error: taurusi4010: task 0: Exited with exit code 21
>>>>
>>>> So, it would be great if you could confirm our problems, maybe then
>>>> schedmd higher up the priority of such mails;-)
>>>> If you get it to work, please help us to understand how.
>>>>
>>>> Kind reagards,
>>>> Danny
>>>> TU Dresden
>>>> Germany
>>>>
>>>> Am 11.04.2016 um 10:09 schrieb Husen R:
>>>>
>>>> Hi all,
>>>>>
>>>>> Based on the information in this link
>>>>> http://slurm.schedmd.com/checkpoint_blcr.html,
>>>>> Slurm able to checkpoint the whole batch jobs and then Restart
>>>>> execution
>>>>> of
>>>>> batch jobs and job steps from checkpoint files.
>>>>>
>>>>> Anyone please tell me how to do that ?
>>>>> I need help.
>>>>>
>>>>> Thank you in advance.
>>>>>
>>>>> Regards,
>>>>>
>>>>>
>>>>> Husen Rusdiansyah
>>>>> University of Indonesia
>>>>>
>>>>>
>>>>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Danny Rotscher
> HPC-Support
>
> Technische Universität Dresden
> Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
> 01062 Dresden
> Tel.: +49 351 463-35853
> Fax : +49 351 463-37773
> E-Mail: danny.rotsc...@tu-dresden.de
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>

[slurm-dev] Re: Slurm Checkpoint/Restart example

Reply via email to