Danny :

I'm unable to use srun_cr command. I got this error message from slurmctld
log file after submitting srun_cr with sbatch:

[2016-04-14T19:22:42.719] job_complete: JobID=67 State=0x1 NodeCnt=2
WEXITSTATUS 255

Any idea to fix this ?

- yes, my job needs more than 5 minutes.

Andy :

Yes, /mirror directory is shared across my cluster. I have configured it
using NFS.

Regards,



Husen



On Thu, Apr 14, 2016 at 6:15 PM, Danny Rotscher <
[email protected]> wrote:

> I've found two things, first you could try srun_cr instead of srun and the
> second is, do your job needs more than 5 minutes?!
> But I'm not sure, so you may try it and post the result.
>
>
> Am 14.04.2016 um 12:56 schrieb Husen R:
>
>> Hello Danny,
>>
>> I have tried to restart using "scontrol checkpoint restart <jobid>" but it
>> doesn't work.
>> In addition, "<jobid>.0" directory and its content are doesn't exist in my
>> --checkpoint-dir.
>> The following is my batch job :
>>
>> =====================batch job===================
>>
>> #!/bin/bash
>> #SBATCH -J MatMul
>> #SBATCH -o mm-%j.out
>> #SBATCH -A pro
>> #SBATCH -N 3
>> #SBATCH -n 24
>> #SBATCH --checkpoint=5
>> #SBATCH --checkpoint-dir=/mirror/source/cr
>> #SBATCH --time=01:30:00
>> #SBATCH [email protected]
>> #SBATCH --mail-type=begin
>> #SBATCH --mail-type=end
>>
>> srun --mpi=pmi2 ./mm.o
>>
>> ===================end batch job================
>>
>> is there something that prevents me from getting the right directory
>> structure ?
>>
>>
>> Regards,
>>
>>
>>
>> Husen
>>
>>
>>
>>
>> On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher <
>> [email protected]> wrote:
>>
>> Hello,
>>>
>>> usually the directory, which is specified by --checkpoint-dir, should
>>> have
>>> the following structure:
>>> <jobid>
>>> |__ script.ckpt
>>> |__ <jobid>.0
>>>       |__ task.0.ckpt
>>>       |__ task.1.ckpt
>>>       |__ ...
>>>
>>> But you only have to run the following command to restart your batch job:
>>> scontrol checkpoint restart <jobid>
>>>
>>> I tried only batch jobs and currently I try to build MVAPICH2 with BLCR
>>> and Slurm support, because that mpi library is explicitly mentioned in
>>> the
>>> Slurm documentation.
>>>
>>> A colleague also tested DMTCP but no success.
>>>
>>> Kind reagards
>>> Danny
>>> TU Dresden
>>> Germany
>>>
>>>
>>> Am 14.04.2016 um 11:01 schrieb Husen R:
>>>
>>> Hi all,
>>>> Thank you for your reply
>>>>
>>>> Danny :
>>>> I have installed BLCR and SLURM successfully.
>>>> I also have configured CheckpointType, --checkpoint, --checkpoint-dir
>>>> and
>>>> JobCheckpointDir in order for slurm to support checkpoint.
>>>>
>>>> I have tried to checkpoint a simple MPI parallel application many times
>>>> in
>>>> my small cluster, and like you said, after checkpoint is completed there
>>>> is
>>>> a directory named with jobid in  --checkpoint-dir. in that directory
>>>> there
>>>> is a file named "script.ckpt". I tried to restart directly using srun
>>>> command below :
>>>>
>>>> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o
>>>>
>>>> where --restart-dir is directory that contains "script.ckpt".
>>>> Unfortunately, I got the following error :
>>>>
>>>> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file
>>>> or
>>>> directory
>>>> srun: error: compute-node: task 0: Exited with exit code 255
>>>>
>>>> As we can see from the error message above, there was no "task.0.ckpt"
>>>> file. I don't know how to get such file. The files that I got from
>>>> checkpoint operation is a file named "script.ckpt" in --checkpoint-dir
>>>> and
>>>> two files in JobCheckpointDir named "<jobid>.ckpt" and
>>>> "<jobid>.ckpt.old".
>>>>
>>>> According to the information in section srun in this link
>>>> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
>>>> completed there should be checkpoint files of the form "<jobid>.ckpt"
>>>> and
>>>> "<jobid>.<stepid>.ckpt" in --checkpoint-dir.
>>>>
>>>> Any idea to solve this ?
>>>>
>>>> Manuel :
>>>>
>>>> Yes, BLCR doesn't support checkpoint/restart parallel/distributed
>>>> application by itself (
>>>> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
>>>> But it can be used by other software to do that (I hope the software is
>>>> SLURM..huhu)
>>>>
>>>> I have ever tried to restart mpi application using DMTCP but it doesn't
>>>> work.
>>>> Would you please tell me how to do that ?
>>>>
>>>>
>>>> Thank you in advance,
>>>>
>>>> Regards,
>>>>
>>>>
>>>> Husen
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
>>>> [email protected]> wrote:
>>>>
>>>> I forgot something to add, you have to create a directory for the
>>>>
>>>>> checkpoint meta data, which is for default located in
>>>>> /var/slurm/checkpoint:
>>>>> mkdir -p /var/slurm/checkpoint
>>>>> chown -R slurm /var/slurm
>>>>> or you define your own directory in slurm.conf:
>>>>> JobCheckpointDir=<your directory>
>>>>>
>>>>> The parameters you could check with:
>>>>> scontrol show config | grep checkpoint
>>>>>
>>>>> Kind regards,
>>>>> Danny
>>>>> TU Dresden
>>>>> Germany
>>>>>
>>>>> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>>>>>
>>>>> Hello,
>>>>>
>>>>>> we don't get it to work too, but we already build Slurm with the BLCR.
>>>>>>
>>>>>> You first have to install the BLCR library, which is described on the
>>>>>> following website:
>>>>>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html
>>>>>>
>>>>>> Then we build and installed Slurm from source and BLCR checkpointing
>>>>>> has
>>>>>> been included.
>>>>>>
>>>>>> After that you have to set at least one Parameter in the file
>>>>>> "slurm.conf":
>>>>>> CheckpointType=checkpoint/blcr
>>>>>>
>>>>>> It exists two ways to create ceckpointing, you could either make a
>>>>>> checkpoint by the following command from outside your job:
>>>>>> scontrol checkpoint create <jobid>
>>>>>> or you could let Slurm do some periodical checkpoints with the
>>>>>> following
>>>>>> sbatch parameter:
>>>>>> #SBATCH --checkpoint <minutes>
>>>>>> We also tried:
>>>>>> #SBATCH --checkpoint <minutes>:<seconds>
>>>>>> e.g.
>>>>>> #SBATCH --checkpoint 0:10
>>>>>> to test it, but it doesn't work for us.
>>>>>>
>>>>>> We also set the parameter for the checkpoint directory:
>>>>>> #SBATCH --checkpoint-dir <directory>
>>>>>>
>>>>>> After you create a checkpoint and in your checkpoint directory is
>>>>>> created
>>>>>> a directory with name of your jobid, you could restart the job by the
>>>>>> following command:
>>>>>> scontrol checkpoint restart <jobid>
>>>>>>
>>>>>> We tested some sequential and openmp programs with different
>>>>>> parameters
>>>>>> and it works (checkpoint creation and restarting),
>>>>>> but *we don't get any mpi library to work*, we already tested some
>>>>>> programs build with openmpi and intelmpi.
>>>>>> The checkpoint will be created but we get the following error when we
>>>>>> want to restart them:
>>>>>> - Failed to open file '/'
>>>>>> - cr_restore_all_files [28534]:  Unable to restore fd 3
>>>>>> (type=1,err=-21)
>>>>>> - cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
>>>>>> Restart failed: Is a directory
>>>>>> srun: error: taurusi4010: task 0: Exited with exit code 21
>>>>>>
>>>>>> So, it would be great if you could confirm our problems, maybe then
>>>>>> schedmd higher up the priority of such mails;-)
>>>>>> If you get it to work, please help us to understand how.
>>>>>>
>>>>>> Kind reagards,
>>>>>> Danny
>>>>>> TU Dresden
>>>>>> Germany
>>>>>>
>>>>>> Am 11.04.2016 um 10:09 schrieb Husen R:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>>> Based on the information in this link
>>>>>>> http://slurm.schedmd.com/checkpoint_blcr.html,
>>>>>>> Slurm able to checkpoint the whole batch jobs and then Restart
>>>>>>> execution
>>>>>>> of
>>>>>>> batch jobs and job steps from checkpoint files.
>>>>>>>
>>>>>>> Anyone please tell me how to do that ?
>>>>>>> I need help.
>>>>>>>
>>>>>>> Thank you in advance.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>>
>>>>>>> Husen Rusdiansyah
>>>>>>> University of Indonesia
>>>>>>>
>>>>>>>
>>>>>>> --
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> Danny Rotscher
>>> HPC-Support
>>>
>>> Technische Universität Dresden
>>> Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
>>> 01062 Dresden
>>> Tel.: +49 351 463-35853
>>> Fax : +49 351 463-37773
>>> E-Mail: [email protected]
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>>
>>>
>>>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Danny Rotscher
> HPC-Support
>
> Technische Universität Dresden
> Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
> 01062 Dresden
> Tel.: +49 351 463-35853
> Fax : +49 351 463-37773
> E-Mail: [email protected]
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>

Reply via email to