There is a good tutorial on how to use DMTCP on their github page,

https://github.com/dmtcp/dmtcp/blob/master/QUICK-START.md

I would start there. Anyway, probably this Slurm mailing list is not
the best place to ask for that information.

Best regards,

Manuel

2016-04-14 11:01 GMT+02:00 Husen R <[email protected]>:
> Hi all,
> Thank you for your reply
>
> Danny :
> I have installed BLCR and SLURM successfully.
> I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
> JobCheckpointDir in order for slurm to support checkpoint.
>
> I have tried to checkpoint a simple MPI parallel application many times in
> my small cluster, and like you said, after checkpoint is completed there is
> a directory named with jobid in  --checkpoint-dir. in that directory there
> is a file named "script.ckpt". I tried to restart directly using srun
> command below :
>
> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o
>
> where --restart-dir is directory that contains "script.ckpt".
> Unfortunately, I got the following error :
>
> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or
> directory
> srun: error: compute-node: task 0: Exited with exit code 255
>
> As we can see from the error message above, there was no "task.0.ckpt" file.
> I don't know how to get such file. The files that I got from checkpoint
> operation is a file named "script.ckpt" in --checkpoint-dir and two files in
> JobCheckpointDir named "<jobid>.ckpt" and "<jobid>.ckpt.old".
>
> According to the information in section srun in this link
> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed
> there should be checkpoint files of the form "<jobid>.ckpt" and
> "<jobid>.<stepid>.ckpt" in --checkpoint-dir.
>
> Any idea to solve this ?
>
> Manuel :
>
> Yes, BLCR doesn't support checkpoint/restart parallel/distributed
> application by itself (
> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by
> other software to do that (I hope the software is SLURM..huhu)
>
> I have ever tried to restart mpi application using DMTCP but it doesn't
> work.
> Would you please tell me how to do that ?
>
>
> Thank you in advance,
>
> Regards,
>
>
> Husen
>
>
>
>
>
> On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher
> <[email protected]> wrote:
>>
>> I forgot something to add, you have to create a directory for the
>> checkpoint meta data, which is for default located in /var/slurm/checkpoint:
>> mkdir -p /var/slurm/checkpoint
>> chown -R slurm /var/slurm
>> or you define your own directory in slurm.conf:
>> JobCheckpointDir=<your directory>
>>
>> The parameters you could check with:
>> scontrol show config | grep checkpoint
>>
>> Kind regards,
>> Danny
>> TU Dresden
>> Germany
>>
>> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>>>
>>> Hello,
>>>
>>> we don't get it to work too, but we already build Slurm with the BLCR.
>>>
>>> You first have to install the BLCR library, which is described on the
>>> following website:
>>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html
>>>
>>> Then we build and installed Slurm from source and BLCR checkpointing has
>>> been included.
>>>
>>> After that you have to set at least one Parameter in the file
>>> "slurm.conf":
>>> CheckpointType=checkpoint/blcr
>>>
>>> It exists two ways to create ceckpointing, you could either make a
>>> checkpoint by the following command from outside your job:
>>> scontrol checkpoint create <jobid>
>>> or you could let Slurm do some periodical checkpoints with the following
>>> sbatch parameter:
>>> #SBATCH --checkpoint <minutes>
>>> We also tried:
>>> #SBATCH --checkpoint <minutes>:<seconds>
>>> e.g.
>>> #SBATCH --checkpoint 0:10
>>> to test it, but it doesn't work for us.
>>>
>>> We also set the parameter for the checkpoint directory:
>>> #SBATCH --checkpoint-dir <directory>
>>>
>>> After you create a checkpoint and in your checkpoint directory is created
>>> a directory with name of your jobid, you could restart the job by the
>>> following command:
>>> scontrol checkpoint restart <jobid>
>>>
>>> We tested some sequential and openmp programs with different parameters
>>> and it works (checkpoint creation and restarting),
>>> but *we don't get any mpi library to work*, we already tested some
>>> programs build with openmpi and intelmpi.
>>> The checkpoint will be created but we get the following error when we
>>> want to restart them:
>>> - Failed to open file '/'
>>> - cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
>>> - cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
>>> Restart failed: Is a directory
>>> srun: error: taurusi4010: task 0: Exited with exit code 21
>>>
>>> So, it would be great if you could confirm our problems, maybe then
>>> schedmd higher up the priority of such mails;-)
>>> If you get it to work, please help us to understand how.
>>>
>>> Kind reagards,
>>> Danny
>>> TU Dresden
>>> Germany
>>>
>>> Am 11.04.2016 um 10:09 schrieb Husen R:
>>>>
>>>> Hi all,
>>>>
>>>> Based on the information in this link
>>>> http://slurm.schedmd.com/checkpoint_blcr.html,
>>>> Slurm able to checkpoint the whole batch jobs and then Restart execution
>>>> of
>>>> batch jobs and job steps from checkpoint files.
>>>>
>>>> Anyone please tell me how to do that ?
>>>> I need help.
>>>>
>>>> Thank you in advance.
>>>>
>>>> Regards,
>>>>
>>>>
>>>> Husen Rusdiansyah
>>>> University of Indonesia
>>>
>>>
>>
>

Reply via email to