Hi all,
Thank you for your reply

Danny :
I have installed BLCR and SLURM successfully.
I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
JobCheckpointDir in order for slurm to support checkpoint.

I have tried to checkpoint a simple MPI parallel application many times in
my small cluster, and like you said, after checkpoint is completed there is
a directory named with jobid in  --checkpoint-dir. in that directory there
is a file named "script.ckpt". I tried to restart directly using srun
command below :

srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o

where --restart-dir is directory that contains "script.ckpt".
Unfortunately, I got the following error :

Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or
directory
srun: error: compute-node: task 0: Exited with exit code 255

As we can see from the error message above, there was no "task.0.ckpt"
file. I don't know how to get such file. The files that I got from
checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
two files in JobCheckpointDir named "<jobid>.ckpt" and "<jobid>.ckpt.old".

According to the information in section srun in this link
http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
completed there should be checkpoint files of the form "<jobid>.ckpt" and
"<jobid>.<stepid>.ckpt" in --checkpoint-dir.

Any idea to solve this ?

Manuel :

Yes, BLCR doesn't support checkpoint/restart parallel/distributed
application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
But it can be used by other software to do that (I hope the software is
SLURM..huhu)

I have ever tried to restart mpi application using DMTCP but it doesn't
work.
Would you please tell me how to do that ?


Thank you in advance,

Regards,


Husen





On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
[email protected]> wrote:

> I forgot something to add, you have to create a directory for the
> checkpoint meta data, which is for default located in /var/slurm/checkpoint:
> mkdir -p /var/slurm/checkpoint
> chown -R slurm /var/slurm
> or you define your own directory in slurm.conf:
> JobCheckpointDir=<your directory>
>
> The parameters you could check with:
> scontrol show config | grep checkpoint
>
> Kind regards,
> Danny
> TU Dresden
> Germany
>
> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>
>> Hello,
>>
>> we don't get it to work too, but we already build Slurm with the BLCR.
>>
>> You first have to install the BLCR library, which is described on the
>> following website:
>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html
>>
>> Then we build and installed Slurm from source and BLCR checkpointing has
>> been included.
>>
>> After that you have to set at least one Parameter in the file
>> "slurm.conf":
>> CheckpointType=checkpoint/blcr
>>
>> It exists two ways to create ceckpointing, you could either make a
>> checkpoint by the following command from outside your job:
>> scontrol checkpoint create <jobid>
>> or you could let Slurm do some periodical checkpoints with the following
>> sbatch parameter:
>> #SBATCH --checkpoint <minutes>
>> We also tried:
>> #SBATCH --checkpoint <minutes>:<seconds>
>> e.g.
>> #SBATCH --checkpoint 0:10
>> to test it, but it doesn't work for us.
>>
>> We also set the parameter for the checkpoint directory:
>> #SBATCH --checkpoint-dir <directory>
>>
>> After you create a checkpoint and in your checkpoint directory is created
>> a directory with name of your jobid, you could restart the job by the
>> following command:
>> scontrol checkpoint restart <jobid>
>>
>> We tested some sequential and openmp programs with different parameters
>> and it works (checkpoint creation and restarting),
>> but *we don't get any mpi library to work*, we already tested some
>> programs build with openmpi and intelmpi.
>> The checkpoint will be created but we get the following error when we
>> want to restart them:
>> - Failed to open file '/'
>> - cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
>> - cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
>> Restart failed: Is a directory
>> srun: error: taurusi4010: task 0: Exited with exit code 21
>>
>> So, it would be great if you could confirm our problems, maybe then
>> schedmd higher up the priority of such mails;-)
>> If you get it to work, please help us to understand how.
>>
>> Kind reagards,
>> Danny
>> TU Dresden
>> Germany
>>
>> Am 11.04.2016 um 10:09 schrieb Husen R:
>>
>>> Hi all,
>>>
>>> Based on the information in this link
>>> http://slurm.schedmd.com/checkpoint_blcr.html,
>>> Slurm able to checkpoint the whole batch jobs and then Restart execution
>>> of
>>> batch jobs and job steps from checkpoint files.
>>>
>>> Anyone please tell me how to do that ?
>>> I need help.
>>>
>>> Thank you in advance.
>>>
>>> Regards,
>>>
>>>
>>> Husen Rusdiansyah
>>> University of Indonesia
>>>
>>
>>
>

Reply via email to