I've found two things, first you could try srun_cr instead of srun and the second is, do your job needs more than 5 minutes?!
But I'm not sure, so you may try it and post the result.

Am 14.04.2016 um 12:56 schrieb Husen R:
Hello Danny,

I have tried to restart using "scontrol checkpoint restart <jobid>" but it
doesn't work.
In addition, "<jobid>.0" directory and its content are doesn't exist in my
--checkpoint-dir.
The following is my batch job :

=====================batch job===================

#!/bin/bash
#SBATCH -J MatMul
#SBATCH -o mm-%j.out
#SBATCH -A pro
#SBATCH -N 3
#SBATCH -n 24
#SBATCH --checkpoint=5
#SBATCH --checkpoint-dir=/mirror/source/cr
#SBATCH --time=01:30:00
#SBATCH --mail-user=hus...@gmail.com
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

srun --mpi=pmi2 ./mm.o

===================end batch job================

is there something that prevents me from getting the right directory
structure ?


Regards,



Husen




On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

Hello,

usually the directory, which is specified by --checkpoint-dir, should have
the following structure:
<jobid>
|__ script.ckpt
|__ <jobid>.0
      |__ task.0.ckpt
      |__ task.1.ckpt
      |__ ...

But you only have to run the following command to restart your batch job:
scontrol checkpoint restart <jobid>

I tried only batch jobs and currently I try to build MVAPICH2 with BLCR
and Slurm support, because that mpi library is explicitly mentioned in the
Slurm documentation.

A colleague also tested DMTCP but no success.

Kind reagards
Danny
TU Dresden
Germany


Am 14.04.2016 um 11:01 schrieb Husen R:

Hi all,
Thank you for your reply

Danny :
I have installed BLCR and SLURM successfully.
I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
JobCheckpointDir in order for slurm to support checkpoint.

I have tried to checkpoint a simple MPI parallel application many times in
my small cluster, and like you said, after checkpoint is completed there
is
a directory named with jobid in  --checkpoint-dir. in that directory there
is a file named "script.ckpt". I tried to restart directly using srun
command below :

srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o

where --restart-dir is directory that contains "script.ckpt".
Unfortunately, I got the following error :

Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file
or
directory
srun: error: compute-node: task 0: Exited with exit code 255

As we can see from the error message above, there was no "task.0.ckpt"
file. I don't know how to get such file. The files that I got from
checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
two files in JobCheckpointDir named "<jobid>.ckpt" and "<jobid>.ckpt.old".

According to the information in section srun in this link
http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
completed there should be checkpoint files of the form "<jobid>.ckpt" and
"<jobid>.<stepid>.ckpt" in --checkpoint-dir.

Any idea to solve this ?

Manuel :

Yes, BLCR doesn't support checkpoint/restart parallel/distributed
application by itself (
https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
But it can be used by other software to do that (I hope the software is
SLURM..huhu)

I have ever tried to restart mpi application using DMTCP but it doesn't
work.
Would you please tell me how to do that ?


Thank you in advance,

Regards,


Husen





On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

I forgot something to add, you have to create a directory for the
checkpoint meta data, which is for default located in
/var/slurm/checkpoint:
mkdir -p /var/slurm/checkpoint
chown -R slurm /var/slurm
or you define your own directory in slurm.conf:
JobCheckpointDir=<your directory>

The parameters you could check with:
scontrol show config | grep checkpoint

Kind regards,
Danny
TU Dresden
Germany

Am 14.04.2016 um 06:41 schrieb Danny Rotscher:

Hello,
we don't get it to work too, but we already build Slurm with the BLCR.

You first have to install the BLCR library, which is described on the
following website:
https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html

Then we build and installed Slurm from source and BLCR checkpointing has
been included.

After that you have to set at least one Parameter in the file
"slurm.conf":
CheckpointType=checkpoint/blcr

It exists two ways to create ceckpointing, you could either make a
checkpoint by the following command from outside your job:
scontrol checkpoint create <jobid>
or you could let Slurm do some periodical checkpoints with the following
sbatch parameter:
#SBATCH --checkpoint <minutes>
We also tried:
#SBATCH --checkpoint <minutes>:<seconds>
e.g.
#SBATCH --checkpoint 0:10
to test it, but it doesn't work for us.

We also set the parameter for the checkpoint directory:
#SBATCH --checkpoint-dir <directory>

After you create a checkpoint and in your checkpoint directory is
created
a directory with name of your jobid, you could restart the job by the
following command:
scontrol checkpoint restart <jobid>

We tested some sequential and openmp programs with different parameters
and it works (checkpoint creation and restarting),
but *we don't get any mpi library to work*, we already tested some
programs build with openmpi and intelmpi.
The checkpoint will be created but we get the following error when we
want to restart them:
- Failed to open file '/'
- cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
- cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
Restart failed: Is a directory
srun: error: taurusi4010: task 0: Exited with exit code 21

So, it would be great if you could confirm our problems, maybe then
schedmd higher up the priority of such mails;-)
If you get it to work, please help us to understand how.

Kind reagards,
Danny
TU Dresden
Germany

Am 11.04.2016 um 10:09 schrieb Husen R:

Hi all,
Based on the information in this link
http://slurm.schedmd.com/checkpoint_blcr.html,
Slurm able to checkpoint the whole batch jobs and then Restart
execution
of
batch jobs and job steps from checkpoint files.

Anyone please tell me how to do that ?
I need help.

Thank you in advance.

Regards,


Husen Rusdiansyah
University of Indonesia


--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Danny Rotscher
HPC-Support

Technische Universität Dresden
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden
Tel.: +49 351 463-35853
Fax : +49 351 463-37773
E-Mail: danny.rotsc...@tu-dresden.de
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Danny Rotscher
HPC-Support

Technische Universität Dresden
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden
Tel.: +49 351 463-35853
Fax : +49 351 463-37773
E-Mail: danny.rotsc...@tu-dresden.de
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to