Is your /mirror directory shared across your cluster?
On 04/14/2016 06:56 AM, Husen R wrote:
Re: [slurm-dev] Re: Slurm Checkpoint/Restart example
Hello Danny,
I have tried to restart using "scontrol checkpoint restart
<jobid>" but it doesn't work.
In addition,
"<jobid>.0" directory and its content are doesn't
exist in my --checkpoint-dir.
The following is my batch
job :
=====================batch
job===================
#!/bin/bash
#SBATCH -J MatMul
#SBATCH -o mm-%j.out
#SBATCH -A pro
#SBATCH -N 3
#SBATCH -n 24
#SBATCH --checkpoint=5
#SBATCH
--checkpoint-dir=/mirror/source/cr
#SBATCH --time=01:30:00
#SBATCH [email protected]
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
srun --mpi=pmi2 ./mm.o
===================end batch
job================
is there something�that prevents me from getting the
right directory structure ?
Regards,
Husen
On Thu, Apr 14, 2016 at 5:36 PM, Danny
Rotscher <[email protected]>
wrote:
Hello,
usually the directory, which is specified by
--checkpoint-dir, should have the following structure:
<jobid>
|__ script.ckpt
|__ <jobid>.0
� � �|__ task.0.ckpt
� � �|__ task.1.ckpt
� � �|__ ...
But you only have to run the following command to restart
your batch job:
scontrol checkpoint restart <jobid>
I tried only batch jobs and currently I try to build
MVAPICH2 with BLCR and Slurm support, because that mpi
library is explicitly mentioned in the Slurm documentation.
A colleague also tested DMTCP but no success.
Kind reagards
Danny
TU Dresden
Germany
Am 14.04.2016 um 11:01 schrieb Husen R:
Hi all,
Thank you for your reply
Danny :
I have installed BLCR and SLURM successfully.
I also have configured CheckpointType, --checkpoint,
--checkpoint-dir and
JobCheckpointDir in order for slurm to support
checkpoint.
I have tried to checkpoint a simple MPI parallel
application many times in
my small cluster, and like you said, after checkpoint
is completed there is
a directory named with jobid in� --checkpoint-dir. in
that directory there
is a file named "script.ckpt". I tried to restart
directly using srun
command below :
srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51
./mm.o
where --restart-dir is directory that contains
"script.ckpt".
Unfortunately, I got the following error :
Failed to open(/mirror/source/cr/51/task.0.ckpt,
O_RDONLY): No such file or
directory
srun: error: compute-node: task 0: Exited with exit
code 255
As we can see from the error message above, there was
no "task.0.ckpt"
file. I don't know how to get such file. The files
that I got from
checkpoint operation is a file named "script.ckpt" in
--checkpoint-dir and
two files in JobCheckpointDir named
"<jobid>.ckpt" and "<jobid>.ckpt.old".
According to the information in section srun in this
link
http://slurm.schedmd.com/checkpoint_blcr.html,
after checkpoint is
completed there should be checkpoint files of the form
"<jobid>.ckpt" and
"<jobid>.<stepid>.ckpt" in
--checkpoint-dir.
Any idea to solve this ?
Manuel :
Yes, BLCR doesn't support checkpoint/restart
parallel/distributed
application by itself (
https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
But it can be used by other software to do that (I
hope the software is
SLURM..huhu)
I have ever tried to restart mpi application using
DMTCP but it doesn't
work.
Would you please tell me how to do that ?
Thank you in advance,
Regards,
Husen
On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
[email protected]>
wrote:
I forgot something to add, you have to create a
directory for the
checkpoint meta data, which is for default located
in /var/slurm/checkpoint:
mkdir -p /var/slurm/checkpoint
chown -R slurm /var/slurm
or you define your own directory in slurm.conf:
JobCheckpointDir=<your directory>
The parameters you could check with:
scontrol show config | grep checkpoint
Kind regards,
Danny
TU Dresden
Germany
Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
Hello,
we don't get it to work too, but we already build
Slurm with the BLCR.
You first have to install the BLCR library, which
is described on the
following website:
https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html
Then we build and installed Slurm from source and
BLCR checkpointing has
been included.
After that you have to set at least one Parameter
in the file
"slurm.conf":
CheckpointType=checkpoint/blcr
It exists two ways to create ceckpointing, you
could either make a
checkpoint by the following command from outside
your job:
scontrol checkpoint create <jobid>
or you could let Slurm do some periodical
checkpoints with the following
sbatch parameter:
#SBATCH --checkpoint <minutes>
We also tried:
#SBATCH --checkpoint
<minutes>:<seconds>
e.g.
#SBATCH --checkpoint 0:10
to test it, but it doesn't work for us.
We also set the parameter for the checkpoint
directory:
#SBATCH --checkpoint-dir <directory>
After you create a checkpoint and in your
checkpoint directory is created
a directory with name of your jobid, you could
restart the job by the
following command:
scontrol checkpoint restart <jobid>
We tested some sequential and openmp programs with
different parameters
and it works (checkpoint creation and restarting),
but *we don't get any mpi library to work*, we
already tested some
programs build with openmpi and intelmpi.
The checkpoint will be created but we get the
following error when we
want to restart them:
- Failed to open file '/'
- cr_restore_all_files [28534]:� Unable to restore
fd 3 (type=1,err=-21)
- cr_rstrt_child [28534]:� Unable to restore
files!� (err=-21)
Restart failed: Is a directory
srun: error: taurusi4010: task 0: Exited with exit
code 21
So, it would be great if you could confirm our
problems, maybe then
schedmd higher up the priority of such mails;-)
If you get it to work, please help us to
understand how.
Kind reagards,
Danny
TU Dresden
Germany
Am 11.04.2016 um 10:09 schrieb Husen R:
Hi all,
Based on the information in this link
http://slurm.schedmd.com/checkpoint_blcr.html,
Slurm able to checkpoint the whole batch jobs
and then Restart execution
of
batch jobs and job steps from checkpoint files.
Anyone please tell me how to do that ?
I need help.
Thank you in advance.
Regards,
Husen Rusdiansyah
University of Indonesia
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Danny Rotscher
HPC-Support
Technische Universität Dresden
Zentrum für Informationsdienste und
Hochleistungsrechnen (ZIH)
01062 Dresden
Tel.: +49 351
463-35853
Fax : +49 351
463-37773
E-Mail: [email protected]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~