[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Husen R
Danny :

I'm unable to use srun_cr command. I got this error message from slurmctld
log file after submitting srun_cr with sbatch:

[2016-04-14T19:22:42.719] job_complete: JobID=67 State=0x1 NodeCnt=2
WEXITSTATUS 255

Any idea to fix this ?

- yes, my job needs more than 5 minutes.

Andy :

Yes, /mirror directory is shared across my cluster. I have configured it
using NFS.

Regards,



Husen



On Thu, Apr 14, 2016 at 6:15 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

> I've found two things, first you could try srun_cr instead of srun and the
> second is, do your job needs more than 5 minutes?!
> But I'm not sure, so you may try it and post the result.
>
>
> Am 14.04.2016 um 12:56 schrieb Husen R:
>
>> Hello Danny,
>>
>> I have tried to restart using "scontrol checkpoint restart " but it
>> doesn't work.
>> In addition, ".0" directory and its content are doesn't exist in my
>> --checkpoint-dir.
>> The following is my batch job :
>>
>> =batch job===
>>
>> #!/bin/bash
>> #SBATCH -J MatMul
>> #SBATCH -o mm-%j.out
>> #SBATCH -A pro
>> #SBATCH -N 3
>> #SBATCH -n 24
>> #SBATCH --checkpoint=5
>> #SBATCH --checkpoint-dir=/mirror/source/cr
>> #SBATCH --time=01:30:00
>> #SBATCH --mail-user=hus...@gmail.com
>> #SBATCH --mail-type=begin
>> #SBATCH --mail-type=end
>>
>> srun --mpi=pmi2 ./mm.o
>>
>> ===end batch job
>>
>> is there something that prevents me from getting the right directory
>> structure ?
>>
>>
>> Regards,
>>
>>
>>
>> Husen
>>
>>
>>
>>
>> On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher <
>> danny.rotsc...@tu-dresden.de> wrote:
>>
>> Hello,
>>>
>>> usually the directory, which is specified by --checkpoint-dir, should
>>> have
>>> the following structure:
>>> 
>>> |__ script.ckpt
>>> |__ .0
>>>   |__ task.0.ckpt
>>>   |__ task.1.ckpt
>>>   |__ ...
>>>
>>> But you only have to run the following command to restart your batch job:
>>> scontrol checkpoint restart 
>>>
>>> I tried only batch jobs and currently I try to build MVAPICH2 with BLCR
>>> and Slurm support, because that mpi library is explicitly mentioned in
>>> the
>>> Slurm documentation.
>>>
>>> A colleague also tested DMTCP but no success.
>>>
>>> Kind reagards
>>> Danny
>>> TU Dresden
>>> Germany
>>>
>>>
>>> Am 14.04.2016 um 11:01 schrieb Husen R:
>>>
>>> Hi all,
 Thank you for your reply

 Danny :
 I have installed BLCR and SLURM successfully.
 I also have configured CheckpointType, --checkpoint, --checkpoint-dir
 and
 JobCheckpointDir in order for slurm to support checkpoint.

 I have tried to checkpoint a simple MPI parallel application many times
 in
 my small cluster, and like you said, after checkpoint is completed there
 is
 a directory named with jobid in  --checkpoint-dir. in that directory
 there
 is a file named "script.ckpt". I tried to restart directly using srun
 command below :

 srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o

 where --restart-dir is directory that contains "script.ckpt".
 Unfortunately, I got the following error :

 Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file
 or
 directory
 srun: error: compute-node: task 0: Exited with exit code 255

 As we can see from the error message above, there was no "task.0.ckpt"
 file. I don't know how to get such file. The files that I got from
 checkpoint operation is a file named "script.ckpt" in --checkpoint-dir
 and
 two files in JobCheckpointDir named ".ckpt" and
 ".ckpt.old".

 According to the information in section srun in this link
 http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
 completed there should be checkpoint files of the form ".ckpt"
 and
 "..ckpt" in --checkpoint-dir.

 Any idea to solve this ?

 Manuel :

 Yes, BLCR doesn't support checkpoint/restart parallel/distributed
 application by itself (
 https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
 But it can be used by other software to do that (I hope the software is
 SLURM..huhu)

 I have ever tried to restart mpi application using DMTCP but it doesn't
 work.
 Would you please tell me how to do that ?


 Thank you in advance,

 Regards,


 Husen





 On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
 danny.rotsc...@tu-dresden.de> wrote:

 I forgot something to add, you have to create a directory for the

> checkpoint meta data, which is for default located in
> /var/slurm/checkpoint:
> mkdir -p /var/slurm/checkpoint
> chown -R slurm /var/slurm
> or you define your own directory in slurm.conf:
> JobCheckpointDir=
>
> The parameters you could check with:
> scontrol show config | grep checkpoint
>
> Kind regards,
> Danny
> TU 

[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Andy Riebs
   Is your /mirror directory shared across your cluster?
 
 On 04/14/2016 06:56 AM, Husen R wrote:
   Re: [slurm-dev] Re: Slurm Checkpoint/Restart example
   
   Hello Danny,
 I have tried to restart using "scontrol checkpoint restart
 " but it doesn't work.
 In addition,
 ".0" directory and its content are doesn't
 exist in my --checkpoint-dir.
 The following is my batch
 job :
 =batch
 job===
   #!/bin/bash
   #SBATCH -J MatMul
   #SBATCH -o mm-%j.out
   #SBATCH -A pro
   #SBATCH -N 3
   #SBATCH -n 24
   #SBATCH --checkpoint=5
   #SBATCH
 --checkpoint-dir=/mirror/source/cr
   #SBATCH --time=01:30:00
   #SBATCH --mail-user=hus...@gmail.com
   #SBATCH --mail-type=begin
   #SBATCH --mail-type=end
   srun --mpi=pmi2 ./mm.o
   ===end batch
 job
   is there something�that prevents me from getting the
 right directory structure ?
   Regards,
   Husen
 On Thu, Apr 14, 2016 at 5:36 PM, Danny
   Rotscher <danny.rotsc...@tu-dresden.de>
   wrote:
   Hello,
 
 usually the directory, which is specified by
 --checkpoint-dir, should have the following structure:
 
 |__ script.ckpt
 |__ .0
 � � �|__ task.0.ckpt
 � � �|__ task.1.ckpt
 � � �|__ ...
 
 But you only have to run the following command to restart
 your batch job:
 scontrol checkpoint restart 
 
 I tried only batch jobs and currently I try to build
 MVAPICH2 with BLCR and Slurm support, because that mpi
 library is explicitly mentioned in the Slurm documentation.
 
 A colleague also tested DMTCP but no success.
 
 Kind reagards
 Danny
 TU Dresden
 Germany
 Am 14.04.2016 um 11:01 schrieb Husen R:
 
   Hi all,
   Thank you for your reply
   
   Danny :
   I have installed BLCR and SLURM successfully.
   I also have configured CheckpointType, --checkpoint,
   --checkpoint-dir and
   JobCheckpointDir in order for slurm to support
   checkpoint.
   
   I have tried to checkpoint a simple MPI parallel
   application many times in
   my small cluster, and like you said, after checkpoint
   is completed there is
   a directory named with jobid in� --checkpoint-dir. in
   that directory there
   is a file named "script.ckpt". I tried to restart
   directly using srun
   command below :
   
   srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51
   ./mm.o
   
   where --restart-dir is directory that contains
   "script.ckpt".
   Unfortunately, I got the following error :
   
   Failed to open(/mirror/source/cr/51/task.0.ckpt,
   O_RDONLY): No such file or
   directory
   srun: error: compute-node: task 0: Exited with exit
   code 255
   
   As we can see from the error message above, there was
   no "task.0.ckpt"
   file. I don't know how to get such file. The files
   that I got from
   checkpoint operation is a file named "script.ckpt" in
   --checkpoint-dir and
   two files in JobCheckpointDir named
   ".ckpt" and ".ckpt.old".
   
   According to the information in section srun in this
   link
   http://slurm.schedmd.com/checkpoint_blcr.html,
   after checkpoint is
   completed there should be checkpoint files of the form
   ".ckpt" and
   "..ckpt" in
   --checkpoint-dir.
   
   Any idea to solve this ?
   
   Manuel :
   
   Yes, BLCR doesn't support checkpoint/restart
   parallel/distributed
   application by itself ( 
https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
   But it can be used by other software to do that (I
   hope the software is
   SLURM..huhu)
   
   I have ever tried to restart mpi application using
   DMTCP but it doesn't
   work.
   Would you please tell me how to do that ?
   Thank you in advance,
   
   Regards,
   Husen
   

[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Danny Rotscher
I've found two things, first you could try srun_cr instead of srun and 
the second is, do your job needs more than 5 minutes?!

But I'm not sure, so you may try it and post the result.

Am 14.04.2016 um 12:56 schrieb Husen R:

Hello Danny,

I have tried to restart using "scontrol checkpoint restart " but it
doesn't work.
In addition, ".0" directory and its content are doesn't exist in my
--checkpoint-dir.
The following is my batch job :

=batch job===

#!/bin/bash
#SBATCH -J MatMul
#SBATCH -o mm-%j.out
#SBATCH -A pro
#SBATCH -N 3
#SBATCH -n 24
#SBATCH --checkpoint=5
#SBATCH --checkpoint-dir=/mirror/source/cr
#SBATCH --time=01:30:00
#SBATCH --mail-user=hus...@gmail.com
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

srun --mpi=pmi2 ./mm.o

===end batch job

is there something that prevents me from getting the right directory
structure ?


Regards,



Husen




On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:


Hello,

usually the directory, which is specified by --checkpoint-dir, should have
the following structure:

|__ script.ckpt
|__ .0
  |__ task.0.ckpt
  |__ task.1.ckpt
  |__ ...

But you only have to run the following command to restart your batch job:
scontrol checkpoint restart 

I tried only batch jobs and currently I try to build MVAPICH2 with BLCR
and Slurm support, because that mpi library is explicitly mentioned in the
Slurm documentation.

A colleague also tested DMTCP but no success.

Kind reagards
Danny
TU Dresden
Germany


Am 14.04.2016 um 11:01 schrieb Husen R:


Hi all,
Thank you for your reply

Danny :
I have installed BLCR and SLURM successfully.
I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
JobCheckpointDir in order for slurm to support checkpoint.

I have tried to checkpoint a simple MPI parallel application many times in
my small cluster, and like you said, after checkpoint is completed there
is
a directory named with jobid in  --checkpoint-dir. in that directory there
is a file named "script.ckpt". I tried to restart directly using srun
command below :

srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o

where --restart-dir is directory that contains "script.ckpt".
Unfortunately, I got the following error :

Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file
or
directory
srun: error: compute-node: task 0: Exited with exit code 255

As we can see from the error message above, there was no "task.0.ckpt"
file. I don't know how to get such file. The files that I got from
checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
two files in JobCheckpointDir named ".ckpt" and ".ckpt.old".

According to the information in section srun in this link
http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
completed there should be checkpoint files of the form ".ckpt" and
"..ckpt" in --checkpoint-dir.

Any idea to solve this ?

Manuel :

Yes, BLCR doesn't support checkpoint/restart parallel/distributed
application by itself (
https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
But it can be used by other software to do that (I hope the software is
SLURM..huhu)

I have ever tried to restart mpi application using DMTCP but it doesn't
work.
Would you please tell me how to do that ?


Thank you in advance,

Regards,


Husen





On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

I forgot something to add, you have to create a directory for the

checkpoint meta data, which is for default located in
/var/slurm/checkpoint:
mkdir -p /var/slurm/checkpoint
chown -R slurm /var/slurm
or you define your own directory in slurm.conf:
JobCheckpointDir=

The parameters you could check with:
scontrol show config | grep checkpoint

Kind regards,
Danny
TU Dresden
Germany

Am 14.04.2016 um 06:41 schrieb Danny Rotscher:

Hello,

we don't get it to work too, but we already build Slurm with the BLCR.

You first have to install the BLCR library, which is described on the
following website:
https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html

Then we build and installed Slurm from source and BLCR checkpointing has
been included.

After that you have to set at least one Parameter in the file
"slurm.conf":
CheckpointType=checkpoint/blcr

It exists two ways to create ceckpointing, you could either make a
checkpoint by the following command from outside your job:
scontrol checkpoint create 
or you could let Slurm do some periodical checkpoints with the following
sbatch parameter:
#SBATCH --checkpoint 
We also tried:
#SBATCH --checkpoint :
e.g.
#SBATCH --checkpoint 0:10
to test it, but it doesn't work for us.

We also set the parameter for the checkpoint directory:
#SBATCH --checkpoint-dir 

After you create a checkpoint and in your checkpoint directory is
created
a directory with name of your jobid, you could restart the job by the
following command:
scontrol checkpoint 

[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Husen R
Hello Danny,

I have tried to restart using "scontrol checkpoint restart " but it
doesn't work.
In addition, ".0" directory and its content are doesn't exist in my
--checkpoint-dir.
The following is my batch job :

=batch job===

#!/bin/bash
#SBATCH -J MatMul
#SBATCH -o mm-%j.out
#SBATCH -A pro
#SBATCH -N 3
#SBATCH -n 24
#SBATCH --checkpoint=5
#SBATCH --checkpoint-dir=/mirror/source/cr
#SBATCH --time=01:30:00
#SBATCH --mail-user=hus...@gmail.com
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

srun --mpi=pmi2 ./mm.o

===end batch job

is there something that prevents me from getting the right directory
structure ?


Regards,



Husen




On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

> Hello,
>
> usually the directory, which is specified by --checkpoint-dir, should have
> the following structure:
> 
> |__ script.ckpt
> |__ .0
>  |__ task.0.ckpt
>  |__ task.1.ckpt
>  |__ ...
>
> But you only have to run the following command to restart your batch job:
> scontrol checkpoint restart 
>
> I tried only batch jobs and currently I try to build MVAPICH2 with BLCR
> and Slurm support, because that mpi library is explicitly mentioned in the
> Slurm documentation.
>
> A colleague also tested DMTCP but no success.
>
> Kind reagards
> Danny
> TU Dresden
> Germany
>
>
> Am 14.04.2016 um 11:01 schrieb Husen R:
>
>> Hi all,
>> Thank you for your reply
>>
>> Danny :
>> I have installed BLCR and SLURM successfully.
>> I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
>> JobCheckpointDir in order for slurm to support checkpoint.
>>
>> I have tried to checkpoint a simple MPI parallel application many times in
>> my small cluster, and like you said, after checkpoint is completed there
>> is
>> a directory named with jobid in  --checkpoint-dir. in that directory there
>> is a file named "script.ckpt". I tried to restart directly using srun
>> command below :
>>
>> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o
>>
>> where --restart-dir is directory that contains "script.ckpt".
>> Unfortunately, I got the following error :
>>
>> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file
>> or
>> directory
>> srun: error: compute-node: task 0: Exited with exit code 255
>>
>> As we can see from the error message above, there was no "task.0.ckpt"
>> file. I don't know how to get such file. The files that I got from
>> checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
>> two files in JobCheckpointDir named ".ckpt" and ".ckpt.old".
>>
>> According to the information in section srun in this link
>> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
>> completed there should be checkpoint files of the form ".ckpt" and
>> "..ckpt" in --checkpoint-dir.
>>
>> Any idea to solve this ?
>>
>> Manuel :
>>
>> Yes, BLCR doesn't support checkpoint/restart parallel/distributed
>> application by itself (
>> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
>> But it can be used by other software to do that (I hope the software is
>> SLURM..huhu)
>>
>> I have ever tried to restart mpi application using DMTCP but it doesn't
>> work.
>> Would you please tell me how to do that ?
>>
>>
>> Thank you in advance,
>>
>> Regards,
>>
>>
>> Husen
>>
>>
>>
>>
>>
>> On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
>> danny.rotsc...@tu-dresden.de> wrote:
>>
>> I forgot something to add, you have to create a directory for the
>>> checkpoint meta data, which is for default located in
>>> /var/slurm/checkpoint:
>>> mkdir -p /var/slurm/checkpoint
>>> chown -R slurm /var/slurm
>>> or you define your own directory in slurm.conf:
>>> JobCheckpointDir=
>>>
>>> The parameters you could check with:
>>> scontrol show config | grep checkpoint
>>>
>>> Kind regards,
>>> Danny
>>> TU Dresden
>>> Germany
>>>
>>> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>>>
>>> Hello,

 we don't get it to work too, but we already build Slurm with the BLCR.

 You first have to install the BLCR library, which is described on the
 following website:
 https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html

 Then we build and installed Slurm from source and BLCR checkpointing has
 been included.

 After that you have to set at least one Parameter in the file
 "slurm.conf":
 CheckpointType=checkpoint/blcr

 It exists two ways to create ceckpointing, you could either make a
 checkpoint by the following command from outside your job:
 scontrol checkpoint create 
 or you could let Slurm do some periodical checkpoints with the following
 sbatch parameter:
 #SBATCH --checkpoint 
 We also tried:
 #SBATCH --checkpoint :
 e.g.
 #SBATCH --checkpoint 0:10
 to test it, but it doesn't work for us.

 We also set the parameter for the checkpoint directory:
 #SBATCH 

[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Danny Rotscher

Hello,

usually the directory, which is specified by --checkpoint-dir, should 
have the following structure:


|__ script.ckpt
|__ .0
 |__ task.0.ckpt
 |__ task.1.ckpt
 |__ ...

But you only have to run the following command to restart your batch job:
scontrol checkpoint restart 

I tried only batch jobs and currently I try to build MVAPICH2 with BLCR 
and Slurm support, because that mpi library is explicitly mentioned in 
the Slurm documentation.


A colleague also tested DMTCP but no success.

Kind reagards
Danny
TU Dresden
Germany

Am 14.04.2016 um 11:01 schrieb Husen R:

Hi all,
Thank you for your reply

Danny :
I have installed BLCR and SLURM successfully.
I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
JobCheckpointDir in order for slurm to support checkpoint.

I have tried to checkpoint a simple MPI parallel application many times in
my small cluster, and like you said, after checkpoint is completed there is
a directory named with jobid in  --checkpoint-dir. in that directory there
is a file named "script.ckpt". I tried to restart directly using srun
command below :

srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o

where --restart-dir is directory that contains "script.ckpt".
Unfortunately, I got the following error :

Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or
directory
srun: error: compute-node: task 0: Exited with exit code 255

As we can see from the error message above, there was no "task.0.ckpt"
file. I don't know how to get such file. The files that I got from
checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
two files in JobCheckpointDir named ".ckpt" and ".ckpt.old".

According to the information in section srun in this link
http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
completed there should be checkpoint files of the form ".ckpt" and
"..ckpt" in --checkpoint-dir.

Any idea to solve this ?

Manuel :

Yes, BLCR doesn't support checkpoint/restart parallel/distributed
application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
But it can be used by other software to do that (I hope the software is
SLURM..huhu)

I have ever tried to restart mpi application using DMTCP but it doesn't
work.
Would you please tell me how to do that ?


Thank you in advance,

Regards,


Husen





On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:


I forgot something to add, you have to create a directory for the
checkpoint meta data, which is for default located in /var/slurm/checkpoint:
mkdir -p /var/slurm/checkpoint
chown -R slurm /var/slurm
or you define your own directory in slurm.conf:
JobCheckpointDir=

The parameters you could check with:
scontrol show config | grep checkpoint

Kind regards,
Danny
TU Dresden
Germany

Am 14.04.2016 um 06:41 schrieb Danny Rotscher:


Hello,

we don't get it to work too, but we already build Slurm with the BLCR.

You first have to install the BLCR library, which is described on the
following website:
https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html

Then we build and installed Slurm from source and BLCR checkpointing has
been included.

After that you have to set at least one Parameter in the file
"slurm.conf":
CheckpointType=checkpoint/blcr

It exists two ways to create ceckpointing, you could either make a
checkpoint by the following command from outside your job:
scontrol checkpoint create 
or you could let Slurm do some periodical checkpoints with the following
sbatch parameter:
#SBATCH --checkpoint 
We also tried:
#SBATCH --checkpoint :
e.g.
#SBATCH --checkpoint 0:10
to test it, but it doesn't work for us.

We also set the parameter for the checkpoint directory:
#SBATCH --checkpoint-dir 

After you create a checkpoint and in your checkpoint directory is created
a directory with name of your jobid, you could restart the job by the
following command:
scontrol checkpoint restart 

We tested some sequential and openmp programs with different parameters
and it works (checkpoint creation and restarting),
but *we don't get any mpi library to work*, we already tested some
programs build with openmpi and intelmpi.
The checkpoint will be created but we get the following error when we
want to restart them:
- Failed to open file '/'
- cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
- cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
Restart failed: Is a directory
srun: error: taurusi4010: task 0: Exited with exit code 21

So, it would be great if you could confirm our problems, maybe then
schedmd higher up the priority of such mails;-)
If you get it to work, please help us to understand how.

Kind reagards,
Danny
TU Dresden
Germany

Am 11.04.2016 um 10:09 schrieb Husen R:


Hi all,

Based on the information in this link
http://slurm.schedmd.com/checkpoint_blcr.html,
Slurm able to checkpoint the whole batch jobs and then Restart execution
of
batch jobs and 

[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Manuel Rodríguez Pascual

There is a good tutorial on how to use DMTCP on their github page,

https://github.com/dmtcp/dmtcp/blob/master/QUICK-START.md

I would start there. Anyway, probably this Slurm mailing list is not
the best place to ask for that information.

Best regards,

Manuel

2016-04-14 11:01 GMT+02:00 Husen R :
> Hi all,
> Thank you for your reply
>
> Danny :
> I have installed BLCR and SLURM successfully.
> I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
> JobCheckpointDir in order for slurm to support checkpoint.
>
> I have tried to checkpoint a simple MPI parallel application many times in
> my small cluster, and like you said, after checkpoint is completed there is
> a directory named with jobid in  --checkpoint-dir. in that directory there
> is a file named "script.ckpt". I tried to restart directly using srun
> command below :
>
> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o
>
> where --restart-dir is directory that contains "script.ckpt".
> Unfortunately, I got the following error :
>
> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or
> directory
> srun: error: compute-node: task 0: Exited with exit code 255
>
> As we can see from the error message above, there was no "task.0.ckpt" file.
> I don't know how to get such file. The files that I got from checkpoint
> operation is a file named "script.ckpt" in --checkpoint-dir and two files in
> JobCheckpointDir named ".ckpt" and ".ckpt.old".
>
> According to the information in section srun in this link
> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed
> there should be checkpoint files of the form ".ckpt" and
> "..ckpt" in --checkpoint-dir.
>
> Any idea to solve this ?
>
> Manuel :
>
> Yes, BLCR doesn't support checkpoint/restart parallel/distributed
> application by itself (
> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by
> other software to do that (I hope the software is SLURM..huhu)
>
> I have ever tried to restart mpi application using DMTCP but it doesn't
> work.
> Would you please tell me how to do that ?
>
>
> Thank you in advance,
>
> Regards,
>
>
> Husen
>
>
>
>
>
> On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher
>  wrote:
>>
>> I forgot something to add, you have to create a directory for the
>> checkpoint meta data, which is for default located in /var/slurm/checkpoint:
>> mkdir -p /var/slurm/checkpoint
>> chown -R slurm /var/slurm
>> or you define your own directory in slurm.conf:
>> JobCheckpointDir=
>>
>> The parameters you could check with:
>> scontrol show config | grep checkpoint
>>
>> Kind regards,
>> Danny
>> TU Dresden
>> Germany
>>
>> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>>>
>>> Hello,
>>>
>>> we don't get it to work too, but we already build Slurm with the BLCR.
>>>
>>> You first have to install the BLCR library, which is described on the
>>> following website:
>>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html
>>>
>>> Then we build and installed Slurm from source and BLCR checkpointing has
>>> been included.
>>>
>>> After that you have to set at least one Parameter in the file
>>> "slurm.conf":
>>> CheckpointType=checkpoint/blcr
>>>
>>> It exists two ways to create ceckpointing, you could either make a
>>> checkpoint by the following command from outside your job:
>>> scontrol checkpoint create 
>>> or you could let Slurm do some periodical checkpoints with the following
>>> sbatch parameter:
>>> #SBATCH --checkpoint 
>>> We also tried:
>>> #SBATCH --checkpoint :
>>> e.g.
>>> #SBATCH --checkpoint 0:10
>>> to test it, but it doesn't work for us.
>>>
>>> We also set the parameter for the checkpoint directory:
>>> #SBATCH --checkpoint-dir 
>>>
>>> After you create a checkpoint and in your checkpoint directory is created
>>> a directory with name of your jobid, you could restart the job by the
>>> following command:
>>> scontrol checkpoint restart 
>>>
>>> We tested some sequential and openmp programs with different parameters
>>> and it works (checkpoint creation and restarting),
>>> but *we don't get any mpi library to work*, we already tested some
>>> programs build with openmpi and intelmpi.
>>> The checkpoint will be created but we get the following error when we
>>> want to restart them:
>>> - Failed to open file '/'
>>> - cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
>>> - cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
>>> Restart failed: Is a directory
>>> srun: error: taurusi4010: task 0: Exited with exit code 21
>>>
>>> So, it would be great if you could confirm our problems, maybe then
>>> schedmd higher up the priority of such mails;-)
>>> If you get it to work, please help us to understand how.
>>>
>>> Kind reagards,
>>> Danny
>>> TU Dresden
>>> Germany
>>>
>>> Am 11.04.2016 um 10:09 schrieb Husen R:

 Hi all,

 Based on the information in this link
 

[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Husen R
Hi all,
Thank you for your reply

Danny :
I have installed BLCR and SLURM successfully.
I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
JobCheckpointDir in order for slurm to support checkpoint.

I have tried to checkpoint a simple MPI parallel application many times in
my small cluster, and like you said, after checkpoint is completed there is
a directory named with jobid in  --checkpoint-dir. in that directory there
is a file named "script.ckpt". I tried to restart directly using srun
command below :

srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o

where --restart-dir is directory that contains "script.ckpt".
Unfortunately, I got the following error :

Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or
directory
srun: error: compute-node: task 0: Exited with exit code 255

As we can see from the error message above, there was no "task.0.ckpt"
file. I don't know how to get such file. The files that I got from
checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
two files in JobCheckpointDir named ".ckpt" and ".ckpt.old".

According to the information in section srun in this link
http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
completed there should be checkpoint files of the form ".ckpt" and
"..ckpt" in --checkpoint-dir.

Any idea to solve this ?

Manuel :

Yes, BLCR doesn't support checkpoint/restart parallel/distributed
application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
But it can be used by other software to do that (I hope the software is
SLURM..huhu)

I have ever tried to restart mpi application using DMTCP but it doesn't
work.
Would you please tell me how to do that ?


Thank you in advance,

Regards,


Husen





On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

> I forgot something to add, you have to create a directory for the
> checkpoint meta data, which is for default located in /var/slurm/checkpoint:
> mkdir -p /var/slurm/checkpoint
> chown -R slurm /var/slurm
> or you define your own directory in slurm.conf:
> JobCheckpointDir=
>
> The parameters you could check with:
> scontrol show config | grep checkpoint
>
> Kind regards,
> Danny
> TU Dresden
> Germany
>
> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>
>> Hello,
>>
>> we don't get it to work too, but we already build Slurm with the BLCR.
>>
>> You first have to install the BLCR library, which is described on the
>> following website:
>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html
>>
>> Then we build and installed Slurm from source and BLCR checkpointing has
>> been included.
>>
>> After that you have to set at least one Parameter in the file
>> "slurm.conf":
>> CheckpointType=checkpoint/blcr
>>
>> It exists two ways to create ceckpointing, you could either make a
>> checkpoint by the following command from outside your job:
>> scontrol checkpoint create 
>> or you could let Slurm do some periodical checkpoints with the following
>> sbatch parameter:
>> #SBATCH --checkpoint 
>> We also tried:
>> #SBATCH --checkpoint :
>> e.g.
>> #SBATCH --checkpoint 0:10
>> to test it, but it doesn't work for us.
>>
>> We also set the parameter for the checkpoint directory:
>> #SBATCH --checkpoint-dir 
>>
>> After you create a checkpoint and in your checkpoint directory is created
>> a directory with name of your jobid, you could restart the job by the
>> following command:
>> scontrol checkpoint restart 
>>
>> We tested some sequential and openmp programs with different parameters
>> and it works (checkpoint creation and restarting),
>> but *we don't get any mpi library to work*, we already tested some
>> programs build with openmpi and intelmpi.
>> The checkpoint will be created but we get the following error when we
>> want to restart them:
>> - Failed to open file '/'
>> - cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
>> - cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
>> Restart failed: Is a directory
>> srun: error: taurusi4010: task 0: Exited with exit code 21
>>
>> So, it would be great if you could confirm our problems, maybe then
>> schedmd higher up the priority of such mails;-)
>> If you get it to work, please help us to understand how.
>>
>> Kind reagards,
>> Danny
>> TU Dresden
>> Germany
>>
>> Am 11.04.2016 um 10:09 schrieb Husen R:
>>
>>> Hi all,
>>>
>>> Based on the information in this link
>>> http://slurm.schedmd.com/checkpoint_blcr.html,
>>> Slurm able to checkpoint the whole batch jobs and then Restart execution
>>> of
>>> batch jobs and job steps from checkpoint files.
>>>
>>> Anyone please tell me how to do that ?
>>> I need help.
>>>
>>> Thank you in advance.
>>>
>>> Regards,
>>>
>>>
>>> Husen Rusdiansyah
>>> University of Indonesia
>>>
>>
>>
>


[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Manuel Rodríguez Pascual

Hi Danny, all,

As far as I know, unfortunately BLCR does not count with MPI support
At lest I haven't been able to achieve it.

On the other side, DMTCP ( http://dmtcp.sourceforge.net/ ) does work
with MPI. My team is very interested on counting with a reliable
checkpoint/restar mechanism in Slurm, so we are now plugin to
integrate it. We are facing some technical problems, but are working
together with  DMTCP team to solve them and we are confident on having
the integration ready soon.

anyway, i'll send a mail to this list when it's ready.

Cheers,


Manuel


2016-04-14 7:03 GMT+02:00 Danny Rotscher :
> I forgot something to add, you have to create a directory for the checkpoint
> meta data, which is for default located in /var/slurm/checkpoint:
> mkdir -p /var/slurm/checkpoint
> chown -R slurm /var/slurm
> or you define your own directory in slurm.conf:
> JobCheckpointDir=
>
> The parameters you could check with:
> scontrol show config | grep checkpoint
>
> Kind regards,
> Danny
> TU Dresden
> Germany
>
> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>>
>> Hello,
>>
>> we don't get it to work too, but we already build Slurm with the BLCR.
>>
>> You first have to install the BLCR library, which is described on the
>> following website:
>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html
>>
>> Then we build and installed Slurm from source and BLCR checkpointing has
>> been included.
>>
>> After that you have to set at least one Parameter in the file
>> "slurm.conf":
>> CheckpointType=checkpoint/blcr
>>
>> It exists two ways to create ceckpointing, you could either make a
>> checkpoint by the following command from outside your job:
>> scontrol checkpoint create 
>> or you could let Slurm do some periodical checkpoints with the following
>> sbatch parameter:
>> #SBATCH --checkpoint 
>> We also tried:
>> #SBATCH --checkpoint :
>> e.g.
>> #SBATCH --checkpoint 0:10
>> to test it, but it doesn't work for us.
>>
>> We also set the parameter for the checkpoint directory:
>> #SBATCH --checkpoint-dir 
>>
>> After you create a checkpoint and in your checkpoint directory is created
>> a directory with name of your jobid, you could restart the job by the
>> following command:
>> scontrol checkpoint restart 
>>
>> We tested some sequential and openmp programs with different parameters
>> and it works (checkpoint creation and restarting),
>> but *we don't get any mpi library to work*, we already tested some
>> programs build with openmpi and intelmpi.
>> The checkpoint will be created but we get the following error when we want
>> to restart them:
>> - Failed to open file '/'
>> - cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
>> - cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
>> Restart failed: Is a directory
>> srun: error: taurusi4010: task 0: Exited with exit code 21
>>
>> So, it would be great if you could confirm our problems, maybe then
>> schedmd higher up the priority of such mails;-)
>> If you get it to work, please help us to understand how.
>>
>> Kind reagards,
>> Danny
>> TU Dresden
>> Germany
>>
>> Am 11.04.2016 um 10:09 schrieb Husen R:
>>>
>>> Hi all,
>>>
>>> Based on the information in this link
>>> http://slurm.schedmd.com/checkpoint_blcr.html,
>>> Slurm able to checkpoint the whole batch jobs and then Restart execution
>>> of
>>> batch jobs and job steps from checkpoint files.
>>>
>>> Anyone please tell me how to do that ?
>>> I need help.
>>>
>>> Thank you in advance.
>>>
>>> Regards,
>>>
>>>
>>> Husen Rusdiansyah
>>> University of Indonesia
>>
>>
>


[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-13 Thread Danny Rotscher
I forgot something to add, you have to create a directory for the 
checkpoint meta data, which is for default located in /var/slurm/checkpoint:

mkdir -p /var/slurm/checkpoint
chown -R slurm /var/slurm
or you define your own directory in slurm.conf:
JobCheckpointDir=

The parameters you could check with:
scontrol show config | grep checkpoint

Kind regards,
Danny
TU Dresden
Germany

Am 14.04.2016 um 06:41 schrieb Danny Rotscher:

Hello,

we don't get it to work too, but we already build Slurm with the BLCR.

You first have to install the BLCR library, which is described on the 
following website:

https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html

Then we build and installed Slurm from source and BLCR checkpointing 
has been included.


After that you have to set at least one Parameter in the file 
"slurm.conf":

CheckpointType=checkpoint/blcr

It exists two ways to create ceckpointing, you could either make a 
checkpoint by the following command from outside your job:

scontrol checkpoint create 
or you could let Slurm do some periodical checkpoints with the 
following sbatch parameter:

#SBATCH --checkpoint 
We also tried:
#SBATCH --checkpoint :
e.g.
#SBATCH --checkpoint 0:10
to test it, but it doesn't work for us.

We also set the parameter for the checkpoint directory:
#SBATCH --checkpoint-dir 

After you create a checkpoint and in your checkpoint directory is 
created a directory with name of your jobid, you could restart the job 
by the following command:

scontrol checkpoint restart 

We tested some sequential and openmp programs with different 
parameters and it works (checkpoint creation and restarting),
but *we don't get any mpi library to work*, we already tested some 
programs build with openmpi and intelmpi.
The checkpoint will be created but we get the following error when we 
want to restart them:

- Failed to open file '/'
- cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
- cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
Restart failed: Is a directory
srun: error: taurusi4010: task 0: Exited with exit code 21

So, it would be great if you could confirm our problems, maybe then 
schedmd higher up the priority of such mails;-)

If you get it to work, please help us to understand how.

Kind reagards,
Danny
TU Dresden
Germany

Am 11.04.2016 um 10:09 schrieb Husen R:

Hi all,

Based on the information in this link
http://slurm.schedmd.com/checkpoint_blcr.html,
Slurm able to checkpoint the whole batch jobs and then Restart 
execution of

batch jobs and job steps from checkpoint files.

Anyone please tell me how to do that ?
I need help.

Thank you in advance.

Regards,


Husen Rusdiansyah
University of Indonesia






smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-13 Thread Danny Rotscher

Hello,

we don't get it to work too, but we already build Slurm with the BLCR.

You first have to install the BLCR library, which is described on the 
following website:

https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html

Then we build and installed Slurm from source and BLCR checkpointing has 
been included.


After that you have to set at least one Parameter in the file "slurm.conf":
CheckpointType=checkpoint/blcr

It exists two ways to create ceckpointing, you could either make a 
checkpoint by the following command from outside your job:

scontrol checkpoint create 
or you could let Slurm do some periodical checkpoints with the following 
sbatch parameter:

#SBATCH --checkpoint 
We also tried:
#SBATCH --checkpoint :
e.g.
#SBATCH --checkpoint 0:10
to test it, but it doesn't work for us.

We also set the parameter for the checkpoint directory:
#SBATCH --checkpoint-dir 

After you create a checkpoint and in your checkpoint directory is 
created a directory with name of your jobid, you could restart the job 
by the following command:

scontrol checkpoint restart 

We tested some sequential and openmp programs with different parameters 
and it works (checkpoint creation and restarting),
but *we don't get any mpi library to work*, we already tested some 
programs build with openmpi and intelmpi.
The checkpoint will be created but we get the following error when we 
want to restart them:

- Failed to open file '/'
- cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
- cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
Restart failed: Is a directory
srun: error: taurusi4010: task 0: Exited with exit code 21

So, it would be great if you could confirm our problems, maybe then 
schedmd higher up the priority of such mails;-)

If you get it to work, please help us to understand how.

Kind reagards,
Danny
TU Dresden
Germany

Am 11.04.2016 um 10:09 schrieb Husen R:

Hi all,

Based on the information in this link
http://slurm.schedmd.com/checkpoint_blcr.html,
Slurm able to checkpoint the whole batch jobs and then Restart execution of
batch jobs and job steps from checkpoint files.

Anyone please tell me how to do that ?
I need help.

Thank you in advance.

Regards,


Husen Rusdiansyah
University of Indonesia


smime.p7s
Description: S/MIME Cryptographic Signature