[slurm-dev] Re: Slurm Checkpoint/Restart example
Danny : I'm unable to use srun_cr command. I got this error message from slurmctld log file after submitting srun_cr with sbatch: [2016-04-14T19:22:42.719] job_complete: JobID=67 State=0x1 NodeCnt=2 WEXITSTATUS 255 Any idea to fix this ? - yes, my job needs more than 5 minutes. Andy : Yes, /mirror directory is shared across my cluster. I have configured it using NFS. Regards, Husen On Thu, Apr 14, 2016 at 6:15 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: > I've found two things, first you could try srun_cr instead of srun and the > second is, do your job needs more than 5 minutes?! > But I'm not sure, so you may try it and post the result. > > > Am 14.04.2016 um 12:56 schrieb Husen R: > >> Hello Danny, >> >> I have tried to restart using "scontrol checkpoint restart " but it >> doesn't work. >> In addition, ".0" directory and its content are doesn't exist in my >> --checkpoint-dir. >> The following is my batch job : >> >> =batch job=== >> >> #!/bin/bash >> #SBATCH -J MatMul >> #SBATCH -o mm-%j.out >> #SBATCH -A pro >> #SBATCH -N 3 >> #SBATCH -n 24 >> #SBATCH --checkpoint=5 >> #SBATCH --checkpoint-dir=/mirror/source/cr >> #SBATCH --time=01:30:00 >> #SBATCH --mail-user=hus...@gmail.com >> #SBATCH --mail-type=begin >> #SBATCH --mail-type=end >> >> srun --mpi=pmi2 ./mm.o >> >> ===end batch job >> >> is there something that prevents me from getting the right directory >> structure ? >> >> >> Regards, >> >> >> >> Husen >> >> >> >> >> On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher < >> danny.rotsc...@tu-dresden.de> wrote: >> >> Hello, >>> >>> usually the directory, which is specified by --checkpoint-dir, should >>> have >>> the following structure: >>> >>> |__ script.ckpt >>> |__ .0 >>> |__ task.0.ckpt >>> |__ task.1.ckpt >>> |__ ... >>> >>> But you only have to run the following command to restart your batch job: >>> scontrol checkpoint restart >>> >>> I tried only batch jobs and currently I try to build MVAPICH2 with BLCR >>> and Slurm support, because that mpi library is explicitly mentioned in >>> the >>> Slurm documentation. >>> >>> A colleague also tested DMTCP but no success. >>> >>> Kind reagards >>> Danny >>> TU Dresden >>> Germany >>> >>> >>> Am 14.04.2016 um 11:01 schrieb Husen R: >>> >>> Hi all, Thank you for your reply Danny : I have installed BLCR and SLURM successfully. I also have configured CheckpointType, --checkpoint, --checkpoint-dir and JobCheckpointDir in order for slurm to support checkpoint. I have tried to checkpoint a simple MPI parallel application many times in my small cluster, and like you said, after checkpoint is completed there is a directory named with jobid in --checkpoint-dir. in that directory there is a file named "script.ckpt". I tried to restart directly using srun command below : srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o where --restart-dir is directory that contains "script.ckpt". Unfortunately, I got the following error : Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or directory srun: error: compute-node: task 0: Exited with exit code 255 As we can see from the error message above, there was no "task.0.ckpt" file. I don't know how to get such file. The files that I got from checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". According to the information in section srun in this link http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed there should be checkpoint files of the form ".ckpt" and "..ckpt" in --checkpoint-dir. Any idea to solve this ? Manuel : Yes, BLCR doesn't support checkpoint/restart parallel/distributed application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by other software to do that (I hope the software is SLURM..huhu) I have ever tried to restart mpi application using DMTCP but it doesn't work. Would you please tell me how to do that ? Thank you in advance, Regards, Husen On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: I forgot something to add, you have to create a directory for the > checkpoint meta data, which is for default located in > /var/slurm/checkpoint: > mkdir -p /var/slurm/checkpoint > chown -R slurm /var/slurm > or you define your own directory in slurm.conf: > JobCheckpointDir= > > The parameters you could check with: > scontrol show config | grep checkpoint > > Kind regards, > Danny > TU
[slurm-dev] Re: Slurm Checkpoint/Restart example
Is your /mirror directory shared across your cluster? On 04/14/2016 06:56 AM, Husen R wrote: Re: [slurm-dev] Re: Slurm Checkpoint/Restart example Hello Danny, I have tried to restart using "scontrol checkpoint restart " but it doesn't work. In addition, ".0" directory and its content are doesn't exist in my --checkpoint-dir. The following is my batch job : =batch job=== #!/bin/bash #SBATCH -J MatMul #SBATCH -o mm-%j.out #SBATCH -A pro #SBATCH -N 3 #SBATCH -n 24 #SBATCH --checkpoint=5 #SBATCH --checkpoint-dir=/mirror/source/cr #SBATCH --time=01:30:00 #SBATCH --mail-user=hus...@gmail.com #SBATCH --mail-type=begin #SBATCH --mail-type=end srun --mpi=pmi2 ./mm.o ===end batch job is there something�that prevents me from getting the right directory structure ? Regards, Husen On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher <danny.rotsc...@tu-dresden.de> wrote: Hello, usually the directory, which is specified by --checkpoint-dir, should have the following structure: |__ script.ckpt |__ .0 � � �|__ task.0.ckpt � � �|__ task.1.ckpt � � �|__ ... But you only have to run the following command to restart your batch job: scontrol checkpoint restart I tried only batch jobs and currently I try to build MVAPICH2 with BLCR and Slurm support, because that mpi library is explicitly mentioned in the Slurm documentation. A colleague also tested DMTCP but no success. Kind reagards Danny TU Dresden Germany Am 14.04.2016 um 11:01 schrieb Husen R: Hi all, Thank you for your reply Danny : I have installed BLCR and SLURM successfully. I also have configured CheckpointType, --checkpoint, --checkpoint-dir and JobCheckpointDir in order for slurm to support checkpoint. I have tried to checkpoint a simple MPI parallel application many times in my small cluster, and like you said, after checkpoint is completed there is a directory named with jobid in� --checkpoint-dir. in that directory there is a file named "script.ckpt". I tried to restart directly using srun command below : srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o where --restart-dir is directory that contains "script.ckpt". Unfortunately, I got the following error : Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or directory srun: error: compute-node: task 0: Exited with exit code 255 As we can see from the error message above, there was no "task.0.ckpt" file. I don't know how to get such file. The files that I got from checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". According to the information in section srun in this link http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed there should be checkpoint files of the form ".ckpt" and "..ckpt" in --checkpoint-dir. Any idea to solve this ? Manuel : Yes, BLCR doesn't support checkpoint/restart parallel/distributed application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by other software to do that (I hope the software is SLURM..huhu) I have ever tried to restart mpi application using DMTCP but it doesn't work. Would you please tell me how to do that ? Thank you in advance, Regards, Husen
[slurm-dev] Re: Slurm Checkpoint/Restart example
I've found two things, first you could try srun_cr instead of srun and the second is, do your job needs more than 5 minutes?! But I'm not sure, so you may try it and post the result. Am 14.04.2016 um 12:56 schrieb Husen R: Hello Danny, I have tried to restart using "scontrol checkpoint restart " but it doesn't work. In addition, ".0" directory and its content are doesn't exist in my --checkpoint-dir. The following is my batch job : =batch job=== #!/bin/bash #SBATCH -J MatMul #SBATCH -o mm-%j.out #SBATCH -A pro #SBATCH -N 3 #SBATCH -n 24 #SBATCH --checkpoint=5 #SBATCH --checkpoint-dir=/mirror/source/cr #SBATCH --time=01:30:00 #SBATCH --mail-user=hus...@gmail.com #SBATCH --mail-type=begin #SBATCH --mail-type=end srun --mpi=pmi2 ./mm.o ===end batch job is there something that prevents me from getting the right directory structure ? Regards, Husen On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: Hello, usually the directory, which is specified by --checkpoint-dir, should have the following structure: |__ script.ckpt |__ .0 |__ task.0.ckpt |__ task.1.ckpt |__ ... But you only have to run the following command to restart your batch job: scontrol checkpoint restart I tried only batch jobs and currently I try to build MVAPICH2 with BLCR and Slurm support, because that mpi library is explicitly mentioned in the Slurm documentation. A colleague also tested DMTCP but no success. Kind reagards Danny TU Dresden Germany Am 14.04.2016 um 11:01 schrieb Husen R: Hi all, Thank you for your reply Danny : I have installed BLCR and SLURM successfully. I also have configured CheckpointType, --checkpoint, --checkpoint-dir and JobCheckpointDir in order for slurm to support checkpoint. I have tried to checkpoint a simple MPI parallel application many times in my small cluster, and like you said, after checkpoint is completed there is a directory named with jobid in --checkpoint-dir. in that directory there is a file named "script.ckpt". I tried to restart directly using srun command below : srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o where --restart-dir is directory that contains "script.ckpt". Unfortunately, I got the following error : Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or directory srun: error: compute-node: task 0: Exited with exit code 255 As we can see from the error message above, there was no "task.0.ckpt" file. I don't know how to get such file. The files that I got from checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". According to the information in section srun in this link http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed there should be checkpoint files of the form ".ckpt" and "..ckpt" in --checkpoint-dir. Any idea to solve this ? Manuel : Yes, BLCR doesn't support checkpoint/restart parallel/distributed application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by other software to do that (I hope the software is SLURM..huhu) I have ever tried to restart mpi application using DMTCP but it doesn't work. Would you please tell me how to do that ? Thank you in advance, Regards, Husen On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: I forgot something to add, you have to create a directory for the checkpoint meta data, which is for default located in /var/slurm/checkpoint: mkdir -p /var/slurm/checkpoint chown -R slurm /var/slurm or you define your own directory in slurm.conf: JobCheckpointDir= The parameters you could check with: scontrol show config | grep checkpoint Kind regards, Danny TU Dresden Germany Am 14.04.2016 um 06:41 schrieb Danny Rotscher: Hello, we don't get it to work too, but we already build Slurm with the BLCR. You first have to install the BLCR library, which is described on the following website: https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html Then we build and installed Slurm from source and BLCR checkpointing has been included. After that you have to set at least one Parameter in the file "slurm.conf": CheckpointType=checkpoint/blcr It exists two ways to create ceckpointing, you could either make a checkpoint by the following command from outside your job: scontrol checkpoint create or you could let Slurm do some periodical checkpoints with the following sbatch parameter: #SBATCH --checkpoint We also tried: #SBATCH --checkpoint : e.g. #SBATCH --checkpoint 0:10 to test it, but it doesn't work for us. We also set the parameter for the checkpoint directory: #SBATCH --checkpoint-dir After you create a checkpoint and in your checkpoint directory is created a directory with name of your jobid, you could restart the job by the following command: scontrol checkpoint
[slurm-dev] Re: Slurm Checkpoint/Restart example
Hello Danny, I have tried to restart using "scontrol checkpoint restart " but it doesn't work. In addition, ".0" directory and its content are doesn't exist in my --checkpoint-dir. The following is my batch job : =batch job=== #!/bin/bash #SBATCH -J MatMul #SBATCH -o mm-%j.out #SBATCH -A pro #SBATCH -N 3 #SBATCH -n 24 #SBATCH --checkpoint=5 #SBATCH --checkpoint-dir=/mirror/source/cr #SBATCH --time=01:30:00 #SBATCH --mail-user=hus...@gmail.com #SBATCH --mail-type=begin #SBATCH --mail-type=end srun --mpi=pmi2 ./mm.o ===end batch job is there something that prevents me from getting the right directory structure ? Regards, Husen On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: > Hello, > > usually the directory, which is specified by --checkpoint-dir, should have > the following structure: > > |__ script.ckpt > |__ .0 > |__ task.0.ckpt > |__ task.1.ckpt > |__ ... > > But you only have to run the following command to restart your batch job: > scontrol checkpoint restart > > I tried only batch jobs and currently I try to build MVAPICH2 with BLCR > and Slurm support, because that mpi library is explicitly mentioned in the > Slurm documentation. > > A colleague also tested DMTCP but no success. > > Kind reagards > Danny > TU Dresden > Germany > > > Am 14.04.2016 um 11:01 schrieb Husen R: > >> Hi all, >> Thank you for your reply >> >> Danny : >> I have installed BLCR and SLURM successfully. >> I also have configured CheckpointType, --checkpoint, --checkpoint-dir and >> JobCheckpointDir in order for slurm to support checkpoint. >> >> I have tried to checkpoint a simple MPI parallel application many times in >> my small cluster, and like you said, after checkpoint is completed there >> is >> a directory named with jobid in --checkpoint-dir. in that directory there >> is a file named "script.ckpt". I tried to restart directly using srun >> command below : >> >> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o >> >> where --restart-dir is directory that contains "script.ckpt". >> Unfortunately, I got the following error : >> >> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file >> or >> directory >> srun: error: compute-node: task 0: Exited with exit code 255 >> >> As we can see from the error message above, there was no "task.0.ckpt" >> file. I don't know how to get such file. The files that I got from >> checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and >> two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". >> >> According to the information in section srun in this link >> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is >> completed there should be checkpoint files of the form ".ckpt" and >> "..ckpt" in --checkpoint-dir. >> >> Any idea to solve this ? >> >> Manuel : >> >> Yes, BLCR doesn't support checkpoint/restart parallel/distributed >> application by itself ( >> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). >> But it can be used by other software to do that (I hope the software is >> SLURM..huhu) >> >> I have ever tried to restart mpi application using DMTCP but it doesn't >> work. >> Would you please tell me how to do that ? >> >> >> Thank you in advance, >> >> Regards, >> >> >> Husen >> >> >> >> >> >> On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < >> danny.rotsc...@tu-dresden.de> wrote: >> >> I forgot something to add, you have to create a directory for the >>> checkpoint meta data, which is for default located in >>> /var/slurm/checkpoint: >>> mkdir -p /var/slurm/checkpoint >>> chown -R slurm /var/slurm >>> or you define your own directory in slurm.conf: >>> JobCheckpointDir= >>> >>> The parameters you could check with: >>> scontrol show config | grep checkpoint >>> >>> Kind regards, >>> Danny >>> TU Dresden >>> Germany >>> >>> Am 14.04.2016 um 06:41 schrieb Danny Rotscher: >>> >>> Hello, we don't get it to work too, but we already build Slurm with the BLCR. You first have to install the BLCR library, which is described on the following website: https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html Then we build and installed Slurm from source and BLCR checkpointing has been included. After that you have to set at least one Parameter in the file "slurm.conf": CheckpointType=checkpoint/blcr It exists two ways to create ceckpointing, you could either make a checkpoint by the following command from outside your job: scontrol checkpoint create or you could let Slurm do some periodical checkpoints with the following sbatch parameter: #SBATCH --checkpoint We also tried: #SBATCH --checkpoint : e.g. #SBATCH --checkpoint 0:10 to test it, but it doesn't work for us. We also set the parameter for the checkpoint directory: #SBATCH
[slurm-dev] Re: Slurm Checkpoint/Restart example
Hello, usually the directory, which is specified by --checkpoint-dir, should have the following structure: |__ script.ckpt |__ .0 |__ task.0.ckpt |__ task.1.ckpt |__ ... But you only have to run the following command to restart your batch job: scontrol checkpoint restart I tried only batch jobs and currently I try to build MVAPICH2 with BLCR and Slurm support, because that mpi library is explicitly mentioned in the Slurm documentation. A colleague also tested DMTCP but no success. Kind reagards Danny TU Dresden Germany Am 14.04.2016 um 11:01 schrieb Husen R: Hi all, Thank you for your reply Danny : I have installed BLCR and SLURM successfully. I also have configured CheckpointType, --checkpoint, --checkpoint-dir and JobCheckpointDir in order for slurm to support checkpoint. I have tried to checkpoint a simple MPI parallel application many times in my small cluster, and like you said, after checkpoint is completed there is a directory named with jobid in --checkpoint-dir. in that directory there is a file named "script.ckpt". I tried to restart directly using srun command below : srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o where --restart-dir is directory that contains "script.ckpt". Unfortunately, I got the following error : Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or directory srun: error: compute-node: task 0: Exited with exit code 255 As we can see from the error message above, there was no "task.0.ckpt" file. I don't know how to get such file. The files that I got from checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". According to the information in section srun in this link http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed there should be checkpoint files of the form ".ckpt" and "..ckpt" in --checkpoint-dir. Any idea to solve this ? Manuel : Yes, BLCR doesn't support checkpoint/restart parallel/distributed application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by other software to do that (I hope the software is SLURM..huhu) I have ever tried to restart mpi application using DMTCP but it doesn't work. Would you please tell me how to do that ? Thank you in advance, Regards, Husen On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: I forgot something to add, you have to create a directory for the checkpoint meta data, which is for default located in /var/slurm/checkpoint: mkdir -p /var/slurm/checkpoint chown -R slurm /var/slurm or you define your own directory in slurm.conf: JobCheckpointDir= The parameters you could check with: scontrol show config | grep checkpoint Kind regards, Danny TU Dresden Germany Am 14.04.2016 um 06:41 schrieb Danny Rotscher: Hello, we don't get it to work too, but we already build Slurm with the BLCR. You first have to install the BLCR library, which is described on the following website: https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html Then we build and installed Slurm from source and BLCR checkpointing has been included. After that you have to set at least one Parameter in the file "slurm.conf": CheckpointType=checkpoint/blcr It exists two ways to create ceckpointing, you could either make a checkpoint by the following command from outside your job: scontrol checkpoint create or you could let Slurm do some periodical checkpoints with the following sbatch parameter: #SBATCH --checkpoint We also tried: #SBATCH --checkpoint : e.g. #SBATCH --checkpoint 0:10 to test it, but it doesn't work for us. We also set the parameter for the checkpoint directory: #SBATCH --checkpoint-dir After you create a checkpoint and in your checkpoint directory is created a directory with name of your jobid, you could restart the job by the following command: scontrol checkpoint restart We tested some sequential and openmp programs with different parameters and it works (checkpoint creation and restarting), but *we don't get any mpi library to work*, we already tested some programs build with openmpi and intelmpi. The checkpoint will be created but we get the following error when we want to restart them: - Failed to open file '/' - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) - cr_rstrt_child [28534]: Unable to restore files! (err=-21) Restart failed: Is a directory srun: error: taurusi4010: task 0: Exited with exit code 21 So, it would be great if you could confirm our problems, maybe then schedmd higher up the priority of such mails;-) If you get it to work, please help us to understand how. Kind reagards, Danny TU Dresden Germany Am 11.04.2016 um 10:09 schrieb Husen R: Hi all, Based on the information in this link http://slurm.schedmd.com/checkpoint_blcr.html, Slurm able to checkpoint the whole batch jobs and then Restart execution of batch jobs and
[slurm-dev] Re: Slurm Checkpoint/Restart example
There is a good tutorial on how to use DMTCP on their github page, https://github.com/dmtcp/dmtcp/blob/master/QUICK-START.md I would start there. Anyway, probably this Slurm mailing list is not the best place to ask for that information. Best regards, Manuel 2016-04-14 11:01 GMT+02:00 Husen R: > Hi all, > Thank you for your reply > > Danny : > I have installed BLCR and SLURM successfully. > I also have configured CheckpointType, --checkpoint, --checkpoint-dir and > JobCheckpointDir in order for slurm to support checkpoint. > > I have tried to checkpoint a simple MPI parallel application many times in > my small cluster, and like you said, after checkpoint is completed there is > a directory named with jobid in --checkpoint-dir. in that directory there > is a file named "script.ckpt". I tried to restart directly using srun > command below : > > srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o > > where --restart-dir is directory that contains "script.ckpt". > Unfortunately, I got the following error : > > Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or > directory > srun: error: compute-node: task 0: Exited with exit code 255 > > As we can see from the error message above, there was no "task.0.ckpt" file. > I don't know how to get such file. The files that I got from checkpoint > operation is a file named "script.ckpt" in --checkpoint-dir and two files in > JobCheckpointDir named ".ckpt" and ".ckpt.old". > > According to the information in section srun in this link > http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed > there should be checkpoint files of the form ".ckpt" and > "..ckpt" in --checkpoint-dir. > > Any idea to solve this ? > > Manuel : > > Yes, BLCR doesn't support checkpoint/restart parallel/distributed > application by itself ( > https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by > other software to do that (I hope the software is SLURM..huhu) > > I have ever tried to restart mpi application using DMTCP but it doesn't > work. > Would you please tell me how to do that ? > > > Thank you in advance, > > Regards, > > > Husen > > > > > > On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher > wrote: >> >> I forgot something to add, you have to create a directory for the >> checkpoint meta data, which is for default located in /var/slurm/checkpoint: >> mkdir -p /var/slurm/checkpoint >> chown -R slurm /var/slurm >> or you define your own directory in slurm.conf: >> JobCheckpointDir= >> >> The parameters you could check with: >> scontrol show config | grep checkpoint >> >> Kind regards, >> Danny >> TU Dresden >> Germany >> >> Am 14.04.2016 um 06:41 schrieb Danny Rotscher: >>> >>> Hello, >>> >>> we don't get it to work too, but we already build Slurm with the BLCR. >>> >>> You first have to install the BLCR library, which is described on the >>> following website: >>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html >>> >>> Then we build and installed Slurm from source and BLCR checkpointing has >>> been included. >>> >>> After that you have to set at least one Parameter in the file >>> "slurm.conf": >>> CheckpointType=checkpoint/blcr >>> >>> It exists two ways to create ceckpointing, you could either make a >>> checkpoint by the following command from outside your job: >>> scontrol checkpoint create >>> or you could let Slurm do some periodical checkpoints with the following >>> sbatch parameter: >>> #SBATCH --checkpoint >>> We also tried: >>> #SBATCH --checkpoint : >>> e.g. >>> #SBATCH --checkpoint 0:10 >>> to test it, but it doesn't work for us. >>> >>> We also set the parameter for the checkpoint directory: >>> #SBATCH --checkpoint-dir >>> >>> After you create a checkpoint and in your checkpoint directory is created >>> a directory with name of your jobid, you could restart the job by the >>> following command: >>> scontrol checkpoint restart >>> >>> We tested some sequential and openmp programs with different parameters >>> and it works (checkpoint creation and restarting), >>> but *we don't get any mpi library to work*, we already tested some >>> programs build with openmpi and intelmpi. >>> The checkpoint will be created but we get the following error when we >>> want to restart them: >>> - Failed to open file '/' >>> - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) >>> - cr_rstrt_child [28534]: Unable to restore files! (err=-21) >>> Restart failed: Is a directory >>> srun: error: taurusi4010: task 0: Exited with exit code 21 >>> >>> So, it would be great if you could confirm our problems, maybe then >>> schedmd higher up the priority of such mails;-) >>> If you get it to work, please help us to understand how. >>> >>> Kind reagards, >>> Danny >>> TU Dresden >>> Germany >>> >>> Am 11.04.2016 um 10:09 schrieb Husen R: Hi all, Based on the information in this link
[slurm-dev] Re: Slurm Checkpoint/Restart example
Hi all, Thank you for your reply Danny : I have installed BLCR and SLURM successfully. I also have configured CheckpointType, --checkpoint, --checkpoint-dir and JobCheckpointDir in order for slurm to support checkpoint. I have tried to checkpoint a simple MPI parallel application many times in my small cluster, and like you said, after checkpoint is completed there is a directory named with jobid in --checkpoint-dir. in that directory there is a file named "script.ckpt". I tried to restart directly using srun command below : srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o where --restart-dir is directory that contains "script.ckpt". Unfortunately, I got the following error : Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or directory srun: error: compute-node: task 0: Exited with exit code 255 As we can see from the error message above, there was no "task.0.ckpt" file. I don't know how to get such file. The files that I got from checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". According to the information in section srun in this link http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed there should be checkpoint files of the form ".ckpt" and "..ckpt" in --checkpoint-dir. Any idea to solve this ? Manuel : Yes, BLCR doesn't support checkpoint/restart parallel/distributed application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by other software to do that (I hope the software is SLURM..huhu) I have ever tried to restart mpi application using DMTCP but it doesn't work. Would you please tell me how to do that ? Thank you in advance, Regards, Husen On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: > I forgot something to add, you have to create a directory for the > checkpoint meta data, which is for default located in /var/slurm/checkpoint: > mkdir -p /var/slurm/checkpoint > chown -R slurm /var/slurm > or you define your own directory in slurm.conf: > JobCheckpointDir= > > The parameters you could check with: > scontrol show config | grep checkpoint > > Kind regards, > Danny > TU Dresden > Germany > > Am 14.04.2016 um 06:41 schrieb Danny Rotscher: > >> Hello, >> >> we don't get it to work too, but we already build Slurm with the BLCR. >> >> You first have to install the BLCR library, which is described on the >> following website: >> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html >> >> Then we build and installed Slurm from source and BLCR checkpointing has >> been included. >> >> After that you have to set at least one Parameter in the file >> "slurm.conf": >> CheckpointType=checkpoint/blcr >> >> It exists two ways to create ceckpointing, you could either make a >> checkpoint by the following command from outside your job: >> scontrol checkpoint create >> or you could let Slurm do some periodical checkpoints with the following >> sbatch parameter: >> #SBATCH --checkpoint >> We also tried: >> #SBATCH --checkpoint : >> e.g. >> #SBATCH --checkpoint 0:10 >> to test it, but it doesn't work for us. >> >> We also set the parameter for the checkpoint directory: >> #SBATCH --checkpoint-dir >> >> After you create a checkpoint and in your checkpoint directory is created >> a directory with name of your jobid, you could restart the job by the >> following command: >> scontrol checkpoint restart >> >> We tested some sequential and openmp programs with different parameters >> and it works (checkpoint creation and restarting), >> but *we don't get any mpi library to work*, we already tested some >> programs build with openmpi and intelmpi. >> The checkpoint will be created but we get the following error when we >> want to restart them: >> - Failed to open file '/' >> - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) >> - cr_rstrt_child [28534]: Unable to restore files! (err=-21) >> Restart failed: Is a directory >> srun: error: taurusi4010: task 0: Exited with exit code 21 >> >> So, it would be great if you could confirm our problems, maybe then >> schedmd higher up the priority of such mails;-) >> If you get it to work, please help us to understand how. >> >> Kind reagards, >> Danny >> TU Dresden >> Germany >> >> Am 11.04.2016 um 10:09 schrieb Husen R: >> >>> Hi all, >>> >>> Based on the information in this link >>> http://slurm.schedmd.com/checkpoint_blcr.html, >>> Slurm able to checkpoint the whole batch jobs and then Restart execution >>> of >>> batch jobs and job steps from checkpoint files. >>> >>> Anyone please tell me how to do that ? >>> I need help. >>> >>> Thank you in advance. >>> >>> Regards, >>> >>> >>> Husen Rusdiansyah >>> University of Indonesia >>> >> >> >
[slurm-dev] Re: Slurm Checkpoint/Restart example
Hi Danny, all, As far as I know, unfortunately BLCR does not count with MPI support At lest I haven't been able to achieve it. On the other side, DMTCP ( http://dmtcp.sourceforge.net/ ) does work with MPI. My team is very interested on counting with a reliable checkpoint/restar mechanism in Slurm, so we are now plugin to integrate it. We are facing some technical problems, but are working together with DMTCP team to solve them and we are confident on having the integration ready soon. anyway, i'll send a mail to this list when it's ready. Cheers, Manuel 2016-04-14 7:03 GMT+02:00 Danny Rotscher: > I forgot something to add, you have to create a directory for the checkpoint > meta data, which is for default located in /var/slurm/checkpoint: > mkdir -p /var/slurm/checkpoint > chown -R slurm /var/slurm > or you define your own directory in slurm.conf: > JobCheckpointDir= > > The parameters you could check with: > scontrol show config | grep checkpoint > > Kind regards, > Danny > TU Dresden > Germany > > Am 14.04.2016 um 06:41 schrieb Danny Rotscher: >> >> Hello, >> >> we don't get it to work too, but we already build Slurm with the BLCR. >> >> You first have to install the BLCR library, which is described on the >> following website: >> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html >> >> Then we build and installed Slurm from source and BLCR checkpointing has >> been included. >> >> After that you have to set at least one Parameter in the file >> "slurm.conf": >> CheckpointType=checkpoint/blcr >> >> It exists two ways to create ceckpointing, you could either make a >> checkpoint by the following command from outside your job: >> scontrol checkpoint create >> or you could let Slurm do some periodical checkpoints with the following >> sbatch parameter: >> #SBATCH --checkpoint >> We also tried: >> #SBATCH --checkpoint : >> e.g. >> #SBATCH --checkpoint 0:10 >> to test it, but it doesn't work for us. >> >> We also set the parameter for the checkpoint directory: >> #SBATCH --checkpoint-dir >> >> After you create a checkpoint and in your checkpoint directory is created >> a directory with name of your jobid, you could restart the job by the >> following command: >> scontrol checkpoint restart >> >> We tested some sequential and openmp programs with different parameters >> and it works (checkpoint creation and restarting), >> but *we don't get any mpi library to work*, we already tested some >> programs build with openmpi and intelmpi. >> The checkpoint will be created but we get the following error when we want >> to restart them: >> - Failed to open file '/' >> - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) >> - cr_rstrt_child [28534]: Unable to restore files! (err=-21) >> Restart failed: Is a directory >> srun: error: taurusi4010: task 0: Exited with exit code 21 >> >> So, it would be great if you could confirm our problems, maybe then >> schedmd higher up the priority of such mails;-) >> If you get it to work, please help us to understand how. >> >> Kind reagards, >> Danny >> TU Dresden >> Germany >> >> Am 11.04.2016 um 10:09 schrieb Husen R: >>> >>> Hi all, >>> >>> Based on the information in this link >>> http://slurm.schedmd.com/checkpoint_blcr.html, >>> Slurm able to checkpoint the whole batch jobs and then Restart execution >>> of >>> batch jobs and job steps from checkpoint files. >>> >>> Anyone please tell me how to do that ? >>> I need help. >>> >>> Thank you in advance. >>> >>> Regards, >>> >>> >>> Husen Rusdiansyah >>> University of Indonesia >> >> >
[slurm-dev] Re: Slurm Checkpoint/Restart example
I forgot something to add, you have to create a directory for the checkpoint meta data, which is for default located in /var/slurm/checkpoint: mkdir -p /var/slurm/checkpoint chown -R slurm /var/slurm or you define your own directory in slurm.conf: JobCheckpointDir= The parameters you could check with: scontrol show config | grep checkpoint Kind regards, Danny TU Dresden Germany Am 14.04.2016 um 06:41 schrieb Danny Rotscher: Hello, we don't get it to work too, but we already build Slurm with the BLCR. You first have to install the BLCR library, which is described on the following website: https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html Then we build and installed Slurm from source and BLCR checkpointing has been included. After that you have to set at least one Parameter in the file "slurm.conf": CheckpointType=checkpoint/blcr It exists two ways to create ceckpointing, you could either make a checkpoint by the following command from outside your job: scontrol checkpoint create or you could let Slurm do some periodical checkpoints with the following sbatch parameter: #SBATCH --checkpoint We also tried: #SBATCH --checkpoint : e.g. #SBATCH --checkpoint 0:10 to test it, but it doesn't work for us. We also set the parameter for the checkpoint directory: #SBATCH --checkpoint-dir After you create a checkpoint and in your checkpoint directory is created a directory with name of your jobid, you could restart the job by the following command: scontrol checkpoint restart We tested some sequential and openmp programs with different parameters and it works (checkpoint creation and restarting), but *we don't get any mpi library to work*, we already tested some programs build with openmpi and intelmpi. The checkpoint will be created but we get the following error when we want to restart them: - Failed to open file '/' - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) - cr_rstrt_child [28534]: Unable to restore files! (err=-21) Restart failed: Is a directory srun: error: taurusi4010: task 0: Exited with exit code 21 So, it would be great if you could confirm our problems, maybe then schedmd higher up the priority of such mails;-) If you get it to work, please help us to understand how. Kind reagards, Danny TU Dresden Germany Am 11.04.2016 um 10:09 schrieb Husen R: Hi all, Based on the information in this link http://slurm.schedmd.com/checkpoint_blcr.html, Slurm able to checkpoint the whole batch jobs and then Restart execution of batch jobs and job steps from checkpoint files. Anyone please tell me how to do that ? I need help. Thank you in advance. Regards, Husen Rusdiansyah University of Indonesia smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Slurm Checkpoint/Restart example
Hello, we don't get it to work too, but we already build Slurm with the BLCR. You first have to install the BLCR library, which is described on the following website: https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html Then we build and installed Slurm from source and BLCR checkpointing has been included. After that you have to set at least one Parameter in the file "slurm.conf": CheckpointType=checkpoint/blcr It exists two ways to create ceckpointing, you could either make a checkpoint by the following command from outside your job: scontrol checkpoint create or you could let Slurm do some periodical checkpoints with the following sbatch parameter: #SBATCH --checkpoint We also tried: #SBATCH --checkpoint : e.g. #SBATCH --checkpoint 0:10 to test it, but it doesn't work for us. We also set the parameter for the checkpoint directory: #SBATCH --checkpoint-dir After you create a checkpoint and in your checkpoint directory is created a directory with name of your jobid, you could restart the job by the following command: scontrol checkpoint restart We tested some sequential and openmp programs with different parameters and it works (checkpoint creation and restarting), but *we don't get any mpi library to work*, we already tested some programs build with openmpi and intelmpi. The checkpoint will be created but we get the following error when we want to restart them: - Failed to open file '/' - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) - cr_rstrt_child [28534]: Unable to restore files! (err=-21) Restart failed: Is a directory srun: error: taurusi4010: task 0: Exited with exit code 21 So, it would be great if you could confirm our problems, maybe then schedmd higher up the priority of such mails;-) If you get it to work, please help us to understand how. Kind reagards, Danny TU Dresden Germany Am 11.04.2016 um 10:09 schrieb Husen R: Hi all, Based on the information in this link http://slurm.schedmd.com/checkpoint_blcr.html, Slurm able to checkpoint the whole batch jobs and then Restart execution of batch jobs and job steps from checkpoint files. Anyone please tell me how to do that ? I need help. Thank you in advance. Regards, Husen Rusdiansyah University of Indonesia smime.p7s Description: S/MIME Cryptographic Signature