It is not exactly clear from the documentation here (
https://computing.llnl.gov/linux/slurm/checkpoint_blcr.html) how it is that
I am supposed to checkpoint my jobs launched via SLURM.

Say I have launched an MPI Job with the following command
srun_cr -N2 -n24 --checkpoint 1 --checkpoint-dir /home/arjun/Checkpoint
MPIJobBinary

Does this command in itself ensure that the checkpoints will be taken out
at 1 minute intervals to the /home/arjun/Checkpoint directory ? If so, it
doesn't work for me.


Or do we need to do a

ps -U arjun | grep srun
to find out the pid of the srun_cr process and then issue

cr_checkpoint sruns_pid
to checkpoint the job ?

But when I do this I get the following error :
- chkpt_watchdog: 'srun_cr'  exited with signal 9 during checkpoint
Checkpoint cancelled by application : Unable to checkpoint

Similar is the case when i try to do something like :
srun_cr -N1 -n1 --checkpoint 1 --checkpoint-dir /home/arjun/Checkpoint
SerialBinary

Reply via email to