It is not exactly clear from the documentation here ( https://computing.llnl.gov/linux/slurm/checkpoint_blcr.html) how it is that I am supposed to checkpoint my jobs launched via SLURM.
Say I have launched an MPI Job with the following command srun_cr -N2 -n24 --checkpoint 1 --checkpoint-dir /home/arjun/Checkpoint MPIJobBinary Does this command in itself ensure that the checkpoints will be taken out at 1 minute intervals to the /home/arjun/Checkpoint directory ? If so, it doesn't work for me. Or do we need to do a ps -U arjun | grep srun to find out the pid of the srun_cr process and then issue cr_checkpoint sruns_pid to checkpoint the job ? But when I do this I get the following error : - chkpt_watchdog: 'srun_cr' exited with signal 9 during checkpoint Checkpoint cancelled by application : Unable to checkpoint Similar is the case when i try to do something like : srun_cr -N1 -n1 --checkpoint 1 --checkpoint-dir /home/arjun/Checkpoint SerialBinary
