I don't know if there's any difference in the documentation, but I know the first thing most recommend is ignoring the top links in google and heading straight over to schedmd.com for documentation. No idea how old that llnl stuff is.
http://slurm.schedmd.com/checkpoint_blcr.html So yes, you should see checkpoints every minute with that command. They should end up in that directory... don't think there's anything wrong with the command. The signal suggests there's something a bit deeper going wrong. Now we don't use cr_checkpoint as indicated in the documentation, instead using scontrol: scontrol checkpoint create <jobid> It's a lot easier than digging around for PIDs. This is working for us (slurm 2.6.2, blcr 0.8.5). Have you increased debugging on slurmd and slurmctd? I think around 5 you would see some interesting messages. Best Michael On Thu, Mar 13, 2014 at 12:47 AM, Arjun J Rao <[email protected]>wrote: > It is not exactly clear from the documentation here ( > https://computing.llnl.gov/linux/slurm/checkpoint_blcr.html) how it is > that I am supposed to checkpoint my jobs launched via SLURM. > > Say I have launched an MPI Job with the following command > srun_cr -N2 -n24 --checkpoint 1 --checkpoint-dir /home/arjun/Checkpoint > MPIJobBinary > > Does this command in itself ensure that the checkpoints will be taken out > at 1 minute intervals to the /home/arjun/Checkpoint directory ? If so, it > doesn't work for me. > > > Or do we need to do a > > ps -U arjun | grep srun > to find out the pid of the srun_cr process and then issue > > cr_checkpoint sruns_pid > to checkpoint the job ? > > But when I do this I get the following error : > - chkpt_watchdog: 'srun_cr' exited with signal 9 during checkpoint > Checkpoint cancelled by application : Unable to checkpoint > > Similar is the case when i try to do something like : > srun_cr -N1 -n1 --checkpoint 1 --checkpoint-dir /home/arjun/Checkpoint > SerialBinary > > > -- Hey! Somebody punched the foley guy! - Crow, MST3K ep. 508
