I don't know if there's any difference in the documentation, but I know the
first thing most recommend is ignoring the top links in google and heading
straight over to schedmd.com for documentation.  No idea how old that llnl
stuff is.

http://slurm.schedmd.com/checkpoint_blcr.html

So yes, you should see checkpoints every minute with that command.  They
should end up in that directory... don't think there's anything wrong with
the command.  The signal suggests there's something a bit deeper going
wrong.

Now we don't use cr_checkpoint as indicated in the documentation, instead
using scontrol:

scontrol checkpoint create <jobid>

It's a lot easier than digging around for PIDs.  This is working for us
(slurm 2.6.2, blcr 0.8.5).

Have you increased debugging on slurmd and slurmctd?  I think around 5 you
would see some interesting messages.

Best

Michael


On Thu, Mar 13, 2014 at 12:47 AM, Arjun J Rao <[email protected]>wrote:

>  It is not exactly clear from the documentation here (
> https://computing.llnl.gov/linux/slurm/checkpoint_blcr.html) how it is
> that I am supposed to checkpoint my jobs launched via SLURM.
>
> Say I have launched an MPI Job with the following command
> srun_cr -N2 -n24 --checkpoint 1 --checkpoint-dir /home/arjun/Checkpoint
> MPIJobBinary
>
> Does this command in itself ensure that the checkpoints will be taken out
> at 1 minute intervals to the /home/arjun/Checkpoint directory ? If so, it
> doesn't work for me.
>
>
> Or do we need to do a
>
> ps -U arjun | grep srun
> to find out the pid of the srun_cr process and then issue
>
> cr_checkpoint sruns_pid
> to checkpoint the job ?
>
> But when I do this I get the following error :
> - chkpt_watchdog: 'srun_cr'  exited with signal 9 during checkpoint
> Checkpoint cancelled by application : Unable to checkpoint
>
> Similar is the case when i try to do something like :
> srun_cr -N1 -n1 --checkpoint 1 --checkpoint-dir /home/arjun/Checkpoint
> SerialBinary
>
>
>


-- 
Hey! Somebody punched the foley guy!
   - Crow, MST3K ep. 508

Reply via email to