Some little more research:

When PreemptMode=REQUEUE, the job does not start from a checkpoint even if
such exists. Is there anyway to change this behavior?

Further, in an attempt to overcome, I added the following commands at the
beginning of the job script:

if [ -z "$SLURM_RESTART_COUNT" ]; then
    SLURM_RESTART_COUNT=0
fi
if [ $SLURM_RESTART_COUNT -ge 1 ]; then
    scontrol checkpoint restart $SLURM_JOB_ID
else
   ... <proceed>

This doesn't help as well - what I get is the following error:

scontrol_checkpoint error: Requested operation is presently disabled

Any idea?

Yoel


On Thu, Jan 8, 2015 at 4:32 PM, Yoel Jacobsen <[email protected]>
wrote:

> Hello there,
>
> Is there a way to combine preemption, checkpointing and automatic requeue?
>
> The behavior I try to configure is:
>
> - Submit a batch job with checkpointing (based on BLCR)
> - On preemption - take a checkpoint and kill (like the CHECKPOINT
> mechanism in PreemptMode)
> - Resubmit the  job (which should start from the last checkpoint)
>
> The documentation is clear about "Checkpointed jobs are not automatically
> restarted." so PreemptMode=CHECKPOINT isn't a solution.
>
> Is there anyway to hook into the process and resubmit the jobs?
>
> Thank you,
>   Yoel
>

Reply via email to