Some little more research:
When PreemptMode=REQUEUE, the job does not start from a checkpoint even if
such exists. Is there anyway to change this behavior?
Further, in an attempt to overcome, I added the following commands at the
beginning of the job script:
if [ -z "$SLURM_RESTART_COUNT" ]; then
SLURM_RESTART_COUNT=0
fi
if [ $SLURM_RESTART_COUNT -ge 1 ]; then
scontrol checkpoint restart $SLURM_JOB_ID
else
... <proceed>
This doesn't help as well - what I get is the following error:
scontrol_checkpoint error: Requested operation is presently disabled
Any idea?
Yoel
On Thu, Jan 8, 2015 at 4:32 PM, Yoel Jacobsen <[email protected]>
wrote:
> Hello there,
>
> Is there a way to combine preemption, checkpointing and automatic requeue?
>
> The behavior I try to configure is:
>
> - Submit a batch job with checkpointing (based on BLCR)
> - On preemption - take a checkpoint and kill (like the CHECKPOINT
> mechanism in PreemptMode)
> - Resubmit the job (which should start from the last checkpoint)
>
> The documentation is clear about "Checkpointed jobs are not automatically
> restarted." so PreemptMode=CHECKPOINT isn't a solution.
>
> Is there anyway to hook into the process and resubmit the jobs?
>
> Thank you,
> Yoel
>