I've been working with the BLCR checkpoint/restart plugin with Slurm 16.05.3
and ran into some issues.  First is an issue with "scontrol restart" second
is a more significant problem with checkpoint-based job preemption.

1.  scontrol restart ID returns 'Duplicate job id' for 5 minutes.

When I manually checkpoint or vacate a job with "scontrol checkpoint ID"
or "scontrol vacate ID", the checkpoint is successful.  But a successful
restart after the job exits takes some time.

Immediately after vacate or job termination::

    scontrol: checkpoint restart 138
    scontrol_checkpoint error: Duplicate job id

At this point, "scontrol show job ID" shows the job in a completed state.

After about 5 minutes (it varies, but seems to take at least 300 seconds)
I CAN successfully restart the job with "scontrol restart ID".

I noticed that BEFORE restarting the job, if I run "scontrol show job ID"
I get an error: "Invalid job id specified"

    scontrol: show job 138
    slurm_load_jobs error: Invalid job id specified

This isn't a big problem, but I do wonder what the reason is for this 
behavior.  Is it a bug?  I didn't see anything in the documentation about
restart needing time to "breathe" after a checkpoint is taken.

2.  Checkpoint-based preemption doesn't resume jobs.

I've configured preemption via slurm.conf to create one preemptable
partition through priority-based preemption.  There are two partitions
set up, a regular partition and a lowprio partition with a PriorityTier of 0.

PartitionName=regular Nodes=pcp-d-[8-10] Default=YES MaxTime=INFINITE State=UP
PartitionName=lowprio Nodes=pcp-d-[8-10] Default=YES MaxTime=INFINITE State=UP 

Jobs submitted to the lowprio partition ARE preempted by jobs in the regular
partition, but the lowprio jobs are never resumed.

So how do you configure slurm so that preempted (checkpointed) jobs are 
automatically resumed?


Reply via email to