I've made progress with the preemption issue that I reported in my last post. I
hadn't assigned a PriorityTier to the regular partition. I tested again with
PartitionName=regular Nodes=pcp-d-[8-10] Default=YES MaxTime=INFINITE State=UP
PartitionName=lowprio Nodes=pcp-d-[8-10] MaxTime=INFINITE State=UP
and in this case jobs resume after they are preempted.
So it looks like (all?) partitions need a PriorityTier configured when
slurmd.conf is set up for:
On Mon, Sep 19, 2016 at 11:06:31AM -0700, Eric Roman wrote:
> I've been working with the BLCR checkpoint/restart plugin with Slurm 16.05.3
> and ran into some issues. First is an issue with "scontrol restart" second
> is a more significant problem with checkpoint-based job preemption.
> 1. scontrol restart ID returns 'Duplicate job id' for 5 minutes.
> When I manually checkpoint or vacate a job with "scontrol checkpoint ID"
> or "scontrol vacate ID", the checkpoint is successful. But a successful
> restart after the job exits takes some time.
> Immediately after vacate or job termination::
> scontrol: checkpoint restart 138
> scontrol_checkpoint error: Duplicate job id
> At this point, "scontrol show job ID" shows the job in a completed state.
> After about 5 minutes (it varies, but seems to take at least 300 seconds)
> I CAN successfully restart the job with "scontrol restart ID".
> I noticed that BEFORE restarting the job, if I run "scontrol show job ID"
> I get an error: "Invalid job id specified"
> scontrol: show job 138
> slurm_load_jobs error: Invalid job id specified
> This isn't a big problem, but I do wonder what the reason is for this
> behavior. Is it a bug? I didn't see anything in the documentation about
> restart needing time to "breathe" after a checkpoint is taken.
> 2. Checkpoint-based preemption doesn't resume jobs.
> I've configured preemption via slurm.conf to create one preemptable
> partition through priority-based preemption. There are two partitions
> set up, a regular partition and a lowprio partition with a PriorityTier of 0.
> PartitionName=regular Nodes=pcp-d-[8-10] Default=YES MaxTime=INFINITE State=UP
> PartitionName=lowprio Nodes=pcp-d-[8-10] Default=YES MaxTime=INFINITE
> State=UP PriorityTier=0
> Jobs submitted to the lowprio partition ARE preempted by jobs in the regular
> partition, but the lowprio jobs are never resumed.
> So how do you configure slurm so that preempted (checkpointed) jobs are
> automatically resumed?