[slurm-dev] Re: Running into issues with checkpoint/restart support

Eric Roman Mon, 19 Sep 2016 12:37:36 -0700


I've made progress with the preemption issue that I reported in my last post. I
hadn't assigned a PriorityTier to the regular partition.  I tested again with


PartitionName=regular Nodes=pcp-d-[8-10] Default=YES MaxTime=INFINITE State=UP 
PriorityTier=100
PartitionName=lowprio Nodes=pcp-d-[8-10] MaxTime=INFINITE State=UP 
PriorityTier=0

and in this case jobs resume after they are preempted.  

So it looks like (all?) partitions need a PriorityTier configured when 
slurmd.conf is set up for:

PreemptMode=CHECKPOINT,GANG
PreemptType=preempt/partition_prio

Eric

On Mon, Sep 19, 2016 at 11:06:31AM -0700, Eric Roman wrote:
> 
> 
> Hi,
> 
> I've been working with the BLCR checkpoint/restart plugin with Slurm 16.05.3
> and ran into some issues.  First is an issue with "scontrol restart" second
> is a more significant problem with checkpoint-based job preemption.
> 
> 1.  scontrol restart ID returns 'Duplicate job id' for 5 minutes.
> 
> When I manually checkpoint or vacate a job with "scontrol checkpoint ID"
> or "scontrol vacate ID", the checkpoint is successful.  But a successful
> restart after the job exits takes some time.
> 
> Immediately after vacate or job termination::
> 
>     scontrol: checkpoint restart 138
>     scontrol_checkpoint error: Duplicate job id
> 
> At this point, "scontrol show job ID" shows the job in a completed state.
> 
> After about 5 minutes (it varies, but seems to take at least 300 seconds)
> I CAN successfully restart the job with "scontrol restart ID".
> 
> I noticed that BEFORE restarting the job, if I run "scontrol show job ID"
> I get an error: "Invalid job id specified"
> 
>     scontrol: show job 138
>     slurm_load_jobs error: Invalid job id specified
> 
> This isn't a big problem, but I do wonder what the reason is for this 
> behavior.  Is it a bug?  I didn't see anything in the documentation about
> restart needing time to "breathe" after a checkpoint is taken.
> 
> 2.  Checkpoint-based preemption doesn't resume jobs.
> 
> I've configured preemption via slurm.conf to create one preemptable
> partition through priority-based preemption.  There are two partitions
> set up, a regular partition and a lowprio partition with a PriorityTier of 0.
> 
> PreemptMode=CHECKPOINT,GANG
> PreemptType=preempt/partition_prio
> PartitionName=regular Nodes=pcp-d-[8-10] Default=YES MaxTime=INFINITE State=UP
> PartitionName=lowprio Nodes=pcp-d-[8-10] Default=YES MaxTime=INFINITE 
> State=UP PriorityTier=0
> 
> Jobs submitted to the lowprio partition ARE preempted by jobs in the regular
> partition, but the lowprio jobs are never resumed.
> 
> So how do you configure slurm so that preempted (checkpointed) jobs are 
> automatically resumed?
> 
> Thanks,
> Eric

[slurm-dev] Re: Running into issues with checkpoint/restart support

Reply via email to