Re: [slurm-dev] Partition-based preemption?

Moe Jette Wed, 17 Aug 2011 07:55:03 -0700

Chris,

It is technically possible to requeue salloc and srun jobs, butprobably not desirable.

Batch jobs have a script that can just be re-run. In the case ofsalloc and srun, there is likely some person interacting with the jobwho can better judge what to do. For example, salloc typically spawnsa shell and the user can do pre-processing, run parallel tasks, andpost-processing. The post-processing typically does not require accessto compute nodes, so cancelling the job after the parallel tasks runwould have no effect, while requeuing the job (killing the runningshell and starting a new shell with new environment variables) wouldlikely prove problematic for the user


Moe Jette
SchedMD LLC

Quoting Chris Scheller <[email protected]>:

Alan Orth wrote on Aug, 04:

Ah!  Thanks for getting back to me on this.  You are indeed right, I
now see that this does in fact work if the job to be
preempted/requeued is submitted with sbatch.  I admit I was confused
on the different job submission tools (srun, salloc, sbatch)...


I've been banging my head on the same problem recently(wish I checked
the list sooner.) Why are srun, salloc not preempted? Will this
change anytime soon?


Thanks again,

Alan

On Tue, Aug 2, 2011 at 8:34 PM,  <[email protected]> wrote:
> Only batch jobs can be requeued, your salloc job would need to be killed.
>

> When the allocation is killed, that should kill all of theprocesses on the> compute nodes (as identified using SLURM's Proctrack plugin), butthe salloc

> command (running on a login node) would not be killed. If your salloc

> command isn't spawning anything on the compute nodes using srun,there would

> be no processes to kill.
>
>
> Quoting Alan Orth <[email protected]>:
>
>> Ok, either I'm missing something, or I've hit a bug... In a test
>> cluster with 4 CPUs and two partitions, "low" and "high"...
>>
>> $ salloc -n4 -p low openssl speed
>> salloc: Granted job allocation 14
>>
>> $ salloc -n4 -p high openssl speed
>> salloc: Pending job allocation 15
>> salloc: job 15 queued and waiting for resources
>> salloc: job 15 has been allocated resources
>> salloc: Granted job allocation 15
>>
>> After submitting the high-priority job I see this printed in the
>> slurmctld log file:
>> "preempted job 14 had to be killed"
>>
>> But the "killed" job keeps running (even though its allocation is
>> revoked).  What gives?
>>
>> $ scontrol show partitions
>> PartitionName=low
>>   AllocNodes=ALL AllowGroups=ALL Default=NO
>>   DefaultTime=NONE DisableRootJobs=NO Hidden=NO
>>   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1
>>   Nodes=noma
>>   Priority=10 RootOnly=NO Shared=NO PreemptMode=REQUEUE
>>   State=UP TotalCPUs=4 TotalNodes=1
>>
>> PartitionName=high
>>   AllocNodes=ALL AllowGroups=ALL Default=NO
>>   DefaultTime=NONE DisableRootJobs=NO Hidden=NO
>>   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1
>>   Nodes=noma
>>   Priority=20 RootOnly=NO Shared=NO PreemptMode=REQUEUE
>>   State=UP TotalCPUs=4 TotalNodes=1
>>
>> slurm.conf
>> PreemptMode             = REQUEUE
>> PreemptType             = preempt/partition_prio
>> ProctrackType           = proctrack/pgid
>> SchedulerType=sched/backfill
>> SelectType=select/linear
>>
>> Thanks!
>>
>> On 07/15/2011 06:07 PM, [email protected] wrote:
>>>
>>> Quoting Alan Orth <[email protected]>:
>>>
>>>> Moe,
>>>>
>>>> Thanks for the quick response.  I've just updated my configuration to
>>>> include some of your tips, but I'm still having problems.  I can
>>>> confirm the same behavior happens with the linear select plugin; with
>>>> "PreemptMode=REQUEUE" the resource allocation is revoked from the job
>>>> in lower-priority partition, but the job continues to run (consuming
>>>> CPU resources).
>>>
>>> Is the job state CG (completing)? If so, then the problem isn't in  the
>>> preemption logic, but in the configuration or communications  (i.e. the
>>> slurmd daemon on the compute nodes isn't doing what the  slurmctld (the
>>> slurm control daemon) is telling it to do).  Alternately it might not be
>>> finding all of the processes due to the  ProctractType configuration.If
>>> that's the case, the SlurmdLogFile  and SlurmctldLogFile should help to

>>> diagnose the problem. Running "scontrol show config" will showyou what all

>>> of these values are.
>>>
>>>
>>>> With "PreemptMode=SUSPEND,GANG" the job submitted in
>>>> the higher-priority partition simply waits until there are free slots.
>>>> The behavior doesn't seem to change with either select/linear or
>>>> select/cons_res.
>>>>
>>>> Again, relevant slurm.conf sections from my slurm 2.2.7 test
>>>> installation (on Ubuntu 11.04):
>>>>
>>>> SchedulerType=sched/backfill
>>>> SelectType=select/linear
>>>> PreemptType=preempt/partition_prio
>>>> NodeName=noma CoresPerSocket=4 ThreadsPerCore=1 Sockets=1 State=UNKNOWN
>>>> PartitionName=batch Nodes=noma Default=NO DefaultTime=INFINITE
>>>> MaxTime=INFINITE State=UP Priority=10 Shared=NO
>>>> PartitionName=interactive Nodes=noma Default=NO MaxTime=INFINITE
>>>> State=UP Priority=20 Shared=NO
>>>>
>>>> Regarding my previous use of the "Shared=Force:1" option in the
>>>> low-priority partition, I had specified it because the
>>>> documentation[1] mentions "By default the max_share value is 4. In
>>>> order to preempt jobs (and not gang schedule them), always set
>>>> max_share to 1."
>>>
>>> That is the correct configuration for preempting a job through

>>> suspending it, but not if you want its resources to berelinquished before>>> starting another job on the same resources (i.e. with PreemptMode=Cancel or

>>> Requeue). In the latter case, you need  Shared=NO.
>>>
>>>> Cheers and thanks,
>>>>
>>>> Alan
>>>>
>>>> [1] https://computing.llnl.gov/linux/slurm/preempt.html
>>>>
>>>> On Thu, Jul 14, 2011 at 6:15 PM, <[email protected]> wrote:
>>>>>
>>>>> Alan,
>>>>>

>>>>> I believe that you need "Shared=NO" for both partitions and preemption

>>>>> modes
>>>>> PreemptMode=CANCEL or REQUEUE. For PreemptMode=Suspend, it seems to
>>>>> work
>>>>> fine for SelectType=select/linear, but not for
>>>>> SelectType=select/cons_res.
>>>>> I'll make a note of this bug in the select/cons_res plugin, but  I'm
>>>>> not sure
>>>>> when it will get fixed.
>>>>>
>>>>> Moe Jette
>>>>>
>>>>>
>>>>> Quoting Alan Orth <[email protected]>:
>>>>>
>>>>>> I'm having problems getting basic partition-based preemption working.
>>>>>> For testing purposes I've set up a cluster with 4 CPUs and two

>>>>>> partitions (each with different priorities). I can't figureout how to

>>>>>> get the higher-priority partition to preempt the lower-priority
>>>>>> partition.  This test configuration has 4 CPU slots.
>>>>>>
>>>>>> First, ask for 4 CPUs, in the batch partition.
>>>>>> $ salloc -n4 -p batch openssl speed
>>>>>> salloc: Granted job allocation 68
>>>>>> Doing md2 for 3s on 16 size blocks: 305643 md2's in 2.97s
>>>>>>
>>>>>> Second, ask for 4 CPUs, in the interactive partition:
>>>>>> $ salloc -n4 -p interactive openssl speed
>>>>>> salloc: Pending job allocation 71
>>>>>> salloc: job 71 queued and waiting for resources
>>>>>>
>>>>>> With PreemptMode=SUSPEND it will wait until the low-priority job
>>>>>> finishes (as shown above).  If PreemptMode=CANCEL or REQUEUE, the

>>>>>> low-priority job allocation is "revoked", but the job keepsrunning!!!

>>>>>> Have I misread or misunderstood something about Preemption in
>>>>>> partitions?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Here are the relevant configuration options I've set:
>>>>>>
>>>>>> From slurm.conf:
>>>>>> SchedulerType=sched/backfill
>>>>>> SelectType=select/cons_res
>>>>>> SelectTypeParameters=CR_CPU
>>>>>> PreemptMode=SUSPEND,GANG
>>>>>> PreemptType=preempt/partition_prio
>>>>>> NodeName=noma CoresPerSocket=4 ThreadsPerCore=1 Sockets=1
>>>>>> State=UNKNOWN
>>>>>> PartitionName=batch Nodes=noma Default=NO DefaultTime=INFINITE
>>>>>> MaxTime=INFINITE State=UP Priority=10 Shared=Force:1
>>>>>> PartitionName=interactive Nodes=noma Default=NO MaxTime=INFINITE
>>>>>> State=UP Priority=20 Shared=NO
>>>>>>
>>>>>> --
>>>>>> Alan Orth
>>>>>> [email protected]
>>>>>> http://alaninkenya.org
>>>>>> http://mjanja.co.ke
>>>>>> "You cannot simultaneously prevent and prepare for war." -Albert
>>>>>> Einstein
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Alan Orth
>>>> [email protected]
>>>> http://alaninkenya.org
>>>> http://mjanja.co.ke
>>>> "You cannot simultaneously prevent and prepare for war." -Albert
>>>> Einstein
>>>>
>>>
>>>
>>>
>>
>>
>> --
>> Alan Orth
>> [email protected]
>> http://alaninkenya.org
>> "I have always wished for my computer to be as easy to use as my
>> telephone; my wish has come true because I can no longer figure out how
>> to use my telephone." -Bjarne Stroustrup, inventor of C++
>
>
>
>



--
Alan Orth
[email protected]
http://alaninkenya.org
http://mjanja.co.ke
"In heaven all the interesting people are missing." -Friedrich Nietzsche


--
Chris Scheller | http://www.pobox.com/~schelcj
----------------------------------------------
A candidate is a person who gets money from the rich and votes from the
poor to protect them from each other.

Re: [slurm-dev] Partition-based preemption?

Reply via email to