[slurm-dev] Re: Oversubscription and running job priority

Kent Engström Tue, 26 Jul 2016 01:05:44 -0700

Joshua Baker-LePain <j...@salilab.org> writes:
> I think that my initial question was too complex/detailed.  Let me ask
> a more open-ended one.  Do folks have any strategies they'd like to
> share on partition setups that favor paying customers while also
> allowing for usage of spare resources by non-paying users?  Thanks!


We are using preemption to achive something like that on one of our
clusters.

Users on that cluster typically belong to a single account linked to
their department (in some cases multiple, if they do cross-department
work). Jobs submitted using those accounts go into a single partition,
but are subject to GrpCPUs limits based on how much of the cluster the
department has payed for.

But: Users are also able to submit "risk jobs", a term kept from the
dark ages before we were using SLURM.

They submit risk jobs using a specific QoS by adding "-q risk"
(otherwise they get the QoS "normal"). Jobs with QoS "risk" have lower
priority than Qos "normal" jobs, so normal jobs will start before
them. Also, normal jobs will preempt risk jobs (with
PreemptMode=REQUEUE).

The submit plugin does some more work behind the user's back, setting
the account to "dept_a_risk" if their normal account was "dept_a", to
make sure the risk jobs are not affected by the GrpCPUs limit on the
normal account (there are no such limits on the dept_*_risk accounts).

The end effect is that any user can use up remaining nodes on the
cluster using risk jobs, but as soon as a normal job (allowed to run
within its limits) is blocked from running due to a lack of nodes and
there are risk jobs running, one or more risk job will be killed to make
the nodes available.

Short risk jobs requiring few nodes might be submitted with just "-q
risk". If preempted, they are just killed and have to be resubmitted.

The typical use though is to add "--requeue" to make the job go back
into the queue if preempted. To make this useful, users of risk jobs
have to make sure that their code makes progress even if the jobs are
preempted one or more times. Some use various state-saving/checkpointing
mechanisms. Others have problems that make this easy (for example, the
job processes 10 years worth of data, one month at a time, and skips
already processed months at the start of a new run based on already
present output files).

This is a rather crude way to accomplish this, but it is easy to explain
to users and there is now way for dormant low-priority jobs to mess with
higher-priority ones by holding on to memory or other resources.

Best Regards,
/ Kent Engström, NSC

[slurm-dev] Re: Oversubscription and running job priority

Reply via email to