[slurm-dev] Re: QOS with a "guaranteed" time to start up jobs

Steven Young Fri, 08 Apr 2016 04:29:11 -0700

Apologies for multiple copies of this email. Resent from various emailaddresses while I was changing list subscription to my new emailaddress. Steve.


On 08/04/16 03:09, Steven Young wrote:


Hi,

We have a requirement to allow a set of users to have high priority
access to a proportion of our cluster.  The extra degree of difficulty
is that they currently have bursty usage, so when they do submit work,
they want a guarantee that their jobs will start within 12 hours on
their proportion of the cluster.

In our cluster the relevant partition for this discussion is our compute
partition which has our compute nodes.  MaxTime for the compute
partition is greater than 12 hours.  (In fact currently it is 5 days).
We currently use Multifactor Priority with Fair Tree Fairshare.  We also
currently have two QOS defined: normal and priority.  We have Priority
Weighting so that Fairshare and QOS are equally weighted, then Age and
JobSize are weighted less.  Partition weighting is currently zero.
Setting up a superpriority QOS which has a GrpNodes setting to the
required value will allow us to provide a higher priority access to the
required proportion of the cluster, but won't allow us to guarantee the
12 hour start time since we normally have a back-log of jobs asking for
multiple days of walltime.

I was recently re-reading the SLURM documentation on Reservations
(http://slurm.schedmd.com/reservations.html), specifically about
Reservations Floating Through Time.  Ie, we could create a reservation
that has Flags=TIME_FLOAT and StartTime=now+12hours and the nodes
assigned to this reservation would only allow jobs with TimeLimit
requested of 12 hours or less.  That gets us part way to meeting the
requirement.

Having read about these reservations I am now wondering whether there is
any way SLURM could be "improved", so that users from the specific SLURM
accounts that should have high priority access can be allowed to run on
the reservation.

I'm currently thinking about the idea of having a TIME_FLOAT reservation
and also setting up a cron script to "watch" for pending superpriority
QOS jobs which then reduces the reservation to allow the pending jobs to
run.  Seems like it could work but there are loads of details that feel
a bit messy.  Or am I barking up the wrong tree?

Failing the possibility of these time-floating reservations being able
to "automatically" meet our requirement, does anyone have any other
thoughts about how we might meet our "high priority" requirement with
"guaranteed" start times?

Any input would be very welcome.

Cheers,
Steven.

--
Steven Young, Advanced Research Computing http://www.arc.ox.ac.uk
          University of Oxford IT Services http://www.it.ox.ac.uk


--
Steven Young, Advanced Research Computing http://www.arc.ox.ac.uk
         University of Oxford IT Services http://www.it.ox.ac.uk

[slurm-dev] Re: QOS with a "guaranteed" time to start up jobs

Reply via email to