Apologies for multiple copies of this email. Resent from various email addresses while I was changing list subscription to my new email address. Steve.
On 08/04/16 03:09, Steven Young wrote:
Hi, We have a requirement to allow a set of users to have high priority access to a proportion of our cluster. The extra degree of difficulty is that they currently have bursty usage, so when they do submit work, they want a guarantee that their jobs will start within 12 hours on their proportion of the cluster. In our cluster the relevant partition for this discussion is our compute partition which has our compute nodes. MaxTime for the compute partition is greater than 12 hours. (In fact currently it is 5 days). We currently use Multifactor Priority with Fair Tree Fairshare. We also currently have two QOS defined: normal and priority. We have Priority Weighting so that Fairshare and QOS are equally weighted, then Age and JobSize are weighted less. Partition weighting is currently zero. Setting up a superpriority QOS which has a GrpNodes setting to the required value will allow us to provide a higher priority access to the required proportion of the cluster, but won't allow us to guarantee the 12 hour start time since we normally have a back-log of jobs asking for multiple days of walltime. I was recently re-reading the SLURM documentation on Reservations (http://slurm.schedmd.com/reservations.html), specifically about Reservations Floating Through Time. Ie, we could create a reservation that has Flags=TIME_FLOAT and StartTime=now+12hours and the nodes assigned to this reservation would only allow jobs with TimeLimit requested of 12 hours or less. That gets us part way to meeting the requirement. Having read about these reservations I am now wondering whether there is any way SLURM could be "improved", so that users from the specific SLURM accounts that should have high priority access can be allowed to run on the reservation. I'm currently thinking about the idea of having a TIME_FLOAT reservation and also setting up a cron script to "watch" for pending superpriority QOS jobs which then reduces the reservation to allow the pending jobs to run. Seems like it could work but there are loads of details that feel a bit messy. Or am I barking up the wrong tree? Failing the possibility of these time-floating reservations being able to "automatically" meet our requirement, does anyone have any other thoughts about how we might meet our "high priority" requirement with "guaranteed" start times? Any input would be very welcome. Cheers, Steven. -- Steven Young, Advanced Research Computing http://www.arc.ox.ac.uk University of Oxford IT Services http://www.it.ox.ac.uk
-- Steven Young, Advanced Research Computing http://www.arc.ox.ac.uk University of Oxford IT Services http://www.it.ox.ac.uk
