Joshua Baker-LePain <j...@salilab.org> writes: > I think that my initial question was too complex/detailed. Let me ask > a more open-ended one. Do folks have any strategies they'd like to > share on partition setups that favor paying customers while also > allowing for usage of spare resources by non-paying users? Thanks!
We are using preemption to achive something like that on one of our clusters. Users on that cluster typically belong to a single account linked to their department (in some cases multiple, if they do cross-department work). Jobs submitted using those accounts go into a single partition, but are subject to GrpCPUs limits based on how much of the cluster the department has payed for. But: Users are also able to submit "risk jobs", a term kept from the dark ages before we were using SLURM. They submit risk jobs using a specific QoS by adding "-q risk" (otherwise they get the QoS "normal"). Jobs with QoS "risk" have lower priority than Qos "normal" jobs, so normal jobs will start before them. Also, normal jobs will preempt risk jobs (with PreemptMode=REQUEUE). The submit plugin does some more work behind the user's back, setting the account to "dept_a_risk" if their normal account was "dept_a", to make sure the risk jobs are not affected by the GrpCPUs limit on the normal account (there are no such limits on the dept_*_risk accounts). The end effect is that any user can use up remaining nodes on the cluster using risk jobs, but as soon as a normal job (allowed to run within its limits) is blocked from running due to a lack of nodes and there are risk jobs running, one or more risk job will be killed to make the nodes available. Short risk jobs requiring few nodes might be submitted with just "-q risk". If preempted, they are just killed and have to be resubmitted. The typical use though is to add "--requeue" to make the job go back into the queue if preempted. To make this useful, users of risk jobs have to make sure that their code makes progress even if the jobs are preempted one or more times. Some use various state-saving/checkpointing mechanisms. Others have problems that make this easy (for example, the job processes 10 years worth of data, one month at a time, and skips already processed months at the start of a new run based on already present output files). This is a rather crude way to accomplish this, but it is easy to explain to users and there is now way for dormant low-priority jobs to mess with higher-priority ones by holding on to memory or other resources. Best Regards, / Kent Engström, NSC