[slurm-dev] Re: Oversubscription and running job priority

Joshua Baker-LePain Mon, 25 Jul 2016 13:47:40 -0700

I think that my initial question was too complex/detailed. Let me ask amore open-ended one. Do folks have any strategies they'd like to share onpartition setups that favor paying customers while also allowing for usageof spare resources by non-paying users? Thanks!


On Fri, 15 Jul 2016 at 3:56pm, Joshua Baker-LePain wrote

We currently run a moderately sized (5000+ cores) cluster using SGE. We'relooking to move to slurm and have a test setup, but I have some questionsabout how best to implement/improve on our current setup.
Our setup is a co-op model. We have users who "own shares" of the cluster aswell as non-contributing users. We try to guarantee contributing usersaccess to their "share" of the cluster while also maximizing utilization viathe following setup:
 o There are 3 queues on each node, and on each node each queue has a
   number of slots equal to the number of real cores on the node (nodes
   with hyperthreading have that feature turned on)

 o Our "lab" queue is for contributing users.  Jobs in this queue run
   un-niced, and each lab has a number of slots in this queue equal to
   their share of the cluster.

o Our "long" queue is for all users.  Jobs in this queue run "nice -19".

 o We also have a "short" queue for quick jobs.  These jobs run at "nice
   -10" and are limited to 30 minutes.

 o We use np_load_avg on the queues to control oversubscription.  A node
   full of lab queue jobs will not launch jobs in the other queues.
   However, a node full of long queue jobs can still launch lab queue
   jobs, up until both lab and long queues on that node are full.
As a starting point for our new setup, I'm trying to somewhat replicate this.Is gang scheduling what I'm looking for? Do folks have issues with jobscontinually being suspended and resumed?
Any pointers or hints would be much appreciated. And feel free to ask forclarification and/or tell me I'm on the completely wrong track. Thanks.


--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF

[slurm-dev] Re: Oversubscription and running job priority

Reply via email to