We currently run a moderately sized (5000+ cores) cluster using SGE.
We're looking to move to slurm and have a test setup, but I have some
questions about how best to implement/improve on our current setup.
Our setup is a co-op model. We have users who "own shares" of the cluster
as well as non-contributing users. We try to guarantee contributing users
access to their "share" of the cluster while also maximizing utilization
via the following setup:
o There are 3 queues on each node, and on each node each queue has a
number of slots equal to the number of real cores on the node (nodes
with hyperthreading have that feature turned on)
o Our "lab" queue is for contributing users. Jobs in this queue run
un-niced, and each lab has a number of slots in this queue equal to
their share of the cluster.
o Our "long" queue is for all users. Jobs in this queue run "nice -19".
o We also have a "short" queue for quick jobs. These jobs run at "nice
-10" and are limited to 30 minutes.
o We use np_load_avg on the queues to control oversubscription. A node
full of lab queue jobs will not launch jobs in the other queues.
However, a node full of long queue jobs can still launch lab queue
jobs, up until both lab and long queues on that node are full.
As a starting point for our new setup, I'm trying to somewhat replicate
this. Is gang scheduling what I'm looking for? Do folks have issues with
jobs continually being suspended and resumed?
Any pointers or hints would be much appreciated. And feel free to ask for
clarification and/or tell me I'm on the completely wrong track. Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF