Hello all,
I'm fairly new to slurm and have a couple of conceptual questions with regards
to the accounting, partitioning, and reservations in slurm. I'm looking to
replace a Moab/Torque cluster. The cluster is general purpose for our campus
(large parallel, memory bound, many serial, gpu), but subscribes to the condo
model for investment. This being that many researchers invest in the machine in
order to have access to a larger system in general. It has been a good
experience from that aspect. However, with this model, researchers across the
machine have requested to enable preemption on their investment when they need
to do software runs. The preempted jobs need to be re-queued automatically
unless the job is interactive or the job script states the job should not be
restartable. In addition to this, our general research office also funds us to
support the computationally driven researchers and request information with
regards to the kinds or research being done and how many compute hours,
resource type, etc. have been dedicated to that specific research (fairly
standard university records keeping). The current scheduler is not achieving
all that we would like to accomplish, so I'm curious if slurm could help with
this. Here is our high level setup currently for account management:
BATCH_QUEUE (or perhaps partition in slurm speak) {
InvestorQualityOfService 1 (Investment Level)
--Department research group (such as ComputationalFluidDynamics of Mechanical
Engineering)
----Users in the department research group
--Department research group (such as SolidMechanics of Mechanical Engineering)
----Users
InvestorQualityOfService 2
--Department research group (CompSci Artificial Intelligence)
----Users
GeneralQualityOfService
--Department 1 test drive (math or something)
----Users
--Department 2 test drive (i.e., chemistry)
----Users
}
Additional queues with similar setup for debugging and visualization.
So with that view of the accounting hierarchy, the operating constraints follow:
- Users from investor 2 should be able to run on the investor2, general, and
investor1 resources, but be preempted from the investor1 resources by investor1
users only.
- Jobs from users should try to fit into their own parent investment first,
then attempt the general investment, then looks for other investment resources
- Currently, there are processor limits on the batch queue as well which
keep all research groups in line and do not swamp the cluster because we
utilize backfilling as well
- The kicker (which is perhaps hard) is that the investments are to be fluid
like such that the investor does not own a node, but owns a resource type such
as during hardware failure, the investor still has the same amount of
investment (which comes from our general investment until the hardware is
replaced). With recent conversations on the list
I think that it is possible to use a QOS with GrpCPU and get a limit. That
looks promising, but I have an additional problem that our cluster is also
heterogeneous with GPUs.
There was a mention of reservations using something similar to something:
`scontrol create reservation nodes=ALL features=gpu ...`
These reservations would need to unlimited in time and allow all users and
accounts on this, but allow preemption from a qos user. This is what I'm not
sure is possible and seems to be our breaking point with Moab currently.
Another question is can a single job overlap multiple reservations when a job
needs to be larger than a reservation would allow?
[I've been given quite an interesting challenge with this particular cluster,
or at least I think so...]
And one more issue is that with the processor limits on a qos, they need to be
multiple stages such that perhaps a group can run with 160 processors if the
cluster is saturated, but perhaps 320 when cluster utilization is below a
certain percentage (such as 80% utilization).
Thanks and again, just getting my feelers out there if anybody thinks this is
completely possible or has visible issues with an implementation in slurm. I
feel I've already reached the complexity breaking point and that I may need to
give up something, but any rants, opinions, suggestions are welcome. And if I
didn't explain something well, apologies.
Regards,
Jared
ARCC
PS anybody going to SC14 that would like to have a discussion for this setup, I
wouldn't mind having a sit down either. Thanks again!