Hello all,

I'm fairly new to slurm and have a couple of conceptual questions with regards 
to the accounting, partitioning, and reservations in slurm. I'm looking to 
replace a Moab/Torque cluster. The cluster is general purpose for our campus 
(large parallel, memory bound, many serial, gpu), but subscribes to the condo 
model for investment. This being that many researchers invest in the machine in 
order to have access to a larger system in general. It has been a good 
experience from that aspect. However, with this model, researchers across the 
machine have requested to enable preemption on their investment when they need 
to do software runs. The preempted jobs need to be re-queued automatically 
unless the job is interactive or the job script states the job should not be 
restartable. In addition to this, our general research office also funds us to 
support the computationally driven researchers and request information with 
regards to the kinds or research being done and how many compute hours, 
resource type, etc. have been dedicated to that specific research (fairly 
standard university records keeping). The current scheduler is not achieving 
all that we would like to accomplish, so I'm curious if slurm could help with 
this. Here is our high level setup currently for account management:

BATCH_QUEUE (or perhaps partition in slurm speak) {

InvestorQualityOfService 1 (Investment Level)
--Department research group (such as ComputationalFluidDynamics of Mechanical 
Engineering)
----Users in the department research group
--Department research group (such as SolidMechanics of Mechanical Engineering)
----Users

InvestorQualityOfService 2
--Department research group (CompSci Artificial Intelligence)
----Users

GeneralQualityOfService
--Department 1 test drive (math or something)
----Users
--Department 2 test drive (i.e., chemistry)
----Users

}

Additional queues with similar setup for debugging and visualization.


So with that view of the accounting hierarchy, the operating constraints follow:


-    Users from investor 2 should be able to run on the investor2, general, and 
investor1 resources, but be preempted from the investor1 resources by investor1 
users only.

-    Jobs from users should try to fit into their own parent investment first, 
then attempt the general investment, then looks for other investment resources

-    Currently, there are processor limits on the batch queue as well which 
keep all research groups in line and do not swamp the cluster because we 
utilize backfilling as well

-    The kicker (which is perhaps hard) is that the investments are to be fluid 
like such that the investor does not own a node, but owns a resource type such 
as during hardware failure, the investor still has the same amount of 
investment (which comes from our general investment until the hardware is 
replaced). With recent conversations on the list

I think that it is possible to use a QOS with GrpCPU and get a limit. That 
looks promising, but I have an additional problem that our cluster is also 
heterogeneous with GPUs.

There was a mention of reservations using something similar to something:



  `scontrol create reservation nodes=ALL features=gpu ...`



These reservations would need to unlimited in time and allow all users and 
accounts on this, but allow preemption from a qos user. This is what I'm not 
sure is possible and seems to be our breaking point with Moab currently. 
Another question is can a single job overlap multiple reservations when a job 
needs to be larger than a reservation would allow?



[I've been given quite an interesting challenge with this particular cluster, 
or at least I think so...]



And one more issue is that with the processor limits on a qos, they need to be 
multiple stages such that perhaps a group can run with 160 processors if the 
cluster is saturated, but perhaps 320 when cluster utilization is below a 
certain percentage (such as 80% utilization).



Thanks and again, just getting my feelers out there if anybody thinks this is 
completely possible or has visible issues with an implementation in slurm. I 
feel I've already reached the complexity breaking point and that I may need to 
give up something, but any rants, opinions, suggestions are welcome. And if I 
didn't explain something well, apologies.



Regards,



Jared

ARCC



PS anybody going to SC14 that would like to have a discussion for this setup, I 
wouldn't mind having a sit down either. Thanks again!



Reply via email to