I am trying to set up a cluster where two user groups each have a maximum number of nodes they can use. I thought I was going to be able to use GrpNodes on accounts for this, but ran into problems.
Base case: An 80 node cluster for one of the groups --------------------------------------------------- Let us simulate two users in the same user group competing for the 80 nodes available in our cluster, that has no jobs running at the time. We do not care about the other user group at the moment. User ua1 does: sbatch -N30 -t 1:0:0 sleep.sh 1000 sbatch -N30 -t 1:0:0 sleep.sh 1000 User ua2 then does: sbatch -N60 -t 1:0:0 sleep.sh 1000 User ua1 does: sbatch -N30 -t 1:0:0 sleep.sh 1000 An squeue output shows the following: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1247 c6 sleep.sh ua2 PD 0:00 60 (Resources) 1248 c6 sleep.sh ua1 PD 0:00 30 (Priority) 1246 c6 sleep.sh ua1 R 0:07 30 n[31-60] 1245 c6 sleep.sh ua1 R 0:10 30 n[1-30] The two first 30-node jobs from ua1 have started. No other job can start due to the lack of nodes. We now cancel 1245. There is now nodes enough to start 1248, but that does not happen, as job 1247 has priority and 1248 cannot be backfilled. We now cancel 1246. There are now nodes available to start 1247, while 1248 is still waiting: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1248 c6 sleep.sh ua1 PD 0:00 30 (Resources) 1247 c6 sleep.sh ua2 R 0:01 60 n[1-60] This is as expected. Trying another way: An account with GrpNodes=80 on a larger cluster ------------------------------------------------------------------- We now try to achieve the same on a larger 240 node cluster, by setting GrpNodes=80 on the account "ga" that both users ua1 and ua2 belong to. We assume that the other user group also has an account with GrpNodes set, but lets focus on the behaviour within the account "ga": User ua1 does: sbatch -N30 -t 1:0:0 sleep.sh 1000 sbatch -N30 -t 1:0:0 sleep.sh 1000 User ua2 then does: sbatch -N60 -t 1:0:0 sleep.sh 1000 User ua1 does: sbatch -N30 -t 1:0:0 sleep.sh 1000 An squeue output shows the following: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1252 c6 sleep.sh ua2 PD 0:00 60 (AssociationResourceLimit) 1253 c6 sleep.sh ua1 PD 0:00 30 (AssociationResourceLimit) 1251 c6 sleep.sh ua1 R 0:19 30 n[31-60] 1250 c6 sleep.sh ua1 R 0:23 30 n[1-30] As before, the two first 30-node jobs from ua1 have started. No other job can start due to the lack of nodes. And sprio shows: JOBID PRIORITY AGE QOS 1252 1000000171 172 1000000000 1253 1000000160 160 1000000000 We now cancel job 1250. Now, the last 30-node job from ua1 is allowed to start, while the higher priority 60-node job submitted earlier is still in queue: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1252 c6 sleep.sh ua2 PD 0:00 60 (AssociationResourceLimit) 1253 c6 sleep.sh ua1 R 0:02 30 n[1-30] 1251 c6 sleep.sh ua1 R 3:10 30 n[31-60] If user ua1 keeps submitting 30-node jobs, he can starve user ua2 who wants to run 60-node jobs. That would lead to support cases... Questions --------- Is it terribly naive of me to expect the GrpNodes case to respect priority just like the first case? :-) Do I need to use separate partitions for group A and group B to acheive my goals? Or should I approach this from some other angle? Configuration ------------- I am simulating the nodes by compiling with --enable-front-end and running one slurmctld and one slurmd (and one slurmdbd) on the test system. The SLURM source code was checked out from git today (commit 5d9b141800b314d45facb1f9c526cfe8fb8ec285). The parametes that I think could be relevant: SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory PriorityType=priority/multifactor PriorityDecayHalfLife=21-0:0:0 PriorityCalcPeriod=0:1:00 #PriorityFavorSmall= PriorityMaxAge=7-0 #PriorityUsageResetPeriod= PriorityWeightAge=1000000 #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= PriorityWeightQOS=1000000000 DefaultStorageType=slurmdbd AccountingStorageEnforce=associations,limits,qos NodeName=n[1-240] NodeHostName=localhost CPUs=1 State=UNKNOWN #PartitionName=c6 Nodes=n[1-80] Default=YES MaxTime=INFINITE State=UP Shared=EXCLUSIVE PartitionName=c6 Nodes=n[1-240] Default=YES MaxTime=INFINITE State=UP Shared=EXCLUSIVE sacctmgr config: ... Account - ga:Description='group a':Organization='ga':Fairshare=1:GrpNodes=80 Parent - ga User - ua1:DefaultAccount='ga':Fairshare=1 User - ua2:DefaultAccount='ga':Fairshare=1 ... (no GrpNodes=80 in the base case)
