Hi Kent, In the example which fails to work properly (starting job 1252 before job 1252), the problem is the backfill scheduler not accounting for all resources and limits. Specifically the backfill scheduler simulates what resources become available as various jobs begin and end going forward in time. It accounts for the CPUs, memory, various limits and job preemption. It does not currently account for the group limits or licenses. So when the backfill scheduler tries to determine when job 1252 can start, it notes the association limit, but fails to recognize the job will be able to start in 57 minutes (when job 1251 terminates, effecting the group limit) and thus fails to reserve those resources preventing the initiation of job 1253.
There is not a simple fix for this problem. It would require adding new logic to track the group limits through the future to better determine when and where pending jobs can be initiated. Moe Quoting Kent Engström <[email protected]>: > > I am trying to set up a cluster where two user groups each have a > maximum number of nodes they can use. I thought I was going to be able > to use GrpNodes on accounts for this, but ran into problems. > > Base case: An 80 node cluster for one of the groups > --------------------------------------------------- > > Let us simulate two users in the same user group competing for the 80 > nodes available in our cluster, that has no jobs running at the time. > We do not care about the other user group at the moment. > > User ua1 does: > sbatch -N30 -t 1:0:0 sleep.sh 1000 > sbatch -N30 -t 1:0:0 sleep.sh 1000 > > User ua2 then does: > sbatch -N60 -t 1:0:0 sleep.sh 1000 > > User ua1 does: > sbatch -N30 -t 1:0:0 sleep.sh 1000 > > An squeue output shows the following: > > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > 1247 c6 sleep.sh ua2 PD 0:00 60 (Resources) > 1248 c6 sleep.sh ua1 PD 0:00 30 (Priority) > 1246 c6 sleep.sh ua1 R 0:07 30 n[31-60] > 1245 c6 sleep.sh ua1 R 0:10 30 n[1-30] > > The two first 30-node jobs from ua1 have started. No other job can > start due to the lack of nodes. > > We now cancel 1245. There is now nodes enough to start 1248, but that does > not happen, as job 1247 has priority and 1248 cannot be backfilled. > > We now cancel 1246. There are now nodes available to start 1247, > while 1248 is still waiting: > > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > 1248 c6 sleep.sh ua1 PD 0:00 30 (Resources) > 1247 c6 sleep.sh ua2 R 0:01 60 n[1-60] > > This is as expected. > > Trying another way: An account with GrpNodes=80 on a larger cluster > ------------------------------------------------------------------- > > We now try to achieve the same on a larger 240 node cluster, by > setting GrpNodes=80 on the account "ga" that both users ua1 and ua2 > belong to. We assume that the other user group also has an account > with GrpNodes set, but lets focus on the behaviour within the account "ga": > > User ua1 does: > sbatch -N30 -t 1:0:0 sleep.sh 1000 > sbatch -N30 -t 1:0:0 sleep.sh 1000 > > User ua2 then does: > sbatch -N60 -t 1:0:0 sleep.sh 1000 > > User ua1 does: > sbatch -N30 -t 1:0:0 sleep.sh 1000 > > An squeue output shows the following: > > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > 1252 c6 sleep.sh ua2 PD 0:00 60 > (AssociationResourceLimit) > 1253 c6 sleep.sh ua1 PD 0:00 30 > (AssociationResourceLimit) > 1251 c6 sleep.sh ua1 R 0:19 30 n[31-60] > 1250 c6 sleep.sh ua1 R 0:23 30 n[1-30] > > As before, the two first 30-node jobs from ua1 have started. No other > job can start due to the lack of nodes. > > And sprio shows: > > JOBID PRIORITY AGE QOS > 1252 1000000171 172 1000000000 > 1253 1000000160 160 1000000000 > > We now cancel job 1250. Now, the last 30-node job from ua1 is > allowed to start, while the > higher priority 60-node job submitted earlier is still in queue: > > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > 1252 c6 sleep.sh ua2 PD 0:00 60 > (AssociationResourceLimit) > 1253 c6 sleep.sh ua1 R 0:02 30 n[1-30] > 1251 c6 sleep.sh ua1 R 3:10 30 n[31-60] > > If user ua1 keeps submitting 30-node jobs, he can starve user ua2 > who wants to run 60-node jobs. > That would lead to support cases... > > Questions > --------- > > Is it terribly naive of me to expect the GrpNodes case to respect > priority just like the first case? :-) > > Do I need to use separate partitions for group A and group B to > acheive my goals? Or should I approach this from some other angle? > > > Configuration > ------------- > > I am simulating the nodes by compiling with --enable-front-end and > running one slurmctld and one slurmd (and one slurmdbd) on the test > system. > > The SLURM source code was checked out from git today > (commit 5d9b141800b314d45facb1f9c526cfe8fb8ec285). > > The parametes that I think could be relevant: > > SchedulerType=sched/backfill > SelectType=select/cons_res > SelectTypeParameters=CR_Core_Memory > > PriorityType=priority/multifactor > PriorityDecayHalfLife=21-0:0:0 > PriorityCalcPeriod=0:1:00 > #PriorityFavorSmall= > PriorityMaxAge=7-0 > #PriorityUsageResetPeriod= > PriorityWeightAge=1000000 > #PriorityWeightFairshare= > #PriorityWeightJobSize= > #PriorityWeightPartition= > PriorityWeightQOS=1000000000 > > DefaultStorageType=slurmdbd > AccountingStorageEnforce=associations,limits,qos > > NodeName=n[1-240] NodeHostName=localhost CPUs=1 State=UNKNOWN > #PartitionName=c6 Nodes=n[1-80] Default=YES MaxTime=INFINITE > State=UP Shared=EXCLUSIVE > PartitionName=c6 Nodes=n[1-240] Default=YES MaxTime=INFINITE > State=UP Shared=EXCLUSIVE > > > sacctmgr config: > > ... > Account - ga:Description='group a':Organization='ga':Fairshare=1:GrpNodes=80 > Parent - ga > User - ua1:DefaultAccount='ga':Fairshare=1 > User - ua2:DefaultAccount='ga':Fairshare=1 > ... > > (no GrpNodes=80 in the base case)
