Hi Kent,

In the example which fails to work properly (starting job 1252 before  
job 1252), the problem is the backfill scheduler not accounting for  
all resources and limits. Specifically the backfill scheduler  
simulates what resources become available as various jobs begin and  
end going forward in time. It accounts for the CPUs, memory, various  
limits and job preemption. It does not currently account for the group  
limits or licenses. So when the backfill scheduler tries to determine  
when job 1252 can start, it notes the association limit, but fails to  
recognize the job will be able to start in 57 minutes (when job 1251  
terminates, effecting the group limit) and thus fails to reserve those  
resources preventing the initiation of job 1253.

There is not a simple fix for this problem. It would require adding  
new logic to track the group limits through the future to better  
determine when and where pending jobs can be initiated.

Moe


Quoting Kent Engström <[email protected]>:

>
> I am trying to set up a cluster where two user groups each have a
> maximum number of nodes they can use. I thought I was going to be able
> to use GrpNodes on accounts for this, but ran into problems.
>
> Base case: An 80 node cluster for one of the groups
> ---------------------------------------------------
>
> Let us simulate two users in the same user group competing for the 80
> nodes available in our cluster, that has no jobs running at the time.
> We do not care about the other user group at the moment.
>
> User ua1 does:
>   sbatch -N30 -t 1:0:0 sleep.sh 1000
>   sbatch -N30 -t 1:0:0 sleep.sh 1000
>
> User ua2 then does:
>   sbatch -N60 -t 1:0:0 sleep.sh 1000
>
> User ua1 does:
>   sbatch -N30 -t 1:0:0 sleep.sh 1000
>
> An squeue output shows the following:
>
>   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
>    1247        c6 sleep.sh      ua2  PD       0:00     60 (Resources)
>    1248        c6 sleep.sh      ua1  PD       0:00     30 (Priority)
>    1246        c6 sleep.sh      ua1   R       0:07     30 n[31-60]
>    1245        c6 sleep.sh      ua1   R       0:10     30 n[1-30]
>
> The two first 30-node jobs from ua1 have started. No other job can
> start due to the lack of nodes.
>
> We now cancel 1245. There is now nodes enough to start 1248, but that does
> not happen, as job 1247 has priority and 1248 cannot be backfilled.
>
> We now cancel 1246. There are now nodes available to start 1247,
> while 1248 is still waiting:
>
>   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
>    1248        c6 sleep.sh      ua1  PD       0:00     30 (Resources)
>    1247        c6 sleep.sh      ua2   R       0:01     60 n[1-60]
>
> This is as expected.
>
> Trying another way: An account with GrpNodes=80 on a larger cluster
> -------------------------------------------------------------------
>
> We now try to achieve the same on a larger 240 node cluster, by
> setting GrpNodes=80 on the account "ga" that both users ua1 and ua2
> belong to. We assume that the other user group also has an account
> with GrpNodes set, but lets focus on the behaviour within the account "ga":
>
> User ua1 does:
>   sbatch -N30 -t 1:0:0 sleep.sh 1000
>   sbatch -N30 -t 1:0:0 sleep.sh 1000
>
> User ua2 then does:
>   sbatch -N60 -t 1:0:0 sleep.sh 1000
>
> User ua1 does:
>   sbatch -N30 -t 1:0:0 sleep.sh 1000
>
> An squeue output shows the following:
>
>   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
>    1252        c6 sleep.sh      ua2  PD       0:00     60  
> (AssociationResourceLimit)
>    1253        c6 sleep.sh      ua1  PD       0:00     30  
> (AssociationResourceLimit)
>    1251        c6 sleep.sh      ua1   R       0:19     30 n[31-60]
>    1250        c6 sleep.sh      ua1   R       0:23     30 n[1-30]
>
> As before, the two first 30-node jobs from ua1 have started. No other
> job can start due to the lack of nodes.
>
> And sprio shows:
>
>   JOBID   PRIORITY        AGE        QOS
>    1252 1000000171        172 1000000000
>    1253 1000000160        160 1000000000
>
> We now cancel job 1250. Now, the last 30-node job from ua1 is  
> allowed to start, while the
> higher priority 60-node job submitted earlier is still in queue:
>
>   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
>    1252        c6 sleep.sh      ua2  PD       0:00     60  
> (AssociationResourceLimit)
>    1253        c6 sleep.sh      ua1   R       0:02     30 n[1-30]
>    1251        c6 sleep.sh      ua1   R       3:10     30 n[31-60]
>
> If user ua1 keeps submitting 30-node jobs, he can starve user ua2  
> who wants to run 60-node jobs.
> That would lead to support cases...
>
> Questions
> ---------
>
> Is it terribly naive of me to expect the GrpNodes case to respect
> priority just like the first case? :-)
>
> Do I need to use separate partitions for group A and group B to
> acheive my goals? Or should I approach this from some other angle?
>
>
> Configuration
> -------------
>
> I am simulating the nodes by compiling with --enable-front-end and
> running one slurmctld and one slurmd (and one slurmdbd) on the test
> system.
>
> The SLURM source code was checked out from git today
> (commit 5d9b141800b314d45facb1f9c526cfe8fb8ec285).
>
> The parametes that I think could be relevant:
>
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
>
> PriorityType=priority/multifactor
> PriorityDecayHalfLife=21-0:0:0
> PriorityCalcPeriod=0:1:00
> #PriorityFavorSmall=
> PriorityMaxAge=7-0
> #PriorityUsageResetPeriod=
> PriorityWeightAge=1000000
> #PriorityWeightFairshare=
> #PriorityWeightJobSize=
> #PriorityWeightPartition=
> PriorityWeightQOS=1000000000
>
> DefaultStorageType=slurmdbd
> AccountingStorageEnforce=associations,limits,qos
>
> NodeName=n[1-240] NodeHostName=localhost CPUs=1 State=UNKNOWN
> #PartitionName=c6 Nodes=n[1-80] Default=YES MaxTime=INFINITE  
> State=UP Shared=EXCLUSIVE
> PartitionName=c6 Nodes=n[1-240] Default=YES MaxTime=INFINITE  
> State=UP Shared=EXCLUSIVE
>
>
> sacctmgr config:
>
> ...
> Account - ga:Description='group a':Organization='ga':Fairshare=1:GrpNodes=80
> Parent - ga
> User - ua1:DefaultAccount='ga':Fairshare=1
> User - ua2:DefaultAccount='ga':Fairshare=1
> ...
>
> (no GrpNodes=80 in the base case)

Reply via email to