I am trying to set up a cluster where two user groups each have a
maximum number of nodes they can use. I thought I was going to be able
to use GrpNodes on accounts for this, but ran into problems.

Base case: An 80 node cluster for one of the groups
---------------------------------------------------

Let us simulate two users in the same user group competing for the 80
nodes available in our cluster, that has no jobs running at the time.
We do not care about the other user group at the moment.

User ua1 does:
  sbatch -N30 -t 1:0:0 sleep.sh 1000
  sbatch -N30 -t 1:0:0 sleep.sh 1000

User ua2 then does:
  sbatch -N60 -t 1:0:0 sleep.sh 1000

User ua1 does:
  sbatch -N30 -t 1:0:0 sleep.sh 1000

An squeue output shows the following:

  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
   1247        c6 sleep.sh      ua2  PD       0:00     60 (Resources)
   1248        c6 sleep.sh      ua1  PD       0:00     30 (Priority)
   1246        c6 sleep.sh      ua1   R       0:07     30 n[31-60]
   1245        c6 sleep.sh      ua1   R       0:10     30 n[1-30]

The two first 30-node jobs from ua1 have started. No other job can
start due to the lack of nodes.

We now cancel 1245. There is now nodes enough to start 1248, but that does
not happen, as job 1247 has priority and 1248 cannot be backfilled.

We now cancel 1246. There are now nodes available to start 1247,
while 1248 is still waiting:

  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
   1248        c6 sleep.sh      ua1  PD       0:00     30 (Resources)
   1247        c6 sleep.sh      ua2   R       0:01     60 n[1-60]

This is as expected.

Trying another way: An account with GrpNodes=80 on a larger cluster
-------------------------------------------------------------------

We now try to achieve the same on a larger 240 node cluster, by
setting GrpNodes=80 on the account "ga" that both users ua1 and ua2
belong to. We assume that the other user group also has an account
with GrpNodes set, but lets focus on the behaviour within the account "ga":

User ua1 does:
  sbatch -N30 -t 1:0:0 sleep.sh 1000
  sbatch -N30 -t 1:0:0 sleep.sh 1000

User ua2 then does:
  sbatch -N60 -t 1:0:0 sleep.sh 1000

User ua1 does:
  sbatch -N30 -t 1:0:0 sleep.sh 1000

An squeue output shows the following:

  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
   1252        c6 sleep.sh      ua2  PD       0:00     60 
(AssociationResourceLimit)
   1253        c6 sleep.sh      ua1  PD       0:00     30 
(AssociationResourceLimit)
   1251        c6 sleep.sh      ua1   R       0:19     30 n[31-60]
   1250        c6 sleep.sh      ua1   R       0:23     30 n[1-30]

As before, the two first 30-node jobs from ua1 have started. No other
job can start due to the lack of nodes.

And sprio shows:

  JOBID   PRIORITY        AGE        QOS
   1252 1000000171        172 1000000000
   1253 1000000160        160 1000000000

We now cancel job 1250. Now, the last 30-node job from ua1 is allowed to start, 
while the
higher priority 60-node job submitted earlier is still in queue:

  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
   1252        c6 sleep.sh      ua2  PD       0:00     60 
(AssociationResourceLimit)
   1253        c6 sleep.sh      ua1   R       0:02     30 n[1-30]
   1251        c6 sleep.sh      ua1   R       3:10     30 n[31-60]

If user ua1 keeps submitting 30-node jobs, he can starve user ua2 who wants to 
run 60-node jobs.
That would lead to support cases...

Questions
---------

Is it terribly naive of me to expect the GrpNodes case to respect
priority just like the first case? :-)

Do I need to use separate partitions for group A and group B to
acheive my goals? Or should I approach this from some other angle?


Configuration
-------------

I am simulating the nodes by compiling with --enable-front-end and
running one slurmctld and one slurmd (and one slurmdbd) on the test
system.

The SLURM source code was checked out from git today
(commit 5d9b141800b314d45facb1f9c526cfe8fb8ec285).

The parametes that I think could be relevant:

SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

PriorityType=priority/multifactor
PriorityDecayHalfLife=21-0:0:0
PriorityCalcPeriod=0:1:00
#PriorityFavorSmall=
PriorityMaxAge=7-0
#PriorityUsageResetPeriod=
PriorityWeightAge=1000000
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
PriorityWeightQOS=1000000000

DefaultStorageType=slurmdbd
AccountingStorageEnforce=associations,limits,qos

NodeName=n[1-240] NodeHostName=localhost CPUs=1 State=UNKNOWN
#PartitionName=c6 Nodes=n[1-80] Default=YES MaxTime=INFINITE State=UP 
Shared=EXCLUSIVE
PartitionName=c6 Nodes=n[1-240] Default=YES MaxTime=INFINITE State=UP 
Shared=EXCLUSIVE


sacctmgr config:

...
Account - ga:Description='group a':Organization='ga':Fairshare=1:GrpNodes=80
Parent - ga
User - ua1:DefaultAccount='ga':Fairshare=1
User - ua2:DefaultAccount='ga':Fairshare=1
...

(no GrpNodes=80 in the base case)

Reply via email to