Hi,
After some fun incidents with accidental monopolization of the cluster, we
decided to enforce some QOS.
I read the documentation. Thus far in the set up the only thing I've done
that's even close is I assigned "share" values when I set up each
association.
The cluster had a QOS called normal.
I adjusted normal to have MaxTRESPerUser=cpu=90, create a new QOS called
firstclass for a special set of associations that need better access.
sacctmgr show qos
format=Name,Priority,PreemptMode,UsageFactor,MaxTRESPerUser
Name Priority PreemptMode UsageFactor MaxTRESPU
---------- ---------- ----------- ----------- -------------
normal 10 cluster 1.000000 cpu=90
firstclass 100 cluster 1.000000
sacctmgr show assoc format=Cluster,Account,User,Partition,QOS
Cluster Account User Partition QOS
---------- ---------- ---------- ---------- --------------------
rosalind dev normal
rosalind dev test_user debug normal
rosalind dev test_user bcl2fastq normal
rosalind pipeline firstclass
rosalind pathology firstclass
rosalind pathology bioinf pipeline firstclass
rosalind pathology bioinf bcl2fastq firstclass
rosalind pathology bioinf pathology firstclass
rosalind research firstclass
rosalind research bioinf pathology firstclass
rosalind research bioinf pipeline firstclass
rosalind research bioinf bcl2fastq firstclass
rosalind reynolds normal
rosalind reynolds ysun@pete+ prod normal
rosalind reynolds ysun@pete+ debug normal
rosalind users normal
rosalind bacg normal
rosalind bacg akumar@pe+ debug normal
rosalind bacg akumar@pe+ prod normal
rosalind bacg apapenfus+ debug normal
rosalind bacg apapenfus+ prod normal
rosalind bacg dgoode@pe+ debug normal
rosalind bacg dgoode@pe+ prod normal
rosalind bacg ivergara@+ debug normal
rosalind bacg ivergara@+ prod normal
rosalind bacg jmarkham@+ debug normal
etc
Then I assign the firstclass to those that need it (on different
partitions), adjusted the slurm.conf accordingly (confirmed that
PriorityType was multifactor, made PriorityWeightQOS=1000), distributed to
all nodes, restarted slurmctld and did scontrol reconfigure.
Yet the max cpu doesn't seem to have propagated? Almost immediately someone
used more than 90 cpus.
What have I done wrong? I re-read the documentation this AM, but I can't
see anything that might be preventing QOS from being applied except for
maybe a qos hierarchy issue, but I've only set the two qos and they apply
to distinct associations and partitions.
cheers
L.
------
The most dangerous phrase in the language is, "We've always done it this
way."
- Grace Hopper