Dear All, I tried to implement a strict limit on the GrpTRESMins for each user. The effect I'm trying to achieve is that after the limit of GPU minutes is reached, no new jobs can be run. No decay, no automatic resource replenishment. After the limit on GPU minutes is reached, each user should ask for more minutes. But despite exceeding the limits users *can* run new jobs.
* When I'm adding a user to the cluster I set:
sacctmgr --immediate add user name=...
...
QOS=2gpu2d
GrpTRESMins=gres/gpu=20000
* In the "slurm.conf" ("safe" means limits and associations
are automatically set). Storage is MariaDB with SlurmDBD:
GresTypes=gpu
AccountingStorageTRES=gres/gpu
AccountingStorageEnforce=qos,safe
# This disables GPU minutes replenishing.
PriorityDecayHalfLife=0
PriorityUsageResetPeriod=NONE
But when I look at a user's account info and usage, you can
see that the limits are not enforced.
Account User Partition QOS GrpTRESMins
---------- ---------------- ------------ ------------ --------------------
redacted redacted a6000 2gpu2d
gres/gpu=10000
--------------------------------------------------------------------------------
Top 1 Users 2024-01-05T00:00:00 - 2024-01-17T19:59:59 (1108800 secs)
Usage reported in TRES Minutes
--------------------------------------------------------------------------------
Login Used TRES Name
------------ -------- ----------------
redacted 184311 gres/gpu
redacted 1558558 cpu
Could someone explain, where could the problem be? Am I missing
something? Apparently yes :)
Kind regards
--
Kamil Wilczek [https://keys.openpgp.org/]
[D415917E84B8DA5A60E853B6E676ED061316B69B]
OpenPGP_signature.asc
Description: OpenPGP digital signature
