This is on slurm 2.2.3. I've attached our slurm.conf below.
There is a possible bug that prevents the owner of a job to run scontrol
hold (or scontrol release) when the job's qos has a GrpCPUs limit. (It
might be the same for other types of limits; I haven't checked.)
An example:
*** First without limits:
teflon 478(2)# sacctmgr show qos staff
Name Priority Preempt PreemptMode
Flags UsageThres GrpCPUs GrpCPUMins GrpJobs GrpNodes GrpSubmit GrpWall
MaxCPUs MaxCPUMins MaxJobs MaxNodes MaxSubmit MaxWall
---------- ---------- ---------- -----------
---------------------------------------- ---------- -------- -----------
------- -------- --------- ----------- -------- ----------- ------- --------
--------- -----------
staff 10000 lowpri cluster
0.000000
login-0-0 635(1)$ bjob -j 1058
JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORITY(PRIOR) TIME
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
1058 1019.12 bhm staff normal staff PD 0.0000002414( 1037) 0:00
5:00 1 1 400 0 (Priority)
(the "bjob" is just a wrapper around squeue)
login-0-0 636(1)$ whoami
bhm
login-0-0 637(1)$ scontrol hold 1058
login-0-0 638(1)$ bjob -j 1058
JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORITY(PRIOR) TIME
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
1058 1019.12 bhm staff normal staff PD 0.0000000000( 0) 0:00
5:00 1 1 400 0 (JobHeldUser)
login-0-0 639(1)$ scontrol release 1058
login-0-0 640(1)$ bjob -j 1058
JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORITY(PRIOR) TIME
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
1058 1019.12 bhm staff normal staff PD 0.0000002419( 1039) 0:00
5:00 1 1 400 0 (Priority)
*** Everything ok.
*** Then turn limits on:
teflon 479(2)# sacctmgr modify qos staff set grpcpus=1000
Modified qos...
staff
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
teflon 480(2)# sacctmgr show qos staff
Name Priority Preempt PreemptMode
Flags UsageThres GrpCPUs GrpCPUMins GrpJobs GrpNodes GrpSubmit GrpWall
MaxCPUs MaxCPUMins MaxJobs MaxNodes MaxSubmit MaxWall
---------- ---------- ---------- -----------
---------------------------------------- ---------- -------- -----------
------- -------- --------- ----------- -------- ----------- ------- --------
--------- -----------
staff 10000 lowpri cluster
0.000000 1000
login-0-0 641(1)$ bjob -j 1058
JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORITY(PRIOR) TIME
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
1058 1019.12 bhm staff normal staff PD 0.0000002419( 1039) 0:00
5:00 1 1 400 0 (Priority)
login-0-0 642(1)$ scontrol hold 1058
slurm_update_job error: Job violates accounting policy (job submit limit,
user's size and/or time limits)
The slurmctld.log says this:
[2011-03-15T15:56:14] job submit for user bhm(4294967294): min cpu request
4294967294 exceeds group max cpu limit 1000 for qos 'staff'
[2011-03-15T15:56:14] update_job: exceeded association's cpu, node or time
limit for user 4294967294
[2011-03-15T15:56:14] error: _slurm_rpc_update_job JobId=1058 uid=10231: Job
violates accounting policy (job submit limit, user's size and/
or time limits)
This uid of 4294967294 (2^32) instead of 10231, and cpu request of 4294967294
perhaps indicate uninitialized variables or mixed integer types?
--
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo