This is on slurm 2.2.3.  I've attached our slurm.conf below.

There is a possible bug that prevents the owner of a job to run scontrol
hold (or scontrol release) when the job's qos has a GrpCPUs limit.  (It
might be the same for other types of limits; I haven't checked.)

An example:

*** First without limits:

teflon 478(2)# sacctmgr show qos staff
      Name   Priority    Preempt PreemptMode                                    
Flags UsageThres  GrpCPUs  GrpCPUMins GrpJobs GrpNodes GrpSubmit     GrpWall  
MaxCPUs  MaxCPUMins MaxJobs MaxNodes MaxSubmit     MaxWall 
---------- ---------- ---------- ----------- 
---------------------------------------- ---------- -------- ----------- 
------- -------- --------- ----------- -------- ----------- ------- -------- 
--------- ----------- 
     staff      10000     lowpri     cluster                                    
        0.000000                                                                
                                                         

login-0-0 635(1)$ bjob -j 1058
JOBID NAME       USER   ACCOUNT PARTITI QOS    ST     PRIORITY(PRIOR)  TIME 
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
 1058 1019.12    bhm    staff   normal  staff  PD 0.0000002414( 1037)  0:00     
 5:00   1   1     400       0 (Priority)
(the "bjob" is just a wrapper around squeue)
login-0-0 636(1)$ whoami
bhm
login-0-0 637(1)$ scontrol hold 1058
login-0-0 638(1)$ bjob -j 1058
JOBID NAME       USER   ACCOUNT PARTITI QOS    ST     PRIORITY(PRIOR)  TIME 
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
 1058 1019.12    bhm    staff   normal  staff  PD 0.0000000000(    0)  0:00     
 5:00   1   1     400       0 (JobHeldUser)
login-0-0 639(1)$ scontrol release 1058
login-0-0 640(1)$ bjob -j 1058
JOBID NAME       USER   ACCOUNT PARTITI QOS    ST     PRIORITY(PRIOR)  TIME 
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
 1058 1019.12    bhm    staff   normal  staff  PD 0.0000002419( 1039)  0:00     
 5:00   1   1     400       0 (Priority)

*** Everything ok.

*** Then turn limits on:

teflon 479(2)# sacctmgr modify qos staff set grpcpus=1000
 Modified qos...
  staff
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
teflon 480(2)# sacctmgr show qos staff
      Name   Priority    Preempt PreemptMode                                    
Flags UsageThres  GrpCPUs  GrpCPUMins GrpJobs GrpNodes GrpSubmit     GrpWall  
MaxCPUs  MaxCPUMins MaxJobs MaxNodes MaxSubmit     MaxWall 
---------- ---------- ---------- ----------- 
---------------------------------------- ---------- -------- ----------- 
------- -------- --------- ----------- -------- ----------- ------- -------- 
--------- ----------- 
     staff      10000     lowpri     cluster                                    
        0.000000     1000                                                       
                      

login-0-0 641(1)$ bjob -j 1058
JOBID NAME       USER   ACCOUNT PARTITI QOS    ST     PRIORITY(PRIOR)  TIME 
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
 1058 1019.12    bhm    staff   normal  staff  PD 0.0000002419( 1039)  0:00     
 5:00   1   1     400       0 (Priority)
login-0-0 642(1)$ scontrol hold 1058
slurm_update_job error: Job violates accounting policy (job submit limit, 
user's size and/or time limits)

The slurmctld.log says this:

[2011-03-15T15:56:14] job submit for user bhm(4294967294): min cpu request 
4294967294 exceeds group max cpu limit 1000 for qos 'staff'
[2011-03-15T15:56:14] update_job: exceeded association's cpu, node or time 
limit for user 4294967294
[2011-03-15T15:56:14] error: _slurm_rpc_update_job JobId=1058 uid=10231: Job 
violates accounting policy (job submit limit, user's size and/
or time limits)

This uid of 4294967294 (2^32) instead of 10231, and cpu request of 4294967294
perhaps indicate uninitialized variables or mixed integer types?


-- 
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Reply via email to