[slurm-dev] MaxJobs on association not being respected

Will Dennis Fri, 10 Mar 2017 10:29:02 -0800

Hi all,


Generally new to Slurm here, so please forgive any ignorance...


We have a test cluster (three compute nodes) running Slurm 16.05.4 in 
operation, with the ‘multifactor’ scheduler in use. We have set up slurmdb, and 
have set up associations for the users on partitions of the cluster, as follows:


[root@ml43 ~]# sacctmgr show associations

   Cluster    Account       User  Partition     Share GrpJobs       GrpTRES 
GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode 
MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin

---------- ---------- ---------- ---------- --------- ------- ------------- 
--------- ----------- ------------- ------- ------------- -------------- 
--------- ----------- ------------- -------------------- --------- -------------

ml-cluster       root                               1                           
                                                                                
                                       normal

ml-cluster       root       root                    1                           
                                                                                
                                       normal

ml-cluster         ml                               1                           
                                                                                
                                       normal

ml-cluster         ml       alex  scavenger         1                           
                                                                                
                                       normal

ml-cluster         ml       alex      batch         1                           
                                                                                
                                       normal

ml-cluster         ml       alex       long         1                           
                                      1                                         
                                       normal

ml-cluster         ml       iain  scavenger         1                           
                                                                                
                                       normal

ml-cluster         ml       iain      batch         1                           
                                                                                
                                       normal

ml-cluster         ml       iain       long         1                           
                                                                                
                                       normal


As you may notice, we have set up a “MaxJobs” limit of “1" for the ‘alex’ user 
on the ‘long’ partition. What we want to do is enforce a maximum of one job 
running at a time per user for the ‘long’ partition. However, when the user 
‘alex’ submitted a number of jobs to this partition, all of them ran:

[root@ml43 ~]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
               324      long   tmp.sh     alex PD       0:00      1 (Resources)
               321      long   tmp.sh     alex  R       1:56      1 ml46
               323      long   tmp.sh     alex  R       0:33      1 ml53
               322      long   tmp.sh     alex  R       0:36      1 ml48

From the output of “share” we verified the right queue got the job:

[root@ml43 ~]# sshare -am
             Account       User    Partition  RawShares  NormShares    RawUsage 
 EffectvUsage  FairShare
-------------------- ---------- ------------ ---------- ----------- ----------- 
------------- ----------
root                                                       1.000000        7977 
     1.000000   0.500000
 root                      root                       1    0.500000           0 
     0.000000   1.000000
 ml                                                   1    0.500000        7977 
     1.000000   0.250000
  ml                       alex    scavenger          1    0.083333           0 
     0.166667   0.250000
  ml                       alex        batch          1    0.083333           0 
     0.166667   0.250000
  ml                       alex         long          1    0.083333        7977 
     1.000000   0.000244
  ml                       iain    scavenger          1    0.083333           0 
     0.166667   0.250000
  ml                       iain        batch          1    0.083333           0 
     0.166667   0.250000
  ml                       iain         long          1    0.083333           0 
     0.166667   0.250000

Why doesn’t the “MaxJobs” limit prevent the running of more than one job at a 
time for this user?

Thanks,
Will

[slurm-dev] MaxJobs on association not being respected

Reply via email to