Thanks Lyn. I've added both DenyOnLimit and EnforceUsageThreshold using the modify on the qos=long. When I submit a job again, there is no difference. It seems as though the limits don't apply to me.
Do the options require some restart of the slurmd? I wouldn't think so and in the past this has been detrimental given the use of cpusets on this SMP machine.
Bill On 02/18/2014 10:29 AM, Lyn Gerner wrote:
Hi Bill, Check out Flags=DenyOnLimit in the sacctmgr man page. Best, Lyn On Tue, Feb 18, 2014 at 4:50 AM, Bill Wichser <[email protected] <mailto:[email protected]>> wrote: We have activated a few setting in the database for QOS. For brevity, lets just look at one. sacctmgr add qos long priority=10 GrpCpus=516 MaxCpusPerUser=258 My expectation was that there would only be 516 cores which could be used under this QOS and that users would only be able to submit a largest job requiring 258 cores. (This is an SMP machine with 1500+ cores) The QOS is assigned in the job_submit.lua script. But when testing with an explicit #SBATCH --qos=long directive, nothing changes. I submit a job requiring 522 cores, it accepts it and leaves it pending on resources: # scontrol show job 163 JobId=163 Name=hello.slurm UserId=bill(14119) GroupId=cses(20121) Priority=1680 Account=all QOS=long JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-17:40:00 TimeMin=N/A SubmitTime=2014-02-18T09:33:27 EligibleTime=2014-02-18T09:33:__27 StartTime=2014-02-19T09:42:19 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=normal AllocNode:Sid=hecate:1104586 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=522 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryCPU=5000M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/bill/mpi/hello.__slurm WorkDir=/home/bill/mpi I would have expected either that the job was rejected or having a Reason != Resources. Also, there are a total amount of cores being used with the qos=long (by others) which exceeds this GrpCpus=516 limit. Obviously I have missed something here. My goals would be 1) to reject outright jobs exceeding QOS limits of MaxCpusPerUser (maybe I also need a MaxCpusPerJob?) and to hold jobs which will exceed this GrpCpus limit. Thanks, Bill
