Thanks Lyn. I've added both DenyOnLimit and EnforceUsageThreshold using the modify on the qos=long. When I submit a job again, there is no difference. It seems as though the limits don't apply to me.

Do the options require some restart of the slurmd? I wouldn't think so and in the past this has been detrimental given the use of cpusets on this SMP machine.

Bill

On 02/18/2014 10:29 AM, Lyn Gerner wrote:
Hi Bill,

Check out Flags=DenyOnLimit in the sacctmgr man page.

Best,
Lyn


On Tue, Feb 18, 2014 at 4:50 AM, Bill Wichser <[email protected]
<mailto:[email protected]>> wrote:


    We have activated a few setting in the database for QOS.  For
    brevity, lets just look at one.

    sacctmgr add qos long priority=10 GrpCpus=516 MaxCpusPerUser=258

    My expectation was that there would only be 516 cores which could be
    used under this QOS and that users would only be able to submit a
    largest job requiring 258 cores.  (This is an SMP machine with 1500+
    cores)

    The QOS is assigned in the job_submit.lua script.  But when testing
    with an explicit #SBATCH --qos=long directive, nothing changes.

    I submit a job requiring 522 cores, it accepts it and leaves it
    pending on resources:

    # scontrol show job 163
    JobId=163 Name=hello.slurm
        UserId=bill(14119) GroupId=cses(20121)
        Priority=1680 Account=all QOS=long
        JobState=PENDING Reason=Resources Dependency=(null)
        Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
        RunTime=00:00:00 TimeLimit=1-17:40:00 TimeMin=N/A
        SubmitTime=2014-02-18T09:33:27 EligibleTime=2014-02-18T09:33:__27
        StartTime=2014-02-19T09:42:19 EndTime=Unknown
        PreemptTime=None SuspendTime=None SecsPreSuspend=0
        Partition=normal AllocNode:Sid=hecate:1104586
        ReqNodeList=(null) ExcNodeList=(null)
        NodeList=(null)
        NumNodes=1 NumCPUs=522 CPUs/Task=1 ReqS:C:T=*:*:*
        MinCPUsNode=1 MinMemoryCPU=5000M MinTmpDiskNode=0
        Features=(null) Gres=(null) Reservation=(null)
        Shared=OK Contiguous=0 Licenses=(null) Network=(null)
        Command=/home/bill/mpi/hello.__slurm
        WorkDir=/home/bill/mpi


    I would have expected either that the job was rejected or having a
    Reason != Resources.

    Also, there are a total amount of cores being used with the qos=long
    (by others) which exceeds this GrpCpus=516 limit.

    Obviously I have missed something here.

    My goals would be 1) to reject outright jobs exceeding QOS limits of
    MaxCpusPerUser (maybe I also need a MaxCpusPerJob?) and to hold jobs
    which will exceed this GrpCpus limit.

    Thanks,
    Bill


Reply via email to