Have you set up the limits in slurm.conf?
AccountingStorageEnforce=limits,qos


Regards.

Juan Pancorbo Armada
[email protected]
http//www.lrz.de


Leibniz-Rechenzentrum
Abteilung: Hochleistungssysteme
Boltzmannstrasse 1, 85748 Garching
Telefon:  +49 (0) 89 35831-8735
Fax:      +49 (0) 89 35831-8535 

-----Ursprüngliche Nachricht-----
Von: Bill Wichser [mailto:[email protected]] 
Gesendet: Dienstag, 18. Februar 2014 16:52
An: slurm-dev
Betreff: [slurm-dev] Re: QOS does not appear to be working for me


Thanks Lyn.  I've added both DenyOnLimit and EnforceUsageThreshold using the 
modify on the qos=long.  When I submit a job again, there is no difference.  It 
seems as though the limits don't apply to me.

Do the options require some restart of the slurmd?  I wouldn't think so and in 
the past this has been detrimental given the use of cpusets on this SMP machine.

Bill

On 02/18/2014 10:29 AM, Lyn Gerner wrote:
> Hi Bill,
>
> Check out Flags=DenyOnLimit in the sacctmgr man page.
>
> Best,
> Lyn
>
>
> On Tue, Feb 18, 2014 at 4:50 AM, Bill Wichser <[email protected] 
> <mailto:[email protected]>> wrote:
>
>
>     We have activated a few setting in the database for QOS.  For
>     brevity, lets just look at one.
>
>     sacctmgr add qos long priority=10 GrpCpus=516 MaxCpusPerUser=258
>
>     My expectation was that there would only be 516 cores which could be
>     used under this QOS and that users would only be able to submit a
>     largest job requiring 258 cores.  (This is an SMP machine with 1500+
>     cores)
>
>     The QOS is assigned in the job_submit.lua script.  But when testing
>     with an explicit #SBATCH --qos=long directive, nothing changes.
>
>     I submit a job requiring 522 cores, it accepts it and leaves it
>     pending on resources:
>
>     # scontrol show job 163
>     JobId=163 Name=hello.slurm
>         UserId=bill(14119) GroupId=cses(20121)
>         Priority=1680 Account=all QOS=long
>         JobState=PENDING Reason=Resources Dependency=(null)
>         Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
>         RunTime=00:00:00 TimeLimit=1-17:40:00 TimeMin=N/A
>         SubmitTime=2014-02-18T09:33:27 EligibleTime=2014-02-18T09:33:__27
>         StartTime=2014-02-19T09:42:19 EndTime=Unknown
>         PreemptTime=None SuspendTime=None SecsPreSuspend=0
>         Partition=normal AllocNode:Sid=hecate:1104586
>         ReqNodeList=(null) ExcNodeList=(null)
>         NodeList=(null)
>         NumNodes=1 NumCPUs=522 CPUs/Task=1 ReqS:C:T=*:*:*
>         MinCPUsNode=1 MinMemoryCPU=5000M MinTmpDiskNode=0
>         Features=(null) Gres=(null) Reservation=(null)
>         Shared=OK Contiguous=0 Licenses=(null) Network=(null)
>         Command=/home/bill/mpi/hello.__slurm
>         WorkDir=/home/bill/mpi
>
>
>     I would have expected either that the job was rejected or having a
>     Reason != Resources.
>
>     Also, there are a total amount of cores being used with the qos=long
>     (by others) which exceeds this GrpCpus=516 limit.
>
>     Obviously I have missed something here.
>
>     My goals would be 1) to reject outright jobs exceeding QOS limits of
>     MaxCpusPerUser (maybe I also need a MaxCpusPerJob?) and to hold jobs
>     which will exceed this GrpCpus limit.
>
>     Thanks,
>     Bill
>
>

Reply via email to