Have you set up the limits in slurm.conf? AccountingStorageEnforce=limits,qos
Regards. Juan Pancorbo Armada [email protected] http//www.lrz.de Leibniz-Rechenzentrum Abteilung: Hochleistungssysteme Boltzmannstrasse 1, 85748 Garching Telefon: +49 (0) 89 35831-8735 Fax: +49 (0) 89 35831-8535 -----Ursprüngliche Nachricht----- Von: Bill Wichser [mailto:[email protected]] Gesendet: Dienstag, 18. Februar 2014 16:52 An: slurm-dev Betreff: [slurm-dev] Re: QOS does not appear to be working for me Thanks Lyn. I've added both DenyOnLimit and EnforceUsageThreshold using the modify on the qos=long. When I submit a job again, there is no difference. It seems as though the limits don't apply to me. Do the options require some restart of the slurmd? I wouldn't think so and in the past this has been detrimental given the use of cpusets on this SMP machine. Bill On 02/18/2014 10:29 AM, Lyn Gerner wrote: > Hi Bill, > > Check out Flags=DenyOnLimit in the sacctmgr man page. > > Best, > Lyn > > > On Tue, Feb 18, 2014 at 4:50 AM, Bill Wichser <[email protected] > <mailto:[email protected]>> wrote: > > > We have activated a few setting in the database for QOS. For > brevity, lets just look at one. > > sacctmgr add qos long priority=10 GrpCpus=516 MaxCpusPerUser=258 > > My expectation was that there would only be 516 cores which could be > used under this QOS and that users would only be able to submit a > largest job requiring 258 cores. (This is an SMP machine with 1500+ > cores) > > The QOS is assigned in the job_submit.lua script. But when testing > with an explicit #SBATCH --qos=long directive, nothing changes. > > I submit a job requiring 522 cores, it accepts it and leaves it > pending on resources: > > # scontrol show job 163 > JobId=163 Name=hello.slurm > UserId=bill(14119) GroupId=cses(20121) > Priority=1680 Account=all QOS=long > JobState=PENDING Reason=Resources Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=1-17:40:00 TimeMin=N/A > SubmitTime=2014-02-18T09:33:27 EligibleTime=2014-02-18T09:33:__27 > StartTime=2014-02-19T09:42:19 EndTime=Unknown > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > Partition=normal AllocNode:Sid=hecate:1104586 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=(null) > NumNodes=1 NumCPUs=522 CPUs/Task=1 ReqS:C:T=*:*:* > MinCPUsNode=1 MinMemoryCPU=5000M MinTmpDiskNode=0 > Features=(null) Gres=(null) Reservation=(null) > Shared=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/home/bill/mpi/hello.__slurm > WorkDir=/home/bill/mpi > > > I would have expected either that the job was rejected or having a > Reason != Resources. > > Also, there are a total amount of cores being used with the qos=long > (by others) which exceeds this GrpCpus=516 limit. > > Obviously I have missed something here. > > My goals would be 1) to reject outright jobs exceeding QOS limits of > MaxCpusPerUser (maybe I also need a MaxCpusPerJob?) and to hold jobs > which will exceed this GrpCpus limit. > > Thanks, > Bill > >
