So this will require a reboot of the daemon. Since the machine uses cpusets for every job, this is a non-trivial operation so will need to wait until we have scheduled downtime next month. Sigh....

On 02/18/2014 11:08 AM, Bill Wichser wrote:

AccountingStorageEnforce=limits,qos  have already been set, yes.  I had
done a scontrol reconfigure when adding these but not a restart.



On 02/18/2014 10:56 AM, Pancorbo, Juan wrote:
Have you set up the limits in slurm.conf?
AccountingStorageEnforce=limits,qos


Regards.

Juan Pancorbo Armada
[email protected]
http//www.lrz.de


Leibniz-Rechenzentrum
Abteilung: Hochleistungssysteme
Boltzmannstrasse 1, 85748 Garching
Telefon:  +49 (0) 89 35831-8735
Fax:      +49 (0) 89 35831-8535

-----Ursprüngliche Nachricht-----
Von: Bill Wichser [mailto:[email protected]]
Gesendet: Dienstag, 18. Februar 2014 16:52
An: slurm-dev
Betreff: [slurm-dev] Re: QOS does not appear to be working for me


Thanks Lyn.  I've added both DenyOnLimit and EnforceUsageThreshold
using the modify on the qos=long.  When I submit a job again, there is
no difference.  It seems as though the limits don't apply to me.

Do the options require some restart of the slurmd?  I wouldn't think
so and in the past this has been detrimental given the use of cpusets
on this SMP machine.

Bill

On 02/18/2014 10:29 AM, Lyn Gerner wrote:
Hi Bill,

Check out Flags=DenyOnLimit in the sacctmgr man page.

Best,
Lyn


On Tue, Feb 18, 2014 at 4:50 AM, Bill Wichser <[email protected]
<mailto:[email protected]>> wrote:


     We have activated a few setting in the database for QOS.  For
     brevity, lets just look at one.

     sacctmgr add qos long priority=10 GrpCpus=516 MaxCpusPerUser=258

     My expectation was that there would only be 516 cores which
could be
     used under this QOS and that users would only be able to submit a
     largest job requiring 258 cores.  (This is an SMP machine with
1500+
     cores)

     The QOS is assigned in the job_submit.lua script.  But when testing
     with an explicit #SBATCH --qos=long directive, nothing changes.

     I submit a job requiring 522 cores, it accepts it and leaves it
     pending on resources:

     # scontrol show job 163
     JobId=163 Name=hello.slurm
         UserId=bill(14119) GroupId=cses(20121)
         Priority=1680 Account=all QOS=long
         JobState=PENDING Reason=Resources Dependency=(null)
         Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
         RunTime=00:00:00 TimeLimit=1-17:40:00 TimeMin=N/A
         SubmitTime=2014-02-18T09:33:27
EligibleTime=2014-02-18T09:33:__27
         StartTime=2014-02-19T09:42:19 EndTime=Unknown
         PreemptTime=None SuspendTime=None SecsPreSuspend=0
         Partition=normal AllocNode:Sid=hecate:1104586
         ReqNodeList=(null) ExcNodeList=(null)
         NodeList=(null)
         NumNodes=1 NumCPUs=522 CPUs/Task=1 ReqS:C:T=*:*:*
         MinCPUsNode=1 MinMemoryCPU=5000M MinTmpDiskNode=0
         Features=(null) Gres=(null) Reservation=(null)
         Shared=OK Contiguous=0 Licenses=(null) Network=(null)
         Command=/home/bill/mpi/hello.__slurm
         WorkDir=/home/bill/mpi


     I would have expected either that the job was rejected or having a
     Reason != Resources.

     Also, there are a total amount of cores being used with the
qos=long
     (by others) which exceeds this GrpCpus=516 limit.

     Obviously I have missed something here.

     My goals would be 1) to reject outright jobs exceeding QOS
limits of
     MaxCpusPerUser (maybe I also need a MaxCpusPerJob?) and to hold
jobs
     which will exceed this GrpCpus limit.

     Thanks,
     Bill


Reply via email to