So this will require a reboot of the daemon. Since the machine uses
cpusets for every job, this is a non-trivial operation so will need to
wait until we have scheduled downtime next month. Sigh....
On 02/18/2014 11:08 AM, Bill Wichser wrote:
AccountingStorageEnforce=limits,qos have already been set, yes. I had
done a scontrol reconfigure when adding these but not a restart.
On 02/18/2014 10:56 AM, Pancorbo, Juan wrote:
Have you set up the limits in slurm.conf?
AccountingStorageEnforce=limits,qos
Regards.
Juan Pancorbo Armada
[email protected]
http//www.lrz.de
Leibniz-Rechenzentrum
Abteilung: Hochleistungssysteme
Boltzmannstrasse 1, 85748 Garching
Telefon: +49 (0) 89 35831-8735
Fax: +49 (0) 89 35831-8535
-----Ursprüngliche Nachricht-----
Von: Bill Wichser [mailto:[email protected]]
Gesendet: Dienstag, 18. Februar 2014 16:52
An: slurm-dev
Betreff: [slurm-dev] Re: QOS does not appear to be working for me
Thanks Lyn. I've added both DenyOnLimit and EnforceUsageThreshold
using the modify on the qos=long. When I submit a job again, there is
no difference. It seems as though the limits don't apply to me.
Do the options require some restart of the slurmd? I wouldn't think
so and in the past this has been detrimental given the use of cpusets
on this SMP machine.
Bill
On 02/18/2014 10:29 AM, Lyn Gerner wrote:
Hi Bill,
Check out Flags=DenyOnLimit in the sacctmgr man page.
Best,
Lyn
On Tue, Feb 18, 2014 at 4:50 AM, Bill Wichser <[email protected]
<mailto:[email protected]>> wrote:
We have activated a few setting in the database for QOS. For
brevity, lets just look at one.
sacctmgr add qos long priority=10 GrpCpus=516 MaxCpusPerUser=258
My expectation was that there would only be 516 cores which
could be
used under this QOS and that users would only be able to submit a
largest job requiring 258 cores. (This is an SMP machine with
1500+
cores)
The QOS is assigned in the job_submit.lua script. But when testing
with an explicit #SBATCH --qos=long directive, nothing changes.
I submit a job requiring 522 cores, it accepts it and leaves it
pending on resources:
# scontrol show job 163
JobId=163 Name=hello.slurm
UserId=bill(14119) GroupId=cses(20121)
Priority=1680 Account=all QOS=long
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-17:40:00 TimeMin=N/A
SubmitTime=2014-02-18T09:33:27
EligibleTime=2014-02-18T09:33:__27
StartTime=2014-02-19T09:42:19 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=normal AllocNode:Sid=hecate:1104586
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=522 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryCPU=5000M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/bill/mpi/hello.__slurm
WorkDir=/home/bill/mpi
I would have expected either that the job was rejected or having a
Reason != Resources.
Also, there are a total amount of cores being used with the
qos=long
(by others) which exceeds this GrpCpus=516 limit.
Obviously I have missed something here.
My goals would be 1) to reject outright jobs exceeding QOS
limits of
MaxCpusPerUser (maybe I also need a MaxCpusPerJob?) and to hold
jobs
which will exceed this GrpCpus limit.
Thanks,
Bill