We had a pretty substantial killing happen here when the
/etc/init.d/slurm restart was triggered. I'm VERY gunshy at the moment!
On 02/18/2014 02:45 PM, Danny Auble wrote:
Why would cpusets matter when restarting the slurmd? I wouldn't expect
any issue at all.
Only restarting the slurmctld would be needed for
AccountingStorageEnforce to be update btw. You will get messages about
the slurm.conf files being out of sync when the slurmd's register, but
that would be expected. You could probably restart them as well. I
would not expect any issue. Perhaps you have seen some in the past?
Danny
On 02/18/14 11:40, Bill Wichser wrote:
So this will require a reboot of the daemon. Since the machine uses
cpusets for every job, this is a non-trivial operation so will need to
wait until we have scheduled downtime next month. Sigh....
On 02/18/2014 11:08 AM, Bill Wichser wrote:
AccountingStorageEnforce=limits,qos have already been set, yes. I had
done a scontrol reconfigure when adding these but not a restart.
On 02/18/2014 10:56 AM, Pancorbo, Juan wrote:
Have you set up the limits in slurm.conf?
AccountingStorageEnforce=limits,qos
Regards.
Juan Pancorbo Armada
[email protected]
http//www.lrz.de
Leibniz-Rechenzentrum
Abteilung: Hochleistungssysteme
Boltzmannstrasse 1, 85748 Garching
Telefon: +49 (0) 89 35831-8735
Fax: +49 (0) 89 35831-8535
-----Ursprüngliche Nachricht-----
Von: Bill Wichser [mailto:[email protected]]
Gesendet: Dienstag, 18. Februar 2014 16:52
An: slurm-dev
Betreff: [slurm-dev] Re: QOS does not appear to be working for me
Thanks Lyn. I've added both DenyOnLimit and EnforceUsageThreshold
using the modify on the qos=long. When I submit a job again, there is
no difference. It seems as though the limits don't apply to me.
Do the options require some restart of the slurmd? I wouldn't think
so and in the past this has been detrimental given the use of cpusets
on this SMP machine.
Bill
On 02/18/2014 10:29 AM, Lyn Gerner wrote:
Hi Bill,
Check out Flags=DenyOnLimit in the sacctmgr man page.
Best,
Lyn
On Tue, Feb 18, 2014 at 4:50 AM, Bill Wichser <[email protected]
<mailto:[email protected]>> wrote:
We have activated a few setting in the database for QOS. For
brevity, lets just look at one.
sacctmgr add qos long priority=10 GrpCpus=516 MaxCpusPerUser=258
My expectation was that there would only be 516 cores which
could be
used under this QOS and that users would only be able to submit a
largest job requiring 258 cores. (This is an SMP machine with
1500+
cores)
The QOS is assigned in the job_submit.lua script. But when
testing
with an explicit #SBATCH --qos=long directive, nothing changes.
I submit a job requiring 522 cores, it accepts it and leaves it
pending on resources:
# scontrol show job 163
JobId=163 Name=hello.slurm
UserId=bill(14119) GroupId=cses(20121)
Priority=1680 Account=all QOS=long
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-17:40:00 TimeMin=N/A
SubmitTime=2014-02-18T09:33:27
EligibleTime=2014-02-18T09:33:__27
StartTime=2014-02-19T09:42:19 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=normal AllocNode:Sid=hecate:1104586
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=522 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryCPU=5000M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/bill/mpi/hello.__slurm
WorkDir=/home/bill/mpi
I would have expected either that the job was rejected or
having a
Reason != Resources.
Also, there are a total amount of cores being used with the
qos=long
(by others) which exceeds this GrpCpus=516 limit.
Obviously I have missed something here.
My goals would be 1) to reject outright jobs exceeding QOS
limits of
MaxCpusPerUser (maybe I also need a MaxCpusPerJob?) and to hold
jobs
which will exceed this GrpCpus limit.
Thanks,
Bill