I am guessing there was something else that changed in the slurm.conf that made that happen. Knowing the version of Slurm and the log files from the slurmctld and at least one of the slurmd's that cancelled a job would be helpful to determine what happened. Under normal circumstances you should be able to restart as needed with no problems.

On 02/18/14 12:27, Bill Wichser wrote:

We had a pretty substantial killing happen here when the /etc/init.d/slurm restart was triggered. I'm VERY gunshy at the moment!

On 02/18/2014 02:45 PM, Danny Auble wrote:

Why would cpusets matter when restarting the slurmd?  I wouldn't expect
any issue at all.

Only restarting the slurmctld would be needed for
AccountingStorageEnforce to be update btw.  You will get messages about
the slurm.conf files being out of sync when the slurmd's register, but
that would be expected.  You could probably restart them as well.  I
would not expect any issue.  Perhaps you have seen some in the past?

Danny

On 02/18/14 11:40, Bill Wichser wrote:

So this will require a reboot of the daemon.  Since the machine uses
cpusets for every job, this is a non-trivial operation so will need to
wait until we have scheduled downtime next month. Sigh....

On 02/18/2014 11:08 AM, Bill Wichser wrote:

AccountingStorageEnforce=limits,qos have already been set, yes. I had
done a scontrol reconfigure when adding these but not a restart.



On 02/18/2014 10:56 AM, Pancorbo, Juan wrote:
Have you set up the limits in slurm.conf?
AccountingStorageEnforce=limits,qos


Regards.

Juan Pancorbo Armada
[email protected]
http//www.lrz.de


Leibniz-Rechenzentrum
Abteilung: Hochleistungssysteme
Boltzmannstrasse 1, 85748 Garching
Telefon:  +49 (0) 89 35831-8735
Fax:      +49 (0) 89 35831-8535

-----Ursprüngliche Nachricht-----
Von: Bill Wichser [mailto:[email protected]]
Gesendet: Dienstag, 18. Februar 2014 16:52
An: slurm-dev
Betreff: [slurm-dev] Re: QOS does not appear to be working for me


Thanks Lyn.  I've added both DenyOnLimit and EnforceUsageThreshold
using the modify on the qos=long. When I submit a job again, there is
no difference.  It seems as though the limits don't apply to me.

Do the options require some restart of the slurmd?  I wouldn't think
so and in the past this has been detrimental given the use of cpusets
on this SMP machine.

Bill

On 02/18/2014 10:29 AM, Lyn Gerner wrote:
Hi Bill,

Check out Flags=DenyOnLimit in the sacctmgr man page.

Best,
Lyn


On Tue, Feb 18, 2014 at 4:50 AM, Bill Wichser <[email protected]
<mailto:[email protected]>> wrote:


     We have activated a few setting in the database for QOS.  For
     brevity, lets just look at one.

sacctmgr add qos long priority=10 GrpCpus=516 MaxCpusPerUser=258

     My expectation was that there would only be 516 cores which
could be
used under this QOS and that users would only be able to submit a
     largest job requiring 258 cores.  (This is an SMP machine with
1500+
     cores)

     The QOS is assigned in the job_submit.lua script. But when
testing
     with an explicit #SBATCH --qos=long directive, nothing changes.

     I submit a job requiring 522 cores, it accepts it and leaves it
     pending on resources:

     # scontrol show job 163
     JobId=163 Name=hello.slurm
         UserId=bill(14119) GroupId=cses(20121)
         Priority=1680 Account=all QOS=long
         JobState=PENDING Reason=Resources Dependency=(null)
         Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
         RunTime=00:00:00 TimeLimit=1-17:40:00 TimeMin=N/A
         SubmitTime=2014-02-18T09:33:27
EligibleTime=2014-02-18T09:33:__27
         StartTime=2014-02-19T09:42:19 EndTime=Unknown
         PreemptTime=None SuspendTime=None SecsPreSuspend=0
         Partition=normal AllocNode:Sid=hecate:1104586
         ReqNodeList=(null) ExcNodeList=(null)
         NodeList=(null)
         NumNodes=1 NumCPUs=522 CPUs/Task=1 ReqS:C:T=*:*:*
         MinCPUsNode=1 MinMemoryCPU=5000M MinTmpDiskNode=0
         Features=(null) Gres=(null) Reservation=(null)
         Shared=OK Contiguous=0 Licenses=(null) Network=(null)
         Command=/home/bill/mpi/hello.__slurm
         WorkDir=/home/bill/mpi


     I would have expected either that the job was rejected or
having a
     Reason != Resources.

     Also, there are a total amount of cores being used with the
qos=long
     (by others) which exceeds this GrpCpus=516 limit.

     Obviously I have missed something here.

     My goals would be 1) to reject outright jobs exceeding QOS
limits of
     MaxCpusPerUser (maybe I also need a MaxCpusPerJob?) and to hold
jobs
     which will exceed this GrpCpus limit.

     Thanks,
     Bill


Reply via email to