We have our basic slurm setup running with a couple of queues set up through
QOS (a main and long “queue”). What we would like to do now is set up for
pre-emption for our stakeholders. When we have preemption set on the partition
(suspend), though, it looks like any job with –exclusive set will suspend the
job running on the node that has the next available with enough free cores
(since the jobs block schedule to fill free cores before going on to a new
node). The only way to stop it seems to be to put preemptmode=off on the
partition (which defeats the purpose). How do we do this, short of telling
everyone never to use “—exclusive”
There seem to be a number of descriptions online of setting up preemption,
some using different partitions but when we use different partitions we always
get oversubscribed nodes (since it seems that the scheduling doesn’t coordinate
between partitions). Thanks in advance - any help is appreciated.
This is in our slurm.conf:
PreemptType=preempt/qos
PreemptMode=SUSPEND,GANG
…
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
Our partition:
PartitionName=main Default=yes Priority=1 State=UP
TRESBillingWeights="CPU=1.0,Mem=0.25G"
nodes=compute-0-[0-8],compute-5-[0-1],compute-20-0 DefMemPerCPU=4096 QOS=main
PreemptMode=suspend
And this is our qos:
Name Priority GraceTime Preempt PreemptMode
Flags UsageThres UsageFactor GrpTRES GrpTRESMins
GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode
MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MinTRES
---------- ---------- ---------- ---------- -----------
---------------------------------------- ---------- ----------- -------------
------------- ------------- ------- --------- ----------- -------------
-------------- ------------- ----------- ------------- --------- -----------
-------------
long 0 00:00:00 master cluster
OverPartQOS 1.000000 cpu=132
7-00:00:00
main 0 00:00:00 cluster
1.000000
Deborah Crocker, PhD
Systems Engineer III
Office of Information Technology
The University of Alabama
Box 870346
Tuscaloosa, AL 36587
Office 205-348-3758 | Fax 205-348-9393
[email protected]
From: Husen R [mailto:[email protected]]
Sent: Wednesday, April 06, 2016 9:05 PM
To: slurm-dev
Subject: [slurm-dev] Re: Failed to access munge.socket.2
Hello Lachlan,Chris
Thank you for your reply.
I don't know why "/usr/local" is appended to the path..
I tried to locate munge.socket.2 manually using locate command and the file is
not exist indeed.
The directory /usr/local/var/run/munge is empty.
There is no munge directory in /var/run. I don't know why the munge directory
is located in /usr/local/var/run instead of in /var/run.
I have ever installed slurm-llnl from repository before installing it from
source. is this probably the cause of the problem ?
Regards,
Husen
On Thu, Apr 7, 2016 at 8:26 AM, Christopher Samuel
<[email protected]<mailto:[email protected]>> wrote:
On 06/04/16 19:50, Husen R wrote:
> however, when I tried to run sbatch I get the following error message:
>
> Failed to access "/usr/local/var/run/munge/munge.socket.2": No such file
> or directory
Is that path really correct?
On our systems it's: /var/run/munge/munge.socket.2
Best of luck,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: [email protected]<mailto:[email protected]> Phone: +61 (0)3 903
55545<tel:%2B61%20%280%293%20903%2055545>
http://www.vlsci.org.au/ http://twitter.com/vlsci