We have our basic slurm setup running with a couple of queues set up through 
QOS (a main and long “queue”). What we would like to do now is set up for 
pre-emption for our stakeholders. When we have preemption set on the partition 
(suspend),  though, it looks like any job with –exclusive set will suspend the 
job running on the node that has the next available with enough free cores 
(since the jobs block schedule to fill free cores before going on to a new 
node). The only way to stop it seems to be to put preemptmode=off on the 
partition (which defeats the purpose). How do we do this, short of telling 
everyone never to use “—exclusive”

There seem to be a number of descriptions  online of setting up preemption, 
some using different partitions but when we use different partitions we always 
get oversubscribed nodes (since it seems that the scheduling doesn’t coordinate 
between partitions). Thanks in advance - any help is appreciated.

This is in our slurm.conf:

PreemptType=preempt/qos
PreemptMode=SUSPEND,GANG
…
SelectType=select/cons_res
SelectTypeParameters=CR_CPU

Our partition:
PartitionName=main  Default=yes Priority=1 State=UP 
TRESBillingWeights="CPU=1.0,Mem=0.25G" 
nodes=compute-0-[0-8],compute-5-[0-1],compute-20-0 DefMemPerCPU=4096 QOS=main 
PreemptMode=suspend


And this is our qos:
      Name   Priority  GraceTime    Preempt PreemptMode                         
           Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins 
GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   
MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU       MinTRES
---------- ---------- ---------- ---------- ----------- 
---------------------------------------- ---------- ----------- ------------- 
------------- ------------- ------- --------- ----------- ------------- 
-------------- ------------- ----------- ------------- --------- ----------- 
-------------
      long          0   00:00:00     master     cluster                         
     OverPartQOS               1.000000       cpu=132                           
                                                                            
7-00:00:00
      main          0   00:00:00                cluster                         
                               1.000000

Deborah Crocker, PhD
Systems Engineer III
Office of Information Technology
The University of Alabama
Box 870346
Tuscaloosa, AL 36587
Office 205-348-3758 | Fax 205-348-9393
[email protected]

From: Husen R [mailto:[email protected]]
Sent: Wednesday, April 06, 2016 9:05 PM
To: slurm-dev
Subject: [slurm-dev] Re: Failed to access munge.socket.2

Hello Lachlan,Chris

Thank you for your reply.

I don't know why "/usr/local" is appended to the path..
I tried to locate munge.socket.2 manually using locate command and the file is 
not exist indeed.
The directory /usr/local/var/run/munge is empty.

There is no munge directory in /var/run. I don't know why the munge directory 
is located in /usr/local/var/run instead of in /var/run.

I have ever installed slurm-llnl from repository before installing it from 
source. is this probably the cause of the problem ?

Regards,

Husen

On Thu, Apr 7, 2016 at 8:26 AM, Christopher Samuel 
<[email protected]<mailto:[email protected]>> wrote:

On 06/04/16 19:50, Husen R wrote:

> however, when I tried to run sbatch I get the following error message:
>
> Failed to access "/usr/local/var/run/munge/munge.socket.2": No such file
> or directory

Is that path really correct?

On our systems it's: /var/run/munge/munge.socket.2

Best of luck,
Chris
--
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: [email protected]<mailto:[email protected]> Phone: +61 (0)3 903 
55545<tel:%2B61%20%280%293%20903%2055545>
 http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to