Just to further clarify: I'm running on Centos 7. 

Slurmctld is running as the slurm user (SlurmUser) and each node's slurmd is 
running as root user.

Slurm.conf is:

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.

ControlMachine=slurm-head
ControlAddr=115.146.87.234

AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none

TaskPlugin=task/none

InactiveLimit=0
KillWait=30
MinJobAge=300


SlurmctldTimeout=120
SlurmdTimeout=300

Waittime=0

FastSchedule=1

SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear

AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster

JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none

SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurm-ctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurm-d.log
# COMPUTE NODES
NodeName=slurm-test,slurm-w1,slurm-w2,slurm-w3 CPUs=8 RealMemory=32014 
Sockets=8 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
PartitionName=debug Nodes=slurm-test,slurm-w1,slurm-w2,slurm-w3 Default=YES 
MaxTime=INFINITE State=UP

-----Original Message-----
From: Simpson Lachlan 
Sent: Friday, 18 December 2015 2:54 PM
To: '[email protected]'
Subject: Worker nodes down, draining?

Not 100% sure what I'm doing wrong, but I don't seem to be able to get SLURM 
working in a real sense. I'm getting more "SUCCESS"es responses in the testing.

Here are the outputs I'm seeing - I'm seeing similar but inconsistent symptoms: 
always seems to be a couple of nodes DOWN or DRAINING, but they don't ever come 
back up? It seems to be a different node every time I reboot or restart the 
services?


Here are some of the info I'm seeing.

Questions, from the outputs below them: 

1. Why am I getting "Cray node selection plugin loaded" - I don't remember 
setting this, and I'm using a Cray?
2. What do down and drain represent, and what does the addition of a * mean?
3. What does "Real LowMemory" mean and how can I fix this?
4. "error: If munged is up, restart with --num-threads=10" - I see this a lot 
in each node's slurm-d log, but I can't see anything online re how to fix it 
and sudo systemctl start munge --num-threads=10 fails with "/bin/systemctl: 
unrecognised option '--num-threads=10'". Sure enough, I can't see that option 
anywhere in the init scripts. 

Cheers
L.

===================================================
[ec2-user@slurm-head slurm-administration]$ sudo sinfo -a -R -l -v
-----------------------------
dead        = false
exact       = 0
filtering   = true
format      = %20E %12U %19H %6t %N
iterate     = 0
long        = true
no_header   = false
node_field  = false
node_format = false
nodes       = n/a
part_field  = false
partition   = n/a
responding  = false
states      = down,drain,error
sort        = (null)
summarize   = false
verbose     = 1
-----------------------------
all_flag        = true
alloc_mem_flag  = false
avail_flag      = false
bg_flag         = false
cpus_flag       = false
default_time_flag =false
disk_flag       = false
features_flag   = false
groups_flag     = false
gres_flag       = false
job_size_flag   = false
max_time_flag   = false
memory_flag     = false
partition_flag  = false
priority_flag   = false
reason_flag     = true
reason_timestamp_flag = true
reason_user_flag = true
reservation_flag = false
root_flag       = false
share_flag      = false
state_flag      = true
weight_flag     = false
-----------------------------

Fri Dec 18 03:28:35 2015
sinfo: Cray node selection plugin loaded
REASON               USER         TIMESTAMP           STATE  NODELIST
Not responding       slurm(1001)  2015-12-18T02:52:28 down*  slurm-w1
Low RealMemory       slurm(1001)  2015-12-18T03:09:08 drain  slurm-test,slurm-w2

===================================================

[ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-w2
NodeName=slurm-w2 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null)
   Gres=(null)
   NodeAddr=slurm-w2 NodeHostName=slurm-w2 Version=15.08
   OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31412 Sockets=8 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
   BootTime=2015-12-17T23:51:43 SlurmdStartTime=2015-12-17T23:51:55
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm@2015-12-18T03:09:08]

[ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-w3
NodeName=slurm-w3 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=8 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null)
   Gres=(null)
   NodeAddr=slurm-w3 NodeHostName=slurm-w3 Version=15.08
   OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31404 Sockets=8 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
   BootTime=2015-12-17T23:51:54 SlurmdStartTime=2015-12-17T23:52:04
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

[ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-w1
NodeName=slurm-w1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null)
   Gres=(null)
   NodeAddr=slurm-w1 NodeHostName=slurm-w1 Version=15.08
   OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31426 Sockets=8 Boards=1
   State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
   BootTime=2015-12-17T23:51:37 SlurmdStartTime=2015-12-17T23:51:48
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Not responding [slurm@2015-12-18T02:52:28]

[ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-test 
NodeName=slurm-test Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null)
   Gres=(null)
   NodeAddr=slurm-test NodeHostName=slurm-test Version=15.08
   OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31438 Sockets=8 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
   BootTime=2015-12-17T23:53:30 SlurmdStartTime=2015-12-17T23:53:40
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm@2015-12-18T03:09:08]

===================================================

What I'm seeing in the logs:

from slurm-d.log

[2015-12-18T02:37:29.573] error: Unable to register: Protocol authentication 
error [2015-12-18T02:37:30.705] error: If munged is up, restart with 
--num-threads=10 [2015-12-18T02:37:30.705] error: Munge decode failed: Failed 
to access "/var/run/munge/munge.socket.2": No such file or directory 
[2015-12-18T02:37:30.705] error: authentication: Socket communication error 



from slurm-ctl.log

[2015-12-18T03:42:28.584] error: Node slurm-w3 has low real_memory size (32013 
< 32014) [2015-12-18T03:42:28.584] error: Node slurm-w2 has low real_memory 
size (32013 < 32014) [2015-12-18T03:42:28.585] error: Node slurm-w1 has low 
real_memory size (32013 < 32014) [2015-12-18T03:42:28.586] error: Node 
slurm-test has low real_memory size (32013 < 32014)


This email (including any attachments or links) may contain
confidential and/or legally privileged information and is
intended only to be read or used by the addressee.  If you
are not the intended addressee, any use, distribution,
disclosure or copying of this email is strictly
prohibited.
Confidentiality and legal privilege attached to this email
(including any attachments) are not waived or lost by
reason of its mistaken delivery to you.
If you have received this email in error, please delete it
and notify us immediately by telephone or email.  Peter
MacCallum Cancer Centre provides no guarantee that this
transmission is free of virus or that it has not been
intercepted or altered and will not be liable for any delay
in its receipt.

Reply via email to