Not 100% sure what I'm doing wrong, but I don't seem to be able to get SLURM 
working in a real sense. I'm getting more "SUCCESS"es responses in the testing.

Here are the outputs I'm seeing - I'm seeing similar but inconsistent symptoms: 
always seems to be a couple of nodes DOWN or DRAINING, but they don't ever come 
back up? It seems to be a different node every time I reboot or restart the 
services?


Here are some of the info I'm seeing.

Questions, from the outputs below them: 

1. Why am I getting "Cray node selection plugin loaded" - I don't remember 
setting this, and I'm using a Cray?
2. What do down and drain represent, and what does the addition of a * mean?
3. What does "Real LowMemory" mean and how can I fix this?
4. "error: If munged is up, restart with --num-threads=10" - I see this a lot 
in each node's slurm-d log, but I can't see anything online re how to fix it 
and sudo systemctl start munge --num-threads=10 fails with "/bin/systemctl: 
unrecognised option '--num-threads=10'". Sure enough, I can't see that option 
anywhere in the init scripts. 

Cheers
L.

===================================================
[ec2-user@slurm-head slurm-administration]$ sudo sinfo -a -R -l -v
-----------------------------
dead        = false
exact       = 0
filtering   = true
format      = %20E %12U %19H %6t %N
iterate     = 0
long        = true
no_header   = false
node_field  = false
node_format = false
nodes       = n/a
part_field  = false
partition   = n/a
responding  = false
states      = down,drain,error
sort        = (null)
summarize   = false
verbose     = 1
-----------------------------
all_flag        = true
alloc_mem_flag  = false
avail_flag      = false
bg_flag         = false
cpus_flag       = false
default_time_flag =false
disk_flag       = false
features_flag   = false
groups_flag     = false
gres_flag       = false
job_size_flag   = false
max_time_flag   = false
memory_flag     = false
partition_flag  = false
priority_flag   = false
reason_flag     = true
reason_timestamp_flag = true
reason_user_flag = true
reservation_flag = false
root_flag       = false
share_flag      = false
state_flag      = true
weight_flag     = false
-----------------------------

Fri Dec 18 03:28:35 2015
sinfo: Cray node selection plugin loaded
REASON               USER         TIMESTAMP           STATE  NODELIST
Not responding       slurm(1001)  2015-12-18T02:52:28 down*  slurm-w1
Low RealMemory       slurm(1001)  2015-12-18T03:09:08 drain  slurm-test,slurm-w2

===================================================

[ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-w2
NodeName=slurm-w2 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null)
   Gres=(null)
   NodeAddr=slurm-w2 NodeHostName=slurm-w2 Version=15.08
   OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31412 Sockets=8 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
   BootTime=2015-12-17T23:51:43 SlurmdStartTime=2015-12-17T23:51:55
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm@2015-12-18T03:09:08]

[ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-w3
NodeName=slurm-w3 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=8 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null)
   Gres=(null)
   NodeAddr=slurm-w3 NodeHostName=slurm-w3 Version=15.08
   OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31404 Sockets=8 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
   BootTime=2015-12-17T23:51:54 SlurmdStartTime=2015-12-17T23:52:04
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

[ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-w1
NodeName=slurm-w1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null)
   Gres=(null)
   NodeAddr=slurm-w1 NodeHostName=slurm-w1 Version=15.08
   OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31426 Sockets=8 Boards=1
   State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
   BootTime=2015-12-17T23:51:37 SlurmdStartTime=2015-12-17T23:51:48
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Not responding [slurm@2015-12-18T02:52:28]

[ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-test
NodeName=slurm-test Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null)
   Gres=(null)
   NodeAddr=slurm-test NodeHostName=slurm-test Version=15.08
   OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31438 Sockets=8 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
   BootTime=2015-12-17T23:53:30 SlurmdStartTime=2015-12-17T23:53:40
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm@2015-12-18T03:09:08]

===================================================

What I'm seeing in the logs:

from slurm-d.log

[2015-12-18T02:37:29.573] error: Unable to register: Protocol authentication 
error
[2015-12-18T02:37:30.705] error: If munged is up, restart with --num-threads=10
[2015-12-18T02:37:30.705] error: Munge decode failed: Failed to access 
"/var/run/munge/munge.socket.2": No such file or directory
[2015-12-18T02:37:30.705] error: authentication: Socket communication error 



from slurm-ctl.log

[2015-12-18T03:42:28.584] error: Node slurm-w3 has low real_memory size (32013 
< 32014)
[2015-12-18T03:42:28.584] error: Node slurm-w2 has low real_memory size (32013 
< 32014)
[2015-12-18T03:42:28.585] error: Node slurm-w1 has low real_memory size (32013 
< 32014)
[2015-12-18T03:42:28.586] error: Node slurm-test has low real_memory size 
(32013 < 32014)


This email (including any attachments or links) may contain
confidential and/or legally privileged information and is
intended only to be read or used by the addressee.  If you
are not the intended addressee, any use, distribution,
disclosure or copying of this email is strictly
prohibited.
Confidentiality and legal privilege attached to this email
(including any attachments) are not waived or lost by
reason of its mistaken delivery to you.
If you have received this email in error, please delete it
and notify us immediately by telephone or email.  Peter
MacCallum Cancer Centre provides no guarantee that this
transmission is free of virus or that it has not been
intercepted or altered and will not be liable for any delay
in its receipt.

Reply via email to