Just to further clarify: I'm running on Centos 7. Slurmctld is running as the slurm user (SlurmUser) and each node's slurmd is running as root user.
Slurm.conf is: # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. ControlMachine=slurm-head ControlAddr=115.146.87.234 AuthType=auth/munge CacheGroups=0 CryptoType=crypto/munge MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=2 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurmd SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear AccountingStorageType=accounting_storage/none AccountingStoreJobComment=YES ClusterName=cluster JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm/slurm-ctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm/slurm-d.log # COMPUTE NODES NodeName=slurm-test,slurm-w1,slurm-w2,slurm-w3 CPUs=8 RealMemory=32014 Sockets=8 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN PartitionName=debug Nodes=slurm-test,slurm-w1,slurm-w2,slurm-w3 Default=YES MaxTime=INFINITE State=UP -----Original Message----- From: Simpson Lachlan Sent: Friday, 18 December 2015 2:54 PM To: '[email protected]' Subject: Worker nodes down, draining? Not 100% sure what I'm doing wrong, but I don't seem to be able to get SLURM working in a real sense. I'm getting more "SUCCESS"es responses in the testing. Here are the outputs I'm seeing - I'm seeing similar but inconsistent symptoms: always seems to be a couple of nodes DOWN or DRAINING, but they don't ever come back up? It seems to be a different node every time I reboot or restart the services? Here are some of the info I'm seeing. Questions, from the outputs below them: 1. Why am I getting "Cray node selection plugin loaded" - I don't remember setting this, and I'm using a Cray? 2. What do down and drain represent, and what does the addition of a * mean? 3. What does "Real LowMemory" mean and how can I fix this? 4. "error: If munged is up, restart with --num-threads=10" - I see this a lot in each node's slurm-d log, but I can't see anything online re how to fix it and sudo systemctl start munge --num-threads=10 fails with "/bin/systemctl: unrecognised option '--num-threads=10'". Sure enough, I can't see that option anywhere in the init scripts. Cheers L. =================================================== [ec2-user@slurm-head slurm-administration]$ sudo sinfo -a -R -l -v ----------------------------- dead = false exact = 0 filtering = true format = %20E %12U %19H %6t %N iterate = 0 long = true no_header = false node_field = false node_format = false nodes = n/a part_field = false partition = n/a responding = false states = down,drain,error sort = (null) summarize = false verbose = 1 ----------------------------- all_flag = true alloc_mem_flag = false avail_flag = false bg_flag = false cpus_flag = false default_time_flag =false disk_flag = false features_flag = false groups_flag = false gres_flag = false job_size_flag = false max_time_flag = false memory_flag = false partition_flag = false priority_flag = false reason_flag = true reason_timestamp_flag = true reason_user_flag = true reservation_flag = false root_flag = false share_flag = false state_flag = true weight_flag = false ----------------------------- Fri Dec 18 03:28:35 2015 sinfo: Cray node selection plugin loaded REASON USER TIMESTAMP STATE NODELIST Not responding slurm(1001) 2015-12-18T02:52:28 down* slurm-w1 Low RealMemory slurm(1001) 2015-12-18T03:09:08 drain slurm-test,slurm-w2 =================================================== [ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-w2 NodeName=slurm-w2 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null) Gres=(null) NodeAddr=slurm-w2 NodeHostName=slurm-w2 Version=15.08 OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31412 Sockets=8 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2015-12-17T23:51:43 SlurmdStartTime=2015-12-17T23:51:55 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [slurm@2015-12-18T03:09:08] [ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-w3 NodeName=slurm-w3 Arch=x86_64 CoresPerSocket=1 CPUAlloc=8 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null) Gres=(null) NodeAddr=slurm-w3 NodeHostName=slurm-w3 Version=15.08 OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31404 Sockets=8 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2015-12-17T23:51:54 SlurmdStartTime=2015-12-17T23:52:04 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-w1 NodeName=slurm-w1 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null) Gres=(null) NodeAddr=slurm-w1 NodeHostName=slurm-w1 Version=15.08 OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31426 Sockets=8 Boards=1 State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2015-12-17T23:51:37 SlurmdStartTime=2015-12-17T23:51:48 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Not responding [slurm@2015-12-18T02:52:28] [ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-test NodeName=slurm-test Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null) Gres=(null) NodeAddr=slurm-test NodeHostName=slurm-test Version=15.08 OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31438 Sockets=8 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2015-12-17T23:53:30 SlurmdStartTime=2015-12-17T23:53:40 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [slurm@2015-12-18T03:09:08] =================================================== What I'm seeing in the logs: from slurm-d.log [2015-12-18T02:37:29.573] error: Unable to register: Protocol authentication error [2015-12-18T02:37:30.705] error: If munged is up, restart with --num-threads=10 [2015-12-18T02:37:30.705] error: Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory [2015-12-18T02:37:30.705] error: authentication: Socket communication error from slurm-ctl.log [2015-12-18T03:42:28.584] error: Node slurm-w3 has low real_memory size (32013 < 32014) [2015-12-18T03:42:28.584] error: Node slurm-w2 has low real_memory size (32013 < 32014) [2015-12-18T03:42:28.585] error: Node slurm-w1 has low real_memory size (32013 < 32014) [2015-12-18T03:42:28.586] error: Node slurm-test has low real_memory size (32013 < 32014) This email (including any attachments or links) may contain confidential and/or legally privileged information and is intended only to be read or used by the addressee. If you are not the intended addressee, any use, distribution, disclosure or copying of this email is strictly prohibited. Confidentiality and legal privilege attached to this email (including any attachments) are not waived or lost by reason of its mistaken delivery to you. If you have received this email in error, please delete it and notify us immediately by telephone or email. Peter MacCallum Cancer Centre provides no guarantee that this transmission is free of virus or that it has not been intercepted or altered and will not be liable for any delay in its receipt.
