I’m running Ubuntu in Azure, I have a total of 5 host and one master, SLURM
user exist on all servers and has permissions to all folders. Master will not
start SLURMCTLD or SLURMD.
Ubuntu Version: 16.10
SLURM Version – Can’t connect to controller
IPTABLES are empty:
root@master:~# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Systemctl status slurmd.service
[....] Starting slurmd (via systemctl): slurmd.serviceJob for slurmd.service
failed because a timeout was exceeded.
See "systemctl status slurmd.service" and "journalctl -xe" for details.
failed!
root@master:~# systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset:
enabled)
Active: failed (Result: timeout) since Mon 2017-06-12 10:25:21 PDT; 2min 3s
ago
Docs: man:slurmd(8)
Process: 6604 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited,
status=0/SUCCESS)
Jun 12 10:23:41 master systemd[1]: Starting Slurm node daemon...
Jun 12 10:25:11 master systemd[1]: slurmd.service: Start operation timed out.
Terminating.
Jun 12 10:25:21 master systemd[1]: Failed to start Slurm node daemon.
Jun 12 10:25:21 master systemd[1]: slurmd.service: Unit entered failed state.
Jun 12 10:25:21 master systemd[1]: slurmd.service: Failed with result 'timeout'.
Systemctl status slurmctld.service
Jun 12 10:20:11 master slurmctld[6527]: Running as primary controller
Jun 12 10:20:11 master slurmctld[6527]: No parameter for mcs plugin, default
values set
Jun 12 10:20:11 master slurmctld[6527]: mcs: MCSParameters = (null). ondemand
set.
Jun 12 10:21:11 master slurmctld[6527]:
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0
Jun 12 10:21:41 master systemd[1]: slurmctld.service: Start operation timed
out. Terminating.
Jun 12 10:21:41 master slurmctld[6527]: Terminate signal (SIGINT or SIGTERM)
received
Jun 12 10:21:41 master slurmctld[6527]: Saving all slurm state
Jun 12 10:21:41 master systemd[1]: Failed to start Slurm controller daemon.
Jun 12 10:21:41 master systemd[1]: slurmctld.service: Unit entered failed state.
Jun 12 10:21:41 master systemd[1]: slurmctld.service: Failed with result
'timeout'.
This is my config file
#slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=master
ControlAddr=10.0.0.254
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=0
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
#SlurmctldLogFile=
#SlurmdDebug=3
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=master,worker[0-3] CPUs=4 State=UNKNOWN
PartitionName=debug Nodes=master,worker[0-3] Default=YES MaxTime=INFINITE
State=UP
Slurmd- Dvvvv results
slurmd: debug3: successfully opened slurm listen port *:6818
slurmd: slurmd started on Mon, 12 Jun 2017 10:18:41 -0700
slurmd: CPUs=4 Boards=1 Sockets=1 Cores=4 Threads=1 Memory=28133
TmpDisk=15713234 Uptime=429657 CPUSpecList=(null) FeaturesAvail=(null)
FeaturesActive=(null)
slurmd: debug3: Trying to load plugin
/usr/lib/x86_64-linux-gnu/slurm/acct_gather_energy_none.so
slurmd: debug: AcctGatherEnergy NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin
/usr/lib/x86_64-linux-gnu/slurm/acct_gather_profile_none.so
slurmd: debug: AcctGatherProfile NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin
/usr/lib/x86_64-linux-gnu/slurm/acct_gather_infiniband_none.so
slurmd: debug: AcctGatherInfiniband NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin
/usr/lib/x86_64-linux-gnu/slurm/acct_gather_filesystem_none.so
slurmd: debug: AcctGatherFilesystem NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817:
Connection refused
slurmd: debug: Failed to contact primary controller: Connection refused
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817:
Connection refused
slurmd: debug: Failed to contact primary controller: Connection refused
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817:
Connection refused
slurmd: debug: Failed to contact primary controller: Connection refused
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817:
Connection refused
slurmd: debug: Failed to contact primary controller: Connection refused
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817:
Connection refused
slurmd: debug: Failed to contact primary controller: Connection refused
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817:
Connection refused
slurmd: debug: Failed to contact primary controller: Connection refused
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817:
Connection refused
slurmd: debug: Failed to contact primary controller: Connection refused
SLURMCTLD and SLURMD logs
Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
This message and its contents are confidential. If you received this message in
error, do not use or rely upon it. Instead, please inform the sender and then
delete it. Thank you.