I’m running Ubuntu in Azure, I have a total of 5 host and one master, SLURM 
user exist on all servers and has permissions to all folders. Master will not 
start SLURMCTLD or SLURMD.

Ubuntu Version: 16.10
SLURM Version – Can’t connect to controller
IPTABLES are empty:
root@master:~# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination




Systemctl status slurmd.service
[....] Starting slurmd (via systemctl): slurmd.serviceJob for slurmd.service 
failed because a timeout was exceeded.
See "systemctl status slurmd.service" and "journalctl -xe" for details.
failed!
root@master:~# systemctl status slurmd.service
● slurmd.service - Slurm node daemon
   Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: 
enabled)
   Active: failed (Result: timeout) since Mon 2017-06-12 10:25:21 PDT; 2min 3s 
ago
     Docs: man:slurmd(8)
  Process: 6604 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, 
status=0/SUCCESS)

Jun 12 10:23:41 master systemd[1]: Starting Slurm node daemon...
Jun 12 10:25:11 master systemd[1]: slurmd.service: Start operation timed out. 
Terminating.
Jun 12 10:25:21 master systemd[1]: Failed to start Slurm node daemon.
Jun 12 10:25:21 master systemd[1]: slurmd.service: Unit entered failed state.
Jun 12 10:25:21 master systemd[1]: slurmd.service: Failed with result 'timeout'.

Systemctl status slurmctld.service
Jun 12 10:20:11 master slurmctld[6527]: Running as primary controller
Jun 12 10:20:11 master slurmctld[6527]: No parameter for mcs plugin, default 
values set
Jun 12 10:20:11 master slurmctld[6527]: mcs: MCSParameters = (null). ondemand 
set.
Jun 12 10:21:11 master slurmctld[6527]: 
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0
Jun 12 10:21:41 master systemd[1]: slurmctld.service: Start operation timed 
out. Terminating.
Jun 12 10:21:41 master slurmctld[6527]: Terminate signal (SIGINT or SIGTERM) 
received
Jun 12 10:21:41 master slurmctld[6527]: Saving all slurm state
Jun 12 10:21:41 master systemd[1]: Failed to start Slurm controller daemon.
Jun 12 10:21:41 master systemd[1]: slurmctld.service: Unit entered failed state.
Jun 12 10:21:41 master systemd[1]: slurmctld.service: Failed with result 
'timeout'.

This is my config file


#slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=master
ControlAddr=10.0.0.254
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=0
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
#SlurmctldLogFile=
#SlurmdDebug=3
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=master,worker[0-3] CPUs=4 State=UNKNOWN
PartitionName=debug Nodes=master,worker[0-3] Default=YES MaxTime=INFINITE 
State=UP


Slurmd- Dvvvv results

slurmd: debug3: successfully opened slurm listen port *:6818
slurmd: slurmd started on Mon, 12 Jun 2017 10:18:41 -0700
slurmd: CPUs=4 Boards=1 Sockets=1 Cores=4 Threads=1 Memory=28133 
TmpDisk=15713234 Uptime=429657 CPUSpecList=(null) FeaturesAvail=(null) 
FeaturesActive=(null)
slurmd: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm/acct_gather_energy_none.so
slurmd: debug:  AcctGatherEnergy NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm/acct_gather_profile_none.so
slurmd: debug:  AcctGatherProfile NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm/acct_gather_infiniband_none.so
slurmd: debug:  AcctGatherInfiniband NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm/acct_gather_filesystem_none.so
slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817: 
Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817: 
Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817: 
Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817: 
Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817: 
Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817: 
Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817: 
Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused

SLURMCTLD and SLURMD logs



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

This message and its contents are confidential. If you received this message in 
error, do not use or rely upon it. Instead, please inform the sender and then 
delete it. Thank you.

Reply via email to