This may help. Especially the section "Networking and configuration problems".
http://slurm.schedmd.com/troubleshoot.html
Quoting Bob Healey <hea...@rpi.edu>:
Hello.
I've spent a week or two trying to figure this one out. Recently, I
replaced a two of my four dying SMC 8648T switches (purchased 2005)
in a 136 node Sun V20z Opteron cluster (also purchased 2005) with
Netgear M4100 switches. Yes, I know the hardware is ancient beyond
all belief. I've been directed to keep it running. Since replacing
those two switches, nodes on the new switches will randomly lose
communication with the slurmctld daemon, while otherwise being fully
accessible via other protocols, and can even run assorted s commands
for cluster information, while still appearing as down in sinfo. I'm
not seeing any errors on the switches. Spanning tree is disabled,
flow control is disabled, green features are disabled. Running
version 14.11.3 on RHEL 5.11. I'm tempted to drop the old switches
back in and see if the problem goes away. Before I admit to my boss
I can't handle networking worth a damn, anything obvious spring to
mind I should check? I have not changed anything, only just
rebooted cluster.
--
Bob Healey
Systems Administrator
Biocomputation and Bioinformatics Constellation
and Molecularium
hea...@rpi.edu
(518) 276-4407
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support