This may help. Especially the section "Networking and configuration problems".
http://slurm.schedmd.com/troubleshoot.html


Quoting Bob Healey <hea...@rpi.edu>:

Hello.

I've spent a week or two trying to figure this one out. Recently, I replaced a two of my four dying SMC 8648T switches (purchased 2005) in a 136 node Sun V20z Opteron cluster (also purchased 2005) with Netgear M4100 switches. Yes, I know the hardware is ancient beyond all belief. I've been directed to keep it running. Since replacing those two switches, nodes on the new switches will randomly lose communication with the slurmctld daemon, while otherwise being fully accessible via other protocols, and can even run assorted s commands for cluster information, while still appearing as down in sinfo. I'm not seeing any errors on the switches. Spanning tree is disabled, flow control is disabled, green features are disabled. Running version 14.11.3 on RHEL 5.11. I'm tempted to drop the old switches back in and see if the problem goes away. Before I admit to my boss I can't handle networking worth a damn, anything obvious spring to mind I should check? I have not changed anything, only just rebooted cluster.

--
Bob Healey
Systems Administrator
Biocomputation and Bioinformatics Constellation
and Molecularium
hea...@rpi.edu
(518) 276-4407


--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

Reply via email to