Did you try to ping the compute node from the controller node and the
other way around ?
On 07/05/2017 01:07 PM, Said Mohamed Said wrote:
Thank you all for suggestions. I turned off firewall on both machines
but still no luck. I can confirm that No managed switch is preventing
the nodes from communicating. If you check the log file, there is
communication for about 4mins and then the node state goes down.
Any other idea?
------------------------------------------------------------------------
*From:* Ole Holm Nielsen <[email protected]>
*Sent:* Wednesday, July 5, 2017 7:07:15 PM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
On 07/05/2017 11:40 AM, Felix Willenborg wrote:
> in my network I encountered that managed switches were preventing
> necessary network communication between the nodes, on which SLURM
> relies. You should check if you're using managed switches to connect
> nodes to the network and if so, if they're blocking communication on
> slurm ports.
Managed switches should permit IP layer 2 traffic just like unmanaged
switches! We only have managed Ethernet switches, and they work without
problems.
Perhaps you meant that Ethernet switches may perform some firewall
functions by themselves?
Firewalls must be off between Slurm compute nodes as well as the
controller host. See
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
/Ole
--
---
Mehdi Denou
Bull/Atos international HPC support
+336 45 57 66 56