Sinfo -R gives "NODE IS NOT RESPONDING" ping gives successful results from both nodes
I really can not figure out what is causing the problem. Regards, Said ________________________________ From: Felix Willenborg <[email protected]> Sent: Wednesday, July 5, 2017 9:07:05 PM To: slurm-dev Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP When the nodes change to the down state, what is 'sinfo -R' saying? Sometimes it gives you a reason for that. Best, Felix Am 05.07.2017 um 13:16 schrieb Said Mohamed Said: Thank you Adam, For NTP I did that as well before posting but didn't fix the issue. Regards, Said ________________________________ From: Adam Huffman <[email protected]><mailto:[email protected]> Sent: Wednesday, July 5, 2017 8:11:03 PM To: slurm-dev Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP I've seen something similar when node clocks were skewed. Worth checking that NTP is running and they're all synchronised. On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said <[email protected]><mailto:[email protected]> wrote: > Thank you all for suggestions. I turned off firewall on both machines but > still no luck. I can confirm that No managed switch is preventing the nodes > from communicating. If you check the log file, there is communication for > about 4mins and then the node state goes down. > Any other idea? > ________________________________ > From: Ole Holm Nielsen > <[email protected]><mailto:[email protected]> > Sent: Wednesday, July 5, 2017 7:07:15 PM > To: slurm-dev > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP > > > On 07/05/2017 11:40 AM, Felix Willenborg wrote: >> in my network I encountered that managed switches were preventing >> necessary network communication between the nodes, on which SLURM >> relies. You should check if you're using managed switches to connect >> nodes to the network and if so, if they're blocking communication on >> slurm ports. > > Managed switches should permit IP layer 2 traffic just like unmanaged > switches! We only have managed Ethernet switches, and they work without > problems. > > Perhaps you meant that Ethernet switches may perform some firewall > functions by themselves? > > Firewalls must be off between Slurm compute nodes as well as the > controller host. See > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons > > /Ole
