Which OS are you using on both nodes and how particulary did you turn off the firewall?
Best, Felix Am 05.07.2017 um 16:23 schrieb Said Mohamed Said: > Sinfo -R gives "NODE IS NOT RESPONDING" > ping gives successful results from both nodes > > I really can not figure out what is causing the problem. > > Regards, > Said > ------------------------------------------------------------------------ > *From:* Felix Willenborg <[email protected]> > *Sent:* Wednesday, July 5, 2017 9:07:05 PM > *To:* slurm-dev > *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP > > When the nodes change to the down state, what is 'sinfo -R' saying? > Sometimes it gives you a reason for that. > > Best, > Felix > > Am 05.07.2017 um 13:16 schrieb Said Mohamed Said: >> Thank you Adam, For NTP I did that as well before posting but didn't >> fix the issue. >> >> Regards, >> Said >> ------------------------------------------------------------------------ >> *From:* Adam Huffman <[email protected]> >> *Sent:* Wednesday, July 5, 2017 8:11:03 PM >> *To:* slurm-dev >> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP >> >> >> I've seen something similar when node clocks were skewed. >> >> Worth checking that NTP is running and they're all synchronised. >> >> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said >> <[email protected]> wrote: >> > Thank you all for suggestions. I turned off firewall on both >> machines but >> > still no luck. I can confirm that No managed switch is preventing >> the nodes >> > from communicating. If you check the log file, there is >> communication for >> > about 4mins and then the node state goes down. >> > Any other idea? >> > ________________________________ >> > From: Ole Holm Nielsen <[email protected]> >> > Sent: Wednesday, July 5, 2017 7:07:15 PM >> > To: slurm-dev >> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP >> > >> > >> > On 07/05/2017 11:40 AM, Felix Willenborg wrote: >> >> in my network I encountered that managed switches were preventing >> >> necessary network communication between the nodes, on which SLURM >> >> relies. You should check if you're using managed switches to connect >> >> nodes to the network and if so, if they're blocking communication on >> >> slurm ports. >> > >> > Managed switches should permit IP layer 2 traffic just like unmanaged >> > switches! We only have managed Ethernet switches, and they work >> without >> > problems. >> > >> > Perhaps you meant that Ethernet switches may perform some firewall >> > functions by themselves? >> > >> > Firewalls must be off between Slurm compute nodes as well as the >> > controller host. See >> > >> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons >> > >> > /Ole >
