Said, a problem like this always has a simple cause. We share your frustration, and several people her have offered help. So please do not get discouraged. We have all been in your situation!
The only way to handle problems like this is a) start at the beginning and read the manuals and webpages closely b) start at the lowest level, ie here the network and do NOT assume that any component is working c) look at all the log files closely d) start daeomon sprocesses in a terminal with any 'verbose' flags set e) then start on more low-level diagnostics, such as tcpdump of network adapters and straces of the processes and gstacks you have been doing steps a b and c very well I suggest staying with these - I myself am going for Adam Huffmans suggestion of the NTP clock times. Are you SURE that on all nodes you have run the 'date' command and also 'ntpq -p' Are you SURE the master node and the node OBU-N6 are both connecting to an NTP server? ntpq -p will tell you that And do not lose heart. This is how we all learn. On 5 July 2017 at 16:23, Said Mohamed Said <[email protected]> wrote: > Sinfo -R gives "NODE IS NOT RESPONDING" > ping gives successful results from both nodes > > I really can not figure out what is causing the problem. > > Regards, > Said > ------------------------------ > *From:* Felix Willenborg <[email protected]> > *Sent:* Wednesday, July 5, 2017 9:07:05 PM > > *To:* slurm-dev > *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP > > When the nodes change to the down state, what is 'sinfo -R' saying? > Sometimes it gives you a reason for that. > > Best, > Felix > > Am 05.07.2017 um 13:16 schrieb Said Mohamed Said: > > Thank you Adam, For NTP I did that as well before posting but didn't fix > the issue. > > Regards, > Said > ------------------------------ > *From:* Adam Huffman <[email protected]> <[email protected]> > *Sent:* Wednesday, July 5, 2017 8:11:03 PM > *To:* slurm-dev > *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP > > > I've seen something similar when node clocks were skewed. > > Worth checking that NTP is running and they're all synchronised. > > On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said <[email protected]> > <[email protected]> wrote: > > Thank you all for suggestions. I turned off firewall on both machines but > > still no luck. I can confirm that No managed switch is preventing the > nodes > > from communicating. If you check the log file, there is communication for > > about 4mins and then the node state goes down. > > Any other idea? > > ________________________________ > > From: Ole Holm Nielsen <[email protected]> > <[email protected]> > > Sent: Wednesday, July 5, 2017 7:07:15 PM > > To: slurm-dev > > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP > > > > > > On 07/05/2017 11:40 AM, Felix Willenborg wrote: > >> in my network I encountered that managed switches were preventing > >> necessary network communication between the nodes, on which SLURM > >> relies. You should check if you're using managed switches to connect > >> nodes to the network and if so, if they're blocking communication on > >> slurm ports. > > > > Managed switches should permit IP layer 2 traffic just like unmanaged > > switches! We only have managed Ethernet switches, and they work without > > problems. > > > > Perhaps you meant that Ethernet switches may perform some firewall > > functions by themselves? > > > > Firewalls must be off between Slurm compute nodes as well as the > > controller host. See > > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration# > configure-firewall-for-slurm-daemons > > > > /Ole > > >
