I came across the same problem sometime back. It's generally when one of the controller or compute can reach to other one but it's happening vice-versa.
Have a look at the following points: - controller and compute can ping to each other - both share the same slurm.conf - slurm.conf has the location of both controller and compute - slurm services are running on the compute node when the controller says it's down - TCP connections are not being dropped - Ports are accessible that are to be used for communication, specifically response ports - Check the routing rules if any - Clocks are synced across - Hope there isn't any version mismatch but still have a look (doesn't recognize the nodes for major version differences) Hope this helps. Best, Rajul On Wed, Jul 5, 2017 at 10:52 AM, John Hearns <[email protected]> wrote: > Said, > a problem like this always has a simple cause. We share your > frustration, and several people her have offered help. > So please do not get discouraged. We have all been in your situation! > > The only way to handle problems like this is > a) start at the beginning and read the manuals and webpages closely > b) start at the lowest level, ie here the network and do NOT assume that > any component is working > c) look at all the log files closely > d) start daeomon sprocesses in a terminal with any 'verbose' flags set > e) then start on more low-level diagnostics, such as tcpdump of network > adapters and straces of the processes and gstacks > > > you have been doing steps a b and c very well > I suggest staying with these - I myself am going for Adam Huffmans > suggestion of the NTP clock times. > Are you SURE that on all nodes you have run the 'date' command and also > 'ntpq -p' > Are you SURE the master node and the node OBU-N6 are both connecting to > an NTP server? ntpq -p will tell you that > > > And do not lose heart. This is how we all learn. > > > > > > > > > > > > > > > > > > On 5 July 2017 at 16:23, Said Mohamed Said <[email protected]> wrote: > >> Sinfo -R gives "NODE IS NOT RESPONDING" >> ping gives successful results from both nodes >> >> I really can not figure out what is causing the problem. >> >> Regards, >> Said >> ------------------------------ >> *From:* Felix Willenborg <[email protected]> >> *Sent:* Wednesday, July 5, 2017 9:07:05 PM >> >> *To:* slurm-dev >> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP >> >> When the nodes change to the down state, what is 'sinfo -R' saying? >> Sometimes it gives you a reason for that. >> >> Best, >> Felix >> >> Am 05.07.2017 um 13:16 schrieb Said Mohamed Said: >> >> Thank you Adam, For NTP I did that as well before posting but didn't fix >> the issue. >> >> Regards, >> Said >> ------------------------------ >> *From:* Adam Huffman <[email protected]> <[email protected]> >> *Sent:* Wednesday, July 5, 2017 8:11:03 PM >> *To:* slurm-dev >> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP >> >> >> I've seen something similar when node clocks were skewed. >> >> Worth checking that NTP is running and they're all synchronised. >> >> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said <[email protected]> >> <[email protected]> wrote: >> > Thank you all for suggestions. I turned off firewall on both machines >> but >> > still no luck. I can confirm that No managed switch is preventing the >> nodes >> > from communicating. If you check the log file, there is communication >> for >> > about 4mins and then the node state goes down. >> > Any other idea? >> > ________________________________ >> > From: Ole Holm Nielsen <[email protected]> >> <[email protected]> >> > Sent: Wednesday, July 5, 2017 7:07:15 PM >> > To: slurm-dev >> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP >> > >> > >> > On 07/05/2017 11:40 AM, Felix Willenborg wrote: >> >> in my network I encountered that managed switches were preventing >> >> necessary network communication between the nodes, on which SLURM >> >> relies. You should check if you're using managed switches to connect >> >> nodes to the network and if so, if they're blocking communication on >> >> slurm ports. >> > >> > Managed switches should permit IP layer 2 traffic just like unmanaged >> > switches! We only have managed Ethernet switches, and they work without >> > problems. >> > >> > Perhaps you meant that Ethernet switches may perform some firewall >> > functions by themselves? >> > >> > Firewalls must be off between Slurm compute nodes as well as the >> > controller host. See >> > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#confi >> gure-firewall-for-slurm-daemons >> > >> > /Ole >> >> >> >
