Sorry for the typo It's generally when one of the controller or compute can reach the other one but it's *not* happening vice-versa.
On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar <[email protected]> wrote: > I came across the same problem sometime back. It's generally when one of > the controller or compute can reach to other one but it's happening > vice-versa. > > Have a look at the following points: > - controller and compute can ping to each other > - both share the same slurm.conf > - slurm.conf has the location of both controller and compute > - slurm services are running on the compute node when the controller says > it's down > - TCP connections are not being dropped > - Ports are accessible that are to be used for communication, specifically > response ports > - Check the routing rules if any > - Clocks are synced across > - Hope there isn't any version mismatch but still have a look (doesn't > recognize the nodes for major version differences) > > Hope this helps. > > Best, > Rajul > > On Wed, Jul 5, 2017 at 10:52 AM, John Hearns <[email protected]> > wrote: > >> Said, >> a problem like this always has a simple cause. We share your >> frustration, and several people her have offered help. >> So please do not get discouraged. We have all been in your situation! >> >> The only way to handle problems like this is >> a) start at the beginning and read the manuals and webpages closely >> b) start at the lowest level, ie here the network and do NOT assume that >> any component is working >> c) look at all the log files closely >> d) start daeomon sprocesses in a terminal with any 'verbose' flags set >> e) then start on more low-level diagnostics, such as tcpdump of network >> adapters and straces of the processes and gstacks >> >> >> you have been doing steps a b and c very well >> I suggest staying with these - I myself am going for Adam Huffmans >> suggestion of the NTP clock times. >> Are you SURE that on all nodes you have run the 'date' command and also >> 'ntpq -p' >> Are you SURE the master node and the node OBU-N6 are both connecting to >> an NTP server? ntpq -p will tell you that >> >> >> And do not lose heart. This is how we all learn. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 5 July 2017 at 16:23, Said Mohamed Said <[email protected]> wrote: >> >>> Sinfo -R gives "NODE IS NOT RESPONDING" >>> ping gives successful results from both nodes >>> >>> I really can not figure out what is causing the problem. >>> >>> Regards, >>> Said >>> ------------------------------ >>> *From:* Felix Willenborg <[email protected]> >>> *Sent:* Wednesday, July 5, 2017 9:07:05 PM >>> >>> *To:* slurm-dev >>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP >>> >>> When the nodes change to the down state, what is 'sinfo -R' saying? >>> Sometimes it gives you a reason for that. >>> >>> Best, >>> Felix >>> >>> Am 05.07.2017 um 13:16 schrieb Said Mohamed Said: >>> >>> Thank you Adam, For NTP I did that as well before posting but didn't fix >>> the issue. >>> >>> Regards, >>> Said >>> ------------------------------ >>> *From:* Adam Huffman <[email protected]> <[email protected]> >>> *Sent:* Wednesday, July 5, 2017 8:11:03 PM >>> *To:* slurm-dev >>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP >>> >>> >>> I've seen something similar when node clocks were skewed. >>> >>> Worth checking that NTP is running and they're all synchronised. >>> >>> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said >>> <[email protected]> <[email protected]> wrote: >>> > Thank you all for suggestions. I turned off firewall on both machines >>> but >>> > still no luck. I can confirm that No managed switch is preventing the >>> nodes >>> > from communicating. If you check the log file, there is communication >>> for >>> > about 4mins and then the node state goes down. >>> > Any other idea? >>> > ________________________________ >>> > From: Ole Holm Nielsen <[email protected]> >>> <[email protected]> >>> > Sent: Wednesday, July 5, 2017 7:07:15 PM >>> > To: slurm-dev >>> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP >>> > >>> > >>> > On 07/05/2017 11:40 AM, Felix Willenborg wrote: >>> >> in my network I encountered that managed switches were preventing >>> >> necessary network communication between the nodes, on which SLURM >>> >> relies. You should check if you're using managed switches to connect >>> >> nodes to the network and if so, if they're blocking communication on >>> >> slurm ports. >>> > >>> > Managed switches should permit IP layer 2 traffic just like unmanaged >>> > switches! We only have managed Ethernet switches, and they work >>> without >>> > problems. >>> > >>> > Perhaps you meant that Ethernet switches may perform some firewall >>> > functions by themselves? >>> > >>> > Firewalls must be off between Slurm compute nodes as well as the >>> > controller host. See >>> > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#confi >>> gure-firewall-for-slurm-daemons >>> > >>> > /Ole >>> >>> >>> >> >
