Said, thankyou for letting us know. I'm going to blame this one on systemd. Just because I can.
On 6 July 2017 at 13:22, Said Mohamed Said <[email protected]> wrote: > John and Others, > > > Thank you very much for your support. The problem is finally solved. > > > After Installing nmap, it let me realize that some ports were blocked even > with firewall daemon stopped and disabled. Turned out that iptables was on > and enabled. After stopping iptables everything work just fine. > > > > Best Regards, > > > Said. > ------------------------------ > *From:* John Hearns <[email protected]> > *Sent:* Thursday, July 6, 2017 6:47:48 PM > > *To:* slurm-dev > *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP > > Said, you are not out of ideas. > > I would suggest 'nmap' as a good tool to start with. Instlal nmap on > your compute node and see which ports are open on the controller node > > Also do we have a DNS name resolution problem here? > I alwasy remember sun Gridengine as being notoriously sensitive to name > resolution, and that was my first question when any SGE problem was > reported. > So a couple of questions: > > On the ocntroller node and on the compute node run this: > hostname > hostname -f > > Do the cluster controller node or the compute nodes have more than one > network interface. > I bet the cluster controller node does! From the compute node, do an > nslookup or a dig and see what the COMPUTE NODE thinks are hte names of > both of those interfaces. > > Also as Rajul says - how are you making sure that both controller and > compute nodes have the same slurm.conf file > Actually if the slurm.conf files are different this will eb logged when > the compute node starts up, but let us check everything. > > > > > > > > > > On 6 July 2017 at 11:37, Said Mohamed Said <[email protected]> wrote: > >> Even after reinstalling everything from the beginning the problem is >> still there. Right now I am out of Ideas. >> >> >> >> >> Best Regards, >> >> >> Said. >> ------------------------------ >> *From:* Said Mohamed Said >> *Sent:* Thursday, July 6, 2017 2:23:05 PM >> *To:* slurm-dev >> *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP >> >> >> Thank you all for your suggestions, the only thing I can do for now is to >> uninstall and install from the beginning and I will use the most recent >> version of slurm on both nodes. >> >> For Felix who asked, the OS is CentOS 7.3 on both machines. >> >> I will let you know if that can solve the issue. >> ------------------------------ >> *From:* Rajul Kumar <[email protected]> >> *Sent:* Thursday, July 6, 2017 12:41:51 AM >> *To:* slurm-dev >> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP >> >> Sorry for the typo >> It's generally when one of the controller or compute can reach the other >> one but it's *not* happening vice-versa. >> >> >> On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar <[email protected]> >> wrote: >> >>> I came across the same problem sometime back. It's generally when one of >>> the controller or compute can reach to other one but it's happening >>> vice-versa. >>> >>> Have a look at the following points: >>> - controller and compute can ping to each other >>> - both share the same slurm.conf >>> - slurm.conf has the location of both controller and compute >>> - slurm services are running on the compute node when the controller >>> says it's down >>> - TCP connections are not being dropped >>> - Ports are accessible that are to be used for communication, >>> specifically response ports >>> - Check the routing rules if any >>> - Clocks are synced across >>> - Hope there isn't any version mismatch but still have a look (doesn't >>> recognize the nodes for major version differences) >>> >>> Hope this helps. >>> >>> Best, >>> Rajul >>> >>> On Wed, Jul 5, 2017 at 10:52 AM, John Hearns <[email protected]> >>> wrote: >>> >>>> Said, >>>> a problem like this always has a simple cause. We share your >>>> frustration, and several people her have offered help. >>>> So please do not get discouraged. We have all been in your situation! >>>> >>>> The only way to handle problems like this is >>>> a) start at the beginning and read the manuals and webpages closely >>>> b) start at the lowest level, ie here the network and do NOT assume >>>> that any component is working >>>> c) look at all the log files closely >>>> d) start daeomon sprocesses in a terminal with any 'verbose' flags set >>>> e) then start on more low-level diagnostics, such as tcpdump of network >>>> adapters and straces of the processes and gstacks >>>> >>>> >>>> you have been doing steps a b and c very well >>>> I suggest staying with these - I myself am going for Adam Huffmans >>>> suggestion of the NTP clock times. >>>> Are you SURE that on all nodes you have run the 'date' command and also >>>> 'ntpq -p' >>>> Are you SURE the master node and the node OBU-N6 are both connecting >>>> to an NTP server? ntpq -p will tell you that >>>> >>>> >>>> And do not lose heart. This is how we all learn. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 5 July 2017 at 16:23, Said Mohamed Said <[email protected]> >>>> wrote: >>>> >>>>> Sinfo -R gives "NODE IS NOT RESPONDING" >>>>> ping gives successful results from both nodes >>>>> >>>>> I really can not figure out what is causing the problem. >>>>> >>>>> Regards, >>>>> Said >>>>> ------------------------------ >>>>> *From:* Felix Willenborg <[email protected]> >>>>> *Sent:* Wednesday, July 5, 2017 9:07:05 PM >>>>> >>>>> *To:* slurm-dev >>>>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP >>>>> >>>>> When the nodes change to the down state, what is 'sinfo -R' saying? >>>>> Sometimes it gives you a reason for that. >>>>> >>>>> Best, >>>>> Felix >>>>> >>>>> Am 05.07.2017 um 13:16 schrieb Said Mohamed Said: >>>>> >>>>> Thank you Adam, For NTP I did that as well before posting but didn't >>>>> fix the issue. >>>>> >>>>> Regards, >>>>> Said >>>>> ------------------------------ >>>>> *From:* Adam Huffman <[email protected]> <[email protected]> >>>>> *Sent:* Wednesday, July 5, 2017 8:11:03 PM >>>>> *To:* slurm-dev >>>>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP >>>>> >>>>> >>>>> I've seen something similar when node clocks were skewed. >>>>> >>>>> Worth checking that NTP is running and they're all synchronised. >>>>> >>>>> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said >>>>> <[email protected]> <[email protected]> wrote: >>>>> > Thank you all for suggestions. I turned off firewall on both >>>>> machines but >>>>> > still no luck. I can confirm that No managed switch is preventing >>>>> the nodes >>>>> > from communicating. If you check the log file, there is >>>>> communication for >>>>> > about 4mins and then the node state goes down. >>>>> > Any other idea? >>>>> > ________________________________ >>>>> > From: Ole Holm Nielsen <[email protected]> >>>>> <[email protected]> >>>>> > Sent: Wednesday, July 5, 2017 7:07:15 PM >>>>> > To: slurm-dev >>>>> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP >>>>> > >>>>> > >>>>> > On 07/05/2017 11:40 AM, Felix Willenborg wrote: >>>>> >> in my network I encountered that managed switches were preventing >>>>> >> necessary network communication between the nodes, on which SLURM >>>>> >> relies. You should check if you're using managed switches to connect >>>>> >> nodes to the network and if so, if they're blocking communication on >>>>> >> slurm ports. >>>>> > >>>>> > Managed switches should permit IP layer 2 traffic just like unmanaged >>>>> > switches! We only have managed Ethernet switches, and they work >>>>> without >>>>> > problems. >>>>> > >>>>> > Perhaps you meant that Ethernet switches may perform some firewall >>>>> > functions by themselves? >>>>> > >>>>> > Firewalls must be off between Slurm compute nodes as well as the >>>>> > controller host. See >>>>> > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#confi >>>>> gure-firewall-for-slurm-daemons >>>>> > >>>>> > /Ole >>>>> >>>>> >>>>> >>>> >>> >> >
