[slurm-dev] Re: SLURM ERROR! NEED HELP

Uwe Sauter Thu, 06 Jul 2017 07:31:11 -0700

Alternatively you can

  systemctl disable firewalld.service


  systemctl mask firewalld.service

  yum install iptables-services

  systemctl enable iptables.service ip6tables.service

and configure configure iptables in /etc/sysconfig/iptables and 
/etc/sysconfig/ip6tables, then

  systemctl start iptables.service ip6tables.service



The crucial part is to ensure that either firewalld *or* iptables is running 
but not both. Or you could run without firewall at
all *if* you trust your network…




Am 06.07.2017 um 14:12 schrieb Ole Holm Nielsen:
> 
> Firewall problems, like I suggested initially!  Nmap is a great tool for 
> probing open ports!
> 
> The iptables *must not* be configured on CentOS 7, you *must* use firewalld.  
> See
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
>  for Slurm configurations.
> 
> /Ole
> 
> On 07/06/2017 01:22 PM, Said Mohamed Said wrote:
>> John and Others,
>>
>>
>> Thank you very much for your support. The problem is finally solved.
>>
>>
>> After Installing nmap, it let me realize that some ports were blocked even 
>> with firewall daemon stopped and disabled. Turned out
>> that iptables was on and enabled. After stopping iptables everything work 
>> just fine.
>>
>>
>>
>> Best Regards,
>>
>>
>> Said.
>>
>> ------------------------------------------------------------------------
>> *From:* John Hearns <[email protected]>
>> *Sent:* Thursday, July 6, 2017 6:47:48 PM
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>> Said, you are not out of ideas.
>>
>> I would suggest 'nmap' as a good tool to start with.   Instlal nmap on your 
>> compute node and see which ports are open on the
>> controller node
>>
>> Also do we have a DNS name resolution problem here?
>> I alwasy remember sun Gridengine as being notoriously sensitive to name 
>> resolution, and that was my first question when any SGE
>> problem was reported.
>> So a couple of questions:
>>
>> On the ocntroller node and on the compute node run this:
>> hostname
>> hostname -f
>>
>> Do the cluster controller node or the compute nodes have more than one 
>> network interface.
>> I bet the cluster controller node does!   From the compute node, do an 
>> nslookup or a dig  and see what the COMPUTE NODE thinks
>> are hte names of both of those interfaces.
>>
>> Also as Rajul says - how are you making sure that both controller and 
>> compute nodes have the same slurm.conf file
>> Actually if the slurm.conf files are different this will eb logged when the 
>> compute node starts up, but let us check everything.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 6 July 2017 at 11:37, Said Mohamed Said <[email protected] 
>> <mailto:[email protected]>> wrote:
>>
>>     Even after reinstalling everything from the beginning the problem is
>>     still there. Right now I am out of Ideas.
>>
>>
>>
>>
>>     Best Regards,
>>
>>
>>     Said.
>>
>>     ------------------------------------------------------------------------
>>     *From:* Said Mohamed Said
>>     *Sent:* Thursday, July 6, 2017 2:23:05 PM
>>     *To:* slurm-dev
>>     *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>>     Thank you all for your suggestions, the only thing I can do for now
>>     is to uninstall and install from the beginning and I will use the
>>     most recent version of slurm on both nodes.
>>
>>     For Felix who asked, the OS is CentOS 7.3 on both machines.
>>
>>     I will let you know if that can solve the issue.
>>     ------------------------------------------------------------------------
>>     *From:* Rajul Kumar <[email protected]
>>     <mailto:[email protected]>>
>>     *Sent:* Thursday, July 6, 2017 12:41:51 AM
>>     *To:* slurm-dev
>>     *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>     Sorry for the typo
>>     It's generally when one of the controller or compute can reach the
>>     other one but it's *not* happening vice-versa.
>>
>>
>>     On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar
>>     <[email protected] <mailto:[email protected]>> wrote:
>>
>>         I came across the same problem sometime back. It's generally
>>         when one of the controller or compute can reach to other one but
>>         it's happening vice-versa.
>>
>>         Have a look at the following points:
>>         - controller and compute can ping to each other
>>         - both share the same slurm.conf
>>         - slurm.conf has the location of both controller and compute
>>         - slurm services are running on the compute node when the
>>         controller says it's down
>>         - TCP connections are not being dropped
>>         - Ports are accessible that are to be used for communication,
>>         specifically response ports
>>         - Check the routing rules if any
>>         - Clocks are synced across
>>         - Hope there isn't any version mismatch but still have a look
>>         (doesn't recognize the nodes for major version differences)
>>
>>         Hope this helps.
>>
>>         Best,
>>         Rajul
>>
>>         On Wed, Jul 5, 2017 at 10:52 AM, John Hearns
>>         <[email protected] <mailto:[email protected]>> wrote:
>>
>>             Said,
>>                 a problem like this always has a simple cause. We share
>>             your frustration, and several people her have offered help.
>>             So please do not get discouraged. We have all been in your
>>             situation!
>>
>>             The only way to handle problems like this is
>>             a) start at the beginning and read the manuals and webpages
>>             closely
>>             b) start at the lowest level, ie here the network and do NOT
>>             assume that any component is working
>>             c) look at all the log files closely
>>             d) start daeomon sprocesses in a terminal with any 'verbose'
>>             flags set
>>             e) then start on more low-level diagnostics, such as tcpdump
>>             of network adapters and straces of the processes and gstacks
>>
>>
>>             you have been doing steps a b and c very well
>>             I suggest staying with these - I myself am going for Adam
>>             Huffmans suggestion of the NTP clock times.
>>             Are you SURE that on all nodes you have run the 'date'
>>             command and also 'ntpq -p'
>>             Are you SURE the master node and the node OBU-N6   are both
>>             connecting to an NTP server?   ntpq -p will tell you that
>>
>>
>>             And do not lose heart.  This is how we all learn.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>             On 5 July 2017 at 16:23, Said Mohamed Said
>>             <[email protected] <mailto:[email protected]>> wrote:
>>
>>                 Sinfo -R gives "NODE IS NOT RESPONDING"
>>                 ping gives successful results from both nodes
>>
>>                 I really can not figure out what is causing the problem.
>>
>>                 Regards,
>>                 Said
>>                 
>> ------------------------------------------------------------------------
>>                 *From:* Felix Willenborg
>>                 <[email protected]
>>                 <mailto:[email protected]>>
>>                 *Sent:* Wednesday, July 5, 2017 9:07:05 PM
>>
>>                 *To:* slurm-dev
>>                 *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>                 When the nodes change to the down state, what is 'sinfo
>>                 -R' saying? Sometimes it gives you a reason for that.
>>
>>                 Best,
>>                 Felix
>>
>>                 Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
>>>                 Thank you Adam, For NTP I did that as well before
>>>                 posting but didn't fix the issue.
>>>
>>>                 Regards,
>>>                 Said
>>>                 
>>> ------------------------------------------------------------------------
>>>                 *From:* Adam Huffman <[email protected]>
>>>                 <mailto:[email protected]>
>>>                 *Sent:* Wednesday, July 5, 2017 8:11:03 PM
>>>                 *To:* slurm-dev
>>>                 *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>>
>>>                 I've seen something similar when node clocks were skewed.
>>>
>>>                 Worth checking that NTP is running and they're all
>>>                 synchronised.
>>>
>>>                 On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said
>>>                 <[email protected]> <mailto:[email protected]>
>>>                 wrote:
>>>                 > Thank you all for suggestions. I turned off firewall on 
>>> both machines but
>>>                 > still no luck. I can confirm that No managed switch is 
>>> preventing the nodes
>>>                 > from communicating. If you check the log file, there is 
>>> communication for
>>>                 > about 4mins and then the node state goes down.
>>>                 > Any other idea?
>>>                 > ________________________________
>>>                 > From: Ole Holm Nielsen <[email protected]>
>>>                 <mailto:[email protected]>
>>>                 > Sent: Wednesday, July 5, 2017 7:07:15 PM
>>>                 > To: slurm-dev
>>>                 > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
>>>                 >
>>>                 >
>>>                 > On 07/05/2017 11:40 AM, Felix Willenborg wrote:
>>>                 >> in my network I encountered that managed switches were 
>>> preventing
>>>                 >> necessary network communication between the nodes, on 
>>> which SLURM
>>>                 >> relies. You should check if you're using managed 
>>> switches to connect
>>>                 >> nodes to the network and if so, if they're blocking 
>>> communication on
>>>                 >> slurm ports.
>>>                 >
>>>                 > Managed switches should permit IP layer 2 traffic just 
>>> like unmanaged
>>>                 > switches!  We only have managed Ethernet switches, and 
>>> they work without
>>>                 > problems.
>>>                 >
>>>                 > Perhaps you meant that Ethernet switches may perform some 
>>> firewall
>>>                 > functions by themselves?
>>>                 >
>>>                 > Firewalls must be off between Slurm compute nodes as well 
>>> as the
>>>                 > controller host.  See
>>>                 > 
>>> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
>>>                 
>>> <https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons>
>>>                 >
>>>                 > /Ole
>>
>>
>>
>>

[slurm-dev] Re: SLURM ERROR! NEED HELP

Reply via email to