Alternatively you can systemctl disable firewalld.service
systemctl mask firewalld.service yum install iptables-services systemctl enable iptables.service ip6tables.service and configure configure iptables in /etc/sysconfig/iptables and /etc/sysconfig/ip6tables, then systemctl start iptables.service ip6tables.service The crucial part is to ensure that either firewalld *or* iptables is running but not both. Or you could run without firewall at all *if* you trust your network… Am 06.07.2017 um 14:12 schrieb Ole Holm Nielsen: > > Firewall problems, like I suggested initially! Nmap is a great tool for > probing open ports! > > The iptables *must not* be configured on CentOS 7, you *must* use firewalld. > See > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons > for Slurm configurations. > > /Ole > > On 07/06/2017 01:22 PM, Said Mohamed Said wrote: >> John and Others, >> >> >> Thank you very much for your support. The problem is finally solved. >> >> >> After Installing nmap, it let me realize that some ports were blocked even >> with firewall daemon stopped and disabled. Turned out >> that iptables was on and enabled. After stopping iptables everything work >> just fine. >> >> >> >> Best Regards, >> >> >> Said. >> >> ------------------------------------------------------------------------ >> *From:* John Hearns <[email protected]> >> *Sent:* Thursday, July 6, 2017 6:47:48 PM >> *To:* slurm-dev >> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP >> Said, you are not out of ideas. >> >> I would suggest 'nmap' as a good tool to start with. Instlal nmap on your >> compute node and see which ports are open on the >> controller node >> >> Also do we have a DNS name resolution problem here? >> I alwasy remember sun Gridengine as being notoriously sensitive to name >> resolution, and that was my first question when any SGE >> problem was reported. >> So a couple of questions: >> >> On the ocntroller node and on the compute node run this: >> hostname >> hostname -f >> >> Do the cluster controller node or the compute nodes have more than one >> network interface. >> I bet the cluster controller node does! From the compute node, do an >> nslookup or a dig and see what the COMPUTE NODE thinks >> are hte names of both of those interfaces. >> >> Also as Rajul says - how are you making sure that both controller and >> compute nodes have the same slurm.conf file >> Actually if the slurm.conf files are different this will eb logged when the >> compute node starts up, but let us check everything. >> >> >> >> >> >> >> >> >> >> On 6 July 2017 at 11:37, Said Mohamed Said <[email protected] >> <mailto:[email protected]>> wrote: >> >> Even after reinstalling everything from the beginning the problem is >> still there. Right now I am out of Ideas. >> >> >> >> >> Best Regards, >> >> >> Said. >> >> ------------------------------------------------------------------------ >> *From:* Said Mohamed Said >> *Sent:* Thursday, July 6, 2017 2:23:05 PM >> *To:* slurm-dev >> *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP >> >> Thank you all for your suggestions, the only thing I can do for now >> is to uninstall and install from the beginning and I will use the >> most recent version of slurm on both nodes. >> >> For Felix who asked, the OS is CentOS 7.3 on both machines. >> >> I will let you know if that can solve the issue. >> ------------------------------------------------------------------------ >> *From:* Rajul Kumar <[email protected] >> <mailto:[email protected]>> >> *Sent:* Thursday, July 6, 2017 12:41:51 AM >> *To:* slurm-dev >> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP >> Sorry for the typo >> It's generally when one of the controller or compute can reach the >> other one but it's *not* happening vice-versa. >> >> >> On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar >> <[email protected] <mailto:[email protected]>> wrote: >> >> I came across the same problem sometime back. It's generally >> when one of the controller or compute can reach to other one but >> it's happening vice-versa. >> >> Have a look at the following points: >> - controller and compute can ping to each other >> - both share the same slurm.conf >> - slurm.conf has the location of both controller and compute >> - slurm services are running on the compute node when the >> controller says it's down >> - TCP connections are not being dropped >> - Ports are accessible that are to be used for communication, >> specifically response ports >> - Check the routing rules if any >> - Clocks are synced across >> - Hope there isn't any version mismatch but still have a look >> (doesn't recognize the nodes for major version differences) >> >> Hope this helps. >> >> Best, >> Rajul >> >> On Wed, Jul 5, 2017 at 10:52 AM, John Hearns >> <[email protected] <mailto:[email protected]>> wrote: >> >> Said, >> a problem like this always has a simple cause. We share >> your frustration, and several people her have offered help. >> So please do not get discouraged. We have all been in your >> situation! >> >> The only way to handle problems like this is >> a) start at the beginning and read the manuals and webpages >> closely >> b) start at the lowest level, ie here the network and do NOT >> assume that any component is working >> c) look at all the log files closely >> d) start daeomon sprocesses in a terminal with any 'verbose' >> flags set >> e) then start on more low-level diagnostics, such as tcpdump >> of network adapters and straces of the processes and gstacks >> >> >> you have been doing steps a b and c very well >> I suggest staying with these - I myself am going for Adam >> Huffmans suggestion of the NTP clock times. >> Are you SURE that on all nodes you have run the 'date' >> command and also 'ntpq -p' >> Are you SURE the master node and the node OBU-N6 are both >> connecting to an NTP server? ntpq -p will tell you that >> >> >> And do not lose heart. This is how we all learn. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 5 July 2017 at 16:23, Said Mohamed Said >> <[email protected] <mailto:[email protected]>> wrote: >> >> Sinfo -R gives "NODE IS NOT RESPONDING" >> ping gives successful results from both nodes >> >> I really can not figure out what is causing the problem. >> >> Regards, >> Said >> >> ------------------------------------------------------------------------ >> *From:* Felix Willenborg >> <[email protected] >> <mailto:[email protected]>> >> *Sent:* Wednesday, July 5, 2017 9:07:05 PM >> >> *To:* slurm-dev >> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP >> When the nodes change to the down state, what is 'sinfo >> -R' saying? Sometimes it gives you a reason for that. >> >> Best, >> Felix >> >> Am 05.07.2017 um 13:16 schrieb Said Mohamed Said: >>> Thank you Adam, For NTP I did that as well before >>> posting but didn't fix the issue. >>> >>> Regards, >>> Said >>> >>> ------------------------------------------------------------------------ >>> *From:* Adam Huffman <[email protected]> >>> <mailto:[email protected]> >>> *Sent:* Wednesday, July 5, 2017 8:11:03 PM >>> *To:* slurm-dev >>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP >>> >>> I've seen something similar when node clocks were skewed. >>> >>> Worth checking that NTP is running and they're all >>> synchronised. >>> >>> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said >>> <[email protected]> <mailto:[email protected]> >>> wrote: >>> > Thank you all for suggestions. I turned off firewall on >>> both machines but >>> > still no luck. I can confirm that No managed switch is >>> preventing the nodes >>> > from communicating. If you check the log file, there is >>> communication for >>> > about 4mins and then the node state goes down. >>> > Any other idea? >>> > ________________________________ >>> > From: Ole Holm Nielsen <[email protected]> >>> <mailto:[email protected]> >>> > Sent: Wednesday, July 5, 2017 7:07:15 PM >>> > To: slurm-dev >>> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP >>> > >>> > >>> > On 07/05/2017 11:40 AM, Felix Willenborg wrote: >>> >> in my network I encountered that managed switches were >>> preventing >>> >> necessary network communication between the nodes, on >>> which SLURM >>> >> relies. You should check if you're using managed >>> switches to connect >>> >> nodes to the network and if so, if they're blocking >>> communication on >>> >> slurm ports. >>> > >>> > Managed switches should permit IP layer 2 traffic just >>> like unmanaged >>> > switches! We only have managed Ethernet switches, and >>> they work without >>> > problems. >>> > >>> > Perhaps you meant that Ethernet switches may perform some >>> firewall >>> > functions by themselves? >>> > >>> > Firewalls must be off between Slurm compute nodes as well >>> as the >>> > controller host. See >>> > >>> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons >>> >>> <https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons> >>> > >>> > /Ole >> >> >> >>
