[slurm-dev] Re: SLURM ERROR! NEED HELP

John Hearns Thu, 06 Jul 2017 04:38:02 -0700

Said,
thankyou for letting us know.
I'm going to blame this one on systemd.  Just because I can.



On 6 July 2017 at 13:22, Said Mohamed Said <[email protected]> wrote:

> John and Others,
>
>
> Thank you very much for your support. The problem is finally solved.
>
>
> After Installing nmap, it let me realize that some ports were blocked even
> with firewall daemon stopped and disabled. Turned out that iptables was on
> and enabled. After stopping iptables everything work just fine.
>
>
>
> Best Regards,
>
>
> Said.
> ------------------------------
> *From:* John Hearns <[email protected]>
> *Sent:* Thursday, July 6, 2017 6:47:48 PM
>
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>
> Said, you are not out of ideas.
>
> I would suggest 'nmap' as a good tool to start with.   Instlal nmap on
> your compute node and see which ports are open on the controller node
>
> Also do we have a DNS name resolution problem here?
> I alwasy remember sun Gridengine as being notoriously sensitive to name
> resolution, and that was my first question when any SGE problem was
> reported.
> So a couple of questions:
>
> On the ocntroller node and on the compute node run this:
> hostname
> hostname -f
>
> Do the cluster controller node or the compute nodes have more than one
> network interface.
> I bet the cluster controller node does!   From the compute node, do an
> nslookup or a dig  and see what the COMPUTE NODE thinks are hte names of
> both of those interfaces.
>
> Also as Rajul says - how are you making sure that both controller and
> compute nodes have the same slurm.conf file
> Actually if the slurm.conf files are different this will eb logged when
> the compute node starts up, but let us check everything.
>
>
>
>
>
>
>
>
>
> On 6 July 2017 at 11:37, Said Mohamed Said <[email protected]> wrote:
>
>> Even after reinstalling everything from the beginning the problem is
>> still there. Right now I am out of Ideas.
>>
>>
>>
>>
>> Best Regards,
>>
>>
>> Said.
>> ------------------------------
>> *From:* Said Mohamed Said
>> *Sent:* Thursday, July 6, 2017 2:23:05 PM
>> *To:* slurm-dev
>> *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>>
>> Thank you all for your suggestions, the only thing I can do for now is to
>> uninstall and install from the beginning and I will use the most recent
>> version of slurm on both nodes.
>>
>> For Felix who asked, the OS is CentOS 7.3 on both machines.
>>
>> I will let you know if that can solve the issue.
>> ------------------------------
>> *From:* Rajul Kumar <[email protected]>
>> *Sent:* Thursday, July 6, 2017 12:41:51 AM
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>> Sorry for the typo
>> It's generally when one of the controller or compute can reach the other
>> one but it's *not* happening vice-versa.
>>
>>
>> On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar <[email protected]>
>> wrote:
>>
>>> I came across the same problem sometime back. It's generally when one of
>>> the controller or compute can reach to other one but it's happening
>>> vice-versa.
>>>
>>> Have a look at the following points:
>>> - controller and compute can ping to each other
>>> - both share the same slurm.conf
>>> - slurm.conf has the location of both controller and compute
>>> - slurm services are running on the compute node when the controller
>>> says it's down
>>> - TCP connections are not being dropped
>>> - Ports are accessible that are to be used for communication,
>>> specifically response ports
>>> - Check the routing rules if any
>>> - Clocks are synced across
>>> - Hope there isn't any version mismatch but still have a look (doesn't
>>> recognize the nodes for major version differences)
>>>
>>> Hope this helps.
>>>
>>> Best,
>>> Rajul
>>>
>>> On Wed, Jul 5, 2017 at 10:52 AM, John Hearns <[email protected]>
>>> wrote:
>>>
>>>> Said,
>>>>    a problem like this always has a simple cause. We share your
>>>> frustration, and several people her have offered help.
>>>> So please do not get discouraged. We have all been in your situation!
>>>>
>>>> The only way to handle problems like this is
>>>> a) start at the beginning and read the manuals and webpages closely
>>>> b) start at the lowest level, ie here the network and do NOT assume
>>>> that any component is working
>>>> c) look at all the log files closely
>>>> d) start daeomon sprocesses in a terminal with any 'verbose' flags set
>>>> e) then start on more low-level diagnostics, such as tcpdump of network
>>>> adapters and straces of the processes and gstacks
>>>>
>>>>
>>>> you have been doing steps a b and c very well
>>>> I suggest staying with these - I myself am going for Adam Huffmans
>>>> suggestion of the NTP clock times.
>>>> Are you SURE that on all nodes you have run the 'date' command and also
>>>> 'ntpq -p'
>>>> Are you SURE the master node and the node OBU-N6   are both connecting
>>>> to an NTP server?   ntpq -p will tell you that
>>>>
>>>>
>>>> And do not lose heart.  This is how we all learn.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 5 July 2017 at 16:23, Said Mohamed Said <[email protected]>
>>>> wrote:
>>>>
>>>>> Sinfo -R gives "NODE IS NOT RESPONDING"
>>>>> ping gives successful results from both nodes
>>>>>
>>>>> I really can not figure out what is causing the problem.
>>>>>
>>>>> Regards,
>>>>> Said
>>>>> ------------------------------
>>>>> *From:* Felix Willenborg <[email protected]>
>>>>> *Sent:* Wednesday, July 5, 2017 9:07:05 PM
>>>>>
>>>>> *To:* slurm-dev
>>>>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>>>>
>>>>> When the nodes change to the down state, what is 'sinfo -R' saying?
>>>>> Sometimes it gives you a reason for that.
>>>>>
>>>>> Best,
>>>>> Felix
>>>>>
>>>>> Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
>>>>>
>>>>> Thank you Adam, For NTP I did that as well before posting but didn't
>>>>> fix the issue.
>>>>>
>>>>> Regards,
>>>>> Said
>>>>> ------------------------------
>>>>> *From:* Adam Huffman <[email protected]> <[email protected]>
>>>>> *Sent:* Wednesday, July 5, 2017 8:11:03 PM
>>>>> *To:* slurm-dev
>>>>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>>>>
>>>>>
>>>>> I've seen something similar when node clocks were skewed.
>>>>>
>>>>> Worth checking that NTP is running and they're all synchronised.
>>>>>
>>>>> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said
>>>>> <[email protected]> <[email protected]> wrote:
>>>>> > Thank you all for suggestions. I turned off firewall on both
>>>>> machines but
>>>>> > still no luck. I can confirm that No managed switch is preventing
>>>>> the nodes
>>>>> > from communicating. If you check the log file, there is
>>>>> communication for
>>>>> > about 4mins and then the node state goes down.
>>>>> > Any other idea?
>>>>> > ________________________________
>>>>> > From: Ole Holm Nielsen <[email protected]>
>>>>> <[email protected]>
>>>>> > Sent: Wednesday, July 5, 2017 7:07:15 PM
>>>>> > To: slurm-dev
>>>>> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
>>>>> >
>>>>> >
>>>>> > On 07/05/2017 11:40 AM, Felix Willenborg wrote:
>>>>> >> in my network I encountered that managed switches were preventing
>>>>> >> necessary network communication between the nodes, on which SLURM
>>>>> >> relies. You should check if you're using managed switches to connect
>>>>> >> nodes to the network and if so, if they're blocking communication on
>>>>> >> slurm ports.
>>>>> >
>>>>> > Managed switches should permit IP layer 2 traffic just like unmanaged
>>>>> > switches!  We only have managed Ethernet switches, and they work
>>>>> without
>>>>> > problems.
>>>>> >
>>>>> > Perhaps you meant that Ethernet switches may perform some firewall
>>>>> > functions by themselves?
>>>>> >
>>>>> > Firewalls must be off between Slurm compute nodes as well as the
>>>>> > controller host.  See
>>>>> > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#confi
>>>>> gure-firewall-for-slurm-daemons
>>>>> >
>>>>> > /Ole
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

[slurm-dev] Re: SLURM ERROR! NEED HELP

Reply via email to