[slurm-dev] Re: SLURM ERROR! NEED HELP

Rajul Kumar Wed, 05 Jul 2017 08:40:33 -0700

Sorry for the typo
It's generally when one of the controller or compute can reach the other
one but it's *not* happening vice-versa.



On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar <[email protected]>
wrote:

> I came across the same problem sometime back. It's generally when one of
> the controller or compute can reach to other one but it's happening
> vice-versa.
>
> Have a look at the following points:
> - controller and compute can ping to each other
> - both share the same slurm.conf
> - slurm.conf has the location of both controller and compute
> - slurm services are running on the compute node when the controller says
> it's down
> - TCP connections are not being dropped
> - Ports are accessible that are to be used for communication, specifically
> response ports
> - Check the routing rules if any
> - Clocks are synced across
> - Hope there isn't any version mismatch but still have a look (doesn't
> recognize the nodes for major version differences)
>
> Hope this helps.
>
> Best,
> Rajul
>
> On Wed, Jul 5, 2017 at 10:52 AM, John Hearns <[email protected]>
> wrote:
>
>> Said,
>>    a problem like this always has a simple cause. We share your
>> frustration, and several people her have offered help.
>> So please do not get discouraged. We have all been in your situation!
>>
>> The only way to handle problems like this is
>> a) start at the beginning and read the manuals and webpages closely
>> b) start at the lowest level, ie here the network and do NOT assume that
>> any component is working
>> c) look at all the log files closely
>> d) start daeomon sprocesses in a terminal with any 'verbose' flags set
>> e) then start on more low-level diagnostics, such as tcpdump of network
>> adapters and straces of the processes and gstacks
>>
>>
>> you have been doing steps a b and c very well
>> I suggest staying with these - I myself am going for Adam Huffmans
>> suggestion of the NTP clock times.
>> Are you SURE that on all nodes you have run the 'date' command and also
>> 'ntpq -p'
>> Are you SURE the master node and the node OBU-N6   are both connecting to
>> an NTP server?   ntpq -p will tell you that
>>
>>
>> And do not lose heart.  This is how we all learn.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 5 July 2017 at 16:23, Said Mohamed Said <[email protected]> wrote:
>>
>>> Sinfo -R gives "NODE IS NOT RESPONDING"
>>> ping gives successful results from both nodes
>>>
>>> I really can not figure out what is causing the problem.
>>>
>>> Regards,
>>> Said
>>> ------------------------------
>>> *From:* Felix Willenborg <[email protected]>
>>> *Sent:* Wednesday, July 5, 2017 9:07:05 PM
>>>
>>> *To:* slurm-dev
>>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>>
>>> When the nodes change to the down state, what is 'sinfo -R' saying?
>>> Sometimes it gives you a reason for that.
>>>
>>> Best,
>>> Felix
>>>
>>> Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
>>>
>>> Thank you Adam, For NTP I did that as well before posting but didn't fix
>>> the issue.
>>>
>>> Regards,
>>> Said
>>> ------------------------------
>>> *From:* Adam Huffman <[email protected]> <[email protected]>
>>> *Sent:* Wednesday, July 5, 2017 8:11:03 PM
>>> *To:* slurm-dev
>>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>>
>>>
>>> I've seen something similar when node clocks were skewed.
>>>
>>> Worth checking that NTP is running and they're all synchronised.
>>>
>>> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said
>>> <[email protected]> <[email protected]> wrote:
>>> > Thank you all for suggestions. I turned off firewall on both machines
>>> but
>>> > still no luck. I can confirm that No managed switch is preventing the
>>> nodes
>>> > from communicating. If you check the log file, there is communication
>>> for
>>> > about 4mins and then the node state goes down.
>>> > Any other idea?
>>> > ________________________________
>>> > From: Ole Holm Nielsen <[email protected]>
>>> <[email protected]>
>>> > Sent: Wednesday, July 5, 2017 7:07:15 PM
>>> > To: slurm-dev
>>> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
>>> >
>>> >
>>> > On 07/05/2017 11:40 AM, Felix Willenborg wrote:
>>> >> in my network I encountered that managed switches were preventing
>>> >> necessary network communication between the nodes, on which SLURM
>>> >> relies. You should check if you're using managed switches to connect
>>> >> nodes to the network and if so, if they're blocking communication on
>>> >> slurm ports.
>>> >
>>> > Managed switches should permit IP layer 2 traffic just like unmanaged
>>> > switches!  We only have managed Ethernet switches, and they work
>>> without
>>> > problems.
>>> >
>>> > Perhaps you meant that Ethernet switches may perform some firewall
>>> > functions by themselves?
>>> >
>>> > Firewalls must be off between Slurm compute nodes as well as the
>>> > controller host.  See
>>> > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#confi
>>> gure-firewall-for-slurm-daemons
>>> >
>>> > /Ole
>>>
>>>
>>>
>>
>

[slurm-dev] Re: SLURM ERROR! NEED HELP

Reply via email to