[slurm-dev] Re: SLURM ERROR! NEED HELP

Rajul Kumar Wed, 05 Jul 2017 08:39:12 -0700

I came across the same problem sometime back. It's generally when one of
the controller or compute can reach to other one but it's happening
vice-versa.


Have a look at the following points:
- controller and compute can ping to each other
- both share the same slurm.conf
- slurm.conf has the location of both controller and compute
- slurm services are running on the compute node when the controller says
it's down
- TCP connections are not being dropped
- Ports are accessible that are to be used for communication, specifically
response ports
- Check the routing rules if any
- Clocks are synced across
- Hope there isn't any version mismatch but still have a look (doesn't
recognize the nodes for major version differences)

Hope this helps.

Best,
Rajul

On Wed, Jul 5, 2017 at 10:52 AM, John Hearns <[email protected]> wrote:

> Said,
>    a problem like this always has a simple cause. We share your
> frustration, and several people her have offered help.
> So please do not get discouraged. We have all been in your situation!
>
> The only way to handle problems like this is
> a) start at the beginning and read the manuals and webpages closely
> b) start at the lowest level, ie here the network and do NOT assume that
> any component is working
> c) look at all the log files closely
> d) start daeomon sprocesses in a terminal with any 'verbose' flags set
> e) then start on more low-level diagnostics, such as tcpdump of network
> adapters and straces of the processes and gstacks
>
>
> you have been doing steps a b and c very well
> I suggest staying with these - I myself am going for Adam Huffmans
> suggestion of the NTP clock times.
> Are you SURE that on all nodes you have run the 'date' command and also
> 'ntpq -p'
> Are you SURE the master node and the node OBU-N6   are both connecting to
> an NTP server?   ntpq -p will tell you that
>
>
> And do not lose heart.  This is how we all learn.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 5 July 2017 at 16:23, Said Mohamed Said <[email protected]> wrote:
>
>> Sinfo -R gives "NODE IS NOT RESPONDING"
>> ping gives successful results from both nodes
>>
>> I really can not figure out what is causing the problem.
>>
>> Regards,
>> Said
>> ------------------------------
>> *From:* Felix Willenborg <[email protected]>
>> *Sent:* Wednesday, July 5, 2017 9:07:05 PM
>>
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>> When the nodes change to the down state, what is 'sinfo -R' saying?
>> Sometimes it gives you a reason for that.
>>
>> Best,
>> Felix
>>
>> Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
>>
>> Thank you Adam, For NTP I did that as well before posting but didn't fix
>> the issue.
>>
>> Regards,
>> Said
>> ------------------------------
>> *From:* Adam Huffman <[email protected]> <[email protected]>
>> *Sent:* Wednesday, July 5, 2017 8:11:03 PM
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>>
>> I've seen something similar when node clocks were skewed.
>>
>> Worth checking that NTP is running and they're all synchronised.
>>
>> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said <[email protected]>
>> <[email protected]> wrote:
>> > Thank you all for suggestions. I turned off firewall on both machines
>> but
>> > still no luck. I can confirm that No managed switch is preventing the
>> nodes
>> > from communicating. If you check the log file, there is communication
>> for
>> > about 4mins and then the node state goes down.
>> > Any other idea?
>> > ________________________________
>> > From: Ole Holm Nielsen <[email protected]>
>> <[email protected]>
>> > Sent: Wednesday, July 5, 2017 7:07:15 PM
>> > To: slurm-dev
>> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
>> >
>> >
>> > On 07/05/2017 11:40 AM, Felix Willenborg wrote:
>> >> in my network I encountered that managed switches were preventing
>> >> necessary network communication between the nodes, on which SLURM
>> >> relies. You should check if you're using managed switches to connect
>> >> nodes to the network and if so, if they're blocking communication on
>> >> slurm ports.
>> >
>> > Managed switches should permit IP layer 2 traffic just like unmanaged
>> > switches!  We only have managed Ethernet switches, and they work without
>> > problems.
>> >
>> > Perhaps you meant that Ethernet switches may perform some firewall
>> > functions by themselves?
>> >
>> > Firewalls must be off between Slurm compute nodes as well as the
>> > controller host.  See
>> > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#confi
>> gure-firewall-for-slurm-daemons
>> >
>> > /Ole
>>
>>
>>
>

[slurm-dev] Re: SLURM ERROR! NEED HELP

Reply via email to