[slurm-dev] Re: SLURM ERROR! NEED HELP

Ole Holm Nielsen Thu, 06 Jul 2017 05:10:57 -0700

Firewall problems, like I suggested initially! Nmap is a great tool forprobing open ports!

The iptables *must not* be configured on CentOS 7, you *must* usefirewalld. Seehttps://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemonsfor Slurm configurations.


/Ole

On 07/06/2017 01:22 PM, Said Mohamed Said wrote:

John and Others,


Thank you very much for your support. The problem is finally solved.

After Installing nmap, it let me realize that some ports were blockedeven with firewall daemon stopped and disabled. Turned out that iptableswas on and enabled. After stopping iptables everything work just fine.




Best Regards,


Said.

------------------------------------------------------------------------
*From:* John Hearns <hear...@googlemail.com>
*Sent:* Thursday, July 6, 2017 6:47:48 PM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
Said, you are not out of ideas.

I would suggest 'nmap' as a good tool to start with. Instlal nmap onyour compute node and see which ports are open on the controller node


Also do we have a DNS name resolution problem here?

I alwasy remember sun Gridengine as being notoriously sensitive to nameresolution, and that was my first question when any SGE problem wasreported.

So a couple of questions:

On the ocntroller node and on the compute node run this:
hostname
hostname -f

Do the cluster controller node or the compute nodes have more than onenetwork interface.I bet the cluster controller node does! From the compute node, do annslookup or a dig and see what the COMPUTE NODE thinks are hte names ofboth of those interfaces.

Also as Rajul says - how are you making sure that both controller andcompute nodes have the same slurm.conf fileActually if the slurm.conf files are different this will eb logged whenthe compute node starts up, but let us check everything.

On 6 July 2017 at 11:37, Said Mohamed Said <said.moha...@oist.jp<mailto:said.moha...@oist.jp>> wrote:


    Even after reinstalling everything from the beginning the problem is
    still there. Right now I am out of Ideas.




    Best Regards,


    Said.

    ------------------------------------------------------------------------
    *From:* Said Mohamed Said
    *Sent:* Thursday, July 6, 2017 2:23:05 PM
    *To:* slurm-dev
    *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP

    Thank you all for your suggestions, the only thing I can do for now
    is to uninstall and install from the beginning and I will use the
    most recent version of slurm on both nodes.

    For Felix who asked, the OS is CentOS 7.3 on both machines.

    I will let you know if that can solve the issue.
    ------------------------------------------------------------------------
    *From:* Rajul Kumar <kumar.r...@husky.neu.edu
    <mailto:kumar.r...@husky.neu.edu>>
    *Sent:* Thursday, July 6, 2017 12:41:51 AM
    *To:* slurm-dev
    *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
    Sorry for the typo
    It's generally when one of the controller or compute can reach the
    other one but it's *not* happening vice-versa.


    On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar
    <kumar.r...@husky.neu.edu <mailto:kumar.r...@husky.neu.edu>> wrote:

        I came across the same problem sometime back. It's generally
        when one of the controller or compute can reach to other one but
        it's happening vice-versa.

        Have a look at the following points:
        - controller and compute can ping to each other
        - both share the same slurm.conf
        - slurm.conf has the location of both controller and compute
        - slurm services are running on the compute node when the
        controller says it's down
        - TCP connections are not being dropped
        - Ports are accessible that are to be used for communication,
        specifically response ports
        - Check the routing rules if any
        - Clocks are synced across
        - Hope there isn't any version mismatch but still have a look
        (doesn't recognize the nodes for major version differences)

        Hope this helps.

        Best,
        Rajul

        On Wed, Jul 5, 2017 at 10:52 AM, John Hearns
        <hear...@googlemail.com <mailto:hear...@googlemail.com>> wrote:

            Said,
                a problem like this always has a simple cause. We share
            your frustration, and several people her have offered help.
            So please do not get discouraged. We have all been in your
            situation!

            The only way to handle problems like this is
            a) start at the beginning and read the manuals and webpages
            closely
            b) start at the lowest level, ie here the network and do NOT
            assume that any component is working
            c) look at all the log files closely
            d) start daeomon sprocesses in a terminal with any 'verbose'
            flags set
            e) then start on more low-level diagnostics, such as tcpdump
            of network adapters and straces of the processes and gstacks


            you have been doing steps a b and c very well
            I suggest staying with these - I myself am going for Adam
            Huffmans suggestion of the NTP clock times.
            Are you SURE that on all nodes you have run the 'date'
            command and also 'ntpq -p'
            Are you SURE the master node and the node OBU-N6   are both
            connecting to an NTP server?   ntpq -p will tell you that


            And do not lose heart.  This is how we all learn.

















            On 5 July 2017 at 16:23, Said Mohamed Said
            <said.moha...@oist.jp <mailto:said.moha...@oist.jp>> wrote:

                Sinfo -R gives "NODE IS NOT RESPONDING"
                ping gives successful results from both nodes

                I really can not figure out what is causing the problem.

                Regards,
                Said
                
------------------------------------------------------------------------
                *From:* Felix Willenborg
                <felix.willenb...@uni-oldenburg.de
                <mailto:felix.willenb...@uni-oldenburg.de>>
                *Sent:* Wednesday, July 5, 2017 9:07:05 PM

                *To:* slurm-dev
                *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
                When the nodes change to the down state, what is 'sinfo
                -R' saying? Sometimes it gives you a reason for that.

                Best,
                Felix

                Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:

                Thank you Adam, For NTP I did that as well before
                posting but didn't fix the issue.

                Regards,
                Said
                
------------------------------------------------------------------------
                *From:* Adam Huffman <adam.huff...@gmail.com>
                <mailto:adam.huff...@gmail.com>
                *Sent:* Wednesday, July 5, 2017 8:11:03 PM
                *To:* slurm-dev
                *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP

                I've seen something similar when node clocks were skewed.

                Worth checking that NTP is running and they're all
                synchronised.

                On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said
                <said.moha...@oist.jp> <mailto:said.moha...@oist.jp>
                wrote:
                > Thank you all for suggestions. I turned off firewall on both 
machines but
                > still no luck. I can confirm that No managed switch is 
preventing the nodes
                > from communicating. If you check the log file, there is 
communication for
                > about 4mins and then the node state goes down.
                > Any other idea?
                > ________________________________
                > From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
                <mailto:ole.h.niel...@fysik.dtu.dk>
                > Sent: Wednesday, July 5, 2017 7:07:15 PM
                > To: slurm-dev
                > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
                >
                >
                > On 07/05/2017 11:40 AM, Felix Willenborg wrote:
                >> in my network I encountered that managed switches were 
preventing
                >> necessary network communication between the nodes, on which 
SLURM
                >> relies. You should check if you're using managed switches to 
connect
                >> nodes to the network and if so, if they're blocking 
communication on
                >> slurm ports.
                >
                > Managed switches should permit IP layer 2 traffic just like 
unmanaged
                > switches!  We only have managed Ethernet switches, and they 
work without
                > problems.
                >
                > Perhaps you meant that Ethernet switches may perform some 
firewall
                > functions by themselves?
                >
                > Firewalls must be off between Slurm compute nodes as well as 
the
                > controller host.  See
                > 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
                
<https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons>
                >
                > /Ole

[slurm-dev] Re: SLURM ERROR! NEED HELP

Reply via email to