We provision our nodes with warewulf so they all have the same image and packages. As far as the firewall rules they should be the same but I will ask for someone to check that.
Thanks Jackie Sent from my iPhone > On May 28, 2014, at 2:56 AM, Vsevolod Nikonorov <[email protected]> wrote: > > > If slurmd is actually up (/etc/init.d/slurmd status should tell if it > is), maybe you should check your firewall configuration on the troubled > nodes. Some systems allow ssh and icmp by defaylt, but block other > traffic. Judging by your rpcinfo output I am most likely wrong, though. > > Also, are there equvalent versions of slurm on all the nodes? > > Jacqueline Scoggins писал 2014-05-28 08:30: >> I set tree width much higher than that based off our last >> conversation. The output from the node was included in the email. The >> slurmdctl log shows communication connection failure Node not >> responding. First slurm takes it off and then root owns it. >> >> Can we chat tomorrow? >> >> Jackie >> >> Sent from my iPhone >> >>> On May 27, 2014, at 8:35 PM, Danny Auble <[email protected]> wrote: >>> >>> Is there anything in the slurmctld log about node n0169.lr3? >>> >>> All your nodes in the slurm.conf can talk to each other >>> correct? I am pretty sure that is the case, but just to verify. >>> >>> debug4 is quite high I don't think you would need to go higher. >>> >>> If you have debug2 on your slurmctld you could see the tree fanout >>> and see which node is trying to talk to it. >>> >>> Just out of curiosity, if you set TreeWidth31 does everything >>> register? >>> >>> On 05/27/2014 07:50 PM, Jacqueline Scoggins wrote: >>> Re: [slurm-dev] Re: migration and node communication error >>> I can ping the node and ssh onto the node.  The log file on the >>> node does not report any communication issues. >>> >>> i.e. >>> >>> ssh n0169.lr3 >>> >>> uptime >>> >>>  19:44:56 up 99 days, 1:28, 1 user, load average: >>> 2.00, 2.00, 2.00 >>> >>> sinfo -R |grep n0169.lr3 >>> >>> Not responding    root   2014-05-27T16:35:47 >>> >>> >>> [2014-05-27T11:16:44.883] slurmd version 2.6.4 started >>> >>> [2014-05-27T11:16:44.883] Job accounting gather LINUX plugin loaded >>> >>> [2014-05-27T11:16:44.883] switch NONE plugin loaded >>> >>> [2014-05-27T11:16:44.883] slurmd started on Tue, 27 May 2014 >>> 11:16:44 -0700 >>> >>> [2014-05-27T11:16:44.883] CPUs Boards=1 Sockets=2 Cores Threads=1 >>> Memoryd498 TmpDiskv93 Uptime…28433 >>> >>> [2014-05-27T11:16:44.883] AcctGatherEnergy NONE plugin loaded >>> >>> [2014-05-27T11:16:44.883] AcctGatherProfile NONE plugin loaded >>> >>> [2014-05-27T11:16:44.883] AcctGatherInfiniband NONE plugin loaded >>> >>> [2014-05-27T11:16:44.883] AcctGatherFilesystem NONE plugin loaded >>> >>> [2014-05-27T13:02:17.632] got shutdown request >>> >>> [2014-05-27T13:02:17.632] all threads complete >>> >>> [2014-05-27T13:02:17.634] Consumable Resources (CR) Node Selection >>> plugin shutting down ... >>> >>> [2014-05-27T13:02:17.635] Munge cryptographic signature plugin >>> unloaded >>> >>> [2014-05-27T13:02:17.635] Slurmd shutdown completing >>> >>> [2014-05-27T13:02:19.050] topology tree plugin loaded >>> >>> [2014-05-27T13:02:19.661] Warning: Note very large processing time >>> from slurm_topo_build_config: useca1478 began:02:19.050 >>> >>> [2014-05-27T13:02:19.662] Gathering cpu frequency information for 20 >>> cpus >>> >>> [2014-05-27T13:02:19.663] task NONE plugin loaded >>> >>> [2014-05-27T13:02:19.663] auth plugin for Munge >>> (http://code.google.com/p/munge/ [1]) loaded >>> >>> [2014-05-27T13:02:19.663] Munge cryptographic signature plugin >>> loaded >>> >>> [2014-05-27T13:02:19.664] Warning: Core limit is only 0 KB >>> >>> [2014-05-27T13:02:19.664] slurmd version 2.6.4 started >>> >>> [2014-05-27T13:02:19.664] Job accounting gather LINUX plugin loaded >>> >>> >>> [2014-05-27T13:02:19.664] switch NONE plugin loaded >>> >>> [2014-05-27T13:02:19.664] slurmd started on Tue, 27 May 2014 >>> 13:02:19 -0700 >>> >>> [2014-05-27T13:02:19.664] CPUs Boards=1 Sockets=2 Cores Threads=1 >>> Memoryd498 TmpDiskv93 Uptime…34767 >>> >>> [2014-05-27T13:02:19.664] AcctGatherEnergy NONE plugin loaded >>> >>> [2014-05-27T13:02:19.664] AcctGatherProfile NONE plugin loaded >>> >>> [2014-05-27T13:02:19.664] AcctGatherInfiniband NONE plugin loaded >>> >>> [2014-05-27T13:02:19.664] AcctGatherFilesystem NONE plugin loaded >>> >>> So it should be up and running. There is a list of nodes on this >>> cluster having problems.  I could speak via munge but slurm is >>> having problems.  What can I run to test if rpc is the issue? >>> >>> rpcinfo n0169.lr3 >>> >>>   program version netid   address    >>>     service  owner >>> >>>   100000  4  tcp6   ::.0.111 >>>        portmapper superuser >>> >>>   100000  3  tcp6   ::.0.111 >>>        portmapper superuser >>> >>>   100000  4  udp6   ::.0.111 >>>        portmapper superuser >>> >>>   100000  3  udp6   ::.0.111 >>>        portmapper superuser >>> >>>   100000  4  tcp    >>> 0.0.0.0.0.111     portmapper superuser >>> >>>   100000  3  tcp    >>> 0.0.0.0.0.111     portmapper superuser >>> >>>   100000  2  tcp    >>> 0.0.0.0.0.111     portmapper superuser >>> >>>   100000  4  udp    >>> 0.0.0.0.0.111     portmapper superuser >>> >>>   100000  3  udp    >>> 0.0.0.0.0.111     portmapper superuser >>> >>>   100000  2  udp    >>> 0.0.0.0.0.111     portmapper superuser >>> >>>   100000  4  local   >>> /var/run/rpcbind.sock portmapper superuser >>> >>>   100000  3  local   >>> /var/run/rpcbind.sock portmapper superuser >>> >>>   100024  1  udp    >>> 0.0.0.0.181.183    status   29 >>> >>>   100024  1  tcp    >>> 0.0.0.0.215.135    status   29 >>> >>>   100024  1  udp6   >>> ::.238.33       status   29 >>> >>>   100024  1  tcp6   ::.153.169 >>>       status   29 >>> >>>   100021  1  udp    >>> 0.0.0.0.168.20     nlockmgr  superuser >>> >>>   100021  3  udp    >>> 0.0.0.0.168.20     nlockmgr  superuser >>> >>>   100021  4  udp    >>> 0.0.0.0.168.20     nlockmgr  superuser >>> >>>   100021  1  tcp    >>> 0.0.0.0.179.21     nlockmgr  superuser >>> >>>   100021  3  tcp    >>> 0.0.0.0.179.21     nlockmgr  superuser >>> >>>   100021  4  tcp    >>> 0.0.0.0.179.21     nlockmgr  superuser >>> >>>   100021  1  udp6   >>> ::.155.84       nlockmgr  superuser >>> >>>   100021  3  udp6   >>> ::.155.84       nlockmgr  superuser >>> >>>   100021  4  udp6   >>> ::.155.84       nlockmgr  superuser >>> >>>   100021  1  tcp6   ::.212.199 >>>       nlockmgr  superuser >>> >>>   100021  3  tcp6   ::.212.199 >>>       nlockmgr  superuser >>> >>>    100021  4  tcp6   >>> ::.212.199       nlockmgr  superuser >>> >>> Is there something that is not running that should be running? >>> >>> I even changed logging to debug4 and I still did not see any reason >>> why.  Should I up the logging higher? >>> >>> Thanks >>> >>> Jackie >>> >>> On Tue, May 27, 2014 at 7:24 PM, Danny Auble <[email protected]> wrote: >>> >>> Jackie, what does the slurmd log look like on one of these nodes? >>> The * means just what you thought, no communication. >>> >>> Make sure you can ping the address from the slurmctld. >>> >>> Your timeout should be fine. >>> >>> Danny >>> >>> On May 27, 2014 4:40:23 PM PDT, Jacqueline Scoggins >>> <[email protected]> wrote: >>> I just migrated over 611  nodes to slurm from moab/torque. >>>  The last set of our nodes and noticed that a subset of the nodes >>> around 39 or so show down with a * after the work down.  I have >>> tried to change the state to IDLE but the log files shows - >>> Communication connection failure rpc:1008 errors and I can't see to >>> see what is causing this. >>> >>> Any ideas of what to troubleshoot would be helpful.  Tried the >>> munge -n | ssh nodename umunge so munge is communication just fine. >>>  Does it have anything to do with any of the scheduler >>> parameters.  My thoughts are that the Timeout for message timeout >>> is too low for a cluster of this size:  1831 nodes. >>> >>> Current setting is MessageTimeout      = 60 sec >>> >>> should I increase it to 5 minutes or at least 2 minutes? >>> >>> Jackie >> >> Email secured by Check Point >> >> >> >> Links: >> ------ >> [1] http://code.google.com/p/munge/ > > -- > Никоноров Всеволод Дмитриевич, ОИТТиС, НИКИЭТ > > Vsevolod D. Nikonorov, JSC NIKET
