[slurm-dev] Re: migration and node communication error

Jacqueline Scoggins Wed, 28 May 2014 07:11:42 -0700

We provision our nodes with warewulf so they all have the same image
and packages.  As far as the firewall rules they should be the same
but I will ask for someone to check that.


Thanks

Jackie


Sent from my iPhone

> On May 28, 2014, at 2:56 AM, Vsevolod Nikonorov <[email protected]> wrote:
>
>
> If slurmd is actually up (/etc/init.d/slurmd status should tell if it
> is), maybe you should check your firewall configuration on the troubled
> nodes. Some systems allow ssh and icmp by defaylt, but block other
> traffic. Judging by your rpcinfo output I am most likely wrong, though.
>
> Also, are there equvalent versions of slurm on all the nodes?
>
> Jacqueline Scoggins писал 2014-05-28 08:30:
>> I set tree width much higher than that based off our last
>> conversation. The output from the node was included in the email. The
>> slurmdctl log shows communication connection failure   Node not
>> responding. First slurm takes it off and then root owns it.
>>
>> Can we chat tomorrow?
>>
>> Jackie
>>
>> Sent from my iPhone
>>
>>> On May 27, 2014, at 8:35 PM, Danny Auble <[email protected]> wrote:
>>>
>>> Is there anything in the slurmctld log about node n0169.lr3?
>>>
>>> All your nodes in the slurm.conf can talk to each other
>>> correct?Â  I am pretty sure that is the case, but just to verify.
>>>
>>> debug4 is quite high I don't think you would need to go higher.
>>>
>>> If you have debug2 on your slurmctld you could see the tree fanout
>>> and see which node is trying to talk to it.
>>>
>>> Just out of curiosity, if you set TreeWidth31 does everything
>>> register?
>>>
>>> On 05/27/2014 07:50 PM, Jacqueline Scoggins wrote:
>>> Re: [slurm-dev] Re: migration and node communication error
>>> I can ping the node and ssh onto the node. Â The log file on the
>>> node does not report any communication issues.Â
>>>
>>> i.e.Â
>>>
>>> ssh n0169.lr3
>>>
>>> uptime
>>>
>>> Â 19:44:56 up 99 days,Â  1:28,Â  1 user,Â  load average:
>>> 2.00, 2.00, 2.00
>>>
>>> sinfo -R |grep n0169.lr3
>>>
>>> Not responding Â  Â  Â  rootÂ  Â  Â  2014-05-27T16:35:47
>>>
>>>
>>> [2014-05-27T11:16:44.883] slurmd version 2.6.4 started
>>>
>>> [2014-05-27T11:16:44.883] Job accounting gather LINUX plugin loaded
>>>
>>> [2014-05-27T11:16:44.883] switch NONE plugin loaded
>>>
>>> [2014-05-27T11:16:44.883] slurmd started on Tue, 27 May 2014
>>> 11:16:44 -0700
>>>
>>> [2014-05-27T11:16:44.883] CPUs Boards=1 Sockets=2 Cores Threads=1
>>> Memoryd498 TmpDiskv93 Uptime…28433
>>>
>>> [2014-05-27T11:16:44.883] AcctGatherEnergy NONE plugin loaded
>>>
>>> [2014-05-27T11:16:44.883] AcctGatherProfile NONE plugin loaded
>>>
>>> [2014-05-27T11:16:44.883] AcctGatherInfiniband NONE plugin loaded
>>>
>>> [2014-05-27T11:16:44.883] AcctGatherFilesystem NONE plugin loaded
>>>
>>> [2014-05-27T13:02:17.632] got shutdown request
>>>
>>> [2014-05-27T13:02:17.632] all threads complete
>>>
>>> [2014-05-27T13:02:17.634] Consumable Resources (CR) Node Selection
>>> plugin shutting down ...
>>>
>>> [2014-05-27T13:02:17.635] Munge cryptographic signature plugin
>>> unloaded
>>>
>>> [2014-05-27T13:02:17.635] Slurmd shutdown completing
>>>
>>> [2014-05-27T13:02:19.050] topology tree plugin loaded
>>>
>>> [2014-05-27T13:02:19.661] Warning: Note very large processing time
>>> from slurm_topo_build_config: useca1478 began:02:19.050
>>>
>>> [2014-05-27T13:02:19.662] Gathering cpu frequency information for 20
>>> cpus
>>>
>>> [2014-05-27T13:02:19.663] task NONE plugin loaded
>>>
>>> [2014-05-27T13:02:19.663] auth plugin for Munge
>>> (http://code.google.com/p/munge/ [1]) loaded
>>>
>>> [2014-05-27T13:02:19.663] Munge cryptographic signature plugin
>>> loaded
>>>
>>> [2014-05-27T13:02:19.664] Warning: Core limit is only 0 KB
>>>
>>> [2014-05-27T13:02:19.664] slurmd version 2.6.4 started
>>>
>>> [2014-05-27T13:02:19.664] Job accounting gather LINUX plugin loaded
>>>
>>>
>>> [2014-05-27T13:02:19.664] switch NONE plugin loaded
>>>
>>> [2014-05-27T13:02:19.664] slurmd started on Tue, 27 May 2014
>>> 13:02:19 -0700
>>>
>>> [2014-05-27T13:02:19.664] CPUs Boards=1 Sockets=2 Cores Threads=1
>>> Memoryd498 TmpDiskv93 Uptime…34767
>>>
>>> [2014-05-27T13:02:19.664] AcctGatherEnergy NONE plugin loaded
>>>
>>> [2014-05-27T13:02:19.664] AcctGatherProfile NONE plugin loaded
>>>
>>> [2014-05-27T13:02:19.664] AcctGatherInfiniband NONE plugin loaded
>>>
>>> [2014-05-27T13:02:19.664] AcctGatherFilesystem NONE plugin loaded
>>>
>>> So it should be up and running. There is a list of nodes on this
>>> cluster having problems. Â I could speak via munge but slurm is
>>> having problems. Â What can I run to test if rpc is the issue?
>>>
>>> rpcinfo n0169.lr3
>>>
>>> Â Â  program version netid Â  Â  addressÂ  Â  Â  Â
>>> Â  Â  Â  Â  serviceÂ  Â  owner
>>>
>>> Â  Â  100000Â  Â  4Â  Â  tcp6Â  Â  Â  ::.0.111
>>> Â  Â  Â  Â  Â  Â  Â  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  3Â  Â  tcp6Â  Â  Â  ::.0.111
>>> Â  Â  Â  Â  Â  Â  Â  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  4Â  Â  udp6Â  Â  Â  ::.0.111
>>> Â  Â  Â  Â  Â  Â  Â  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  3Â  Â  udp6Â  Â  Â  ::.0.111
>>> Â  Â  Â  Â  Â  Â  Â  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  4Â  Â  tcp Â  Â  Â
>>> 0.0.0.0.0.111Â  Â  Â  Â  Â  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  3Â  Â  tcp Â  Â  Â
>>> 0.0.0.0.0.111Â  Â  Â  Â  Â  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  2Â  Â  tcp Â  Â  Â
>>> 0.0.0.0.0.111Â  Â  Â  Â  Â  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  4Â  Â  udp Â  Â  Â
>>> 0.0.0.0.0.111Â  Â  Â  Â  Â  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  3Â  Â  udp Â  Â  Â
>>> 0.0.0.0.0.111Â  Â  Â  Â  Â  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  2Â  Â  udp Â  Â  Â
>>> 0.0.0.0.0.111Â  Â  Â  Â  Â  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  4Â  Â  local Â  Â
>>> /var/run/rpcbind.sockÂ  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  3Â  Â  local Â  Â
>>> /var/run/rpcbind.sockÂ  portmapper superuser
>>>
>>> Â  Â  100024Â  Â  1Â  Â  udp Â  Â  Â
>>> 0.0.0.0.181.183Â  Â  Â  Â  status Â  Â  29
>>>
>>> Â  Â  100024Â  Â  1Â  Â  tcp Â  Â  Â
>>> 0.0.0.0.215.135Â  Â  Â  Â  status Â  Â  29
>>>
>>> Â  Â  100024Â  Â  1Â  Â  udp6Â  Â  Â
>>> ::.238.33Â  Â  Â  Â  Â  Â  Â  status Â  Â  29
>>>
>>> Â  Â  100024Â  Â  1Â  Â  tcp6Â  Â  Â  ::.153.169
>>> Â  Â  Â  Â  Â  Â  status Â  Â  29
>>>
>>> Â  Â  100021Â  Â  1Â  Â  udp Â  Â  Â
>>> 0.0.0.0.168.20 Â  Â  Â  Â  nlockmgr Â  superuser
>>>
>>> Â  Â  100021Â  Â  3Â  Â  udp Â  Â  Â
>>> 0.0.0.0.168.20 Â  Â  Â  Â  nlockmgr Â  superuser
>>>
>>> Â  Â  100021Â  Â  4Â  Â  udp Â  Â  Â
>>> 0.0.0.0.168.20 Â  Â  Â  Â  nlockmgr Â  superuser
>>>
>>> Â  Â  100021Â  Â  1Â  Â  tcp Â  Â  Â
>>> 0.0.0.0.179.21 Â  Â  Â  Â  nlockmgr Â  superuser
>>>
>>> Â  Â  100021Â  Â  3Â  Â  tcp Â  Â  Â
>>> 0.0.0.0.179.21 Â  Â  Â  Â  nlockmgr Â  superuser
>>>
>>> Â  Â  100021Â  Â  4Â  Â  tcp Â  Â  Â
>>> 0.0.0.0.179.21 Â  Â  Â  Â  nlockmgr Â  superuser
>>>
>>> Â  Â  100021Â  Â  1Â  Â  udp6Â  Â  Â
>>> ::.155.84Â  Â  Â  Â  Â  Â  Â  nlockmgr Â  superuser
>>>
>>> Â  Â  100021Â  Â  3Â  Â  udp6Â  Â  Â
>>> ::.155.84Â  Â  Â  Â  Â  Â  Â  nlockmgr Â  superuser
>>>
>>> Â  Â  100021Â  Â  4Â  Â  udp6Â  Â  Â
>>> ::.155.84Â  Â  Â  Â  Â  Â  Â  nlockmgr Â  superuser
>>>
>>> Â  Â  100021Â  Â  1Â  Â  tcp6Â  Â  Â  ::.212.199
>>> Â  Â  Â  Â  Â  Â  nlockmgr Â  superuser
>>>
>>> Â  Â  100021Â  Â  3Â  Â  tcp6Â  Â  Â  ::.212.199
>>> Â  Â  Â  Â  Â  Â  nlockmgr Â  superuser
>>>
>>> Â  Â Â 100021Â  Â  4Â  Â  tcp6Â  Â  Â
>>> ::.212.199 Â  Â  Â  Â  Â  Â  nlockmgr Â  superuser
>>>
>>> Is there something that is not running that should be running?
>>>
>>> I even changed logging to debug4 and I still did not see any reason
>>> why. Â  Should I up the logging higher?
>>>
>>> Thanks
>>>
>>> Jackie
>>>
>>> On Tue, May 27, 2014 at 7:24 PM, Danny Auble <[email protected]> wrote:
>>>
>>> Jackie, what does the slurmd log look like on one of these nodes?
>>> The * means just what you thought, no communication.
>>>
>>> Make sure you can ping the address from the slurmctld.
>>>
>>> Your timeout should be fine.
>>>
>>> Danny
>>>
>>> On May 27, 2014 4:40:23 PM PDT, Jacqueline Scoggins
>>> <[email protected]> wrote:
>>> I just migrated over 611 Â nodes to slurm from moab/torque.
>>> Â The last set of our nodes and noticed that a subset of the nodes
>>> around 39 or so show down with a * after the work down. Â I have
>>> tried to change the state to IDLE but the log files shows -
>>> Communication connection failure rpc:1008 errors and I can't see to
>>> see what is causing this.
>>>
>>> Any ideas of what to troubleshoot would be helpful. Â Tried the
>>> munge -n | ssh nodename umunge so munge is communication just fine.
>>> Â Does it have anything to do with any of the scheduler
>>> parameters. Â My thoughts are that the Timeout for message timeout
>>> is too low for a cluster of this size: Â 1831 nodes.
>>>
>>> Current setting is MessageTimeout Â  Â  Â  Â  Â = 60 sec
>>>
>>> should I increase it to 5 minutes or at least 2 minutes?
>>>
>>> Jackie
>>
>> Email secured by Check Point
>>
>>
>>
>> Links:
>> ------
>> [1] http://code.google.com/p/munge/
>
> --
> Никоноров Всеволод Дмитриевич, ОИТТиС, НИКИЭТ
>
> Vsevolod D. Nikonorov, JSC NIKET

[slurm-dev] Re: migration and node communication error

Reply via email to