We provision our nodes with warewulf so they all have the same image
and packages.  As far as the firewall rules they should be the same
but I will ask for someone to check that.

Thanks

Jackie


Sent from my iPhone

> On May 28, 2014, at 2:56 AM, Vsevolod Nikonorov <[email protected]> wrote:
>
>
> If slurmd is actually up (/etc/init.d/slurmd status should tell if it
> is), maybe you should check your firewall configuration on the troubled
> nodes. Some systems allow ssh and icmp by defaylt, but block other
> traffic. Judging by your rpcinfo output I am most likely wrong, though.
>
> Also, are there equvalent versions of slurm on all the nodes?
>
> Jacqueline Scoggins писал 2014-05-28 08:30:
>> I set tree width much higher than that based off our last
>> conversation. The output from the node was included in the email. The
>> slurmdctl log shows communication connection failure   Node not
>> responding. First slurm takes it off and then root owns it.
>>
>> Can we chat tomorrow?
>>
>> Jackie
>>
>> Sent from my iPhone
>>
>>> On May 27, 2014, at 8:35 PM, Danny Auble <[email protected]> wrote:
>>>
>>> Is there anything in the slurmctld log about node n0169.lr3?
>>>
>>> All your nodes in the slurm.conf can talk to each other
>>> correct?  I am pretty sure that is the case, but just to verify.
>>>
>>> debug4 is quite high I don't think you would need to go higher.
>>>
>>> If you have debug2 on your slurmctld you could see the tree fanout
>>> and see which node is trying to talk to it.
>>>
>>> Just out of curiosity, if you set TreeWidth31 does everything
>>> register?
>>>
>>> On 05/27/2014 07:50 PM, Jacqueline Scoggins wrote:
>>> Re: [slurm-dev] Re: migration and node communication error
>>> I can ping the node and ssh onto the node. Â The log file on the
>>> node does not report any communication issues.Â
>>>
>>> i.e.Â
>>>
>>> ssh n0169.lr3
>>>
>>> uptime
>>>
>>>  19:44:56 up 99 days,  1:28,  1 user,  load average:
>>> 2.00, 2.00, 2.00
>>>
>>> sinfo -R |grep n0169.lr3
>>>
>>> Not responding       root      2014-05-27T16:35:47
>>>
>>>
>>> [2014-05-27T11:16:44.883] slurmd version 2.6.4 started
>>>
>>> [2014-05-27T11:16:44.883] Job accounting gather LINUX plugin loaded
>>>
>>> [2014-05-27T11:16:44.883] switch NONE plugin loaded
>>>
>>> [2014-05-27T11:16:44.883] slurmd started on Tue, 27 May 2014
>>> 11:16:44 -0700
>>>
>>> [2014-05-27T11:16:44.883] CPUs Boards=1 Sockets=2 Cores Threads=1
>>> Memoryd498 TmpDiskv93 Uptime…28433
>>>
>>> [2014-05-27T11:16:44.883] AcctGatherEnergy NONE plugin loaded
>>>
>>> [2014-05-27T11:16:44.883] AcctGatherProfile NONE plugin loaded
>>>
>>> [2014-05-27T11:16:44.883] AcctGatherInfiniband NONE plugin loaded
>>>
>>> [2014-05-27T11:16:44.883] AcctGatherFilesystem NONE plugin loaded
>>>
>>> [2014-05-27T13:02:17.632] got shutdown request
>>>
>>> [2014-05-27T13:02:17.632] all threads complete
>>>
>>> [2014-05-27T13:02:17.634] Consumable Resources (CR) Node Selection
>>> plugin shutting down ...
>>>
>>> [2014-05-27T13:02:17.635] Munge cryptographic signature plugin
>>> unloaded
>>>
>>> [2014-05-27T13:02:17.635] Slurmd shutdown completing
>>>
>>> [2014-05-27T13:02:19.050] topology tree plugin loaded
>>>
>>> [2014-05-27T13:02:19.661] Warning: Note very large processing time
>>> from slurm_topo_build_config: useca1478 began:02:19.050
>>>
>>> [2014-05-27T13:02:19.662] Gathering cpu frequency information for 20
>>> cpus
>>>
>>> [2014-05-27T13:02:19.663] task NONE plugin loaded
>>>
>>> [2014-05-27T13:02:19.663] auth plugin for Munge
>>> (http://code.google.com/p/munge/ [1]) loaded
>>>
>>> [2014-05-27T13:02:19.663] Munge cryptographic signature plugin
>>> loaded
>>>
>>> [2014-05-27T13:02:19.664] Warning: Core limit is only 0 KB
>>>
>>> [2014-05-27T13:02:19.664] slurmd version 2.6.4 started
>>>
>>> [2014-05-27T13:02:19.664] Job accounting gather LINUX plugin loaded
>>>
>>>
>>> [2014-05-27T13:02:19.664] switch NONE plugin loaded
>>>
>>> [2014-05-27T13:02:19.664] slurmd started on Tue, 27 May 2014
>>> 13:02:19 -0700
>>>
>>> [2014-05-27T13:02:19.664] CPUs Boards=1 Sockets=2 Cores Threads=1
>>> Memoryd498 TmpDiskv93 Uptime…34767
>>>
>>> [2014-05-27T13:02:19.664] AcctGatherEnergy NONE plugin loaded
>>>
>>> [2014-05-27T13:02:19.664] AcctGatherProfile NONE plugin loaded
>>>
>>> [2014-05-27T13:02:19.664] AcctGatherInfiniband NONE plugin loaded
>>>
>>> [2014-05-27T13:02:19.664] AcctGatherFilesystem NONE plugin loaded
>>>
>>> So it should be up and running. There is a list of nodes on this
>>> cluster having problems. Â I could speak via munge but slurm is
>>> having problems. Â What can I run to test if rpc is the issue?
>>>
>>> rpcinfo n0169.lr3
>>>
>>>    program version netid     address      Â
>>>         service    owner
>>>
>>> Â  Â  100000Â  Â  4Â  Â  tcp6Â  Â  Â  ::.0.111
>>> Â  Â  Â  Â  Â  Â  Â  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  3Â  Â  tcp6Â  Â  Â  ::.0.111
>>> Â  Â  Â  Â  Â  Â  Â  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  4Â  Â  udp6Â  Â  Â  ::.0.111
>>> Â  Â  Â  Â  Â  Â  Â  portmapper superuser
>>>
>>> Â  Â  100000Â  Â  3Â  Â  udp6Â  Â  Â  ::.0.111
>>> Â  Â  Â  Â  Â  Â  Â  portmapper superuser
>>>
>>>     100000    4    tcp     Â
>>> 0.0.0.0.0.111Â  Â  Â  Â  Â  portmapper superuser
>>>
>>>     100000    3    tcp     Â
>>> 0.0.0.0.0.111Â  Â  Â  Â  Â  portmapper superuser
>>>
>>>     100000    2    tcp     Â
>>> 0.0.0.0.0.111Â  Â  Â  Â  Â  portmapper superuser
>>>
>>>     100000    4    udp     Â
>>> 0.0.0.0.0.111Â  Â  Â  Â  Â  portmapper superuser
>>>
>>>     100000    3    udp     Â
>>> 0.0.0.0.0.111Â  Â  Â  Â  Â  portmapper superuser
>>>
>>>     100000    2    udp     Â
>>> 0.0.0.0.0.111Â  Â  Â  Â  Â  portmapper superuser
>>>
>>>     100000    4    local   Â
>>> /var/run/rpcbind.sock  portmapper superuser
>>>
>>>     100000    3    local   Â
>>> /var/run/rpcbind.sock  portmapper superuser
>>>
>>>     100024    1    udp     Â
>>> 0.0.0.0.181.183        status     29
>>>
>>>     100024    1    tcp     Â
>>> 0.0.0.0.215.135        status     29
>>>
>>> Â  Â  100024Â  Â  1Â  Â  udp6Â  Â  Â
>>> ::.238.33              status     29
>>>
>>> Â  Â  100024Â  Â  1Â  Â  tcp6Â  Â  Â  ::.153.169
>>>             status     29
>>>
>>>     100021    1    udp     Â
>>> 0.0.0.0.168.20         nlockmgr   superuser
>>>
>>>     100021    3    udp     Â
>>> 0.0.0.0.168.20         nlockmgr   superuser
>>>
>>>     100021    4    udp     Â
>>> 0.0.0.0.168.20         nlockmgr   superuser
>>>
>>>     100021    1    tcp     Â
>>> 0.0.0.0.179.21         nlockmgr   superuser
>>>
>>>     100021    3    tcp     Â
>>> 0.0.0.0.179.21         nlockmgr   superuser
>>>
>>>     100021    4    tcp     Â
>>> 0.0.0.0.179.21         nlockmgr   superuser
>>>
>>> Â  Â  100021Â  Â  1Â  Â  udp6Â  Â  Â
>>> ::.155.84              nlockmgr   superuser
>>>
>>> Â  Â  100021Â  Â  3Â  Â  udp6Â  Â  Â
>>> ::.155.84              nlockmgr   superuser
>>>
>>> Â  Â  100021Â  Â  4Â  Â  udp6Â  Â  Â
>>> ::.155.84              nlockmgr   superuser
>>>
>>> Â  Â  100021Â  Â  1Â  Â  tcp6Â  Â  Â  ::.212.199
>>>             nlockmgr   superuser
>>>
>>> Â  Â  100021Â  Â  3Â  Â  tcp6Â  Â  Â  ::.212.199
>>>             nlockmgr   superuser
>>>
>>> Â  Â Â 100021Â  Â  4Â  Â  tcp6Â  Â  Â
>>> ::.212.199             nlockmgr   superuser
>>>
>>> Is there something that is not running that should be running?
>>>
>>> I even changed logging to debug4 and I still did not see any reason
>>> why. Â  Should I up the logging higher?
>>>
>>> Thanks
>>>
>>> Jackie
>>>
>>> On Tue, May 27, 2014 at 7:24 PM, Danny Auble <[email protected]> wrote:
>>>
>>> Jackie, what does the slurmd log look like on one of these nodes?
>>> The * means just what you thought, no communication.
>>>
>>> Make sure you can ping the address from the slurmctld.
>>>
>>> Your timeout should be fine.
>>>
>>> Danny
>>>
>>> On May 27, 2014 4:40:23 PM PDT, Jacqueline Scoggins
>>> <[email protected]> wrote:
>>> I just migrated over 611 Â nodes to slurm from moab/torque.
>>> Â The last set of our nodes and noticed that a subset of the nodes
>>> around 39 or so show down with a * after the work down. Â I have
>>> tried to change the state to IDLE but the log files shows -
>>> Communication connection failure rpc:1008 errors and I can't see to
>>> see what is causing this.
>>>
>>> Any ideas of what to troubleshoot would be helpful. Â Tried the
>>> munge -n | ssh nodename umunge so munge is communication just fine.
>>> Â Does it have anything to do with any of the scheduler
>>> parameters. Â My thoughts are that the Timeout for message timeout
>>> is too low for a cluster of this size: Â 1831 nodes.
>>>
>>> Current setting is MessageTimeout          = 60 sec
>>>
>>> should I increase it to 5 minutes or at least 2 minutes?
>>>
>>> Jackie
>>
>> Email secured by Check Point
>>
>>
>>
>> Links:
>> ------
>> [1] http://code.google.com/p/munge/
>
> --
> Никоноров Всеволод Дмитриевич, ОИТТиС, НИКИЭТ
>
> Vsevolod D. Nikonorov, JSC NIKET

Reply via email to