[slurm-dev] Re: migration and node communication error

Jacqueline Scoggins Wed, 28 May 2014 14:47:33 -0700

Problem fixed. Found an error in the node ip address in the slurm.conf.
 Fixed the hosts and now the nodes are running fine.


Thanks

Jackie



On Tue, May 27, 2014 at 8:35 PM, Danny Auble <[email protected]> wrote:

>  Is there anything in the slurmctld log about node n0169.lr3?
>
> All your nodes in the slurm.conf can talk to each other correct?  I am
> pretty sure that is the case, but just to verify.
>
> debug4 is quite high I don't think you would need to go higher.
>
> If you have debug2 on your slurmctld you could see the tree fanout and see
> which node is trying to talk to it.
>
> Just out of curiosity, if you set TreeWidth=1831 does everything register?
>
>
>
> On 05/27/2014 07:50 PM, Jacqueline Scoggins wrote:
>
> I can ping the node and ssh onto the node.  The log file on the node does
> not report any communication issues.
>
>  i.e.
>
>  ssh n0169.lr3
>
> uptime
>
>   19:44:56 up 99 days,  1:28,  1 user,  load average: 2.00, 2.00, 2.00
>
>
>  sinfo -R |grep n0169.lr3
>
>  Not responding       root      2014-05-27T16:35:47
>
>
>
>  [2014-05-27T11:16:44.883] slurmd version 2.6.4 started
>
> [2014-05-27T11:16:44.883] Job accounting gather LINUX plugin loaded
>
> [2014-05-27T11:16:44.883] switch NONE plugin loaded
>
> [2014-05-27T11:16:44.883] slurmd started on Tue, 27 May 2014 11:16:44 -0700
>
> [2014-05-27T11:16:44.883] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1
> Memory=64498 TmpDisk=7693 Uptime=8528433
>
> [2014-05-27T11:16:44.883] AcctGatherEnergy NONE plugin loaded
>
> [2014-05-27T11:16:44.883] AcctGatherProfile NONE plugin loaded
>
> [2014-05-27T11:16:44.883] AcctGatherInfiniband NONE plugin loaded
>
> [2014-05-27T11:16:44.883] AcctGatherFilesystem NONE plugin loaded
>
> [2014-05-27T13:02:17.632] got shutdown request
>
> [2014-05-27T13:02:17.632] all threads complete
>
> [2014-05-27T13:02:17.634] Consumable Resources (CR) Node Selection plugin
> shutting down ...
>
> [2014-05-27T13:02:17.635] Munge cryptographic signature plugin unloaded
>
> [2014-05-27T13:02:17.635] Slurmd shutdown completing
>
> [2014-05-27T13:02:19.050] topology tree plugin loaded
>
> [2014-05-27T13:02:19.661] Warning: Note very large processing time from
> slurm_topo_build_config: usec=611478 began=13:02:19.050
>
> [2014-05-27T13:02:19.662] Gathering cpu frequency information for 20 cpus
>
> [2014-05-27T13:02:19.663] task NONE plugin loaded
>
> [2014-05-27T13:02:19.663] auth plugin for Munge (
> http://code.google.com/p/munge/) loaded
>
> [2014-05-27T13:02:19.663] Munge cryptographic signature plugin loaded
>
> [2014-05-27T13:02:19.664] Warning: Core limit is only 0 KB
>
> [2014-05-27T13:02:19.664] slurmd version 2.6.4 started
>
> [2014-05-27T13:02:19.664] Job accounting gather LINUX plugin loaded
>
> [2014-05-27T13:02:19.664] switch NONE plugin loaded
>
> [2014-05-27T13:02:19.664] slurmd started on Tue, 27 May 2014 13:02:19 -0700
>
> [2014-05-27T13:02:19.664] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1
> Memory=64498 TmpDisk=7693 Uptime=8534767
>
> [2014-05-27T13:02:19.664] AcctGatherEnergy NONE plugin loaded
>
> [2014-05-27T13:02:19.664] AcctGatherProfile NONE plugin loaded
>
> [2014-05-27T13:02:19.664] AcctGatherInfiniband NONE plugin loaded
>
>  [2014-05-27T13:02:19.664] AcctGatherFilesystem NONE plugin loaded
>
>
>  So it should be up and running. There is a list of nodes on this cluster
> having problems.  I could speak via munge but slurm is having problems.
>  What can I run to test if rpc is the issue?
>
>
>  rpcinfo n0169.lr3
>
>    program version netid     address                service    owner
>
>     100000    4    tcp6      ::.0.111               portmapper superuser
>
>     100000    3    tcp6      ::.0.111               portmapper superuser
>
>     100000    4    udp6      ::.0.111               portmapper superuser
>
>     100000    3    udp6      ::.0.111               portmapper superuser
>
>     100000    4    tcp       0.0.0.0.0.111          portmapper superuser
>
>     100000    3    tcp       0.0.0.0.0.111          portmapper superuser
>
>     100000    2    tcp       0.0.0.0.0.111          portmapper superuser
>
>     100000    4    udp       0.0.0.0.0.111          portmapper superuser
>
>     100000    3    udp       0.0.0.0.0.111          portmapper superuser
>
>     100000    2    udp       0.0.0.0.0.111          portmapper superuser
>
>     100000    4    local     /var/run/rpcbind.sock  portmapper superuser
>
>     100000    3    local     /var/run/rpcbind.sock  portmapper superuser
>
>     100024    1    udp       0.0.0.0.181.183        status     29
>
>     100024    1    tcp       0.0.0.0.215.135        status     29
>
>     100024    1    udp6      ::.238.33              status     29
>
>     100024    1    tcp6      ::.153.169             status     29
>
>     100021    1    udp       0.0.0.0.168.20         nlockmgr   superuser
>
>     100021    3    udp       0.0.0.0.168.20         nlockmgr   superuser
>
>     100021    4    udp       0.0.0.0.168.20         nlockmgr   superuser
>
>     100021    1    tcp       0.0.0.0.179.21         nlockmgr   superuser
>
>     100021    3    tcp       0.0.0.0.179.21         nlockmgr   superuser
>
>     100021    4    tcp       0.0.0.0.179.21         nlockmgr   superuser
>
>     100021    1    udp6      ::.155.84              nlockmgr   superuser
>
>     100021    3    udp6      ::.155.84              nlockmgr   superuser
>
>     100021    4    udp6      ::.155.84              nlockmgr   superuser
>
>     100021    1    tcp6      ::.212.199             nlockmgr   superuser
>
>     100021    3    tcp6      ::.212.199             nlockmgr   superuser
>
>      100021    4    tcp6      ::.212.199             nlockmgr   superuser
>
>
>  Is there something that is not running that should be running?
>
>
>  I even changed logging to debug4 and I still did not see any reason why.
>   Should I up the logging higher?
>
>
>  Thanks
>
>
>  Jackie
>
>
>
>
> On Tue, May 27, 2014 at 7:24 PM, Danny Auble <[email protected]> wrote:
>
>> Jackie, what does the slurmd log look like on one of these nodes? The *
>> means just what you thought, no communication.
>>
>> Make sure you can ping the address from the slurmctld.
>>
>> Your timeout should be fine.
>>
>> Danny
>>
>> On May 27, 2014 4:40:23 PM PDT, Jacqueline Scoggins <[email protected]>
>> wrote:
>>>
>>> I just migrated over 611  nodes to slurm from moab/torque.  The last set
>>> of our nodes and noticed that a subset of the nodes around 39 or so show
>>> down with a * after the work down.  I have tried to change the state to
>>> IDLE but the log files shows - Communication connection failure rpc:1008
>>> errors and I can't see to see what is causing this.
>>>
>>>
>>>  Any ideas of what to troubleshoot would be helpful.  Tried the munge
>>> -n | ssh nodename umunge so munge is communication just fine.  Does it have
>>> anything to do with any of the scheduler parameters.  My thoughts are that
>>> the Timeout for message timeout is too low for a cluster of this size:
>>>  1831 nodes.
>>>
>>>  Current setting is MessageTimeout          = 60 sec
>>>
>>>  should I increase it to 5 minutes or at least 2 minutes?
>>>
>>>  Jackie
>>>
>>
>
>

[slurm-dev] Re: migration and node communication error

Reply via email to