[slurm-dev] Re: migration and node communication error

Jacqueline Scoggins Tue, 27 May 2014 19:50:37 -0700

I can ping the node and ssh onto the node.  The log file on the node does
not report any communication issues.


i.e.

ssh n0169.lr3

uptime

 19:44:56 up 99 days,  1:28,  1 user,  load average: 2.00, 2.00, 2.00


sinfo -R |grep n0169.lr3

Not responding       root      2014-05-27T16:35:47



[2014-05-27T11:16:44.883] slurmd version 2.6.4 started

[2014-05-27T11:16:44.883] Job accounting gather LINUX plugin loaded

[2014-05-27T11:16:44.883] switch NONE plugin loaded

[2014-05-27T11:16:44.883] slurmd started on Tue, 27 May 2014 11:16:44 -0700

[2014-05-27T11:16:44.883] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1
Memory=64498 TmpDisk=7693 Uptime=8528433

[2014-05-27T11:16:44.883] AcctGatherEnergy NONE plugin loaded

[2014-05-27T11:16:44.883] AcctGatherProfile NONE plugin loaded

[2014-05-27T11:16:44.883] AcctGatherInfiniband NONE plugin loaded

[2014-05-27T11:16:44.883] AcctGatherFilesystem NONE plugin loaded

[2014-05-27T13:02:17.632] got shutdown request

[2014-05-27T13:02:17.632] all threads complete

[2014-05-27T13:02:17.634] Consumable Resources (CR) Node Selection plugin
shutting down ...

[2014-05-27T13:02:17.635] Munge cryptographic signature plugin unloaded

[2014-05-27T13:02:17.635] Slurmd shutdown completing

[2014-05-27T13:02:19.050] topology tree plugin loaded

[2014-05-27T13:02:19.661] Warning: Note very large processing time from
slurm_topo_build_config: usec=611478 began=13:02:19.050

[2014-05-27T13:02:19.662] Gathering cpu frequency information for 20 cpus

[2014-05-27T13:02:19.663] task NONE plugin loaded

[2014-05-27T13:02:19.663] auth plugin for Munge (
http://code.google.com/p/munge/) loaded

[2014-05-27T13:02:19.663] Munge cryptographic signature plugin loaded

[2014-05-27T13:02:19.664] Warning: Core limit is only 0 KB

[2014-05-27T13:02:19.664] slurmd version 2.6.4 started

[2014-05-27T13:02:19.664] Job accounting gather LINUX plugin loaded

[2014-05-27T13:02:19.664] switch NONE plugin loaded

[2014-05-27T13:02:19.664] slurmd started on Tue, 27 May 2014 13:02:19 -0700

[2014-05-27T13:02:19.664] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1
Memory=64498 TmpDisk=7693 Uptime=8534767

[2014-05-27T13:02:19.664] AcctGatherEnergy NONE plugin loaded

[2014-05-27T13:02:19.664] AcctGatherProfile NONE plugin loaded

[2014-05-27T13:02:19.664] AcctGatherInfiniband NONE plugin loaded

[2014-05-27T13:02:19.664] AcctGatherFilesystem NONE plugin loaded


So it should be up and running. There is a list of nodes on this cluster
having problems.  I could speak via munge but slurm is having problems.
 What can I run to test if rpc is the issue?


rpcinfo n0169.lr3

   program version netid     address                service    owner

    100000    4    tcp6      ::.0.111               portmapper superuser

    100000    3    tcp6      ::.0.111               portmapper superuser

    100000    4    udp6      ::.0.111               portmapper superuser

    100000    3    udp6      ::.0.111               portmapper superuser

    100000    4    tcp       0.0.0.0.0.111          portmapper superuser

    100000    3    tcp       0.0.0.0.0.111          portmapper superuser

    100000    2    tcp       0.0.0.0.0.111          portmapper superuser

    100000    4    udp       0.0.0.0.0.111          portmapper superuser

    100000    3    udp       0.0.0.0.0.111          portmapper superuser

    100000    2    udp       0.0.0.0.0.111          portmapper superuser

    100000    4    local     /var/run/rpcbind.sock  portmapper superuser

    100000    3    local     /var/run/rpcbind.sock  portmapper superuser

    100024    1    udp       0.0.0.0.181.183        status     29

    100024    1    tcp       0.0.0.0.215.135        status     29

    100024    1    udp6      ::.238.33              status     29

    100024    1    tcp6      ::.153.169             status     29

    100021    1    udp       0.0.0.0.168.20         nlockmgr   superuser

    100021    3    udp       0.0.0.0.168.20         nlockmgr   superuser

    100021    4    udp       0.0.0.0.168.20         nlockmgr   superuser

    100021    1    tcp       0.0.0.0.179.21         nlockmgr   superuser

    100021    3    tcp       0.0.0.0.179.21         nlockmgr   superuser

    100021    4    tcp       0.0.0.0.179.21         nlockmgr   superuser

    100021    1    udp6      ::.155.84              nlockmgr   superuser

    100021    3    udp6      ::.155.84              nlockmgr   superuser

    100021    4    udp6      ::.155.84              nlockmgr   superuser

    100021    1    tcp6      ::.212.199             nlockmgr   superuser

    100021    3    tcp6      ::.212.199             nlockmgr   superuser

    100021    4    tcp6      ::.212.199             nlockmgr   superuser


Is there something that is not running that should be running?


I even changed logging to debug4 and I still did not see any reason why.
Should I up the logging higher?


Thanks


Jackie




On Tue, May 27, 2014 at 7:24 PM, Danny Auble <[email protected]> wrote:

> Jackie, what does the slurmd log look like on one of these nodes? The *
> means just what you thought, no communication.
>
> Make sure you can ping the address from the slurmctld.
>
> Your timeout should be fine.
>
> Danny
>
> On May 27, 2014 4:40:23 PM PDT, Jacqueline Scoggins <[email protected]>
> wrote:
>>
>> I just migrated over 611  nodes to slurm from moab/torque.  The last set
>> of our nodes and noticed that a subset of the nodes around 39 or so show
>> down with a * after the work down.  I have tried to change the state to
>> IDLE but the log files shows - Communication connection failure rpc:1008
>> errors and I can't see to see what is causing this.
>>
>>
>> Any ideas of what to troubleshoot would be helpful.  Tried the munge -n |
>> ssh nodename umunge so munge is communication just fine.  Does it have
>> anything to do with any of the scheduler parameters.  My thoughts are that
>> the Timeout for message timeout is too low for a cluster of this size:
>>  1831 nodes.
>>
>> Current setting is MessageTimeout          = 60 sec
>>
>> should I increase it to 5 minutes or at least 2 minutes?
>>
>> Jackie
>>
>

[slurm-dev] Re: migration and node communication error

Reply via email to