I can ping the node and ssh onto the node. The log file on the node does not report any communication issues.
i.e. ssh n0169.lr3 uptime 19:44:56 up 99 days, 1:28, 1 user, load average: 2.00, 2.00, 2.00 sinfo -R |grep n0169.lr3 Not responding root 2014-05-27T16:35:47 [2014-05-27T11:16:44.883] slurmd version 2.6.4 started [2014-05-27T11:16:44.883] Job accounting gather LINUX plugin loaded [2014-05-27T11:16:44.883] switch NONE plugin loaded [2014-05-27T11:16:44.883] slurmd started on Tue, 27 May 2014 11:16:44 -0700 [2014-05-27T11:16:44.883] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1 Memory=64498 TmpDisk=7693 Uptime=8528433 [2014-05-27T11:16:44.883] AcctGatherEnergy NONE plugin loaded [2014-05-27T11:16:44.883] AcctGatherProfile NONE plugin loaded [2014-05-27T11:16:44.883] AcctGatherInfiniband NONE plugin loaded [2014-05-27T11:16:44.883] AcctGatherFilesystem NONE plugin loaded [2014-05-27T13:02:17.632] got shutdown request [2014-05-27T13:02:17.632] all threads complete [2014-05-27T13:02:17.634] Consumable Resources (CR) Node Selection plugin shutting down ... [2014-05-27T13:02:17.635] Munge cryptographic signature plugin unloaded [2014-05-27T13:02:17.635] Slurmd shutdown completing [2014-05-27T13:02:19.050] topology tree plugin loaded [2014-05-27T13:02:19.661] Warning: Note very large processing time from slurm_topo_build_config: usec=611478 began=13:02:19.050 [2014-05-27T13:02:19.662] Gathering cpu frequency information for 20 cpus [2014-05-27T13:02:19.663] task NONE plugin loaded [2014-05-27T13:02:19.663] auth plugin for Munge ( http://code.google.com/p/munge/) loaded [2014-05-27T13:02:19.663] Munge cryptographic signature plugin loaded [2014-05-27T13:02:19.664] Warning: Core limit is only 0 KB [2014-05-27T13:02:19.664] slurmd version 2.6.4 started [2014-05-27T13:02:19.664] Job accounting gather LINUX plugin loaded [2014-05-27T13:02:19.664] switch NONE plugin loaded [2014-05-27T13:02:19.664] slurmd started on Tue, 27 May 2014 13:02:19 -0700 [2014-05-27T13:02:19.664] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1 Memory=64498 TmpDisk=7693 Uptime=8534767 [2014-05-27T13:02:19.664] AcctGatherEnergy NONE plugin loaded [2014-05-27T13:02:19.664] AcctGatherProfile NONE plugin loaded [2014-05-27T13:02:19.664] AcctGatherInfiniband NONE plugin loaded [2014-05-27T13:02:19.664] AcctGatherFilesystem NONE plugin loaded So it should be up and running. There is a list of nodes on this cluster having problems. I could speak via munge but slurm is having problems. What can I run to test if rpc is the issue? rpcinfo n0169.lr3 program version netid address service owner 100000 4 tcp6 ::.0.111 portmapper superuser 100000 3 tcp6 ::.0.111 portmapper superuser 100000 4 udp6 ::.0.111 portmapper superuser 100000 3 udp6 ::.0.111 portmapper superuser 100000 4 tcp 0.0.0.0.0.111 portmapper superuser 100000 3 tcp 0.0.0.0.0.111 portmapper superuser 100000 2 tcp 0.0.0.0.0.111 portmapper superuser 100000 4 udp 0.0.0.0.0.111 portmapper superuser 100000 3 udp 0.0.0.0.0.111 portmapper superuser 100000 2 udp 0.0.0.0.0.111 portmapper superuser 100000 4 local /var/run/rpcbind.sock portmapper superuser 100000 3 local /var/run/rpcbind.sock portmapper superuser 100024 1 udp 0.0.0.0.181.183 status 29 100024 1 tcp 0.0.0.0.215.135 status 29 100024 1 udp6 ::.238.33 status 29 100024 1 tcp6 ::.153.169 status 29 100021 1 udp 0.0.0.0.168.20 nlockmgr superuser 100021 3 udp 0.0.0.0.168.20 nlockmgr superuser 100021 4 udp 0.0.0.0.168.20 nlockmgr superuser 100021 1 tcp 0.0.0.0.179.21 nlockmgr superuser 100021 3 tcp 0.0.0.0.179.21 nlockmgr superuser 100021 4 tcp 0.0.0.0.179.21 nlockmgr superuser 100021 1 udp6 ::.155.84 nlockmgr superuser 100021 3 udp6 ::.155.84 nlockmgr superuser 100021 4 udp6 ::.155.84 nlockmgr superuser 100021 1 tcp6 ::.212.199 nlockmgr superuser 100021 3 tcp6 ::.212.199 nlockmgr superuser 100021 4 tcp6 ::.212.199 nlockmgr superuser Is there something that is not running that should be running? I even changed logging to debug4 and I still did not see any reason why. Should I up the logging higher? Thanks Jackie On Tue, May 27, 2014 at 7:24 PM, Danny Auble <[email protected]> wrote: > Jackie, what does the slurmd log look like on one of these nodes? The * > means just what you thought, no communication. > > Make sure you can ping the address from the slurmctld. > > Your timeout should be fine. > > Danny > > On May 27, 2014 4:40:23 PM PDT, Jacqueline Scoggins <[email protected]> > wrote: >> >> I just migrated over 611 nodes to slurm from moab/torque. The last set >> of our nodes and noticed that a subset of the nodes around 39 or so show >> down with a * after the work down. I have tried to change the state to >> IDLE but the log files shows - Communication connection failure rpc:1008 >> errors and I can't see to see what is causing this. >> >> >> Any ideas of what to troubleshoot would be helpful. Tried the munge -n | >> ssh nodename umunge so munge is communication just fine. Does it have >> anything to do with any of the scheduler parameters. My thoughts are that >> the Timeout for message timeout is too low for a cluster of this size: >> 1831 nodes. >> >> Current setting is MessageTimeout = 60 sec >> >> should I increase it to 5 minutes or at least 2 minutes? >> >> Jackie >> >
