If slurmd is actually up (/etc/init.d/slurmd status should tell if it
is), maybe you should check your firewall configuration on the troubled
nodes. Some systems allow ssh and icmp by defaylt, but block other
traffic. Judging by your rpcinfo output I am most likely wrong, though.
Also, are there equvalent versions of slurm on all the nodes?
Jacqueline Scoggins писал 2014-05-28 08:30:
I set tree width much higher than that based off our last
conversation. The output from the node was included in the email. The
slurmdctl log shows communication connection failure Node not
responding. First slurm takes it off and then root owns it.
Can we chat tomorrow?
Jackie
Sent from my iPhone
On May 27, 2014, at 8:35 PM, Danny Auble <[email protected]> wrote:
Is there anything in the slurmctld log about node n0169.lr3?
All your nodes in the slurm.conf can talk to each other
correct? I am pretty sure that is the case, but just to verify.
debug4 is quite high I don't think you would need to go higher.
If you have debug2 on your slurmctld you could see the tree fanout
and see which node is trying to talk to it.
Just out of curiosity, if you set TreeWidth31 does everything
register?
On 05/27/2014 07:50 PM, Jacqueline Scoggins wrote:
Re: [slurm-dev] Re: migration and node communication error
I can ping the node and ssh onto the node. Â The log file on the
node does not report any communication issues.Â
i.e.Â
ssh n0169.lr3
uptime
 19:44:56 up 99 days, 1:28, 1 user, load average:
2.00, 2.00, 2.00
sinfo -R |grep n0169.lr3
Not responding    root   2014-05-27T16:35:47
[2014-05-27T11:16:44.883] slurmd version 2.6.4 started
[2014-05-27T11:16:44.883] Job accounting gather LINUX plugin loaded
[2014-05-27T11:16:44.883] switch NONE plugin loaded
[2014-05-27T11:16:44.883] slurmd started on Tue, 27 May 2014
11:16:44 -0700
[2014-05-27T11:16:44.883] CPUs Boards=1 Sockets=2 Cores Threads=1
Memoryd498 TmpDiskv93 Uptime…28433
[2014-05-27T11:16:44.883] AcctGatherEnergy NONE plugin loaded
[2014-05-27T11:16:44.883] AcctGatherProfile NONE plugin loaded
[2014-05-27T11:16:44.883] AcctGatherInfiniband NONE plugin loaded
[2014-05-27T11:16:44.883] AcctGatherFilesystem NONE plugin loaded
[2014-05-27T13:02:17.632] got shutdown request
[2014-05-27T13:02:17.632] all threads complete
[2014-05-27T13:02:17.634] Consumable Resources (CR) Node Selection
plugin shutting down ...
[2014-05-27T13:02:17.635] Munge cryptographic signature plugin
unloaded
[2014-05-27T13:02:17.635] Slurmd shutdown completing
[2014-05-27T13:02:19.050] topology tree plugin loaded
[2014-05-27T13:02:19.661] Warning: Note very large processing time
from slurm_topo_build_config: useca1478 began:02:19.050
[2014-05-27T13:02:19.662] Gathering cpu frequency information for 20
cpus
[2014-05-27T13:02:19.663] task NONE plugin loaded
[2014-05-27T13:02:19.663] auth plugin for Munge
(http://code.google.com/p/munge/ [1]) loaded
[2014-05-27T13:02:19.663] Munge cryptographic signature plugin
loaded
[2014-05-27T13:02:19.664] Warning: Core limit is only 0 KB
[2014-05-27T13:02:19.664] slurmd version 2.6.4 started
[2014-05-27T13:02:19.664] Job accounting gather LINUX plugin loaded
[2014-05-27T13:02:19.664] switch NONE plugin loaded
[2014-05-27T13:02:19.664] slurmd started on Tue, 27 May 2014
13:02:19 -0700
[2014-05-27T13:02:19.664] CPUs Boards=1 Sockets=2 Cores Threads=1
Memoryd498 TmpDiskv93 Uptime…34767
[2014-05-27T13:02:19.664] AcctGatherEnergy NONE plugin loaded
[2014-05-27T13:02:19.664] AcctGatherProfile NONE plugin loaded
[2014-05-27T13:02:19.664] AcctGatherInfiniband NONE plugin loaded
[2014-05-27T13:02:19.664] AcctGatherFilesystem NONE plugin loaded
So it should be up and running. There is a list of nodes on this
cluster having problems. Â I could speak via munge but slurm is
having problems. Â What can I run to test if rpc is the issue?
rpcinfo n0169.lr3
  program version netid   address   Â
    service  owner
  100000  4  tcp6   ::.0.111
       portmapper superuser
  100000  3  tcp6   ::.0.111
       portmapper superuser
  100000  4  udp6   ::.0.111
       portmapper superuser
  100000  3  udp6   ::.0.111
       portmapper superuser
  100000  4  tcp   Â
0.0.0.0.0.111Â Â Â Â Â portmapper superuser
  100000  3  tcp   Â
0.0.0.0.0.111Â Â Â Â Â portmapper superuser
  100000  2  tcp   Â
0.0.0.0.0.111Â Â Â Â Â portmapper superuser
  100000  4  udp   Â
0.0.0.0.0.111Â Â Â Â Â portmapper superuser
  100000  3  udp   Â
0.0.0.0.0.111Â Â Â Â Â portmapper superuser
  100000  2  udp   Â
0.0.0.0.0.111Â Â Â Â Â portmapper superuser
  100000  4  local  Â
/var/run/rpcbind.sock portmapper superuser
  100000  3  local  Â
/var/run/rpcbind.sock portmapper superuser
  100024  1  udp   Â
0.0.0.0.181.183    status   29
  100024  1  tcp   Â
0.0.0.0.215.135    status   29
  100024  1  udp6  Â
::.238.33       status   29
  100024  1  tcp6   ::.153.169
      status   29
  100021  1  udp   Â
0.0.0.0.168.20     nlockmgr  superuser
  100021  3  udp   Â
0.0.0.0.168.20     nlockmgr  superuser
  100021  4  udp   Â
0.0.0.0.168.20     nlockmgr  superuser
  100021  1  tcp   Â
0.0.0.0.179.21     nlockmgr  superuser
  100021  3  tcp   Â
0.0.0.0.179.21     nlockmgr  superuser
  100021  4  tcp   Â
0.0.0.0.179.21     nlockmgr  superuser
  100021  1  udp6  Â
::.155.84       nlockmgr  superuser
  100021  3  udp6  Â
::.155.84       nlockmgr  superuser
  100021  4  udp6  Â
::.155.84       nlockmgr  superuser
  100021  1  tcp6   ::.212.199
      nlockmgr  superuser
  100021  3  tcp6   ::.212.199
      nlockmgr  superuser
   100021  4  tcp6  Â
::.212.199       nlockmgr  superuser
Is there something that is not running that should be running?
I even changed logging to debug4 and I still did not see any reason
why. Â Should I up the logging higher?
Thanks
Jackie
On Tue, May 27, 2014 at 7:24 PM, Danny Auble <[email protected]> wrote:
Jackie, what does the slurmd log look like on one of these nodes?
The * means just what you thought, no communication.
Make sure you can ping the address from the slurmctld.
Your timeout should be fine.
Danny
On May 27, 2014 4:40:23 PM PDT, Jacqueline Scoggins
<[email protected]> wrote:
I just migrated over 611 Â nodes to slurm from moab/torque.
 The last set of our nodes and noticed that a subset of the nodes
around 39 or so show down with a * after the work down. Â I have
tried to change the state to IDLE but the log files shows -
Communication connection failure rpc:1008 errors and I can't see to
see what is causing this.
Any ideas of what to troubleshoot would be helpful. Â Tried the
munge -n | ssh nodename umunge so munge is communication just fine.
 Does it have anything to do with any of the scheduler
parameters. Â My thoughts are that the Timeout for message timeout
is too low for a cluster of this size: Â 1831 nodes.
Current setting is MessageTimeout      = 60 sec
should I increase it to 5 minutes or at least 2 minutes?
Jackie
Email secured by Check Point
Links:
------
[1] http://code.google.com/p/munge/
--
Никоноров Всеволод Дмитриевич, ОИТТиС, НИКИЭТ
Vsevolod D. Nikonorov, JSC NIKET