I finally got some time to debug this a bit more. I used scontrol to turn on debug logging. I'm seeing messages like this:
[2017-02-16T13:18:52.599] debug: Spawning ping agent for j-[001-050],t-[019-052] [2017-02-16T13:18:52.599] debug: Spawning registration agent for j-[101-128],r-[001-004],t-[001-018] 50 hosts [2017-02-16T13:18:52.621] agent/is_node_resp: node:t-019 rpc:1008 : Can't find an address, check slurm.conf t-019 is one of my nodes that's frequently "down" according to slurm but really isn't. What is that "Can't find an address" about? DNS lookups seem to be working fine in a shell on the same machine. Sometimes I see larger groups of nodes showing the same message, but they quickly become responsive again: [2017-02-16T13:27:12.540] debug: Spawning ping agent for j-[051-079,094,097-100],t-[036-064] [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-036 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-037 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-038 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-039 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-040 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-042 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-041 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-043 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-044 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-045 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-046 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-047 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-048 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-049 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-050 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-051 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:12.554] agent/is_node_resp: node:t-052 rpc:1008 : Can't find an address, check slurm.conf [2017-02-16T13:27:13.541] error: Nodes t-[036-052] not responding [...] [2017-02-16T13:28:52.659] debug: Spawning ping agent for j-[001-050,101-128],r-[001-004],t-[001-052] [2017-02-16T13:28:52.694] Node t-036 now responding [2017-02-16T13:28:52.694] Node t-044 now responding [2017-02-16T13:28:52.695] Node t-042 now responding [2017-02-16T13:28:52.695] Node t-047 now responding [2017-02-16T13:28:52.695] Node t-037 now responding [2017-02-16T13:28:52.695] Node t-046 now responding [2017-02-16T13:28:52.695] Node t-049 now responding [2017-02-16T13:28:52.695] Node t-045 now responding [2017-02-16T13:28:52.695] Node t-052 now responding [2017-02-16T13:28:52.695] Node t-039 now responding [2017-02-16T13:28:52.695] Node t-050 now responding [2017-02-16T13:28:52.695] Node t-038 now responding [2017-02-16T13:28:52.695] Node t-048 now responding [2017-02-16T13:28:52.695] Node t-043 now responding [2017-02-16T13:28:52.695] Node t-040 now responding [2017-02-16T13:28:52.695] Node t-041 now responding [2017-02-16T13:28:52.695] Node t-051 now responding All the "t-nnn" nodes are on one cluster and these are the only nodes showing this problem. Thanks, Allan Allan Streib <astr...@indiana.edu> writes: > They are all running ntpd and clocks are in sync. > > In this slurmctld there are a total of 226 nodes, in several different > partitions. The cluster of 64 is the only one where I see this > happening. Unless that number of nodes is pushing the limit for a single > slurmctld (which I doubt) I'd be inclined to think it's more likely a > network issue but in that case I'd expect wireshark to show an attempt > by slurmctld to contact the node and then no response. What I'm actually > seeing is no traffic either way for these nodes, on the same interval as > the others. > > Allan > > Lachlan Musicman <data...@gmail.com> writes: > >> Check they are all in the same time or ntpd against the same server. I >> found that the nodes that kept going down had the time out of sync. >> >> Cheers >> L. >> >> ------ >> The most dangerous phrase in the language is, "We've always done it this >> way." >> >> - Grace Hopper >> >> On 25 January 2017 at 05:49, Allan Streib <astr...@indiana.edu> wrote: >> >> I have a cluster of 64 nodes and nodes 1 and 19 keep getting >> marked as down with a reason of not responding. They are up, >> pingable, slurmd is running, etc. everything looks normal. >> >> Using wireshark on the slurmctld I looked at traffic for node 1 >> and node 2. I can see traffic betwen the slurmctld node and node 2 >> at intervals about every 300 seconds but for node 1 sometimes the >> interval is as much as 1800 seconds. >> >> Any reason why these nodes might be getting "pinged" less often >> than the others? The slurm.conf is identical, and contains these >> timer settings (which I think are all defaults): >> >> # TIMERS >> InactiveLimit=0 >> KillWait=30 >> MinJobAge=300 >> SlurmctldTimeout=120 >> SlurmdTimeout=300 >> Waittime=0 >> >> Slurm version 14.11.7. >> >> Allan