Hi,
I'm having trouble with the CLOUD type nodes lingering after they have
crashed or disappeared by means other than via Slurm's SuspendProgram.

After some time, the nodes typically get marked by *down* but most of the
time, they are never actually removed. They show up in sinfo:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
multi        up 2-00:20:00      2  down* jetstream-iu-elastic[2,4]


Here's the log, showing that the control process is failing to communicate
with the node but it never relinquishes it:
$ tail -n 500 slurmctld.log | grep -i 'power\|error'
2016-09-26T20:58:57.204] error: Nodes jetstream-iu-elastic4 not responding
[2016-09-26T21:00:37.068] error: Unable to resolve "jetstream-iu-elastic4":
Unknown host
[2016-09-26T21:00:37.068] error: fwd_tree_thread: can't find address for
host jetstream-iu-elastic4, check slurm.conf
[2016-09-26T21:00:37.312] error: Nodes jetstream-iu-elastic4 not responding
[2016-09-26T21:00:57.022] Power save mode: 2 nodes
[2016-09-26T21:02:17.250] error: Unable to resolve "jetstream-iu-elastic4":
Unknown host
[2016-09-26T21:02:17.250] error: fwd_tree_thread: can't find address for
host jetstream-iu-elastic4, check slurm.conf
[2016-09-26T21:02:17.521] error: Nodes jetstream-iu-elastic4 not responding
[2016-09-26T21:02:40.255] powering down node jetstream-iu-elastic4
[2016-09-26T21:02:41.533] error: agent waited too long for nodes to
respond, sending batch request anyway...
[2016-09-26T21:02:42.288] error: Unable to resolve "jetstream-iu-elastic4":
Unknown host
[2016-09-26T21:02:42.288] error: fwd_tree_thread: can't find address for
host jetstream-iu-elastic4, check slurm.conf
[2016-09-26T21:02:42.551] error: Nodes jetstream-iu-elastic4 not responding
[2016-09-26T21:02:43.083] error: Unable to resolve "jetstream-iu-elastic4":
Unknown host
[2016-09-26T21:02:43.083] error: fwd_tree_thread: can't find address for
host jetstream-iu-elastic4, check slurm.conf
[2016-09-26T21:02:43.552] error: Nodes jetstream-iu-elastic4 not responding
[2016-09-26T21:03:57.354] error: Unable to resolve "jetstream-iu-elastic4":
Unknown host
[2016-09-26T21:03:57.354] error: fwd_tree_thread: can't find address for
host jetstream-iu-elastic4, check slurm.conf
[2016-09-26T21:03:57.590] error: Nodes jetstream-iu-elastic4 not responding
[2016-09-26T21:05:37.402] error: Unable to resolve "jetstream-iu-elastic4":
Unknown host
[2016-09-26T21:05:37.402] error: fwd_tree_thread: can't find address for
host jetstream-iu-elastic4, check slurm.conf
[2016-09-26T21:05:37.638] error: Nodes jetstream-iu-elastic4 not responding
[2016-09-26T21:07:17.652] error: Unable to resolve "jetstream-iu-elastic4":
Unknown host
[2016-09-26T21:07:17.652] error: fwd_tree_thread: can't find address for
host jetstream-iu-elastic4, check slurm.conf
[2016-09-26T21:07:17.887] error: Nodes jetstream-iu-elastic4 not responding
[2016-09-26T21:08:57.726] error: Unable to resolve "jetstream-iu-elastic4":
Unknown host
[2016-09-26T21:08:57.726] error: fwd_tree_thread: can't find address for
host jetstream-iu-elastic4, check slurm.conf
[2016-09-26T21:08:57.962] error: Nodes jetstream-iu-elastic4 not responding
[2016-09-26T21:10:36.011] error: Nodes jetstream-iu-elastic4 not
responding, setting DOWN
[2016-09-26T21:11:37.058] Power save mode: 2 nodes
[2016-09-26T21:12:17.024] error: Unable to resolve "jetstream-iu-elastic4":
Unknown host
[2016-09-26T21:12:17.024] error: fwd_tree_thread: can't find address for
host jetstream-iu-elastic4, check slurm.conf
[2016-09-26T21:13:57.095] error: Unable to resolve "jetstream-iu-elastic4":
Unknown host
[2016-09-26T21:13:57.096] error: fwd_tree_thread: can't find address for
host jetstream-iu-elastic4, check slurm.conf
[2016-09-26T21:14:00.096] debug2: Error connecting slurm stream socket at
10.0.0.50:7003: No route to host
[2016-09-26T21:15:37.157] error: Unable to resolve "jetstream-iu-elastic4":
Unknown host
[2016-09-26T21:15:37.157] error: fwd_tree_thread: can't find address for
host jetstream-iu-elastic4, check slurm.conf


In slurm.conf, the following are set:
SuspendTime=180
SlurmdTimeout=300


I've tried using the following command but that cycles the new status from
down to completing back to down:
$ scontrol update nodename=jetstream-iu-elastic4 state=power_down


Is there a way to tell/force Slurm to forget/discard a reference to a node
that's disappeared?

Thanks,
Enis

Reply via email to