Hi, I'm having trouble with the CLOUD type nodes lingering after they have crashed or disappeared by means other than via Slurm's SuspendProgram.
After some time, the nodes typically get marked by *down* but most of the time, they are never actually removed. They show up in sinfo: $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST multi up 2-00:20:00 2 down* jetstream-iu-elastic[2,4] Here's the log, showing that the control process is failing to communicate with the node but it never relinquishes it: $ tail -n 500 slurmctld.log | grep -i 'power\|error' 2016-09-26T20:58:57.204] error: Nodes jetstream-iu-elastic4 not responding [2016-09-26T21:00:37.068] error: Unable to resolve "jetstream-iu-elastic4": Unknown host [2016-09-26T21:00:37.068] error: fwd_tree_thread: can't find address for host jetstream-iu-elastic4, check slurm.conf [2016-09-26T21:00:37.312] error: Nodes jetstream-iu-elastic4 not responding [2016-09-26T21:00:57.022] Power save mode: 2 nodes [2016-09-26T21:02:17.250] error: Unable to resolve "jetstream-iu-elastic4": Unknown host [2016-09-26T21:02:17.250] error: fwd_tree_thread: can't find address for host jetstream-iu-elastic4, check slurm.conf [2016-09-26T21:02:17.521] error: Nodes jetstream-iu-elastic4 not responding [2016-09-26T21:02:40.255] powering down node jetstream-iu-elastic4 [2016-09-26T21:02:41.533] error: agent waited too long for nodes to respond, sending batch request anyway... [2016-09-26T21:02:42.288] error: Unable to resolve "jetstream-iu-elastic4": Unknown host [2016-09-26T21:02:42.288] error: fwd_tree_thread: can't find address for host jetstream-iu-elastic4, check slurm.conf [2016-09-26T21:02:42.551] error: Nodes jetstream-iu-elastic4 not responding [2016-09-26T21:02:43.083] error: Unable to resolve "jetstream-iu-elastic4": Unknown host [2016-09-26T21:02:43.083] error: fwd_tree_thread: can't find address for host jetstream-iu-elastic4, check slurm.conf [2016-09-26T21:02:43.552] error: Nodes jetstream-iu-elastic4 not responding [2016-09-26T21:03:57.354] error: Unable to resolve "jetstream-iu-elastic4": Unknown host [2016-09-26T21:03:57.354] error: fwd_tree_thread: can't find address for host jetstream-iu-elastic4, check slurm.conf [2016-09-26T21:03:57.590] error: Nodes jetstream-iu-elastic4 not responding [2016-09-26T21:05:37.402] error: Unable to resolve "jetstream-iu-elastic4": Unknown host [2016-09-26T21:05:37.402] error: fwd_tree_thread: can't find address for host jetstream-iu-elastic4, check slurm.conf [2016-09-26T21:05:37.638] error: Nodes jetstream-iu-elastic4 not responding [2016-09-26T21:07:17.652] error: Unable to resolve "jetstream-iu-elastic4": Unknown host [2016-09-26T21:07:17.652] error: fwd_tree_thread: can't find address for host jetstream-iu-elastic4, check slurm.conf [2016-09-26T21:07:17.887] error: Nodes jetstream-iu-elastic4 not responding [2016-09-26T21:08:57.726] error: Unable to resolve "jetstream-iu-elastic4": Unknown host [2016-09-26T21:08:57.726] error: fwd_tree_thread: can't find address for host jetstream-iu-elastic4, check slurm.conf [2016-09-26T21:08:57.962] error: Nodes jetstream-iu-elastic4 not responding [2016-09-26T21:10:36.011] error: Nodes jetstream-iu-elastic4 not responding, setting DOWN [2016-09-26T21:11:37.058] Power save mode: 2 nodes [2016-09-26T21:12:17.024] error: Unable to resolve "jetstream-iu-elastic4": Unknown host [2016-09-26T21:12:17.024] error: fwd_tree_thread: can't find address for host jetstream-iu-elastic4, check slurm.conf [2016-09-26T21:13:57.095] error: Unable to resolve "jetstream-iu-elastic4": Unknown host [2016-09-26T21:13:57.096] error: fwd_tree_thread: can't find address for host jetstream-iu-elastic4, check slurm.conf [2016-09-26T21:14:00.096] debug2: Error connecting slurm stream socket at 10.0.0.50:7003: No route to host [2016-09-26T21:15:37.157] error: Unable to resolve "jetstream-iu-elastic4": Unknown host [2016-09-26T21:15:37.157] error: fwd_tree_thread: can't find address for host jetstream-iu-elastic4, check slurm.conf In slurm.conf, the following are set: SuspendTime=180 SlurmdTimeout=300 I've tried using the following command but that cycles the new status from down to completing back to down: $ scontrol update nodename=jetstream-iu-elastic4 state=power_down Is there a way to tell/force Slurm to forget/discard a reference to a node that's disappeared? Thanks, Enis