Just to close this issue... > after an upgrade from 15.08.12 to 16.05.04 we get these new messages: > > slurmctld error: fwd_tree_thread: taurusi4181 failed to forward the > message, expecting 51 ret got only 1 > > Where do they come from? - How can we get rid of them?
We have not deleted the "node_state" file and thus ran into these troubles: In this file, Slurm remembers the most recent version of the message protocol a slurmd on a compute node is using. In our case, nodes that have not been productive (drained and no daemon running) in Slurm for a long time were assumed to run on version 14.11 (or the like). Even if they did not intend to join Slurm slurmctl kept the overall communication protocol version as low as 6912. The slurmds did complain about this old version, and the tree-like message forwarding could not work. In the end, we did * save the draining reasons, * stop all Slurm daemons * DELETE node_state * start slurmdbd, slurmctld * load draining reasons * start slurmd on the nodes. This is a bit drastically, and we did lose the original timestamp of the draining events, but now everything runs fine. Best regards, Ulf -- ___________________________________________________________________ Dr. Ulf Markwardt Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) 01062 Dresden, Germany Phone: (+49) 351/463-33640 WWW: http://www.tu-dresden.de/zih
Description: S/MIME Cryptographic Signature