Just to close this issue...

> after an upgrade from 15.08.12 to 16.05.04 we get these new messages:
> 
> slurmctld error: fwd_tree_thread: taurusi4181 failed to forward the
> message, expecting 51 ret got only 1
> 
> Where do they come from? - How can we get rid of them?

We have not deleted the "node_state" file and thus ran into these troubles:

In this file, Slurm remembers the most recent version of the message
protocol a slurmd on a compute node is using. In our case, nodes that
have not been productive (drained and no daemon running) in Slurm for a
long time were assumed to run on version 14.11 (or the like). Even if
they did not intend to join Slurm slurmctl kept the overall
communication protocol version as low as 6912. The slurmds did complain
about this old version, and the tree-like message forwarding could not
work.

In the end, we did
* save the draining reasons,
* stop all Slurm daemons
* DELETE node_state
* start slurmdbd, slurmctld
* load draining reasons
* start slurmd on the nodes.

This is a bit drastically, and we did lose the original timestamp of the
draining events, but now everything runs fine.

Best regards,
Ulf



-- 
___________________________________________________________________
Dr. Ulf Markwardt

Technische Universit├Ąt Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany

Phone: (+49) 351/463-33640      WWW:  http://www.tu-dresden.de/zih

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to