Won't it be easier if: - set a node in standby - stop a node - remove the node - add again with the new hostname
Best Regards, Strahil Nikolov На 18 август 2020 г. 17:15:49 GMT+03:00, Ken Gaillot <[email protected]> написа: >On Tue, 2020-08-18 at 14:35 +0200, Kadlecsik József wrote: >> Hi, >> >> On Mon, 17 Aug 2020, Ken Gaillot wrote: >> >> > On Mon, 2020-08-17 at 12:12 +0200, Kadlecsik József wrote: >> > > >> > > At upgrading a corosync/pacemaker/libvirt/KVM cluster from >> > > Debian >> > > stretch to buster, all the node utilization attributes were >> > > erased >> > > from the configuration. However, the same attributes were kept at >> > > the >> > > VirtualDomain resources. This resulted that all resources with >> > > utilization attributes were stopped. >> > >> > Ouch :( >> > >> > There are two types of node attributes, transient and permanent. >> > Transient attributes last only until pacemaker is next stopped on >> > the >> > node, while permanent attributes persist between reboots/restarts. >> > >> > If you configured the utilization attributes with crm_attribute >> > -z/ >> > --utilization, it will default to permanent, but it's possible to >> > override that with -l/--lifetime reboot (or equivalently, -t/ >> > --type >> > status). >> >> The attributes were defined by "crm configure edit", simply stating: >> >> node 1084762113: atlas0 \ >> utilization hv_memory=192 cpu=32 \ >> attributes standby=off >> ... >> node 1084762119: atlas6 \ >> utilization hv_memory=192 cpu=32 \ >> >> But I believe now that corosync caused the problem, because the nodes >> had >> been renumbered: > >Ah yes, that would do it. Pacemaker would consider them different nodes >with the same names. The "other" node's attributes would not apply to >the "new" node. > >The upgrade procedure would be similar except that you would start >corosync by itself after each upgrade. After all nodes were upgraded, >you would modify the CIB on one node (while pacemaker is not running) >with: > >CIB_file=/var/lib/pacemaker/cib/cib.xml cibadmin --modify --scope=nodes >-X '...' > >where '...' is a <node> XML entry from the CIB with the "id" value >changed to the new ID, and repeat that for each node. Then, start >pacemaker on that node and wait for it to come up, then start pacemaker >on the other nodes. > >> >> node 3232245761: atlas0 >> ... >> node 3232245767: atlas6 >> >> The upgrade process was: >> >> for each node do >> set the "hold" mark on the corosync package >> put the node standby >> wait for the resources to be migrated off >> upgrade from stretch to buster >> reboot >> put the node online >> wait for the resources to be migrated (back) >> done >> >> Up to this point all resources were running fine. >> >> In order to upgrade corosync, we followed the next steps: >> >> enable maintenance mode >> stop pacemaker and corosync on all nodes >> for each node do >> delete the hold mark and upgrade corosync >> install new config file (nodeid not specified) >> restart corosync, start pacemaker >> done >> >> We could see that all resources were running unmanaged. When >> disabling the >> maintenance mode, then those were stopped. >> >> So I think corosync renumbered the nodes and I suspect the reason for >> that >> was that "clear_node_high_bit: yes" was not specified in the new >> config >> file. It means it was an admin error then. >> >> Best regards, >> Jozsef >> -- >> E-mail : [email protected] >> PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt >> Address: Wigner Research Centre for Physics >> H-1525 Budapest 114, POB. 49, Hungary >-- >Ken Gaillot <[email protected]> > >_______________________________________________ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
