For clarity my findings so far are that:
 - The package upgrade stops pacemaker

 - After 30 seconds (customised down from 30min by charm-hacluster), the
stop times out and pretends to have finished, but leaves pacemaker
running (due to SendSIGKILL=no in the .service intentionally set
upstream to prevent fencing)

 - Pacemaker is started again, but fails to start because the old copy
is still running, so exits and the systemd service is left 'stopped'

 - The original "unmanaged" pacemaker copy eventually exits sometimes
later (usually once the resources all transitioned away) leaving no
running pacemaker at all

Compounding this issue is that:
 - Pacemaker won't stop until it confirms all local services have stopped and 
transitioned away to other nodes (and possibly that it won't destory quorum by 
going down, but I am not sure about that bit) - in some cases this just takes 
more than 30 seconds in other cases the cluster may be in such a state that it 
will never happen, e.g. another node was already down or trying to shutdown.

 - All unattended-upgrades happen within a randomized 60 minute window
(apt-daily-upgrade.timer), and they all just try to stop pacemaker
without regard to whether that is possible or likely to succeed - after
a while all 3 will be attempting to stop so none of them would succeed.

Current Thoughts:
 - Adjust the charm-hacluster StopTimeout=30 back to some value (possibly the 
default) after testing this does not break the charm from doing 
deploy/scale-up/scale-down [as noted in previous bugs where it was originally 
added, but the original case was supposedly fixed by adding the cluster_count 
option].

 - Consider whether we need to override SendSigKILL in the charm -
changing it as a global package default seems like a bad idea

 - Research an improvement to the pacemaker dpkg scripts to do something
smarter than just running stop, for example the preinst script could ask
for a transition away without actually running stop on pacemaker and/or
abort the upgrade if it is obvious that that transition will fail.

 - As a related note, the patch to set BindsTo=corosync on
pacemaker.service was removed in Groovy due to debate with Debian over
this change (but still exists in Xenial-Focal). This is something that
will need to be dealt with for the next LTS. This override should
probably be added to charm-hacluster at a minimum.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1903745

Title:
  pacemaker left stopped after unattended-upgrade of pacemaker
  (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to