[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)
Charm fix was released in the 21.01 release with this patch https://review.opendev.org/c/openstack/charm-hacluster/+/763077 ** Changed in: charm-hacluster Milestone: None => 21.01 ** Changed in: charm-hacluster Status: In Progress => Fix Released -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1903745 Title: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9) To manage notifications about this bug go to: https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)
Hello, I cannot share sos report due to sensitive information. Symptoms are following. Update of pacemaker in Ubuntu 16 to pacemaker 1.9 will cause crash of VMs. Also error log of pacemaker and syslog are flooded by thousands of messages per second - the disk will go out quit quickly. Reboot -f is helping(VMs boots also). Ordinary reboot (without -f) is stuck for about 30 minutes, then machine boots and problem is gone. snippet of logs from syslog: 194551 Jan 8 09:14:31 te-primary crmd[3948]: notice: Transition aborted by lrm_rsc_op.te-res_last_failure_0: Event failed (cib=0.0.0, source=match_graph_event:381, path=/create_request_adv/crm_xml/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rte_re s']/lrm_rsc_op[@id='te-res_last_failure_0'], 0) 194552 Jan 8 09:14:31 te-primary crmd[3948]: warning: FSA: Input I_FAIL from get_lrm_resource() received in state S_TRANSITION_ENGINE 194553 Jan 8 09:14:31 te-primary crmd[3948]: notice: Transition aborted: Peer Cancelled (source=do_te_invoke:161, 0) 194554 Jan 8 09:14:31 te-primary crmd[3948]: notice: Transition 6451494 (Complete=3, Pending=0, Fired=0, Skipped=4, Incomplete=28, Source=/var/lib/pacemaker/pengine/pe-input-19.bz2): Stopped 194555 Jan 8 09:14:31 te-primary pengine[3947]: notice: On loss of DCM Quorum: Ignore 194556 Jan 8 09:14:31 te-primary pengine[3947]: notice: Start te-res#011(te-primary) 194557 Jan 8 09:14:31 te-primary pengine[3947]: notice: Start fs_res#011(te-primary) 194558 Jan 8 09:14:31 te-primary pengine[3947]: notice: Start VM_DCM_res#011(te-primary) 194559 Jan 8 09:14:31 te-primary pengine[3947]: notice: Start VM_DAC_1_res#011(te-primary) 194560 Jan 8 09:14:31 te-primary pengine[3947]: notice: Start drbd_res:0#011(te-primary) 194561 Jan 8 09:14:31 te-primary pengine[3947]: notice: Promote drbd_res:0#011(Stopped -> Master te-primary) 194562 Jan 8 09:14:31 te-primary pengine[3947]: notice: Calculated Transition 6451495: /var/lib/pacemaker/pengine/pe-input-19.bz2 194563 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input 194564 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input 194565 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input 194566 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input 194567 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input 194568 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input 194569 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input 194570 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input 194571 Jan 8 09:14:31 te-primary crmd[3948]: notice: Transition aborted by lrm_rsc_op.te-res_last_failure_0: Event failed (cib=0.0.0, source=match_graph_event:381, path=/create_request_adv/crm_xml/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rte_re s']/lrm_rsc_op[@id='te-res_last_failure_0'], 0) As it seems to be different error, I will open new bug-report Thanks a lot sincerely Martin -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1903745 Title: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9) To manage notifications about this bug go to: https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)
@Martin - hi, given the state of the bug it isn't considered an issue /fixable-bug in pacemaker and a fix will be deployed via the charms controlling it in this case. I must admit I don't know enough about your particular case, but is it really the very same one? Would it make sense to report a new bug with a lot of details on setup and why/why happens and what fails? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1903745 Title: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9) To manage notifications about this bug go to: https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)
Hi, when we can expect pacemaker1.10 version? I am working for company developing and delivering cloud software and upgrade pacemaker to version 1.9 is causing crash of VMs. Problem is occurring very frequently (8-9 times from 10). Our temporary solution is that we have blacklisted pacemaker 1.9, so we are staying on 1.8 and waiting for 1.10. thanks Martin -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1903745 Title: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9) To manage notifications about this bug go to: https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)
Fix proposed to branch: master Review: https://review.opendev.org/763077 ** Changed in: charm-hacluster Status: Confirmed => In Progress -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1903745 Title: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9) To manage notifications about this bug go to: https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)
Based on discussion with Trent, who has access to more logs and data than I currently do for this, all signs are indeed pointing to the override timeouts provided by the charm itself. A viable work-around to prevent this is to tweak the service_stop_timeout config on the hacluster charm to be higher than the 60 second default. Setting it to 1800 would restore this to package's default. I am also going to invalidate the pacemaker task as it wasn't caused by the change to pacemaker and a more targeted bug to tweak the behavior of whether the service starts/stops should be raised instead. An investigation on possible alternatives for dealing with the upgrades and maintenance mode of the cluster should be pursued outside the bounds of this particular bug. As a work-around is available, I'll reduce subscribe field-high/remove field-critical while working on a patch to change the service timeout defaults. ** Changed in: pacemaker (Ubuntu) Status: Confirmed => Invalid ** Changed in: charm-hacluster Importance: Undecided => Critical -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1903745 Title: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9) To manage notifications about this bug go to: https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)
For clarity my findings so far are that: - The package upgrade stops pacemaker - After 30 seconds (customised down from 30min by charm-hacluster), the stop times out and pretends to have finished, but leaves pacemaker running (due to SendSIGKILL=no in the .service intentionally set upstream to prevent fencing) - Pacemaker is started again, but fails to start because the old copy is still running, so exits and the systemd service is left 'stopped' - The original "unmanaged" pacemaker copy eventually exits sometimes later (usually once the resources all transitioned away) leaving no running pacemaker at all Compounding this issue is that: - Pacemaker won't stop until it confirms all local services have stopped and transitioned away to other nodes (and possibly that it won't destory quorum by going down, but I am not sure about that bit) - in some cases this just takes more than 30 seconds in other cases the cluster may be in such a state that it will never happen, e.g. another node was already down or trying to shutdown. - All unattended-upgrades happen within a randomized 60 minute window (apt-daily-upgrade.timer), and they all just try to stop pacemaker without regard to whether that is possible or likely to succeed - after a while all 3 will be attempting to stop so none of them would succeed. Current Thoughts: - Adjust the charm-hacluster StopTimeout=30 back to some value (possibly the default) after testing this does not break the charm from doing deploy/scale-up/scale-down [as noted in previous bugs where it was originally added, but the original case was supposedly fixed by adding the cluster_count option]. - Consider whether we need to override SendSigKILL in the charm - changing it as a global package default seems like a bad idea - Research an improvement to the pacemaker dpkg scripts to do something smarter than just running stop, for example the preinst script could ask for a transition away without actually running stop on pacemaker and/or abort the upgrade if it is obvious that that transition will fail. - As a related note, the patch to set BindsTo=corosync on pacemaker.service was removed in Groovy due to debate with Debian over this change (but still exists in Xenial-Focal). This is something that will need to be dealt with for the next LTS. This override should probably be added to charm-hacluster at a minimum. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1903745 Title: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9) To manage notifications about this bug go to: https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)
With regards to Billy's Comment #18, my analysis for that bionic sosreport is in Comment #8 where I found that specific sosreport didn't experience this issue - but I found most likely that node was suffering from the issue occuring on the MySQL nodes it was connected to - and the service couldn't connect to MySQL as a result. We'd need the full logs (sosreport --all-logs) from all related keystone nodes and mysql nodes in the environment to be sure but I am 95% sure that is the case there. I think there is some argument to be made to improve the package restart process for the pacemaker package itself, whoever I am finding based on the logs here and in a couple of environments I analysed that the primary problem is specifically related to the reduced StopTimeout set by charm-hacluster. So I think we should focus on that issue here and if we decide it makes sense to make improvements to the pacemaker package process itself that should be opened as a separate bug as I haven't seen any evidence of that issue in the logs here so far. For anyone else experiencing this bug, please take a *full* copy of /var/log (or sosreport --all-logs) from -all- nodes in that specific pacemaker cluster and upload them and I am happy to analyse them - if you need a non-public location to share the files feel free to e-mail them to me. It would be great to receive that from any nodes already recovered so we can ensure we fully understand all the cases that happened. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1903745 Title: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9) To manage notifications about this bug go to: https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)
I should note that one of the reasons I do not suspect the charm is at fault here is that in the bionic sosreport linked in the bug, I do not see the delay that I would expect in a timeout scenario. If the stop timeout were coming into play, I would expect to see a long duration on the restart. However, we can see the package was upgraded at 06:17:34 : ... Start-Date: 2020-11-10 06:17:34 Commandline: /usr/bin/unattended-upgrade Upgrade: pacemaker:amd64 (1.1.18-0ubuntu1.1, 1.1.18-0ubuntu1.3) End-Date: 2020-11-10 06:17:36 ... Pacemaker service is restarted at 06:17:34 with a SIGTERM: Nov 10 06:17:34 [51765] juju-caae6f-19-lxd-6 pacemakerd: notice: crm_signal_dispatch: Caught 'Terminated' signal | 15 (invoking handler) and is restarted at 06:17:35 Nov 10 06:17:36 [41195] juju-caae6f-19-lxd-6 pacemakerd: info: crm_log_init:Changed active directory to /var/lib/pacemaker/cores Nov 10 06:17:36 [41195] juju-caae6f-19-lxd-6 pacemakerd: info: get_cluster_type:Detected an active 'corosync' cluster Nov 10 06:17:36 [41195] juju-caae6f-19-lxd-6 pacemakerd:error: sysrq_init: Cannot write to /proc/sys/kernel/sysrq: Permission denied (13) This doesn't look like its hindered by the timeout configuration. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1903745 Title: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9) To manage notifications about this bug go to: https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)
I'm not sure why the pacemaker task was marked as invalid. The issues that Trent identified in comment #15 are a problem, but I'm not entirely convinced that its *the* problem that was encountered here (as also evidenced by Pedro in comment #16). While packages do typically restart services automatically, not all package upgrades will trigger this particular behavior. For example, the ceph packages will upgrade the services but not actually restart services as that could be very disruptive to the storage provided. I agree with Ante in comment #9 that the packages aren't doing all of the necessary steps to properly manage the upgrade. While, I'm not necessarily convinced this is something that should be handled by unattended-upgrades I'm also not convinced that it will be easy to add this logic to the packages. For example pacemaker depends on corosync, but so does dlm-controld and it is not reasonable for the corosync package to make assumptions on how to treat software built on top of it. Rather, I think it should be considered to change the behavior of the corosync/pacemaker packages to not automatically restart the services, or to provide an option in which the operator can control this behavior. ** Changed in: pacemaker (Ubuntu) Status: Invalid => Confirmed ** Changed in: charm-hacluster Assignee: (unassigned) => Billy Olsen (billy-olsen) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1903745 Title: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9) To manage notifications about this bug go to: https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)
I have an environment with the same behavior in a Bionic VM that does not use charm. Preparing to unpack .../pacemaker_1.1.18-0ubuntu1.3_amd64.deb ... Unpacking pacemaker (1.1.18-0ubuntu1.3) over (1.1.18-0ubuntu1.1) ... Log ended: 2020-11-11 06:54:05 After the upgrade, there was an IP(master/vip) switch between the nodes that left the environment unavailable. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1903745 Title: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9) To manage notifications about this bug go to: https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs