[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

2021-02-18 Thread Billy Olsen
Charm fix was released in the 21.01 release with this patch
https://review.opendev.org/c/openstack/charm-hacluster/+/763077

** Changed in: charm-hacluster
Milestone: None => 21.01

** Changed in: charm-hacluster
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1903745

Title:
  pacemaker left stopped after unattended-upgrade of pacemaker
  (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

2021-01-25 Thread Martin Benicek
Hello,
I cannot share sos report due to sensitive information. Symptoms are following. 
Update of pacemaker in Ubuntu 16 to pacemaker 1.9 will cause crash of VMs. Also 
error log of pacemaker and syslog are flooded by thousands of messages per 
second - the disk will go out quit quickly. 
Reboot -f is helping(VMs boots also). Ordinary reboot (without -f) is stuck for 
about 30 minutes, then machine boots and problem is gone.

snippet of logs from syslog:
194551 Jan  8 09:14:31 te-primary crmd[3948]:   notice: Transition aborted by 
lrm_rsc_op.te-res_last_failure_0: Event failed (cib=0.0.0, 
source=match_graph_event:381, 
path=/create_request_adv/crm_xml/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rte_re
   s']/lrm_rsc_op[@id='te-res_last_failure_0'], 0)
194552 Jan  8 09:14:31 te-primary crmd[3948]:  warning: FSA: Input I_FAIL from 
get_lrm_resource() received in state S_TRANSITION_ENGINE
194553 Jan  8 09:14:31 te-primary crmd[3948]:   notice: Transition aborted: 
Peer Cancelled (source=do_te_invoke:161, 0)
194554 Jan  8 09:14:31 te-primary crmd[3948]:   notice: Transition 6451494 
(Complete=3, Pending=0, Fired=0, Skipped=4, Incomplete=28, 
Source=/var/lib/pacemaker/pengine/pe-input-19.bz2): Stopped
194555 Jan  8 09:14:31 te-primary pengine[3947]:   notice: On loss of DCM 
Quorum: Ignore
194556 Jan  8 09:14:31 te-primary pengine[3947]:   notice: Start   
te-res#011(te-primary)
194557 Jan  8 09:14:31 te-primary pengine[3947]:   notice: Start   
fs_res#011(te-primary)
194558 Jan  8 09:14:31 te-primary pengine[3947]:   notice: Start   
VM_DCM_res#011(te-primary)
194559 Jan  8 09:14:31 te-primary pengine[3947]:   notice: Start   
VM_DAC_1_res#011(te-primary)
194560 Jan  8 09:14:31 te-primary pengine[3947]:   notice: Start   
drbd_res:0#011(te-primary)
194561 Jan  8 09:14:31 te-primary pengine[3947]:   notice: Promote 
drbd_res:0#011(Stopped -> Master te-primary)
194562 Jan  8 09:14:31 te-primary pengine[3947]:   notice: Calculated 
Transition 6451495: /var/lib/pacemaker/pengine/pe-input-19.bz2
194563 Jan  8 09:14:31 te-primary crmd[3948]:  warning: bad input   

194564 Jan  8 09:14:31 te-primary crmd[3948]:  warning: bad input 
194565 Jan  8 09:14:31 te-primary crmd[3948]:  warning: bad input   
194566 Jan  8 09:14:31 te-primary crmd[3948]:  warning: bad input 

194567 Jan  8 09:14:31 te-primary crmd[3948]:  warning: bad input 

194568 Jan  8 09:14:31 te-primary crmd[3948]:  warning: bad input   

194569 Jan  8 09:14:31 te-primary crmd[3948]:  warning: bad input 
194570 Jan  8 09:14:31 te-primary crmd[3948]:  warning: bad input   

194571 Jan  8 09:14:31 te-primary crmd[3948]:   notice: Transition aborted by 
lrm_rsc_op.te-res_last_failure_0: Event failed (cib=0.0.0, 
source=match_graph_event:381, 
path=/create_request_adv/crm_xml/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rte_re
   s']/lrm_rsc_op[@id='te-res_last_failure_0'], 0)


As it seems to be different error, I will open new bug-report
Thanks a lot
sincerely
Martin

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1903745

Title:
  pacemaker left stopped after unattended-upgrade of pacemaker
  (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

2021-01-05 Thread Christian Ehrhardt 
@Martin - hi, given the state of the bug it isn't considered an issue
/fixable-bug in pacemaker and a fix will be deployed via the charms
controlling it in this case.

I must admit I don't know enough about your particular case, but is it really 
the very same one?
Would it make sense to report a new bug with a lot of details on setup and 
why/why happens and what fails?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1903745

Title:
  pacemaker left stopped after unattended-upgrade of pacemaker
  (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

2021-01-05 Thread Martin Benicek
Hi, when we can expect pacemaker1.10 version? I am working for company 
developing and delivering cloud software and upgrade pacemaker to version 1.9 
is causing crash of VMs. Problem is occurring very frequently (8-9 times from 
10). 
Our temporary solution is that we have blacklisted pacemaker 1.9, so we are 
staying on 1.8 and waiting for 1.10.
thanks
Martin

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1903745

Title:
  pacemaker left stopped after unattended-upgrade of pacemaker
  (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

2020-11-17 Thread OpenStack Infra
Fix proposed to branch: master
Review: https://review.opendev.org/763077

** Changed in: charm-hacluster
   Status: Confirmed => In Progress

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1903745

Title:
  pacemaker left stopped after unattended-upgrade of pacemaker
  (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

2020-11-17 Thread Billy Olsen
Based on discussion with Trent, who has access to more logs and data
than I currently do for this, all signs are indeed pointing to the
override timeouts provided by the charm itself.

A viable work-around to prevent this is to tweak the
service_stop_timeout config on the hacluster charm to be higher than the
60 second default. Setting it to 1800 would restore this to package's
default.

I am also going to invalidate the pacemaker task as it wasn't caused by
the change to pacemaker and a more targeted bug to tweak the behavior of
whether the service starts/stops should be raised instead.

An investigation on possible alternatives for dealing with the upgrades
and maintenance mode of the cluster should be pursued outside the bounds
of this particular bug.

As a work-around is available, I'll reduce subscribe field-high/remove
field-critical while working on a patch to change the service timeout
defaults.

** Changed in: pacemaker (Ubuntu)
   Status: Confirmed => Invalid

** Changed in: charm-hacluster
   Importance: Undecided => Critical

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1903745

Title:
  pacemaker left stopped after unattended-upgrade of pacemaker
  (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

2020-11-16 Thread Trent Lloyd
For clarity my findings so far are that:
 - The package upgrade stops pacemaker

 - After 30 seconds (customised down from 30min by charm-hacluster), the
stop times out and pretends to have finished, but leaves pacemaker
running (due to SendSIGKILL=no in the .service intentionally set
upstream to prevent fencing)

 - Pacemaker is started again, but fails to start because the old copy
is still running, so exits and the systemd service is left 'stopped'

 - The original "unmanaged" pacemaker copy eventually exits sometimes
later (usually once the resources all transitioned away) leaving no
running pacemaker at all

Compounding this issue is that:
 - Pacemaker won't stop until it confirms all local services have stopped and 
transitioned away to other nodes (and possibly that it won't destory quorum by 
going down, but I am not sure about that bit) - in some cases this just takes 
more than 30 seconds in other cases the cluster may be in such a state that it 
will never happen, e.g. another node was already down or trying to shutdown.

 - All unattended-upgrades happen within a randomized 60 minute window
(apt-daily-upgrade.timer), and they all just try to stop pacemaker
without regard to whether that is possible or likely to succeed - after
a while all 3 will be attempting to stop so none of them would succeed.

Current Thoughts:
 - Adjust the charm-hacluster StopTimeout=30 back to some value (possibly the 
default) after testing this does not break the charm from doing 
deploy/scale-up/scale-down [as noted in previous bugs where it was originally 
added, but the original case was supposedly fixed by adding the cluster_count 
option].

 - Consider whether we need to override SendSigKILL in the charm -
changing it as a global package default seems like a bad idea

 - Research an improvement to the pacemaker dpkg scripts to do something
smarter than just running stop, for example the preinst script could ask
for a transition away without actually running stop on pacemaker and/or
abort the upgrade if it is obvious that that transition will fail.

 - As a related note, the patch to set BindsTo=corosync on
pacemaker.service was removed in Groovy due to debate with Debian over
this change (but still exists in Xenial-Focal). This is something that
will need to be dealt with for the next LTS. This override should
probably be added to charm-hacluster at a minimum.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1903745

Title:
  pacemaker left stopped after unattended-upgrade of pacemaker
  (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

2020-11-16 Thread Trent Lloyd
With regards to Billy's Comment #18, my analysis for that bionic
sosreport is in Comment #8 where I found that specific sosreport didn't
experience this issue - but I found most likely that node was suffering
from the issue occuring on the MySQL nodes it was connected to - and the
service couldn't connect to MySQL as a result. We'd need the full logs
(sosreport --all-logs) from all related keystone nodes and mysql nodes
in the environment to be sure but I am 95% sure that is the case there.

I think there is some argument to be made to improve the package restart
process for the pacemaker package itself, whoever I am finding based on
the logs here and in a couple of environments I analysed that the
primary problem is specifically related to the reduced StopTimeout set
by charm-hacluster. So I think we should focus on that issue here and if
we decide it makes sense to make improvements to the pacemaker package
process itself that should be opened as a separate bug as I haven't seen
any evidence of that issue in the logs here so far.

For anyone else experiencing this bug, please take a *full* copy of
/var/log (or sosreport --all-logs) from -all- nodes in that specific
pacemaker cluster and upload them and I am happy to analyse them - if
you need a non-public location to share the files feel free to e-mail
them to me. It would be great to receive that from any nodes already
recovered so we can ensure we fully understand all the cases that
happened.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1903745

Title:
  pacemaker left stopped after unattended-upgrade of pacemaker
  (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

2020-11-16 Thread Billy Olsen
I should note that one of the reasons I do not suspect the charm is at
fault here is that in the bionic sosreport linked in the bug, I do not
see the delay that I would expect in a timeout scenario. If the stop
timeout were coming into play, I would expect to see a long duration on
the restart.

However, we can see the package was upgraded at 06:17:34 :

...
Start-Date: 2020-11-10  06:17:34
Commandline: /usr/bin/unattended-upgrade
Upgrade: pacemaker:amd64 (1.1.18-0ubuntu1.1, 1.1.18-0ubuntu1.3)
End-Date: 2020-11-10  06:17:36
...

Pacemaker service is restarted at 06:17:34 with a SIGTERM:

Nov 10 06:17:34 [51765] juju-caae6f-19-lxd-6 pacemakerd:   notice:
crm_signal_dispatch: Caught 'Terminated' signal | 15 (invoking handler)

and is restarted at 06:17:35

Nov 10 06:17:36 [41195] juju-caae6f-19-lxd-6 pacemakerd: info: 
crm_log_init:Changed active directory to /var/lib/pacemaker/cores
Nov 10 06:17:36 [41195] juju-caae6f-19-lxd-6 pacemakerd: info: 
get_cluster_type:Detected an active 'corosync' cluster
Nov 10 06:17:36 [41195] juju-caae6f-19-lxd-6 pacemakerd:error: sysrq_init:  
Cannot write to /proc/sys/kernel/sysrq: Permission denied (13)

This doesn't look like its hindered by the timeout configuration.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1903745

Title:
  pacemaker left stopped after unattended-upgrade of pacemaker
  (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

2020-11-16 Thread Billy Olsen
I'm not sure why the pacemaker task was marked as invalid. The issues
that Trent identified in comment #15 are a problem, but I'm not entirely
convinced that its *the* problem that was encountered here (as also
evidenced by Pedro in comment #16).

While packages do typically restart services automatically, not all
package upgrades will trigger this particular behavior. For example, the
ceph packages will upgrade the services but not actually restart
services as that could be very disruptive to the storage provided.

I agree with Ante in comment #9 that the packages aren't doing all of
the necessary steps to properly manage the upgrade. While, I'm not
necessarily convinced this is something that should be handled by
unattended-upgrades I'm also not convinced that it will be easy to add
this logic to the packages. For example pacemaker depends on corosync,
but so does dlm-controld and it is not reasonable for the corosync
package to make assumptions on how to treat software built on top of it.

Rather, I think it should be considered to change the behavior of the
corosync/pacemaker packages to not automatically restart the services,
or to provide an option in which the operator can control this behavior.

** Changed in: pacemaker (Ubuntu)
   Status: Invalid => Confirmed

** Changed in: charm-hacluster
 Assignee: (unassigned) => Billy Olsen (billy-olsen)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1903745

Title:
  pacemaker left stopped after unattended-upgrade of pacemaker
  (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1903745] Re: pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

2020-11-14 Thread Pedro Victor Lourenço Fragola
I have an environment with the same behavior in a Bionic VM that does
not use charm.

Preparing to unpack .../pacemaker_1.1.18-0ubuntu1.3_amd64.deb ...
Unpacking pacemaker (1.1.18-0ubuntu1.3) over (1.1.18-0ubuntu1.1) ...
Log ended: 2020-11-11 06:54:05 

After the upgrade, there was an IP(master/vip) switch between the nodes
that left the environment unavailable.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1903745

Title:
  pacemaker left stopped after unattended-upgrade of pacemaker
  (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-hacluster/+bug/1903745/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs