Root cause: 1) When corosync is restarted it may take up to a minute for it to finish setting up.
2) The systemd timeout value is exceeded. Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: Failed to start Corosync Cluster Engine. Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: corosync.service: Unit entered failed state. Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: corosync.service: Failed with result 'timeout'. 3) Pacemaker is then started. Pacemaker systemd script has a dependency on corosync which may still be in the process of comming up. 4) Pacemaker fails to start due to dependency Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: pacemaker.service: Job pacemaker.service/start failed with result 'dependency'. 5) Pacemaker remains down. 6) Subsequently, the charm checks for pacemaker health by running `crm node list` in a loop until it succeeds. 7) This is an infinite loop. Soulitions 1) Adding corosync to this bug for systemd script timeout change 2) Charm needs to better handle validation of restart of the services and better communicate to the end user when an error has occured Current Work in Process https://review.openstack.org/#/c/419204/ ** Also affects: corosync (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1654403 Title: Race condition in hacluster charm that leaves pacemaker down Status in corosync package in Ubuntu: New Status in hacluster package in Juju Charms Collection: Triaged Bug description: Symptom: one or more hacluster nodes are left in an executing state. Observing the process list on the affected nodes the command 'crm node list' is in an infinite loop and pacemaker is not started. On nodes that complete the crm node list and other crm commands pacemaker is started. See the artefacts from this run: https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline/openstack/charm-percona-cluster/417131/1/1873/index.html Hypothesis: There is a race that leads to crm node list being executed before pacemaker is started. It is also possible that something causes pacemaker to fail to start. Suggest a check for pacemaker heath before any crm commands are run. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1654403/+subscriptions _______________________________________________ Mailing list: https://launchpad.net/~ubuntu-ha Post to : [email protected] Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp

