Root cause:

1) When corosync is restarted it may take up to a minute for it to
finish setting up.

2) The systemd timeout value is exceeded.
Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: Failed to start Corosync 
Cluster Engine.
Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: corosync.service: Unit 
entered failed state.
Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: corosync.service: Failed with 
result 'timeout'.

3) Pacemaker is then started. Pacemaker systemd script has a dependency
on corosync which may still be in the process of comming up.

4) Pacemaker fails to start due to dependency 
Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: pacemaker.service: Job 
pacemaker.service/start failed with result 'dependency'.

5) Pacemaker remains down.

6) Subsequently, the charm checks for pacemaker health by running `crm
node list` in a loop until it succeeds.

7) This is an infinite loop.


Soulitions

1) Adding corosync to this bug for systemd script timeout change

2) Charm needs to better handle validation of restart of the services
and better communicate to the end user when an error has occured


Current Work in Process
https://review.openstack.org/#/c/419204/


** Also affects: corosync (Ubuntu)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to corosync in Ubuntu.
https://bugs.launchpad.net/bugs/1654403

Title:
  Race condition in hacluster charm that leaves pacemaker down

Status in corosync package in Ubuntu:
  New
Status in hacluster package in Juju Charms Collection:
  Triaged

Bug description:
  Symptom: one or more hacluster nodes are left in an executing state.
  Observing the process list on the affected nodes the command 'crm node list' 
is in an infinite loop and pacemaker is not started. On nodes that complete the 
crm node list and other crm commands pacemaker is started.

  See the artefacts from this run:
  
https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline/openstack/charm-percona-cluster/417131/1/1873/index.html

  Hypothesis: There is a race that leads to crm node list being executed
  before pacemaker is started. It is also possible that something causes
  pacemaker to fail to start.

  Suggest a check for pacemaker heath before any crm commands are run.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1654403/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

Reply via email to