Additional information from the charm:

Without cluster_count set to NUM_UNITS a race occurs where the relation
to the last hacluster node is not yet set leading to the attempt to
startup corosync and pacemaker with only n-1/n nodes.

The last node only has one relationship it is aware of yet when there should be 
2 relations:
relation-list -r hanode:0
hacluster/0

corosync.conf looks like the following when there should be 3 nodes:

nodelist {

        node {
                ring0_addr: 10.5.35.235
                nodeid: 1000
        }

        node {
                ring0_addr: 10.5.35.237
                nodeid: 1001
        }

}

The services themselves (not the charm) fail:
corosync logs thousands of RETRANSMIT errors.
pacemaker eventually times out after waiting on corosync.

Adding more documentation to push the setting of cluster_count and
updating the amulet tests to include it.

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to corosync in Ubuntu.
https://bugs.launchpad.net/bugs/1654403

Title:
  Race condition in hacluster charm that leaves pacemaker down

Status in corosync package in Ubuntu:
  New
Status in hacluster package in Juju Charms Collection:
  Triaged

Bug description:
  Symptom: one or more hacluster nodes are left in an executing state.
  Observing the process list on the affected nodes the command 'crm node list' 
is in an infinite loop and pacemaker is not started. On nodes that complete the 
crm node list and other crm commands pacemaker is started.

  See the artefacts from this run:
  
https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline/openstack/charm-percona-cluster/417131/1/1873/index.html

  Hypothesis: There is a race that leads to crm node list being executed
  before pacemaker is started. It is also possible that something causes
  pacemaker to fail to start.

  Suggest a check for pacemaker heath before any crm commands are run.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1654403/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

Reply via email to