Here is the patch fixing corosync misbehavior described above. Description: Remove buggy logic to prevent secondary dc fencing
On logic before commit 82aa2d8d17 the node responsible for fencing (executioner) the dc was responsible also for updating cib. If this update failed (due to a executioner fail, for ex) the dc would be fenced a second time because the cluster would not know about fencing result. On upstream commit 82aa2d8d17, a logic trying to avoid this second dc fencing was introduced. If this node was not the dc fence executioner it would keep its name. With its name, in the case executioner node died and this node became the new dc it would be able to update cib telling the result of last dc fencing. Problem is that this list is never cleaned and there might be cases wrong cib update is given (when a dc takeover has to run) resulting in a bad, bad thing: same resource running on different nodes. It is much more acceptable for SRU to restore old behavior, known to be safe even if it implies killing dc twice, than to backport several pieces of code to implement a logic that was not there on the stable version release. ** Patch added: "precise_pacemaker_1.1.6-2ubuntu3.3.diff" https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1312156/+attachment/4103593/+files/precise_pacemaker_1.1.6-2ubuntu3.3.diff -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1312156 Title: [Precise] Potential for data corruption Status in “pacemaker” package in Ubuntu: In Progress Status in “pacemaker” source package in Precise: New Bug description: Under certain conditions there is faulty logic in function tengine_stonith_notify() which can incorrectly add successfully fenced nodes to a list, causing Pacemaker to subsequently erase that node’s status section when the next DC (Designated Controller) election occurs. With the status section erased, the cluster considers that node is down and starts corresponding services on other nodes. Multiple instances of the same service can cause data corruption. Conditions: 1. fenced node must have been the previous DC and been sufficiently functional to request its own fencing 2. fencing notification must arrive after the new DC has been elected but before it invokes the policy engine Pacemaker versions affected: 1.1.6 - 1.1.9 Stable Ubuntu releases affected: Ubuntu 12.04 LTS Ubuntu 12.10 (EOL?) Fix: https://github.com/ClusterLabs/pacemaker/commit/f30e1e43 References: https://www.mail-archive.com/[email protected]/msg19509.html http://blog.clusterlabs.org/blog/2014/potential-for-data-corruption-in-pacemaker-1-dot-1-6-through-1-dot-1-9/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1312156/+subscriptions _______________________________________________ Mailing list: https://launchpad.net/~ubuntu-ha Post to : [email protected] Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp

