Hello. I've been hopelessly fighting a bug [0] in the custom OCF agent of Fuel for OpenStack project. It is related to the destructive test case when one node of 3 or 5 total goes down and then back. The bug itself is tricky (is rarely reproduced), tl;dr, and has many duplicates. So I only put here the latest comment.
As it says, at some point, after the rabbit OCF monitor reported an error followed by several "not running" reports (see crmd log snippet [1]), pacemaker starts "thinking" everything is fine with the resource and shows it as "running". While in fact it is completely dead and manually triggered OCF action monitor may confirm that (not running). But *why* pacemaker shows the resource is running and never calls monitor actions again? I have no idea how to proceed with the root cause of such pacemaker behaviour. So, I'm asking for guidance on the any recommendations on how-to debug and troubleshoot this strange situation and for which useful log patterns to seek (and where). Thank you in advance! PS. this is Pacemaker 1.1.12, Corosync 2.3.4, libqb0 0.17.0 from Ubuntu vivid. But the Corosync & Pacemaker cluster looks healthy and I can find no log records saying otherwise. [0] https://bugs.launchpad.net/fuel/+bug/1472230/comments/32 [1] http://pastebin.com/0UuBvzzz -- Best regards, Bogdan Dobrelya, Irc #bogdando _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org