https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/
It's a great write-up, although a little frustrating that it is still not fully understood why a -inf colocation failed whereas a +inf succeeded. (I actually have a vague memory of discovering something very similar a while back, but I can't find the details.) IMHO this serves as a good example of the difficulty Pacemaker faces, and consequently as valuable feedback for how Pacemaker needs to improve: it's all too easy to do one tiny misconfiguration which can potentially bring the whole house of cards tumbling down, and it's often really hard to understand what went wrong. So FWIW, my personal view is that more than anything else right now, Pacemaker needs to be made easier to understand. I know this is a big ask since HA is unavoidably complex, but I'm sure there are actionable items which would serve as relatively manageable yet very worthwhile steps towards this goal. I alluded to this during my presentation at the Clusterlabs Summit, e.g. see https://aspiers.github.io/clusterlabs-summit-2017-openstack-ha/#/debugging and the following slide. And in fact I remember some really good discussions on this during the summit too, but I'm not sure if they led anywhere. Hope this feedback is useful! _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org