My objective is two-node active/passive DRBD device which would automatically fail over, a secondary objective would be to use standard, stock and supported software distributions and repositories with as little customization as possible.
I'm using Ubuntu 18.04.3, plus the DRBD, corosync and Pacemaker that are in the (LTS) repositories. DRBD drbdadm reports version 8.9.10. Corosync is 2.4.3, and Pacemaker is 0.9.164. For my test scenario, I would have two nodes up and running, I would reboot, disconnect or shut down one node, and the other node would then after a delay take over. That's the scenario I wanted to cover: unexpected loss of a node. The application is supplementary and isn't life safety or mission critical, but it would be measured, and the goal would be to stay above 4 nines of uptime annually. All of this is working for me, I can manually failover by telling PCS to move my resource from one node to another. If I reboot the primary node, the failover will not complete until the primary is back online. Occasionally I'd get split-brain by doing these hard kills, which would require manual recovery. I added STONITH and watchdog using SBD with an iSCSI block device and softdog. I added a qdevice to get an odd-numbered quorum. When I run crm_simulate on this, the simulation says that if I down the primary node, it will promote the resource to the secondary. And yet I still see the same behavior: crashing the primary, there is no promotion until after the primary returns online, and after that the secondary is smoothly promoted and the primary demoted. Getting each component of this stack configured and running has had substantial challenges, with regards to compatibility, documentation, integration bugs, etc. I see other people reporting problems similar to mine, I'm wondering if there's a general approach, or perhaps I need a nudge in a new direction to tackle this issue? * Should I continue to focus on the existing Pacemaker configuration? perhaps there's some hidden or absent order/constraint/weighting that is causing this behavior? * Should I dig harder at the DRBD configuration? Is it something about the fencing scripts? * Should I try stripping this back down to something more basic? Can I have a reliable failover without STONITH, SBD and an odd-numbered quorum? * It seems possible that moving to DRBD 9.X might take some of the problem off of Pacemaker altogether since it has built in failover apparently, is that an easier win? * Should I go to another stack? I'm trying to work within LTS releases for stability, but perhaps I would get better integrations with RHEL 7, CentOS 7, an edge release of Ubuntu, or some other distribution? Thank you for your consideration!
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/