Hello all, I have a two-node Pacemaker cluster configured with a high-availability PostgreSQL resource. Occasionally, I notice that when the Master/primary DB is running under significant write load, the cluster "monitor" operation will time out, thus incrementing the failcount for the ha-db resource. This happened again recently, and the running primary DB was demoted and then re-promoted to be the running primary. What I'm having trouble understanding is why the running Master/primary DB was demoted. After the monitor operation timed out, the failcount for the ha-db resource was still less than the configured "migration-threshold", which is set to 5.
Here's some information on the cluster and the scenario: OS: RHEL 7.9 Corosync: 2.4.5-7 Pacemaker: 1.1.23-1 PAF resource agent: 2.3.0 Master DB node: dbsrv2 Standby DB node: dbsrv1 Other resources: fencing and VIP Cluster resource defaults: migration-threshold=5 resource-stickiness=10 The overall scenario is this: * The Master/primary DB is running on node dbsrv2 * A monitoring timeout occurs on node dbsrv2 * The master resource for node dbsrv2 is incremented from 2 to 3 (there were 2 other failures days before) * The master DB is demoted to standby (i.e. stopped and restarted) on node dbsrv2 * Node dbsrv2 is re-promoted to Master/primary, with the failcount remaining at 3 My expectation was that no demotion should have occurred until the migration threshold was met. Any help in explaining why the running primary was demoted would be appreciated! I've attached the corosync log from the time period this occurred. The scenario I described starts at 20:55 on 2/16/22 in the log. Thanks, Larry
corosync.log-20220217.gz
Description: corosync.log-20220217.gz
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
