Hello all,

I have a two-node Pacemaker cluster configured with a high-availability 
PostgreSQL resource.   Occasionally, I notice that when the Master/primary DB 
is running under significant write load, the cluster "monitor" operation will 
time out, thus incrementing the failcount for the ha-db resource.  This 
happened again recently, and the running primary DB was demoted and then 
re-promoted to be the running primary.    What I'm having trouble understanding 
is why the running Master/primary DB was demoted.  After the monitor operation 
timed out, the failcount for the ha-db resource was still less than the 
configured "migration-threshold", which is set to 5.

Here's some information on the cluster and the scenario:

OS:  RHEL 7.9
Corosync: 2.4.5-7
Pacemaker:  1.1.23-1
PAF resource agent: 2.3.0
Master DB node:  dbsrv2
Standby DB node: dbsrv1
Other resources:  fencing and VIP
Cluster resource defaults:
  migration-threshold=5
  resource-stickiness=10


The overall scenario is this:


  *   The Master/primary DB is running on node dbsrv2
  *   A monitoring timeout occurs on node dbsrv2
  *   The master resource for node dbsrv2 is incremented from 2 to 3  (there 
were 2 other failures days before)
  *   The master DB is demoted to standby (i.e. stopped and restarted) on node 
dbsrv2
  *   Node dbsrv2 is re-promoted to Master/primary, with the failcount 
remaining at 3


My expectation was that no demotion should have occurred until the migration 
threshold was met.   Any help in explaining why the running primary was demoted 
would be appreciated!

I've attached the corosync log from the time period this occurred.  The 
scenario I described starts at 20:55 on 2/16/22 in the log.

Thanks,

Larry

Attachment: corosync.log-20220217.gz
Description: corosync.log-20220217.gz

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to