> On May 27, 2018, at 2:28 PM, Ken Gaillot <kgail...@redhat.com> wrote: > >> May 22 23:57:24 [2196] d-gp2-dbpg0-2 pengine: info: >> determine_op_status: Operation monitor found resource postgresql-10- >> main:2 active on d-gp2-dbpg0-2 > >> May 22 23:57:24 [2196] d-gp2-dbpg0-2 pengine: notice: >> LogActions: Demote postgresql-10-main:1 (Master -> Slave d-gp2- >> dbpg0-1) >> May 22 23:57:24 [2196] d-gp2-dbpg0-2 pengine: notice: >> LogActions: Recover postgresql-10-main:1 (Master d-gp2-dbpg0-1) > > From the above, we can see that the initial probe after the node > rejoined found that the resource was already running in master mode > there (at least, that's what the agent thinks). So, the cluster wants > to demote it, stop it, and start it again as a slave.
Are you sure you're reading the above correctly? The first line you quoted says the resource is already active on node 2, which is not the node that was restarted, and is the node that took over as master after I powered node 1 off. Anyways I enabled debug logging in corosync.conf, and I now see the following information: May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource: debug: determine_op_status: postgresql-10-main_monitor_0 on d-gp2-dbpg0-1 returned 'master (failed)' (9) instead of the expected value: 'not running' (7) May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource: warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:1 on d-gp2-dbpg0-1: master (failed) (9) May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource: debug: determine_op_status: postgresql-10-main_monitor_0 on d-gp2-dbpg0-1 returned 'master (failed)' (9) instead of the expected value: 'not running' (7) May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource: warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:1 on d-gp2-dbpg0-1: master (failed) (9) I'm not sure why these lines appear twice (same question I've had in the past about some log messages), but it seems that whatever it's doing to check the status of the resource, it is correctly determining that PostgreSQL failed while in master state, rather than being shut down cleanly. Why this results in the node being fenced is beyond me. I don't feel that I'm trying to do anything complex - just have a simple cluster that handles PostgreSQL failover. I'm not trying to do anything fancy and am pretty much following the PAF docs, plus the addition of the fencing resource (which it says it requires to work properly - if this is "properly" I don't understand what goal it is trying to achieve...). I'm getting really frustrated with pacemaker as I've been fighting hard to try to get it working for two months now and still feel in the dark about why it's behaving the way it is. I'm sorry if I seem like an idiot...this definitely makes me feel like one... Here is my configuration again, in case it helps: Cluster Name: d-gp2-dbpg0 Corosync Nodes: d-gp2-dbpg0-1 d-gp2-dbpg0-2 d-gp2-dbpg0-3 Pacemaker Nodes: d-gp2-dbpg0-1 d-gp2-dbpg0-2 d-gp2-dbpg0-3 Resources: Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.124.164.250 cidr_netmask=22 Operations: start interval=0s timeout=20s (postgresql-master-vip-start-interval-0s) stop interval=0s timeout=20s (postgresql-master-vip-stop-interval-0s) monitor interval=10s (postgresql-master-vip-monitor-interval-10s) Master: postgresql-ha Meta Attrs: notify=true Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms) Attributes: bindir=/usr/lib/postgresql/10/bin pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c config_file=/etc/postgresql/10/main/postgresql.conf" Operations: start interval=0s timeout=60s (postgresql-10-main-start-interval-0s) stop interval=0s timeout=60s (postgresql-10-main-stop-interval-0s) promote interval=0s timeout=30s (postgresql-10-main-promote-interval-0s) demote interval=0s timeout=120s (postgresql-10-main-demote-interval-0s) monitor interval=15s role=Master timeout=10s (postgresql-10-main-monitor-interval-15s) monitor interval=16s role=Slave timeout=10s (postgresql-10-main-monitor-interval-16s) notify interval=0s timeout=60s (postgresql-10-main-notify-interval-0s) Stonith Devices: Resource: vfencing (class=stonith type=external/vcenter) Attributes: VI_SERVER=10.124.137.100 VI_CREDSTORE=/etc/pacemaker/vicredentials.xml HOSTLIST=d-gp2-dbpg0-1;d-gp2-dbpg0-2;d-gp2-dbpg0-3 RESETPOWERON=1 Operations: monitor interval=60s (vfencing-monitor-60s) Fencing Levels: Location Constraints: Ordering Constraints: promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory) demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory) (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1) Colocation Constraints: postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-postgresql-master-vip-postgresql-ha-INFINITY) Resources Defaults: migration-threshold: 5 resource-stickiness: 10 Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: d-gp2-dbpg0 dc-version: 1.1.14-70404b0 have-watchdog: false stonith-enabled: true Node Attributes: d-gp2-dbpg0-1: master-postgresql-10-main=-1000 d-gp2-dbpg0-2: master-postgresql-10-main=1001 d-gp2-dbpg0-3: master-postgresql-10-main=1000 Thanks, -- Casey _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org