Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

Casey & Gina Tue, 29 May 2018 14:57:01 -0700

> On May 27, 2018, at 2:28 PM, Ken Gaillot <[email protected]> wrote:
> 
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:     info:
>> determine_op_status: Operation monitor found resource postgresql-10-
>> main:2 active on d-gp2-dbpg0-2
> 
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:   notice:
>> LogActions:  Demote  postgresql-10-main:1    (Master -> Slave d-gp2-
>> dbpg0-1)
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:   notice:
>> LogActions:  Recover postgresql-10-main:1    (Master d-gp2-dbpg0-1)
> 
> From the above, we can see that the initial probe after the node
> rejoined found that the resource was already running in master mode
> there (at least, that's what the agent thinks). So, the cluster wants
> to demote it, stop it, and start it again as a slave.


Are you sure you're reading the above correctly?  The first line you quoted 
says the resource is already active on node 2, which is not the node that was 
restarted, and is the node that took over as master after I powered node 1 off.

Anyways I enabled debug logging in corosync.conf, and I now see the following 
information:

May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource:    debug: 
determine_op_status:      postgresql-10-main_monitor_0 on d-gp2-dbpg0-1 
returned 'master (failed)' (9) instead of the expected value: 'not running' (7)
May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource:  warning: 
unpack_rsc_op_failure:    Processing failed op monitor for postgresql-10-main:1 
on d-gp2-dbpg0-1: master (failed) (9)
May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource:    debug: 
determine_op_status:      postgresql-10-main_monitor_0 on d-gp2-dbpg0-1 
returned 'master (failed)' (9) instead of the expected value: 'not running' (7)
May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource:  warning: 
unpack_rsc_op_failure:    Processing failed op monitor for postgresql-10-main:1 
on d-gp2-dbpg0-1: master (failed) (9)

I'm not sure why these lines appear twice (same question I've had in the past 
about some log messages), but it seems that whatever it's doing to check the 
status of the resource, it is correctly determining that PostgreSQL failed 
while in master state, rather than being shut down cleanly.  Why this results 
in the node being fenced is beyond me.

I don't feel that I'm trying to do anything complex - just have a simple 
cluster that handles PostgreSQL failover.  I'm not trying to do anything fancy 
and am pretty much following the PAF docs, plus the addition of the fencing 
resource (which it says it requires to work properly - if this is "properly" I 
don't understand what goal it is trying to achieve...).  I'm getting really 
frustrated with pacemaker as I've been fighting hard to try to get it working 
for two months now and still feel in the dark about why it's behaving the way 
it is.  I'm sorry if I seem like an idiot...this definitely makes me feel like 
one...


Here is my configuration again, in case it helps:

Cluster Name: d-gp2-dbpg0
Corosync Nodes:
 d-gp2-dbpg0-1 d-gp2-dbpg0-2 d-gp2-dbpg0-3
Pacemaker Nodes:
 d-gp2-dbpg0-1 d-gp2-dbpg0-2 d-gp2-dbpg0-3

Resources:
 Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.124.164.250 cidr_netmask=22
  Operations: start interval=0s timeout=20s 
(postgresql-master-vip-start-interval-0s)
              stop interval=0s timeout=20s 
(postgresql-master-vip-stop-interval-0s)
              monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
 Master: postgresql-ha
  Meta Attrs: notify=true 
  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/lib/postgresql/10/bin 
pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 
recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c 
config_file=/etc/postgresql/10/main/postgresql.conf"
   Operations: start interval=0s timeout=60s 
(postgresql-10-main-start-interval-0s)
               stop interval=0s timeout=60s 
(postgresql-10-main-stop-interval-0s)
               promote interval=0s timeout=30s 
(postgresql-10-main-promote-interval-0s)
               demote interval=0s timeout=120s 
(postgresql-10-main-demote-interval-0s)
               monitor interval=15s role=Master timeout=10s 
(postgresql-10-main-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s 
(postgresql-10-main-monitor-interval-16s)
               notify interval=0s timeout=60s 
(postgresql-10-main-notify-interval-0s)

Stonith Devices:
 Resource: vfencing (class=stonith type=external/vcenter)
  Attributes: VI_SERVER=10.124.137.100 
VI_CREDSTORE=/etc/pacemaker/vicredentials.xml 
HOSTLIST=d-gp2-dbpg0-1;d-gp2-dbpg0-2;d-gp2-dbpg0-3 RESETPOWERON=1
  Operations: monitor interval=60s (vfencing-monitor-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) 
(non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
  demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory) 
(non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
Colocation Constraints:
  postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started) 
(with-rsc-role:Master) 
(id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)

Resources Defaults:
 migration-threshold: 5
 resource-stickiness: 10
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: d-gp2-dbpg0
 dc-version: 1.1.14-70404b0
 have-watchdog: false
 stonith-enabled: true
Node Attributes:
 d-gp2-dbpg0-1: master-postgresql-10-main=-1000
 d-gp2-dbpg0-2: master-postgresql-10-main=1001
 d-gp2-dbpg0-3: master-postgresql-10-main=1000

Thanks,
-- 
Casey
_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

Reply via email to