Hello,

In my test environment, I meet one issue about the pacemaker: when a new node 
is added in the cluster, the master node restart. This issue will lead to the 
system out of service for a while when adding a new node because there is no 
master node. Could you please help tell how to debug such issue?

I have a pacemaker master/slave cluster as below. pgsql-ha is a resource. I 
copy the script from /usr/lib/ocf/resource.d/heartbeat/Dumy and add some simple 
codes to make it support promote/demote.
Now when I run “pcs cluster stop” on db1,the db1 is stopped status and db2 is 
still master.
The problem is: when I run “pcs cluster start” on db1.The db2 status changes as 
below: master -> slave->stop->slave->master. Why does db2 restart?

CENTOS7:
======================================================
2 nodes and 7 resources configured

Online: [ db1 db2 ]

Full list of resources:

Clone Set: dlm-clone [dlm]
     Started: [ db1 db2 ]
Clone Set: clvmd-clone [clvmd]
     Started: [ db1 db2 ]
scsi-stonith-device    (stonith:fence_scsi):   Started db2
Master/Slave Set: pgsql-ha [pgsqld]
     Masters: [ db2 ]
     Slaves: [ db1 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@db1 heartbeat]#
==========================================================
/var/log/messages:
Dec 27 00:52:50 db2 cib[3290]:  notice: Purged 1 peers with id=1 and/or 
uname=db1 from the membership cache
Dec 27 00:52:51 db2 kernel: dlm: closing connection to node 1
Dec 27 00:52:51 db2 corosync[3268]: [TOTEM ] A new membership 
(192.168.199.199:372) was formed. Members left: 1
Dec 27 00:52:51 db2 corosync[3268]: [QUORUM] Members[1]: 2
Dec 27 00:52:51 db2 corosync[3268]: [MAIN  ] Completed service synchronization, 
ready to provide service.
Dec 27 00:52:51 db2 crmd[3295]:  notice: Node db1 state is now lost
Dec 27 00:52:51 db2 crmd[3295]:  notice: do_shutdown of peer db1 is complete
Dec 27 00:52:51 db2 pacemakerd[3289]:  notice: Node db1 state is now lost
Dec 27 00:52:57 db2 Doctor(pgsqld)[6671]: INFO: pgsqld monitor : 8
Dec 27 00:53:12 db2 Doctor(pgsqld)[6681]: INFO: pgsqld monitor : 8
Dec 27 00:53:27 db2 Doctor(pgsqld)[6746]: INFO: pgsqld monitor : 8
Dec 27 00:53:33 db2 corosync[3268]: [TOTEM ] A new membership 
(192.168.199.197:376) was formed. Members joined: 1
Dec 27 00:53:33 db2 corosync[3268]: [QUORUM] Members[2]: 1 2
Dec 27 00:53:33 db2 corosync[3268]: [MAIN  ] Completed service synchronization, 
ready to provide service.
Dec 27 00:53:33 db2 crmd[3295]:  notice: Node db1 state is now member
Dec 27 00:53:33 db2 pacemakerd[3289]:  notice: Node db1 state is now member
Dec 27 00:53:33 db2 crmd[3295]:  notice: do_shutdown of peer db1 is complete
Dec 27 00:53:33 db2 crmd[3295]:  notice: State transition S_IDLE -> 
S_INTEGRATION
Dec 27 00:53:33 db2 pengine[3294]:  notice: Calculated transition 17, saving 
inputs in /var/lib/pacemaker/pengine/pe-input-116.bz2
Dec 27 00:53:33 db2 crmd[3295]:  notice: Transition 17 (Complete=0, Pending=0, 
Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-116.bz2): Complete
Dec 27 00:53:33 db2 crmd[3295]:  notice: State transition S_TRANSITION_ENGINE 
-> S_IDLE
Dec 27 00:53:33 db2 stonith-ng[3291]:  notice: Node db1 state is now member
Dec 27 00:53:33 db2 attrd[3293]:  notice: Node db1 state is now member
Dec 27 00:53:33 db2 cib[3290]:  notice: Node db1 state is now member
Dec 27 00:53:34 db2 crmd[3295]:  notice: State transition S_IDLE -> 
S_INTEGRATION
Dec 27 00:53:37 db2 crmd[3295]: warning: No reason to expect node 2 to be down
Dec 27 00:53:38 db2 pengine[3294]:  notice: Unfencing db1: node discovery
Dec 27 00:53:38 db2 pengine[3294]:  notice: Start   dlm:1#011(db1)
Dec 27 00:53:38 db2 pengine[3294]:  notice: Start   clvmd:1#011(db1)
Dec 27 00:53:38 db2 pengine[3294]:  notice: Restart pgsqld:0#011(Master db2)


/var/log/cluster/corosync.log:

Dec 27 00:53:37 [3290] db2        cib:     info: cib_process_request:   
Completed cib_modify operation for section status: OK (rc=0, 
origin=db2/crmd/99, version=0.60.29)
Dec 27 00:53:37 [3290] db2        cib:     info: cib_process_request:   
Forwarding cib_delete operation for section //node_state[@uname='db2']/lrm to 
all (origin=local/crmd/100)
Dec 27 00:53:37 [3295] db2       crmd:     info: do_state_transition:   State 
transition S_FINALIZE_JOIN -> S_POLICY_ENGINE | input=I_FINALIZED 
cause=C_FSA_INTERNAL origin=check_join_state
Dec 27 00:53:37 [3295] db2       crmd:     info: abort_transition_graph:        
Transition aborted: Peer Cancelled | source=do_te_invoke:161 complete=true
Dec 27 00:53:37 [3293] db2      attrd:     info: attrd_client_refresh:  
Updating all attributes
Dec 27 00:53:37 [3293] db2      attrd:     info: write_attribute:       Sent 
update 12 with 2 changes for shutdown, id=<n/a>, set=(null)
Dec 27 00:53:37 [3293] db2      attrd:     info: write_attribute:       Sent 
update 13 with 1 changes for last-failure-pgsqld, id=<n/a>, set=(null)
Dec 27 00:53:37 [3293] db2      attrd:     info: write_attribute:       Sent 
update 14 with 2 changes for terminate, id=<n/a>, set=(null)
Dec 27 00:53:37 [3293] db2      attrd:     info: write_attribute:       Sent 
update 15 with 1 changes for fail-count-pgsqld, id=<n/a>, set=(null)
Dec 27 00:53:37 [3290] db2        cib:     info: cib_process_request:   
Forwarding cib_modify operation for section status to all 
(origin=local/crmd/101)
Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        Diff: 
--- 0.60.29 2
Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        Diff: 
+++ 0.60.30 (null)
Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        -- 
/cib/status/node_state[@id='2']/lrm[@id='2']
Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        +  
/cib:  @num_updates=30
Dec 27 00:53:37 [3295] db2       crmd:  warning: match_down_event:      No 
reason to expect node 2 to be down
Dec 27 00:53:37 [3295] db2       crmd:     info: abort_transition_graph:        
Transition aborted by deletion of lrm[@id='2']: Resource state removal | 
cib=0.60.30 source=abort_unless_down:343 
path=/cib/status/node_state[@id='2']/lrm[@id='2'] complete=true
Dec 27 00:53:37 [3290] db2        cib:     info: cib_process_request:   
Completed cib_delete operation for section //node_state[@uname='db2']/lrm: OK 
(rc=0, origin=db2/crmd/100, version=0.60.30)
Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        Diff: 
--- 0.60.30 2
Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        Diff: 
+++ 0.60.31 (null)
Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        +  
/cib:  @num_updates=31
Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        +  
/cib/status/node_state[@id='2']:  @crm-debug-origin=do_lrm_query_internal
Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        ++ 
/cib/status/node_state[@id='2']:  <lrm id="2"/>

I use this command to create the resource:
pcs resource create pgsqld ocf:heartbeat:Doctor op start timeout=60s op stop 
timeout=60s op promote timeout=30s op demote timeout=120s op monitor 
interval=15s timeout=10s role="Master" op monitor interval=16s timeout=10s 
role="Slave" op notify timeout=60s; pcs resource master pgsql-ha pgsqld 
notify=true;pcs constraint order start clvmd-clone then pgsql-ha;
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to