Hi All, We discovered a problem in the cluster which Quorum control and STONITH did not have.
We can confirm the problem in the next procedure. Step1) Constitute a cluster. [root@rh72-01 ~]# crm configure load update trac3437.crm [root@rh72-01 ~]# crm_mon -1 -Af Stack: corosync Current DC: rh72-01 (version 1.1.15-e174ec8) - partition with quorum Last updated: Mon Sep 26 13:00:22 2016 Last change: Mon Sep 26 12:59:52 2016 by root via cibadmin on rh72-01 2 nodes and 1 resource configured Online: [ rh72-01 rh72-02 ] Resource Group: grpDummy prmDummy (ocf::pacemaker:Dummy): Started rh72-01 Node Attributes: * Node rh72-01: * Node rh72-02: Migration Summary: * Node rh72-01: * Node rh72-02: Step2) Edit Dummy resource to cause stop trouble. (snip) dummy_stop() { return $OCF_ERR_GENERIC dummy_monitor if [ $? -eq $OCF_SUCCESS ]; then rm ${OCF_RESKEY_state} fi rm -f "${VERIFY_SERIALIZED_FILE}" return $OCF_SUCCESS } (snip) Step3) Stop Pacemaker of the node. Stop trouble happens. [root@rh72-01 ~]# systemctl stop pacemaker [root@rh72-01 ~]# crm_mon -1 -Af Stack: corosync Current DC: rh72-01 (version 1.1.15-e174ec8) - partition with quorum Last updated: Mon Sep 26 13:01:33 2016 Last change: Mon Sep 26 12:59:52 2016 by root via cibadmin on rh72-01 2 nodes and 1 resource configured Online: [ rh72-01 rh72-02 ] Resource Group: grpDummy prmDummy (ocf::pacemaker:Dummy): FAILED rh72-01 (blocked) Node Attributes: * Node rh72-01: * Node rh72-02: Migration Summary: * Node rh72-01: prmDummy: migration-threshold=1 fail-count=1000000 last-failure='Mon Sep 26 13:01:18 2016' * Node rh72-02: Failed Actions: * prmDummy_stop_0 on rh72-01 'unknown error' (1): call=8, status=complete, exitreason='none', last-rc-change='Mon Sep 26 13:01:18 2016', queued=0ms, exec=33ms Step4) Correct Dummy resource in the original. (snip) dummy_stop() { dummy_monitor if [ $? -eq $OCF_SUCCESS ]; then rm ${OCF_RESKEY_state} fi rm -f "${VERIFY_SERIALIZED_FILE}" return $OCF_SUCCESS } (snip) Step5) Clean up does the trouble of the Dummy resource. [root@rh72-01 ~]# crm_resource -C -r prmDummy -H rh72-01 -f Cleaning up prmDummy on rh72-01, removing fail-count-prmDummy Waiting for 1 replies from the CRMd. OK Step6) Fail-over is completed. However, the stop of the Dummy resource is not carried out in rh72-01 node. [root@rh72-02 ~]# crm_mon -1 -Af Stack: corosync Current DC: rh72-02 (version 1.1.15-e174ec8) - partition WITHOUT quorum Last updated: Mon Sep 26 13:02:32 2016 Last change: Mon Sep 26 13:02:20 2016 by hacluster via crmd on rh72-01 2 nodes and 1 resource configured Online: [ rh72-02 ] OFFLINE: [ rh72-01 ] Resource Group: grpDummy prmDummy (ocf::pacemaker:Dummy): Started rh72-02 Node Attributes: * Node rh72-02: Migration Summary: * Node rh72-02: [root@rh72-01 ~]# ls -lt /var/run/Dummy-prmDummy.state -rw-r-----. 1 root root 0 9月 26 2016 /var/run/Dummy-prmDummy.state ------------- Sep 26 13:02:21 rh72-01 crmd[1584]: warning: Action 2 (prmDummy_monitor_0) on rh72-01 failed (target: 7 vs. rc: 0): Error Sep 26 13:02:21 rh72-01 crmd[1584]: notice: Transition aborted by operation prmDummy_monitor_0 'create' on rh72-01: Event failed | magic=0:0;2:6:7:196faae4-4faf-42a5-9ffb-9dcf6272e3fb cib=0.6.2 source=match_graph_event:310 complete=false Sep 26 13:02:21 rh72-01 crmd[1584]: info: Action prmDummy_monitor_0 (2) confirmed on rh72-01 (rc=0) Sep 26 13:02:21 rh72-01 crmd[1584]: info: Detected action (6.2) prmDummy_monitor_0.13=ok: failed Sep 26 13:02:21 rh72-01 crmd[1584]: warning: Action 2 (prmDummy_monitor_0) on rh72-01 failed (target: 7 vs. rc: 0): Error Sep 26 13:02:21 rh72-01 crmd[1584]: info: Transition aborted by operation prmDummy_monitor_0 'create' on rh72-01: Event failed | magic=0:0;2:6:7:196faae4-4faf-42a5-9ffb-9dcf6272e3fb cib=0.6.2 source=match_graph_event:310 complete=false Sep 26 13:02:21 rh72-01 crmd[1584]: info: Action prmDummy_monitor_0 (2) confirmed on rh72-01 (rc=0) Sep 26 13:02:21 rh72-01 crmd[1584]: info: Detected action (6.2) prmDummy_monitor_0.13=ok: failed Sep 26 13:02:21 rh72-01 crmd[1584]: notice: Transition 6 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Complete Sep 26 13:02:21 rh72-01 crmd[1584]: info: Input I_STOP received in state S_TRANSITION_ENGINE from notify_crmd Sep 26 13:02:21 rh72-01 crmd[1584]: info: State transition S_TRANSITION_ENGINE -> S_STOPPING | input=I_STOP cause=C_FSA_INTERNAL origin=notify_crmd Sep 26 13:02:21 rh72-01 crmd[1584]: info: DC role released Sep 26 13:02:21 rh72-01 crmd[1584]: info: Connection to the Policy Engine released Sep 26 13:02:21 rh72-01 cib[1579]: info: Forwarding cib_modify operation for section status to all (origin=local/crmd/56) Sep 26 13:02:21 rh72-01 cib[1579]: info: Diff: --- 0.6.2 2 Sep 26 13:02:21 rh72-01 cib[1579]: info: Diff: +++ 0.6.3 (null) Sep 26 13:02:21 rh72-01 cib[1579]: info: + /cib: @num_updates=3Sep 26 13:02:21 rh72-01 cib[1579]: info: + /cib/status/node_state[@id='1']: @crm-debug-origin=do_dc_release, @expected=down Sep 26 13:02:21 rh72-01 cib[1579]: info: Completed cib_modify operation for section status: OK (rc=0, origin=rh72-01/crmd/56, version=0.6.3) Sep 26 13:02:21 rh72-01 crmd[1584]: info: Transitioner is now inactive Sep 26 13:02:21 rh72-01 crmd[1584]: info: Disconnecting STONITH... Sep 26 13:02:21 rh72-01 crmd[1584]: info: Fencing daemon disconnected ------------- It has a problem to do fail-over without a resource being stopped. When we make rh72-01 node standby, there is no problem. There seems to be the problem with the control of the Pacemaker stop somehow or other. I registered these contents with Bugzilla. - http://bugs.clusterlabs.org/show_bug.cgi?id=5300 I attached crm_report to Bugzilla. Bet Regards, Hideo Yamauchi. _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org