Hi Andrew,
A correction seems to still have a problem. It is awaiting demote, and the master-group resource cannot move. [root@bl460g8n3 ~]# crm_mon -1 -Af Last updated: Tue Aug 18 11:13:39 2015 Last change: Tue Aug 18 11:11:01 2015 by root via crm_resource on bl460g8n4 Stack: corosync Current DC: bl460g8n3 (version 1.1.13-7d0cac0) - partition with quorum 4 nodes and 10 resources configured Online: [ bl460g8n3 bl460g8n4 ] GuestOnline: [ pgsr02@bl460g8n4 ] prmDB2 (ocf::heartbeat:VirtualDomain): Started bl460g8n4 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ipmi): Started bl460g8n4 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ipmi): Started bl460g8n3 Master/Slave Set: msPostgresql [pgsql] Masters: [ pgsr02 ] Node Attributes: * Node bl460g8n3: * Node bl460g8n4: * Node pgsr02@bl460g8n4: + master-pgsql : 10 Migration Summary: * Node bl460g8n3: pgsr01: migration-threshold=1 fail-count=1 last-failure='Tue Aug 18 11:12:03 2015' * Node bl460g8n4: * Node pgsr02@bl460g8n4: Failed Actions: * pgsr01_monitor_30000 on bl460g8n3 'unknown error' (1): call=2, status=Error, exitreason='none', last-rc-change='Tue Aug 18 11:12:03 2015', queued=0ms, exec=0ms (snip) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Container prmDB1 and the resources within it have failed 1 times on bl460g8n3 Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Forcing prmDB1 away from bl460g8n3 after 1 failures (max=1) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: pgsr01 has failed 1 times on bl460g8n3 Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Forcing pgsr01 away from bl460g8n3 after 1 failures (max=1) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: prmDB1: Rolling back scores from pgsr01Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource prmDB1 cannot run anywhere Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource pgsr01 cannot run anywhere Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: pgsql:0: Rolling back scores from vip-master Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource pgsql:0 cannot run anywhere Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Promoting pgsql:1 (Master pgsr02) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: msPostgresql: Promoted 1 instances of a possible 1 to master Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action vip-master_stop_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Start recurring monitor (10s) for vip-master on pgsr02 Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action vip-rep_stop_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Start recurring monitor (10s) for vip-rep on pgsr02 Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Start recurring monitor (9s) for pgsql:1 on pgsr02 Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Start recurring monitor (9s) for pgsql:1 on pgsr02 Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Impliying node pgsr01 is down when container prmDB1 is stopped ((nil)) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave prmDB1 (Stopped) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave prmDB2 (Started bl460g8n4) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave prmStonith1-2 (Started bl460g8n4) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave prmStonith2-2 (Started bl460g8n3) Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Stop vip-master (Started pgsr01 - blocked) Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Stop vip-rep (Started pgsr01 - blocked) Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Demote pgsql:0 (Master -> Stopped pgsr01 - blocked) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave pgsql:1 (Master pgsr02) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave pgsr01 (Stopped) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave pgsr02 (Started bl460g8n4) Aug 18 11:12:07 bl460g8n3 pengine[10325]: crit: Cannot shut down node 'pgsr01' because of pgsql:0: blocked failed Aug 18 11:12:07 bl460g8n3 pengine[10325]: crit: Cannot shut down node 'pgsr01' because of vip-rep: blocked failed Aug 18 11:12:07 bl460g8n3 pengine[10325]: crit: Cannot shut down node 'pgsr01' because of vip-master: blocked failed * http://bugs.clusterlabs.org/show_bug.cgi?id=5247#c3 Best Regards, Hideo Yamauch. ----- Original Message ----- >From: Andrew Beekhof <and...@beekhof.net> >To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to >open-source clustering welcomed <users@clusterlabs.org> >Date: 2015/8/18, Tue 10:17 >Subject: Re: [ClusterLabs] [Question:pacemaker_remote] By the operation that >remote node cannot carry out a cluster, the resource does not move. (STONITH >is not carried out, too) > >Should be fixed now. Thanks for the report! > >> On 12 Aug 2015, at 1:20 pm, renayama19661...@ybb.ne.jp wrote: >> >> Hi All, >> >> We confirmed movement of >> pacemaker_remote.(version:pacemaker-ad1f397a8228a63949f86c96597da5cecc3ed977) >> >> It is the following cluster constitution. >> * bl460g8n3(KVM host) >> * bl460g8n4(KVM host) >> * pgsr01(Guest on the bl460g8n3 host) >> * pgsr02(Guest on the bl460g8n4 host) >> >> >> Step 1) I compose a cluster of a simple resource. >> >> [root@bl460g8n3 ~]# crm_mon -1 -Af >> Last updated: Wed Aug 12 11:52:27 2015 Last change: Wed Aug 12 >> 11:51:47 2015 by root via crm_resource on bl460g8n4 >> Stack: corosync >> Current DC: bl460g8n3 (version 1.1.13-ad1f397) - partition with quorum >> 4 nodes and 10 resources configured >> >> Online: [ bl460g8n3 bl460g8n4 ] >> GuestOnline: [ pgsr01@bl460g8n3 pgsr02@bl460g8n4 ] >> >> prmDB1 (ocf::heartbeat:VirtualDomain): Started bl460g8n3 >> prmDB2 (ocf::heartbeat:VirtualDomain): Started bl460g8n4 >> Resource Group: grpStonith1 >> prmStonith1-2 (stonith:external/ipmi): Started bl460g8n4 >> Resource Group: grpStonith2 >> prmStonith2-2 (stonith:external/ipmi): Started bl460g8n3 >> Resource Group: master-group >> vip-master (ocf::heartbeat:Dummy): Started pgsr02 >> vip-rep (ocf::heartbeat:Dummy): Started pgsr02 >> Master/Slave Set: msPostgresql [pgsql] >> Masters: [ pgsr02 ] >> Slaves: [ pgsr01 ] >> >> Node Attributes: >> * Node bl460g8n3: >> * Node bl460g8n4: >> * Node pgsr01@bl460g8n3: >> + master-pgsql : 5 >> * Node pgsr02@bl460g8n4: >> + master-pgsql : 10 >> >> Migration Summary: >> * Node bl460g8n4: >> * Node bl460g8n3: >> * Node pgsr02@bl460g8n4: >> * Node pgsr01@bl460g8n3: >> >> >> Step 2) I cause trouble of pacemaker_remote in pgsr02. >> >> [root@pgsr02 ~]# ps -ef |grep remote >> root 1171 1 0 11:52 ? 00:00:00 /usr/sbin/pacemaker_remoted >> root 1428 1377 0 11:53 pts/0 00:00:00 grep --color=auto remote >> [root@pgsr02 ~]# kill -9 1171 >> >> >> Step 3) After trouble, the master-group resource does not start in pgsr01. >> >> [root@bl460g8n3 ~]# crm_mon -1 -Af >> Last updated: Wed Aug 12 11:54:04 2015 Last change: Wed Aug 12 >> 11:51:47 2015 by root via crm_resource on bl460g8n4 >> Stack: corosync >> Current DC: bl460g8n3 (version 1.1.13-ad1f397) - partition with quorum >> 4 nodes and 10 resources configured >> >> Online: [ bl460g8n3 bl460g8n4 ] >> GuestOnline: [ pgsr01@bl460g8n3 ] >> >> prmDB1 (ocf::heartbeat:VirtualDomain): Started bl460g8n3 >> prmDB2 (ocf::heartbeat:VirtualDomain): FAILED bl460g8n4 >> Resource Group: grpStonith1 >> prmStonith1-2 (stonith:external/ipmi): Started bl460g8n4 >> Resource Group: grpStonith2 >> prmStonith2-2 (stonith:external/ipmi): Started bl460g8n3 >> Master/Slave Set: msPostgresql [pgsql] >> Masters: [ pgsr01 ] >> >> Node Attributes: >> * Node bl460g8n3: >> * Node bl460g8n4: >> * Node pgsr01@bl460g8n3: >> + master-pgsql : 10 >> >> Migration Summary: >> * Node bl460g8n4: >> pgsr02: migration-threshold=1 fail-count=1 last-failure='Wed Aug 12 >>11:53:39 2015' >> * Node bl460g8n3: >> * Node pgsr01@bl460g8n3: >> >> Failed Actions: >> * pgsr02_monitor_30000 on bl460g8n4 'unknown error' (1): call=2, >> status=Error, exitreason='none', >> last-rc-change='Wed Aug 12 11:53:39 2015', queued=0ms, exec=0ms >> >> >> It seems to be caused by the fact that STONITH is not carried out somehow or >> other. >> The demote operation that a cluster cannot handle seems to obstruct start in >> pgsr01. >> -------------------------------------------------------------------------------------- >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: Graph 10 with 20 actions: >> batch-limit=20 jobs, network-delay=0ms >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 4]: Pending rsc op >> prmDB2_stop_0 on bl460g8n4 (priority: 0, waiting: 70) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 36]: Completed >> pseudo op master-group_stop_0 on N/A (priority: 0, waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 34]: Completed >> pseudo op master-group_start_0 on N/A (priority: 0, waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 82]: Completed rsc >> op pgsql_post_notify_demote_0 on pgsr01 (priority: 1000000, waiting: >> none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 81]: Completed rsc >> op pgsql_pre_notify_demote_0 on pgsr01 (priority: 0, waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 78]: Completed rsc >> op pgsql_post_notify_stop_0 on pgsr01 (priority: 1000000, waiting: >> none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 77]: Completed rsc >> op pgsql_pre_notify_stop_0 on pgsr01 (priority: 0, waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 67]: Completed >> pseudo op msPostgresql_confirmed-post_notify_demoted_0 on N/A (priority: >> 1000000, waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 66]: Completed >> pseudo op msPostgresql_post_notify_demoted_0 on N/A (priority: 1000000, >> waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 65]: Completed >> pseudo op msPostgresql_confirmed-pre_notify_demote_0 on N/A (priority: 0, >> waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 64]: Completed >> pseudo op msPostgresql_pre_notify_demote_0 on N/A (priority: 0, waiting: >> none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 63]: Completed >> pseudo op msPostgresql_demoted_0 on N/A (priority: 1000000, waiting: >> none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 62]: Completed >> pseudo op msPostgresql_demote_0 on N/A (priority: 0, waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 55]: Completed >> pseudo op msPostgresql_confirmed-post_notify_stopped_0 on N/A (priority: >> 1000000, waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 54]: Completed >> pseudo op msPostgresql_post_notify_stopped_0 on N/A (priority: 1000000, >> waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 53]: Completed >> pseudo op msPostgresql_confirmed-pre_notify_stop_0 on N/A (priority: 0, >> waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 52]: Completed >> pseudo op msPostgresql_pre_notify_stop_0 on N/A (priority: 0, waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 51]: Completed >> pseudo op msPostgresql_stopped_0 on N/A (priority: 1000000, waiting: >> none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 50]: Completed >> pseudo op msPostgresql_stop_0 on N/A (priority: 0, waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 70]: Pending rsc op >> pgsr02_stop_0 on bl460g8n4 (priority: 0, waiting: none) >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: * [Input 38]: Unresolved >> dependency rsc op pgsql_demote_0 on pgsr02 >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: info: FSA: Input I_TE_SUCCESS from >> notify_crmd() received in state S_TRANSITION_ENGINE >> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: State transition >> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL >> origin=notify_crmd ] >> -------------------------------------------------------------------------------------- >> >> Is there setting to let a cluster carry out STONITH well? >> Is this a bug of pacemaker_remote? >> >> * I registered these contents with >>Bugzilla.(http://bugs.clusterlabs.org/show_bug.cgi?id=5247) >> * In addition, I attached crm_report to Bugzilla. >> >> Best Regards, >> Hideo Yamauchi. >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org