>>> Niu Sibo <nius...@linux.vnet.ibm.com> schrieb am 07.11.2016 um 16:59 in Nachricht <5820a4cc.9030...@linux.vnet.ibm.com>: > Hi Ken, > > Thanks for the clarification. Now I have another real problem that needs > your advise. > > The cluster consists of 5 nodes and one of the node got a 1 second > network failure which resulted in one of the VirtualDomain resources to > start on two nodes at the same time. The cluster property > no_quorum_policy is set to stop. > > At 16:13:34, this happened: > 16:13:34 zs95kj attrd[133000]: notice: crm_update_peer_proc: Node > zs93KLpcs1[5] - state is now lost (was member) > 16:13:34 zs95kj corosync[132974]: [CPG ] left_list[0] > group:pacemakerd\x00, ip:r(0) ip(10.20.93.13) , pid:28721 > 16:13:34 zs95kj crmd[133002]: warning: No match for shutdown action on 5
Usually the node would be fenced now. In the meantime the node might _try_ to stop the resources. > 16:13:34 zs95kj attrd[133000]: notice: Removing all zs93KLpcs1 > attributes for attrd_peer_change_cb > 16:13:34 zs95kj corosync[132974]: [CPG ] left_list_entries:1 > 16:13:34 zs95kj crmd[133002]: notice: Stonith/shutdown of zs93KLpcs1 > not matched > ... > 16:13:35 zs95kj attrd[133000]: notice: crm_update_peer_proc: Node > zs93KLpcs1[5] - state is now member (was (null)) Where are the logs from the other node? I don't see where resources are _started_. > > From the DC: > [root@zs95kj ~]# crm_simulate --xml-file > /var/lib/pacemaker/pengine/pe-input-3288.bz2 |grep 110187 > zs95kjg110187_res (ocf::heartbeat:VirtualDomain): Started > zs93KLpcs1 <----------This is the baseline that everything works normal > > [root@zs95kj ~]# crm_simulate --xml-file > /var/lib/pacemaker/pengine/pe-input-3289.bz2 |grep 110187 > zs95kjg110187_res (ocf::heartbeat:VirtualDomain): Stopped > <----------- Here the node zs93KLpcs1 lost it's network for 1 sec and > resulted in this state. > > [root@zs95kj ~]# crm_simulate --xml-file > /var/lib/pacemaker/pengine/pe-input-3290.bz2 |grep 110187 > zs95kjg110187_res (ocf::heartbeat:VirtualDomain): Stopped > > [root@zs95kj ~]# crm_simulate --xml-file > /var/lib/pacemaker/pengine/pe-input-3291.bz2 |grep 110187 > zs95kjg110187_res (ocf::heartbeat:VirtualDomain): Stopped > > > From the DC's pengine log, it has: > 16:05:01 zs95kj pengine[133001]: notice: Calculated Transition 238: > /var/lib/pacemaker/pengine/pe-input-3288.bz2 > ... > 16:13:41 zs95kj pengine[133001]: notice: Start > zs95kjg110187_res#011(zs90kppcs1) > ... > 16:13:41 zs95kj pengine[133001]: notice: Calculated Transition 239: > /var/lib/pacemaker/pengine/pe-input-3289.bz2 > > From the DC's CRMD log, it has: > Sep 9 16:05:25 zs95kj crmd[133002]: notice: Transition 238 > (Complete=48, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-3288.bz2): Complete > ... > Sep 9 16:13:42 zs95kj crmd[133002]: notice: Initiating action 752: > start zs95kjg110187_res_start_0 on zs90kppcs1 > ... > Sep 9 16:13:56 zs95kj crmd[133002]: notice: Transition 241 > (Complete=81, Pending=0, Fired=0, Skipped=172, Incomplete=341, > Source=/var/lib/pacemaker/pengine/pe-input-3291.bz2): Stopped > > Here I do not see any log about pe-input-3289.bz2 and pe-input-3290.bz2. > Why is this? > > From the log on zs93KLpcs1 where guest 110187 was running, i do not see > any message regarding stopping this resource after it lost its > connection to the cluster. > > Any ideas where to look for possible cause? > > On 11/3/2016 1:02 AM, Ken Gaillot wrote: >> On 11/02/2016 11:17 AM, Niu Sibo wrote: >>> Hi all, >>> >>> I have a general question regarding the fence login in pacemaker. >>> >>> I have setup a three nodes cluster with Pacemaker 1.1.13 and cluster >>> property no_quorum_policy set to ignore. When two nodes lost their NIC >>> corosync is running on at the same time, it looks like the two nodes are >>> getting fenced one by one, even I have three fence devices defined for >>> each of the node. >>> >>> What should I be expecting in the case? >> It's probably coincidence that the fencing happens serially; there is >> nothing enforcing that for separate fence devices. There are many steps >> in a fencing request, so they can easily take different times to complete. >> >>> I noticed if the node rejoins the cluster before the cluster starts the >>> fence actions, some resources will get activated on 2 nodes at the >>> sametime. This is really not good if the resource happens to be >>> VirutalGuest. Thanks for any suggestions. >> Since you're ignoring quorum, there's nothing stopping the disconnected >> node from starting all resources on its own. It can even fence the other >> nodes, unless the downed NIC is used for fencing. From that node's point >> of view, it's the other two nodes that are lost. >> >> Quorum is the only solution I know of to prevent that. Fencing will >> correct the situation, but it won't prevent it. >> >> See the votequorum(5) man page for various options that can affect how >> quorum is calculated. Also, the very latest version of corosync supports >> qdevice (a lightweight daemon that run on a host outside the cluster >> strictly for the purposes of quorum). >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org