Hi, On Tue, Aug 30, 2016 at 10:03 PM, Scott Greenlese <swgre...@us.ibm.com> wrote:
> Added an appropriate subject line (was blank). Thanks... > > > Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y. > INTERNET: swgre...@us.ibm.com > PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966 > > ----- Forwarded by Scott Greenlese/Poughkeepsie/IBM on 08/30/2016 03:59 PM > ----- > > From: Scott Greenlese/Poughkeepsie/IBM@IBMUS > To: Cluster Labs - All topics related to open-source clustering welcomed < > users@clusterlabs.org> > Date: 08/29/2016 06:36 PM > Subject: [ClusterLabs] (no subject) > ------------------------------ > > > > Hi folks, > > I'm assigned to system test Pacemaker/Corosync on the KVM on System Z > platform > with pacemaker-1.1.13-10 and corosync-2.3.4-7 . > Would be good to see your full cluster configuration (corosync.conf and cib) - but first guess is: no fencing at all .... and what is your "no-quorum-policy" in Pacemaker? Regards, Andreas > > I have a cluster with 5 KVM hosts, and a total of 200 > ocf:pacemakerVirtualDomain resources defined to run > across the 5 cluster nodes (symmertical is true for this cluster). > > The heartbeat network is communicating over vlan1293, which is hung off a > network device, 0230 . > > In general, pacemaker does a good job of distributing my virtual guest > resources evenly across the hypervisors > in the cluster. These resource are a mixed bag: > > - "opaque" and remote "guest nodes" managed by the cluster. > - allow-migrate=false and allow-migrate=true > - qcow2 (file based) guests and LUN based guests > - Sles and Ubuntu OS > > [root@zs95kj ]# pcs status |less > Cluster name: test_cluster_2 > Last updated: Mon Aug 29 17:02:08 2016 Last change: Mon Aug 29 16:37:31 > 2016 by root via crm_resource on zs93kjpcs1 > Stack: corosync > Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition > with quorum > 103 nodes and 300 resources configured > > Node zs90kppcs1: standby > Online: [ zs93KLpcs1 zs93kjpcs1 zs95KLpcs1 zs95kjpcs1 ] > > This morning, our system admin team performed a "non-disruptive" > (concurrent) microcode code load on the OSA, which > (to our surprise) dropped the network connection for 13 seconds on the S93 > CEC, from 11:18:34am to 11:18:47am , to be exact. > This temporary outage caused the two cluster nodes on S93 (zs93kjpcs1 and > zs93KLpcs1) to drop out of the cluster, > as expected. > > However, pacemaker didn't handle this too well. The end result was > numerous VirtualDomain resources in FAILED state: > > [root@zs95kj log]# date;pcs status |grep VirtualD |grep zs93 |grep FAILED > Mon Aug 29 12:33:32 EDT 2016 > zs95kjg110104_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110092_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 > zs95kjg110099_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110102_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110106_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 > zs95kjg110112_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110115_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110118_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 > zs95kjg110124_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110127_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110130_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 > zs95kjg110136_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110139_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110142_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 > zs95kjg110148_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110152_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110155_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110161_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110164_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110167_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110173_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110176_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110179_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110185_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg109106_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > > > As well as, several VirtualDomain resources showing "Started" on two > cluster nodes: > > zs95kjg110079_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 > zs93KLpcs1 ] > zs95kjg110108_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 > zs93KLpcs1 ] > zs95kjg110186_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 > zs93KLpcs1 ] > zs95kjg110188_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 > zs93KLpcs1 ] > zs95kjg110198_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 > zs93KLpcs1 ] > > > The virtual machines themselves were in fact, "running" on both hosts. For > example: > > [root@zs93kl ~]# virsh list |grep zs95kjg110079 > 70 zs95kjg110079 running > > [root@zs93kj cli]# virsh list |grep zs95kjg110079 > 18 zs95kjg110079 running > > > On this particular VM, here was file corruption of this file-based qcow2 > guest's image, such that you could not ping or ssh, > and if you open a virsh console, you get "initramfs" prompt. > > To recover, we had to mount the volume on another VM and then run fsck to > recover it. > > I walked through the system log on the two S93 hosts to see how > zs95kjg110079 ended up running > on two cluster nodes. (some entries were omitted, I saved logs for future > reference): > > > *zs93kjpcs1 *system log - (shows membership changes after the network > failure at 11:18:34) > > Aug 29 11:18:33 zs93kl kernel: qeth 0.0.0230: The qeth device driver > failed to recover an error on the device > Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000000: 00 c2 40 17 01 51 90 38 > 00 04 00 00 00 00 00 00 ..@..Q.8........ > Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000010: 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 ................ > Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000020: 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 ................ > Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000030: 00 00 00 00 00 00 00 00 > 00 00 00 34 00 1f 00 07 ...........4.... > Aug 29 11:18:33 zs93kl kernel: qeth 0.0.0230: A recovery process has been > started for the device > Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] The token was lost in the > OPERATIONAL state. > Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] A processor failed, > forming new configuration. > Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] entering GATHER state > from 2(The token was lost in the OPERATIONAL state.). > Aug 29 11:18:34 zs93kl kernel: qeth 0.0.0230: The qeth device driver > failed to recover an error on the device > Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000000: 00 00 11 01 00 00 00 00 > 00 04 00 00 00 00 00 00 ................ > Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000010: 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 ................ > Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000020: 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 ................ > Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000030: 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 ................ > > > Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node > zs95kjpcs1[2] - state is now lost (was member) > Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs95kjpcs1 > attributes for attrd_peer_change_cb > Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc: Node > zs95kjpcs1[2] - state is now lost (was member) > Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs95kjpcs1/2 from the > membership list > Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=2 and/or > uname=zs95kjpcs1 from the membership cache > Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing zs95kjpcs1/2 from > the membership list > Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc: Node > zs95KLpcs1[3] - state is now lost (was member) > Aug 29 11:18:37 zs93kj attrd[21400]: notice: Purged 1 peers with id=2 > and/or uname=zs95kjpcs1 from the membership cache > Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs95KLpcs1/3 from the > membership list > Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node > zs95KLpcs1[3] - state is now lost (was member) > Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=3 and/or > uname=zs95KLpcs1 from the membership cache > Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc: Node > zs93KLpcs1[5] - state is now lost (was member) > Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs93KLpcs1/5 from the > membership list > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering GATHER state > from 0(consensus timeout). > Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=5 and/or > uname=zs93KLpcs1 from the membership cache > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Creating commit token > because I am the rep. > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Saving state aru 32 high > seq received 32 > Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Storing new sequence id > for ring 300 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering COMMIT state. > Aug 29 11:18:37 zs93kj crmd[21402]: notice: Membership 768: quorum lost (1) > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] got commit token > Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs95KLpcs1 > attributes for attrd_peer_change_cb > Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing zs95KLpcs1/3 from > the membership list > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering RECOVERY state. > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] TRANS [0] member > 10.20.93.11: > Aug 29 11:18:37 zs93kj pacemakerd[21143]: notice: Membership 768: quorum > lost (1) > Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: crm_update_peer_proc: > Node zs95kjpcs1[2] - state is now lost (was member) > Aug 29 11:18:37 zs93kj crmd[21402]: notice: crm_reap_unseen_nodes: Node > zs95KLpcs1[3] - state is now lost (was member) > Aug 29 11:18:37 zs93kj crmd[21402]: warning: No match for shutdown action > on 3 > Aug 29 11:18:37 zs93kj attrd[21400]: notice: Purged 1 peers with id=3 > and/or uname=zs95KLpcs1 from the membership cache > Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: Removing zs95kjpcs1/2 > from the membership list > Aug 29 11:18:37 zs93kj crmd[21402]: notice: Stonith/shutdown of zs95KLpcs1 > not matched > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] position [0] member > 10.20.93.11: > Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node > zs93KLpcs1[5] - state is now lost (was member) > Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: Purged 1 peers with id=2 > and/or uname=zs95kjpcs1 from the membership cache > Aug 29 11:18:37 zs93kj crmd[21402]: notice: crm_reap_unseen_nodes: Node > zs95kjpcs1[2] - state is now lost (was member) > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] previous ring seq 2fc rep > 10.20.93.11 > Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs93KLpcs1 > attributes for attrd_peer_change_cb > Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: crm_update_peer_proc: > Node zs95KLpcs1[3] - state is now lost (was member) > Aug 29 11:18:37 zs93kj crmd[21402]: warning: No match for shutdown action > on 2 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] aru 32 high delivered 32 > received flag 1 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Did not need to originate > any messages in recovery. > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] got commit token > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Sending initial ORF token > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag is 0 > my set retrans flag0 retrans queue empty 1 count 0, aru 0 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0 high > seq received 0 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag is 0 > my set retrans flag0 retrans queue empty 1 count 1, aru 0 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0 high > seq received 0 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag is 0 > my set retrans flag0 retrans queue empty 1 count 2, aru 0 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0 high > seq received 0 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag is 0 > my set retrans flag0 retrans queue empty 1 count 3, aru 0 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0 high > seq received 0 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] retrans flag count 4 > token aru 0 install seq 0 aru 0 0 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Resetting old ring state > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] recovery to regular 1-0 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member > 10.20.93.12 inactive > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member > 10.20.93.13 inactive > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member > 10.20.93.14 inactive > Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0) > ip(10.20.93.12) > Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0) > ip(10.20.93.13) > Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0) > ip(10.20.93.14) > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] waiting_trans_ack changed > to 1 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering OPERATIONAL > state. > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] A new membership ( > 10.20.93.11:768) was formed. Members left: 2 5 3 > Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Failed to receive the > leave message. failed: 2 5 3 > Aug 29 11:18:37 zs93kj corosync[20562]: [SYNC ] Committing synchronization > for corosync configuration map access > Aug 29 11:18:37 zs93kj corosync[20562]: [CMAP ] Not first sync -> no action > Aug 29 11:18:37 zs93kj corosync[20562]: [CPG ] comparing: sender r(0) > ip(10.20.93.11) ; members(old:4 left:3) > Aug 29 11:18:37 zs93kj corosync[20562]: [CPG ] chosen downlist: sender > r(0) ip(10.20.93.11) ; members(old:4 left:3) > > > Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member > 10.20.93.12 active > Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member > 10.20.93.14 active > Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Member joined: r(0) > ip(10.20.93.12) > Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Member joined: r(0) > ip(10.20.93.14) > Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] entering OPERATIONAL > state. > Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] A new membership ( > 10.20.93.11:772) was formed. Members joined: 2 3 > Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing synchronization > for corosync configuration map access > Aug 29 11:18:43 zs93kj corosync[20562]: [CMAP ] Not first sync -> no action > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message from > node 0x1 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message from > node 0x2 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender r(0) > ip(10.20.93.14) ; members(old:2 left:0) > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender r(0) > ip(10.20.93.12) ; members(old:2 left:0) > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender r(0) > ip(10.20.93.11) ; members(old:1 left:0) > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] chosen downlist: sender > r(0) ip(10.20.93.12) ; members(old:2 left:0) > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message from > node 0x3 > Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing synchronization > for corosync cluster closed process group service v1.01 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[0] > group:crmd\x00, ip:r(0) ip(10.20.93.14) , pid:21491 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[1] > group:attrd\x00, ip:r(0) ip(10.20.93.14) , pid:21489 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[2] > group:stonith-ng\x00, ip:r(0) ip(10.20.93.14) , pid:21487 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[3] > group:cib\x00, ip:r(0) ip(10.20.93.14) , pid:21486 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[4] > group:pacemakerd\x00, ip:r(0) ip(10.20.93.14) , pid:21485 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[5] > group:crmd\x00, ip:r(0) ip(10.20.93.12) , pid:24499 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[6] > group:attrd\x00, ip:r(0) ip(10.20.93.12) , pid:24497 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[7] > group:stonith-ng\x00, ip:r(0) ip(10.20.93.12) , pid:24495 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[8] > group:cib\x00, ip:r(0) ip(10.20.93.12) , pid:24494 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[9] > group:pacemakerd\x00, ip:r(0) ip(10.20.93.12) , pid:24491 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[10] > group:crmd\x00, ip:r(0) ip(10.20.93.11) , pid:21402 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[11] > group:attrd\x00, ip:r(0) ip(10.20.93.11) , pid:21400 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[12] > group:stonith-ng\x00, ip:r(0) ip(10.20.93.11) , pid:21398 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[13] > group:cib\x00, ip:r(0) ip(10.20.93.11) , pid:21397 > Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[14] > group:pacemakerd\x00, ip:r(0) ip(10.20.93.11) , pid:21143 > Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] flags: quorate: No > Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No > QdeviceCastVote: No QdeviceMasterWins: No > Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] IPC credentials > authenticated (20562-21400-28) > Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] connecting to client [21400] > Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] shm size:1048589; > real_size:1052672; rb->word_size:263168 > Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] shm size:1048589; > real_size:1052672; rb->word_size:263168 > Aug 29 11:18:43 zs93kj pacemakerd[21143]: notice: Membership 772: quorum > acquired (3) > > Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] quorum regained, resuming > activity > Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] got nodeinfo message from > cluster node 3 > Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] nodeinfo message[0]: > votes: 0, expected: 0 flags: 0 > Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing synchronization > for corosync vote quorum service v1.0 > Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] total_votes=3, > expected_votes=5 > Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 1 state=1, votes=1, > expected=5 > Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 2 state=1, votes=1, > expected=5 > Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 3 state=1, votes=1, > expected=5 > Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 4 state=2, votes=1, > expected=5 > Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 5 state=2, votes=1, > expected=5 > Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] lowest node id: 1 us: 1 > Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] highest node id: 3 us: 1 > Aug 29 11:18:43 zs93kj corosync[20562]: [QUORUM] This node is within the > primary component and will provide service. > Aug 29 11:18:43 zs93kj pacemakerd[21143]: notice: > pcmk_quorum_notification: Node zs95KLpcs1[3] - state is now member (was > lost) > Aug 29 11:18:43 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node > zs95KLpcs1[3] - state is now member (was (null)) > Aug 29 11:18:43 zs93kj corosync[20562]: [QUORUM] Members[3]: 1 2 3 > Aug 29 11:18:43 zs93kj stonith-ng[21398]: warning: Node names with > capitals are discouraged, consider changing 'zs95KLpcs1' to something else > Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Completed service > synchronization, ready to provide service. > Aug 29 11:18:43 zs93kj stonith-ng[21398]: notice: crm_update_peer_proc: > Node zs95KLpcs1[3] - state is now member (was (null)) > Aug 29 11:18:43 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node > zs95kjpcs1[2] - state is now member (was (null)) > > > > > *The story of zs95kjg110079 starts on ZS93KL when it seemed to be already > running on ZS93KJ - * > > *System log on zs93KLpcs1:* > > Aug 29 11:20:58 zs93kl pengine[19997]: notice: Start zs95kjg110079_res#011( > zs93KLpcs1) > > Aug 29 11:21:56 zs93kl crmd[20001]: notice: Initiating action 520: start > zs95kjg110079_res_start_0 on zs93KLpcs1 (local) > > Aug 29 11:21:56 zs93kl systemd-machined: New machine qemu-70-zs95kjg110079. > Aug 29 11:21:56 zs93kl systemd: Started Virtual Machine > qemu-70-zs95kjg110079. > Aug 29 11:21:56 zs93kl systemd: Starting Virtual Machine > qemu-70-zs95kjg110079. > > Aug 29 11:21:59 zs93kl crmd[20001]: notice: Operation > zs95kjg110079_res_start_0: ok (node=zs93KLpcs1, call=1036, rc=0, > cib-update=735, confirmed=true) > > Aug 29 11:22:07 zs93kl crmd[20001]: warning: Action 238 > (zs95kjg110079_res_monitor_0) on zs93kjpcs1 failed (target: 7 vs. rc: 0): > Error > Aug 29 11:22:07 zs93kl crmd[20001]: notice: Transition aborted by > zs95kjg110079_res_monitor_0 'create' on zs93kjpcs1: Event failed > (magic=0:0;238:13:7:236d078a-9063-4092-9660-cfae048f3627, > cib=0.2437.3212, source=match_graph_event:381, 0) > > Aug 29 11:22:15 zs93kl pengine[19997]: error: Resource zs95kjg110079_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 11:22:15 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 11:22:15 zs93kl pengine[19997]: notice: Restart > zs95kjg110079_res#011(Started zs93kjpcs1) > > Aug 29 11:22:23 zs93kl pengine[19997]: error: Resource zs95kjg110079_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 11:22:23 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 11:22:23 zs93kl pengine[19997]: notice: Restart > zs95kjg110079_res#011(Started zs93kjpcs1) > > > Aug 29 11:30:31 zs93kl pengine[19997]: error: Resource zs95kjg110079_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 11:30:31 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 11:30:31 zs93kl pengine[19997]: error: Resource zs95kjg110108_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 11:30:31 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > > Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource zs95kjg110079_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 11:55:41 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource zs95kjg110108_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 11:55:41 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource zs95kjg110186_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 11:55:41 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > > Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource zs95kjg110079_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 11:58:53 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource zs95kjg110108_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 11:58:53 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource zs95kjg110186_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 11:58:53 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource zs95kjg110188_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 11:58:53 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > > > Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110079_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:00:00 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110108_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:00:00 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110186_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:00:00 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110188_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:00:00 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110198_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:00:00 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > > Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110079_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:03:24 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110108_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:03:24 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110186_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:03:24 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110188_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:03:24 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110198_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:03:24 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:03:24 zs93kl pengine[19997]: notice: Restart > zs95kjg110079_res#011(Started zs93kjpcs1) > > Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110079_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:36:27 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110108_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:36:27 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110186_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:36:27 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110188_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:36:27 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110198_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:36:27 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110210_res > (ocf::VirtualDomain) is active on 2 nodes attempting recovery > Aug 29 12:36:27 zs93kl pengine[19997]: warning: See > *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active* > <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more > information. > Aug 29 12:36:27 zs93kl pengine[19997]: notice: Restart > zs95kjg110079_res#011(Started zs93kjpcs1) > > > Aug 29 12:44:41 zs93kl crmd[20001]: warning: Transition 84 (Complete=108, > Pending=0, Fired=0, Skipped=0, Incomplete=77, > Source=/var/lib/pacemaker/pengine/pe-error-106.bz2): > Terminated > Aug 29 12:44:41 zs93kl crmd[20001]: warning: Transition failed: terminated > Aug 29 12:44:41 zs93kl crmd[20001]: notice: Graph 84 with 185 actions: > batch-limit=185 jobs, network-delay=0ms > Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 410]: Pending rsc op > zs95kjg110079_res_monitor_30000 on zs93kjpcs1 (priority: 0, waiting: 409) > Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 409]: Pending rsc op > zs95kjg110079_res_start_0 on zs93kjpcs1 (priority: 0, waiting: 408) > Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 408]: Pending pseudo > op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: 439 470 496 521 > 546) > Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 407]: Completed pseudo > op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: none) > > Aug 29 12:59:42 zs93kl crmd[20001]: notice: Initiating action 428: stop > zs95kjg110079_res_stop_0 on zs93kjpcs1 > Aug 29 12:59:42 zs93kl crmd[20001]: notice: Initiating action 495: stop > zs95kjg110108_res_stop_0 on zs93kjpcs1 > Aug 29 12:59:44 zs93kl crmd[20001]: notice: Initiating action 660: stop > zs95kjg110186_res_stop_0 on zs93kjpcs1 > > Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 431]: Pending rsc op > zs95kjg110079_res_monitor_30000 on zs93kjpcs1 (priority: 0, waiting: 430) > Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 430]: Pending rsc op > zs95kjg110079_res_start_0 on zs93kjpcs1 (priority: 0, waiting: 429) > Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 429]: Pending pseudo > op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: 460 491 517 542 > 567) > Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 428]: Completed rsc op > zs95kjg110079_res_stop_0 on zs93kjpcs1 (priority: 0, waiting: none) > > > > > *System log on zs93kjpcs1*: > > > Aug 29 11:20:48 zs93kj crmd[21402]: notice: Recurring action > zs95kjg110079_res:817 (zs95kjg110079_res_monitor_30000) incomplete at > shutdown > > Aug 29 11:22:07 zs93kj crmd[259639]: notice: Operation > zs95kjg110079_res_monitor_0: ok (node=zs93kjpcs1, call=1223, rc=0, > cib-update=104, confirmed=true) > > Aug 29 12:59:42 zs93kj VirtualDomain(zs95kjg110079_res)[9148]: INFO: > Issuing graceful shutdown request for domain zs95kjg110079. > > *Finally **zs95kjg110079** shuts down on ZS93KJ at 12:59* > > > =================== > > Does this "active on two nodes" recovery process look right? > > What is the recommended procedure to "undo" the resource failures and dual > host assignments? It took several hours (short of stopping/starting the > entire cluster) > to recover them... resource disable, cleanup, enable was the basis ... but > it seemed that I would fix one resource and two more would fall out. > > This seems to be one of the pitfalls of configuring resources in > symmetrical mode. > > I would appreciate any best practice guidelines you have to offer. I saved > the system logs on all hosts in case anyone needs more detailed > information. > I also have pacemaker.log logs. > > Thanks in advance! > > > > Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y. > INTERNET: swgre...@us.ibm.com > PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966 > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > >
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org