Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure

Andreas Kurz Tue, 30 Aug 2016 14:09:16 -0700

Hi,

On Tue, Aug 30, 2016 at 10:03 PM, Scott Greenlese <[email protected]>
wrote:


> Added an appropriate subject line (was blank). Thanks...
>
>
> Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y.
> INTERNET: [email protected]
> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
>
> ----- Forwarded by Scott Greenlese/Poughkeepsie/IBM on 08/30/2016 03:59 PM
> -----
>
> From: Scott Greenlese/Poughkeepsie/IBM@IBMUS
> To: Cluster Labs - All topics related to open-source clustering welcomed <
> [email protected]>
> Date: 08/29/2016 06:36 PM
> Subject: [ClusterLabs] (no subject)
> ------------------------------
>
>
>
> Hi folks,
>
> I'm assigned to system test Pacemaker/Corosync on the KVM on System Z
> platform
> with pacemaker-1.1.13-10 and corosync-2.3.4-7 .
>

Would be good to see your full cluster configuration (corosync.conf and
cib) - but first guess is: no fencing at all .... and what is your
"no-quorum-policy" in Pacemaker?

Regards,
Andreas


>
> I have a cluster with 5 KVM hosts, and a total of 200
> ocf:pacemakerVirtualDomain resources defined to run
> across the 5 cluster nodes (symmertical is true for this cluster).
>
> The heartbeat network is communicating over vlan1293, which is hung off a
> network device, 0230 .
>
> In general, pacemaker does a good job of distributing my virtual guest
> resources evenly across the hypervisors
> in the cluster. These resource are a mixed bag:
>
> - "opaque" and remote "guest nodes" managed by the cluster.
> - allow-migrate=false and allow-migrate=true
> - qcow2 (file based) guests and LUN based guests
> - Sles and Ubuntu OS
>
> [root@zs95kj ]# pcs status |less
> Cluster name: test_cluster_2
> Last updated: Mon Aug 29 17:02:08 2016 Last change: Mon Aug 29 16:37:31
> 2016 by root via crm_resource on zs93kjpcs1
> Stack: corosync
> Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition
> with quorum
> 103 nodes and 300 resources configured
>
> Node zs90kppcs1: standby
> Online: [ zs93KLpcs1 zs93kjpcs1 zs95KLpcs1 zs95kjpcs1 ]
>
> This morning, our system admin team performed a "non-disruptive"
> (concurrent) microcode code load on the OSA, which
> (to our surprise) dropped the network connection for 13 seconds on the S93
> CEC, from 11:18:34am to 11:18:47am , to be exact.
> This temporary outage caused the two cluster nodes on S93 (zs93kjpcs1 and
> zs93KLpcs1) to drop out of the cluster,
> as expected.
>
> However, pacemaker didn't handle this too well. The end result was
> numerous VirtualDomain resources in FAILED state:
>
> [root@zs95kj log]# date;pcs status |grep VirtualD |grep zs93 |grep FAILED
> Mon Aug 29 12:33:32 EDT 2016
> zs95kjg110104_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110092_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110099_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110102_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110106_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110112_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110115_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110118_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110124_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110127_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110130_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110136_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110139_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110142_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110148_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110152_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110155_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110161_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110164_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110167_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110173_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110176_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110179_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110185_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg109106_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>
>
> As well as, several VirtualDomain resources showing "Started" on two
> cluster nodes:
>
> zs95kjg110079_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
> zs93KLpcs1 ]
> zs95kjg110108_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
> zs93KLpcs1 ]
> zs95kjg110186_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
> zs93KLpcs1 ]
> zs95kjg110188_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
> zs93KLpcs1 ]
> zs95kjg110198_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
> zs93KLpcs1 ]
>
>
> The virtual machines themselves were in fact, "running" on both hosts. For
> example:
>
> [root@zs93kl ~]# virsh list |grep zs95kjg110079
> 70 zs95kjg110079 running
>
> [root@zs93kj cli]# virsh list |grep zs95kjg110079
> 18 zs95kjg110079 running
>
>
> On this particular VM, here was file corruption of this file-based qcow2
> guest's image, such that you could not ping or ssh,
> and if you open a virsh console, you get "initramfs" prompt.
>
> To recover, we had to mount the volume on another VM and then run fsck to
> recover it.
>
> I walked through the system log on the two S93 hosts to see how
> zs95kjg110079 ended up running
> on two cluster nodes. (some entries were omitted, I saved logs for future
> reference):
>
>
> *zs93kjpcs1 *system log - (shows membership changes after the network
> failure at 11:18:34)
>
> Aug 29 11:18:33 zs93kl kernel: qeth 0.0.0230: The qeth device driver
> failed to recover an error on the device
> Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000000: 00 c2 40 17 01 51 90 38
> 00 04 00 00 00 00 00 00 [email protected]........
> Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000010: 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 ................
> Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000020: 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 ................
> Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000030: 00 00 00 00 00 00 00 00
> 00 00 00 34 00 1f 00 07 ...........4....
> Aug 29 11:18:33 zs93kl kernel: qeth 0.0.0230: A recovery process has been
> started for the device
> Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] The token was lost in the
> OPERATIONAL state.
> Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] A processor failed,
> forming new configuration.
> Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] entering GATHER state
> from 2(The token was lost in the OPERATIONAL state.).
> Aug 29 11:18:34 zs93kl kernel: qeth 0.0.0230: The qeth device driver
> failed to recover an error on the device
> Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000000: 00 00 11 01 00 00 00 00
> 00 04 00 00 00 00 00 00 ................
> Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000010: 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 ................
> Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000020: 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 ................
> Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000030: 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 ................
>
>
> Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node
> zs95kjpcs1[2] - state is now lost (was member)
> Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs95kjpcs1
> attributes for attrd_peer_change_cb
> Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc: Node
> zs95kjpcs1[2] - state is now lost (was member)
> Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs95kjpcs1/2 from the
> membership list
> Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=2 and/or
> uname=zs95kjpcs1 from the membership cache
> Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing zs95kjpcs1/2 from
> the membership list
> Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc: Node
> zs95KLpcs1[3] - state is now lost (was member)
> Aug 29 11:18:37 zs93kj attrd[21400]: notice: Purged 1 peers with id=2
> and/or uname=zs95kjpcs1 from the membership cache
> Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs95KLpcs1/3 from the
> membership list
> Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node
> zs95KLpcs1[3] - state is now lost (was member)
> Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=3 and/or
> uname=zs95KLpcs1 from the membership cache
> Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc: Node
> zs93KLpcs1[5] - state is now lost (was member)
> Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs93KLpcs1/5 from the
> membership list
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering GATHER state
> from 0(consensus timeout).
> Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=5 and/or
> uname=zs93KLpcs1 from the membership cache
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Creating commit token
> because I am the rep.
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Saving state aru 32 high
> seq received 32
> Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Storing new sequence id
> for ring 300
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering COMMIT state.
> Aug 29 11:18:37 zs93kj crmd[21402]: notice: Membership 768: quorum lost (1)
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] got commit token
> Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs95KLpcs1
> attributes for attrd_peer_change_cb
> Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing zs95KLpcs1/3 from
> the membership list
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering RECOVERY state.
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] TRANS [0] member
> 10.20.93.11:
> Aug 29 11:18:37 zs93kj pacemakerd[21143]: notice: Membership 768: quorum
> lost (1)
> Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: crm_update_peer_proc:
> Node zs95kjpcs1[2] - state is now lost (was member)
> Aug 29 11:18:37 zs93kj crmd[21402]: notice: crm_reap_unseen_nodes: Node
> zs95KLpcs1[3] - state is now lost (was member)
> Aug 29 11:18:37 zs93kj crmd[21402]: warning: No match for shutdown action
> on 3
> Aug 29 11:18:37 zs93kj attrd[21400]: notice: Purged 1 peers with id=3
> and/or uname=zs95KLpcs1 from the membership cache
> Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: Removing zs95kjpcs1/2
> from the membership list
> Aug 29 11:18:37 zs93kj crmd[21402]: notice: Stonith/shutdown of zs95KLpcs1
> not matched
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] position [0] member
> 10.20.93.11:
> Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node
> zs93KLpcs1[5] - state is now lost (was member)
> Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: Purged 1 peers with id=2
> and/or uname=zs95kjpcs1 from the membership cache
> Aug 29 11:18:37 zs93kj crmd[21402]: notice: crm_reap_unseen_nodes: Node
> zs95kjpcs1[2] - state is now lost (was member)
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] previous ring seq 2fc rep
> 10.20.93.11
> Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs93KLpcs1
> attributes for attrd_peer_change_cb
> Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: crm_update_peer_proc:
> Node zs95KLpcs1[3] - state is now lost (was member)
> Aug 29 11:18:37 zs93kj crmd[21402]: warning: No match for shutdown action
> on 2
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] aru 32 high delivered 32
> received flag 1
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Did not need to originate
> any messages in recovery.
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] got commit token
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Sending initial ORF token
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag is 0
> my set retrans flag0 retrans queue empty 1 count 0, aru 0
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0 high
> seq received 0
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag is 0
> my set retrans flag0 retrans queue empty 1 count 1, aru 0
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0 high
> seq received 0
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag is 0
> my set retrans flag0 retrans queue empty 1 count 2, aru 0
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0 high
> seq received 0
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag is 0
> my set retrans flag0 retrans queue empty 1 count 3, aru 0
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0 high
> seq received 0
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] retrans flag count 4
> token aru 0 install seq 0 aru 0 0
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Resetting old ring state
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] recovery to regular 1-0
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
> 10.20.93.12 inactive
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
> 10.20.93.13 inactive
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
> 10.20.93.14 inactive
> Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0)
> ip(10.20.93.12)
> Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0)
> ip(10.20.93.13)
> Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0)
> ip(10.20.93.14)
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] waiting_trans_ack changed
> to 1
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering OPERATIONAL
> state.
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] A new membership (
> 10.20.93.11:768) was formed. Members left: 2 5 3
> Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Failed to receive the
> leave message. failed: 2 5 3
> Aug 29 11:18:37 zs93kj corosync[20562]: [SYNC ] Committing synchronization
> for corosync configuration map access
> Aug 29 11:18:37 zs93kj corosync[20562]: [CMAP ] Not first sync -> no action
> Aug 29 11:18:37 zs93kj corosync[20562]: [CPG ] comparing: sender r(0)
> ip(10.20.93.11) ; members(old:4 left:3)
> Aug 29 11:18:37 zs93kj corosync[20562]: [CPG ] chosen downlist: sender
> r(0) ip(10.20.93.11) ; members(old:4 left:3)
>
>
> Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
> 10.20.93.12 active
> Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
> 10.20.93.14 active
> Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Member joined: r(0)
> ip(10.20.93.12)
> Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Member joined: r(0)
> ip(10.20.93.14)
> Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] entering OPERATIONAL
> state.
> Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] A new membership (
> 10.20.93.11:772) was formed. Members joined: 2 3
> Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing synchronization
> for corosync configuration map access
> Aug 29 11:18:43 zs93kj corosync[20562]: [CMAP ] Not first sync -> no action
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message from
> node 0x1
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message from
> node 0x2
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender r(0)
> ip(10.20.93.14) ; members(old:2 left:0)
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender r(0)
> ip(10.20.93.12) ; members(old:2 left:0)
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender r(0)
> ip(10.20.93.11) ; members(old:1 left:0)
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] chosen downlist: sender
> r(0) ip(10.20.93.12) ; members(old:2 left:0)
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message from
> node 0x3
> Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing synchronization
> for corosync cluster closed process group service v1.01
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[0]
> group:crmd\x00, ip:r(0) ip(10.20.93.14) , pid:21491
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[1]
> group:attrd\x00, ip:r(0) ip(10.20.93.14) , pid:21489
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[2]
> group:stonith-ng\x00, ip:r(0) ip(10.20.93.14) , pid:21487
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[3]
> group:cib\x00, ip:r(0) ip(10.20.93.14) , pid:21486
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[4]
> group:pacemakerd\x00, ip:r(0) ip(10.20.93.14) , pid:21485
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[5]
> group:crmd\x00, ip:r(0) ip(10.20.93.12) , pid:24499
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[6]
> group:attrd\x00, ip:r(0) ip(10.20.93.12) , pid:24497
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[7]
> group:stonith-ng\x00, ip:r(0) ip(10.20.93.12) , pid:24495
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[8]
> group:cib\x00, ip:r(0) ip(10.20.93.12) , pid:24494
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[9]
> group:pacemakerd\x00, ip:r(0) ip(10.20.93.12) , pid:24491
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[10]
> group:crmd\x00, ip:r(0) ip(10.20.93.11) , pid:21402
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[11]
> group:attrd\x00, ip:r(0) ip(10.20.93.11) , pid:21400
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[12]
> group:stonith-ng\x00, ip:r(0) ip(10.20.93.11) , pid:21398
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[13]
> group:cib\x00, ip:r(0) ip(10.20.93.11) , pid:21397
> Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[14]
> group:pacemakerd\x00, ip:r(0) ip(10.20.93.11) , pid:21143
> Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] flags: quorate: No
> Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No
> QdeviceCastVote: No QdeviceMasterWins: No
> Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] IPC credentials
> authenticated (20562-21400-28)
> Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] connecting to client [21400]
> Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] shm size:1048589;
> real_size:1052672; rb->word_size:263168
> Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] shm size:1048589;
> real_size:1052672; rb->word_size:263168
> Aug 29 11:18:43 zs93kj pacemakerd[21143]: notice: Membership 772: quorum
> acquired (3)
>
> Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] quorum regained, resuming
> activity
> Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] got nodeinfo message from
> cluster node 3
> Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] nodeinfo message[0]:
> votes: 0, expected: 0 flags: 0
> Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing synchronization
> for corosync vote quorum service v1.0
> Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] total_votes=3,
> expected_votes=5
> Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 1 state=1, votes=1,
> expected=5
> Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 2 state=1, votes=1,
> expected=5
> Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 3 state=1, votes=1,
> expected=5
> Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 4 state=2, votes=1,
> expected=5
> Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 5 state=2, votes=1,
> expected=5
> Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] lowest node id: 1 us: 1
> Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] highest node id: 3 us: 1
> Aug 29 11:18:43 zs93kj corosync[20562]: [QUORUM] This node is within the
> primary component and will provide service.
> Aug 29 11:18:43 zs93kj pacemakerd[21143]: notice:
> pcmk_quorum_notification: Node zs95KLpcs1[3] - state is now member (was
> lost)
> Aug 29 11:18:43 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node
> zs95KLpcs1[3] - state is now member (was (null))
> Aug 29 11:18:43 zs93kj corosync[20562]: [QUORUM] Members[3]: 1 2 3
> Aug 29 11:18:43 zs93kj stonith-ng[21398]: warning: Node names with
> capitals are discouraged, consider changing 'zs95KLpcs1' to something else
> Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Aug 29 11:18:43 zs93kj stonith-ng[21398]: notice: crm_update_peer_proc:
> Node zs95KLpcs1[3] - state is now member (was (null))
> Aug 29 11:18:43 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node
> zs95kjpcs1[2] - state is now member (was (null))
>
>
>
>
> *The story of zs95kjg110079 starts on ZS93KL when it seemed to be already
> running on ZS93KJ - *
>
> *System log on zs93KLpcs1:*
>
> Aug 29 11:20:58 zs93kl pengine[19997]: notice: Start zs95kjg110079_res#011(
> zs93KLpcs1)
>
> Aug 29 11:21:56 zs93kl crmd[20001]: notice: Initiating action 520: start
> zs95kjg110079_res_start_0 on zs93KLpcs1 (local)
>
> Aug 29 11:21:56 zs93kl systemd-machined: New machine qemu-70-zs95kjg110079.
> Aug 29 11:21:56 zs93kl systemd: Started Virtual Machine
> qemu-70-zs95kjg110079.
> Aug 29 11:21:56 zs93kl systemd: Starting Virtual Machine
> qemu-70-zs95kjg110079.
>
> Aug 29 11:21:59 zs93kl crmd[20001]: notice: Operation
> zs95kjg110079_res_start_0: ok (node=zs93KLpcs1, call=1036, rc=0,
> cib-update=735, confirmed=true)
>
> Aug 29 11:22:07 zs93kl crmd[20001]: warning: Action 238
> (zs95kjg110079_res_monitor_0) on zs93kjpcs1 failed (target: 7 vs. rc: 0):
> Error
> Aug 29 11:22:07 zs93kl crmd[20001]: notice: Transition aborted by
> zs95kjg110079_res_monitor_0 'create' on zs93kjpcs1: Event failed
> (magic=0:0;238:13:7:236d078a-9063-4092-9660-cfae048f3627,
> cib=0.2437.3212, source=match_graph_event:381, 0)
>
> Aug 29 11:22:15 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 11:22:15 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 11:22:15 zs93kl pengine[19997]: notice: Restart
> zs95kjg110079_res#011(Started zs93kjpcs1)
>
> Aug 29 11:22:23 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 11:22:23 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 11:22:23 zs93kl pengine[19997]: notice: Restart
> zs95kjg110079_res#011(Started zs93kjpcs1)
>
>
> Aug 29 11:30:31 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 11:30:31 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 11:30:31 zs93kl pengine[19997]: error: Resource zs95kjg110108_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 11:30:31 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
>
> Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 11:55:41 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource zs95kjg110108_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 11:55:41 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource zs95kjg110186_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 11:55:41 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
>
> Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 11:58:53 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource zs95kjg110108_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 11:58:53 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource zs95kjg110186_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 11:58:53 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource zs95kjg110188_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 11:58:53 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
>
>
> Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110108_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110186_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110188_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110198_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
>
> Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110108_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110186_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110188_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110198_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:03:24 zs93kl pengine[19997]: notice: Restart
> zs95kjg110079_res#011(Started zs93kjpcs1)
>
> Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110108_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110186_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110188_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110198_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110210_res
> (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
> *http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active*
> <http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active> for more
> information.
> Aug 29 12:36:27 zs93kl pengine[19997]: notice: Restart
> zs95kjg110079_res#011(Started zs93kjpcs1)
>
>
> Aug 29 12:44:41 zs93kl crmd[20001]: warning: Transition 84 (Complete=108,
> Pending=0, Fired=0, Skipped=0, Incomplete=77, 
> Source=/var/lib/pacemaker/pengine/pe-error-106.bz2):
> Terminated
> Aug 29 12:44:41 zs93kl crmd[20001]: warning: Transition failed: terminated
> Aug 29 12:44:41 zs93kl crmd[20001]: notice: Graph 84 with 185 actions:
> batch-limit=185 jobs, network-delay=0ms
> Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 410]: Pending rsc op
> zs95kjg110079_res_monitor_30000 on zs93kjpcs1 (priority: 0, waiting: 409)
> Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 409]: Pending rsc op
> zs95kjg110079_res_start_0 on zs93kjpcs1 (priority: 0, waiting: 408)
> Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 408]: Pending pseudo
> op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: 439 470 496 521
> 546)
> Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 407]: Completed pseudo
> op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: none)
>
> Aug 29 12:59:42 zs93kl crmd[20001]: notice: Initiating action 428: stop
> zs95kjg110079_res_stop_0 on zs93kjpcs1
> Aug 29 12:59:42 zs93kl crmd[20001]: notice: Initiating action 495: stop
> zs95kjg110108_res_stop_0 on zs93kjpcs1
> Aug 29 12:59:44 zs93kl crmd[20001]: notice: Initiating action 660: stop
> zs95kjg110186_res_stop_0 on zs93kjpcs1
>
> Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 431]: Pending rsc op
> zs95kjg110079_res_monitor_30000 on zs93kjpcs1 (priority: 0, waiting: 430)
> Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 430]: Pending rsc op
> zs95kjg110079_res_start_0 on zs93kjpcs1 (priority: 0, waiting: 429)
> Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 429]: Pending pseudo
> op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: 460 491 517 542
> 567)
> Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 428]: Completed rsc op
> zs95kjg110079_res_stop_0 on zs93kjpcs1 (priority: 0, waiting: none)
>
>
>
>
> *System log on zs93kjpcs1*:
>
>
> Aug 29 11:20:48 zs93kj crmd[21402]: notice: Recurring action
> zs95kjg110079_res:817 (zs95kjg110079_res_monitor_30000) incomplete at
> shutdown
>
> Aug 29 11:22:07 zs93kj crmd[259639]: notice: Operation
> zs95kjg110079_res_monitor_0: ok (node=zs93kjpcs1, call=1223, rc=0,
> cib-update=104, confirmed=true)
>
> Aug 29 12:59:42 zs93kj VirtualDomain(zs95kjg110079_res)[9148]: INFO:
> Issuing graceful shutdown request for domain zs95kjg110079.
>
> *Finally **zs95kjg110079** shuts down on ZS93KJ at 12:59*
>
>
> ===================
>
> Does this "active on two nodes" recovery process look right?
>
> What is the recommended procedure to "undo" the resource failures and dual
> host assignments? It took several hours (short of stopping/starting the
> entire cluster)
> to recover them... resource disable, cleanup, enable was the basis ... but
> it seemed that I would fix one resource and two more would fall out.
>
> This seems to be one of the pitfalls of configuring resources in
> symmetrical mode.
>
> I would appreciate any best practice guidelines you have to offer. I saved
> the system logs on all hosts in case anyone needs more detailed
> information.
> I also have pacemaker.log logs.
>
> Thanks in advance!
>
>
>
> Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y.
> INTERNET: [email protected]
> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
> _______________________________________________
> Users mailing list: [email protected]
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: [email protected]
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>

_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure

Reply via email to