Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure

2016-09-09 Thread Ken Gaillot
On 09/09/2016 02:47 PM, Scott Greenlese wrote:
> Hi Ken ,
> 
> Below where you commented,
> 
> "It's considered good practice to stop
> pacemaker+corosync before rebooting a node intentionally (for even more
> safety, you can put the node into standby first)."
> 
> .. is this something that we document anywhere?

Not in any official documentation that I'm aware of; it's more a general
custom than a strong recommendation.

> Our 'reboot' action performs a halt (deactivate lpar) and then activate.
> Do I run the risk
> of guest instances running on multiple hosts in my case? I'm performing
> various recovery
> scenarios and want to avoid this procedure (reboot without first
> stopping cluster), if it's not supported.

By "intentionally" I mean via normal system administration, not fencing.
When fencing, it's always acceptable (and desirable) to do an immediate
cutoff, without any graceful stopping of anything.

When doing a graceful reboot/shutdown, the OS typically asks all running
processes to terminate, then waits a while for them to do so. There's
nothing really wrong with pacemaker being running at that point -- as
long as everything goes well.

If the OS gets impatient and terminates pacemaker before it finishes
stopping, the rest of the cluster will want to fence the node. Also, if
something goes wrong when resources are stopping, it might be harder to
troubleshoot, if the whole system is shutting down at the same time. So,
stopping pacemaker first makes sure that all the resources stop cleanly,
and that the cluster will ignore the node.

Putting in standby is not as important, I would say the main benefit is
that the node comes back up in standby when it rejoins, so you have more
control over when resources start being placed back on it. You can bring
up the node and start pacemaker, and make sure everything is good before
allowing resources back on it (especially helpful if you just upgraded
pacemaker or any of its dependencies, changed the host's network
configuration, etc.).

There shouldn't be any chance of multiple-active instances if fencing is
configured. Pacemaker shouldn't recover the resource elsewhere until it
confirms that either the resource stopped successfully on the node, or
the node was fenced.


> 
> By the way, I always put the node in cluster standby before an
> intentional reboot.
> 
> Thanks!
> 
> Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.
> INTERNET: swgre...@us.ibm.com
> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
> 
> 
> Inactive hide details for Ken Gaillot ---09/02/2016 10:01:15 AM---From:
> Ken Gaillot  To: users@clusterlabsKen Gaillot
> ---09/02/2016 10:01:15 AM---From: Ken Gaillot  To:
> users@clusterlabs.org
> 
> From: Ken Gaillot 
> To: users@clusterlabs.org
> Date: 09/02/2016 10:01 AM
> Subject: Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to
> transient network failure
> 
> 
> 
> 
> 
> On 09/01/2016 09:39 AM, Scott Greenlese wrote:
>> Andreas,
>>
>> You wrote:
>>
>> /"Would be good to see your full cluster configuration (corosync.conf
>> and cib) - but first guess is: no fencing at all  and what is your
>> "no-quorum-policy" in Pacemaker?/
>>
>> /Regards,/
>> /Andreas"/
>>
>> Thanks for your interest. I actually do have a stonith device configured
>> which maps all 5 cluster nodes in the cluster:
>>
>> [root@zs95kj ~]# date;pcs stonith show fence_S90HMC1
>> Thu Sep 1 10:11:25 EDT 2016
>> Resource: fence_S90HMC1 (class=stonith type=fence_ibmz)
>> Attributes: ipaddr=9.12.35.134 login=stonith passwd=lnx4ltic
>>
> pcmk_host_map=zs95KLpcs1:S95/KVL;zs93KLpcs1:S93/KVL;zs93kjpcs1:S93/KVJ;zs95kjpcs1:S95/KVJ;zs90kppcs1:S90/PACEMAKER
>> pcmk_host_list="zs95KLpcs1 zs93KLpcs1 zs93kjpcs1 zs95kjpcs1 zs90kppcs1"
>> pcmk_list_timeout=300 pcmk_off_timeout=600 pcmk_reboot_action=off
>> pcmk_reboot_timeout=600
>> Operations: monitor interval=60s (fence_S90HMC1-monitor-interval-60s)
>>
>> This fencing device works, too well actually. It seems extremely
>> sensitive to node "failures", and I'm not sure how to tune that. Stonith
>> reboot actoin is 'off', and the general stonith action (cluster config)
>> is also 'off'. In fact, often if I reboot a cluster node (i.e. reboot
>> command) that is an active member in the cluster... stonith will power
>> off that node while it's on its wait back up. (perhaps requires a
>> separate issue thread on this forum?).
> 
> That depends on what a reboot does in your OS ... if it shuts down the
> cluster services cleanly, you shouldn't get a fence, but if it kills
> anything still running, then the cluster will see the node as failed,
> and fencing is appropriate. It's considered good practice to stop
> pacemaker+corosync before rebooting a node intentionally (for even more
> safety, you can put the node into standby first).
> 
>>
>> My no-quorum-policy is: no-quorum-policy: stop

Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure

2016-09-09 Thread Scott Greenlese

Hi Ken ,

Below where you commented,

"It's considered good practice to stop
pacemaker+corosync before rebooting a node intentionally (for even more
safety, you can put the node into standby first)."

.. is this something that we document anywhere?

Our 'reboot' action performs a halt (deactivate lpar) and then activate.
Do I run the risk
of guest instances running on multiple hosts in my case?  I'm performing
various recovery
scenarios and want to avoid this procedure (reboot without first stopping
cluster), if it's not supported.

By the way, I always put the node in cluster standby before an intentional
reboot.

Thanks!

Scott Greenlese ... IBM Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgre...@us.ibm.com
  PHONE:  8/293-7301 (845-433-7301)M/S:  POK 42HA/P966




From:   Ken Gaillot 
To: users@clusterlabs.org
Date:   09/02/2016 10:01 AM
Subject:Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to
transient network failure



On 09/01/2016 09:39 AM, Scott Greenlese wrote:
> Andreas,
>
> You wrote:
>
> /"Would be good to see your full cluster configuration (corosync.conf
> and cib) - but first guess is: no fencing at all  and what is your
> "no-quorum-policy" in Pacemaker?/
>
> /Regards,/
> /Andreas"/
>
> Thanks for your interest. I actually do have a stonith device configured
> which maps all 5 cluster nodes in the cluster:
>
> [root@zs95kj ~]# date;pcs stonith show fence_S90HMC1
> Thu Sep 1 10:11:25 EDT 2016
> Resource: fence_S90HMC1 (class=stonith type=fence_ibmz)
> Attributes: ipaddr=9.12.35.134 login=stonith passwd=lnx4ltic
>
pcmk_host_map=zs95KLpcs1:S95/KVL;zs93KLpcs1:S93/KVL;zs93kjpcs1:S93/KVJ;zs95kjpcs1:S95/KVJ;zs90kppcs1:S90/PACEMAKER

> pcmk_host_list="zs95KLpcs1 zs93KLpcs1 zs93kjpcs1 zs95kjpcs1 zs90kppcs1"
> pcmk_list_timeout=300 pcmk_off_timeout=600 pcmk_reboot_action=off
> pcmk_reboot_timeout=600
> Operations: monitor interval=60s (fence_S90HMC1-monitor-interval-60s)
>
> This fencing device works, too well actually. It seems extremely
> sensitive to node "failures", and I'm not sure how to tune that. Stonith
> reboot actoin is 'off', and the general stonith action (cluster config)
> is also 'off'. In fact, often if I reboot a cluster node (i.e. reboot
> command) that is an active member in the cluster... stonith will power
> off that node while it's on its wait back up. (perhaps requires a
> separate issue thread on this forum?).

That depends on what a reboot does in your OS ... if it shuts down the
cluster services cleanly, you shouldn't get a fence, but if it kills
anything still running, then the cluster will see the node as failed,
and fencing is appropriate. It's considered good practice to stop
pacemaker+corosync before rebooting a node intentionally (for even more
safety, you can put the node into standby first).

>
> My no-quorum-policy is: no-quorum-policy: stop
>
> I don't think I should have lost quorum, only two of the five cluster
> nodes lost their corosync ring connection.

Those two nodes lost quorum, so they should have stopped all their
resources. And the three remaining nodes should have fenced them.

I'd check the logs around the time of the incident. Do the two affected
nodes detect the loss of quorum? Do they attempt to stop their
resources? Do those stops succeed? Do the other three nodes detect the
loss of the two nodes? Does the DC attempt to fence them? Do the fence
attempts succeed?

> Here's the full configuration:
>
>
> [root@zs95kj ~]# cat /etc/corosync/corosync.conf
> totem {
> version: 2
> secauth: off
> cluster_name: test_cluster_2
> transport: udpu
> }
>
> nodelist {
> node {
> ring0_addr: zs93kjpcs1
> nodeid: 1
> }
>
> node {
> ring0_addr: zs95kjpcs1
> nodeid: 2
> }
>
> node {
> ring0_addr: zs95KLpcs1
> nodeid: 3
> }
>
> node {
> ring0_addr: zs90kppcs1
> nodeid: 4
> }
>
> node {
> ring0_addr: zs93KLpcs1
> nodeid: 5
> }
> }
>
> quorum {
> provider: corosync_votequorum
> }
>
> logging {
> #Log to a specified file
> to_logfile: yes
> logfile: /var/log/corosync/corosync.log
> #Log timestamp as well
> timestamp: on
>
> #Facility in syslog
> syslog_facility: daemon
>
> logger_subsys {
> #Enable debug for this logger.
>
> debug: off
>
> #This specifies the subsystem identity (name) for which logging is
specified
>
> subsys: QUORUM
>
> }
> #Log to syslog
> to_syslog: yes
>
> #Whether or not turning on the debug information in the log
> debug: on
> }
> [root@zs95kj ~]#
>
>
>
> The full CIB (see attachment)
>
> [root@zs95kj ~]# pcs cluster cib > /tmp/scotts_cib_Sep1_2016.out
>
> /(See attached file: scotts_cib_Sep1_2016.out)/
>
>
> A few excerpts from the CIB:
>
> [root@zs95kj ~]# pcs cluster cib |less
>  num_updates="19" admin_epoch="0" cib-last-written="Wed Aug 31 15:59:31
> 2016" update-origin="zs93kjpcs1" update-client="crm_resource"
> update-user="root" have-quorum="1" dc-uuid="2">
> 
> 
> 
>  value="false"/>
>  value="1.1.13-10.el7_2.ibm.1-44eb2dd"/>
>  name="cluster-infrastructure" 

Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure

2016-09-02 Thread Ken Gaillot
On 09/01/2016 09:39 AM, Scott Greenlese wrote:
> Andreas,
> 
> You wrote:
> 
> /"Would be good to see your full cluster configuration (corosync.conf
> and cib) - but first guess is: no fencing at all  and what is your
> "no-quorum-policy" in Pacemaker?/
> 
> /Regards,/
> /Andreas"/
> 
> Thanks for your interest. I actually do have a stonith device configured
> which maps all 5 cluster nodes in the cluster:
> 
> [root@zs95kj ~]# date;pcs stonith show fence_S90HMC1
> Thu Sep 1 10:11:25 EDT 2016
> Resource: fence_S90HMC1 (class=stonith type=fence_ibmz)
> Attributes: ipaddr=9.12.35.134 login=stonith passwd=lnx4ltic
> pcmk_host_map=zs95KLpcs1:S95/KVL;zs93KLpcs1:S93/KVL;zs93kjpcs1:S93/KVJ;zs95kjpcs1:S95/KVJ;zs90kppcs1:S90/PACEMAKER
> pcmk_host_list="zs95KLpcs1 zs93KLpcs1 zs93kjpcs1 zs95kjpcs1 zs90kppcs1"
> pcmk_list_timeout=300 pcmk_off_timeout=600 pcmk_reboot_action=off
> pcmk_reboot_timeout=600
> Operations: monitor interval=60s (fence_S90HMC1-monitor-interval-60s)
> 
> This fencing device works, too well actually. It seems extremely
> sensitive to node "failures", and I'm not sure how to tune that. Stonith
> reboot actoin is 'off', and the general stonith action (cluster config)
> is also 'off'. In fact, often if I reboot a cluster node (i.e. reboot
> command) that is an active member in the cluster... stonith will power
> off that node while it's on its wait back up. (perhaps requires a
> separate issue thread on this forum?).

That depends on what a reboot does in your OS ... if it shuts down the
cluster services cleanly, you shouldn't get a fence, but if it kills
anything still running, then the cluster will see the node as failed,
and fencing is appropriate. It's considered good practice to stop
pacemaker+corosync before rebooting a node intentionally (for even more
safety, you can put the node into standby first).

> 
> My no-quorum-policy is: no-quorum-policy: stop
> 
> I don't think I should have lost quorum, only two of the five cluster
> nodes lost their corosync ring connection.

Those two nodes lost quorum, so they should have stopped all their
resources. And the three remaining nodes should have fenced them.

I'd check the logs around the time of the incident. Do the two affected
nodes detect the loss of quorum? Do they attempt to stop their
resources? Do those stops succeed? Do the other three nodes detect the
loss of the two nodes? Does the DC attempt to fence them? Do the fence
attempts succeed?

> Here's the full configuration:
> 
> 
> [root@zs95kj ~]# cat /etc/corosync/corosync.conf
> totem {
> version: 2
> secauth: off
> cluster_name: test_cluster_2
> transport: udpu
> }
> 
> nodelist {
> node {
> ring0_addr: zs93kjpcs1
> nodeid: 1
> }
> 
> node {
> ring0_addr: zs95kjpcs1
> nodeid: 2
> }
> 
> node {
> ring0_addr: zs95KLpcs1
> nodeid: 3
> }
> 
> node {
> ring0_addr: zs90kppcs1
> nodeid: 4
> }
> 
> node {
> ring0_addr: zs93KLpcs1
> nodeid: 5
> }
> }
> 
> quorum {
> provider: corosync_votequorum
> }
> 
> logging {
> #Log to a specified file
> to_logfile: yes
> logfile: /var/log/corosync/corosync.log
> #Log timestamp as well
> timestamp: on
> 
> #Facility in syslog
> syslog_facility: daemon
> 
> logger_subsys {
> #Enable debug for this logger.
> 
> debug: off
> 
> #This specifies the subsystem identity (name) for which logging is specified
> 
> subsys: QUORUM
> 
> }
> #Log to syslog
> to_syslog: yes
> 
> #Whether or not turning on the debug information in the log
> debug: on
> }
> [root@zs95kj ~]#
> 
> 
> 
> The full CIB (see attachment)
> 
> [root@zs95kj ~]# pcs cluster cib > /tmp/scotts_cib_Sep1_2016.out
> 
> /(See attached file: scotts_cib_Sep1_2016.out)/
> 
> 
> A few excerpts from the CIB:
> 
> [root@zs95kj ~]# pcs cluster cib |less
>  num_updates="19" admin_epoch="0" cib-last-written="Wed Aug 31 15:59:31
> 2016" update-origin="zs93kjpcs1" update-client="crm_resource"
> update-user="root" have-quorum="1" dc-uuid="2">
> 
> 
> 
>  value="false"/>
>  value="1.1.13-10.el7_2.ibm.1-44eb2dd"/>
>  name="cluster-infrastructure" value="corosync"/>
>  value="test_cluster_2"/>
>  name="no-quorum-policy" value="stop"/>
>  name="last-lrm-refresh" value="1472595716"/>
>  value="off"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  type="VirtualDomain">
> 
>  value="/guestxml/nfs1/zs95kjg109062.xml"/>
>  name="hypervisor" value="qemu:///system"/>
>  name="migration_transport" value="ssh"/>
> 
> 
>  name="allow-migrate" value="true"/>
> 
> 
>  timeout="90"/>
>  timeout="90"/>
>  name="monitor"/>
>  name="migrate-from" timeout="1200"/>
> 
> 
> 
>  value="2048"/>
> 
> 
> 
> ( I OMITTED THE OTHER, SIMILAR 199 VIRTUALDOMAIN PRIMITIVE ENTRIES FOR
> THE SAKE OF SPACE, BUT IF THEY ARE OF
> INTEREST, I CAN ADD THEM)
> 
> .
> .
> .
> 
> 
> 
> 
>  operation="eq" value="container"/>
> 
> 
> 
> (I DEFINED THIS LOCATION CONSTRAINT RULE TO PREVENT OPAQUE GUEST VIRTUAL
> DOMAIN RESOUCES FROM BEING
> ASSIGNED TO REMOTE NODE VIRTUAL DOMAIN RESOURCES. I ALSO OMITTED THE
> NUMEROUS, SIMILAR ENTRIES BELOW).
> 

Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure

2016-08-30 Thread Andreas Kurz
Hi,

On Tue, Aug 30, 2016 at 10:03 PM, Scott Greenlese 
wrote:

> Added an appropriate subject line (was blank). Thanks...
>
>
> Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y.
> INTERNET: swgre...@us.ibm.com
> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
>
> - Forwarded by Scott Greenlese/Poughkeepsie/IBM on 08/30/2016 03:59 PM
> -
>
> From: Scott Greenlese/Poughkeepsie/IBM@IBMUS
> To: Cluster Labs - All topics related to open-source clustering welcomed <
> users@clusterlabs.org>
> Date: 08/29/2016 06:36 PM
> Subject: [ClusterLabs] (no subject)
> --
>
>
>
> Hi folks,
>
> I'm assigned to system test Pacemaker/Corosync on the KVM on System Z
> platform
> with pacemaker-1.1.13-10 and corosync-2.3.4-7 .
>

Would be good to see your full cluster configuration (corosync.conf and
cib) - but first guess is: no fencing at all  and what is your
"no-quorum-policy" in Pacemaker?

Regards,
Andreas


>
> I have a cluster with 5 KVM hosts, and a total of 200
> ocf:pacemakerVirtualDomain resources defined to run
> across the 5 cluster nodes (symmertical is true for this cluster).
>
> The heartbeat network is communicating over vlan1293, which is hung off a
> network device, 0230 .
>
> In general, pacemaker does a good job of distributing my virtual guest
> resources evenly across the hypervisors
> in the cluster. These resource are a mixed bag:
>
> - "opaque" and remote "guest nodes" managed by the cluster.
> - allow-migrate=false and allow-migrate=true
> - qcow2 (file based) guests and LUN based guests
> - Sles and Ubuntu OS
>
> [root@zs95kj ]# pcs status |less
> Cluster name: test_cluster_2
> Last updated: Mon Aug 29 17:02:08 2016 Last change: Mon Aug 29 16:37:31
> 2016 by root via crm_resource on zs93kjpcs1
> Stack: corosync
> Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition
> with quorum
> 103 nodes and 300 resources configured
>
> Node zs90kppcs1: standby
> Online: [ zs93KLpcs1 zs93kjpcs1 zs95KLpcs1 zs95kjpcs1 ]
>
> This morning, our system admin team performed a "non-disruptive"
> (concurrent) microcode code load on the OSA, which
> (to our surprise) dropped the network connection for 13 seconds on the S93
> CEC, from 11:18:34am to 11:18:47am , to be exact.
> This temporary outage caused the two cluster nodes on S93 (zs93kjpcs1 and
> zs93KLpcs1) to drop out of the cluster,
> as expected.
>
> However, pacemaker didn't handle this too well. The end result was
> numerous VirtualDomain resources in FAILED state:
>
> [root@zs95kj log]# date;pcs status |grep VirtualD |grep zs93 |grep FAILED
> Mon Aug 29 12:33:32 EDT 2016
> zs95kjg110104_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110092_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110099_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110102_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110106_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110112_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110115_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110118_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110124_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110127_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110130_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110136_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110139_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110142_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110148_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110152_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110155_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110161_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110164_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110167_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110173_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110176_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110179_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg110185_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
> zs95kjg109106_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>
>
> As well as, several VirtualDomain resources showing "Started" on two
> cluster nodes:
>
> zs95kjg110079_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
> zs93KLpcs1 ]
> zs95kjg110108_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
> zs93KLpcs1 ]
> zs95kjg110186_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
> zs93KLpcs1 ]
> zs95kjg110188_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
> zs93KLpcs1 ]
> zs95kjg110198_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
> zs93KLpcs1 ]
>
>
> The virtual machines themselves were in fact, 

[ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure

2016-08-30 Thread Scott Greenlese


Added an appropriate subject line (was blank).  Thanks...


Scott Greenlese ... IBM z/BX Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgre...@us.ibm.com
  PHONE:  8/293-7301 (845-433-7301)M/S:  POK 42HA/P966

- Forwarded by Scott Greenlese/Poughkeepsie/IBM on 08/30/2016 03:59 PM
-

From:   Scott Greenlese/Poughkeepsie/IBM@IBMUS
To: Cluster Labs - All topics related to open-source clustering
welcomed 
Date:   08/29/2016 06:36 PM
Subject:[ClusterLabs] (no subject)



Hi folks,

I'm assigned to system test Pacemaker/Corosync on the KVM on System Z
platform
with pacemaker-1.1.13-10 and corosync-2.3.4-7 .

I have a cluster with 5 KVM hosts, and a total of 200
ocf:pacemakerVirtualDomain resources defined to run
across the 5 cluster nodes (symmertical is true for this cluster).

The heartbeat network is communicating over vlan1293, which is hung off a
network device, 0230 .

In general, pacemaker does a good job of distributing my virtual guest
resources evenly across the hypervisors
in the cluster. These resource are a mixed bag:

- "opaque" and remote "guest nodes" managed by the cluster.
- allow-migrate=false and allow-migrate=true
- qcow2 (file based) guests and LUN based guests
- Sles and Ubuntu OS

[root@zs95kj ]# pcs status |less
Cluster name: test_cluster_2
Last updated: Mon Aug 29 17:02:08 2016 Last change: Mon Aug 29 16:37:31
2016 by root via crm_resource on zs93kjpcs1
Stack: corosync
Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition
with quorum
103 nodes and 300 resources configured

Node zs90kppcs1: standby
Online: [ zs93KLpcs1 zs93kjpcs1 zs95KLpcs1 zs95kjpcs1 ]

This morning, our system admin team performed a
"non-disruptive" (concurrent) microcode code load on the OSA, which
(to our surprise) dropped the network connection for 13 seconds on the S93
CEC, from 11:18:34am to 11:18:47am , to be exact.
This temporary outage caused the two cluster nodes on S93 (zs93kjpcs1 and
zs93KLpcs1) to drop out of the cluster,
as expected.

However, pacemaker didn't handle this too well. The end result was numerous
VirtualDomain resources in FAILED state:

[root@zs95kj log]# date;pcs status |grep VirtualD |grep zs93 |grep FAILED
Mon Aug 29 12:33:32 EDT 2016
zs95kjg110104_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110092_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
zs95kjg110099_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110102_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110106_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
zs95kjg110112_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110115_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110118_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
zs95kjg110124_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110127_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110130_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
zs95kjg110136_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110139_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110142_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
zs95kjg110148_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110152_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110155_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110161_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110164_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110167_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110173_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110176_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110179_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg110185_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
zs95kjg109106_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1


As well as, several VirtualDomain resources showing "Started" on two
cluster nodes:

zs95kjg110079_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
zs93KLpcs1 ]
zs95kjg110108_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
zs93KLpcs1 ]
zs95kjg110186_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
zs93KLpcs1 ]
zs95kjg110188_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
zs93KLpcs1 ]
zs95kjg110198_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
zs93KLpcs1 ]


The virtual machines themselves were in fact, "running" on both hosts. For
example:

[root@zs93kl ~]# virsh list |grep zs95kjg110079
70 zs95kjg110079 running

[root@zs93kj cli]# virsh list |grep zs95kjg110079
18 zs95kjg110079 running


On this particular VM, here was file corruption of this file-based qcow2
guest's image, such that you could not ping or ssh,
and if you open a virsh console, you get "initramfs" prompt.

To recover, we had to mount the volume on another VM and then run fsck to
recover it.

I walked