Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure
On 09/09/2016 02:47 PM, Scott Greenlese wrote: > Hi Ken , > > Below where you commented, > > "It's considered good practice to stop > pacemaker+corosync before rebooting a node intentionally (for even more > safety, you can put the node into standby first)." > > .. is this something that we document anywhere? Not in any official documentation that I'm aware of; it's more a general custom than a strong recommendation. > Our 'reboot' action performs a halt (deactivate lpar) and then activate. > Do I run the risk > of guest instances running on multiple hosts in my case? I'm performing > various recovery > scenarios and want to avoid this procedure (reboot without first > stopping cluster), if it's not supported. By "intentionally" I mean via normal system administration, not fencing. When fencing, it's always acceptable (and desirable) to do an immediate cutoff, without any graceful stopping of anything. When doing a graceful reboot/shutdown, the OS typically asks all running processes to terminate, then waits a while for them to do so. There's nothing really wrong with pacemaker being running at that point -- as long as everything goes well. If the OS gets impatient and terminates pacemaker before it finishes stopping, the rest of the cluster will want to fence the node. Also, if something goes wrong when resources are stopping, it might be harder to troubleshoot, if the whole system is shutting down at the same time. So, stopping pacemaker first makes sure that all the resources stop cleanly, and that the cluster will ignore the node. Putting in standby is not as important, I would say the main benefit is that the node comes back up in standby when it rejoins, so you have more control over when resources start being placed back on it. You can bring up the node and start pacemaker, and make sure everything is good before allowing resources back on it (especially helpful if you just upgraded pacemaker or any of its dependencies, changed the host's network configuration, etc.). There shouldn't be any chance of multiple-active instances if fencing is configured. Pacemaker shouldn't recover the resource elsewhere until it confirms that either the resource stopped successfully on the node, or the node was fenced. > > By the way, I always put the node in cluster standby before an > intentional reboot. > > Thanks! > > Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y. > INTERNET: swgre...@us.ibm.com > PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966 > > > Inactive hide details for Ken Gaillot ---09/02/2016 10:01:15 AM---From: > Ken GaillotTo: users@clusterlabsKen Gaillot > ---09/02/2016 10:01:15 AM---From: Ken Gaillot To: > users@clusterlabs.org > > From: Ken Gaillot > To: users@clusterlabs.org > Date: 09/02/2016 10:01 AM > Subject: Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to > transient network failure > > > > > > On 09/01/2016 09:39 AM, Scott Greenlese wrote: >> Andreas, >> >> You wrote: >> >> /"Would be good to see your full cluster configuration (corosync.conf >> and cib) - but first guess is: no fencing at all and what is your >> "no-quorum-policy" in Pacemaker?/ >> >> /Regards,/ >> /Andreas"/ >> >> Thanks for your interest. I actually do have a stonith device configured >> which maps all 5 cluster nodes in the cluster: >> >> [root@zs95kj ~]# date;pcs stonith show fence_S90HMC1 >> Thu Sep 1 10:11:25 EDT 2016 >> Resource: fence_S90HMC1 (class=stonith type=fence_ibmz) >> Attributes: ipaddr=9.12.35.134 login=stonith passwd=lnx4ltic >> > pcmk_host_map=zs95KLpcs1:S95/KVL;zs93KLpcs1:S93/KVL;zs93kjpcs1:S93/KVJ;zs95kjpcs1:S95/KVJ;zs90kppcs1:S90/PACEMAKER >> pcmk_host_list="zs95KLpcs1 zs93KLpcs1 zs93kjpcs1 zs95kjpcs1 zs90kppcs1" >> pcmk_list_timeout=300 pcmk_off_timeout=600 pcmk_reboot_action=off >> pcmk_reboot_timeout=600 >> Operations: monitor interval=60s (fence_S90HMC1-monitor-interval-60s) >> >> This fencing device works, too well actually. It seems extremely >> sensitive to node "failures", and I'm not sure how to tune that. Stonith >> reboot actoin is 'off', and the general stonith action (cluster config) >> is also 'off'. In fact, often if I reboot a cluster node (i.e. reboot >> command) that is an active member in the cluster... stonith will power >> off that node while it's on its wait back up. (perhaps requires a >> separate issue thread on this forum?). > > That depends on what a reboot does in your OS ... if it shuts down the > cluster services cleanly, you shouldn't get a fence, but if it kills > anything still running, then the cluster will see the node as failed, > and fencing is appropriate. It's considered good practice to stop > pacemaker+corosync before rebooting a node intentionally (for even more > safety, you can put the node into standby first). > >> >> My no-quorum-policy is: no-quorum-policy: stop
Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure
Hi Ken , Below where you commented, "It's considered good practice to stop pacemaker+corosync before rebooting a node intentionally (for even more safety, you can put the node into standby first)." .. is this something that we document anywhere? Our 'reboot' action performs a halt (deactivate lpar) and then activate. Do I run the risk of guest instances running on multiple hosts in my case? I'm performing various recovery scenarios and want to avoid this procedure (reboot without first stopping cluster), if it's not supported. By the way, I always put the node in cluster standby before an intentional reboot. Thanks! Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com PHONE: 8/293-7301 (845-433-7301)M/S: POK 42HA/P966 From: Ken GaillotTo: users@clusterlabs.org Date: 09/02/2016 10:01 AM Subject:Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure On 09/01/2016 09:39 AM, Scott Greenlese wrote: > Andreas, > > You wrote: > > /"Would be good to see your full cluster configuration (corosync.conf > and cib) - but first guess is: no fencing at all and what is your > "no-quorum-policy" in Pacemaker?/ > > /Regards,/ > /Andreas"/ > > Thanks for your interest. I actually do have a stonith device configured > which maps all 5 cluster nodes in the cluster: > > [root@zs95kj ~]# date;pcs stonith show fence_S90HMC1 > Thu Sep 1 10:11:25 EDT 2016 > Resource: fence_S90HMC1 (class=stonith type=fence_ibmz) > Attributes: ipaddr=9.12.35.134 login=stonith passwd=lnx4ltic > pcmk_host_map=zs95KLpcs1:S95/KVL;zs93KLpcs1:S93/KVL;zs93kjpcs1:S93/KVJ;zs95kjpcs1:S95/KVJ;zs90kppcs1:S90/PACEMAKER > pcmk_host_list="zs95KLpcs1 zs93KLpcs1 zs93kjpcs1 zs95kjpcs1 zs90kppcs1" > pcmk_list_timeout=300 pcmk_off_timeout=600 pcmk_reboot_action=off > pcmk_reboot_timeout=600 > Operations: monitor interval=60s (fence_S90HMC1-monitor-interval-60s) > > This fencing device works, too well actually. It seems extremely > sensitive to node "failures", and I'm not sure how to tune that. Stonith > reboot actoin is 'off', and the general stonith action (cluster config) > is also 'off'. In fact, often if I reboot a cluster node (i.e. reboot > command) that is an active member in the cluster... stonith will power > off that node while it's on its wait back up. (perhaps requires a > separate issue thread on this forum?). That depends on what a reboot does in your OS ... if it shuts down the cluster services cleanly, you shouldn't get a fence, but if it kills anything still running, then the cluster will see the node as failed, and fencing is appropriate. It's considered good practice to stop pacemaker+corosync before rebooting a node intentionally (for even more safety, you can put the node into standby first). > > My no-quorum-policy is: no-quorum-policy: stop > > I don't think I should have lost quorum, only two of the five cluster > nodes lost their corosync ring connection. Those two nodes lost quorum, so they should have stopped all their resources. And the three remaining nodes should have fenced them. I'd check the logs around the time of the incident. Do the two affected nodes detect the loss of quorum? Do they attempt to stop their resources? Do those stops succeed? Do the other three nodes detect the loss of the two nodes? Does the DC attempt to fence them? Do the fence attempts succeed? > Here's the full configuration: > > > [root@zs95kj ~]# cat /etc/corosync/corosync.conf > totem { > version: 2 > secauth: off > cluster_name: test_cluster_2 > transport: udpu > } > > nodelist { > node { > ring0_addr: zs93kjpcs1 > nodeid: 1 > } > > node { > ring0_addr: zs95kjpcs1 > nodeid: 2 > } > > node { > ring0_addr: zs95KLpcs1 > nodeid: 3 > } > > node { > ring0_addr: zs90kppcs1 > nodeid: 4 > } > > node { > ring0_addr: zs93KLpcs1 > nodeid: 5 > } > } > > quorum { > provider: corosync_votequorum > } > > logging { > #Log to a specified file > to_logfile: yes > logfile: /var/log/corosync/corosync.log > #Log timestamp as well > timestamp: on > > #Facility in syslog > syslog_facility: daemon > > logger_subsys { > #Enable debug for this logger. > > debug: off > > #This specifies the subsystem identity (name) for which logging is specified > > subsys: QUORUM > > } > #Log to syslog > to_syslog: yes > > #Whether or not turning on the debug information in the log > debug: on > } > [root@zs95kj ~]# > > > > The full CIB (see attachment) > > [root@zs95kj ~]# pcs cluster cib > /tmp/scotts_cib_Sep1_2016.out > > /(See attached file: scotts_cib_Sep1_2016.out)/ > > > A few excerpts from the CIB: > > [root@zs95kj ~]# pcs cluster cib |less > num_updates="19" admin_epoch="0" cib-last-written="Wed Aug 31 15:59:31 > 2016" update-origin="zs93kjpcs1" update-client="crm_resource" > update-user="root" have-quorum="1" dc-uuid="2"> > > > > value="false"/> > value="1.1.13-10.el7_2.ibm.1-44eb2dd"/> > name="cluster-infrastructure"
Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure
On 09/01/2016 09:39 AM, Scott Greenlese wrote: > Andreas, > > You wrote: > > /"Would be good to see your full cluster configuration (corosync.conf > and cib) - but first guess is: no fencing at all and what is your > "no-quorum-policy" in Pacemaker?/ > > /Regards,/ > /Andreas"/ > > Thanks for your interest. I actually do have a stonith device configured > which maps all 5 cluster nodes in the cluster: > > [root@zs95kj ~]# date;pcs stonith show fence_S90HMC1 > Thu Sep 1 10:11:25 EDT 2016 > Resource: fence_S90HMC1 (class=stonith type=fence_ibmz) > Attributes: ipaddr=9.12.35.134 login=stonith passwd=lnx4ltic > pcmk_host_map=zs95KLpcs1:S95/KVL;zs93KLpcs1:S93/KVL;zs93kjpcs1:S93/KVJ;zs95kjpcs1:S95/KVJ;zs90kppcs1:S90/PACEMAKER > pcmk_host_list="zs95KLpcs1 zs93KLpcs1 zs93kjpcs1 zs95kjpcs1 zs90kppcs1" > pcmk_list_timeout=300 pcmk_off_timeout=600 pcmk_reboot_action=off > pcmk_reboot_timeout=600 > Operations: monitor interval=60s (fence_S90HMC1-monitor-interval-60s) > > This fencing device works, too well actually. It seems extremely > sensitive to node "failures", and I'm not sure how to tune that. Stonith > reboot actoin is 'off', and the general stonith action (cluster config) > is also 'off'. In fact, often if I reboot a cluster node (i.e. reboot > command) that is an active member in the cluster... stonith will power > off that node while it's on its wait back up. (perhaps requires a > separate issue thread on this forum?). That depends on what a reboot does in your OS ... if it shuts down the cluster services cleanly, you shouldn't get a fence, but if it kills anything still running, then the cluster will see the node as failed, and fencing is appropriate. It's considered good practice to stop pacemaker+corosync before rebooting a node intentionally (for even more safety, you can put the node into standby first). > > My no-quorum-policy is: no-quorum-policy: stop > > I don't think I should have lost quorum, only two of the five cluster > nodes lost their corosync ring connection. Those two nodes lost quorum, so they should have stopped all their resources. And the three remaining nodes should have fenced them. I'd check the logs around the time of the incident. Do the two affected nodes detect the loss of quorum? Do they attempt to stop their resources? Do those stops succeed? Do the other three nodes detect the loss of the two nodes? Does the DC attempt to fence them? Do the fence attempts succeed? > Here's the full configuration: > > > [root@zs95kj ~]# cat /etc/corosync/corosync.conf > totem { > version: 2 > secauth: off > cluster_name: test_cluster_2 > transport: udpu > } > > nodelist { > node { > ring0_addr: zs93kjpcs1 > nodeid: 1 > } > > node { > ring0_addr: zs95kjpcs1 > nodeid: 2 > } > > node { > ring0_addr: zs95KLpcs1 > nodeid: 3 > } > > node { > ring0_addr: zs90kppcs1 > nodeid: 4 > } > > node { > ring0_addr: zs93KLpcs1 > nodeid: 5 > } > } > > quorum { > provider: corosync_votequorum > } > > logging { > #Log to a specified file > to_logfile: yes > logfile: /var/log/corosync/corosync.log > #Log timestamp as well > timestamp: on > > #Facility in syslog > syslog_facility: daemon > > logger_subsys { > #Enable debug for this logger. > > debug: off > > #This specifies the subsystem identity (name) for which logging is specified > > subsys: QUORUM > > } > #Log to syslog > to_syslog: yes > > #Whether or not turning on the debug information in the log > debug: on > } > [root@zs95kj ~]# > > > > The full CIB (see attachment) > > [root@zs95kj ~]# pcs cluster cib > /tmp/scotts_cib_Sep1_2016.out > > /(See attached file: scotts_cib_Sep1_2016.out)/ > > > A few excerpts from the CIB: > > [root@zs95kj ~]# pcs cluster cib |less > num_updates="19" admin_epoch="0" cib-last-written="Wed Aug 31 15:59:31 > 2016" update-origin="zs93kjpcs1" update-client="crm_resource" > update-user="root" have-quorum="1" dc-uuid="2"> > > > > value="false"/> > value="1.1.13-10.el7_2.ibm.1-44eb2dd"/> > name="cluster-infrastructure" value="corosync"/> > value="test_cluster_2"/> > name="no-quorum-policy" value="stop"/> > name="last-lrm-refresh" value="1472595716"/> > value="off"/> > > > > > > > > > > > > > > > > > > > > type="VirtualDomain"> > > value="/guestxml/nfs1/zs95kjg109062.xml"/> > name="hypervisor" value="qemu:///system"/> > name="migration_transport" value="ssh"/> > > > name="allow-migrate" value="true"/> > > > timeout="90"/> > timeout="90"/> > name="monitor"/> > name="migrate-from" timeout="1200"/> > > > > value="2048"/> > > > > ( I OMITTED THE OTHER, SIMILAR 199 VIRTUALDOMAIN PRIMITIVE ENTRIES FOR > THE SAKE OF SPACE, BUT IF THEY ARE OF > INTEREST, I CAN ADD THEM) > > . > . > . > > > > > operation="eq" value="container"/> > > > > (I DEFINED THIS LOCATION CONSTRAINT RULE TO PREVENT OPAQUE GUEST VIRTUAL > DOMAIN RESOUCES FROM BEING > ASSIGNED TO REMOTE NODE VIRTUAL DOMAIN RESOURCES. I ALSO OMITTED THE > NUMEROUS, SIMILAR ENTRIES BELOW). >
Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure
Hi, On Tue, Aug 30, 2016 at 10:03 PM, Scott Greenlesewrote: > Added an appropriate subject line (was blank). Thanks... > > > Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y. > INTERNET: swgre...@us.ibm.com > PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966 > > - Forwarded by Scott Greenlese/Poughkeepsie/IBM on 08/30/2016 03:59 PM > - > > From: Scott Greenlese/Poughkeepsie/IBM@IBMUS > To: Cluster Labs - All topics related to open-source clustering welcomed < > users@clusterlabs.org> > Date: 08/29/2016 06:36 PM > Subject: [ClusterLabs] (no subject) > -- > > > > Hi folks, > > I'm assigned to system test Pacemaker/Corosync on the KVM on System Z > platform > with pacemaker-1.1.13-10 and corosync-2.3.4-7 . > Would be good to see your full cluster configuration (corosync.conf and cib) - but first guess is: no fencing at all and what is your "no-quorum-policy" in Pacemaker? Regards, Andreas > > I have a cluster with 5 KVM hosts, and a total of 200 > ocf:pacemakerVirtualDomain resources defined to run > across the 5 cluster nodes (symmertical is true for this cluster). > > The heartbeat network is communicating over vlan1293, which is hung off a > network device, 0230 . > > In general, pacemaker does a good job of distributing my virtual guest > resources evenly across the hypervisors > in the cluster. These resource are a mixed bag: > > - "opaque" and remote "guest nodes" managed by the cluster. > - allow-migrate=false and allow-migrate=true > - qcow2 (file based) guests and LUN based guests > - Sles and Ubuntu OS > > [root@zs95kj ]# pcs status |less > Cluster name: test_cluster_2 > Last updated: Mon Aug 29 17:02:08 2016 Last change: Mon Aug 29 16:37:31 > 2016 by root via crm_resource on zs93kjpcs1 > Stack: corosync > Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition > with quorum > 103 nodes and 300 resources configured > > Node zs90kppcs1: standby > Online: [ zs93KLpcs1 zs93kjpcs1 zs95KLpcs1 zs95kjpcs1 ] > > This morning, our system admin team performed a "non-disruptive" > (concurrent) microcode code load on the OSA, which > (to our surprise) dropped the network connection for 13 seconds on the S93 > CEC, from 11:18:34am to 11:18:47am , to be exact. > This temporary outage caused the two cluster nodes on S93 (zs93kjpcs1 and > zs93KLpcs1) to drop out of the cluster, > as expected. > > However, pacemaker didn't handle this too well. The end result was > numerous VirtualDomain resources in FAILED state: > > [root@zs95kj log]# date;pcs status |grep VirtualD |grep zs93 |grep FAILED > Mon Aug 29 12:33:32 EDT 2016 > zs95kjg110104_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110092_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 > zs95kjg110099_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110102_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110106_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 > zs95kjg110112_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110115_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110118_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 > zs95kjg110124_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110127_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110130_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 > zs95kjg110136_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110139_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110142_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 > zs95kjg110148_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110152_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110155_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110161_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110164_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110167_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110173_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110176_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110179_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg110185_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > zs95kjg109106_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 > > > As well as, several VirtualDomain resources showing "Started" on two > cluster nodes: > > zs95kjg110079_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 > zs93KLpcs1 ] > zs95kjg110108_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 > zs93KLpcs1 ] > zs95kjg110186_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 > zs93KLpcs1 ] > zs95kjg110188_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 > zs93KLpcs1 ] > zs95kjg110198_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 > zs93KLpcs1 ] > > > The virtual machines themselves were in fact,
[ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure
Added an appropriate subject line (was blank). Thanks... Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com PHONE: 8/293-7301 (845-433-7301)M/S: POK 42HA/P966 - Forwarded by Scott Greenlese/Poughkeepsie/IBM on 08/30/2016 03:59 PM - From: Scott Greenlese/Poughkeepsie/IBM@IBMUS To: Cluster Labs - All topics related to open-source clustering welcomedDate: 08/29/2016 06:36 PM Subject:[ClusterLabs] (no subject) Hi folks, I'm assigned to system test Pacemaker/Corosync on the KVM on System Z platform with pacemaker-1.1.13-10 and corosync-2.3.4-7 . I have a cluster with 5 KVM hosts, and a total of 200 ocf:pacemakerVirtualDomain resources defined to run across the 5 cluster nodes (symmertical is true for this cluster). The heartbeat network is communicating over vlan1293, which is hung off a network device, 0230 . In general, pacemaker does a good job of distributing my virtual guest resources evenly across the hypervisors in the cluster. These resource are a mixed bag: - "opaque" and remote "guest nodes" managed by the cluster. - allow-migrate=false and allow-migrate=true - qcow2 (file based) guests and LUN based guests - Sles and Ubuntu OS [root@zs95kj ]# pcs status |less Cluster name: test_cluster_2 Last updated: Mon Aug 29 17:02:08 2016 Last change: Mon Aug 29 16:37:31 2016 by root via crm_resource on zs93kjpcs1 Stack: corosync Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition with quorum 103 nodes and 300 resources configured Node zs90kppcs1: standby Online: [ zs93KLpcs1 zs93kjpcs1 zs95KLpcs1 zs95kjpcs1 ] This morning, our system admin team performed a "non-disruptive" (concurrent) microcode code load on the OSA, which (to our surprise) dropped the network connection for 13 seconds on the S93 CEC, from 11:18:34am to 11:18:47am , to be exact. This temporary outage caused the two cluster nodes on S93 (zs93kjpcs1 and zs93KLpcs1) to drop out of the cluster, as expected. However, pacemaker didn't handle this too well. The end result was numerous VirtualDomain resources in FAILED state: [root@zs95kj log]# date;pcs status |grep VirtualD |grep zs93 |grep FAILED Mon Aug 29 12:33:32 EDT 2016 zs95kjg110104_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110092_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110099_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110102_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110106_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110112_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110115_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110118_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110124_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110127_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110130_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110136_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110139_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110142_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110148_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110152_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110155_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110161_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110164_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110167_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110173_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110176_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110179_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110185_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg109106_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 As well as, several VirtualDomain resources showing "Started" on two cluster nodes: zs95kjg110079_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 zs93KLpcs1 ] zs95kjg110108_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 zs93KLpcs1 ] zs95kjg110186_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 zs93KLpcs1 ] zs95kjg110188_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 zs93KLpcs1 ] zs95kjg110198_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1 zs93KLpcs1 ] The virtual machines themselves were in fact, "running" on both hosts. For example: [root@zs93kl ~]# virsh list |grep zs95kjg110079 70 zs95kjg110079 running [root@zs93kj cli]# virsh list |grep zs95kjg110079 18 zs95kjg110079 running On this particular VM, here was file corruption of this file-based qcow2 guest's image, such that you could not ping or ssh, and if you open a virsh console, you get "initramfs" prompt. To recover, we had to mount the volume on another VM and then run fsck to recover it. I walked