Dear Gianluca,

Thanks a lot for this valuable and very helpful simulation. 

best regards,

samuel



Do Right Thing (做正确的事) / Pursue Excellence (追求卓越) / Help Others Succeed (成就他人)
 
From: Gianluca Cecchi
Date: 2022-10-12 10:02
To: Martin Perina
CC: users
Subject: [ovirt-users] Re: How to configure HA virtual machine on hosts without 
IPMI
On Wed, Oct 12, 2022 at 9:14 AM Martin Perina <mper...@redhat.com> wrote:


On Tue, Oct 11, 2022 at 1:42 PM Klaas Demter <klaasdem...@gmail.com> wrote:
Don't storage leases solve that problem? 

Not entirely, you are not able to kill a VM via storage lease, you can only 
detect that even though we lost connection from engine to host (and this means 
also VMs), then we can check if host/VM leases are refresh and if so, we are 
not trying to restart VM on a different host
I seem to recall a HA VM also works when (gets restarted on other node) a 
hypervisor completely loses power, ie there is no response on the fencing 
device. I'd expect it to work the same without a fencing device.

So if that happens, it's not a completely correct setup. If you want reliable 
power management, then your power management network should be independent on 
your data network, so if there is an issue with data network, you should be 
able to use power management network to check power status and perform reboot 
if needed. Of course if both networks are down, then you have a problem, but 
that should be a rare case.


Greetings
Klaas



On September 2021 I simulated some Active-Active DR tests on one environment 
based on RHV 4.4.x with 1 host in Site A and 1 host in Site B.
See also here:
https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html/disaster_recovery_guide/active_active

Cluster configuration:
. enable fencing --> yes
. skip fencing if host has live lease --> yes
. skip fencing on cluster connectivity issues --> yes with threshold 50%

I simulated (through iptables rules) unreachability of host and IPMI device of 
host in Site B.
One HA VM ha-vm and one not HA VM no-ha-vm running on host in Site B 
Generate kernel panic on host in Site B (so that it doesn't renew leases)
(host in Site B based on RHEL8 will make automatic reboot after crash dump and 
I stop it in BIOS boot phase so that the server doesn't come up again)
The VM ha-vm has been correctly restarted on host in Site A after the defined 
timeout

Sep 10, 2021, 6:09:51 PM Host rhvh1 is not responding. It will stay in 
Connecting state for a grace period of 81 seconds and after that an attempt to 
fence the host will be issued.
Sep 10, 2021, 6:09:51 PM VDSM rhvh1 command Get Host Statistics failed: 
Connection timeout for host 'rhvh1', last response arrived 22501 ms ago.
...
Sep 10, 2021, 6:11:25 PM VM ha-vm was set to the Unknown status.
Sep 10, 2021, 6:11:25 PM VM non-ha-vm was set to the Unknown status.
Sep 10, 2021, 6:11:25 PM Host rhvh1 became non responsive and was not restarted 
due to Fencing Policy: 50 percents of the Hosts in the Cluster have 
connectivity issues.
...
Sep 10, 2021, 6:13:43 PM Trying to restart VM ha-vm on Host rhvh2

And the VM ha-vm becomes active and operational.
Note that non-HA VM non-ha-vm will remain in unknown status
If I remove iptables rules and let rhvh1 boot it correctly joins the cluster 
without trying to restart the VM.

The only limitation is that if the Site with the isolation problems is the one 
where the SPM host is running, you will have HA for VMs, but you cannot elect 
new SPM.
So you cannot for example add new disks or change the size of existing ones.
But this is an acceptable temporary situation in case of DR action, that I was 
simulating.

If you try to force rhvh2 to become SPM you get:
Error while executing action: Cannot force select SPM. Unknown Data Center 
status.

To have the new SPM (on rhvh2 in my case), in a real scenario (that I simulated 
before having rhvh1 boot into the OS) you have to verify the real state of Site 
B and that all has been powered off (to prevent a future data corruption if 
SIte B comes up again) and then go and select

"confirm host has been rebooted" on rhvh1

you get a window with "Are you sure?"

Please make sure the Host 'rhvh1' has been manually shut down or rebooted.
This Host is the SPM. Executing this operation on a Host that was not properly 
manually rebooted could lead to Storage corruption condition!
If the host has not been manually rebooted hit 'Cancel'.
Confirm Operation --> check the box

at this point rhvh2 becomes the new SPM and the non-HA VM non-ha-vm transitions 
from unknown status to down and the DC becomes up
From an events point of view you get:

Sep 10, 2021, 6:23:40 PM Vm non-ha-vm was shut down due to rhvh1 host reboot or 
manual fence
Sep 10, 2021, 6:23:41 PM All VMs' status on Non Responsive Host rhvh1 were 
changed to 'Down' by user@internal
Sep 10, 2021, 6:23:41 PM Manual fence for host rhvh1 was started.
Sep 10, 2021, 6:23:43 PM Storage Pool Manager runs on Host rhvh2 (Address: 
rhvh2), Data Center MYDC.

At this point you can start the non-ha-vm VM
Sep 10, 2021, 6:24:44 PM VM non-ha-vm was started by user@internal (Host: 
rhvh2).

During these tests I opened a case because the SPM related limitation was not 
documented inside the DR guide and I got it added (see paragraph 2.3 Storage 
Considerations)

What described above should be applicable to oVirt > 4.4 too for DR and applied 
somehow to target HA needs when missing IPMI
But for sure it is only a sort of workaround, to be avoided in production

I suggest you to test all the scenarios that you want to manage to verify 
expected behavior

HIH digging more,
Gianluca

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/YWVEHSGEG3WXEYAMCNGZDCBXCMBEDSXO/

Reply via email to