GitHub user TadiosAbebe added a comment to the discussion: VM HA and Host HA

>From what I understand, VM HA is enabled by default as long as you use a 
>compute offering with HA enabled or you can create a compute offering by 
>checking Offer HA toggle. To enable Host HA, however, you need to configure 
>Out-of-Band Management by going to Infrastructure -> Hosts -> Configure 
>Out-of-Band Management

That said, it’s important to first understand the difference between VM HA and 
Host HA before going down the rabbit hole. Within the CloudStack community, it 
has even been suggested to avoid using Host HA in some scenarios 
https://github.com/apache/cloudstack/discussions/12139#discussioncomment-15087060

According to the CloudStack documentation:
- The HA feature works with iSCSI or NFS primary storage.
- Host HA is applicable only to KVM clusters.

You can verify this in the official docs: 
https://docs.cloudstack.apache.org/en/latest/adminguide/reliability.html

In my understanding VM HA is a framework that monitors virtual machines and 
starts or restarts them when they are detected as down. This can happen if: The 
hypervisor host goes down, The VM crashes, 
The VM is shut down from inside the OS, The VM is manually stopped (via virsh).

The main challenge with any HA system is reliable failure detection. If you 
cannot accurately determine that a component is truly down, you risk serious 
issues such as split-brain scenarios.

This is where Host HA comes in. To reliably fence a host (the hypervisor, not 
the VM), CloudStack uses OOBM like IPMI or Redfish. In theory, If CloudStack 
detects that a host is unreachable, It first tries to recover the host using 
oobm (for example, issuing a reboot command). If the host still does not 
recover, CloudStack fences it by issuing a shutdown command through OOBM. Only 
after fencing does CloudStack restart the VMs on another host.

Without a reliable fencing mechanism, the following can happen, lets say you 
have three KVM hosts in a cluster. HA-enabled VM runs on KVM1. The network to 
KVM1 goes down, but the host and VM are actually still running. CloudStack 
thinks KVM1 is down and restarts the VM on KVM2 or KVM3. When the network to 
KVM1 is restored, the same VM is now running in two places(I acutally have 
tested this). In theory with Host HA and OOBM, CloudStack ensures that KVM1 is 
properly powered off before restarting the VM elsewhere, so only one active 
instance of the VM exists and accesses the shared storage.

Since i have tested VM HA more let me give you my observation.
- VM HA with Non-NFS environment (Ceph as primary storage)

For a host running a HA-enabled VM, CloudStack monitors the VM and logs 
activity such as c.c.a.m.A.MonitorTask. When the VM is down 
c.c.a.m.A.MonitorTask reports the VM is beind ping and the 
HighAvailabilityManagerExtImpl triggers a series of investigators 
SimpleInvestigator, XenServerInvestigator, KVMInvestigator, HypervInvestigator, 
VMwareInvestigator, PingInvestigator, ManagementIPSysVMInvestigator. 
Ovm3Investigator. Eventually, the host goes into Alert state with a message 
`state cannot be determined for more than alert.wait (1800) seconds, will go to 
Alert state`

In an environment without NFS primary storage, CloudStack cannot reliably 
determine whether the host is truly down, so, The host transitions to Alert, 
VMs are not restarted or migrated to another host

- VM HA with NFS-based environment (Primary storage is Ceph, but an additional 
NFS primary storage is present)

When NFS is available, CloudStack creates a VMHA directory on the NFS primary 
storage. Each compute host writes a heartbeat file there. If the host running a 
HA-enabled VM goes down The KVMInvestigator detects the failure via the 
heartbeat mechanism. The host is marked as Down and After a short time, 
CloudStack restarts the VM on another host.

In this case, VM HA works as expected because the shared NFS storage provides a 
partially reliable way to detect host failure.

This is based on the below threads i read and hands-on experience

- https://cwiki.apache.org/confluence/display/cloudstack/host+ha
- 
https://cwiki.apache.org/confluence/display/CLOUDSTACK/High+Availability+Developer%27s+Guide
- https://github.com/apache/cloudstack/pull/4978
- https://github.com/apache/cloudstack/pull/5862
- https://www.mail-archive.com/[email protected]/msg35692.html
- https://www.mail-archive.com/[email protected]/msg28366.html
- https://www.mail-archive.com/[email protected]/msg25077.html
- https://www.mail-archive.com/[email protected]/msg20426.html
- https://www.mail-archive.com/[email protected]/msg39338.html
- https://www.mail-archive.com/[email protected]/msg27278.html
- https://github.com/apache/cloudstack/issues/10477#issuecomment-2753247589
- 
https://github.com/apache/cloudstack/discussions/12139#discussioncomment-15092627

GitHub link: 
https://github.com/apache/cloudstack/discussions/12400#discussioncomment-15465444

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to