GitHub user TadiosAbebe added a comment to the discussion: VM HA and Host HA
>From what I understand, VM HA is enabled by default as long as you use a >compute offering with HA enabled or you can create a compute offering by >checking Offer HA toggle. To enable Host HA, however, you need to configure >Out-of-Band Management by going to Infrastructure -> Hosts -> Configure >Out-of-Band Management That said, it’s important to first understand the difference between VM HA and Host HA before going down the rabbit hole. Within the CloudStack community, it has even been suggested to avoid using Host HA in some scenarios https://github.com/apache/cloudstack/discussions/12139#discussioncomment-15087060 According to the CloudStack documentation: - The HA feature works with iSCSI or NFS primary storage. - Host HA is applicable only to KVM clusters. You can verify this in the official docs: https://docs.cloudstack.apache.org/en/latest/adminguide/reliability.html In my understanding VM HA is a framework that monitors virtual machines and starts or restarts them when they are detected as down. This can happen if: The hypervisor host goes down, The VM crashes, The VM is shut down from inside the OS, The VM is manually stopped (via virsh). The main challenge with any HA system is reliable failure detection. If you cannot accurately determine that a component is truly down, you risk serious issues such as split-brain scenarios. This is where Host HA comes in. To reliably fence a host (the hypervisor, not the VM), CloudStack uses OOBM like IPMI or Redfish. In theory, If CloudStack detects that a host is unreachable, It first tries to recover the host using oobm (for example, issuing a reboot command). If the host still does not recover, CloudStack fences it by issuing a shutdown command through OOBM. Only after fencing does CloudStack restart the VMs on another host. Without a reliable fencing mechanism, the following can happen, lets say you have three KVM hosts in a cluster. HA-enabled VM runs on KVM1. The network to KVM1 goes down, but the host and VM are actually still running. CloudStack thinks KVM1 is down and restarts the VM on KVM2 or KVM3. When the network to KVM1 is restored, the same VM is now running in two places(I acutally have tested this). In theory with Host HA and OOBM, CloudStack ensures that KVM1 is properly powered off before restarting the VM elsewhere, so only one active instance of the VM exists and accesses the shared storage. Since i have tested VM HA more let me give you my observation. - VM HA with Non-NFS environment (Ceph as primary storage) For a host running a HA-enabled VM, CloudStack monitors the VM and logs activity such as c.c.a.m.A.MonitorTask. When the VM is down c.c.a.m.A.MonitorTask reports the VM is beind ping and the HighAvailabilityManagerExtImpl triggers a series of investigators SimpleInvestigator, XenServerInvestigator, KVMInvestigator, HypervInvestigator, VMwareInvestigator, PingInvestigator, ManagementIPSysVMInvestigator. Ovm3Investigator. Eventually, the host goes into Alert state with a message `state cannot be determined for more than alert.wait (1800) seconds, will go to Alert state` In an environment without NFS primary storage, CloudStack cannot reliably determine whether the host is truly down, so, The host transitions to Alert, VMs are not restarted or migrated to another host - VM HA with NFS-based environment (Primary storage is Ceph, but an additional NFS primary storage is present) When NFS is available, CloudStack creates a VMHA directory on the NFS primary storage. Each compute host writes a heartbeat file there. If the host running a HA-enabled VM goes down The KVMInvestigator detects the failure via the heartbeat mechanism. The host is marked as Down and After a short time, CloudStack restarts the VM on another host. In this case, VM HA works as expected because the shared NFS storage provides a partially reliable way to detect host failure. This is based on the below threads i read and hands-on experience - https://cwiki.apache.org/confluence/display/cloudstack/host+ha - https://cwiki.apache.org/confluence/display/CLOUDSTACK/High+Availability+Developer%27s+Guide - https://github.com/apache/cloudstack/pull/4978 - https://github.com/apache/cloudstack/pull/5862 - https://www.mail-archive.com/[email protected]/msg35692.html - https://www.mail-archive.com/[email protected]/msg28366.html - https://www.mail-archive.com/[email protected]/msg25077.html - https://www.mail-archive.com/[email protected]/msg20426.html - https://www.mail-archive.com/[email protected]/msg39338.html - https://www.mail-archive.com/[email protected]/msg27278.html - https://github.com/apache/cloudstack/issues/10477#issuecomment-2753247589 - https://github.com/apache/cloudstack/discussions/12139#discussioncomment-15092627 GitHub link: https://github.com/apache/cloudstack/discussions/12400#discussioncomment-15465444 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
