GitHub user jpt1624 created a discussion: KVM HA not functioning

### problem
Hello, I am having issues with getting HA to function with my two KVM hosts. 
The cluster, the two hosts, and a test virtual machine each have HA enabled. 

The KVM hosts have OOB management configured using ipmitool.

For testing, I have a virtual machine with an HA supported policy running on 
KVM-02. I power off KVM-02 abruptly to see if the virtual machine will 
automatically migrate over to KVM-01. 

What occurs is the following:

**KVM-02 is determined to be Disconnected by cloudstack management:**

Host 
{"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"}
 has the status [Disconnected].

**KVM-02 is then set to the DOWN status (supposedly):**

_{"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"}
 has the status [Down]._

**KVM-01 then checks connectivity with KVM-02, which also returns a status of 
DOWN:**

_Neighbouring Host 
{"id":43,"name":"kvm-01","type":"Routing","uuid":"025ccefd-1696-43c9-9a2c-e045968d2efa"}
 returned status [Down] for the investigated Host 
{"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"}._

**The shared storage volume mounted onto KVM-02 is checked for any recent 
writes:**

_Checking VM activity for Host 
{"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"}
 on storage pool [StoragePool 
{"id":48,"name":"Cloud-KVM-SSD-01","poolType":"NetworkFilesystem","uuid":"f8e97832-44a9-3031-aa1d-0acfc9e32648"}]._

_Host 
{"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"}
 does not have activity on storage pool [StoragePool 
{"id":48,"name":"Cloud-KVM-SSD-01","poolType":"NetworkFilesystem","uuid":"f8e97832-44a9-3031-aa1d-0acfc9e32648"}]_

**Also while these are occurring, the API states that the status for KVM-02 is 
UP:**

<img width="549" height="215" alt="Image" 
src="https://github.com/user-attachments/assets/d5de6cb1-7a0e-4db0-bfbe-eeacfefe74c8";
 />

After about 10-15 minutes, we progress to the ALERT state for KVM-02. I am not 
sure why it takes this many attempts because we have set this condition in the 
settings for 5 checks:

<img width="920" height="61" alt="Image" 
src="https://github.com/user-attachments/assets/b3f8af7e-96e0-475f-8c15-0c01d908b6fd";
 />

At the ALERT state, now the HA task tries to power OFF KVM-02 (assuming to 
prevent split brain prior to moving the virtual machines over):

<img width="925" height="260" alt="Image" 
src="https://github.com/user-attachments/assets/65a4a7c8-8be4-4cdf-8915-cf476fee37e7";
 />

This command fails because KVM-02 is already OFF.

<img width="249" height="103" alt="Image" 
src="https://github.com/user-attachments/assets/099ef8b5-0a4d-42f3-a10d-4230dbe3f158";
 />

Cloudstack will continue to try to power KVM-02 off until I manually issue the 
OOB power up command. Cloudstack's power OFF command then will work. When this 
happens we progress to marking KVM-02 as DOWN:

<img width="925" height="350" alt="Image" 
src="https://github.com/user-attachments/assets/2e689194-6824-4223-b892-001702f788a4";
 />

<img width="928" height="93" alt="Image" 
src="https://github.com/user-attachments/assets/343ce057-32ea-417f-ab47-03b039245f39";
 />

<img width="431" height="107" alt="Image" 
src="https://github.com/user-attachments/assets/6e59b2ee-a18f-4aaf-b91c-06c030908963";
 />

Here are the investigators configured:

<img width="1431" height="261" alt="Image" 
src="https://github.com/user-attachments/assets/a2072b97-8e19-472b-93b7-69e0f36bbcd6";
 />


### versions
Cloudstack: 4.22.0.0
KVM-01: 4.22.0.0
KVM-02: 4.22.0.0


### The steps to reproduce the bug

1. Create KVM cluster
2. Assign two KVM hosts under cluster
3. Enable HA for cluster and KVM hosts
4. Configure OOB management for KVM hosts
5. Create test VM under a KVM host with HA supported policy
6. Assign SimpleInvestigator, PingInvestigator, and KVMInvestigator under HA 
investigators order.
7. Power off KVM host abruptly to simulate failure scenario.


  

### What to do about it?
Not sure if my configuration is incorrect or underlying issue is present. The 
behavior is confusing. Please let me know if I can provide anything else.

Thanks!

GitHub link: https://github.com/apache/cloudstack/discussions/12139

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to