Re: [D] ACS is not able to restart VM during HA process [cloudstack]

via GitHub Thu, 10 Apr 2025 04:44:34 -0700


GitHub user andrijapanicsb added a comment to the discussion: ACS is not able 
to restart VM during HA process


Copy pasting an in-depth analysis/conclusions (that I had to do on the other 
side):

As expected with recent qemu versions – due to qcow2 having exclusive locks – 
it’s impossible to “achieve” a situation (e.g. in some split-brain scenario) 
where VM1 is still running on HOST-A, but ACS thinks HOST-A is down, and that 
VM1 is down – thus attempting to start it on HOST-B, and CAUSING CORRUPTION by 
successfully starting it on HOST-B. So, it is impossible to start it on host B 
because Qcow2 has a lock on it since the VM is still running (on HOST-A).

This WAS POSSIBLE many years ago, while there were no locks on qcow2 images – 
today it’s impossible (tested on NFS v3, since Lucian mentioned something along 
NFS v4 possibly behaving differently and locks behaving differently (or qcow2 
not being locked, which is a QEMU feature, and sounds “impossible” to me that 
qcow2 files are NOT locked…).

This was a reason that host-HA was implemented, to ensure that if we SUSPECT VM 
is down, due to mgmt. server not being able to communicate with the host – we 
do all kind of checks, including checking the qcow2 access timestamps, etc – 
and ensure the host which is hosting VM1 is killed/STONITH (or rebooted – based 
on the official docs!) – to avoid qcow2 file (FS inside it) corruption!

Here is a screenshot of an example VM: it was deployed on host-A, left running 
(see the RIGHT part of the image/screenshot) – you can see that even basic 
qemu-img info command (read-only access to file) doesn’t work (still on the 
RIGHT side of the screen), unless you force it with -U parameter.

**When trying to start a clone of this VM on host-B, referencing/using the SAME 
QCOW2 file – it refuses to start (see the LEFT side of the screenshot), because 
it can NOT get a lock on the qcow2 file (which is already locked due to being 
in use on host-A):**

![image](https://github.com/user-attachments/assets/2e5adf78-231b-4aef-99ed-71d26db09008)

So host-HA which ALSO was developed to improve VM-HA and avoid qcow2 corruption 
(beside trying to recover the host, or if not possible – STONITH it and keep it 
stopped, and then start VMs on another host) – IS NOT NEEDED for VM-HA  (i.e. 
not needed to ENSURE that there is no FS corruption due to 2 VM copies 
using/writing to the same QCOW2)

 

So for proper VM-HA, at least with qcow2 files (NFS/local/shared-mount points) 
there is no risk of being able to make corruption, thus HOST-HA seems unneeded.

Just sharing the findings.


GitHub link: 
https://github.com/apache/cloudstack/discussions/10690#discussioncomment-12789976

----
This is an automatically sent email for users@cloudstack.apache.org.
To unsubscribe, please send an email to: users-unsubscr...@cloudstack.apache.org

Re: [D] ACS is not able to restart VM during HA process [cloudstack]

Reply via email to