GitHub user andrijapanicsb added a comment to the discussion: ACS is not able to restart VM during HA process
Copy pasting an in-depth analysis/conclusions (that I had to do on the other side): As expected with recent qemu versions – due to qcow2 having exclusive locks – it’s impossible to “achieve” a situation (e.g. in some split-brain scenario) where VM1 is still running on HOST-A, but ACS thinks HOST-A is down, and that VM1 is down – thus attempting to start it on HOST-B, and CAUSING CORRUPTION by successfully starting it on HOST-B. So, it is impossible to start it on host B because Qcow2 has a lock on it since the VM is still running (on HOST-A). This WAS POSSIBLE many years ago, while there were no locks on qcow2 images – today it’s impossible (tested on NFS v3, since Lucian mentioned something along NFS v4 possibly behaving differently and locks behaving differently (or qcow2 not being locked, which is a QEMU feature, and sounds “impossible” to me that qcow2 files are NOT locked…). This was a reason that host-HA was implemented, to ensure that if we SUSPECT VM is down, due to mgmt. server not being able to communicate with the host – we do all kind of checks, including checking the qcow2 access timestamps, etc – and ensure the host which is hosting VM1 is killed/STONITH (or rebooted – based on the official docs!) – to avoid qcow2 file (FS inside it) corruption! Here is a screenshot of an example VM: it was deployed on host-A, left running (see the RIGHT part of the image/screenshot) – you can see that even basic qemu-img info command (read-only access to file) doesn’t work (still on the RIGHT side of the screen), unless you force it with -U parameter. **When trying to start a clone of this VM on host-B, referencing/using the SAME QCOW2 file – it refuses to start (see the LEFT side of the screenshot), because it can NOT get a lock on the qcow2 file (which is already locked due to being in use on host-A):**  So host-HA which ALSO was developed to improve VM-HA and avoid qcow2 corruption (beside trying to recover the host, or if not possible – STONITH it and keep it stopped, and then start VMs on another host) – IS NOT NEEDED for VM-HA (i.e. not needed to ENSURE that there is no FS corruption due to 2 VM copies using/writing to the same QCOW2) So for proper VM-HA, at least with qcow2 files (NFS/local/shared-mount points) there is no risk of being able to make corruption, thus HOST-HA seems unneeded. Just sharing the findings. GitHub link: https://github.com/apache/cloudstack/discussions/10690#discussioncomment-12789976 ---- This is an automatically sent email for users@cloudstack.apache.org. To unsubscribe, please send an email to: users-unsubscr...@cloudstack.apache.org