On Sun, Mar 17, 2019 at 12:56 PM <le...@mydream.com.hk> wrote: > > Hi, I had experience two time of 3-node hyper-converged 4.2.8 ovirt cluster > total outage due to vdsm reactivate the unresponsive node, and cause the > multiple glusterfs daemon restart. As a result, all VM was paused and some of > disk image was corrupted. > > At the very beginning, one of the ovirt node was overloaded due to high > memory and CPU, the hosted-engine have trouble to collect status from vdsm > and mark it as unresponsive and it start migrate the workload to healthy > node. However, when it start migrate, second ovirt node being unresponsive > where vdsm try reactivate the 1st unresponsive node and restart it's > glusterd. So the gluster domain was acquiring the quorum and waiting for > timeout. > > If 1st node reactivation is success and every other node can survive the > timeout, it will be an idea case. Unfortunately, the second node cannot pick > up the VM being migrated due to gluster I/O timeout, so second node at that > moment was marked as unresponsive, and so on... vdsm is restarting the > glusterd on the second node which cause disaster. All node are racing on > gluster volume self-healing, and i can't mark the cluster as maintenance mode > as well. What I can do is try to resume the paused VM via virsh and issue > shutdown for each domain, also hard shutdown for un-resumable VM. > > After number of VM shutdown and wait the gluster healing completed, the > cluster state back to normal, and I try to start the VM being manually > stopped, most of them can be started normally, but number of VM was crashed > or un-startable, instantly I found the image files of un-startable VM was > owned by root(can't explain why), and can be restarted after chmod. Two of > them still cannot start with "bad volume specification" error. One of them > can start to boot loader, but the LVM metadata were lost. > > The impact was huge when vdsm restart the glusterd without human invention.
Is this even with the fencing policies set for ensuring gluster quorum is not lost? There are 2 policies that you need to enable at the cluster level - Skip fencing if Gluster bricks are UP Skip fencing if Gluster quorum not met > _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/NIPHD7COR5ZBVQROOUU6R4Q45SDFAJ5K/ _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/WL3F7PLMAGUBX6VFYKFRJE6YHWAQHFHU/