On Sun, Mar 17, 2019 at 12:56 PM <le...@mydream.com.hk> wrote:
>
> Hi, I had experience two time of 3-node hyper-converged 4.2.8 ovirt cluster 
> total outage due to vdsm reactivate the unresponsive node, and cause the 
> multiple glusterfs daemon restart. As a result, all VM was paused and some of 
> disk image was corrupted.
>
> At the very beginning, one of the ovirt node was overloaded due to high 
> memory and CPU, the hosted-engine have trouble to collect status from vdsm 
> and mark it as unresponsive and it start migrate the workload to healthy 
> node. However, when it start migrate, second ovirt node being unresponsive 
> where vdsm try reactivate the 1st unresponsive node and restart it's 
> glusterd. So the gluster domain was acquiring the quorum and waiting for 
> timeout.
>
> If 1st node reactivation is success and every other node can survive the 
> timeout, it will be an idea case. Unfortunately, the second node cannot pick 
> up the VM being migrated due to gluster I/O timeout, so second node at that 
> moment was marked as unresponsive, and so on... vdsm is restarting the 
> glusterd on the second node which cause disaster. All node are racing on 
> gluster volume self-healing, and i can't mark the cluster as maintenance mode 
> as well. What I can do is try to resume the paused VM via virsh and issue 
> shutdown for each domain, also hard shutdown for un-resumable VM.
>
> After number of VM shutdown and wait the gluster healing completed,  the 
> cluster state back to normal, and I try to start the VM being manually 
> stopped, most of them can be started normally, but number of VM was crashed 
> or un-startable, instantly I  found the image files of un-startable VM was 
> owned by root(can't explain why), and can be restarted after chmod.  Two of 
> them still cannot start with  "bad volume specification" error. One of them 
> can start to boot loader, but the LVM metadata were lost.
>
> The impact was huge when vdsm restart the glusterd without human invention.

Is this even with the fencing policies set for ensuring gluster quorum
is not lost?

There are 2 policies that you need to enable at the cluster level -
Skip fencing if Gluster bricks are UP
Skip fencing if Gluster quorum not met


> _______________________________________________
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/NIPHD7COR5ZBVQROOUU6R4Q45SDFAJ5K/
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/WL3F7PLMAGUBX6VFYKFRJE6YHWAQHFHU/

Reply via email to