[ovirt-users] Re: vdsm should decouple with managed glusterfs services

2019-03-18 Thread Sahina Bose
On Mon, Mar 18, 2019 at 4:15 PM levin  wrote:
>
> Hi Sahina,
>
> My cluster did not enabled fencing, is somewhere I can disable restart policy 
> from vdsm completely? So I can observe this case next time, and giving a fist 
> investigation on unresponsive node.
>
+Martin Perina - do you know if this is possible?

> Regards,
> Levin
>
>
> On 18/3/2019, 17:40, "Sahina Bose"  wrote:
>
> On Sun, Mar 17, 2019 at 12:56 PM  wrote:
> >
> > Hi, I had experience two time of 3-node hyper-converged 4.2.8 ovirt 
> cluster total outage due to vdsm reactivate the unresponsive node, and cause 
> the multiple glusterfs daemon restart. As a result, all VM was paused and 
> some of disk image was corrupted.
> >
> > At the very beginning, one of the ovirt node was overloaded due to high 
> memory and CPU, the hosted-engine have trouble to collect status from vdsm 
> and mark it as unresponsive and it start migrate the workload to healthy 
> node. However, when it start migrate, second ovirt node being unresponsive 
> where vdsm try reactivate the 1st unresponsive node and restart it's 
> glusterd. So the gluster domain was acquiring the quorum and waiting for 
> timeout.
> >
> > If 1st node reactivation is success and every other node can survive 
> the timeout, it will be an idea case. Unfortunately, the second node cannot 
> pick up the VM being migrated due to gluster I/O timeout, so second node at 
> that moment was marked as unresponsive, and so on... vdsm is restarting the 
> glusterd on the second node which cause disaster. All node are racing on 
> gluster volume self-healing, and i can't mark the cluster as maintenance mode 
> as well. What I can do is try to resume the paused VM via virsh and issue 
> shutdown for each domain, also hard shutdown for un-resumable VM.
> >
> > After number of VM shutdown and wait the gluster healing completed,  
> the cluster state back to normal, and I try to start the VM being manually 
> stopped, most of them can be started normally, but number of VM was crashed 
> or un-startable, instantly I  found the image files of un-startable VM was 
> owned by root(can't explain why), and can be restarted after chmod.  Two of 
> them still cannot start with  "bad volume specification" error. One of them 
> can start to boot loader, but the LVM metadata were lost.
> >
> > The impact was huge when vdsm restart the glusterd without human 
> invention.
>
> Is this even with the fencing policies set for ensuring gluster quorum
> is not lost?
>
> There are 2 policies that you need to enable at the cluster level -
> Skip fencing if Gluster bricks are UP
> Skip fencing if Gluster quorum not met
>
>
> > ___
> > Users mailing list -- users@ovirt.org
> > To unsubscribe send an email to users-le...@ovirt.org
> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> > oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> > List Archives: 
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/NIPHD7COR5ZBVQROOUU6R4Q45SDFAJ5K/
>
>
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/Z6OBWXWFMIWILZWEMZLEJNRMHL3VLBDF/


[ovirt-users] Re: vdsm should decouple with managed glusterfs services

2019-03-18 Thread levin
Hi Sahina,

My cluster did not enabled fencing, is somewhere I can disable restart policy 
from vdsm completely? So I can observe this case next time, and giving a fist 
investigation on unresponsive node.

Regards,
Levin


On 18/3/2019, 17:40, "Sahina Bose"  wrote:

On Sun, Mar 17, 2019 at 12:56 PM  wrote:
>
> Hi, I had experience two time of 3-node hyper-converged 4.2.8 ovirt 
cluster total outage due to vdsm reactivate the unresponsive node, and cause 
the multiple glusterfs daemon restart. As a result, all VM was paused and some 
of disk image was corrupted.
>
> At the very beginning, one of the ovirt node was overloaded due to high 
memory and CPU, the hosted-engine have trouble to collect status from vdsm and 
mark it as unresponsive and it start migrate the workload to healthy node. 
However, when it start migrate, second ovirt node being unresponsive where vdsm 
try reactivate the 1st unresponsive node and restart it's glusterd. So the 
gluster domain was acquiring the quorum and waiting for timeout.
>
> If 1st node reactivation is success and every other node can survive the 
timeout, it will be an idea case. Unfortunately, the second node cannot pick up 
the VM being migrated due to gluster I/O timeout, so second node at that moment 
was marked as unresponsive, and so on... vdsm is restarting the glusterd on the 
second node which cause disaster. All node are racing on gluster volume 
self-healing, and i can't mark the cluster as maintenance mode as well. What I 
can do is try to resume the paused VM via virsh and issue shutdown for each 
domain, also hard shutdown for un-resumable VM.
>
> After number of VM shutdown and wait the gluster healing completed,  the 
cluster state back to normal, and I try to start the VM being manually stopped, 
most of them can be started normally, but number of VM was crashed or 
un-startable, instantly I  found the image files of un-startable VM was owned 
by root(can't explain why), and can be restarted after chmod.  Two of them 
still cannot start with  "bad volume specification" error. One of them can 
start to boot loader, but the LVM metadata were lost.
>
> The impact was huge when vdsm restart the glusterd without human 
invention.

Is this even with the fencing policies set for ensuring gluster quorum
is not lost?

There are 2 policies that you need to enable at the cluster level -
Skip fencing if Gluster bricks are UP
Skip fencing if Gluster quorum not met


> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/NIPHD7COR5ZBVQROOUU6R4Q45SDFAJ5K/


___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/RY6Z5IXVQYCJBSFDP7X73BOBVBBXHVTO/


[ovirt-users] Re: vdsm should decouple with managed glusterfs services

2019-03-18 Thread Sahina Bose
On Sun, Mar 17, 2019 at 12:56 PM  wrote:
>
> Hi, I had experience two time of 3-node hyper-converged 4.2.8 ovirt cluster 
> total outage due to vdsm reactivate the unresponsive node, and cause the 
> multiple glusterfs daemon restart. As a result, all VM was paused and some of 
> disk image was corrupted.
>
> At the very beginning, one of the ovirt node was overloaded due to high 
> memory and CPU, the hosted-engine have trouble to collect status from vdsm 
> and mark it as unresponsive and it start migrate the workload to healthy 
> node. However, when it start migrate, second ovirt node being unresponsive 
> where vdsm try reactivate the 1st unresponsive node and restart it's 
> glusterd. So the gluster domain was acquiring the quorum and waiting for 
> timeout.
>
> If 1st node reactivation is success and every other node can survive the 
> timeout, it will be an idea case. Unfortunately, the second node cannot pick 
> up the VM being migrated due to gluster I/O timeout, so second node at that 
> moment was marked as unresponsive, and so on... vdsm is restarting the 
> glusterd on the second node which cause disaster. All node are racing on 
> gluster volume self-healing, and i can't mark the cluster as maintenance mode 
> as well. What I can do is try to resume the paused VM via virsh and issue 
> shutdown for each domain, also hard shutdown for un-resumable VM.
>
> After number of VM shutdown and wait the gluster healing completed,  the 
> cluster state back to normal, and I try to start the VM being manually 
> stopped, most of them can be started normally, but number of VM was crashed 
> or un-startable, instantly I  found the image files of un-startable VM was 
> owned by root(can't explain why), and can be restarted after chmod.  Two of 
> them still cannot start with  "bad volume specification" error. One of them 
> can start to boot loader, but the LVM metadata were lost.
>
> The impact was huge when vdsm restart the glusterd without human invention.

Is this even with the fencing policies set for ensuring gluster quorum
is not lost?

There are 2 policies that you need to enable at the cluster level -
Skip fencing if Gluster bricks are UP
Skip fencing if Gluster quorum not met


> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/NIPHD7COR5ZBVQROOUU6R4Q45SDFAJ5K/
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/WL3F7PLMAGUBX6VFYKFRJE6YHWAQHFHU/