Hi Jim,

On Fri, Jul 6, 2018 at 4:22 PM Jim Kusznir <[email protected]> wrote:

> hi all:
>
> Once again my production ovirt cluster is collapsing in on itself.  My
> servers are intermittently unavailable or degrading, customers are noticing
> and calling in.  This seems to be yet another gluster failure that I
> haven't been able to pin down.
>
> I posted about this a while ago, but didn't get anywhere (no replies that
> I found).
>

cc'ing some people that might be able to assist.


>   The problem started out as a glusterfsd process consuming large amounts
> of ram (up to the point where ram and swap were exhausted and the kernel
> OOM killer killed off the glusterfsd process).  For reasons not clear to me
> at this time, that resulted in any VMs running on that host and that
> gluster volume to be paused with I/O error (the glusterfs process is
> usually unharmed; why it didn't continue I/O with other servers is
> confusing to me).
>
> I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and
> data-hdd).  The first 3 are replica 2+arb; the 4th (data-hdd) is replica
> 3.  The first 3 are backed by an LVM partition (some thin provisioned) on
> an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for
> acceleration).  data-hdd is the only thing on the disk.  Servers are Dell
> R610 with the PERC/6i raid card, with the disks individually passed through
> to the OS (no raid enabled).
>
> The above RAM usage issue came from the data-hdd volume.  Yesterday, I
> cought one of the glusterfsd high ram usage before the OOM-Killer had to
> run.  I was able to migrate the VMs off the machine and for good measure,
> reboot the entire machine (after taking this opportunity to run the
> software updates that ovirt said were pending).  Upon booting back up, the
> necessary volume healing began.  However, this time, the healing caused all
> three servers to go to very, very high load averages (I saw just under 200
> on one server; typically they've been 40-70) with top reporting IO Wait at
> 7-20%.  Network for this volume is a dedicated gig network.  According to
> bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but
> tailed off to mostly in the kB/s for a while.  All machines' load averages
> were still 40+ and gluster volume heal data-hdd info reported 5 items
> needing healing.  Server's were intermittently experiencing IO issues, even
> on the 3 gluster volumes that appeared largely unaffected.  Even the OS
> activities on the hosts itself (logging in, running commands) would often
> be very delayed.  The ovirt engine was seemingly randomly throwing engine
> down / engine up / engine failed notifications.  Responsiveness on ANY VM
> was horrific most of the time, with random VMs being inaccessible.
>
> I let the gluster heal run overnight.  By morning, there were still 5
> items needing healing, all three servers were still experiencing high load,
> and servers were still largely unstable.
>
> I've noticed that all of my ovirt outages (and I've had a lot, way more
> than is acceptable for a production cluster) have come from gluster.  I
> still have 3 VMs who's hard disk images have become corrupted by my last
> gluster crash that I haven't had time to repair / rebuild yet (I believe
> this crash was caused by the OOM issue previously mentioned, but I didn't
> know it at the time).
>
> Is gluster really ready for production yet?  It seems so unstable to
> me....  I'm looking at replacing gluster with a dedicated NFS server likely
> FreeNAS.  Any suggestions?  What is the "right" way to do production
> storage on this (3 node cluster)?  Can I get this gluster volume stable
> enough to get my VMs to run reliably again until I can deploy another
> storage solution?
>
> --Jim
> _______________________________________________
> Users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/[email protected]/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/
>


-- 

GREG SHEREMETA

SENIOR SOFTWARE ENGINEER - TEAM LEAD - RHV UX

Red Hat NA

<https://www.redhat.com/>

[email protected]    IRC: gshereme
<https://red.ht/sig>
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/QXJYWM6W4SNEYLQHQLV6CD3ZZRR2S7ED/

Reply via email to