Hi Jim, On Fri, Jul 6, 2018 at 4:22 PM Jim Kusznir <[email protected]> wrote:
> hi all: > > Once again my production ovirt cluster is collapsing in on itself. My > servers are intermittently unavailable or degrading, customers are noticing > and calling in. This seems to be yet another gluster failure that I > haven't been able to pin down. > > I posted about this a while ago, but didn't get anywhere (no replies that > I found). > cc'ing some people that might be able to assist. > The problem started out as a glusterfsd process consuming large amounts > of ram (up to the point where ram and swap were exhausted and the kernel > OOM killer killed off the glusterfsd process). For reasons not clear to me > at this time, that resulted in any VMs running on that host and that > gluster volume to be paused with I/O error (the glusterfs process is > usually unharmed; why it didn't continue I/O with other servers is > confusing to me). > > I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and > data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica > 3. The first 3 are backed by an LVM partition (some thin provisioned) on > an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for > acceleration). data-hdd is the only thing on the disk. Servers are Dell > R610 with the PERC/6i raid card, with the disks individually passed through > to the OS (no raid enabled). > > The above RAM usage issue came from the data-hdd volume. Yesterday, I > cought one of the glusterfsd high ram usage before the OOM-Killer had to > run. I was able to migrate the VMs off the machine and for good measure, > reboot the entire machine (after taking this opportunity to run the > software updates that ovirt said were pending). Upon booting back up, the > necessary volume healing began. However, this time, the healing caused all > three servers to go to very, very high load averages (I saw just under 200 > on one server; typically they've been 40-70) with top reporting IO Wait at > 7-20%. Network for this volume is a dedicated gig network. According to > bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but > tailed off to mostly in the kB/s for a while. All machines' load averages > were still 40+ and gluster volume heal data-hdd info reported 5 items > needing healing. Server's were intermittently experiencing IO issues, even > on the 3 gluster volumes that appeared largely unaffected. Even the OS > activities on the hosts itself (logging in, running commands) would often > be very delayed. The ovirt engine was seemingly randomly throwing engine > down / engine up / engine failed notifications. Responsiveness on ANY VM > was horrific most of the time, with random VMs being inaccessible. > > I let the gluster heal run overnight. By morning, there were still 5 > items needing healing, all three servers were still experiencing high load, > and servers were still largely unstable. > > I've noticed that all of my ovirt outages (and I've had a lot, way more > than is acceptable for a production cluster) have come from gluster. I > still have 3 VMs who's hard disk images have become corrupted by my last > gluster crash that I haven't had time to repair / rebuild yet (I believe > this crash was caused by the OOM issue previously mentioned, but I didn't > know it at the time). > > Is gluster really ready for production yet? It seems so unstable to > me.... I'm looking at replacing gluster with a dedicated NFS server likely > FreeNAS. Any suggestions? What is the "right" way to do production > storage on this (3 node cluster)? Can I get this gluster volume stable > enough to get my VMs to run reliably again until I can deploy another > storage solution? > > --Jim > _______________________________________________ > Users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/[email protected]/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/ > -- GREG SHEREMETA SENIOR SOFTWARE ENGINEER - TEAM LEAD - RHV UX Red Hat NA <https://www.redhat.com/> [email protected] IRC: gshereme <https://red.ht/sig>
_______________________________________________ Users mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/QXJYWM6W4SNEYLQHQLV6CD3ZZRR2S7ED/

