[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

Jim Kusznir Fri, 06 Jul 2018 23:03:01 -0700

In case it matters, the data-hdd gluster volume uses these hard drives:

https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1


This is in a Dell R610 with PERC6/i (one drive per server, configured as a
single drive volume to pass it through as its own /dev/sd* device).  Inside
the OS, its partitioned with lvm_thin, then an lvm volume formatted with
XFS and mounted as /gluster/brick3, with the data-hdd volume created inside
that.

--Jim

On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <[email protected]> wrote:

> So, I'm still at a loss...It sounds like its either insufficient ram/swap,
> or insufficient network.  It seems to be neither now.  At this point, it
> appears that gluster is just "broke" and killing my systems for no
> descernable reason.  Here's detals, all from the same system (currently
> running 3 VMs):
>
> [root@ovirt3 ~]# w
>  22:26:53 up 36 days,  4:34,  1 user,  load average: 42.78, 55.98, 53.31
> USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> root     pts/0    192.168.8.90     22:26    2.00s  0.12s  0.11s w
>
> bwm-ng reports the highest data usage was about 6MB/s during this test
> (and that was combined; I have two different gig networks.  One gluster
> network (primary VM storage) runs on one, the other network handles
> everything else).
>
> [root@ovirt3 ~]# free -m
>               total        used        free      shared  buff/cache
>  available
> Mem:          31996       13236         232          18       18526
>  18195
> Swap:         16383        1475       14908
>
> top - 22:32:56 up 36 days,  4:41,  1 user,  load average: 17.99, 39.69,
> 47.66
> Tasks: 407 total,   1 running, 405 sleeping,   1 stopped,   0 zombie
> %Cpu(s):  8.6 us,  2.1 sy,  0.0 ni, 87.6 id,  1.6 wa,  0.0 hi,  0.1 si,
> 0.0 st
> KiB Mem : 32764284 total,   228296 free, 13541952 used, 18994036 buff/cache
> KiB Swap: 16777212 total, 15246200 free,  1531012 used. 18643960 avail Mem
>
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> COMMAND
>
> 30036 qemu      20   0 6872324   5.2g  13532 S 144.6 16.5 216:14.55
> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object
> secret,id=masterKey0,format=raw,file=/v+
> 28501 qemu      20   0 5034968   3.6g  12880 S  16.2 11.7  73:44.99
> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object
> secret,id=masterKey0,format=raw,file=/va+
>  2694 root      20   0 2169224  12164   3108 S   5.0  0.0   3290:42
> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
> data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+
> 14293 root      15  -5  944700  13356   4436 S   4.0  0.0  16:32.15
> /usr/sbin/glusterfs --volfile-server=192.168.8.11
> --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+
> 25100 vdsm       0 -20 6747440 107868  12836 S   2.3  0.3  21:35.20
> /usr/bin/python2 /usr/share/vdsm/vdsmd
>
> 28971 qemu      20   0 2842592   1.5g  13548 S   1.7  4.7 241:46.49
> /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on
> -S -object secret,id=masterKey0,format=+
> 12095 root      20   0  162276   2836   1868 R   1.3  0.0   0:00.25 top
>
>
>  2708 root      20   0 1906040  12404   3080 S   1.0  0.0   1083:33
> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
> engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+
> 28623 qemu      20   0 4749536   1.7g  12896 S   0.7  5.5   4:30.64
> /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S
> -object secret,id=masterKey0,format=ra+
>    10 root      20   0       0      0      0 S   0.3  0.0 215:54.72
> [rcu_sched]
>
>  1030 sanlock   rt   0  773804  27908   2744 S   0.3  0.1  35:55.61
> /usr/sbin/sanlock daemon
>
>  1890 zabbix    20   0   83904   1696   1612 S   0.3  0.0  24:30.63
> /usr/sbin/zabbix_agentd: collector [idle 1 sec]
>
>  2722 root      20   0 1298004   6148   2580 S   0.3  0.0  38:10.82
> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
> iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+
>  6340 root      20   0       0      0      0 S   0.3  0.0   0:04.30
> [kworker/7:0]
>
> 10652 root      20   0       0      0      0 S   0.3  0.0   0:00.23
> [kworker/u64:2]
>
> 14724 root      20   0 1076344  17400   3200 S   0.3  0.1  10:04.13
> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
> /var/run/gluster/glustershd/glustershd.pid -+
> 22011 root      20   0       0      0      0 S   0.3  0.0   0:05.04
> [kworker/10:1]
>
>
> Not sure why the system load dropped other than I was trying to take a
> picture of it :)
>
> In any case, it appears that at this time, I have plenty of swap, ram, and
> network capacity, and yet things are still running very sluggish; I'm still
> getting e-mails from servers complaining about loss of communication with
> something or another; I still get e-mails from the engine about bad engine
> status, then recovery, etc.
>
> I've shut down 2/3 of my VMs, too....just trying to keep the critical ones
> operating.
>
> At this point, I don't believe the problem is the memory leak, but it
> seems to be triggered by the memory leak, as in all my problems started
> when I got low ram warnings from one of my 3 nodes and began recovery
> efforts from that.
>
> I do really like the idea / concept behind glusterfs, but I really have to
> figure out why its been so poor performing from day one, and its caused 95%
> of my outages (including several large ones lately).  If I can get it
> stable, reliable, and well performing, then I'd love to keep it.  If I
> can't, then perhaps NFS is the way to go?  I don't like the single point of
> failure aspect of it, but my other NAS boxes I run for clients (central
> storage for windows boxes) have been very solid; If I could get that kind
> of reliability for my ovirt stack, it would be a substantial improvement.
> Currently, it seems about every other month I have a gluster-induced outage.
>
> Sometimes I wonder if its just hyperconverged is the issue, but my
> infrastructure doesn't justify three servers at the same location...I might
> be able to do two, but even that seems like its pushing it.
>
> Looks like I can upgrade to 10G for about $900.  I can order a dual-Xeon
> supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair
> of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered.  I've
> got to do something to improve my reliability; I can't keep going the way I
> have been....
>
> --Jim
>
>
> On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <[email protected]> wrote:
>
>> Load like that is mostly io based either the machine is swapping or
>> network is to slow. Check I/o wait in top.
>>
>> And the problem where you get oom killer to kill off gluster. That means
>> that you don't monitor ram usage on the servers? Either it's eating all
>> your ram and swap gets really io intensive and then is killed off. Or you
>> have the wrong swap settings in sysctl.conf (there are tons of broken
>> guides that recommends swappines to 0 but that disables swap on newer
>> kernels. The proper swappines for only swapping when nesseary is 1 or a
>> sufficiently low number like 10 default is 60)
>>
>>
>> Moving to nfs will not improve things. You will get more memory since
>> gluster isn't running and that is good. But you will have a single node
>> that can fail with all your storage and it would still be on 1 gigabit only
>> and your three node cluster would easily saturate that link.
>>
>> On July 7, 2018 04:13:13 Jim Kusznir <[email protected]> wrote:
>>
>>> So far it does not appear to be helping much. I'm still getting VM's
>>> locking up and all kinds of notices from overt engine about non-responsive
>>> hosts.  I'm still seeing load averages in the 20-30 range.
>>>
>>> Jim
>>>
>>> On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <[email protected]> wrote:
>>>
>>>> Thank you for the advice and help
>>>>
>>>> I do plan on going 10Gbps networking; haven't quite jumped off that
>>>> cliff yet, though.
>>>>
>>>> I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps
>>>> network, and I've watched throughput on that and never seen more than
>>>> 60GB/s achieved (as reported by bwm-ng).  I have a separate 1Gbps network
>>>> for communication and ovirt migration, but I wanted to break that up
>>>> further (separate out VM traffice from migration/mgmt traffic).  My three
>>>> SSD-backed gluster volumes run the main network too, as I haven't been able
>>>> to get them to move to the new network (which I was trying to use as all
>>>> gluster).  I tried bonding, but that seamed to reduce performance rather
>>>> than improve it.
>>>>
>>>> --Jim
>>>>
>>>> On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Jim,
>>>>>
>>>>> I don't have any targeted suggestions, because there isn't much to
>>>>> latch on to. I can say Gluster replica three  (no arbiters) on dedicated
>>>>> servers serving a couple Ovirt VM clusters here have not had these sorts 
>>>>> of
>>>>> issues.
>>>>>
>>>>> I suspect your long heal times (and the resultant long periods of high
>>>>> load) are at least partly related to 1G networking. That is just a matter
>>>>> of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G
>>>>> bonded NICs on the gluster and ovirt boxes for storage traffic and 
>>>>> separate
>>>>> bonded 1G for ovirtmgmt and communication with other machines/people, and
>>>>> we're occasionally hitting the bandwidth ceiling on the storage network.
>>>>> I'm starting to think about 40/100G, different ways of splitting up
>>>>> intensive systems, and considering iSCSI for specific volumes, although I
>>>>> really don't want to go there.
>>>>>
>>>>> I don't run FreeNAS[1], but I do run FreeBSD as storage servers for
>>>>> their excellent ZFS implementation, mostly for backups. ZFS will make your
>>>>> `heal` problem go away, but not your bandwidth problems, which become 
>>>>> worse
>>>>> (because of fewer NICS pushing traffic). 10G hardware is not exactly in 
>>>>> the
>>>>> impulse-buy territory, but if you can, I'd recommend doing some testing
>>>>> using it. I think at least some of your problems are related.
>>>>>
>>>>> If that's not possible, my next stops would be optimizing everything I
>>>>> could about sharding, healing and optimizing for serving the shard size to
>>>>> squeeze as much performance out of 1G as I could, but that will only go so
>>>>> far.
>>>>>
>>>>> -j
>>>>>
>>>>> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI.
>>>>>
>>>>> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir <[email protected]> wrote:
>>>>> >
>>>>> > hi all:
>>>>> >
>>>>> > Once again my production ovirt cluster is collapsing in on itself.
>>>>> My servers are intermittently unavailable or degrading, customers are
>>>>> noticing and calling in.  This seems to be yet another gluster failure 
>>>>> that
>>>>> I haven't been able to pin down.
>>>>> >
>>>>> > I posted about this a while ago, but didn't get anywhere (no replies
>>>>> that I found).  The problem started out as a glusterfsd process consuming
>>>>> large amounts of ram (up to the point where ram and swap were exhausted 
>>>>> and
>>>>> the kernel OOM killer killed off the glusterfsd process).  For reasons not
>>>>> clear to me at this time, that resulted in any VMs running on that host 
>>>>> and
>>>>> that gluster volume to be paused with I/O error (the glusterfs process is
>>>>> usually unharmed; why it didn't continue I/O with other servers is
>>>>> confusing to me).
>>>>> >
>>>>> > I have 3 servers and a total of 4 gluster volumes (engine, iso,
>>>>> data, and data-hdd).  The first 3 are replica 2+arb; the 4th (data-hdd) is
>>>>> replica 3.  The first 3 are backed by an LVM partition (some thin
>>>>> provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some
>>>>> internal flash for acceleration).  data-hdd is the only thing on the disk.
>>>>> Servers are Dell R610 with the PERC/6i raid card, with the disks
>>>>> individually passed through to the OS (no raid enabled).
>>>>> >
>>>>> > The above RAM usage issue came from the data-hdd volume.  Yesterday,
>>>>> I cought one of the glusterfsd high ram usage before the OOM-Killer had to
>>>>> run.  I was able to migrate the VMs off the machine and for good measure,
>>>>> reboot the entire machine (after taking this opportunity to run the
>>>>> software updates that ovirt said were pending).  Upon booting back up, the
>>>>> necessary volume healing began.  However, this time, the healing caused 
>>>>> all
>>>>> three servers to go to very, very high load averages (I saw just under 200
>>>>> on one server; typically they've been 40-70) with top reporting IO Wait at
>>>>> 7-20%.  Network for this volume is a dedicated gig network.  According to
>>>>> bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but
>>>>> tailed off to mostly in the kB/s for a while.  All machines' load averages
>>>>> were still 40+ and gluster volume heal data-hdd info reported 5 items
>>>>> needing healing.  Server's were intermittently experiencing IO issues, 
>>>>> even
>>>>> on the 3 gluster volumes that appeared largely unaffected.  Even the OS
>>>>> activities on the hosts itself (logging in, running commands) would often
>>>>> be very delayed.  The ovirt engine was seemingly randomly throwing engine
>>>>> down / engine up / engine failed notifications.  Responsiveness on ANY VM
>>>>> was horrific most of the time, with random VMs being inaccessible.
>>>>> >
>>>>> > I let the gluster heal run overnight.  By morning, there were still
>>>>> 5 items needing healing, all three servers were still experiencing high
>>>>> load, and servers were still largely unstable.
>>>>> >
>>>>> > I've noticed that all of my ovirt outages (and I've had a lot, way
>>>>> more than is acceptable for a production cluster) have come from gluster.
>>>>> I still have 3 VMs who's hard disk images have become corrupted by my last
>>>>> gluster crash that I haven't had time to repair / rebuild yet (I believe
>>>>> this crash was caused by the OOM issue previously mentioned, but I didn't
>>>>> know it at the time).
>>>>> >
>>>>> > Is gluster really ready for production yet?  It seems so unstable to
>>>>> me....  I'm looking at replacing gluster with a dedicated NFS server 
>>>>> likely
>>>>> FreeNAS.  Any suggestions?  What is the "right" way to do production
>>>>> storage on this (3 node cluster)?  Can I get this gluster volume stable
>>>>> enough to get my VMs to run reliably again until I can deploy another
>>>>> storage solution?
>>>>> >
>>>>> > --Jim
>>>>> > _______________________________________________
>>>>> > Users mailing list -- [email protected]
>>>>> > To unsubscribe send an email to [email protected]
>>>>> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>> > oVirt Code of Conduct: https://www.ovirt.org/communit
>>>>> y/about/community-guidelines/
>>>>> > List Archives: https://lists.ovirt.org/archiv
>>>>> es/list/[email protected]/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/
>>>>>
>>>>>
>>>> _______________________________________________
>>> Users mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct: https://www.ovirt.org/communit
>>> y/about/community-guidelines/
>>> List Archives: https://lists.ovirt.org/archiv
>>> es/list/[email protected]/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/
>>>
>>
>>
>

_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/5KZZ6WMCDKFC62ACTYWEA4LBRUBL3AVY/

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

Reply via email to