[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

Jim Kusznir Fri, 06 Jul 2018 23:19:46 -0700

I think I should throw one more thing out there:  The current batch of
problems started essentially today, and I did apply the updates waiting in
the ovirt repos (through the ovirt mgmt interface: install updates).
Perhaps there is now something from that which is breaking things.


On Fri, Jul 6, 2018 at 10:51 PM, Jim Kusznir <[email protected]> wrote:

> In case it matters, the data-hdd gluster volume uses these hard drives:
>
> https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_
> detailpage_o05_s00?ie=UTF8&psc=1
>
> This is in a Dell R610 with PERC6/i (one drive per server, configured as a
> single drive volume to pass it through as its own /dev/sd* device).  Inside
> the OS, its partitioned with lvm_thin, then an lvm volume formatted with
> XFS and mounted as /gluster/brick3, with the data-hdd volume created inside
> that.
>
> --Jim
>
> On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <[email protected]> wrote:
>
>> So, I'm still at a loss...It sounds like its either insufficient
>> ram/swap, or insufficient network.  It seems to be neither now.  At this
>> point, it appears that gluster is just "broke" and killing my systems for
>> no descernable reason.  Here's detals, all from the same system (currently
>> running 3 VMs):
>>
>> [root@ovirt3 ~]# w
>>  22:26:53 up 36 days,  4:34,  1 user,  load average: 42.78, 55.98, 53.31
>> USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
>> root     pts/0    192.168.8.90     22:26    2.00s  0.12s  0.11s w
>>
>> bwm-ng reports the highest data usage was about 6MB/s during this test
>> (and that was combined; I have two different gig networks.  One gluster
>> network (primary VM storage) runs on one, the other network handles
>> everything else).
>>
>> [root@ovirt3 ~]# free -m
>>               total        used        free      shared  buff/cache
>>  available
>> Mem:          31996       13236         232          18       18526
>>  18195
>> Swap:         16383        1475       14908
>>
>> top - 22:32:56 up 36 days,  4:41,  1 user,  load average: 17.99, 39.69,
>> 47.66
>> Tasks: 407 total,   1 running, 405 sleeping,   1 stopped,   0 zombie
>> %Cpu(s):  8.6 us,  2.1 sy,  0.0 ni, 87.6 id,  1.6 wa,  0.0 hi,  0.1 si,
>> 0.0 st
>> KiB Mem : 32764284 total,   228296 free, 13541952 used, 18994036
>> buff/cache
>> KiB Swap: 16777212 total, 15246200 free,  1531012 used. 18643960 avail
>> Mem
>>
>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
>> COMMAND
>>
>> 30036 qemu      20   0 6872324   5.2g  13532 S 144.6 16.5 216:14.55
>> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object
>> secret,id=masterKey0,format=raw,file=/v+
>> 28501 qemu      20   0 5034968   3.6g  12880 S  16.2 11.7  73:44.99
>> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object
>> secret,id=masterKey0,format=raw,file=/va+
>>  2694 root      20   0 2169224  12164   3108 S   5.0  0.0   3290:42
>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
>> data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+
>> 14293 root      15  -5  944700  13356   4436 S   4.0  0.0  16:32.15
>> /usr/sbin/glusterfs --volfile-server=192.168.8.11
>> --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+
>> 25100 vdsm       0 -20 6747440 107868  12836 S   2.3  0.3  21:35.20
>> /usr/bin/python2 /usr/share/vdsm/vdsmd
>>
>> 28971 qemu      20   0 2842592   1.5g  13548 S   1.7  4.7 241:46.49
>> /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on
>> -S -object secret,id=masterKey0,format=+
>> 12095 root      20   0  162276   2836   1868 R   1.3  0.0   0:00.25 top
>>
>>
>>  2708 root      20   0 1906040  12404   3080 S   1.0  0.0   1083:33
>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
>> engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+
>> 28623 qemu      20   0 4749536   1.7g  12896 S   0.7  5.5   4:30.64
>> /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on
>> -S -object secret,id=masterKey0,format=ra+
>>    10 root      20   0       0      0      0 S   0.3  0.0 215:54.72
>> [rcu_sched]
>>
>>  1030 sanlock   rt   0  773804  27908   2744 S   0.3  0.1  35:55.61
>> /usr/sbin/sanlock daemon
>>
>>  1890 zabbix    20   0   83904   1696   1612 S   0.3  0.0  24:30.63
>> /usr/sbin/zabbix_agentd: collector [idle 1 sec]
>>
>>  2722 root      20   0 1298004   6148   2580 S   0.3  0.0  38:10.82
>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
>> iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+
>>  6340 root      20   0       0      0      0 S   0.3  0.0   0:04.30
>> [kworker/7:0]
>>
>> 10652 root      20   0       0      0      0 S   0.3  0.0   0:00.23
>> [kworker/u64:2]
>>
>> 14724 root      20   0 1076344  17400   3200 S   0.3  0.1  10:04.13
>> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
>> /var/run/gluster/glustershd/glustershd.pid -+
>> 22011 root      20   0       0      0      0 S   0.3  0.0   0:05.04
>> [kworker/10:1]
>>
>>
>> Not sure why the system load dropped other than I was trying to take a
>> picture of it :)
>>
>> In any case, it appears that at this time, I have plenty of swap, ram,
>> and network capacity, and yet things are still running very sluggish; I'm
>> still getting e-mails from servers complaining about loss of communication
>> with something or another; I still get e-mails from the engine about bad
>> engine status, then recovery, etc.
>>
>> I've shut down 2/3 of my VMs, too....just trying to keep the critical
>> ones operating.
>>
>> At this point, I don't believe the problem is the memory leak, but it
>> seems to be triggered by the memory leak, as in all my problems started
>> when I got low ram warnings from one of my 3 nodes and began recovery
>> efforts from that.
>>
>> I do really like the idea / concept behind glusterfs, but I really have
>> to figure out why its been so poor performing from day one, and its caused
>> 95% of my outages (including several large ones lately).  If I can get it
>> stable, reliable, and well performing, then I'd love to keep it.  If I
>> can't, then perhaps NFS is the way to go?  I don't like the single point of
>> failure aspect of it, but my other NAS boxes I run for clients (central
>> storage for windows boxes) have been very solid; If I could get that kind
>> of reliability for my ovirt stack, it would be a substantial improvement.
>> Currently, it seems about every other month I have a gluster-induced outage.
>>
>> Sometimes I wonder if its just hyperconverged is the issue, but my
>> infrastructure doesn't justify three servers at the same location...I might
>> be able to do two, but even that seems like its pushing it.
>>
>> Looks like I can upgrade to 10G for about $900.  I can order a dual-Xeon
>> supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair
>> of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered.  I've
>> got to do something to improve my reliability; I can't keep going the way I
>> have been....
>>
>> --Jim
>>
>>
>> On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <[email protected]>
>> wrote:
>>
>>> Load like that is mostly io based either the machine is swapping or
>>> network is to slow. Check I/o wait in top.
>>>
>>> And the problem where you get oom killer to kill off gluster. That means
>>> that you don't monitor ram usage on the servers? Either it's eating all
>>> your ram and swap gets really io intensive and then is killed off. Or you
>>> have the wrong swap settings in sysctl.conf (there are tons of broken
>>> guides that recommends swappines to 0 but that disables swap on newer
>>> kernels. The proper swappines for only swapping when nesseary is 1 or a
>>> sufficiently low number like 10 default is 60)
>>>
>>>
>>> Moving to nfs will not improve things. You will get more memory since
>>> gluster isn't running and that is good. But you will have a single node
>>> that can fail with all your storage and it would still be on 1 gigabit only
>>> and your three node cluster would easily saturate that link.
>>>
>>> On July 7, 2018 04:13:13 Jim Kusznir <[email protected]> wrote:
>>>
>>>> So far it does not appear to be helping much. I'm still getting VM's
>>>> locking up and all kinds of notices from overt engine about non-responsive
>>>> hosts.  I'm still seeing load averages in the 20-30 range.
>>>>
>>>> Jim
>>>>
>>>> On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <[email protected]> wrote:
>>>>
>>>>> Thank you for the advice and help
>>>>>
>>>>> I do plan on going 10Gbps networking; haven't quite jumped off that
>>>>> cliff yet, though.
>>>>>
>>>>> I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps
>>>>> network, and I've watched throughput on that and never seen more than
>>>>> 60GB/s achieved (as reported by bwm-ng).  I have a separate 1Gbps network
>>>>> for communication and ovirt migration, but I wanted to break that up
>>>>> further (separate out VM traffice from migration/mgmt traffic).  My three
>>>>> SSD-backed gluster volumes run the main network too, as I haven't been 
>>>>> able
>>>>> to get them to move to the new network (which I was trying to use as all
>>>>> gluster).  I tried bonding, but that seamed to reduce performance rather
>>>>> than improve it.
>>>>>
>>>>> --Jim
>>>>>
>>>>> On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Jim,
>>>>>>
>>>>>> I don't have any targeted suggestions, because there isn't much to
>>>>>> latch on to. I can say Gluster replica three  (no arbiters) on dedicated
>>>>>> servers serving a couple Ovirt VM clusters here have not had these sorts 
>>>>>> of
>>>>>> issues.
>>>>>>
>>>>>> I suspect your long heal times (and the resultant long periods of
>>>>>> high load) are at least partly related to 1G networking. That is just a
>>>>>> matter of IO - heals of VMs involve moving a lot of bits. My cluster uses
>>>>>> 10G bonded NICs on the gluster and ovirt boxes for storage traffic and
>>>>>> separate bonded 1G for ovirtmgmt and communication with other
>>>>>> machines/people, and we're occasionally hitting the bandwidth ceiling on
>>>>>> the storage network. I'm starting to think about 40/100G, different ways 
>>>>>> of
>>>>>> splitting up intensive systems, and considering iSCSI for specific 
>>>>>> volumes,
>>>>>> although I really don't want to go there.
>>>>>>
>>>>>> I don't run FreeNAS[1], but I do run FreeBSD as storage servers for
>>>>>> their excellent ZFS implementation, mostly for backups. ZFS will make 
>>>>>> your
>>>>>> `heal` problem go away, but not your bandwidth problems, which become 
>>>>>> worse
>>>>>> (because of fewer NICS pushing traffic). 10G hardware is not exactly in 
>>>>>> the
>>>>>> impulse-buy territory, but if you can, I'd recommend doing some testing
>>>>>> using it. I think at least some of your problems are related.
>>>>>>
>>>>>> If that's not possible, my next stops would be optimizing everything
>>>>>> I could about sharding, healing and optimizing for serving the shard size
>>>>>> to squeeze as much performance out of 1G as I could, but that will only 
>>>>>> go
>>>>>> so far.
>>>>>>
>>>>>> -j
>>>>>>
>>>>>> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI.
>>>>>>
>>>>>> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir <[email protected]>
>>>>>> wrote:
>>>>>> >
>>>>>> > hi all:
>>>>>> >
>>>>>> > Once again my production ovirt cluster is collapsing in on itself.
>>>>>> My servers are intermittently unavailable or degrading, customers are
>>>>>> noticing and calling in.  This seems to be yet another gluster failure 
>>>>>> that
>>>>>> I haven't been able to pin down.
>>>>>> >
>>>>>> > I posted about this a while ago, but didn't get anywhere (no
>>>>>> replies that I found).  The problem started out as a glusterfsd process
>>>>>> consuming large amounts of ram (up to the point where ram and swap were
>>>>>> exhausted and the kernel OOM killer killed off the glusterfsd process).
>>>>>> For reasons not clear to me at this time, that resulted in any VMs 
>>>>>> running
>>>>>> on that host and that gluster volume to be paused with I/O error (the
>>>>>> glusterfs process is usually unharmed; why it didn't continue I/O with
>>>>>> other servers is confusing to me).
>>>>>> >
>>>>>> > I have 3 servers and a total of 4 gluster volumes (engine, iso,
>>>>>> data, and data-hdd).  The first 3 are replica 2+arb; the 4th (data-hdd) 
>>>>>> is
>>>>>> replica 3.  The first 3 are backed by an LVM partition (some thin
>>>>>> provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some
>>>>>> internal flash for acceleration).  data-hdd is the only thing on the 
>>>>>> disk.
>>>>>> Servers are Dell R610 with the PERC/6i raid card, with the disks
>>>>>> individually passed through to the OS (no raid enabled).
>>>>>> >
>>>>>> > The above RAM usage issue came from the data-hdd volume.
>>>>>> Yesterday, I cought one of the glusterfsd high ram usage before the
>>>>>> OOM-Killer had to run.  I was able to migrate the VMs off the machine and
>>>>>> for good measure, reboot the entire machine (after taking this 
>>>>>> opportunity
>>>>>> to run the software updates that ovirt said were pending).  Upon booting
>>>>>> back up, the necessary volume healing began.  However, this time, the
>>>>>> healing caused all three servers to go to very, very high load averages 
>>>>>> (I
>>>>>> saw just under 200 on one server; typically they've been 40-70) with top
>>>>>> reporting IO Wait at 7-20%.  Network for this volume is a dedicated gig
>>>>>> network.  According to bwm-ng, initially the network bandwidth would hit
>>>>>> 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while.  
>>>>>> All
>>>>>> machines' load averages were still 40+ and gluster volume heal data-hdd
>>>>>> info reported 5 items needing healing.  Server's were intermittently
>>>>>> experiencing IO issues, even on the 3 gluster volumes that appeared 
>>>>>> largely
>>>>>> unaffected.  Even the OS activities on the hosts itself (logging in,
>>>>>> running commands) would often be very delayed.  The ovirt engine was
>>>>>> seemingly randomly throwing engine down / engine up / engine failed
>>>>>> notifications.  Responsiveness on ANY VM was horrific most of the time,
>>>>>> with random VMs being inaccessible.
>>>>>> >
>>>>>> > I let the gluster heal run overnight.  By morning, there were still
>>>>>> 5 items needing healing, all three servers were still experiencing high
>>>>>> load, and servers were still largely unstable.
>>>>>> >
>>>>>> > I've noticed that all of my ovirt outages (and I've had a lot, way
>>>>>> more than is acceptable for a production cluster) have come from gluster.
>>>>>> I still have 3 VMs who's hard disk images have become corrupted by my 
>>>>>> last
>>>>>> gluster crash that I haven't had time to repair / rebuild yet (I believe
>>>>>> this crash was caused by the OOM issue previously mentioned, but I didn't
>>>>>> know it at the time).
>>>>>> >
>>>>>> > Is gluster really ready for production yet?  It seems so unstable
>>>>>> to me....  I'm looking at replacing gluster with a dedicated NFS server
>>>>>> likely FreeNAS.  Any suggestions?  What is the "right" way to do 
>>>>>> production
>>>>>> storage on this (3 node cluster)?  Can I get this gluster volume stable
>>>>>> enough to get my VMs to run reliably again until I can deploy another
>>>>>> storage solution?
>>>>>> >
>>>>>> > --Jim
>>>>>> > _______________________________________________
>>>>>> > Users mailing list -- [email protected]
>>>>>> > To unsubscribe send an email to [email protected]
>>>>>> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>>> > oVirt Code of Conduct: https://www.ovirt.org/communit
>>>>>> y/about/community-guidelines/
>>>>>> > List Archives: https://lists.ovirt.org/archiv
>>>>>> es/list/[email protected]/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>> Users mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>> oVirt Code of Conduct: https://www.ovirt.org/communit
>>>> y/about/community-guidelines/
>>>> List Archives: https://lists.ovirt.org/archiv
>>>> es/list/[email protected]/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/
>>>>
>>>
>>>
>>
>

_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/4Y3JCIKDFNSWH2T25PUKZRU2TISJF4W5/

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

Reply via email to