I think I should throw one more thing out there: The current batch of problems started essentially today, and I did apply the updates waiting in the ovirt repos (through the ovirt mgmt interface: install updates). Perhaps there is now something from that which is breaking things.
On Fri, Jul 6, 2018 at 10:51 PM, Jim Kusznir <[email protected]> wrote: > In case it matters, the data-hdd gluster volume uses these hard drives: > > https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_ > detailpage_o05_s00?ie=UTF8&psc=1 > > This is in a Dell R610 with PERC6/i (one drive per server, configured as a > single drive volume to pass it through as its own /dev/sd* device). Inside > the OS, its partitioned with lvm_thin, then an lvm volume formatted with > XFS and mounted as /gluster/brick3, with the data-hdd volume created inside > that. > > --Jim > > On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <[email protected]> wrote: > >> So, I'm still at a loss...It sounds like its either insufficient >> ram/swap, or insufficient network. It seems to be neither now. At this >> point, it appears that gluster is just "broke" and killing my systems for >> no descernable reason. Here's detals, all from the same system (currently >> running 3 VMs): >> >> [root@ovirt3 ~]# w >> 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 >> USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT >> root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w >> >> bwm-ng reports the highest data usage was about 6MB/s during this test >> (and that was combined; I have two different gig networks. One gluster >> network (primary VM storage) runs on one, the other network handles >> everything else). >> >> [root@ovirt3 ~]# free -m >> total used free shared buff/cache >> available >> Mem: 31996 13236 232 18 18526 >> 18195 >> Swap: 16383 1475 14908 >> >> top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, >> 47.66 >> Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie >> %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, >> 0.0 st >> KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 >> buff/cache >> KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail >> Mem >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >> COMMAND >> >> 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 >> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object >> secret,id=masterKey0,format=raw,file=/v+ >> 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 >> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object >> secret,id=masterKey0,format=raw,file=/va+ >> 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 >> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id >> data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ >> 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 >> /usr/sbin/glusterfs --volfile-server=192.168.8.11 >> --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ >> 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 >> /usr/bin/python2 /usr/share/vdsm/vdsmd >> >> 28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 >> /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on >> -S -object secret,id=masterKey0,format=+ >> 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top >> >> >> 2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 >> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id >> engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ >> 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 >> /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on >> -S -object secret,id=masterKey0,format=ra+ >> 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 >> [rcu_sched] >> >> 1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 >> /usr/sbin/sanlock daemon >> >> 1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 >> /usr/sbin/zabbix_agentd: collector [idle 1 sec] >> >> 2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 >> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id >> iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ >> 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 >> [kworker/7:0] >> >> 10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 >> [kworker/u64:2] >> >> 14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 >> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p >> /var/run/gluster/glustershd/glustershd.pid -+ >> 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 >> [kworker/10:1] >> >> >> Not sure why the system load dropped other than I was trying to take a >> picture of it :) >> >> In any case, it appears that at this time, I have plenty of swap, ram, >> and network capacity, and yet things are still running very sluggish; I'm >> still getting e-mails from servers complaining about loss of communication >> with something or another; I still get e-mails from the engine about bad >> engine status, then recovery, etc. >> >> I've shut down 2/3 of my VMs, too....just trying to keep the critical >> ones operating. >> >> At this point, I don't believe the problem is the memory leak, but it >> seems to be triggered by the memory leak, as in all my problems started >> when I got low ram warnings from one of my 3 nodes and began recovery >> efforts from that. >> >> I do really like the idea / concept behind glusterfs, but I really have >> to figure out why its been so poor performing from day one, and its caused >> 95% of my outages (including several large ones lately). If I can get it >> stable, reliable, and well performing, then I'd love to keep it. If I >> can't, then perhaps NFS is the way to go? I don't like the single point of >> failure aspect of it, but my other NAS boxes I run for clients (central >> storage for windows boxes) have been very solid; If I could get that kind >> of reliability for my ovirt stack, it would be a substantial improvement. >> Currently, it seems about every other month I have a gluster-induced outage. >> >> Sometimes I wonder if its just hyperconverged is the issue, but my >> infrastructure doesn't justify three servers at the same location...I might >> be able to do two, but even that seems like its pushing it. >> >> Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon >> supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair >> of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've >> got to do something to improve my reliability; I can't keep going the way I >> have been.... >> >> --Jim >> >> >> On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <[email protected]> >> wrote: >> >>> Load like that is mostly io based either the machine is swapping or >>> network is to slow. Check I/o wait in top. >>> >>> And the problem where you get oom killer to kill off gluster. That means >>> that you don't monitor ram usage on the servers? Either it's eating all >>> your ram and swap gets really io intensive and then is killed off. Or you >>> have the wrong swap settings in sysctl.conf (there are tons of broken >>> guides that recommends swappines to 0 but that disables swap on newer >>> kernels. The proper swappines for only swapping when nesseary is 1 or a >>> sufficiently low number like 10 default is 60) >>> >>> >>> Moving to nfs will not improve things. You will get more memory since >>> gluster isn't running and that is good. But you will have a single node >>> that can fail with all your storage and it would still be on 1 gigabit only >>> and your three node cluster would easily saturate that link. >>> >>> On July 7, 2018 04:13:13 Jim Kusznir <[email protected]> wrote: >>> >>>> So far it does not appear to be helping much. I'm still getting VM's >>>> locking up and all kinds of notices from overt engine about non-responsive >>>> hosts. I'm still seeing load averages in the 20-30 range. >>>> >>>> Jim >>>> >>>> On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <[email protected]> wrote: >>>> >>>>> Thank you for the advice and help >>>>> >>>>> I do plan on going 10Gbps networking; haven't quite jumped off that >>>>> cliff yet, though. >>>>> >>>>> I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps >>>>> network, and I've watched throughput on that and never seen more than >>>>> 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network >>>>> for communication and ovirt migration, but I wanted to break that up >>>>> further (separate out VM traffice from migration/mgmt traffic). My three >>>>> SSD-backed gluster volumes run the main network too, as I haven't been >>>>> able >>>>> to get them to move to the new network (which I was trying to use as all >>>>> gluster). I tried bonding, but that seamed to reduce performance rather >>>>> than improve it. >>>>> >>>>> --Jim >>>>> >>>>> On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Jim, >>>>>> >>>>>> I don't have any targeted suggestions, because there isn't much to >>>>>> latch on to. I can say Gluster replica three (no arbiters) on dedicated >>>>>> servers serving a couple Ovirt VM clusters here have not had these sorts >>>>>> of >>>>>> issues. >>>>>> >>>>>> I suspect your long heal times (and the resultant long periods of >>>>>> high load) are at least partly related to 1G networking. That is just a >>>>>> matter of IO - heals of VMs involve moving a lot of bits. My cluster uses >>>>>> 10G bonded NICs on the gluster and ovirt boxes for storage traffic and >>>>>> separate bonded 1G for ovirtmgmt and communication with other >>>>>> machines/people, and we're occasionally hitting the bandwidth ceiling on >>>>>> the storage network. I'm starting to think about 40/100G, different ways >>>>>> of >>>>>> splitting up intensive systems, and considering iSCSI for specific >>>>>> volumes, >>>>>> although I really don't want to go there. >>>>>> >>>>>> I don't run FreeNAS[1], but I do run FreeBSD as storage servers for >>>>>> their excellent ZFS implementation, mostly for backups. ZFS will make >>>>>> your >>>>>> `heal` problem go away, but not your bandwidth problems, which become >>>>>> worse >>>>>> (because of fewer NICS pushing traffic). 10G hardware is not exactly in >>>>>> the >>>>>> impulse-buy territory, but if you can, I'd recommend doing some testing >>>>>> using it. I think at least some of your problems are related. >>>>>> >>>>>> If that's not possible, my next stops would be optimizing everything >>>>>> I could about sharding, healing and optimizing for serving the shard size >>>>>> to squeeze as much performance out of 1G as I could, but that will only >>>>>> go >>>>>> so far. >>>>>> >>>>>> -j >>>>>> >>>>>> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI. >>>>>> >>>>>> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir <[email protected]> >>>>>> wrote: >>>>>> > >>>>>> > hi all: >>>>>> > >>>>>> > Once again my production ovirt cluster is collapsing in on itself. >>>>>> My servers are intermittently unavailable or degrading, customers are >>>>>> noticing and calling in. This seems to be yet another gluster failure >>>>>> that >>>>>> I haven't been able to pin down. >>>>>> > >>>>>> > I posted about this a while ago, but didn't get anywhere (no >>>>>> replies that I found). The problem started out as a glusterfsd process >>>>>> consuming large amounts of ram (up to the point where ram and swap were >>>>>> exhausted and the kernel OOM killer killed off the glusterfsd process). >>>>>> For reasons not clear to me at this time, that resulted in any VMs >>>>>> running >>>>>> on that host and that gluster volume to be paused with I/O error (the >>>>>> glusterfs process is usually unharmed; why it didn't continue I/O with >>>>>> other servers is confusing to me). >>>>>> > >>>>>> > I have 3 servers and a total of 4 gluster volumes (engine, iso, >>>>>> data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) >>>>>> is >>>>>> replica 3. The first 3 are backed by an LVM partition (some thin >>>>>> provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some >>>>>> internal flash for acceleration). data-hdd is the only thing on the >>>>>> disk. >>>>>> Servers are Dell R610 with the PERC/6i raid card, with the disks >>>>>> individually passed through to the OS (no raid enabled). >>>>>> > >>>>>> > The above RAM usage issue came from the data-hdd volume. >>>>>> Yesterday, I cought one of the glusterfsd high ram usage before the >>>>>> OOM-Killer had to run. I was able to migrate the VMs off the machine and >>>>>> for good measure, reboot the entire machine (after taking this >>>>>> opportunity >>>>>> to run the software updates that ovirt said were pending). Upon booting >>>>>> back up, the necessary volume healing began. However, this time, the >>>>>> healing caused all three servers to go to very, very high load averages >>>>>> (I >>>>>> saw just under 200 on one server; typically they've been 40-70) with top >>>>>> reporting IO Wait at 7-20%. Network for this volume is a dedicated gig >>>>>> network. According to bwm-ng, initially the network bandwidth would hit >>>>>> 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. >>>>>> All >>>>>> machines' load averages were still 40+ and gluster volume heal data-hdd >>>>>> info reported 5 items needing healing. Server's were intermittently >>>>>> experiencing IO issues, even on the 3 gluster volumes that appeared >>>>>> largely >>>>>> unaffected. Even the OS activities on the hosts itself (logging in, >>>>>> running commands) would often be very delayed. The ovirt engine was >>>>>> seemingly randomly throwing engine down / engine up / engine failed >>>>>> notifications. Responsiveness on ANY VM was horrific most of the time, >>>>>> with random VMs being inaccessible. >>>>>> > >>>>>> > I let the gluster heal run overnight. By morning, there were still >>>>>> 5 items needing healing, all three servers were still experiencing high >>>>>> load, and servers were still largely unstable. >>>>>> > >>>>>> > I've noticed that all of my ovirt outages (and I've had a lot, way >>>>>> more than is acceptable for a production cluster) have come from gluster. >>>>>> I still have 3 VMs who's hard disk images have become corrupted by my >>>>>> last >>>>>> gluster crash that I haven't had time to repair / rebuild yet (I believe >>>>>> this crash was caused by the OOM issue previously mentioned, but I didn't >>>>>> know it at the time). >>>>>> > >>>>>> > Is gluster really ready for production yet? It seems so unstable >>>>>> to me.... I'm looking at replacing gluster with a dedicated NFS server >>>>>> likely FreeNAS. Any suggestions? What is the "right" way to do >>>>>> production >>>>>> storage on this (3 node cluster)? Can I get this gluster volume stable >>>>>> enough to get my VMs to run reliably again until I can deploy another >>>>>> storage solution? >>>>>> > >>>>>> > --Jim >>>>>> > _______________________________________________ >>>>>> > Users mailing list -- [email protected] >>>>>> > To unsubscribe send an email to [email protected] >>>>>> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>> > oVirt Code of Conduct: https://www.ovirt.org/communit >>>>>> y/about/community-guidelines/ >>>>>> > List Archives: https://lists.ovirt.org/archiv >>>>>> es/list/[email protected]/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/ >>>>>> >>>>>> >>>>> _______________________________________________ >>>> Users mailing list -- [email protected] >>>> To unsubscribe send an email to [email protected] >>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>> oVirt Code of Conduct: https://www.ovirt.org/communit >>>> y/about/community-guidelines/ >>>> List Archives: https://lists.ovirt.org/archiv >>>> es/list/[email protected]/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/ >>>> >>> >>> >> >
_______________________________________________ Users mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/4Y3JCIKDFNSWH2T25PUKZRU2TISJF4W5/

