In case it matters, the data-hdd gluster volume uses these hard drives: https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1
This is in a Dell R610 with PERC6/i (one drive per server, configured as a single drive volume to pass it through as its own /dev/sd* device). Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted with XFS and mounted as /gluster/brick3, with the data-hdd volume created inside that. --Jim On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <[email protected]> wrote: > So, I'm still at a loss...It sounds like its either insufficient ram/swap, > or insufficient network. It seems to be neither now. At this point, it > appears that gluster is just "broke" and killing my systems for no > descernable reason. Here's detals, all from the same system (currently > running 3 VMs): > > [root@ovirt3 ~]# w > 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 > USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT > root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w > > bwm-ng reports the highest data usage was about 6MB/s during this test > (and that was combined; I have two different gig networks. One gluster > network (primary VM storage) runs on one, the other network handles > everything else). > > [root@ovirt3 ~]# free -m > total used free shared buff/cache > available > Mem: 31996 13236 232 18 18526 > 18195 > Swap: 16383 1475 14908 > > top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, > 47.66 > Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie > %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, > 0.0 st > KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache > KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > > 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 > /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object > secret,id=masterKey0,format=raw,file=/v+ > 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 > /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object > secret,id=masterKey0,format=raw,file=/va+ > 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 > /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id > data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ > 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 > /usr/sbin/glusterfs --volfile-server=192.168.8.11 > --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ > 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 > /usr/bin/python2 /usr/share/vdsm/vdsmd > > 28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 > /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on > -S -object secret,id=masterKey0,format=+ > 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top > > > 2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 > /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id > engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ > 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 > /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S > -object secret,id=masterKey0,format=ra+ > 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 > [rcu_sched] > > 1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 > /usr/sbin/sanlock daemon > > 1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 > /usr/sbin/zabbix_agentd: collector [idle 1 sec] > > 2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 > /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id > iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ > 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 > [kworker/7:0] > > 10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 > [kworker/u64:2] > > 14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 > /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p > /var/run/gluster/glustershd/glustershd.pid -+ > 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 > [kworker/10:1] > > > Not sure why the system load dropped other than I was trying to take a > picture of it :) > > In any case, it appears that at this time, I have plenty of swap, ram, and > network capacity, and yet things are still running very sluggish; I'm still > getting e-mails from servers complaining about loss of communication with > something or another; I still get e-mails from the engine about bad engine > status, then recovery, etc. > > I've shut down 2/3 of my VMs, too....just trying to keep the critical ones > operating. > > At this point, I don't believe the problem is the memory leak, but it > seems to be triggered by the memory leak, as in all my problems started > when I got low ram warnings from one of my 3 nodes and began recovery > efforts from that. > > I do really like the idea / concept behind glusterfs, but I really have to > figure out why its been so poor performing from day one, and its caused 95% > of my outages (including several large ones lately). If I can get it > stable, reliable, and well performing, then I'd love to keep it. If I > can't, then perhaps NFS is the way to go? I don't like the single point of > failure aspect of it, but my other NAS boxes I run for clients (central > storage for windows boxes) have been very solid; If I could get that kind > of reliability for my ovirt stack, it would be a substantial improvement. > Currently, it seems about every other month I have a gluster-induced outage. > > Sometimes I wonder if its just hyperconverged is the issue, but my > infrastructure doesn't justify three servers at the same location...I might > be able to do two, but even that seems like its pushing it. > > Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon > supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair > of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've > got to do something to improve my reliability; I can't keep going the way I > have been.... > > --Jim > > > On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <[email protected]> wrote: > >> Load like that is mostly io based either the machine is swapping or >> network is to slow. Check I/o wait in top. >> >> And the problem where you get oom killer to kill off gluster. That means >> that you don't monitor ram usage on the servers? Either it's eating all >> your ram and swap gets really io intensive and then is killed off. Or you >> have the wrong swap settings in sysctl.conf (there are tons of broken >> guides that recommends swappines to 0 but that disables swap on newer >> kernels. The proper swappines for only swapping when nesseary is 1 or a >> sufficiently low number like 10 default is 60) >> >> >> Moving to nfs will not improve things. You will get more memory since >> gluster isn't running and that is good. But you will have a single node >> that can fail with all your storage and it would still be on 1 gigabit only >> and your three node cluster would easily saturate that link. >> >> On July 7, 2018 04:13:13 Jim Kusznir <[email protected]> wrote: >> >>> So far it does not appear to be helping much. I'm still getting VM's >>> locking up and all kinds of notices from overt engine about non-responsive >>> hosts. I'm still seeing load averages in the 20-30 range. >>> >>> Jim >>> >>> On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <[email protected]> wrote: >>> >>>> Thank you for the advice and help >>>> >>>> I do plan on going 10Gbps networking; haven't quite jumped off that >>>> cliff yet, though. >>>> >>>> I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps >>>> network, and I've watched throughput on that and never seen more than >>>> 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network >>>> for communication and ovirt migration, but I wanted to break that up >>>> further (separate out VM traffice from migration/mgmt traffic). My three >>>> SSD-backed gluster volumes run the main network too, as I haven't been able >>>> to get them to move to the new network (which I was trying to use as all >>>> gluster). I tried bonding, but that seamed to reduce performance rather >>>> than improve it. >>>> >>>> --Jim >>>> >>>> On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < >>>> [email protected]> wrote: >>>> >>>>> Hi Jim, >>>>> >>>>> I don't have any targeted suggestions, because there isn't much to >>>>> latch on to. I can say Gluster replica three (no arbiters) on dedicated >>>>> servers serving a couple Ovirt VM clusters here have not had these sorts >>>>> of >>>>> issues. >>>>> >>>>> I suspect your long heal times (and the resultant long periods of high >>>>> load) are at least partly related to 1G networking. That is just a matter >>>>> of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G >>>>> bonded NICs on the gluster and ovirt boxes for storage traffic and >>>>> separate >>>>> bonded 1G for ovirtmgmt and communication with other machines/people, and >>>>> we're occasionally hitting the bandwidth ceiling on the storage network. >>>>> I'm starting to think about 40/100G, different ways of splitting up >>>>> intensive systems, and considering iSCSI for specific volumes, although I >>>>> really don't want to go there. >>>>> >>>>> I don't run FreeNAS[1], but I do run FreeBSD as storage servers for >>>>> their excellent ZFS implementation, mostly for backups. ZFS will make your >>>>> `heal` problem go away, but not your bandwidth problems, which become >>>>> worse >>>>> (because of fewer NICS pushing traffic). 10G hardware is not exactly in >>>>> the >>>>> impulse-buy territory, but if you can, I'd recommend doing some testing >>>>> using it. I think at least some of your problems are related. >>>>> >>>>> If that's not possible, my next stops would be optimizing everything I >>>>> could about sharding, healing and optimizing for serving the shard size to >>>>> squeeze as much performance out of 1G as I could, but that will only go so >>>>> far. >>>>> >>>>> -j >>>>> >>>>> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI. >>>>> >>>>> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir <[email protected]> wrote: >>>>> > >>>>> > hi all: >>>>> > >>>>> > Once again my production ovirt cluster is collapsing in on itself. >>>>> My servers are intermittently unavailable or degrading, customers are >>>>> noticing and calling in. This seems to be yet another gluster failure >>>>> that >>>>> I haven't been able to pin down. >>>>> > >>>>> > I posted about this a while ago, but didn't get anywhere (no replies >>>>> that I found). The problem started out as a glusterfsd process consuming >>>>> large amounts of ram (up to the point where ram and swap were exhausted >>>>> and >>>>> the kernel OOM killer killed off the glusterfsd process). For reasons not >>>>> clear to me at this time, that resulted in any VMs running on that host >>>>> and >>>>> that gluster volume to be paused with I/O error (the glusterfs process is >>>>> usually unharmed; why it didn't continue I/O with other servers is >>>>> confusing to me). >>>>> > >>>>> > I have 3 servers and a total of 4 gluster volumes (engine, iso, >>>>> data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is >>>>> replica 3. The first 3 are backed by an LVM partition (some thin >>>>> provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some >>>>> internal flash for acceleration). data-hdd is the only thing on the disk. >>>>> Servers are Dell R610 with the PERC/6i raid card, with the disks >>>>> individually passed through to the OS (no raid enabled). >>>>> > >>>>> > The above RAM usage issue came from the data-hdd volume. Yesterday, >>>>> I cought one of the glusterfsd high ram usage before the OOM-Killer had to >>>>> run. I was able to migrate the VMs off the machine and for good measure, >>>>> reboot the entire machine (after taking this opportunity to run the >>>>> software updates that ovirt said were pending). Upon booting back up, the >>>>> necessary volume healing began. However, this time, the healing caused >>>>> all >>>>> three servers to go to very, very high load averages (I saw just under 200 >>>>> on one server; typically they've been 40-70) with top reporting IO Wait at >>>>> 7-20%. Network for this volume is a dedicated gig network. According to >>>>> bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but >>>>> tailed off to mostly in the kB/s for a while. All machines' load averages >>>>> were still 40+ and gluster volume heal data-hdd info reported 5 items >>>>> needing healing. Server's were intermittently experiencing IO issues, >>>>> even >>>>> on the 3 gluster volumes that appeared largely unaffected. Even the OS >>>>> activities on the hosts itself (logging in, running commands) would often >>>>> be very delayed. The ovirt engine was seemingly randomly throwing engine >>>>> down / engine up / engine failed notifications. Responsiveness on ANY VM >>>>> was horrific most of the time, with random VMs being inaccessible. >>>>> > >>>>> > I let the gluster heal run overnight. By morning, there were still >>>>> 5 items needing healing, all three servers were still experiencing high >>>>> load, and servers were still largely unstable. >>>>> > >>>>> > I've noticed that all of my ovirt outages (and I've had a lot, way >>>>> more than is acceptable for a production cluster) have come from gluster. >>>>> I still have 3 VMs who's hard disk images have become corrupted by my last >>>>> gluster crash that I haven't had time to repair / rebuild yet (I believe >>>>> this crash was caused by the OOM issue previously mentioned, but I didn't >>>>> know it at the time). >>>>> > >>>>> > Is gluster really ready for production yet? It seems so unstable to >>>>> me.... I'm looking at replacing gluster with a dedicated NFS server >>>>> likely >>>>> FreeNAS. Any suggestions? What is the "right" way to do production >>>>> storage on this (3 node cluster)? Can I get this gluster volume stable >>>>> enough to get my VMs to run reliably again until I can deploy another >>>>> storage solution? >>>>> > >>>>> > --Jim >>>>> > _______________________________________________ >>>>> > Users mailing list -- [email protected] >>>>> > To unsubscribe send an email to [email protected] >>>>> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>> > oVirt Code of Conduct: https://www.ovirt.org/communit >>>>> y/about/community-guidelines/ >>>>> > List Archives: https://lists.ovirt.org/archiv >>>>> es/list/[email protected]/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/ >>>>> >>>>> >>>> _______________________________________________ >>> Users mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> oVirt Code of Conduct: https://www.ovirt.org/communit >>> y/about/community-guidelines/ >>> List Archives: https://lists.ovirt.org/archiv >>> es/list/[email protected]/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/ >>> >> >> >
_______________________________________________ Users mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/5KZZ6WMCDKFC62ACTYWEA4LBRUBL3AVY/

