[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

Darrell Budic Mon, 09 Jul 2018 12:43:48 -0700

I encountered this after upgrading clients to 3.12.9 as well. It’s not present 
in 3.12.8 or 3.12.6. I’ve added some data I had to that bug, can produce more 
if needed. Forgot to mention my server cluster is at 3.12.9, and is not showing 
any problems, it’s just the clients.


A test cluster on 3.12.11 also shows it, just slower because it’s got fewer 
clients on it.


> From: Sahina Bose <[email protected]>
> Subject: [ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)
> Date: July 9, 2018 at 10:42:15 AM CDT
> To: Edward Clay; Jim Kusznir
> Cc: users
> 
> see response about bug at 
> https://lists.ovirt.org/archives/list/[email protected]/thread/WRYEBOLNHJZGKKJUNF77TJ7WMBS66ZYK/
>  
> <https://lists.ovirt.org/archives/list/[email protected]/thread/WRYEBOLNHJZGKKJUNF77TJ7WMBS66ZYK/>
>  which seems to indicate the referenced bug is fixed at 3.12.2 and higher.
> 
> Could you attach the statedump of the process to the bug 
> https://bugzilla.redhat.com/show_bug.cgi?id=1593826 
> <https://bugzilla.redhat.com/show_bug.cgi?id=1593826> as requested?
> 
> 
> 
> On Mon, Jul 9, 2018 at 8:38 PM, Edward Clay <[email protected] 
> <mailto:[email protected]>> wrote:
> Just to add my .02 here.  I've opened a bug on this issue where HV/host 
> connected to clusterfs volumes are running out of ram.  This seemed to be a 
> bug fixed in gluster 3.13 but that patch doesn't seem to be avaiable any 
> longer and 3.12 is what ovirt is using.  For example I have a host that was 
> showing 72% of memory consumption with 3 VMs running on it.  If I migrate 
> those VMs to another Host memory consumption drops to 52%.  If i put this 
> host into maintenance and then activate it it drops down to 2% or so.  Since 
> I ran into this issue I've been manually watching memory consumption on each 
> host and migrating VMs from it to others to keep things from dying.  I'm 
> hoping with the announcement of gluster 3.12 end of life and the move to 
> gluster 4.1 that this will get fixed or that the patch from 3.13 can get 
> backported so this problem will go away.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1593826 
> <https://bugzilla.redhat.com/show_bug.cgi?id=1593826>
> 
> On 07/07/2018 11:49 AM, Jim Kusznir wrote:
>> **Security Notice - This external email is NOT from The Hut Group** 
>> 
>> This host has NO VMs running on it, only 3 running cluster-wide (including 
>> the engine, which is on its own storage):
>> 
>> top - 10:44:41 up 1 day, 17:10,  1 user,  load average: 15.86, 14.33, 13.39
>> Tasks: 381 total,   1 running, 379 sleeping,   1 stopped,   0 zombie
>> %Cpu(s):  2.7 us,  2.1 sy,  0.0 ni, 89.0 id,  6.1 wa,  0.0 hi,  0.2 si,  0.0 
>> st
>> KiB Mem : 32764284 total,   338232 free,   842324 used, 31583728 buff/cache
>> KiB Swap: 12582908 total, 12258660 free,   324248 used. 31076748 avail Mem 
>> 
>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND  
>>                                                                              
>>                            
>> 13279 root      20   0 2380708  37628   4396 S  51.7  0.1   3768:03 
>> glusterfsd                                                                   
>>                                      
>> 13273 root      20   0 2233212  20460   4380 S  17.2  0.1 105:50.44 
>> glusterfsd                                                                   
>>                                     
>> 13287 root      20   0 2233212  20608   4340 S   4.3  0.1  34:27.20 
>> glusterfsd                                                                   
>>                                     
>> 16205 vdsm       0 -20 5048672  88940  13364 S   1.3  0.3   0:32.69 vdsmd    
>>                                                                              
>>                            
>> 16300 vdsm      20   0  608488  25096   5404 S   1.3  0.1   0:05.78 python   
>>                                                                              
>>                            
>>  1109 vdsm      20   0 3127696  44228   8552 S   0.7  0.1  18:49.76 
>> ovirt-ha-broker                                                              
>>                                      
>> 25555 root      20   0       0      0      0 S   0.7  0.0   0:00.13 
>> kworker/u64:3                                                                
>>                                      
>>    10 root      20   0       0      0      0 S   0.3  0.0   4:22.36 
>> rcu_sched                                                                    
>>                                      
>>   572 root       0 -20       0      0      0 S   0.3  0.0   0:12.02 
>> kworker/1:1H                                                                 
>>                                      
>>   797 root      20   0       0      0      0 S   0.3  0.0   1:59.59 
>> kdmwork-253:2                                                                
>>                                      
>>   877 root       0 -20       0      0      0 S   0.3  0.0   0:11.34 
>> kworker/3:1H                                                                 
>>                                      
>>  1028 root      20   0       0      0      0 S   0.3  0.0   0:35.35 
>> xfsaild/dm-10                                                                
>>                                      
>>  1869 root      20   0 1496472  10540   6564 S   0.3  0.0   2:15.46 python   
>>                                                                              
>>                            
>>  3747 root      20   0       0      0      0 D   0.3  0.0   0:01.21 
>> kworker/u64:1                                                                
>>                                      
>> 10979 root      15  -5  723504  15644   3920 S   0.3  0.0  22:46.27 
>> glusterfs                                                                    
>>                                     
>> 15085 root      20   0  680884  10792   4328 S   0.3  0.0   0:01.13 glusterd 
>>                                                                              
>>                            
>> 16102 root      15  -5 1204216  44948  11160 S   0.3  0.1   0:18.61 
>> supervdsmd                                     
>> 
>> At the moment, the engine is barely usable, my other VMs appear to be 
>> unresponsive.  Two on one host, one on another, and none on the third.
>> 
>> 
>> 
>> On Sat, Jul 7, 2018 at 10:38 AM, Jim Kusznir <[email protected] 
>> <mailto:[email protected]>> wrote:
>> I run 4-7 VMs, and most of them are 2GB ram.  I have 2 VMs with 4GB.
>> 
>> Ram hasn't been an issue until recent ovirt/gluster upgrades.  Storage has 
>> always been slow, especially with these drives.  However, even watching 
>> network utilization on my switch, the gig-e links never max out.
>> 
>> The loadavg issues and unresponsive behavior started with yesterday's ovirt 
>> updates.  I now have one VM with low I/O that lives on a separate storage 
>> volume (data, fully SSD backed instead of data-hdd, which was having the 
>> issues).  I moved it to a ovirt host with no other VMs on it, and that had 
>> reshly been rebooted.  Before it had this one VM on it, loadavg was >0.5.  
>> Now its up in the 20's, with only one low Disk I/O, 4GB ram VM on the host.
>> 
>> This to me says there's now a new problem separate from Gluster.  I don't 
>> have any non-gluster storage available to test with.  I did notice that the 
>> last update included a new kernel, and it appears its the qemu-kvm processes 
>> that are consuming way more CPU than they used to now.
>> 
>> Are there any known issues?  I'm going to reboot into my previous kernel to 
>> see if its kernel-caused.
>> 
>> --Jim
>> 
>> 
>> 
>> On Fri, Jul 6, 2018 at 11:07 PM, Johan Bernhardsson <[email protected] 
>> <mailto:[email protected]>> wrote:
>> That is a single sata drive that is slow on random I/O and that has to be 
>> synced with 2 other servers. Gluster works syncronous so one write has to be 
>> written and acknowledged on all the three nodes.
>> 
>> So you have a bottle neck in io on drives and one on network and depending 
>> on how many virtual servers you have and how much ram they take you might 
>> have memory.
>> 
>> Load spikes when you have a wait somewhere and are overusing capacity. But 
>> it's now only CPU that load is counted on. It is waiting for resources so it 
>> can be memory or Network or drives.
>> 
>> How many virtual server do you run and how much ram do they consume?
>> 
>> On July 7, 2018 09:51:42 Jim Kusznir <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>>> In case it matters, the data-hdd gluster volume uses these hard drives:
>>> 
>>> https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1
>>>  
>>> <https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1>
>>> 
>>> This is in a Dell R610 with PERC6/i (one drive per server, configured as a 
>>> single drive volume to pass it through as its own /dev/sd* device).  Inside 
>>> the OS, its partitioned with lvm_thin, then an lvm volume formatted with 
>>> XFS and mounted as /gluster/brick3, with the data-hdd volume created inside 
>>> that.
>>> 
>>> --Jim
>>> 
>>> On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> So, I'm still at a loss...It sounds like its either insufficient ram/swap, 
>>> or insufficient network.  It seems to be neither now.  At this point, it 
>>> appears that gluster is just "broke" and killing my systems for no 
>>> descernable reason.  Here's detals, all from the same system (currently 
>>> running 3 VMs):
>>> 
>>> [root@ovirt3 ~]# w
>>>  22:26:53 up 36 days,  4:34,  1 user,  load average: 42.78, 55.98, 53.31
>>> USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
>>> root     pts/0    192.168.8.90     22:26    2.00s  0.12s  0.11s w
>>> 
>>> bwm-ng reports the highest data usage was about 6MB/s during this test (and 
>>> that was combined; I have two different gig networks.  One gluster network 
>>> (primary VM storage) runs on one, the other network handles everything 
>>> else).
>>> 
>>> [root@ovirt3 ~]# free -m
>>>               total        used        free      shared  buff/cache   
>>> available
>>> Mem:          31996       13236         232          18       18526       
>>> 18195
>>> Swap:         16383        1475       14908
>>> 
>>> top - 22:32:56 up 36 days,  4:41,  1 user,  load average: 17.99, 39.69, 
>>> 47.66
>>> Tasks: 407 total,   1 running, 405 sleeping,   1 stopped,   0 zombie
>>> %Cpu(s):  8.6 us,  2.1 sy,  0.0 ni, 87.6 id,  1.6 wa,  0.0 hi,  0.1 si,  
>>> 0.0 st
>>> KiB Mem : 32764284 total,   228296 free, 13541952 used, 18994036 buff/cache
>>> KiB Swap: 16777212 total, 15246200 free,  1531012 used. 18643960 avail Mem 
>>> 
>>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
>>>                                                                             
>>>                              
>>> 30036 qemu      20   0 6872324   5.2g  13532 S 144.6 16.5 216:14.55 
>>> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object 
>>> secret,id=masterKey0,format=raw,file=/v+ 
>>> 28501 qemu      20   0 5034968   3.6g  12880 S  16.2 11.7  73:44.99 
>>> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object 
>>> secret,id=masterKey0,format=raw,file=/va+ 
>>>  2694 root      20   0 2169224  12164   3108 S   5.0  0.0   3290:42 
>>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com <http://ovirt3.nwfiber.com/> 
>>> --volfile-iddata.ovirt3.nwfiber.com 
>>> <http://data.ovirt3.nwfiber.com/>.gluster-brick2-data -p /var/run/+ 
>>> 14293 root      15  -5  944700  13356   4436 S   4.0  0.0  16:32.15 
>>> /usr/sbin/glusterfs --volfile-server=192.168.8.11 
>>> --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ 
>>> 25100 vdsm       0 -20 6747440 107868  12836 S   2.3  0.3  21:35.20 
>>> /usr/bin/python2 /usr/share/vdsm/vdsmd                                      
>>>                                       
>>> 28971 qemu      20   0 2842592   1.5g  13548 S   1.7  4.7 241:46.49 
>>> /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com 
>>> <http://unifi.palousetech.com/>,debug-threads=on -S -object 
>>> secret,id=masterKey0,format=+ 
>>> 12095 root      20   0  162276   2836   1868 R   1.3  0.0   0:00.25 top     
>>>                                                                             
>>>                               
>>>  2708 root      20   0 1906040  12404   3080 S   1.0  0.0   1083:33 
>>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com <http://ovirt3.nwfiber.com/> 
>>> --volfile-idengine.ovirt3.nwfiber.com 
>>> <http://engine.ovirt3.nwfiber.com/>.gluster-brick1-engine -p /var/+ 
>>> 28623 qemu      20   0 4749536   1.7g  12896 S   0.7  5.5   4:30.64 
>>> /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com 
>>> <http://billing.nwfiber.com/>,debug-threads=on -S -object 
>>> secret,id=masterKey0,format=ra+ 
>>>    10 root      20   0       0      0      0 S   0.3  0.0 215:54.72 
>>> [rcu_sched]                                                                 
>>>                                      
>>>  1030 sanlock   rt   0  773804  27908   2744 S   0.3  0.1  35:55.61 
>>> /usr/sbin/sanlock daemon                                                    
>>>                                       
>>>  1890 zabbix    20   0   83904   1696   1612 S   0.3  0.0  24:30.63 
>>> /usr/sbin/zabbix_agentd: collector [idle 1 sec]                             
>>>                                      
>>>  2722 root      20   0 1298004   6148   2580 S   0.3  0.0  38:10.82 
>>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com <http://ovirt3.nwfiber.com/> 
>>> --volfile-idiso.ovirt3.nwfiber.com 
>>> <http://iso.ovirt3.nwfiber.com/>.gluster-brick4-iso -p /var/run/gl+ 
>>>  6340 root      20   0       0      0      0 S   0.3  0.0   0:04.30 
>>> [kworker/7:0]                                                               
>>>                                      
>>> 10652 root      20   0       0      0      0 S   0.3  0.0   0:00.23 
>>> [kworker/u64:2]                                                             
>>>                                       
>>> 14724 root      20   0 1076344  17400   3200 S   0.3  0.1  10:04.13 
>>> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p 
>>> /var/run/gluster/glustershd/glustershd.pid <http://ustershd.pid/> -+ 
>>> 22011 root      20   0       0      0      0 S   0.3  0.0   0:05.04 
>>> [kworker/10:1]                                                              
>>>                                       
>>> 
>>> Not sure why the system load dropped other than I was trying to take a 
>>> picture of it :)
>>> 
>>> In any case, it appears that at this time, I have plenty of swap, ram, and 
>>> network capacity, and yet things are still running very sluggish; I'm still 
>>> getting e-mails from servers complaining about loss of communication with 
>>> something or another; I still get e-mails from the engine about bad engine 
>>> status, then recovery, etc.
>>> 
>>> I've shut down 2/3 of my VMs, too....just trying to keep the critical ones 
>>> operating.
>>> 
>>> At this point, I don't believe the problem is the memory leak, but it seems 
>>> to be triggered by the memory leak, as in all my problems started when I 
>>> got low ram warnings from one of my 3 nodes and began recovery efforts from 
>>> that.
>>> 
>>> I do really like the idea / concept behind glusterfs, but I really have to 
>>> figure out why its been so poor performing from day one, and its caused 95% 
>>> of my outages (including several large ones lately).  If I can get it 
>>> stable, reliable, and well performing, then I'd love to keep it.  If I 
>>> can't, then perhaps NFS is the way to go?  I don't like the single point of 
>>> failure aspect of it, but my other NAS boxes I run for clients (central 
>>> storage for windows boxes) have been very solid; If I could get that kind 
>>> of reliability for my ovirt stack, it would be a substantial improvement.  
>>> Currently, it seems about every other month I have a gluster-induced outage.
>>> 
>>> Sometimes I wonder if its just hyperconverged is the issue, but my 
>>> infrastructure doesn't justify three servers at the same location...I might 
>>> be able to do two, but even that seems like its pushing it.
>>> 
>>> Looks like I can upgrade to 10G for about $900.  I can order a dual-Xeon 
>>> supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair 
>>> of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered.  I've 
>>> got to do something to improve my reliability; I can't keep going the way I 
>>> have been....
>>> 
>>> --Jim
>>> 
>>> 
>>> On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Load like that is mostly io based either the machine is swapping or network 
>>> is to slow. Check I/o wait in top.
>>> 
>>> And the problem where you get oom killer to kill off gluster. That means 
>>> that you don't monitor ram usage on the servers? Either it's eating all 
>>> your ram and swap gets really io intensive and then is killed off. Or you 
>>> have the wrong swap settings in sysctl.conf (there are tons of broken 
>>> guides that recommends swappines to 0 but that disables swap on newer 
>>> kernels. The proper swappines for only swapping when nesseary is 1 or a 
>>> sufficiently low number like 10 default is 60)
>>> 
>>> 
>>> Moving to nfs will not improve things. You will get more memory since 
>>> gluster isn't running and that is good. But you will have a single node 
>>> that can fail with all your storage and it would still be on 1 gigabit only 
>>> and your three node cluster would easily saturate that link.
>>> 
>>> On July 7, 2018 04:13:13 Jim Kusznir <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>>> So far it does not appear to be helping much. I'm still getting VM's 
>>>> locking up and all kinds of notices from overt engine about non-responsive 
>>>> hosts.  I'm still seeing load averages in the 20-30 range.
>>>> 
>>>> Jim
>>>> 
>>>> On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Thank you for the advice and help
>>>> 
>>>> I do plan on going 10Gbps networking; haven't quite jumped off that cliff 
>>>> yet, though.
>>>> 
>>>> I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps 
>>>> network, and I've watched throughput on that and never seen more than 
>>>> 60GB/s achieved (as reported by bwm-ng).  I have a separate 1Gbps network 
>>>> for communication and ovirt migration, but I wanted to break that up 
>>>> further (separate out VM traffice from migration/mgmt traffic).  My three 
>>>> SSD-backed gluster volumes run the main network too, as I haven't been 
>>>> able to get them to move to the new network (which I was trying to use as 
>>>> all gluster).  I tried bonding, but that seamed to reduce performance 
>>>> rather than improve it.
>>>> 
>>>> --Jim
>>>> 
>>>> On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Hi Jim,
>>>> 
>>>> I don't have any targeted suggestions, because there isn't much to latch 
>>>> on to. I can say Gluster replica three  (no arbiters) on dedicated servers 
>>>> serving a couple Ovirt VM clusters here have not had these sorts of 
>>>> issues. 
>>>> 
>>>> I suspect your long heal times (and the resultant long periods of high 
>>>> load) are at least partly related to 1G networking. That is just a matter 
>>>> of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G 
>>>> bonded NICs on the gluster and ovirt boxes for storage traffic and 
>>>> separate bonded 1G for ovirtmgmt and communication with other 
>>>> machines/people, and we're occasionally hitting the bandwidth ceiling on 
>>>> the storage network. I'm starting to think about 40/100G, different ways 
>>>> of splitting up intensive systems, and considering iSCSI for specific 
>>>> volumes, although I really don't want to go there.
>>>> 
>>>> I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their 
>>>> excellent ZFS implementation, mostly for backups. ZFS will make your 
>>>> `heal` problem go away, but not your bandwidth problems, which become 
>>>> worse (because of fewer NICS pushing traffic). 10G hardware is not exactly 
>>>> in the impulse-buy territory, but if you can, I'd recommend doing some 
>>>> testing using it. I think at least some of your problems are related.
>>>> 
>>>> If that's not possible, my next stops would be optimizing everything I 
>>>> could about sharding, healing and optimizing for serving the shard size to 
>>>> squeeze as much performance out of 1G as I could, but that will only go so 
>>>> far.
>>>> 
>>>> -j
>>>> 
>>>> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI.
>>>> 
>>>> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir <[email protected] 
>>>> > <mailto:[email protected]>> wrote:
>>>> > 
>>>> > hi all:
>>>> > 
>>>> > Once again my production ovirt cluster is collapsing in on itself.  My 
>>>> > servers are intermittently unavailable or degrading, customers are 
>>>> > noticing and calling in.  This seems to be yet another gluster failure 
>>>> > that I haven't been able to pin down.
>>>> > 
>>>> > I posted about this a while ago, but didn't get anywhere (no replies 
>>>> > that I found).  The problem started out as a glusterfsd process 
>>>> > consuming large amounts of ram (up to the point where ram and swap were 
>>>> > exhausted and the kernel OOM killer killed off the glusterfsd process).  
>>>> > For reasons not clear to me at this time, that resulted in any VMs 
>>>> > running on that host and that gluster volume to be paused with I/O error 
>>>> > (the glusterfs process is usually unharmed; why it didn't continue I/O 
>>>> > with other servers is confusing to me).
>>>> > 
>>>> > I have 3 servers and a total of 4 gluster volumes (engine, iso, data, 
>>>> > and data-hdd).  The first 3 are replica 2+arb; the 4th (data-hdd) is 
>>>> > replica 3.  The first 3 are backed by an LVM partition (some thin 
>>>> > provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some 
>>>> > internal flash for acceleration).  data-hdd is the only thing on the 
>>>> > disk.  Servers are Dell R610 with the PERC/6i raid card, with the disks 
>>>> > individually passed through to the OS (no raid enabled).
>>>> > 
>>>> > The above RAM usage issue came from the data-hdd volume.  Yesterday, I 
>>>> > cought one of the glusterfsd high ram usage before the OOM-Killer had to 
>>>> > run.  I was able to migrate the VMs off the machine and for good 
>>>> > measure, reboot the entire machine (after taking this opportunity to run 
>>>> > the software updates that ovirt said were pending).  Upon booting back 
>>>> > up, the necessary volume healing began.  However, this time, the healing 
>>>> > caused all three servers to go to very, very high load averages (I saw 
>>>> > just under 200 on one server; typically they've been 40-70) with top 
>>>> > reporting IO Wait at 7-20%.  Network for this volume is a dedicated gig 
>>>> > network.  According to bwm-ng, initially the network bandwidth would hit 
>>>> > 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while.  
>>>> > All machines' load averages were still 40+ and gluster volume heal 
>>>> > data-hdd info reported 5 items needing healing.  Server's were 
>>>> > intermittently experiencing IO issues, even on the 3 gluster volumes 
>>>> > that appeared largely unaffected.  Even the OS activities on the hosts 
>>>> > itself (logging in, running commands) would often be very delayed.  The 
>>>> > ovirt engine was seemingly randomly throwing engine down / engine up / 
>>>> > engine failed notifications.  Responsiveness on ANY VM was horrific most 
>>>> > of the time, with random VMs being inaccessible.
>>>> > 
>>>> > I let the gluster heal run overnight.  By morning, there were still 5 
>>>> > items needing healing, all three servers were still experiencing high 
>>>> > load, and servers were still largely unstable.
>>>> > 
>>>> > I've noticed that all of my ovirt outages (and I've had a lot, way more 
>>>> > than is acceptable for a production cluster) have come from gluster.  I 
>>>> > still have 3 VMs who's hard disk images have become corrupted by my last 
>>>> > gluster crash that I haven't had time to repair / rebuild yet (I believe 
>>>> > this crash was caused by the OOM issue previously mentioned, but I 
>>>> > didn't know it at the time).
>>>> > 
>>>> > Is gluster really ready for production yet?  It seems so unstable to 
>>>> > me....  I'm looking at replacing gluster with a dedicated NFS server 
>>>> > likely FreeNAS.  Any suggestions?  What is the "right" way to do 
>>>> > production storage on this (3 node cluster)?  Can I get this gluster 
>>>> > volume stable enough to get my VMs to run reliably again until I can 
>>>> > deploy another storage solution?
>>>> > 
>>>> > --Jim
>>>> > _______________________________________________
>>>> > Users mailing list -- [email protected] <mailto:[email protected]>
>>>> > To unsubscribe send an email to [email protected] 
>>>> > <mailto:[email protected]>
>>>> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ 
>>>> > <https://www.ovirt.org/site/privacy-policy/>
>>>> > oVirt Code of Conduct: 
>>>> > https://www.ovirt.org/community/about/community-guidelines/ 
>>>> > <https://www.ovirt.org/community/about/community-guidelines/>
>>>> > List Archives: 
>>>> > https://lists.ovirt.org/archives/list/[email protected]/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/
>>>> >  
>>>> > <https://lists.ovirt.org/archives/list/[email protected]/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/>
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Users mailing list -- [email protected] <mailto:users%40ovirt.org>
>>>> To unsubscribe send an email to [email protected] 
>>>> <mailto:users-leave%40ovirt.org>
>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ 
>>>> <https://www.ovirt.org/site/privacy-policy/>
>>>> oVirt Code of Conduct: 
>>>> https://www.ovirt.org/community/about/community-guidelines/ 
>>>> <https://www.ovirt.org/community/about/community-guidelines/>
>>>> List Archives: 
>>>> https://lists.ovirt.org/archives/list/[email protected]/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/
>>>>  
>>>> <https://lists.ovirt.org/archives/list/[email protected]/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/>
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Users mailing list -- [email protected] <mailto:[email protected]>
>> To unsubscribe send an email to [email protected] 
>> <mailto:[email protected]>
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ 
>> <https://www.ovirt.org/site/privacy-policy/>
>> oVirt Code of Conduct: 
>> https://www.ovirt.org/community/about/community-guidelines/ 
>> <https://www.ovirt.org/community/about/community-guidelines/>
>> List Archives: 
>> https://lists.ovirt.org/archives/list/[email protected]/message/T2M4J3Z7RPSGPEHNC33WFC2HUYOVL6FB/
>>  
>> <https://lists.ovirt.org/archives/list/[email protected]/message/T2M4J3Z7RPSGPEHNC33WFC2HUYOVL6FB/>
>> 
> 
> Edward Clay 
> Systems Administrator
> The Hut Group <http://www.thehutgroup.com/> 
> 
> Tel: 
> Email: [email protected] <mailto:[email protected]>
> 
> For the purposes of this email, the "company" means The Hut Group Limited, a 
> company registered in England and Wales (company number 6539496) whose 
> registered office is at Fifth Floor, Voyager House, Chicago Avenue, 
> Manchester Airport, M90 3DQ and/or any of its respective subsidiaries. 
> 
> Confidentiality Notice 
> This e-mail is confidential and intended for the use of the named recipient 
> only. If you are not the intended recipient please notify us by telephone 
> immediately on +44(0)1606 811888 or return it to us by e-mail. Please then 
> delete it from your system and note that any use, dissemination, forwarding, 
> printing or copying is strictly prohibited. Any views or opinions are solely 
> those of the author and do not necessarily represent those of the company. 
> 
> Encryptions and Viruses 
> Please note that this e-mail and any attachments have not been encrypted. 
> They may therefore be liable to be compromised. Please also note that it is 
> your responsibility to scan this e-mail and any attachments for viruses. We 
> do not, to the extent permitted by law, accept any liability (whether in 
> contract, negligence or otherwise) for any virus infection and/or external 
> compromise of security and/or confidentiality in relation to transmissions 
> sent by e-mail. 
> 
> Monitoring 
> Activity and use of the company's systems is monitored to secure its 
> effective use and operation and for other lawful business purposes. 
> Communications using these systems will also be monitored and may be recorded 
> to secure effective use and operation and for other lawful business purposes.
> 
> hgvyjuv
> 
> _______________________________________________
> Users mailing list -- [email protected] <mailto:[email protected]>
> To unsubscribe send an email to [email protected] 
> <mailto:[email protected]>
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ 
> <https://www.ovirt.org/site/privacy-policy/>
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/ 
> <https://www.ovirt.org/community/about/community-guidelines/>
> List Archives: 
> https://lists.ovirt.org/archives/list/[email protected]/message/Y2ZFGU2XDAXPMNLPQVHRDTNJQDFVWGCL/
>  
> <https://lists.ovirt.org/archives/list/[email protected]/message/Y2ZFGU2XDAXPMNLPQVHRDTNJQDFVWGCL/>
> 
> 
> _______________________________________________
> Users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
> https://lists.ovirt.org/archives/list/[email protected]/message/5YDMKKRWPEACWOOGVQPF2KGK6ZWUJVQY/

_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/4LCSECPBDN4HAUCGHVQGB3DPBZVURZ6K/

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

Reply via email to