Re: [ovirt-users] gluster self-heal takes cluster offline

2018-03-23 Thread Darrell Budic
What version of ovirt and gluster? Sounds like something I just saw with 
gluster 3.12.x, are you using libgfapi or just fuse mounts?

> From: Sahina Bose <sab...@redhat.com>
> Subject: Re: [ovirt-users] gluster self-heal takes cluster offline
> Date: March 23, 2018 at 1:26:01 AM CDT
> To: Jim Kusznir
> Cc: Ravishankar Narayanankutty; users
> 
> 
> 
> On Fri, Mar 16, 2018 at 2:45 AM, Jim Kusznir <j...@palousetech.com 
> <mailto:j...@palousetech.com>> wrote:
> Hi all:
> 
> I'm trying to understand why/how (and most importantly, how to fix) an 
> substantial issue I had last night.  This happened one other time, but I 
> didn't know/understand all the parts associated with it until last night.
> 
> I have a 3 node hyperconverged (self-hosted engine, Gluster on each node) 
> cluster.  Gluster is Replica 2 + arbitrar.  Current network configuration is 
> 2x GigE on load balance ("LAG Group" on switch), plus one GigE from each 
> server on a separate vlan, intended for Gluster (but not used).  Server 
> hardware is Dell R610's, each server as an SSD in it.  Server 1 and 2 have 
> the full replica, server 3 is the arbitrar.
> 
> I put server 2 into maintence so I can work on the hardware, including turn 
> it off and such.  In the course of the work, I found that I needed to 
> reconfigure the SSD's partitioning somewhat, and it resulted in wiping the 
> data partition (storing VM images).  I figure, its no big deal, gluster will 
> rebuild that in short order.  I did take care of the extended attr settings 
> and the like, and when I booted it up, gluster came up as expected and began 
> rebuilding the disk.
> 
> How big was the data on this partition? What was the shard size set on the 
> gluster volume?
> Out of curiosity, how long did it take to heal and come back to operational?
> 
> 
> The problem is that suddenly my entire cluster got very sluggish.  The entine 
> was marking nodes and VMs failed and unfaling them throughout the system, 
> fairly randomly.  It didn't matter what node the engine or VM was on.  At one 
> point, it power cycled server 1 for "non-responsive" (even though everything 
> was running on it, and the gluster rebuild was working on it).  As a result 
> of this, about 6 VMs were killed and my entire gluster system went down hard 
> (suspending all remaining VMs and the engine), as there were no remaining 
> full copies of the data.  After several minutes (these are Dell servers, 
> after all...), server 1 came back up, and gluster resumed the rebuild, and 
> came online on the cluster.  I had to manually (virtsh command) unpause the 
> engine, and then struggle through trying to get critical VMs back up.  
> Everything was super slow, and load averages on the servers were often seen 
> in excess of 80 (these are 8 core / 16 thread boxes).  Actual CPU usage 
> (reported by top) was rarely above 40% (inclusive of all CPUs) for any one 
> server. Glusterfs was often seen using 180%-350% of a CPU on server 1 and 2.  
> 
> I ended up putting the cluster in global HA maintence mode and disabling 
> power fencing on the nodes until the process finished.  It appeared on at 
> least two occasions a functional node was marked bad and had the fencing not 
> been disabled, a node would have rebooted, just further exacerbating the 
> problem.  
> 
> Its clear that the gluster rebuild overloaded things and caused the problem.  
> I don't know why the load was so high (even IOWait was low), but load 
> averages were definately tied to the glusterfs cpu utilization %.   At no 
> point did I have any problems pinging any machine (host or VM) unless the 
> engine decided it was dead and killed it.
> 
> Why did my system bite it so hard with the rebuild?  I baby'ed it along until 
> the rebuild was complete, after which it returned to normal operation.
> 
> As of this event, all networking (host/engine management, gluster, and VM 
> network) were on the same vlan.  I'd love to move things off, but so far any 
> attempt to do so breaks my cluster.  How can I move my management interfaces 
> to a separate VLAN/IP Space?  I also want to move Gluster to its own private 
> space, but it seems if I change anything in the peers file, the entire 
> gluster cluster goes down.  The dedicated gluster network is listed as a 
> secondary hostname for all peers already.
> 
> Will the above network reconfigurations be enough?  I got the impression that 
> the issue may not have been purely network based, but possibly server IO 
> overload.  Is this likely / right?
> 
> I appreciate input.  I don't think gluster's recovery is supposed to do as 
> much damage as it did the last two or three times any healing was required.
> 
> Thanks!

[ovirt-users] gluster self-heal takes cluster offline

2018-03-15 Thread Jim Kusznir
Hi all:

I'm trying to understand why/how (and most importantly, how to fix) an
substantial issue I had last night.  This happened one other time, but I
didn't know/understand all the parts associated with it until last night.

I have a 3 node hyperconverged (self-hosted engine, Gluster on each node)
cluster.  Gluster is Replica 2 + arbitrar.  Current network configuration
is 2x GigE on load balance ("LAG Group" on switch), plus one GigE from each
server on a separate vlan, intended for Gluster (but not used).  Server
hardware is Dell R610's, each server as an SSD in it.  Server 1 and 2 have
the full replica, server 3 is the arbitrar.

I put server 2 into maintence so I can work on the hardware, including turn
it off and such.  In the course of the work, I found that I needed to
reconfigure the SSD's partitioning somewhat, and it resulted in wiping the
data partition (storing VM images).  I figure, its no big deal, gluster
will rebuild that in short order.  I did take care of the extended attr
settings and the like, and when I booted it up, gluster came up as expected
and began rebuilding the disk.

The problem is that suddenly my entire cluster got very sluggish.  The
entine was marking nodes and VMs failed and unfaling them throughout the
system, fairly randomly.  It didn't matter what node the engine or VM was
on.  At one point, it power cycled server 1 for "non-responsive" (even
though everything was running on it, and the gluster rebuild was working on
it).  As a result of this, about 6 VMs were killed and my entire gluster
system went down hard (suspending all remaining VMs and the engine), as
there were no remaining full copies of the data.  After several minutes
(these are Dell servers, after all...), server 1 came back up, and gluster
resumed the rebuild, and came online on the cluster.  I had to manually
(virtsh command) unpause the engine, and then struggle through trying to
get critical VMs back up.  Everything was super slow, and load averages on
the servers were often seen in excess of 80 (these are 8 core / 16 thread
boxes).  Actual CPU usage (reported by top) was rarely above 40% (inclusive
of all CPUs) for any one server. Glusterfs was often seen using 180%-350%
of a CPU on server 1 and 2.

I ended up putting the cluster in global HA maintence mode and disabling
power fencing on the nodes until the process finished.  It appeared on at
least two occasions a functional node was marked bad and had the fencing
not been disabled, a node would have rebooted, just further exacerbating
the problem.

Its clear that the gluster rebuild overloaded things and caused the
problem.  I don't know why the load was so high (even IOWait was low), but
load averages were definately tied to the glusterfs cpu utilization %.   At
no point did I have any problems pinging any machine (host or VM) unless
the engine decided it was dead and killed it.

Why did my system bite it so hard with the rebuild?  I baby'ed it along
until the rebuild was complete, after which it returned to normal operation.

As of this event, all networking (host/engine management, gluster, and VM
network) were on the same vlan.  I'd love to move things off, but so far
any attempt to do so breaks my cluster.  How can I move my management
interfaces to a separate VLAN/IP Space?  I also want to move Gluster to its
own private space, but it seems if I change anything in the peers file, the
entire gluster cluster goes down.  The dedicated gluster network is listed
as a secondary hostname for all peers already.

Will the above network reconfigurations be enough?  I got the impression
that the issue may not have been purely network based, but possibly server
IO overload.  Is this likely / right?

I appreciate input.  I don't think gluster's recovery is supposed to do as
much damage as it did the last two or three times any healing was required.

Thanks!
--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users