I probably will wait... the 'Shedding' seems like a race condition bug, if you grep for it in the log I've sent you will see what I mean.
-Jack On Mon, Oct 4, 2010 at 11:15 PM, Stack <[email protected]> wrote: > I mean TRUNK. > > 0.89s have been cut from TRUNK every 3 or 4 weeks or so. > > J-D is about to put up our next 0.89. It does not have new > loadbalancer. The next release we hope will be 0.90.0RC1. That'll > have the new balancer. Feature freeze is this weds. Hopefully it'll > be up not too long after that (week or two?) > > As to running out of TRUNK, you could, but it'd be super risky. I'd > say if you are up for risk, wait a little while. We're busy doing > stabilization at the mo. > > St.Ack > > On Mon, Oct 4, 2010 at 11:00 PM, Jack Levin <[email protected]> wrote: >> By trunk, you mean 0.89 or 0.20.6? >> >> -Jack >> >> On Mon, Oct 4, 2010 at 10:59 PM, Jack Levin <[email protected]> wrote: >>> Full stop of all region servers, restart of master, is what brings it all >>> back: >>> >>> Please attached. Lots of data there, search for 'Shedding'. >>> >>> -Jack >>> >>> On Mon, Oct 4, 2010 at 9:42 PM, Stack <[email protected]> wrote: >>>> So, required a start/stop to fix balance issue? >>>> >>>> Can I see master log from around problematic time? >>>> >>>> (The load balancer has been completely redone in TRUNK) >>>> >>>> St.Ack >>>> >>>> On Mon, Oct 4, 2010 at 6:23 PM, Jack Levin <[email protected]> wrote: >>>>> http://pastebin.com/suw2QVYg this is OOME event. >>>>> >>>>> When I started it up, the master eventually stopped shedding to 14 >>>>> regions each (used to be 700 on 10 servers), and stayed there for a >>>>> while, I wanted 10 minutes, and stopped/started all region servers, >>>>> and they came up in 5 minutes. >>>>> >>>>> -Jack >>>>> >>>>> On Mon, Oct 4, 2010 at 5:48 PM, Jack Levin <[email protected]> wrote: >>>>>> 2010-10-04 17:47:25,449 DEBUG >>>>>> org.apache.hadoop.hbase.master.RegionManager: Server(s) are carrying >>>>>> only 2 regions. Server mtab5.prod.imageshack.com,60020,1285878100774 >>>>>> is most loaded (290). Shedding 32 regions to pass to least loaded >>>>>> (numMoveToLowLoaded=177) >>>>>> >>>>>> >>>>>> I observe that number of loaded regions sheds pretty much to zero >>>>>> before starting back up (taking long time in the process), even though >>>>>> I had server that OOME'ed started up again. It seems to be there >>>>>> might be a bug in rebalancing logic? >>>>>> >>>>>> -Jack >>>>>> >>>>> >>>> >>> >> >
