I probably will wait... the 'Shedding' seems like a race condition
bug, if you grep for it in the log I've sent you will see what I mean.

-Jack

On Mon, Oct 4, 2010 at 11:15 PM, Stack <[email protected]> wrote:
> I mean TRUNK.
>
> 0.89s have been cut from TRUNK every 3 or 4 weeks or so.
>
> J-D is about to put up our next 0.89.  It does not have new
> loadbalancer.  The next release we hope will be 0.90.0RC1.  That'll
> have the new balancer.  Feature freeze is this weds.  Hopefully it'll
> be up not too long after that (week or two?)
>
> As to running out of TRUNK, you could, but it'd be super risky.  I'd
> say if you are up for risk, wait a little while.  We're busy doing
> stabilization at the mo.
>
> St.Ack
>
> On Mon, Oct 4, 2010 at 11:00 PM, Jack Levin <[email protected]> wrote:
>> By trunk, you mean 0.89 or 0.20.6?
>>
>> -Jack
>>
>> On Mon, Oct 4, 2010 at 10:59 PM, Jack Levin <[email protected]> wrote:
>>> Full stop of all region servers, restart of master, is what brings it all 
>>> back:
>>>
>>> Please attached.  Lots of data there, search for 'Shedding'.
>>>
>>> -Jack
>>>
>>> On Mon, Oct 4, 2010 at 9:42 PM, Stack <[email protected]> wrote:
>>>> So, required a start/stop to fix balance issue?
>>>>
>>>> Can I see master log from around problematic time?
>>>>
>>>> (The load balancer has been completely redone in TRUNK)
>>>>
>>>> St.Ack
>>>>
>>>> On Mon, Oct 4, 2010 at 6:23 PM, Jack Levin <[email protected]> wrote:
>>>>> http://pastebin.com/suw2QVYg this is OOME event.
>>>>>
>>>>> When I started it up, the master eventually stopped shedding to 14
>>>>> regions each (used to be 700 on 10 servers), and stayed there for a
>>>>> while, I wanted 10 minutes, and stopped/started all region servers,
>>>>> and they came up in 5 minutes.
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Mon, Oct 4, 2010 at 5:48 PM, Jack Levin <[email protected]> wrote:
>>>>>> 2010-10-04 17:47:25,449 DEBUG
>>>>>> org.apache.hadoop.hbase.master.RegionManager: Server(s) are carrying
>>>>>> only 2 regions. Server mtab5.prod.imageshack.com,60020,1285878100774
>>>>>> is most loaded (290). Shedding 32 regions to pass to  least loaded
>>>>>> (numMoveToLowLoaded=177)
>>>>>>
>>>>>>
>>>>>> I observe that number of loaded regions sheds pretty much to zero
>>>>>> before starting back up (taking long time in the process), even though
>>>>>> I had server that OOME'ed started up again.  It seems to be there
>>>>>> might be a bug in rebalancing logic?
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to