From #2 in the initial email, the hbase:meta might not be the cause for the 
hotspot. 

Saad:
Can you pastebin stack trace of the hot region server when this happens again ?

Thanks

> On Dec 2, 2016, at 4:48 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
> 
> We used a pre-split into 1024 regions at the start but we miscalculated our
> data size, so there were still auto-splits storms at the beginning as data
> size stabilized, it has ended up at around 9500 or so regions, plus a few
> thousand regions for a few other tables (much smaller). But haven't had any
> new auto-splits in a couple of months. And the hotspots only started
> happening recently.
> 
> Our hashing scheme is very simple, we take the MD5 of the key, then form a
> 4 digit prefix based on the first two bytes of the MD5 normalized to be
> within the range 0-1023 . I am fairly confident about this scheme
> especially since even during the hotspot we see no evidence so far that any
> particular region is taking disproportionate traffic (based on Cloudera
> Manager per region charts on the hotspot server). Does that look like a
> reasonable scheme to randomize which region any give key goes to? And the
> start of the hotspot doesn't seem to correspond to any region splitting or
> moving from one server to another activity.
> 
> Thanks.
> 
> ----
> Saad
> 
> 
>> On Thu, Dec 1, 2016 at 3:32 PM, John Leach <jle...@splicemachine.com> wrote:
>> 
>> Saad,
>> 
>> Region move or split causes client connections to simultaneously refresh
>> their meta.
>> 
>> Key word is supposed.  We have seen meta hot spotting from time to time
>> and on different versions at Splice Machine.
>> 
>> How confident are you in your hashing algorithm?
>> 
>> Regards,
>> John Leach
>> 
>> 
>> 
>>> On Dec 1, 2016, at 2:25 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>>> 
>>> No never thought about that. I just figured out how to locate the server
>>> for that table after you mentioned it. We'll have to keep an eye on it
>> next
>>> time we have a hotspot to see if it coincides with the hotspot server.
>>> 
>>> What would be the theory for how it could become a hotspot? Isn't the
>>> client supposed to cache it and only go back for a refresh if it hits a
>>> region that is not in its expected location?
>>> 
>>> ----
>>> Saad
>>> 
>>> 
>>> On Thu, Dec 1, 2016 at 2:56 PM, John Leach <jle...@splicemachine.com>
>> wrote:
>>> 
>>>> Saad,
>>>> 
>>>> Did you validate that Meta is not on the “Hot” region server?
>>>> 
>>>> Regards,
>>>> John Leach
>>>> 
>>>> 
>>>> 
>>>>> On Dec 1, 2016, at 1:50 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to avoid
>>>>> hotspotting due to inadvertent data patterns by prepending an MD5
>> based 4
>>>>> digit hash prefix to all our data keys. This works fine most of the
>>>> times,
>>>>> but more and more (as much as once or twice a day) recently we have
>>>>> occasions where one region server suddenly becomes "hot" (CPU above or
>>>>> around 95% in various monitoring tools). When it happens it lasts for
>>>>> hours, occasionally the hotspot might jump to another region server as
>>>> the
>>>>> master decide the region is unresponsive and gives its region to
>> another
>>>>> server.
>>>>> 
>>>>> For the longest time, we thought this must be some single rogue key in
>>>> our
>>>>> input data that is being hammered. All attempts to track this down have
>>>>> failed though, and the following behavior argues against this being
>>>>> application based:
>>>>> 
>>>>> 1. plotted Get and Put rate by region on the "hot" region server in
>>>>> Cloudera Manager Charts, shows no single region is an outlier.
>>>>> 
>>>>> 2. cleanly restarting just the region server process causes its regions
>>>> to
>>>>> randomly migrate to other region servers, then it gets new ones from
>> the
>>>>> HBase master, basically a sort of shuffling, then the hotspot goes
>> away.
>>>> If
>>>>> it were application based, you'd expect the hotspot to just jump to
>>>> another
>>>>> region server.
>>>>> 
>>>>> 3. have pored through region server logs and can't see anything out of
>>>> the
>>>>> ordinary happening
>>>>> 
>>>>> The only other pertinent thing to mention might be that we have a
>> special
>>>>> process of our own running outside the cluster that does cluster wide
>>>> major
>>>>> compaction in a rolling fashion, where each batch consists of one
>> region
>>>>> from each region server, and it waits before one batch is completely
>> done
>>>>> before starting another. We have seen no real impact on the hotspot
>> from
>>>>> shutting this down and in normal times it doesn't impact our read or
>>>> write
>>>>> performance much.
>>>>> 
>>>>> We are at our wit's end, anyone have experience with a scenario like
>>>> this?
>>>>> Any help/guidance would be most appreciated.
>>>>> 
>>>>> -----
>>>>> Saad
>>>> 
>>>> 
>> 
>> 

Reply via email to