Thanks everyone for the feedback. We tracked this down to having a bad
design using dynamic columns, there were a few (very few) rows that
accumulated up to 200,000 dynamic columns. When we got any activity that
caused us to try to read one of these rows, it resulted in a hot region
server.

Follow up question, we are now in the process of cleaning up those rows as
identified, but but some are so big that trying to read them in the cleanup
process kills it with out of memory exceptions. Is there any way to
identify rows with too many columns without actually reading them all?

Thanks.

----
Saad


On Sat, Dec 3, 2016 at 3:20 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> I took a look at the stack trace.
>
> Region server log would give us more detail on the frequency and duration
> of compactions.
>
> Cheers
>
> On Sat, Dec 3, 2016 at 7:39 AM, Jeremy Carroll <phobos...@gmail.com>
> wrote:
>
> > I would check compaction, investigate throttling if it's causing high
> CPU.
> >
> > On Sat, Dec 3, 2016 at 6:20 AM Saad Mufti <saad.mu...@gmail.com> wrote:
> >
> > > No.
> > >
> > > ----
> > > Saad
> > >
> > >
> > > On Fri, Dec 2, 2016 at 3:27 PM, Ted Yu <ted...@yahoo.com.invalid>
> wrote:
> > >
> > > > Some how I couldn't access the pastebin (I am in China now).
> > > > Did the region server showing hotspot host meta ?
> > > > Thanks
> > > >
> > > >     On Friday, December 2, 2016 11:53 AM, Saad Mufti <
> > > saad.mu...@gmail.com>
> > > > wrote:
> > > >
> > > >
> > > >  We're in AWS with D2.4xLarge instances. Each instance has 12
> > independent
> > > > spindles/disks from what I can tell.
> > > >
> > > > We have charted get_rate and mutate_rate by host and
> > > >
> > > > a) mutate_rate shows no real outliers
> > > > b) read_rate shows the overall rate on the "hotspot" region server
> is a
> > > bit
> > > > higher than every other server, not severely but enough that it is a
> > bit
> > > > noticeable. But when we chart get_rate on that server by region, no
> one
> > > > region stands out.
> > > >
> > > > get_rate chart by host:
> > > >
> > > > https://snag.gy/hmoiDw.jpg
> > > >
> > > > mutate_rate chart by host:
> > > >
> > > > https://snag.gy/jitdMN.jpg
> > > >
> > > > ----
> > > > Saad
> > > >
> > > >
> > > > ----
> > > > Saad
> > > >
> > > >
> > > > On Fri, Dec 2, 2016 at 2:34 PM, John Leach <jle...@splicemachine.com
> >
> > > > wrote:
> > > >
> > > > > Here is what I see...
> > > > >
> > > > >
> > > > > * Short Compaction Running on Heap
> > > > > "regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.
> > > > > aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547" -
> > > > Thread
> > > > > t@242
> > > > >    java.lang.Thread.State: RUNNABLE
> > > > >    at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
> > > > >    at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > internalEncode(FastDiffDeltaEncoder.java:245)
> > > > >    at org.apache.hadoop.hbase.io.encoding.
> BufferedDataBlockEncoder.
> > > > > encode(BufferedDataBlockEncoder.java:987)
> > > > >    at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > encode(FastDiffDeltaEncoder.java:58)
> > > > >    at org.apache.hadoop.hbase.io
> > > .hfile.HFileDataBlockEncoderImpl.encode(
> > > > > HFileDataBlockEncoderImpl.java:97)
> > > > >    at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(
> > > > > HFileBlock.java:866)
> > > > >    at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(
> > > > > HFileWriterV2.java:270)
> > > > >    at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(
> > > > > HFileWriterV3.java:87)
> > > > >    at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.
> > > > > append(StoreFile.java:949)
> > > > >    at org.apache.hadoop.hbase.regionserver.compactions.
> > > > > Compactor.performCompaction(Compactor.java:282)
> > > > >    at org.apache.hadoop.hbase.regionserver.compactions.
> > > > > DefaultCompactor.compact(DefaultCompactor.java:105)
> > > > >    at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$
> > > > > DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
> > > > >    at org.apache.hadoop.hbase.regionserver.HStore.compact(
> > > > > HStore.java:1233)
> > > > >    at org.apache.hadoop.hbase.regionserver.HRegion.compact(
> > > > > HRegion.java:1770)
> > > > >    at org.apache.hadoop.hbase.regionserver.CompactSplitThread$
> > > > > CompactionRunner.run(CompactSplitThread.java:520)
> > > > >    at java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > > ThreadPoolExecutor.java:1142)
> > > > >    at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > > > ThreadPoolExecutor.java:617)
> > > > >    at java.lang.Thread.run(Thread.java:745)
> > > > >
> > > > >
> > > > > * WAL Syncs waiting…  ALL 5
> > > > > "sync.0" - Thread t@202
> > > > >    java.lang.Thread.State: TIMED_WAITING
> > > > >    at java.lang.Object.wait(Native Method)
> > > > >    - waiting on <67ba892d> (a java.util.LinkedList)
> > > > >    at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(
> > > > > DFSOutputStream.java:2337)
> > > > >    at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(
> > > > > DFSOutputStream.java:2224)
> > > > >    at org.apache.hadoop.hdfs.DFSOutputStream.hflush(
> > > > > DFSOutputStream.java:2116)
> > > > >    at org.apache.hadoop.fs.FSDataOutputStream.hflush(
> > > > > FSDataOutputStream.java:130)
> > > > >    at org.apache.hadoop.hbase.regionserver.wal.
> > ProtobufLogWriter.sync(
> > > > > ProtobufLogWriter.java:173)
> > > > >    at org.apache.hadoop.hbase.regionserver.wal.FSHLog$
> > > > > SyncRunner.run(FSHLog.java:1379)
> > > > >    at java.lang.Thread.run(Thread.java:745)
> > > > >
> > > > > * Mutations backing up very badly...
> > > > >
> > > > > "B.defaultRpcServer.handler=103,queue=7,port=60020" - Thread t@155
> > > > >    java.lang.Thread.State: TIMED_WAITING
> > > > >    at java.lang.Object.wait(Native Method)
> > > > >    - waiting on <6ab54ea3> (a org.apache.hadoop.hbase.
> > > > > regionserver.wal.SyncFuture)
> > > > >    at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.
> > > > > get(SyncFuture.java:167)
> > > > >    at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> > > > > blockOnSync(FSHLog.java:1504)
> > > > >    at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> > > > > publishSyncThenBlockOnCompletion(FSHLog.java:1498)
> > > > >    at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(
> > > > > FSHLog.java:1632)
> > > > >    at org.apache.hadoop.hbase.regionserver.HRegion.
> > > > > syncOrDefer(HRegion.java:7737)
> > > > >    at org.apache.hadoop.hbase.regionserver.HRegion.
> > > > > processRowsWithLocks(HRegion.java:6504)
> > > > >    at org.apache.hadoop.hbase.regionserver.HRegion.
> > > > > mutateRowsWithLocks(HRegion.java:6352)
> > > > >    at org.apache.hadoop.hbase.regionserver.HRegion.
> > > > > mutateRowsWithLocks(HRegion.java:6334)
> > > > >    at org.apache.hadoop.hbase.regionserver.HRegion.
> > > > > mutateRow(HRegion.java:6325)
> > > > >    at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> > > > > mutateRows(RSRpcServices.java:418)
> > > > >    at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> > > > > multi(RSRpcServices.java:1916)
> > > > >    at org.apache.hadoop.hbase.protobuf.generated.
> > > > >
> > > ClientProtos$ClientService$2.callBlockingMethod(
> ClientProtos.java:32213)
> > > > >    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:
> > 2034)
> > > > >    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.
> > java:107)
> > > > >    at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> > > > > RpcExecutor.java:130)
> > > > >    at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.
> > > > java:107)
> > > > >    at java.lang.Thread.run(Thread.java:745)
> > > > >
> > > > >
> > > > > Too many writers being blocked attempting to write to WAL.
> > > > >
> > > > > What does your disk infrastructure look like?  Can you get away
> with
> > > > > Multi-wal?  Ugh...
> > > > >
> > > > > Regards,
> > > > > John Leach
> > > > >
> > > > >
> > > > > > On Dec 2, 2016, at 1:20 PM, Saad Mufti <saad.mu...@gmail.com>
> > wrote:
> > > > > >
> > > > > > Hi Ted,
> > > > > >
> > > > > > Finally we have another hotspot going on, same symptoms as
> before,
> > > here
> > > > > is
> > > > > > the pastebin for the stack trace from the region server that I
> > > obtained
> > > > > via
> > > > > > VisualVM:
> > > > > >
> > > > > > http://pastebin.com/qbXPPrXk
> > > > > >
> > > > > > Would really appreciate any insight you or anyone else can
> provide.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > ----
> > > > > > Saad
> > > > > >
> > > > > >
> > > > > > On Thu, Dec 1, 2016 at 6:08 PM, Saad Mufti <saad.mu...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > >> Sure will, the next time it happens.
> > > > > >>
> > > > > >> Thanks!!!
> > > > > >>
> > > > > >> ----
> > > > > >> Saad
> > > > > >>
> > > > > >>
> > > > > >> On Thu, Dec 1, 2016 at 5:01 PM, Ted Yu <ted...@yahoo.com.invalid
> >
> > > > > wrote:
> > > > > >>
> > > > > >>> From #2 in the initial email, the hbase:meta might not be the
> > cause
> > > > for
> > > > > >>> the hotspot.
> > > > > >>>
> > > > > >>> Saad:
> > > > > >>> Can you pastebin stack trace of the hot region server when this
> > > > happens
> > > > > >>> again ?
> > > > > >>>
> > > > > >>> Thanks
> > > > > >>>
> > > > > >>>> On Dec 2, 2016, at 4:48 AM, Saad Mufti <saad.mu...@gmail.com>
> > > > wrote:
> > > > > >>>>
> > > > > >>>> We used a pre-split into 1024 regions at the start but we
> > > > > miscalculated
> > > > > >>> our
> > > > > >>>> data size, so there were still auto-splits storms at the
> > beginning
> > > > as
> > > > > >>> data
> > > > > >>>> size stabilized, it has ended up at around 9500 or so regions,
> > > plus
> > > > a
> > > > > >>> few
> > > > > >>>> thousand regions for a few other tables (much smaller). But
> > > haven't
> > > > > had
> > > > > >>> any
> > > > > >>>> new auto-splits in a couple of months. And the hotspots only
> > > started
> > > > > >>>> happening recently.
> > > > > >>>>
> > > > > >>>> Our hashing scheme is very simple, we take the MD5 of the key,
> > > then
> > > > > >>> form a
> > > > > >>>> 4 digit prefix based on the first two bytes of the MD5
> > normalized
> > > to
> > > > > be
> > > > > >>>> within the range 0-1023 . I am fairly confident about this
> > scheme
> > > > > >>>> especially since even during the hotspot we see no evidence so
> > far
> > > > > that
> > > > > >>> any
> > > > > >>>> particular region is taking disproportionate traffic (based on
> > > > > Cloudera
> > > > > >>>> Manager per region charts on the hotspot server). Does that
> look
> > > > like
> > > > > a
> > > > > >>>> reasonable scheme to randomize which region any give key goes
> > to?
> > > > And
> > > > > >>> the
> > > > > >>>> start of the hotspot doesn't seem to correspond to any region
> > > > > splitting
> > > > > >>> or
> > > > > >>>> moving from one server to another activity.
> > > > > >>>>
> > > > > >>>> Thanks.
> > > > > >>>>
> > > > > >>>> ----
> > > > > >>>> Saad
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>> On Thu, Dec 1, 2016 at 3:32 PM, John Leach <
> > > > jle...@splicemachine.com
> > > > > >
> > > > > >>> wrote:
> > > > > >>>>>
> > > > > >>>>> Saad,
> > > > > >>>>>
> > > > > >>>>> Region move or split causes client connections to
> > simultaneously
> > > > > >>> refresh
> > > > > >>>>> their meta.
> > > > > >>>>>
> > > > > >>>>> Key word is supposed.  We have seen meta hot spotting from
> time
> > > to
> > > > > time
> > > > > >>>>> and on different versions at Splice Machine.
> > > > > >>>>>
> > > > > >>>>> How confident are you in your hashing algorithm?
> > > > > >>>>>
> > > > > >>>>> Regards,
> > > > > >>>>> John Leach
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>> On Dec 1, 2016, at 2:25 PM, Saad Mufti <
> saad.mu...@gmail.com>
> > > > > wrote:
> > > > > >>>>>>
> > > > > >>>>>> No never thought about that. I just figured out how to
> locate
> > > the
> > > > > >>> server
> > > > > >>>>>> for that table after you mentioned it. We'll have to keep an
> > eye
> > > > on
> > > > > it
> > > > > >>>>> next
> > > > > >>>>>> time we have a hotspot to see if it coincides with the
> hotspot
> > > > > server.
> > > > > >>>>>>
> > > > > >>>>>> What would be the theory for how it could become a hotspot?
> > > Isn't
> > > > > the
> > > > > >>>>>> client supposed to cache it and only go back for a refresh
> if
> > it
> > > > > hits
> > > > > >>> a
> > > > > >>>>>> region that is not in its expected location?
> > > > > >>>>>>
> > > > > >>>>>> ----
> > > > > >>>>>> Saad
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> On Thu, Dec 1, 2016 at 2:56 PM, John Leach <
> > > > > jle...@splicemachine.com>
> > > > > >>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Saad,
> > > > > >>>>>>>
> > > > > >>>>>>> Did you validate that Meta is not on the “Hot” region
> server?
> > > > > >>>>>>>
> > > > > >>>>>>> Regards,
> > > > > >>>>>>> John Leach
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>> On Dec 1, 2016, at 1:50 PM, Saad Mufti <
> > saad.mu...@gmail.com>
> > > > > >>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>> Hi,
> > > > > >>>>>>>>
> > > > > >>>>>>>> We are using HBase 1.0 on CDH 5.5.2 . We have taken great
> > care
> > > > to
> > > > > >>> avoid
> > > > > >>>>>>>> hotspotting due to inadvertent data patterns by prepending
> > an
> > > > MD5
> > > > > >>>>> based 4
> > > > > >>>>>>>> digit hash prefix to all our data keys. This works fine
> most
> > > of
> > > > > the
> > > > > >>>>>>> times,
> > > > > >>>>>>>> but more and more (as much as once or twice a day)
> recently
> > we
> > > > > have
> > > > > >>>>>>>> occasions where one region server suddenly becomes "hot"
> > (CPU
> > > > > above
> > > > > >>> or
> > > > > >>>>>>>> around 95% in various monitoring tools). When it happens
> it
> > > > lasts
> > > > > >>> for
> > > > > >>>>>>>> hours, occasionally the hotspot might jump to another
> region
> > > > > server
> > > > > >>> as
> > > > > >>>>>>> the
> > > > > >>>>>>>> master decide the region is unresponsive and gives its
> > region
> > > to
> > > > > >>>>> another
> > > > > >>>>>>>> server.
> > > > > >>>>>>>>
> > > > > >>>>>>>> For the longest time, we thought this must be some single
> > > rogue
> > > > > key
> > > > > >>> in
> > > > > >>>>>>> our
> > > > > >>>>>>>> input data that is being hammered. All attempts to track
> > this
> > > > down
> > > > > >>> have
> > > > > >>>>>>>> failed though, and the following behavior argues against
> > this
> > > > > being
> > > > > >>>>>>>> application based:
> > > > > >>>>>>>>
> > > > > >>>>>>>> 1. plotted Get and Put rate by region on the "hot" region
> > > server
> > > > > in
> > > > > >>>>>>>> Cloudera Manager Charts, shows no single region is an
> > outlier.
> > > > > >>>>>>>>
> > > > > >>>>>>>> 2. cleanly restarting just the region server process
> causes
> > > its
> > > > > >>> regions
> > > > > >>>>>>> to
> > > > > >>>>>>>> randomly migrate to other region servers, then it gets new
> > > ones
> > > > > from
> > > > > >>>>> the
> > > > > >>>>>>>> HBase master, basically a sort of shuffling, then the
> > hotspot
> > > > goes
> > > > > >>>>> away.
> > > > > >>>>>>> If
> > > > > >>>>>>>> it were application based, you'd expect the hotspot to
> just
> > > jump
> > > > > to
> > > > > >>>>>>> another
> > > > > >>>>>>>> region server.
> > > > > >>>>>>>>
> > > > > >>>>>>>> 3. have pored through region server logs and can't see
> > > anything
> > > > > out
> > > > > >>> of
> > > > > >>>>>>> the
> > > > > >>>>>>>> ordinary happening
> > > > > >>>>>>>>
> > > > > >>>>>>>> The only other pertinent thing to mention might be that we
> > > have
> > > > a
> > > > > >>>>> special
> > > > > >>>>>>>> process of our own running outside the cluster that does
> > > cluster
> > > > > >>> wide
> > > > > >>>>>>> major
> > > > > >>>>>>>> compaction in a rolling fashion, where each batch consists
> > of
> > > > one
> > > > > >>>>> region
> > > > > >>>>>>>> from each region server, and it waits before one batch is
> > > > > completely
> > > > > >>>>> done
> > > > > >>>>>>>> before starting another. We have seen no real impact on
> the
> > > > > hotspot
> > > > > >>>>> from
> > > > > >>>>>>>> shutting this down and in normal times it doesn't impact
> our
> > > > read
> > > > > or
> > > > > >>>>>>> write
> > > > > >>>>>>>> performance much.
> > > > > >>>>>>>>
> > > > > >>>>>>>> We are at our wit's end, anyone have experience with a
> > > scenario
> > > > > like
> > > > > >>>>>>> this?
> > > > > >>>>>>>> Any help/guidance would be most appreciated.
> > > > > >>>>>>>>
> > > > > >>>>>>>> -----
> > > > > >>>>>>>> Saad
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> >
>

Reply via email to