Re: Hot Region Server With No Hot Region

2016-12-13 Thread Stack
On Tue, Dec 13, 2016 at 12:47 PM, Saad Mufti  wrote:

> Thanks everyone for the feedback. We tracked this down to having a bad
> design using dynamic columns, there were a few (very few) rows that
> accumulated up to 200,000 dynamic columns. When we got any activity that
> caused us to try to read one of these rows, it resulted in a hot region
> server.
>
> Follow up question, we are now in the process of cleaning up those rows as
> identified, but but some are so big that trying to read them in the cleanup
> process kills it with out of memory exceptions. Is there any way to
> identify rows with too many columns without actually reading them all?
>
>
Can you upgrade and then read with partials enabled?

How are you doing your cleaning?

(In the past I've heard of folks narrowing down the culprit storefiles and
then offline rewriting hfiles with a variant on ./hbase/bin/hbase --config
~/conf_hbase org.apache.hadoop.hbase.io.hfile.HFile)

St.Ack







> Thanks.
>
> 
> Saad
>
>
> On Sat, Dec 3, 2016 at 3:20 PM, Ted Yu  wrote:
>
> > I took a look at the stack trace.
> >
> > Region server log would give us more detail on the frequency and duration
> > of compactions.
> >
> > Cheers
> >
> > On Sat, Dec 3, 2016 at 7:39 AM, Jeremy Carroll 
> > wrote:
> >
> > > I would check compaction, investigate throttling if it's causing high
> > CPU.
> > >
> > > On Sat, Dec 3, 2016 at 6:20 AM Saad Mufti 
> wrote:
> > >
> > > > No.
> > > >
> > > > 
> > > > Saad
> > > >
> > > >
> > > > On Fri, Dec 2, 2016 at 3:27 PM, Ted Yu 
> > wrote:
> > > >
> > > > > Some how I couldn't access the pastebin (I am in China now).
> > > > > Did the region server showing hotspot host meta ?
> > > > > Thanks
> > > > >
> > > > > On Friday, December 2, 2016 11:53 AM, Saad Mufti <
> > > > saad.mu...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >
> > > > >  We're in AWS with D2.4xLarge instances. Each instance has 12
> > > independent
> > > > > spindles/disks from what I can tell.
> > > > >
> > > > > We have charted get_rate and mutate_rate by host and
> > > > >
> > > > > a) mutate_rate shows no real outliers
> > > > > b) read_rate shows the overall rate on the "hotspot" region server
> > is a
> > > > bit
> > > > > higher than every other server, not severely but enough that it is
> a
> > > bit
> > > > > noticeable. But when we chart get_rate on that server by region, no
> > one
> > > > > region stands out.
> > > > >
> > > > > get_rate chart by host:
> > > > >
> > > > > https://snag.gy/hmoiDw.jpg
> > > > >
> > > > > mutate_rate chart by host:
> > > > >
> > > > > https://snag.gy/jitdMN.jpg
> > > > >
> > > > > 
> > > > > Saad
> > > > >
> > > > >
> > > > > 
> > > > > Saad
> > > > >
> > > > >
> > > > > On Fri, Dec 2, 2016 at 2:34 PM, John Leach <
> jle...@splicemachine.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Here is what I see...
> > > > > >
> > > > > >
> > > > > > * Short Compaction Running on Heap
> > > > > > "regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.
> > > > > > aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547"
> -
> > > > > Thread
> > > > > > t@242
> > > > > >java.lang.Thread.State: RUNNABLE
> > > > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > > compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
> > > > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > > internalEncode(FastDiffDeltaEncoder.java:245)
> > > > > >at org.apache.hadoop.hbase.io.encoding.
> > BufferedDataBlockEncoder.
> > > > > > encode(BufferedDataBlockEncoder.java:987)
> > > > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > > encode(FastDiffDeltaEncoder.java:58)
> > > > > >at org.apache.hadoop.hbase.io
> > > > .hfile.HFileDataBlockEncoderImpl.encode(
> > > > > > HFileDataBlockEncoderImpl.java:97)
> > > > > >at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(
> > > > > > HFileBlock.java:866)
> > > > > >at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(
> > > > > > HFileWriterV2.java:270)
> > > > > >at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(
> > > > > > HFileWriterV3.java:87)
> > > > > >at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.
> > > > > > append(StoreFile.java:949)
> > > > > >at org.apache.hadoop.hbase.regionserver.compactions.
> > > > > > Compactor.performCompaction(Compactor.java:282)
> > > > > >at org.apache.hadoop.hbase.regionserver.compactions.
> > > > > > DefaultCompactor.compact(DefaultCompactor.java:105)
> > > > > >at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$
> > > > > > DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
> > > > > >at org.apache.hadoop.hbase.regionserver.HStore.compact(
> > > > > > HStore.java:1233)
> > > > > >at org.apache.hadoop.hbase.regionserver.HRegion.compact(
> > > > > > 

Re: Hot Region Server With No Hot Region

2016-12-13 Thread Ted Yu
I was looking at CellCounter but it doesn't provide what you are looking
for.

Maybe we can enhance it such that given threshold on the number
of qualifiers in a row (say 100,000), output the rows which have at least
these many qualifiers.

On Tue, Dec 13, 2016 at 12:47 PM, Saad Mufti  wrote:

> Thanks everyone for the feedback. We tracked this down to having a bad
> design using dynamic columns, there were a few (very few) rows that
> accumulated up to 200,000 dynamic columns. When we got any activity that
> caused us to try to read one of these rows, it resulted in a hot region
> server.
>
> Follow up question, we are now in the process of cleaning up those rows as
> identified, but but some are so big that trying to read them in the cleanup
> process kills it with out of memory exceptions. Is there any way to
> identify rows with too many columns without actually reading them all?
>
> Thanks.
>
> 
> Saad
>
>
> On Sat, Dec 3, 2016 at 3:20 PM, Ted Yu  wrote:
>
> > I took a look at the stack trace.
> >
> > Region server log would give us more detail on the frequency and duration
> > of compactions.
> >
> > Cheers
> >
> > On Sat, Dec 3, 2016 at 7:39 AM, Jeremy Carroll 
> > wrote:
> >
> > > I would check compaction, investigate throttling if it's causing high
> > CPU.
> > >
> > > On Sat, Dec 3, 2016 at 6:20 AM Saad Mufti 
> wrote:
> > >
> > > > No.
> > > >
> > > > 
> > > > Saad
> > > >
> > > >
> > > > On Fri, Dec 2, 2016 at 3:27 PM, Ted Yu 
> > wrote:
> > > >
> > > > > Some how I couldn't access the pastebin (I am in China now).
> > > > > Did the region server showing hotspot host meta ?
> > > > > Thanks
> > > > >
> > > > > On Friday, December 2, 2016 11:53 AM, Saad Mufti <
> > > > saad.mu...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >
> > > > >  We're in AWS with D2.4xLarge instances. Each instance has 12
> > > independent
> > > > > spindles/disks from what I can tell.
> > > > >
> > > > > We have charted get_rate and mutate_rate by host and
> > > > >
> > > > > a) mutate_rate shows no real outliers
> > > > > b) read_rate shows the overall rate on the "hotspot" region server
> > is a
> > > > bit
> > > > > higher than every other server, not severely but enough that it is
> a
> > > bit
> > > > > noticeable. But when we chart get_rate on that server by region, no
> > one
> > > > > region stands out.
> > > > >
> > > > > get_rate chart by host:
> > > > >
> > > > > https://snag.gy/hmoiDw.jpg
> > > > >
> > > > > mutate_rate chart by host:
> > > > >
> > > > > https://snag.gy/jitdMN.jpg
> > > > >
> > > > > 
> > > > > Saad
> > > > >
> > > > >
> > > > > 
> > > > > Saad
> > > > >
> > > > >
> > > > > On Fri, Dec 2, 2016 at 2:34 PM, John Leach <
> jle...@splicemachine.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Here is what I see...
> > > > > >
> > > > > >
> > > > > > * Short Compaction Running on Heap
> > > > > > "regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.
> > > > > > aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547"
> -
> > > > > Thread
> > > > > > t@242
> > > > > >java.lang.Thread.State: RUNNABLE
> > > > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > > compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
> > > > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > > internalEncode(FastDiffDeltaEncoder.java:245)
> > > > > >at org.apache.hadoop.hbase.io.encoding.
> > BufferedDataBlockEncoder.
> > > > > > encode(BufferedDataBlockEncoder.java:987)
> > > > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > > encode(FastDiffDeltaEncoder.java:58)
> > > > > >at org.apache.hadoop.hbase.io
> > > > .hfile.HFileDataBlockEncoderImpl.encode(
> > > > > > HFileDataBlockEncoderImpl.java:97)
> > > > > >at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(
> > > > > > HFileBlock.java:866)
> > > > > >at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(
> > > > > > HFileWriterV2.java:270)
> > > > > >at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(
> > > > > > HFileWriterV3.java:87)
> > > > > >at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.
> > > > > > append(StoreFile.java:949)
> > > > > >at org.apache.hadoop.hbase.regionserver.compactions.
> > > > > > Compactor.performCompaction(Compactor.java:282)
> > > > > >at org.apache.hadoop.hbase.regionserver.compactions.
> > > > > > DefaultCompactor.compact(DefaultCompactor.java:105)
> > > > > >at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$
> > > > > > DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
> > > > > >at org.apache.hadoop.hbase.regionserver.HStore.compact(
> > > > > > HStore.java:1233)
> > > > > >at org.apache.hadoop.hbase.regionserver.HRegion.compact(
> > > > > > HRegion.java:1770)
> > > > > >at 

Re: Hot Region Server With No Hot Region

2016-12-13 Thread Saad Mufti
Thanks everyone for the feedback. We tracked this down to having a bad
design using dynamic columns, there were a few (very few) rows that
accumulated up to 200,000 dynamic columns. When we got any activity that
caused us to try to read one of these rows, it resulted in a hot region
server.

Follow up question, we are now in the process of cleaning up those rows as
identified, but but some are so big that trying to read them in the cleanup
process kills it with out of memory exceptions. Is there any way to
identify rows with too many columns without actually reading them all?

Thanks.


Saad


On Sat, Dec 3, 2016 at 3:20 PM, Ted Yu  wrote:

> I took a look at the stack trace.
>
> Region server log would give us more detail on the frequency and duration
> of compactions.
>
> Cheers
>
> On Sat, Dec 3, 2016 at 7:39 AM, Jeremy Carroll 
> wrote:
>
> > I would check compaction, investigate throttling if it's causing high
> CPU.
> >
> > On Sat, Dec 3, 2016 at 6:20 AM Saad Mufti  wrote:
> >
> > > No.
> > >
> > > 
> > > Saad
> > >
> > >
> > > On Fri, Dec 2, 2016 at 3:27 PM, Ted Yu 
> wrote:
> > >
> > > > Some how I couldn't access the pastebin (I am in China now).
> > > > Did the region server showing hotspot host meta ?
> > > > Thanks
> > > >
> > > > On Friday, December 2, 2016 11:53 AM, Saad Mufti <
> > > saad.mu...@gmail.com>
> > > > wrote:
> > > >
> > > >
> > > >  We're in AWS with D2.4xLarge instances. Each instance has 12
> > independent
> > > > spindles/disks from what I can tell.
> > > >
> > > > We have charted get_rate and mutate_rate by host and
> > > >
> > > > a) mutate_rate shows no real outliers
> > > > b) read_rate shows the overall rate on the "hotspot" region server
> is a
> > > bit
> > > > higher than every other server, not severely but enough that it is a
> > bit
> > > > noticeable. But when we chart get_rate on that server by region, no
> one
> > > > region stands out.
> > > >
> > > > get_rate chart by host:
> > > >
> > > > https://snag.gy/hmoiDw.jpg
> > > >
> > > > mutate_rate chart by host:
> > > >
> > > > https://snag.gy/jitdMN.jpg
> > > >
> > > > 
> > > > Saad
> > > >
> > > >
> > > > 
> > > > Saad
> > > >
> > > >
> > > > On Fri, Dec 2, 2016 at 2:34 PM, John Leach  >
> > > > wrote:
> > > >
> > > > > Here is what I see...
> > > > >
> > > > >
> > > > > * Short Compaction Running on Heap
> > > > > "regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.
> > > > > aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547" -
> > > > Thread
> > > > > t@242
> > > > >java.lang.Thread.State: RUNNABLE
> > > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
> > > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > internalEncode(FastDiffDeltaEncoder.java:245)
> > > > >at org.apache.hadoop.hbase.io.encoding.
> BufferedDataBlockEncoder.
> > > > > encode(BufferedDataBlockEncoder.java:987)
> > > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > encode(FastDiffDeltaEncoder.java:58)
> > > > >at org.apache.hadoop.hbase.io
> > > .hfile.HFileDataBlockEncoderImpl.encode(
> > > > > HFileDataBlockEncoderImpl.java:97)
> > > > >at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(
> > > > > HFileBlock.java:866)
> > > > >at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(
> > > > > HFileWriterV2.java:270)
> > > > >at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(
> > > > > HFileWriterV3.java:87)
> > > > >at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.
> > > > > append(StoreFile.java:949)
> > > > >at org.apache.hadoop.hbase.regionserver.compactions.
> > > > > Compactor.performCompaction(Compactor.java:282)
> > > > >at org.apache.hadoop.hbase.regionserver.compactions.
> > > > > DefaultCompactor.compact(DefaultCompactor.java:105)
> > > > >at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$
> > > > > DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
> > > > >at org.apache.hadoop.hbase.regionserver.HStore.compact(
> > > > > HStore.java:1233)
> > > > >at org.apache.hadoop.hbase.regionserver.HRegion.compact(
> > > > > HRegion.java:1770)
> > > > >at org.apache.hadoop.hbase.regionserver.CompactSplitThread$
> > > > > CompactionRunner.run(CompactSplitThread.java:520)
> > > > >at java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > > ThreadPoolExecutor.java:1142)
> > > > >at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > > > ThreadPoolExecutor.java:617)
> > > > >at java.lang.Thread.run(Thread.java:745)
> > > > >
> > > > >
> > > > > * WAL Syncs waiting…  ALL 5
> > > > > "sync.0" - Thread t@202
> > > > >java.lang.Thread.State: TIMED_WAITING
> > > > >at java.lang.Object.wait(Native Method)
> > > > >- waiting on 

Re: Hot Region Server With No Hot Region

2016-12-03 Thread Ted Yu
I took a look at the stack trace.

Region server log would give us more detail on the frequency and duration
of compactions.

Cheers

On Sat, Dec 3, 2016 at 7:39 AM, Jeremy Carroll  wrote:

> I would check compaction, investigate throttling if it's causing high CPU.
>
> On Sat, Dec 3, 2016 at 6:20 AM Saad Mufti  wrote:
>
> > No.
> >
> > 
> > Saad
> >
> >
> > On Fri, Dec 2, 2016 at 3:27 PM, Ted Yu  wrote:
> >
> > > Some how I couldn't access the pastebin (I am in China now).
> > > Did the region server showing hotspot host meta ?
> > > Thanks
> > >
> > > On Friday, December 2, 2016 11:53 AM, Saad Mufti <
> > saad.mu...@gmail.com>
> > > wrote:
> > >
> > >
> > >  We're in AWS with D2.4xLarge instances. Each instance has 12
> independent
> > > spindles/disks from what I can tell.
> > >
> > > We have charted get_rate and mutate_rate by host and
> > >
> > > a) mutate_rate shows no real outliers
> > > b) read_rate shows the overall rate on the "hotspot" region server is a
> > bit
> > > higher than every other server, not severely but enough that it is a
> bit
> > > noticeable. But when we chart get_rate on that server by region, no one
> > > region stands out.
> > >
> > > get_rate chart by host:
> > >
> > > https://snag.gy/hmoiDw.jpg
> > >
> > > mutate_rate chart by host:
> > >
> > > https://snag.gy/jitdMN.jpg
> > >
> > > 
> > > Saad
> > >
> > >
> > > 
> > > Saad
> > >
> > >
> > > On Fri, Dec 2, 2016 at 2:34 PM, John Leach 
> > > wrote:
> > >
> > > > Here is what I see...
> > > >
> > > >
> > > > * Short Compaction Running on Heap
> > > > "regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.
> > > > aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547" -
> > > Thread
> > > > t@242
> > > >java.lang.Thread.State: RUNNABLE
> > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
> > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > internalEncode(FastDiffDeltaEncoder.java:245)
> > > >at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder.
> > > > encode(BufferedDataBlockEncoder.java:987)
> > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > encode(FastDiffDeltaEncoder.java:58)
> > > >at org.apache.hadoop.hbase.io
> > .hfile.HFileDataBlockEncoderImpl.encode(
> > > > HFileDataBlockEncoderImpl.java:97)
> > > >at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(
> > > > HFileBlock.java:866)
> > > >at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(
> > > > HFileWriterV2.java:270)
> > > >at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(
> > > > HFileWriterV3.java:87)
> > > >at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.
> > > > append(StoreFile.java:949)
> > > >at org.apache.hadoop.hbase.regionserver.compactions.
> > > > Compactor.performCompaction(Compactor.java:282)
> > > >at org.apache.hadoop.hbase.regionserver.compactions.
> > > > DefaultCompactor.compact(DefaultCompactor.java:105)
> > > >at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$
> > > > DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
> > > >at org.apache.hadoop.hbase.regionserver.HStore.compact(
> > > > HStore.java:1233)
> > > >at org.apache.hadoop.hbase.regionserver.HRegion.compact(
> > > > HRegion.java:1770)
> > > >at org.apache.hadoop.hbase.regionserver.CompactSplitThread$
> > > > CompactionRunner.run(CompactSplitThread.java:520)
> > > >at java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > ThreadPoolExecutor.java:1142)
> > > >at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > > ThreadPoolExecutor.java:617)
> > > >at java.lang.Thread.run(Thread.java:745)
> > > >
> > > >
> > > > * WAL Syncs waiting…  ALL 5
> > > > "sync.0" - Thread t@202
> > > >java.lang.Thread.State: TIMED_WAITING
> > > >at java.lang.Object.wait(Native Method)
> > > >- waiting on <67ba892d> (a java.util.LinkedList)
> > > >at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(
> > > > DFSOutputStream.java:2337)
> > > >at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(
> > > > DFSOutputStream.java:2224)
> > > >at org.apache.hadoop.hdfs.DFSOutputStream.hflush(
> > > > DFSOutputStream.java:2116)
> > > >at org.apache.hadoop.fs.FSDataOutputStream.hflush(
> > > > FSDataOutputStream.java:130)
> > > >at org.apache.hadoop.hbase.regionserver.wal.
> ProtobufLogWriter.sync(
> > > > ProtobufLogWriter.java:173)
> > > >at org.apache.hadoop.hbase.regionserver.wal.FSHLog$
> > > > SyncRunner.run(FSHLog.java:1379)
> > > >at java.lang.Thread.run(Thread.java:745)
> > > >
> > > > * Mutations backing up very badly...
> > > >
> > > > "B.defaultRpcServer.handler=103,queue=7,port=60020" - Thread t@155
> > > >java.lang.Thread.State: TIMED_WAITING
> > > >at 

Re: Hot Region Server With No Hot Region

2016-12-03 Thread Jeremy Carroll
I would check compaction, investigate throttling if it's causing high CPU.

On Sat, Dec 3, 2016 at 6:20 AM Saad Mufti  wrote:

> No.
>
> 
> Saad
>
>
> On Fri, Dec 2, 2016 at 3:27 PM, Ted Yu  wrote:
>
> > Some how I couldn't access the pastebin (I am in China now).
> > Did the region server showing hotspot host meta ?
> > Thanks
> >
> > On Friday, December 2, 2016 11:53 AM, Saad Mufti <
> saad.mu...@gmail.com>
> > wrote:
> >
> >
> >  We're in AWS with D2.4xLarge instances. Each instance has 12 independent
> > spindles/disks from what I can tell.
> >
> > We have charted get_rate and mutate_rate by host and
> >
> > a) mutate_rate shows no real outliers
> > b) read_rate shows the overall rate on the "hotspot" region server is a
> bit
> > higher than every other server, not severely but enough that it is a bit
> > noticeable. But when we chart get_rate on that server by region, no one
> > region stands out.
> >
> > get_rate chart by host:
> >
> > https://snag.gy/hmoiDw.jpg
> >
> > mutate_rate chart by host:
> >
> > https://snag.gy/jitdMN.jpg
> >
> > 
> > Saad
> >
> >
> > 
> > Saad
> >
> >
> > On Fri, Dec 2, 2016 at 2:34 PM, John Leach 
> > wrote:
> >
> > > Here is what I see...
> > >
> > >
> > > * Short Compaction Running on Heap
> > > "regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.
> > > aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547" -
> > Thread
> > > t@242
> > >java.lang.Thread.State: RUNNABLE
> > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
> > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > internalEncode(FastDiffDeltaEncoder.java:245)
> > >at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder.
> > > encode(BufferedDataBlockEncoder.java:987)
> > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > encode(FastDiffDeltaEncoder.java:58)
> > >at org.apache.hadoop.hbase.io
> .hfile.HFileDataBlockEncoderImpl.encode(
> > > HFileDataBlockEncoderImpl.java:97)
> > >at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(
> > > HFileBlock.java:866)
> > >at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(
> > > HFileWriterV2.java:270)
> > >at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(
> > > HFileWriterV3.java:87)
> > >at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.
> > > append(StoreFile.java:949)
> > >at org.apache.hadoop.hbase.regionserver.compactions.
> > > Compactor.performCompaction(Compactor.java:282)
> > >at org.apache.hadoop.hbase.regionserver.compactions.
> > > DefaultCompactor.compact(DefaultCompactor.java:105)
> > >at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$
> > > DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
> > >at org.apache.hadoop.hbase.regionserver.HStore.compact(
> > > HStore.java:1233)
> > >at org.apache.hadoop.hbase.regionserver.HRegion.compact(
> > > HRegion.java:1770)
> > >at org.apache.hadoop.hbase.regionserver.CompactSplitThread$
> > > CompactionRunner.run(CompactSplitThread.java:520)
> > >at java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > ThreadPoolExecutor.java:1142)
> > >at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > ThreadPoolExecutor.java:617)
> > >at java.lang.Thread.run(Thread.java:745)
> > >
> > >
> > > * WAL Syncs waiting…  ALL 5
> > > "sync.0" - Thread t@202
> > >java.lang.Thread.State: TIMED_WAITING
> > >at java.lang.Object.wait(Native Method)
> > >- waiting on <67ba892d> (a java.util.LinkedList)
> > >at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(
> > > DFSOutputStream.java:2337)
> > >at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(
> > > DFSOutputStream.java:2224)
> > >at org.apache.hadoop.hdfs.DFSOutputStream.hflush(
> > > DFSOutputStream.java:2116)
> > >at org.apache.hadoop.fs.FSDataOutputStream.hflush(
> > > FSDataOutputStream.java:130)
> > >at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(
> > > ProtobufLogWriter.java:173)
> > >at org.apache.hadoop.hbase.regionserver.wal.FSHLog$
> > > SyncRunner.run(FSHLog.java:1379)
> > >at java.lang.Thread.run(Thread.java:745)
> > >
> > > * Mutations backing up very badly...
> > >
> > > "B.defaultRpcServer.handler=103,queue=7,port=60020" - Thread t@155
> > >java.lang.Thread.State: TIMED_WAITING
> > >at java.lang.Object.wait(Native Method)
> > >- waiting on <6ab54ea3> (a org.apache.hadoop.hbase.
> > > regionserver.wal.SyncFuture)
> > >at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.
> > > get(SyncFuture.java:167)
> > >at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> > > blockOnSync(FSHLog.java:1504)
> > >at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> > > publishSyncThenBlockOnCompletion(FSHLog.java:1498)
> > >at 

Re: Hot Region Server With No Hot Region

2016-12-03 Thread Saad Mufti
No.


Saad


On Fri, Dec 2, 2016 at 3:27 PM, Ted Yu  wrote:

> Some how I couldn't access the pastebin (I am in China now).
> Did the region server showing hotspot host meta ?
> Thanks
>
> On Friday, December 2, 2016 11:53 AM, Saad Mufti 
> wrote:
>
>
>  We're in AWS with D2.4xLarge instances. Each instance has 12 independent
> spindles/disks from what I can tell.
>
> We have charted get_rate and mutate_rate by host and
>
> a) mutate_rate shows no real outliers
> b) read_rate shows the overall rate on the "hotspot" region server is a bit
> higher than every other server, not severely but enough that it is a bit
> noticeable. But when we chart get_rate on that server by region, no one
> region stands out.
>
> get_rate chart by host:
>
> https://snag.gy/hmoiDw.jpg
>
> mutate_rate chart by host:
>
> https://snag.gy/jitdMN.jpg
>
> 
> Saad
>
>
> 
> Saad
>
>
> On Fri, Dec 2, 2016 at 2:34 PM, John Leach 
> wrote:
>
> > Here is what I see...
> >
> >
> > * Short Compaction Running on Heap
> > "regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.
> > aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547" -
> Thread
> > t@242
> >java.lang.Thread.State: RUNNABLE
> >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
> >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > internalEncode(FastDiffDeltaEncoder.java:245)
> >at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder.
> > encode(BufferedDataBlockEncoder.java:987)
> >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > encode(FastDiffDeltaEncoder.java:58)
> >at org.apache.hadoop.hbase.io.hfile.HFileDataBlockEncoderImpl.encode(
> > HFileDataBlockEncoderImpl.java:97)
> >at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(
> > HFileBlock.java:866)
> >at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(
> > HFileWriterV2.java:270)
> >at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(
> > HFileWriterV3.java:87)
> >at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.
> > append(StoreFile.java:949)
> >at org.apache.hadoop.hbase.regionserver.compactions.
> > Compactor.performCompaction(Compactor.java:282)
> >at org.apache.hadoop.hbase.regionserver.compactions.
> > DefaultCompactor.compact(DefaultCompactor.java:105)
> >at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$
> > DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
> >at org.apache.hadoop.hbase.regionserver.HStore.compact(
> > HStore.java:1233)
> >at org.apache.hadoop.hbase.regionserver.HRegion.compact(
> > HRegion.java:1770)
> >at org.apache.hadoop.hbase.regionserver.CompactSplitThread$
> > CompactionRunner.run(CompactSplitThread.java:520)
> >at java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1142)
> >at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:617)
> >at java.lang.Thread.run(Thread.java:745)
> >
> >
> > * WAL Syncs waiting…  ALL 5
> > "sync.0" - Thread t@202
> >java.lang.Thread.State: TIMED_WAITING
> >at java.lang.Object.wait(Native Method)
> >- waiting on <67ba892d> (a java.util.LinkedList)
> >at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(
> > DFSOutputStream.java:2337)
> >at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(
> > DFSOutputStream.java:2224)
> >at org.apache.hadoop.hdfs.DFSOutputStream.hflush(
> > DFSOutputStream.java:2116)
> >at org.apache.hadoop.fs.FSDataOutputStream.hflush(
> > FSDataOutputStream.java:130)
> >at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(
> > ProtobufLogWriter.java:173)
> >at org.apache.hadoop.hbase.regionserver.wal.FSHLog$
> > SyncRunner.run(FSHLog.java:1379)
> >at java.lang.Thread.run(Thread.java:745)
> >
> > * Mutations backing up very badly...
> >
> > "B.defaultRpcServer.handler=103,queue=7,port=60020" - Thread t@155
> >java.lang.Thread.State: TIMED_WAITING
> >at java.lang.Object.wait(Native Method)
> >- waiting on <6ab54ea3> (a org.apache.hadoop.hbase.
> > regionserver.wal.SyncFuture)
> >at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.
> > get(SyncFuture.java:167)
> >at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> > blockOnSync(FSHLog.java:1504)
> >at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> > publishSyncThenBlockOnCompletion(FSHLog.java:1498)
> >at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(
> > FSHLog.java:1632)
> >at org.apache.hadoop.hbase.regionserver.HRegion.
> > syncOrDefer(HRegion.java:7737)
> >at org.apache.hadoop.hbase.regionserver.HRegion.
> > processRowsWithLocks(HRegion.java:6504)
> >at org.apache.hadoop.hbase.regionserver.HRegion.
> > mutateRowsWithLocks(HRegion.java:6352)
> >at 

Re: Hot Region Server With No Hot Region

2016-12-02 Thread Ted Yu
Some how I couldn't access the pastebin (I am in China now).
Did the region server showing hotspot host meta ?
Thanks 

On Friday, December 2, 2016 11:53 AM, Saad Mufti  
wrote:
 

 We're in AWS with D2.4xLarge instances. Each instance has 12 independent
spindles/disks from what I can tell.

We have charted get_rate and mutate_rate by host and

a) mutate_rate shows no real outliers
b) read_rate shows the overall rate on the "hotspot" region server is a bit
higher than every other server, not severely but enough that it is a bit
noticeable. But when we chart get_rate on that server by region, no one
region stands out.

get_rate chart by host:

https://snag.gy/hmoiDw.jpg

mutate_rate chart by host:

https://snag.gy/jitdMN.jpg


Saad



Saad


On Fri, Dec 2, 2016 at 2:34 PM, John Leach  wrote:

> Here is what I see...
>
>
> * Short Compaction Running on Heap
> "regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.
> aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547" - Thread
> t@242
>    java.lang.Thread.State: RUNNABLE
>    at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
>    at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> internalEncode(FastDiffDeltaEncoder.java:245)
>    at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder.
> encode(BufferedDataBlockEncoder.java:987)
>    at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> encode(FastDiffDeltaEncoder.java:58)
>    at org.apache.hadoop.hbase.io.hfile.HFileDataBlockEncoderImpl.encode(
> HFileDataBlockEncoderImpl.java:97)
>    at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(
> HFileBlock.java:866)
>    at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(
> HFileWriterV2.java:270)
>    at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(
> HFileWriterV3.java:87)
>    at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.
> append(StoreFile.java:949)
>    at org.apache.hadoop.hbase.regionserver.compactions.
> Compactor.performCompaction(Compactor.java:282)
>    at org.apache.hadoop.hbase.regionserver.compactions.
> DefaultCompactor.compact(DefaultCompactor.java:105)
>    at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$
> DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
>    at org.apache.hadoop.hbase.regionserver.HStore.compact(
> HStore.java:1233)
>    at org.apache.hadoop.hbase.regionserver.HRegion.compact(
> HRegion.java:1770)
>    at org.apache.hadoop.hbase.regionserver.CompactSplitThread$
> CompactionRunner.run(CompactSplitThread.java:520)
>    at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>    at java.lang.Thread.run(Thread.java:745)
>
>
> * WAL Syncs waiting…  ALL 5
> "sync.0" - Thread t@202
>    java.lang.Thread.State: TIMED_WAITING
>    at java.lang.Object.wait(Native Method)
>    - waiting on <67ba892d> (a java.util.LinkedList)
>    at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(
> DFSOutputStream.java:2337)
>    at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(
> DFSOutputStream.java:2224)
>    at org.apache.hadoop.hdfs.DFSOutputStream.hflush(
> DFSOutputStream.java:2116)
>    at org.apache.hadoop.fs.FSDataOutputStream.hflush(
> FSDataOutputStream.java:130)
>    at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(
> ProtobufLogWriter.java:173)
>    at org.apache.hadoop.hbase.regionserver.wal.FSHLog$
> SyncRunner.run(FSHLog.java:1379)
>    at java.lang.Thread.run(Thread.java:745)
>
> * Mutations backing up very badly...
>
> "B.defaultRpcServer.handler=103,queue=7,port=60020" - Thread t@155
>    java.lang.Thread.State: TIMED_WAITING
>    at java.lang.Object.wait(Native Method)
>    - waiting on <6ab54ea3> (a org.apache.hadoop.hbase.
> regionserver.wal.SyncFuture)
>    at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.
> get(SyncFuture.java:167)
>    at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> blockOnSync(FSHLog.java:1504)
>    at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> publishSyncThenBlockOnCompletion(FSHLog.java:1498)
>    at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(
> FSHLog.java:1632)
>    at org.apache.hadoop.hbase.regionserver.HRegion.
> syncOrDefer(HRegion.java:7737)
>    at org.apache.hadoop.hbase.regionserver.HRegion.
> processRowsWithLocks(HRegion.java:6504)
>    at org.apache.hadoop.hbase.regionserver.HRegion.
> mutateRowsWithLocks(HRegion.java:6352)
>    at org.apache.hadoop.hbase.regionserver.HRegion.
> mutateRowsWithLocks(HRegion.java:6334)
>    at org.apache.hadoop.hbase.regionserver.HRegion.
> mutateRow(HRegion.java:6325)
>    at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> mutateRows(RSRpcServices.java:418)
>    at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> multi(RSRpcServices.java:1916)
>    

Re: Hot Region Server With No Hot Region

2016-12-02 Thread Saad Mufti
We're in AWS with D2.4xLarge instances. Each instance has 12 independent
spindles/disks from what I can tell.

We have charted get_rate and mutate_rate by host and

a) mutate_rate shows no real outliers
b) read_rate shows the overall rate on the "hotspot" region server is a bit
higher than every other server, not severely but enough that it is a bit
noticeable. But when we chart get_rate on that server by region, no one
region stands out.

get_rate chart by host:

https://snag.gy/hmoiDw.jpg

mutate_rate chart by host:

https://snag.gy/jitdMN.jpg


Saad



Saad


On Fri, Dec 2, 2016 at 2:34 PM, John Leach  wrote:

> Here is what I see...
>
>
> * Short Compaction Running on Heap
> "regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.
> aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547" - Thread
> t@242
>java.lang.Thread.State: RUNNABLE
> at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
> at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> internalEncode(FastDiffDeltaEncoder.java:245)
> at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder.
> encode(BufferedDataBlockEncoder.java:987)
> at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> encode(FastDiffDeltaEncoder.java:58)
> at org.apache.hadoop.hbase.io.hfile.HFileDataBlockEncoderImpl.encode(
> HFileDataBlockEncoderImpl.java:97)
> at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(
> HFileBlock.java:866)
> at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(
> HFileWriterV2.java:270)
> at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(
> HFileWriterV3.java:87)
> at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.
> append(StoreFile.java:949)
> at org.apache.hadoop.hbase.regionserver.compactions.
> Compactor.performCompaction(Compactor.java:282)
> at org.apache.hadoop.hbase.regionserver.compactions.
> DefaultCompactor.compact(DefaultCompactor.java:105)
> at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$
> DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
> at org.apache.hadoop.hbase.regionserver.HStore.compact(
> HStore.java:1233)
> at org.apache.hadoop.hbase.regionserver.HRegion.compact(
> HRegion.java:1770)
> at org.apache.hadoop.hbase.regionserver.CompactSplitThread$
> CompactionRunner.run(CompactSplitThread.java:520)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> * WAL Syncs waiting…   ALL 5
> "sync.0" - Thread t@202
>java.lang.Thread.State: TIMED_WAITING
> at java.lang.Object.wait(Native Method)
> - waiting on <67ba892d> (a java.util.LinkedList)
> at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(
> DFSOutputStream.java:2337)
> at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(
> DFSOutputStream.java:2224)
> at org.apache.hadoop.hdfs.DFSOutputStream.hflush(
> DFSOutputStream.java:2116)
> at org.apache.hadoop.fs.FSDataOutputStream.hflush(
> FSDataOutputStream.java:130)
> at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(
> ProtobufLogWriter.java:173)
> at org.apache.hadoop.hbase.regionserver.wal.FSHLog$
> SyncRunner.run(FSHLog.java:1379)
> at java.lang.Thread.run(Thread.java:745)
>
> * Mutations backing up very badly...
>
> "B.defaultRpcServer.handler=103,queue=7,port=60020" - Thread t@155
>java.lang.Thread.State: TIMED_WAITING
> at java.lang.Object.wait(Native Method)
> - waiting on <6ab54ea3> (a org.apache.hadoop.hbase.
> regionserver.wal.SyncFuture)
> at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.
> get(SyncFuture.java:167)
> at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> blockOnSync(FSHLog.java:1504)
> at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> publishSyncThenBlockOnCompletion(FSHLog.java:1498)
> at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(
> FSHLog.java:1632)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> syncOrDefer(HRegion.java:7737)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> processRowsWithLocks(HRegion.java:6504)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> mutateRowsWithLocks(HRegion.java:6352)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> mutateRowsWithLocks(HRegion.java:6334)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> mutateRow(HRegion.java:6325)
> at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> mutateRows(RSRpcServices.java:418)
> at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> multi(RSRpcServices.java:1916)
> at org.apache.hadoop.hbase.protobuf.generated.
> ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213)
> at 

Re: Hot Region Server With No Hot Region

2016-12-02 Thread John Leach
Here is what I see...


* Short Compaction Running on Heap
"regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547"
 - Thread t@242
   java.lang.Thread.State: RUNNABLE
at 
org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
at 
org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.internalEncode(FastDiffDeltaEncoder.java:245)
at 
org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder.encode(BufferedDataBlockEncoder.java:987)
at 
org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.encode(FastDiffDeltaEncoder.java:58)
at 
org.apache.hadoop.hbase.io.hfile.HFileDataBlockEncoderImpl.encode(HFileDataBlockEncoderImpl.java:97)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(HFileBlock.java:866)
at 
org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:270)
at 
org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
at 
org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:949)
at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:282)
at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:105)
at 
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1233)
at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1770)
at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:520)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


* WAL Syncs waiting…   ALL 5
"sync.0" - Thread t@202
   java.lang.Thread.State: TIMED_WAITING
at java.lang.Object.wait(Native Method)
- waiting on <67ba892d> (a java.util.LinkedList)
at 
org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2337)
at 
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:2224)
at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:2116)
at 
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1379)
at java.lang.Thread.run(Thread.java:745)

* Mutations backing up very badly...

"B.defaultRpcServer.handler=103,queue=7,port=60020" - Thread t@155
   java.lang.Thread.State: TIMED_WAITING
at java.lang.Object.wait(Native Method)
- waiting on <6ab54ea3> (a 
org.apache.hadoop.hbase.regionserver.wal.SyncFuture)
at 
org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:167)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1504)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:1498)
at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1632)
at 
org.apache.hadoop.hbase.regionserver.HRegion.syncOrDefer(HRegion.java:7737)
at 
org.apache.hadoop.hbase.regionserver.HRegion.processRowsWithLocks(HRegion.java:6504)
at 
org.apache.hadoop.hbase.regionserver.HRegion.mutateRowsWithLocks(HRegion.java:6352)
at 
org.apache.hadoop.hbase.regionserver.HRegion.mutateRowsWithLocks(HRegion.java:6334)
at org.apache.hadoop.hbase.regionserver.HRegion.mutateRow(HRegion.java:6325)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.mutateRows(RSRpcServices.java:418)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1916)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)


Too many writers being blocked attempting to write to WAL.

What does your disk infrastructure look like?  Can you get away with Multi-wal? 
 Ugh...

Regards,
John Leach


> On Dec 2, 2016, at 1:20 PM, Saad Mufti  wrote:
> 
> Hi Ted,
> 
> Finally we have another hotspot going on, same symptoms as before, here is
> the pastebin for the stack trace from the region server that I obtained via
> VisualVM:
> 
> http://pastebin.com/qbXPPrXk
> 
> Would really appreciate any insight you or anyone else can provide.
> 
> 

Re: Hot Region Server With No Hot Region

2016-12-02 Thread Saad Mufti
Hi Ted,

Finally we have another hotspot going on, same symptoms as before, here is
the pastebin for the stack trace from the region server that I obtained via
VisualVM:

http://pastebin.com/qbXPPrXk

Would really appreciate any insight you or anyone else can provide.

Thanks.


Saad


On Thu, Dec 1, 2016 at 6:08 PM, Saad Mufti  wrote:

> Sure will, the next time it happens.
>
> Thanks!!!
>
> 
> Saad
>
>
> On Thu, Dec 1, 2016 at 5:01 PM, Ted Yu  wrote:
>
>> From #2 in the initial email, the hbase:meta might not be the cause for
>> the hotspot.
>>
>> Saad:
>> Can you pastebin stack trace of the hot region server when this happens
>> again ?
>>
>> Thanks
>>
>> > On Dec 2, 2016, at 4:48 AM, Saad Mufti  wrote:
>> >
>> > We used a pre-split into 1024 regions at the start but we miscalculated
>> our
>> > data size, so there were still auto-splits storms at the beginning as
>> data
>> > size stabilized, it has ended up at around 9500 or so regions, plus a
>> few
>> > thousand regions for a few other tables (much smaller). But haven't had
>> any
>> > new auto-splits in a couple of months. And the hotspots only started
>> > happening recently.
>> >
>> > Our hashing scheme is very simple, we take the MD5 of the key, then
>> form a
>> > 4 digit prefix based on the first two bytes of the MD5 normalized to be
>> > within the range 0-1023 . I am fairly confident about this scheme
>> > especially since even during the hotspot we see no evidence so far that
>> any
>> > particular region is taking disproportionate traffic (based on Cloudera
>> > Manager per region charts on the hotspot server). Does that look like a
>> > reasonable scheme to randomize which region any give key goes to? And
>> the
>> > start of the hotspot doesn't seem to correspond to any region splitting
>> or
>> > moving from one server to another activity.
>> >
>> > Thanks.
>> >
>> > 
>> > Saad
>> >
>> >
>> >> On Thu, Dec 1, 2016 at 3:32 PM, John Leach 
>> wrote:
>> >>
>> >> Saad,
>> >>
>> >> Region move or split causes client connections to simultaneously
>> refresh
>> >> their meta.
>> >>
>> >> Key word is supposed.  We have seen meta hot spotting from time to time
>> >> and on different versions at Splice Machine.
>> >>
>> >> How confident are you in your hashing algorithm?
>> >>
>> >> Regards,
>> >> John Leach
>> >>
>> >>
>> >>
>> >>> On Dec 1, 2016, at 2:25 PM, Saad Mufti  wrote:
>> >>>
>> >>> No never thought about that. I just figured out how to locate the
>> server
>> >>> for that table after you mentioned it. We'll have to keep an eye on it
>> >> next
>> >>> time we have a hotspot to see if it coincides with the hotspot server.
>> >>>
>> >>> What would be the theory for how it could become a hotspot? Isn't the
>> >>> client supposed to cache it and only go back for a refresh if it hits
>> a
>> >>> region that is not in its expected location?
>> >>>
>> >>> 
>> >>> Saad
>> >>>
>> >>>
>> >>> On Thu, Dec 1, 2016 at 2:56 PM, John Leach 
>> >> wrote:
>> >>>
>>  Saad,
>> 
>>  Did you validate that Meta is not on the “Hot” region server?
>> 
>>  Regards,
>>  John Leach
>> 
>> 
>> 
>> > On Dec 1, 2016, at 1:50 PM, Saad Mufti 
>> wrote:
>> >
>> > Hi,
>> >
>> > We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to
>> avoid
>> > hotspotting due to inadvertent data patterns by prepending an MD5
>> >> based 4
>> > digit hash prefix to all our data keys. This works fine most of the
>>  times,
>> > but more and more (as much as once or twice a day) recently we have
>> > occasions where one region server suddenly becomes "hot" (CPU above
>> or
>> > around 95% in various monitoring tools). When it happens it lasts
>> for
>> > hours, occasionally the hotspot might jump to another region server
>> as
>>  the
>> > master decide the region is unresponsive and gives its region to
>> >> another
>> > server.
>> >
>> > For the longest time, we thought this must be some single rogue key
>> in
>>  our
>> > input data that is being hammered. All attempts to track this down
>> have
>> > failed though, and the following behavior argues against this being
>> > application based:
>> >
>> > 1. plotted Get and Put rate by region on the "hot" region server in
>> > Cloudera Manager Charts, shows no single region is an outlier.
>> >
>> > 2. cleanly restarting just the region server process causes its
>> regions
>>  to
>> > randomly migrate to other region servers, then it gets new ones from
>> >> the
>> > HBase master, basically a sort of shuffling, then the hotspot goes
>> >> away.
>>  If
>> > it were application based, you'd expect the hotspot to just jump to
>>  another
>> > region server.
>> >
>> > 3. have pored 

Re: Hot Region Server With No Hot Region

2016-12-01 Thread Saad Mufti
Sure will, the next time it happens.

Thanks!!!


Saad


On Thu, Dec 1, 2016 at 5:01 PM, Ted Yu  wrote:

> From #2 in the initial email, the hbase:meta might not be the cause for
> the hotspot.
>
> Saad:
> Can you pastebin stack trace of the hot region server when this happens
> again ?
>
> Thanks
>
> > On Dec 2, 2016, at 4:48 AM, Saad Mufti  wrote:
> >
> > We used a pre-split into 1024 regions at the start but we miscalculated
> our
> > data size, so there were still auto-splits storms at the beginning as
> data
> > size stabilized, it has ended up at around 9500 or so regions, plus a few
> > thousand regions for a few other tables (much smaller). But haven't had
> any
> > new auto-splits in a couple of months. And the hotspots only started
> > happening recently.
> >
> > Our hashing scheme is very simple, we take the MD5 of the key, then form
> a
> > 4 digit prefix based on the first two bytes of the MD5 normalized to be
> > within the range 0-1023 . I am fairly confident about this scheme
> > especially since even during the hotspot we see no evidence so far that
> any
> > particular region is taking disproportionate traffic (based on Cloudera
> > Manager per region charts on the hotspot server). Does that look like a
> > reasonable scheme to randomize which region any give key goes to? And the
> > start of the hotspot doesn't seem to correspond to any region splitting
> or
> > moving from one server to another activity.
> >
> > Thanks.
> >
> > 
> > Saad
> >
> >
> >> On Thu, Dec 1, 2016 at 3:32 PM, John Leach 
> wrote:
> >>
> >> Saad,
> >>
> >> Region move or split causes client connections to simultaneously refresh
> >> their meta.
> >>
> >> Key word is supposed.  We have seen meta hot spotting from time to time
> >> and on different versions at Splice Machine.
> >>
> >> How confident are you in your hashing algorithm?
> >>
> >> Regards,
> >> John Leach
> >>
> >>
> >>
> >>> On Dec 1, 2016, at 2:25 PM, Saad Mufti  wrote:
> >>>
> >>> No never thought about that. I just figured out how to locate the
> server
> >>> for that table after you mentioned it. We'll have to keep an eye on it
> >> next
> >>> time we have a hotspot to see if it coincides with the hotspot server.
> >>>
> >>> What would be the theory for how it could become a hotspot? Isn't the
> >>> client supposed to cache it and only go back for a refresh if it hits a
> >>> region that is not in its expected location?
> >>>
> >>> 
> >>> Saad
> >>>
> >>>
> >>> On Thu, Dec 1, 2016 at 2:56 PM, John Leach 
> >> wrote:
> >>>
>  Saad,
> 
>  Did you validate that Meta is not on the “Hot” region server?
> 
>  Regards,
>  John Leach
> 
> 
> 
> > On Dec 1, 2016, at 1:50 PM, Saad Mufti  wrote:
> >
> > Hi,
> >
> > We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to
> avoid
> > hotspotting due to inadvertent data patterns by prepending an MD5
> >> based 4
> > digit hash prefix to all our data keys. This works fine most of the
>  times,
> > but more and more (as much as once or twice a day) recently we have
> > occasions where one region server suddenly becomes "hot" (CPU above
> or
> > around 95% in various monitoring tools). When it happens it lasts for
> > hours, occasionally the hotspot might jump to another region server
> as
>  the
> > master decide the region is unresponsive and gives its region to
> >> another
> > server.
> >
> > For the longest time, we thought this must be some single rogue key
> in
>  our
> > input data that is being hammered. All attempts to track this down
> have
> > failed though, and the following behavior argues against this being
> > application based:
> >
> > 1. plotted Get and Put rate by region on the "hot" region server in
> > Cloudera Manager Charts, shows no single region is an outlier.
> >
> > 2. cleanly restarting just the region server process causes its
> regions
>  to
> > randomly migrate to other region servers, then it gets new ones from
> >> the
> > HBase master, basically a sort of shuffling, then the hotspot goes
> >> away.
>  If
> > it were application based, you'd expect the hotspot to just jump to
>  another
> > region server.
> >
> > 3. have pored through region server logs and can't see anything out
> of
>  the
> > ordinary happening
> >
> > The only other pertinent thing to mention might be that we have a
> >> special
> > process of our own running outside the cluster that does cluster wide
>  major
> > compaction in a rolling fashion, where each batch consists of one
> >> region
> > from each region server, and it waits before one batch is completely
> >> done
> > before starting another. We have seen no real impact on the 

Re: Hot Region Server With No Hot Region

2016-12-01 Thread Ted Yu
From #2 in the initial email, the hbase:meta might not be the cause for the 
hotspot. 

Saad:
Can you pastebin stack trace of the hot region server when this happens again ?

Thanks

> On Dec 2, 2016, at 4:48 AM, Saad Mufti  wrote:
> 
> We used a pre-split into 1024 regions at the start but we miscalculated our
> data size, so there were still auto-splits storms at the beginning as data
> size stabilized, it has ended up at around 9500 or so regions, plus a few
> thousand regions for a few other tables (much smaller). But haven't had any
> new auto-splits in a couple of months. And the hotspots only started
> happening recently.
> 
> Our hashing scheme is very simple, we take the MD5 of the key, then form a
> 4 digit prefix based on the first two bytes of the MD5 normalized to be
> within the range 0-1023 . I am fairly confident about this scheme
> especially since even during the hotspot we see no evidence so far that any
> particular region is taking disproportionate traffic (based on Cloudera
> Manager per region charts on the hotspot server). Does that look like a
> reasonable scheme to randomize which region any give key goes to? And the
> start of the hotspot doesn't seem to correspond to any region splitting or
> moving from one server to another activity.
> 
> Thanks.
> 
> 
> Saad
> 
> 
>> On Thu, Dec 1, 2016 at 3:32 PM, John Leach  wrote:
>> 
>> Saad,
>> 
>> Region move or split causes client connections to simultaneously refresh
>> their meta.
>> 
>> Key word is supposed.  We have seen meta hot spotting from time to time
>> and on different versions at Splice Machine.
>> 
>> How confident are you in your hashing algorithm?
>> 
>> Regards,
>> John Leach
>> 
>> 
>> 
>>> On Dec 1, 2016, at 2:25 PM, Saad Mufti  wrote:
>>> 
>>> No never thought about that. I just figured out how to locate the server
>>> for that table after you mentioned it. We'll have to keep an eye on it
>> next
>>> time we have a hotspot to see if it coincides with the hotspot server.
>>> 
>>> What would be the theory for how it could become a hotspot? Isn't the
>>> client supposed to cache it and only go back for a refresh if it hits a
>>> region that is not in its expected location?
>>> 
>>> 
>>> Saad
>>> 
>>> 
>>> On Thu, Dec 1, 2016 at 2:56 PM, John Leach 
>> wrote:
>>> 
 Saad,
 
 Did you validate that Meta is not on the “Hot” region server?
 
 Regards,
 John Leach
 
 
 
> On Dec 1, 2016, at 1:50 PM, Saad Mufti  wrote:
> 
> Hi,
> 
> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to avoid
> hotspotting due to inadvertent data patterns by prepending an MD5
>> based 4
> digit hash prefix to all our data keys. This works fine most of the
 times,
> but more and more (as much as once or twice a day) recently we have
> occasions where one region server suddenly becomes "hot" (CPU above or
> around 95% in various monitoring tools). When it happens it lasts for
> hours, occasionally the hotspot might jump to another region server as
 the
> master decide the region is unresponsive and gives its region to
>> another
> server.
> 
> For the longest time, we thought this must be some single rogue key in
 our
> input data that is being hammered. All attempts to track this down have
> failed though, and the following behavior argues against this being
> application based:
> 
> 1. plotted Get and Put rate by region on the "hot" region server in
> Cloudera Manager Charts, shows no single region is an outlier.
> 
> 2. cleanly restarting just the region server process causes its regions
 to
> randomly migrate to other region servers, then it gets new ones from
>> the
> HBase master, basically a sort of shuffling, then the hotspot goes
>> away.
 If
> it were application based, you'd expect the hotspot to just jump to
 another
> region server.
> 
> 3. have pored through region server logs and can't see anything out of
 the
> ordinary happening
> 
> The only other pertinent thing to mention might be that we have a
>> special
> process of our own running outside the cluster that does cluster wide
 major
> compaction in a rolling fashion, where each batch consists of one
>> region
> from each region server, and it waits before one batch is completely
>> done
> before starting another. We have seen no real impact on the hotspot
>> from
> shutting this down and in normal times it doesn't impact our read or
 write
> performance much.
> 
> We are at our wit's end, anyone have experience with a scenario like
 this?
> Any help/guidance would be most appreciated.
> 
> -
> Saad
 
 
>> 
>> 


Re: Hot Region Server With No Hot Region

2016-12-01 Thread Saad Mufti
We used a pre-split into 1024 regions at the start but we miscalculated our
data size, so there were still auto-splits storms at the beginning as data
size stabilized, it has ended up at around 9500 or so regions, plus a few
thousand regions for a few other tables (much smaller). But haven't had any
new auto-splits in a couple of months. And the hotspots only started
happening recently.

Our hashing scheme is very simple, we take the MD5 of the key, then form a
4 digit prefix based on the first two bytes of the MD5 normalized to be
within the range 0-1023 . I am fairly confident about this scheme
especially since even during the hotspot we see no evidence so far that any
particular region is taking disproportionate traffic (based on Cloudera
Manager per region charts on the hotspot server). Does that look like a
reasonable scheme to randomize which region any give key goes to? And the
start of the hotspot doesn't seem to correspond to any region splitting or
moving from one server to another activity.

Thanks.


Saad


On Thu, Dec 1, 2016 at 3:32 PM, John Leach  wrote:

> Saad,
>
> Region move or split causes client connections to simultaneously refresh
> their meta.
>
> Key word is supposed.  We have seen meta hot spotting from time to time
> and on different versions at Splice Machine.
>
> How confident are you in your hashing algorithm?
>
> Regards,
> John Leach
>
>
>
> > On Dec 1, 2016, at 2:25 PM, Saad Mufti  wrote:
> >
> > No never thought about that. I just figured out how to locate the server
> > for that table after you mentioned it. We'll have to keep an eye on it
> next
> > time we have a hotspot to see if it coincides with the hotspot server.
> >
> > What would be the theory for how it could become a hotspot? Isn't the
> > client supposed to cache it and only go back for a refresh if it hits a
> > region that is not in its expected location?
> >
> > 
> > Saad
> >
> >
> > On Thu, Dec 1, 2016 at 2:56 PM, John Leach 
> wrote:
> >
> >> Saad,
> >>
> >> Did you validate that Meta is not on the “Hot” region server?
> >>
> >> Regards,
> >> John Leach
> >>
> >>
> >>
> >>> On Dec 1, 2016, at 1:50 PM, Saad Mufti  wrote:
> >>>
> >>> Hi,
> >>>
> >>> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to avoid
> >>> hotspotting due to inadvertent data patterns by prepending an MD5
> based 4
> >>> digit hash prefix to all our data keys. This works fine most of the
> >> times,
> >>> but more and more (as much as once or twice a day) recently we have
> >>> occasions where one region server suddenly becomes "hot" (CPU above or
> >>> around 95% in various monitoring tools). When it happens it lasts for
> >>> hours, occasionally the hotspot might jump to another region server as
> >> the
> >>> master decide the region is unresponsive and gives its region to
> another
> >>> server.
> >>>
> >>> For the longest time, we thought this must be some single rogue key in
> >> our
> >>> input data that is being hammered. All attempts to track this down have
> >>> failed though, and the following behavior argues against this being
> >>> application based:
> >>>
> >>> 1. plotted Get and Put rate by region on the "hot" region server in
> >>> Cloudera Manager Charts, shows no single region is an outlier.
> >>>
> >>> 2. cleanly restarting just the region server process causes its regions
> >> to
> >>> randomly migrate to other region servers, then it gets new ones from
> the
> >>> HBase master, basically a sort of shuffling, then the hotspot goes
> away.
> >> If
> >>> it were application based, you'd expect the hotspot to just jump to
> >> another
> >>> region server.
> >>>
> >>> 3. have pored through region server logs and can't see anything out of
> >> the
> >>> ordinary happening
> >>>
> >>> The only other pertinent thing to mention might be that we have a
> special
> >>> process of our own running outside the cluster that does cluster wide
> >> major
> >>> compaction in a rolling fashion, where each batch consists of one
> region
> >>> from each region server, and it waits before one batch is completely
> done
> >>> before starting another. We have seen no real impact on the hotspot
> from
> >>> shutting this down and in normal times it doesn't impact our read or
> >> write
> >>> performance much.
> >>>
> >>> We are at our wit's end, anyone have experience with a scenario like
> >> this?
> >>> Any help/guidance would be most appreciated.
> >>>
> >>> -
> >>> Saad
> >>
> >>
>
>


Re: Hot Region Server With No Hot Region

2016-12-01 Thread John Leach
Saad,

Region move or split causes client connections to simultaneously refresh their 
meta.

Key word is supposed.  We have seen meta hot spotting from time to time and on 
different versions at Splice Machine.  

How confident are you in your hashing algorithm?

Regards,
John Leach



> On Dec 1, 2016, at 2:25 PM, Saad Mufti  wrote:
> 
> No never thought about that. I just figured out how to locate the server
> for that table after you mentioned it. We'll have to keep an eye on it next
> time we have a hotspot to see if it coincides with the hotspot server.
> 
> What would be the theory for how it could become a hotspot? Isn't the
> client supposed to cache it and only go back for a refresh if it hits a
> region that is not in its expected location?
> 
> 
> Saad
> 
> 
> On Thu, Dec 1, 2016 at 2:56 PM, John Leach  wrote:
> 
>> Saad,
>> 
>> Did you validate that Meta is not on the “Hot” region server?
>> 
>> Regards,
>> John Leach
>> 
>> 
>> 
>>> On Dec 1, 2016, at 1:50 PM, Saad Mufti  wrote:
>>> 
>>> Hi,
>>> 
>>> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to avoid
>>> hotspotting due to inadvertent data patterns by prepending an MD5 based 4
>>> digit hash prefix to all our data keys. This works fine most of the
>> times,
>>> but more and more (as much as once or twice a day) recently we have
>>> occasions where one region server suddenly becomes "hot" (CPU above or
>>> around 95% in various monitoring tools). When it happens it lasts for
>>> hours, occasionally the hotspot might jump to another region server as
>> the
>>> master decide the region is unresponsive and gives its region to another
>>> server.
>>> 
>>> For the longest time, we thought this must be some single rogue key in
>> our
>>> input data that is being hammered. All attempts to track this down have
>>> failed though, and the following behavior argues against this being
>>> application based:
>>> 
>>> 1. plotted Get and Put rate by region on the "hot" region server in
>>> Cloudera Manager Charts, shows no single region is an outlier.
>>> 
>>> 2. cleanly restarting just the region server process causes its regions
>> to
>>> randomly migrate to other region servers, then it gets new ones from the
>>> HBase master, basically a sort of shuffling, then the hotspot goes away.
>> If
>>> it were application based, you'd expect the hotspot to just jump to
>> another
>>> region server.
>>> 
>>> 3. have pored through region server logs and can't see anything out of
>> the
>>> ordinary happening
>>> 
>>> The only other pertinent thing to mention might be that we have a special
>>> process of our own running outside the cluster that does cluster wide
>> major
>>> compaction in a rolling fashion, where each batch consists of one region
>>> from each region server, and it waits before one batch is completely done
>>> before starting another. We have seen no real impact on the hotspot from
>>> shutting this down and in normal times it doesn't impact our read or
>> write
>>> performance much.
>>> 
>>> We are at our wit's end, anyone have experience with a scenario like
>> this?
>>> Any help/guidance would be most appreciated.
>>> 
>>> -
>>> Saad
>> 
>> 



Re: Hot Region Server With No Hot Region

2016-12-01 Thread Saad Mufti
No never thought about that. I just figured out how to locate the server
for that table after you mentioned it. We'll have to keep an eye on it next
time we have a hotspot to see if it coincides with the hotspot server.

What would be the theory for how it could become a hotspot? Isn't the
client supposed to cache it and only go back for a refresh if it hits a
region that is not in its expected location?


Saad


On Thu, Dec 1, 2016 at 2:56 PM, John Leach  wrote:

> Saad,
>
> Did you validate that Meta is not on the “Hot” region server?
>
> Regards,
> John Leach
>
>
>
> > On Dec 1, 2016, at 1:50 PM, Saad Mufti  wrote:
> >
> > Hi,
> >
> > We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to avoid
> > hotspotting due to inadvertent data patterns by prepending an MD5 based 4
> > digit hash prefix to all our data keys. This works fine most of the
> times,
> > but more and more (as much as once or twice a day) recently we have
> > occasions where one region server suddenly becomes "hot" (CPU above or
> > around 95% in various monitoring tools). When it happens it lasts for
> > hours, occasionally the hotspot might jump to another region server as
> the
> > master decide the region is unresponsive and gives its region to another
> > server.
> >
> > For the longest time, we thought this must be some single rogue key in
> our
> > input data that is being hammered. All attempts to track this down have
> > failed though, and the following behavior argues against this being
> > application based:
> >
> > 1. plotted Get and Put rate by region on the "hot" region server in
> > Cloudera Manager Charts, shows no single region is an outlier.
> >
> > 2. cleanly restarting just the region server process causes its regions
> to
> > randomly migrate to other region servers, then it gets new ones from the
> > HBase master, basically a sort of shuffling, then the hotspot goes away.
> If
> > it were application based, you'd expect the hotspot to just jump to
> another
> > region server.
> >
> > 3. have pored through region server logs and can't see anything out of
> the
> > ordinary happening
> >
> > The only other pertinent thing to mention might be that we have a special
> > process of our own running outside the cluster that does cluster wide
> major
> > compaction in a rolling fashion, where each batch consists of one region
> > from each region server, and it waits before one batch is completely done
> > before starting another. We have seen no real impact on the hotspot from
> > shutting this down and in normal times it doesn't impact our read or
> write
> > performance much.
> >
> > We are at our wit's end, anyone have experience with a scenario like
> this?
> > Any help/guidance would be most appreciated.
> >
> > -
> > Saad
>
>


Re: Hot Region Server With No Hot Region

2016-12-01 Thread John Leach
Saad,

Did you validate that Meta is not on the “Hot” region server?  

Regards,
John Leach



> On Dec 1, 2016, at 1:50 PM, Saad Mufti  wrote:
> 
> Hi,
> 
> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to avoid
> hotspotting due to inadvertent data patterns by prepending an MD5 based 4
> digit hash prefix to all our data keys. This works fine most of the times,
> but more and more (as much as once or twice a day) recently we have
> occasions where one region server suddenly becomes "hot" (CPU above or
> around 95% in various monitoring tools). When it happens it lasts for
> hours, occasionally the hotspot might jump to another region server as the
> master decide the region is unresponsive and gives its region to another
> server.
> 
> For the longest time, we thought this must be some single rogue key in our
> input data that is being hammered. All attempts to track this down have
> failed though, and the following behavior argues against this being
> application based:
> 
> 1. plotted Get and Put rate by region on the "hot" region server in
> Cloudera Manager Charts, shows no single region is an outlier.
> 
> 2. cleanly restarting just the region server process causes its regions to
> randomly migrate to other region servers, then it gets new ones from the
> HBase master, basically a sort of shuffling, then the hotspot goes away. If
> it were application based, you'd expect the hotspot to just jump to another
> region server.
> 
> 3. have pored through region server logs and can't see anything out of the
> ordinary happening
> 
> The only other pertinent thing to mention might be that we have a special
> process of our own running outside the cluster that does cluster wide major
> compaction in a rolling fashion, where each batch consists of one region
> from each region server, and it waits before one batch is completely done
> before starting another. We have seen no real impact on the hotspot from
> shutting this down and in normal times it doesn't impact our read or write
> performance much.
> 
> We are at our wit's end, anyone have experience with a scenario like this?
> Any help/guidance would be most appreciated.
> 
> -
> Saad