Meetup next Thursday at Splice: https://www.meetup.com/hbaseusergroup/events/235542241/

2016-12-02 Thread Stack
Come on out Thursday night, December 8th. We're having a nice meetup w/
some friends in town visiting to talk about Omid2 and Accordion, the
inmemory compaction work that has been ongoing for hbase2.

Its going to be a good night,
Yours,
S


Re: Hot Region Server With No Hot Region

2016-12-02 Thread Ted Yu
Some how I couldn't access the pastebin (I am in China now).
Did the region server showing hotspot host meta ?
Thanks 

On Friday, December 2, 2016 11:53 AM, Saad Mufti  
wrote:
 

 We're in AWS with D2.4xLarge instances. Each instance has 12 independent
spindles/disks from what I can tell.

We have charted get_rate and mutate_rate by host and

a) mutate_rate shows no real outliers
b) read_rate shows the overall rate on the "hotspot" region server is a bit
higher than every other server, not severely but enough that it is a bit
noticeable. But when we chart get_rate on that server by region, no one
region stands out.

get_rate chart by host:

https://snag.gy/hmoiDw.jpg

mutate_rate chart by host:

https://snag.gy/jitdMN.jpg


Saad



Saad


On Fri, Dec 2, 2016 at 2:34 PM, John Leach  wrote:

> Here is what I see...
>
>
> * Short Compaction Running on Heap
> "regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.
> aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547" - Thread
> t@242
>    java.lang.Thread.State: RUNNABLE
>    at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
>    at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> internalEncode(FastDiffDeltaEncoder.java:245)
>    at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder.
> encode(BufferedDataBlockEncoder.java:987)
>    at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> encode(FastDiffDeltaEncoder.java:58)
>    at org.apache.hadoop.hbase.io.hfile.HFileDataBlockEncoderImpl.encode(
> HFileDataBlockEncoderImpl.java:97)
>    at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(
> HFileBlock.java:866)
>    at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(
> HFileWriterV2.java:270)
>    at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(
> HFileWriterV3.java:87)
>    at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.
> append(StoreFile.java:949)
>    at org.apache.hadoop.hbase.regionserver.compactions.
> Compactor.performCompaction(Compactor.java:282)
>    at org.apache.hadoop.hbase.regionserver.compactions.
> DefaultCompactor.compact(DefaultCompactor.java:105)
>    at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$
> DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
>    at org.apache.hadoop.hbase.regionserver.HStore.compact(
> HStore.java:1233)
>    at org.apache.hadoop.hbase.regionserver.HRegion.compact(
> HRegion.java:1770)
>    at org.apache.hadoop.hbase.regionserver.CompactSplitThread$
> CompactionRunner.run(CompactSplitThread.java:520)
>    at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>    at java.lang.Thread.run(Thread.java:745)
>
>
> * WAL Syncs waiting…  ALL 5
> "sync.0" - Thread t@202
>    java.lang.Thread.State: TIMED_WAITING
>    at java.lang.Object.wait(Native Method)
>    - waiting on <67ba892d> (a java.util.LinkedList)
>    at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(
> DFSOutputStream.java:2337)
>    at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(
> DFSOutputStream.java:2224)
>    at org.apache.hadoop.hdfs.DFSOutputStream.hflush(
> DFSOutputStream.java:2116)
>    at org.apache.hadoop.fs.FSDataOutputStream.hflush(
> FSDataOutputStream.java:130)
>    at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(
> ProtobufLogWriter.java:173)
>    at org.apache.hadoop.hbase.regionserver.wal.FSHLog$
> SyncRunner.run(FSHLog.java:1379)
>    at java.lang.Thread.run(Thread.java:745)
>
> * Mutations backing up very badly...
>
> "B.defaultRpcServer.handler=103,queue=7,port=60020" - Thread t@155
>    java.lang.Thread.State: TIMED_WAITING
>    at java.lang.Object.wait(Native Method)
>    - waiting on <6ab54ea3> (a org.apache.hadoop.hbase.
> regionserver.wal.SyncFuture)
>    at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.
> get(SyncFuture.java:167)
>    at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> blockOnSync(FSHLog.java:1504)
>    at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> publishSyncThenBlockOnCompletion(FSHLog.java:1498)
>    at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(
> FSHLog.java:1632)
>    at org.apache.hadoop.hbase.regionserver.HRegion.
> syncOrDefer(HRegion.java:7737)
>    at org.apache.hadoop.hbase.regionserver.HRegion.
> processRowsWithLocks(HRegion.java:6504)
>    at org.apache.hadoop.hbase.regionserver.HRegion.
> mutateRowsWithLocks(HRegion.java:6352)
>    at org.apache.hadoop.hbase.regionserver.HRegion.
> mutateRowsWithLocks(HRegion.java:6334)
>    at org.apache.hadoop.hbase.regionserver.HRegion.
> mutateRow(HRegion.java:6325)
>    at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> mutateRows(RSRpcServices.java:418)
>    at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> multi(RSRpcServices.java:1916)
>    

Re: Hot Region Server With No Hot Region

2016-12-02 Thread Saad Mufti
We're in AWS with D2.4xLarge instances. Each instance has 12 independent
spindles/disks from what I can tell.

We have charted get_rate and mutate_rate by host and

a) mutate_rate shows no real outliers
b) read_rate shows the overall rate on the "hotspot" region server is a bit
higher than every other server, not severely but enough that it is a bit
noticeable. But when we chart get_rate on that server by region, no one
region stands out.

get_rate chart by host:

https://snag.gy/hmoiDw.jpg

mutate_rate chart by host:

https://snag.gy/jitdMN.jpg


Saad



Saad


On Fri, Dec 2, 2016 at 2:34 PM, John Leach  wrote:

> Here is what I see...
>
>
> * Short Compaction Running on Heap
> "regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.
> aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547" - Thread
> t@242
>java.lang.Thread.State: RUNNABLE
> at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
> at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> internalEncode(FastDiffDeltaEncoder.java:245)
> at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder.
> encode(BufferedDataBlockEncoder.java:987)
> at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> encode(FastDiffDeltaEncoder.java:58)
> at org.apache.hadoop.hbase.io.hfile.HFileDataBlockEncoderImpl.encode(
> HFileDataBlockEncoderImpl.java:97)
> at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(
> HFileBlock.java:866)
> at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(
> HFileWriterV2.java:270)
> at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(
> HFileWriterV3.java:87)
> at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.
> append(StoreFile.java:949)
> at org.apache.hadoop.hbase.regionserver.compactions.
> Compactor.performCompaction(Compactor.java:282)
> at org.apache.hadoop.hbase.regionserver.compactions.
> DefaultCompactor.compact(DefaultCompactor.java:105)
> at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$
> DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
> at org.apache.hadoop.hbase.regionserver.HStore.compact(
> HStore.java:1233)
> at org.apache.hadoop.hbase.regionserver.HRegion.compact(
> HRegion.java:1770)
> at org.apache.hadoop.hbase.regionserver.CompactSplitThread$
> CompactionRunner.run(CompactSplitThread.java:520)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> * WAL Syncs waiting…   ALL 5
> "sync.0" - Thread t@202
>java.lang.Thread.State: TIMED_WAITING
> at java.lang.Object.wait(Native Method)
> - waiting on <67ba892d> (a java.util.LinkedList)
> at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(
> DFSOutputStream.java:2337)
> at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(
> DFSOutputStream.java:2224)
> at org.apache.hadoop.hdfs.DFSOutputStream.hflush(
> DFSOutputStream.java:2116)
> at org.apache.hadoop.fs.FSDataOutputStream.hflush(
> FSDataOutputStream.java:130)
> at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(
> ProtobufLogWriter.java:173)
> at org.apache.hadoop.hbase.regionserver.wal.FSHLog$
> SyncRunner.run(FSHLog.java:1379)
> at java.lang.Thread.run(Thread.java:745)
>
> * Mutations backing up very badly...
>
> "B.defaultRpcServer.handler=103,queue=7,port=60020" - Thread t@155
>java.lang.Thread.State: TIMED_WAITING
> at java.lang.Object.wait(Native Method)
> - waiting on <6ab54ea3> (a org.apache.hadoop.hbase.
> regionserver.wal.SyncFuture)
> at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.
> get(SyncFuture.java:167)
> at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> blockOnSync(FSHLog.java:1504)
> at org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> publishSyncThenBlockOnCompletion(FSHLog.java:1498)
> at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(
> FSHLog.java:1632)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> syncOrDefer(HRegion.java:7737)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> processRowsWithLocks(HRegion.java:6504)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> mutateRowsWithLocks(HRegion.java:6352)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> mutateRowsWithLocks(HRegion.java:6334)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> mutateRow(HRegion.java:6325)
> at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> mutateRows(RSRpcServices.java:418)
> at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> multi(RSRpcServices.java:1916)
> at org.apache.hadoop.hbase.protobuf.generated.
> ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213)
> at 

Re: Hot Region Server With No Hot Region

2016-12-02 Thread John Leach
Here is what I see...


* Short Compaction Running on Heap
"regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547"
 - Thread t@242
   java.lang.Thread.State: RUNNABLE
at 
org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
at 
org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.internalEncode(FastDiffDeltaEncoder.java:245)
at 
org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder.encode(BufferedDataBlockEncoder.java:987)
at 
org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.encode(FastDiffDeltaEncoder.java:58)
at 
org.apache.hadoop.hbase.io.hfile.HFileDataBlockEncoderImpl.encode(HFileDataBlockEncoderImpl.java:97)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(HFileBlock.java:866)
at 
org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:270)
at 
org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
at 
org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:949)
at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:282)
at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:105)
at 
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1233)
at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1770)
at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:520)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


* WAL Syncs waiting…   ALL 5
"sync.0" - Thread t@202
   java.lang.Thread.State: TIMED_WAITING
at java.lang.Object.wait(Native Method)
- waiting on <67ba892d> (a java.util.LinkedList)
at 
org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2337)
at 
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:2224)
at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:2116)
at 
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1379)
at java.lang.Thread.run(Thread.java:745)

* Mutations backing up very badly...

"B.defaultRpcServer.handler=103,queue=7,port=60020" - Thread t@155
   java.lang.Thread.State: TIMED_WAITING
at java.lang.Object.wait(Native Method)
- waiting on <6ab54ea3> (a 
org.apache.hadoop.hbase.regionserver.wal.SyncFuture)
at 
org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:167)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1504)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:1498)
at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1632)
at 
org.apache.hadoop.hbase.regionserver.HRegion.syncOrDefer(HRegion.java:7737)
at 
org.apache.hadoop.hbase.regionserver.HRegion.processRowsWithLocks(HRegion.java:6504)
at 
org.apache.hadoop.hbase.regionserver.HRegion.mutateRowsWithLocks(HRegion.java:6352)
at 
org.apache.hadoop.hbase.regionserver.HRegion.mutateRowsWithLocks(HRegion.java:6334)
at org.apache.hadoop.hbase.regionserver.HRegion.mutateRow(HRegion.java:6325)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.mutateRows(RSRpcServices.java:418)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1916)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)


Too many writers being blocked attempting to write to WAL.

What does your disk infrastructure look like?  Can you get away with Multi-wal? 
 Ugh...

Regards,
John Leach


> On Dec 2, 2016, at 1:20 PM, Saad Mufti  wrote:
> 
> Hi Ted,
> 
> Finally we have another hotspot going on, same symptoms as before, here is
> the pastebin for the stack trace from the region server that I obtained via
> VisualVM:
> 
> http://pastebin.com/qbXPPrXk
> 
> Would really appreciate any insight you or anyone else can provide.
> 
> 

Re: Hot Region Server With No Hot Region

2016-12-02 Thread Saad Mufti
Hi Ted,

Finally we have another hotspot going on, same symptoms as before, here is
the pastebin for the stack trace from the region server that I obtained via
VisualVM:

http://pastebin.com/qbXPPrXk

Would really appreciate any insight you or anyone else can provide.

Thanks.


Saad


On Thu, Dec 1, 2016 at 6:08 PM, Saad Mufti  wrote:

> Sure will, the next time it happens.
>
> Thanks!!!
>
> 
> Saad
>
>
> On Thu, Dec 1, 2016 at 5:01 PM, Ted Yu  wrote:
>
>> From #2 in the initial email, the hbase:meta might not be the cause for
>> the hotspot.
>>
>> Saad:
>> Can you pastebin stack trace of the hot region server when this happens
>> again ?
>>
>> Thanks
>>
>> > On Dec 2, 2016, at 4:48 AM, Saad Mufti  wrote:
>> >
>> > We used a pre-split into 1024 regions at the start but we miscalculated
>> our
>> > data size, so there were still auto-splits storms at the beginning as
>> data
>> > size stabilized, it has ended up at around 9500 or so regions, plus a
>> few
>> > thousand regions for a few other tables (much smaller). But haven't had
>> any
>> > new auto-splits in a couple of months. And the hotspots only started
>> > happening recently.
>> >
>> > Our hashing scheme is very simple, we take the MD5 of the key, then
>> form a
>> > 4 digit prefix based on the first two bytes of the MD5 normalized to be
>> > within the range 0-1023 . I am fairly confident about this scheme
>> > especially since even during the hotspot we see no evidence so far that
>> any
>> > particular region is taking disproportionate traffic (based on Cloudera
>> > Manager per region charts on the hotspot server). Does that look like a
>> > reasonable scheme to randomize which region any give key goes to? And
>> the
>> > start of the hotspot doesn't seem to correspond to any region splitting
>> or
>> > moving from one server to another activity.
>> >
>> > Thanks.
>> >
>> > 
>> > Saad
>> >
>> >
>> >> On Thu, Dec 1, 2016 at 3:32 PM, John Leach 
>> wrote:
>> >>
>> >> Saad,
>> >>
>> >> Region move or split causes client connections to simultaneously
>> refresh
>> >> their meta.
>> >>
>> >> Key word is supposed.  We have seen meta hot spotting from time to time
>> >> and on different versions at Splice Machine.
>> >>
>> >> How confident are you in your hashing algorithm?
>> >>
>> >> Regards,
>> >> John Leach
>> >>
>> >>
>> >>
>> >>> On Dec 1, 2016, at 2:25 PM, Saad Mufti  wrote:
>> >>>
>> >>> No never thought about that. I just figured out how to locate the
>> server
>> >>> for that table after you mentioned it. We'll have to keep an eye on it
>> >> next
>> >>> time we have a hotspot to see if it coincides with the hotspot server.
>> >>>
>> >>> What would be the theory for how it could become a hotspot? Isn't the
>> >>> client supposed to cache it and only go back for a refresh if it hits
>> a
>> >>> region that is not in its expected location?
>> >>>
>> >>> 
>> >>> Saad
>> >>>
>> >>>
>> >>> On Thu, Dec 1, 2016 at 2:56 PM, John Leach 
>> >> wrote:
>> >>>
>>  Saad,
>> 
>>  Did you validate that Meta is not on the “Hot” region server?
>> 
>>  Regards,
>>  John Leach
>> 
>> 
>> 
>> > On Dec 1, 2016, at 1:50 PM, Saad Mufti 
>> wrote:
>> >
>> > Hi,
>> >
>> > We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to
>> avoid
>> > hotspotting due to inadvertent data patterns by prepending an MD5
>> >> based 4
>> > digit hash prefix to all our data keys. This works fine most of the
>>  times,
>> > but more and more (as much as once or twice a day) recently we have
>> > occasions where one region server suddenly becomes "hot" (CPU above
>> or
>> > around 95% in various monitoring tools). When it happens it lasts
>> for
>> > hours, occasionally the hotspot might jump to another region server
>> as
>>  the
>> > master decide the region is unresponsive and gives its region to
>> >> another
>> > server.
>> >
>> > For the longest time, we thought this must be some single rogue key
>> in
>>  our
>> > input data that is being hammered. All attempts to track this down
>> have
>> > failed though, and the following behavior argues against this being
>> > application based:
>> >
>> > 1. plotted Get and Put rate by region on the "hot" region server in
>> > Cloudera Manager Charts, shows no single region is an outlier.
>> >
>> > 2. cleanly restarting just the region server process causes its
>> regions
>>  to
>> > randomly migrate to other region servers, then it gets new ones from
>> >> the
>> > HBase master, basically a sort of shuffling, then the hotspot goes
>> >> away.
>>  If
>> > it were application based, you'd expect the hotspot to just jump to
>>  another
>> > region server.
>> >
>> > 3. have pored 

Re: Creating HBase table with presplits

2016-12-02 Thread Saad Mufti
Forgot to mention in above example you would presplit into 1024 regions,
starting from "" to "1023" (start keys).

Cheers.


Saad


On Fri, Dec 2, 2016 at 8:47 AM, Saad Mufti  wrote:

> One way to do this without knowing your data (still need some idea of size
> of keyspace) is to prepend a fixed numeric prefix from a suitable range
> based on a good hash like MD5. For example, let us say you can predict your
> data will fit in about 1024 regions. You can decide to prepend a prefix
> from  to 1024 to all you keys based on a suitable hash.
>
> The pros:
>
> 1. you get to pre-split without knowing your keyspace
> 2. very hard if not impossible for unknown data providers to send you data
> in some order that generates hotspots (unless of course the same key is
> repeated over and over, still have to watch out for that)
>
> The cons:
>
> 1. lose the ability to do scan in "natural" sorted order of your keyspace
> as that order is not preserved anymore in HBase
> 2. if you miscalculate your keyspace size by a lot, you are stuck with the
> hash function and range you selected even if you later get more regions
> unless you're willing to do complete migration to a new table
>
> Hope above helps.
>
> 
> Saad
>
>
> On Tue, Nov 29, 2016 at 4:28 AM, Sachin Jain 
> wrote:
>
>> Thanks Dave for your suggestions!
>> Will let you know if I find some approach to tackle this situation.
>>
>> Regards
>>
>> On Mon, Nov 28, 2016 at 9:05 PM, Dave Latham  wrote:
>>
>> > If you truly have no way to predict anything about the distribution of
>> your
>> > data across the row key space, then you are correct that there is no
>> way to
>> > presplit your regions in an effective way.  Either you need to make some
>> > starting guess, such as a small number of uniform splits, or wait until
>> you
>> > have some information about what the data will look like.
>> >
>> > Dave
>> >
>> > On Mon, Nov 28, 2016 at 12:42 AM, Sachin Jain 
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > I was going though pre-splitting a table article [0] and it is
>> mentioned
>> > > that it is generally best practice to presplit your table. But don't
>> we
>> > > need to know the data in advance in order to presplit it.
>> > >
>> > > Question: What should be the best practice when we don't know what
>> data
>> > is
>> > > going to be inserted into HBase. Essentially I don't know the key
>> range
>> > so
>> > > if I specify wrong splits, then either first or last split can be a
>> hot
>> > > region in my system.
>> > >
>> > > [0]: https://hbase.apache.org/book.html#rowkey.regionsplits
>> > >
>> > > Thanks
>> > > -Sachin
>> > >
>> >
>>
>
>


Re: Creating HBase table with presplits

2016-12-02 Thread Saad Mufti
One way to do this without knowing your data (still need some idea of size
of keyspace) is to prepend a fixed numeric prefix from a suitable range
based on a good hash like MD5. For example, let us say you can predict your
data will fit in about 1024 regions. You can decide to prepend a prefix
from  to 1024 to all you keys based on a suitable hash.

The pros:

1. you get to pre-split without knowing your keyspace
2. very hard if not impossible for unknown data providers to send you data
in some order that generates hotspots (unless of course the same key is
repeated over and over, still have to watch out for that)

The cons:

1. lose the ability to do scan in "natural" sorted order of your keyspace
as that order is not preserved anymore in HBase
2. if you miscalculate your keyspace size by a lot, you are stuck with the
hash function and range you selected even if you later get more regions
unless you're willing to do complete migration to a new table

Hope above helps.


Saad


On Tue, Nov 29, 2016 at 4:28 AM, Sachin Jain 
wrote:

> Thanks Dave for your suggestions!
> Will let you know if I find some approach to tackle this situation.
>
> Regards
>
> On Mon, Nov 28, 2016 at 9:05 PM, Dave Latham  wrote:
>
> > If you truly have no way to predict anything about the distribution of
> your
> > data across the row key space, then you are correct that there is no way
> to
> > presplit your regions in an effective way.  Either you need to make some
> > starting guess, such as a small number of uniform splits, or wait until
> you
> > have some information about what the data will look like.
> >
> > Dave
> >
> > On Mon, Nov 28, 2016 at 12:42 AM, Sachin Jain 
> > wrote:
> >
> > > Hi,
> > >
> > > I was going though pre-splitting a table article [0] and it is
> mentioned
> > > that it is generally best practice to presplit your table. But don't we
> > > need to know the data in advance in order to presplit it.
> > >
> > > Question: What should be the best practice when we don't know what data
> > is
> > > going to be inserted into HBase. Essentially I don't know the key range
> > so
> > > if I specify wrong splits, then either first or last split can be a hot
> > > region in my system.
> > >
> > > [0]: https://hbase.apache.org/book.html#rowkey.regionsplits
> > >
> > > Thanks
> > > -Sachin
> > >
> >
>