Forgot to mention in above example you would presplit into 1024 regions,
starting from "0000" to "1023" (start keys).

Cheers.

----
Saad


On Fri, Dec 2, 2016 at 8:47 AM, Saad Mufti <saad.mu...@gmail.com> wrote:

> One way to do this without knowing your data (still need some idea of size
> of keyspace) is to prepend a fixed numeric prefix from a suitable range
> based on a good hash like MD5. For example, let us say you can predict your
> data will fit in about 1024 regions. You can decide to prepend a prefix
> from 0000 to 1024 to all you keys based on a suitable hash.
>
> The pros:
>
> 1. you get to pre-split without knowing your keyspace
> 2. very hard if not impossible for unknown data providers to send you data
> in some order that generates hotspots (unless of course the same key is
> repeated over and over, still have to watch out for that)
>
> The cons:
>
> 1. lose the ability to do scan in "natural" sorted order of your keyspace
> as that order is not preserved anymore in HBase
> 2. if you miscalculate your keyspace size by a lot, you are stuck with the
> hash function and range you selected even if you later get more regions
> unless you're willing to do complete migration to a new table
>
> Hope above helps.
>
> ----
> Saad
>
>
> On Tue, Nov 29, 2016 at 4:28 AM, Sachin Jain <sachinjain...@gmail.com>
> wrote:
>
>> Thanks Dave for your suggestions!
>> Will let you know if I find some approach to tackle this situation.
>>
>> Regards
>>
>> On Mon, Nov 28, 2016 at 9:05 PM, Dave Latham <lat...@davelink.net> wrote:
>>
>> > If you truly have no way to predict anything about the distribution of
>> your
>> > data across the row key space, then you are correct that there is no
>> way to
>> > presplit your regions in an effective way.  Either you need to make some
>> > starting guess, such as a small number of uniform splits, or wait until
>> you
>> > have some information about what the data will look like.
>> >
>> > Dave
>> >
>> > On Mon, Nov 28, 2016 at 12:42 AM, Sachin Jain <sachinjain...@gmail.com>
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > I was going though pre-splitting a table article [0] and it is
>> mentioned
>> > > that it is generally best practice to presplit your table. But don't
>> we
>> > > need to know the data in advance in order to presplit it.
>> > >
>> > > Question: What should be the best practice when we don't know what
>> data
>> > is
>> > > going to be inserted into HBase. Essentially I don't know the key
>> range
>> > so
>> > > if I specify wrong splits, then either first or last split can be a
>> hot
>> > > region in my system.
>> > >
>> > > [0]: https://hbase.apache.org/book.html#rowkey.regionsplits
>> > >
>> > > Thanks
>> > > -Sachin
>> > >
>> >
>>
>
>

Reply via email to