One way to do this without knowing your data (still need some idea of size of keyspace) is to prepend a fixed numeric prefix from a suitable range based on a good hash like MD5. For example, let us say you can predict your data will fit in about 1024 regions. You can decide to prepend a prefix from 0000 to 1024 to all you keys based on a suitable hash.
The pros: 1. you get to pre-split without knowing your keyspace 2. very hard if not impossible for unknown data providers to send you data in some order that generates hotspots (unless of course the same key is repeated over and over, still have to watch out for that) The cons: 1. lose the ability to do scan in "natural" sorted order of your keyspace as that order is not preserved anymore in HBase 2. if you miscalculate your keyspace size by a lot, you are stuck with the hash function and range you selected even if you later get more regions unless you're willing to do complete migration to a new table Hope above helps. ---- Saad On Tue, Nov 29, 2016 at 4:28 AM, Sachin Jain <[email protected]> wrote: > Thanks Dave for your suggestions! > Will let you know if I find some approach to tackle this situation. > > Regards > > On Mon, Nov 28, 2016 at 9:05 PM, Dave Latham <[email protected]> wrote: > > > If you truly have no way to predict anything about the distribution of > your > > data across the row key space, then you are correct that there is no way > to > > presplit your regions in an effective way. Either you need to make some > > starting guess, such as a small number of uniform splits, or wait until > you > > have some information about what the data will look like. > > > > Dave > > > > On Mon, Nov 28, 2016 at 12:42 AM, Sachin Jain <[email protected]> > > wrote: > > > > > Hi, > > > > > > I was going though pre-splitting a table article [0] and it is > mentioned > > > that it is generally best practice to presplit your table. But don't we > > > need to know the data in advance in order to presplit it. > > > > > > Question: What should be the best practice when we don't know what data > > is > > > going to be inserted into HBase. Essentially I don't know the key range > > so > > > if I specify wrong splits, then either first or last split can be a hot > > > region in my system. > > > > > > [0]: https://hbase.apache.org/book.html#rowkey.regionsplits > > > > > > Thanks > > > -Sachin > > > > > >
