Thanks Saad!! This is exactly similar to what I had planned to implement i.e to map your known keyspack to known keyspace by using a hash algorithm like MD5. Then split the table. Thanks once again!!
On Fri, Dec 2, 2016 at 7:18 PM, Saad Mufti <[email protected]> wrote: > Forgot to mention in above example you would presplit into 1024 regions, > starting from "0000" to "1023" (start keys). > > Cheers. > > ---- > Saad > > > On Fri, Dec 2, 2016 at 8:47 AM, Saad Mufti <[email protected]> wrote: > > > One way to do this without knowing your data (still need some idea of > size > > of keyspace) is to prepend a fixed numeric prefix from a suitable range > > based on a good hash like MD5. For example, let us say you can predict > your > > data will fit in about 1024 regions. You can decide to prepend a prefix > > from 0000 to 1024 to all you keys based on a suitable hash. > > > > The pros: > > > > 1. you get to pre-split without knowing your keyspace > > 2. very hard if not impossible for unknown data providers to send you > data > > in some order that generates hotspots (unless of course the same key is > > repeated over and over, still have to watch out for that) > > > > The cons: > > > > 1. lose the ability to do scan in "natural" sorted order of your keyspace > > as that order is not preserved anymore in HBase > > 2. if you miscalculate your keyspace size by a lot, you are stuck with > the > > hash function and range you selected even if you later get more regions > > unless you're willing to do complete migration to a new table > > > > Hope above helps. > > > > ---- > > Saad > > > > > > On Tue, Nov 29, 2016 at 4:28 AM, Sachin Jain <[email protected]> > > wrote: > > > >> Thanks Dave for your suggestions! > >> Will let you know if I find some approach to tackle this situation. > >> > >> Regards > >> > >> On Mon, Nov 28, 2016 at 9:05 PM, Dave Latham <[email protected]> > wrote: > >> > >> > If you truly have no way to predict anything about the distribution of > >> your > >> > data across the row key space, then you are correct that there is no > >> way to > >> > presplit your regions in an effective way. Either you need to make > some > >> > starting guess, such as a small number of uniform splits, or wait > until > >> you > >> > have some information about what the data will look like. > >> > > >> > Dave > >> > > >> > On Mon, Nov 28, 2016 at 12:42 AM, Sachin Jain < > [email protected]> > >> > wrote: > >> > > >> > > Hi, > >> > > > >> > > I was going though pre-splitting a table article [0] and it is > >> mentioned > >> > > that it is generally best practice to presplit your table. But don't > >> we > >> > > need to know the data in advance in order to presplit it. > >> > > > >> > > Question: What should be the best practice when we don't know what > >> data > >> > is > >> > > going to be inserted into HBase. Essentially I don't know the key > >> range > >> > so > >> > > if I specify wrong splits, then either first or last split can be a > >> hot > >> > > region in my system. > >> > > > >> > > [0]: https://hbase.apache.org/book.html#rowkey.regionsplits > >> > > > >> > > Thanks > >> > > -Sachin > >> > > > >> > > >> > > > > >
