Forgot to mention in above example you would presplit into 1024 regions, starting from "0000" to "1023" (start keys).
Cheers. ---- Saad On Fri, Dec 2, 2016 at 8:47 AM, Saad Mufti <saad.mu...@gmail.com> wrote: > One way to do this without knowing your data (still need some idea of size > of keyspace) is to prepend a fixed numeric prefix from a suitable range > based on a good hash like MD5. For example, let us say you can predict your > data will fit in about 1024 regions. You can decide to prepend a prefix > from 0000 to 1024 to all you keys based on a suitable hash. > > The pros: > > 1. you get to pre-split without knowing your keyspace > 2. very hard if not impossible for unknown data providers to send you data > in some order that generates hotspots (unless of course the same key is > repeated over and over, still have to watch out for that) > > The cons: > > 1. lose the ability to do scan in "natural" sorted order of your keyspace > as that order is not preserved anymore in HBase > 2. if you miscalculate your keyspace size by a lot, you are stuck with the > hash function and range you selected even if you later get more regions > unless you're willing to do complete migration to a new table > > Hope above helps. > > ---- > Saad > > > On Tue, Nov 29, 2016 at 4:28 AM, Sachin Jain <sachinjain...@gmail.com> > wrote: > >> Thanks Dave for your suggestions! >> Will let you know if I find some approach to tackle this situation. >> >> Regards >> >> On Mon, Nov 28, 2016 at 9:05 PM, Dave Latham <lat...@davelink.net> wrote: >> >> > If you truly have no way to predict anything about the distribution of >> your >> > data across the row key space, then you are correct that there is no >> way to >> > presplit your regions in an effective way. Either you need to make some >> > starting guess, such as a small number of uniform splits, or wait until >> you >> > have some information about what the data will look like. >> > >> > Dave >> > >> > On Mon, Nov 28, 2016 at 12:42 AM, Sachin Jain <sachinjain...@gmail.com> >> > wrote: >> > >> > > Hi, >> > > >> > > I was going though pre-splitting a table article [0] and it is >> mentioned >> > > that it is generally best practice to presplit your table. But don't >> we >> > > need to know the data in advance in order to presplit it. >> > > >> > > Question: What should be the best practice when we don't know what >> data >> > is >> > > going to be inserted into HBase. Essentially I don't know the key >> range >> > so >> > > if I specify wrong splits, then either first or last split can be a >> hot >> > > region in my system. >> > > >> > > [0]: https://hbase.apache.org/book.html#rowkey.regionsplits >> > > >> > > Thanks >> > > -Sachin >> > > >> > >> > >