Hi, I am just reading about region splitting. By default - as I understand - Hbase handles splitting the regions. I just don't know how to imagine on which key it splits the regions.
1) For example when I write MD5 hash of rowkeys, they are most probably evenly distributed from 000000... to FFFFF... right? When Hbase starts with one region, all the writes goes into that region, and when the HFile get's too big, it just gets for example the median value of the stored keys, and split the region by this? 2) I want to bulk load tons of data with the HBase java client API put operations. I want it to perform well. My keys are numeric sequential values (which I know from this post, I cannot load into Hbase sequentially, because the Hbase tables are going to be sad http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/ ) So I thought I would pre-split the table into regions, and load the data randomized. This way I will get good distribution among region servers in terms of network IO from the beginning. Is that a good idea? 3) If my rowkeys are not evenly distributed in the keyspace, but they show some peaks or bursts. e.g. 000-999, but most of the keys gather around 020 and 060 values, is it a good idea to have the pre region splits at those peaks? Thanks in advance, Pal
