The answer to your first question is yes - midkey of the key range would be chosen as split key.
For #2, can you tell us how you plan to randomize the loading ? Bulk load normally means preparing HFiles which would be loaded directly into your table. Cheers On Apr 20, 2013, at 1:11 PM, Pal Konyves <[email protected]> wrote: > Hi Ted, > Only one family, my data is very simple key-value, although I want to make > sequential scan, so making a hash of the key is not an option. > > > > On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu <[email protected]> wrote: > >> How many column families do you have ? >> >> For #3, per-splitting table at the row keys corresponding to peaks makes >> sense. >> >> On Apr 20, 2013, at 10:52 AM, Pal Konyves <[email protected]> wrote: >> >>> Hi, >>> >>> I am just reading about region splitting. By default - as I understand - >>> Hbase handles splitting the regions. I just don't know how to imagine on >>> which key it splits the regions. >>> >>> 1) For example when I write MD5 hash of rowkeys, they are most probably >>> evenly distributed from >>> 000000... to FFFFF... right? When Hbase starts with one region, all the >>> writes goes into that region, and when the HFile get's too big, it just >>> gets for example the median value of the stored keys, and split the >> region >>> by this? >>> >>> 2) I want to bulk load tons of data with the HBase java client API put >>> operations. I want it to perform well. My keys are numeric sequential >>> values (which I know from this post, I cannot load into Hbase >> sequentially, >>> because the Hbase tables are going to be sad >> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/ >>> ) >>> So I thought I would pre-split the table into regions, and load the data >>> randomized. This way I will get good distribution among region servers in >>> terms of network IO from the beginning. Is that a good idea? >>> >>> 3) If my rowkeys are not evenly distributed in the keyspace, but they >> show >>> some peaks or bursts. e.g. 000-999, but most of the keys gather around >> 020 >>> and 060 values, is it a good idea to have the pre region splits at those >>> peaks? >>> >>> Thanks in advance, >>> Pal >>
