Re: default region splitting on which value?

Pal Konyves Sat, 20 Apr 2013 14:25:13 -0700

I am making a paper for school about HBase, so the data I chose is not a
real usable example. I am familiar with GTFS that is a de facto standard
for storing information about public transportation schedules: when vehicle
arrives to a stop and where it goes toward.


I chose to genrate the rows on the fly, where each row represents a
sequence of 'bus' stops that make a route from the first stop until the
last stop.
e.g.: [first_stop_id,last_stop_id],string_sequence_of_stops
where within the [...] is the rowkey.

So long story short, I generate the data. I want to use the HBase java
client api to store the rows with Put. I plan to randomize it by picking
random first_stop_id-s, and use more threads.

the rowkeys will still have a sequence, because the way I generate the rows
will output about 100-1000 rows starting with the same first_stop_id within
the rowkey. The total ammount of rows will be about billions, and would
take up about 1TB.


On Sat, Apr 20, 2013 at 10:54 PM, Ted Yu <[email protected]> wrote:

> The answer to your first question is yes - midkey of the key range would
> be chosen as split key.
>
> For #2, can you tell us how you plan to randomize the loading ?
> Bulk load normally means preparing HFiles which would be loaded directly
> into your table.
>
> Cheers
>
> On Apr 20, 2013, at 1:11 PM, Pal Konyves <[email protected]> wrote:
>
> > Hi Ted,
> > Only one family, my data is very simple key-value, although I want to
> make
> > sequential scan, so making a hash of the key is not an option.
> >
> >
> >
> > On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu <[email protected]> wrote:
> >
> >> How many column families do you have ?
> >>
> >> For #3, per-splitting table at the row keys corresponding to peaks makes
> >> sense.
> >>
> >> On Apr 20, 2013, at 10:52 AM, Pal Konyves <[email protected]>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I am just reading about region splitting. By default - as I understand
> -
> >>> Hbase handles splitting the regions. I just don't know how to imagine
> on
> >>> which key it splits the regions.
> >>>
> >>> 1) For example when I write MD5 hash of rowkeys, they are most probably
> >>> evenly distributed from
> >>> 000000... to FFFFF... right? When  Hbase starts with one region, all
> the
> >>> writes goes into that region, and when the HFile get's too big, it just
> >>> gets for example the median value of the stored keys, and split the
> >> region
> >>> by this?
> >>>
> >>> 2) I want to bulk load tons of data with the HBase java client API put
> >>> operations. I want it to perform well. My keys are numeric sequential
> >>> values (which I know from this post, I cannot load into Hbase
> >> sequentially,
> >>> because the Hbase tables are going to be sad
> >>
> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
> >>> )
> >>> So I thought I would pre-split the table into regions, and load the
> data
> >>> randomized. This way I will get good distribution among region servers
> in
> >>> terms of network IO from the beginning. Is that a good idea?
> >>>
> >>> 3) If my rowkeys are not evenly distributed in the keyspace, but they
> >> show
> >>> some peaks or bursts. e.g. 000-999, but most of the keys gather around
> >> 020
> >>> and 060 values, is it a good idea to have the pre region splits at
> those
> >>> peaks?
> >>>
> >>> Thanks in advance,
> >>> Pal
> >>
>

Re: default region splitting on which value?

Reply via email to