Re: default region splitting on which value?

Pal Konyves Sun, 21 Apr 2013 04:22:38 -0700

Hi, no it's not important, only the stops are.


On Sun, Apr 21, 2013 at 3:34 AM, Ted Yu <[email protected]> wrote:

> Thanks for sharing the information below.
>
> How do you plan to store time (when the bus gets to each stop) in the row ?
> Or maybe it is not of importance to you ?
>
> On Sat, Apr 20, 2013 at 2:24 PM, Pal Konyves <[email protected]>
> wrote:
>
> > I am making a paper for school about HBase, so the data I chose is not a
> > real usable example. I am familiar with GTFS that is a de facto standard
> > for storing information about public transportation schedules: when
> vehicle
> > arrives to a stop and where it goes toward.
> >
> > I chose to genrate the rows on the fly, where each row represents a
> > sequence of 'bus' stops that make a route from the first stop until the
> > last stop.
> > e.g.: [first_stop_id,last_stop_id],string_sequence_of_stops
> > where within the [...] is the rowkey.
> >
> > So long story short, I generate the data. I want to use the HBase java
> > client api to store the rows with Put. I plan to randomize it by picking
> > random first_stop_id-s, and use more threads.
> >
> > the rowkeys will still have a sequence, because the way I generate the
> rows
> > will output about 100-1000 rows starting with the same first_stop_id
> within
> > the rowkey. The total ammount of rows will be about billions, and would
> > take up about 1TB.
> >
> >
> > On Sat, Apr 20, 2013 at 10:54 PM, Ted Yu <[email protected]> wrote:
> >
> > > The answer to your first question is yes - midkey of the key range
> would
> > > be chosen as split key.
> > >
> > > For #2, can you tell us how you plan to randomize the loading ?
> > > Bulk load normally means preparing HFiles which would be loaded
> directly
> > > into your table.
> > >
> > > Cheers
> > >
> > > On Apr 20, 2013, at 1:11 PM, Pal Konyves <[email protected]>
> wrote:
> > >
> > > > Hi Ted,
> > > > Only one family, my data is very simple key-value, although I want to
> > > make
> > > > sequential scan, so making a hash of the key is not an option.
> > > >
> > > >
> > > >
> > > > On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu <[email protected]>
> wrote:
> > > >
> > > >> How many column families do you have ?
> > > >>
> > > >> For #3, per-splitting table at the row keys corresponding to peaks
> > makes
> > > >> sense.
> > > >>
> > > >> On Apr 20, 2013, at 10:52 AM, Pal Konyves <[email protected]>
> > > wrote:
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> I am just reading about region splitting. By default - as I
> > understand
> > > -
> > > >>> Hbase handles splitting the regions. I just don't know how to
> imagine
> > > on
> > > >>> which key it splits the regions.
> > > >>>
> > > >>> 1) For example when I write MD5 hash of rowkeys, they are most
> > probably
> > > >>> evenly distributed from
> > > >>> 000000... to FFFFF... right? When  Hbase starts with one region,
> all
> > > the
> > > >>> writes goes into that region, and when the HFile get's too big, it
> > just
> > > >>> gets for example the median value of the stored keys, and split the
> > > >> region
> > > >>> by this?
> > > >>>
> > > >>> 2) I want to bulk load tons of data with the HBase java client API
> > put
> > > >>> operations. I want it to perform well. My keys are numeric
> sequential
> > > >>> values (which I know from this post, I cannot load into Hbase
> > > >> sequentially,
> > > >>> because the Hbase tables are going to be sad
> > > >>
> > >
> >
> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
> > > >>> )
> > > >>> So I thought I would pre-split the table into regions, and load the
> > > data
> > > >>> randomized. This way I will get good distribution among region
> > servers
> > > in
> > > >>> terms of network IO from the beginning. Is that a good idea?
> > > >>>
> > > >>> 3) If my rowkeys are not evenly distributed in the keyspace, but
> they
> > > >> show
> > > >>> some peaks or bursts. e.g. 000-999, but most of the keys gather
> > around
> > > >> 020
> > > >>> and 060 values, is it a good idea to have the pre region splits at
> > > those
> > > >>> peaks?
> > > >>>
> > > >>> Thanks in advance,
> > > >>> Pal
> > > >>
> > >
> >
>

Re: default region splitting on which value?

Reply via email to