Re: hbase hashing algorithm and schema design

Sam Seigal Thu, 09 Jun 2011 23:13:19 -0700

Hi Otis,

This approach looks much better than the uuid approach. I will definitely
try it out. Thanks for the contribution.


Sam

----- Original Message ----
From: Otis Gospodnetic <[email protected]>
To: [email protected]
Sent: Wed, June 8, 2011 6:02:16 PM
Subject: Re: hbase hashing algorithm and schema design

Sam, would HBaseWD help you here?
See
http://search-hadoop.com/m/AQ7CG2GkiO/hbasewd&subj=+ANN+HBaseWD+Distribute+Sequential+Writes+in+HBase


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop - HBase
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----
> From: Sam Seigal <[email protected]>
> To: [email protected]
> Cc: [email protected]; [email protected]
> Sent: Wed, June 8, 2011 4:54:24 PM
> Subject: Re: hbase hashing algorithm and schema design
>
> On Wed, Jun 8, 2011 at 12:40 AM, tsuna <[email protected]> wrote:
>
> >  On Tue, Jun 7, 2011 at 7:56 PM, Kjew Jned <[email protected]> wrote:
> > >  I was studying the OpenTSDB example, where they also prefix the row
keys
> >  with
> > > event id.
> > >
> > > I further modified my row  keys to have this ->
> > >
> > > <eventid>  <uuid>  <yyyy-mm-dd>
> > >
> > > The uuid is  fairly unique and random.
> > > Is appending a uuid to the event id help  the distribution ?
> >
> > Yes it will help the distribution, but it  will also make certain query
> > patterns harder.  You can no longer  scan for a time range, for a given
> > eventid.  How to solve this  problem depends on how you generate the
> > UUIDs.
> >
> > I  wouldn't recommend doing this unless you've already tried simpler
> >  approaches and reached the conclusion that they don't work.  Many
> >  people seem to be afraid of creating hot spots in their tables without
> >  having first-hand evidence that the hot spots would actually be a
> >  problem.
> >
> >
> >
> >
> Can I not use regex row filters to  query for date ranges ? There is an
added
> overhead for the client to
> order  them, and it is not an efficient query, but it is certainly
possible
> to do so  ? Am I wrong ?
>
>
>
> > > Let us say if I have 4 region servers to  start off with and I start
the
> >
> > If you have only 4 region  servers, your goal should be to have roughly
> > 25% of writes going to each  server.  It doesn't matter if the 25%
> > slice of one server is going  to a single region or not.  As long as
> > all the writes don't go to  the same row (which would cause lock
> > contention on that row), you'll get  the same kind of performance.
> >
>
> I am worried about the following  scenario, hence putting a lot of thought
> into how to design this  schema.
>
> For example, for simplicity sake, I only have two event Ids A and  B, and
the
> traffic is equally distributed
> between them i.e. 50% of my  traffic is event A and 50% is event B. I have
> two region servers running,  on
> two physical nodes with the following schema -
>
> <eventid>  <timestamp>
>
> Ideally, I now have all of A traffic going into  regionServerA and all of
B
> traffic going into regionserver B.
> The cluster  is able to hold this traffic, and the write load is
distributed
> 50-50.
>
> However, now I reach a point where I need to scale,  since the two
clusters
> are not being able to
> cope with the write traffic.  Adding extra regionservers to the cluster is
> not going to make any  difference
> , since only the physical machine holding the tail end of the  region is
the
> one that will receive
> the traffic. Most of my other cluster  is going to be idle.
>
> To generalize, if I want to scale where the # of  machines is greater than
> the # of unique event ids, I have no way  to
> distribute the load in an efficient manner, since I cannot distribute  the
> load of a single event id across multiple machines
> (without adding a  uuid somewhere in the middle and sacrificing data
locality
> on ordered  timestamps).
>
> Do you think my concern is genuine ? Thanks a lot for your  help.
>
>
> >
> > > workload, how does HBase decide how many  regions is it going to
create,
> > and what
> > > key is going to go  into what region ?
> >
> > Your table starts as a single region.  As this region fills up, it'll
> > split.  Where it split is chosen by  HBase.  HBase tries to spit the
> > region "in the middle", so that  roughly the number of keys ends up in
> > each new daughter  region.
> >
> > You can also manually pre-split a table (from the  shell).  This can be
> > advantageous in certain situations where you  know what your table will
> > look like and you have a very high write  volume coupled with
> > aggressive latency requirements for >95th  percentile.
> >
>
> > > I could have gone with something  like
> > >
> > > <uuid><eventid><yyyy-mm-dd> ,  but would not like to, since my queries
are
> > always
> > > going to  be against a particular event id type, and i would like them
to
> >  be
> > > spatially located.
> >
> > If you have a lot of data per  <eventid>, then putting the <uuid> in
> > between the  <eventid> and the <yyyy-mm-dd> will screw up data locality
> >  anyway.  But the exact details depend on how you pick the  <uuid>.
> >
> > --
> > Benoit "tsuna" Sigoure
> > Software  Engineer @ www.StumbleUpon.com <http://www.stumbleupon.com/>
> >
>

Re: hbase hashing algorithm and schema design

Reply via email to