Re: Is there a problem with having 4000 tables in a cluster?

jeremy p Thu, 26 Sep 2013 12:26:56 -0700

These are both good suggestions -- thanks!

Varun's suggestion will work if I can do all of my lookups at once.  In
fact, I wouldn't even need a reduce phase; I could just prepend all my keys
with hashes, add them to some kind of sorted array, then pop them off
one-by-one.  Not sure if I can do all my lookups at once, though, for
reasons I left out of my example for the sake of brevity.


Michael's suggestion will work if I can find a good way to partition the
words such that they're evenly distributed.  This may or may not be
possible.

In any case, I like both suggestions more than my original idea of using
4000 tables, so thank you both for the help!

--Jeremy


On Tue, Sep 24, 2013 at 10:16 PM, Michael Webster <
[email protected]> wrote:

> The biggest issue I see with so many tables is the region counts could get
> quite large.  With 4000 tables, you will need at least that many regions,
> not even accounting for splitting the regions/growth.
>
> Forgive the speculation, but it almost sounds like you want an inverted
> index.  Could you not just store the word as the rowkey and have each
> position be a column?  4000 columns isn't really that many, and if this is
> text it will probably be pretty sparse i.e most words won't appear in all
> 4K positions.  When you set up the scanner for your MR job,  add the column
> for the position that you want.
>
> It seems that the downside of trying to push the reads into the reduce
> phase by prepending a hash/salt to each real rowkey is, assuming the hashes
> are well distributed, each reducer will get a random slice of the data and
> will have to issue individual gets for their rowkeys instead of taking
> advantage of HBase's scan performance.
>
>
> On Tue, Sep 24, 2013 at 8:57 PM, Michael Segel <[email protected]
> >wrote:
>
> > Since different people use different terms... Salting is BAD. (You need
> to
> > understand what is implied by the term salt.)
> >
> > What you really want to do is take the hash of the key, and then truncate
> > the hash. Use that instead of a salt.
> >
> > Much better than a salt.
> >
> > Sent from a remote device. Please excuse any typos...
> >
> > Mike Segel
> >
> > > On Sep 24, 2013, at 5:17 PM, "Varun Sharma" <[email protected]>
> wrote:
> > >
> > > Its better to do some "salting" in your keys for the reduce phase.
> > > Basically, make ur key be something like "KeyHash + Key" and then
> decode
> > it
> > > in your reducer and write to HBase. This way you avoid the hotspotting
> > > problem on HBase due to MapReduce sorting.
> > >
> > >
> > > On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari <
> > > [email protected]> wrote:
> > >
> > >> Hi Jeremy,
> > >>
> > >> I don't see any issue for HBase to handle 4000 tables. However, I
> don't
> > >> think it's the best solution for your use case.
> > >>
> > >> JM
> > >>
> > >>
> > >> 2013/9/24 jeremy p <[email protected]>
> > >>
> > >>> Short description : I'd like to have 4000 tables in my HBase cluster.
> > >> Will
> > >>> this be a problem?  In general, what problems do you run into when
> you
> > >> try
> > >>> to host thousands of tables in a cluster?
> > >>>
> > >>> Long description : I'd like the performance advantage of pre-split
> > >> tables,
> > >>> and I'd also like to do filtered range scans.  Imagine a keyspace
> where
> > >> the
> > >>> key consists of : [POSITION]_[WORD] , where POSITION is a number
> from 1
> > >> to
> > >>> 4000, and WORD is a string consisting of 96 characters.  The value in
> > the
> > >>> cell would be a single integer.  My app will examine a 'document',
> > where
> > >>> each 'line' consists of 4000 WORDs.  For each WORD, it'll do a
> filtered
> > >>> regex lookup.  Only problem?  Say I have 200 mappers and they all
> start
> > >> at
> > >>> POSITION 1, my region servers would get hotspotted like crazy. So my
> > idea
> > >>> is to break it into 4000 tables (one for each POSITION), and then
> > >> pre-split
> > >>> the tables such that each region gets an equal amount of the traffic.
> >  In
> > >>> this scenario, the key would just be WORD.  Dunno if this a bad idea,
> > >> would
> > >>> be open to suggestions
> > >>>
> > >>> Thanks!
> > >>>
> > >>> --J
> > >>
> >
>
>
>
> --
>
> *Michael Webster*, Software Engineer
> Bronto Software
> [email protected]
> bronto.com <http://www.bronto.com/>
> Marketing solutions for commerce. Learn more.<
> http://www.bronto.com/platform>
>

Re: Is there a problem with having 4000 tables in a cluster?

Reply via email to