Re: Is there a problem with having 4000 tables in a cluster?

jeremy p Tue, 24 Sep 2013 15:53:25 -0700

Varun : I'm familiar with that method of salting.  However, in this case, I
need to do filtered range scans.  When I do a lookup for a given WORD at a
given POSITION, I'll actually be doing a regex on a range of WORDs at that
POSITION.  If I salt the keys with a hash, the WORDs will no longer be
sorted, and so I would need to do a full table scan for every lookup.


Jean-Marc : What problems do you see with my solution?  Do you have a
better suggestion?

--Jeremy


On Tue, Sep 24, 2013 at 3:16 PM, Varun Sharma <[email protected]> wrote:

> Its better to do some "salting" in your keys for the reduce phase.
> Basically, make ur key be something like "KeyHash + Key" and then decode it
> in your reducer and write to HBase. This way you avoid the hotspotting
> problem on HBase due to MapReduce sorting.
>
>
> On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari <
> [email protected]> wrote:
>
> > Hi Jeremy,
> >
> > I don't see any issue for HBase to handle 4000 tables. However, I don't
> > think it's the best solution for your use case.
> >
> > JM
> >
> >
> > 2013/9/24 jeremy p <[email protected]>
> >
> > > Short description : I'd like to have 4000 tables in my HBase cluster.
> >  Will
> > > this be a problem?  In general, what problems do you run into when you
> > try
> > > to host thousands of tables in a cluster?
> > >
> > > Long description : I'd like the performance advantage of pre-split
> > tables,
> > > and I'd also like to do filtered range scans.  Imagine a keyspace where
> > the
> > > key consists of : [POSITION]_[WORD] , where POSITION is a number from 1
> > to
> > > 4000, and WORD is a string consisting of 96 characters.  The value in
> the
> > > cell would be a single integer.  My app will examine a 'document',
> where
> > > each 'line' consists of 4000 WORDs.  For each WORD, it'll do a filtered
> > > regex lookup.  Only problem?  Say I have 200 mappers and they all start
> > at
> > > POSITION 1, my region servers would get hotspotted like crazy. So my
> idea
> > > is to break it into 4000 tables (one for each POSITION), and then
> > pre-split
> > > the tables such that each region gets an equal amount of the traffic.
>  In
> > > this scenario, the key would just be WORD.  Dunno if this a bad idea,
> > would
> > > be open to suggestions
> > >
> > > Thanks!
> > >
> > > --J
> > >
> >
>

Re: Is there a problem with having 4000 tables in a cluster?

Reply via email to