Re: Is there a problem with having 4000 tables in a cluster?

Varun Sharma Tue, 24 Sep 2013 16:23:30 -0700

So you should salt the keys in the reduce phase but u donot salt the keys
in HBase. That basically means that reducers do not see the keys in sorted
order but they do see all the values for a specific key together.


So the Hash essentially is a trick that stays within the mapreduce does not
make it into HBase. This prevents you from hotspotting a region and since
the Hash is not being written into HBase - you can do your prefix/regex
scans etc.


On Tue, Sep 24, 2013 at 4:16 PM, Jean-Marc Spaggiari <
[email protected]> wrote:

> If you have a fixed length like:
> XXXX_AAAAAAA
>
> Where XXXX is a number from 0000 to 4000 and AAAA is your word, then simply
> split by the number?
>
> Then when you will instead each line, it will write to 4000 different
> regions, which can be hosted in 4000 different servers if you have that.
> And there will be no hot-spotting?
>
> Then when you run MR job, you will have one mapper per region. Each region
> will be evenly loaded. And if you start 200 jobs, then each region will
> have 200 mappers, still evenly loaded, no?
>
> Or I might be missing something from your usecase.
>
> JM
>
>
> 2013/9/24 jeremy p <[email protected]>
>
> > Varun : I'm familiar with that method of salting.  However, in this
> case, I
> > need to do filtered range scans.  When I do a lookup for a given WORD at
> a
> > given POSITION, I'll actually be doing a regex on a range of WORDs at
> that
> > POSITION.  If I salt the keys with a hash, the WORDs will no longer be
> > sorted, and so I would need to do a full table scan for every lookup.
> >
> > Jean-Marc : What problems do you see with my solution?  Do you have a
> > better suggestion?
> >
> > --Jeremy
> >
> >
> > On Tue, Sep 24, 2013 at 3:16 PM, Varun Sharma <[email protected]>
> wrote:
> >
> > > Its better to do some "salting" in your keys for the reduce phase.
> > > Basically, make ur key be something like "KeyHash + Key" and then
> decode
> > it
> > > in your reducer and write to HBase. This way you avoid the hotspotting
> > > problem on HBase due to MapReduce sorting.
> > >
> > >
> > > On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari <
> > > [email protected]> wrote:
> > >
> > > > Hi Jeremy,
> > > >
> > > > I don't see any issue for HBase to handle 4000 tables. However, I
> don't
> > > > think it's the best solution for your use case.
> > > >
> > > > JM
> > > >
> > > >
> > > > 2013/9/24 jeremy p <[email protected]>
> > > >
> > > > > Short description : I'd like to have 4000 tables in my HBase
> cluster.
> > > >  Will
> > > > > this be a problem?  In general, what problems do you run into when
> > you
> > > > try
> > > > > to host thousands of tables in a cluster?
> > > > >
> > > > > Long description : I'd like the performance advantage of pre-split
> > > > tables,
> > > > > and I'd also like to do filtered range scans.  Imagine a keyspace
> > where
> > > > the
> > > > > key consists of : [POSITION]_[WORD] , where POSITION is a number
> > from 1
> > > > to
> > > > > 4000, and WORD is a string consisting of 96 characters.  The value
> in
> > > the
> > > > > cell would be a single integer.  My app will examine a 'document',
> > > where
> > > > > each 'line' consists of 4000 WORDs.  For each WORD, it'll do a
> > filtered
> > > > > regex lookup.  Only problem?  Say I have 200 mappers and they all
> > start
> > > > at
> > > > > POSITION 1, my region servers would get hotspotted like crazy. So
> my
> > > idea
> > > > > is to break it into 4000 tables (one for each POSITION), and then
> > > > pre-split
> > > > > the tables such that each region gets an equal amount of the
> traffic.
> > >  In
> > > > > this scenario, the key would just be WORD.  Dunno if this a bad
> idea,
> > > > would
> > > > > be open to suggestions
> > > > >
> > > > > Thanks!
> > > > >
> > > > > --J
> > > > >
> > > >
> > >
> >
>

Re: Is there a problem with having 4000 tables in a cluster?

Reply via email to