If you have a fixed length like:
XXXX_AAAAAAA

Where XXXX is a number from 0000 to 4000 and AAAA is your word, then simply
split by the number?

Then when you will instead each line, it will write to 4000 different
regions, which can be hosted in 4000 different servers if you have that.
And there will be no hot-spotting?

Then when you run MR job, you will have one mapper per region. Each region
will be evenly loaded. And if you start 200 jobs, then each region will
have 200 mappers, still evenly loaded, no?

Or I might be missing something from your usecase.

JM


2013/9/24 jeremy p <[email protected]>

> Varun : I'm familiar with that method of salting.  However, in this case, I
> need to do filtered range scans.  When I do a lookup for a given WORD at a
> given POSITION, I'll actually be doing a regex on a range of WORDs at that
> POSITION.  If I salt the keys with a hash, the WORDs will no longer be
> sorted, and so I would need to do a full table scan for every lookup.
>
> Jean-Marc : What problems do you see with my solution?  Do you have a
> better suggestion?
>
> --Jeremy
>
>
> On Tue, Sep 24, 2013 at 3:16 PM, Varun Sharma <[email protected]> wrote:
>
> > Its better to do some "salting" in your keys for the reduce phase.
> > Basically, make ur key be something like "KeyHash + Key" and then decode
> it
> > in your reducer and write to HBase. This way you avoid the hotspotting
> > problem on HBase due to MapReduce sorting.
> >
> >
> > On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari <
> > [email protected]> wrote:
> >
> > > Hi Jeremy,
> > >
> > > I don't see any issue for HBase to handle 4000 tables. However, I don't
> > > think it's the best solution for your use case.
> > >
> > > JM
> > >
> > >
> > > 2013/9/24 jeremy p <[email protected]>
> > >
> > > > Short description : I'd like to have 4000 tables in my HBase cluster.
> > >  Will
> > > > this be a problem?  In general, what problems do you run into when
> you
> > > try
> > > > to host thousands of tables in a cluster?
> > > >
> > > > Long description : I'd like the performance advantage of pre-split
> > > tables,
> > > > and I'd also like to do filtered range scans.  Imagine a keyspace
> where
> > > the
> > > > key consists of : [POSITION]_[WORD] , where POSITION is a number
> from 1
> > > to
> > > > 4000, and WORD is a string consisting of 96 characters.  The value in
> > the
> > > > cell would be a single integer.  My app will examine a 'document',
> > where
> > > > each 'line' consists of 4000 WORDs.  For each WORD, it'll do a
> filtered
> > > > regex lookup.  Only problem?  Say I have 200 mappers and they all
> start
> > > at
> > > > POSITION 1, my region servers would get hotspotted like crazy. So my
> > idea
> > > > is to break it into 4000 tables (one for each POSITION), and then
> > > pre-split
> > > > the tables such that each region gets an equal amount of the traffic.
> >  In
> > > > this scenario, the key would just be WORD.  Dunno if this a bad idea,
> > > would
> > > > be open to suggestions
> > > >
> > > > Thanks!
> > > >
> > > > --J
> > > >
> > >
> >
>

Reply via email to