RE: getSplits question

Geoff Hendrey Thu, 10 Feb 2011 08:48:41 -0800

I hunted around for some info on how to force a table to split, but I
didn't find what I was looking for. Is there a command I can issue from
the Hbase shell that would force every existing region to divide in
half? That would be quite useful. If not, what's the next best way to
force splits.


thanks!
-g

-----Original Message-----
From: Michael Segel [mailto:michael_se...@hotmail.com] 
Sent: Thursday, February 10, 2011 8:15 AM
To: user@hbase.apache.org
Cc: hbase-u...@hadoop.apache.org
Subject: RE: getSplits question


Ryan,

Just to point out the obvious...

On smaller tables where you don't get enough parallelism, you can
manually force the table's regions to be split.
My understanding that if/when the table grows it will then go back to
splitting normally. 

This way if you have a 'small' look up table that is relatively static,
you manually split it to the 'right' size for your cloud. 
If you are seeding a system, you can do the splits to get good
parallelism and not overload a single region with inserts, then let it
go back to its normal growth pattern and splits.

This would solve the OP's issue and as you point out, not worry about
getSplits().

Does this make sense, or am I missing something?

-Mike

> Date: Wed, 9 Feb 2011 23:54:19 -0800
> Subject: Re: getSplits question
> From: ryano...@gmail.com
> To: user@hbase.apache.org
> CC: hbase-u...@hadoop.apache.org
> 
> By default each map gets the contents of 1 region. A region is by
> default a maximum of 256MB. There is no trivial way to generally
> bisect a region in half, in terms of row count, by just knowing what
> we known (start, end key).
> 
> For very large tables that have > 100 regions, this algorithm works
> really well and you get some good parallelism.  If you want to see a
> lot of parallelism out of 1 region, you might have to work a lot
> harder.  Or reduce your region size and have more regions.  Be warned
> though, that more regions has performance hits in other areas
> (specifically server startup/shutdown/assignment times).  So you
> probably dont want 50,000 32MB regions.
> 
> -ryan
> 
> On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey <ghend...@decarta.com>
wrote:
> > Oh, I definitely don't *need* my own to run mapreduce. However, if I
want to control the number of records handled by each mapper (splitsize)
and the startrow and endrow, then I thought I had to write my own
getSplits(). Is there another way to accomplish this, because I do need
the combination of controlled splitsize and start/endrow.
> >
> > -geoff
> >
> > -----Original Message-----
> > From: Ryan Rawson [mailto:ryano...@gmail.com]
> > Sent: Wednesday, February 09, 2011 11:43 PM
> > To: user@hbase.apache.org
> > Cc: hbase-u...@hadoop.apache.org
> > Subject: Re: getSplits question
> >
> > You shouldn't need to write your own getSplits() method to run a map
> > reduce, I never did at least...
> >
> > -ryan
> >
> > On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey
<ghend...@decarta.com> wrote:
> >> Are endrows inclusive or exclusive? The docs say exclusive, but
then the
> >> question arises as to how to form the last split for getSplits().
The
> >> code below runs fine, but I believe it is omitting some rows,
perhaps
> >> b/c of the exclusive end row. For the final split, should the
endrow be
> >> null? I tried that, and got what appeared to be a final split
without an
> >> endrow at all. Would appreciate a pointer to the correct
implementation
> >> of getSplits in which I desire to provide a startrow, endrow, and
> >> splitsize. Apparently this isn't it J :
> >>
> >>
> >>
> >> int splitSize = context.getConfiguration().getInt("splitsize",
1000);
> >>
> >>                byte[] splitStop = null;
> >>
> >>                String hostname = null;
> >>
> >>                while ((results =
resultScanner.next(splitSize)).length
> >>> 0) {
> >>
> >>                    //   System.out.println("results
> >> :-------------------------- "+results);
> >>
> >>                    byte[] splitStart = results[0].getRow();
> >>
> >>                    splitStop = results[results.length -
1].getRow();
> >> //I think this is a problem...we don't actually include this row in
the
> >> split since it's exclusive..revisit this and correct
> >>
> >>                    HRegionLocation location =
> >> table.getRegionLocation(splitStart);
> >>
> >>                    hostname =
> >> location.getServerAddress().getHostname();
> >>
> >>                    InputSplit split = new
> >> TableSplit(table.getTableName(), splitStart, splitStop, hostname);
> >>
> >>                    splits.add(split);
> >>
> >>                    System.out.println("initializing splits: " +
> >> split.toString());
> >>
> >>                }
> >>
> >>                resultScanner.close();
> >>
> >>
> >>
> >>
> >>
> >> -g
> >>
> >>
> >

RE: getSplits question

Reply via email to