Re: getSplits question

Jean-Daniel Cryans Thu, 10 Feb 2011 09:55:57 -0800

There's the "split" command in the shel.

HBaseAdmin has that same method.


In the table's page from the master's web UI, there's a "split" button.

Finally, when creating a table, you can pre-specify all the split keys
with this method:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor,
byte[][])

J-D

On Thu, Feb 10, 2011 at 8:48 AM, Geoff Hendrey <ghend...@decarta.com> wrote:
> I hunted around for some info on how to force a table to split, but I
> didn't find what I was looking for. Is there a command I can issue from
> the Hbase shell that would force every existing region to divide in
> half? That would be quite useful. If not, what's the next best way to
> force splits.
>
> thanks!
> -g
>
> -----Original Message-----
> From: Michael Segel [mailto:michael_se...@hotmail.com]
> Sent: Thursday, February 10, 2011 8:15 AM
> To: user@hbase.apache.org
> Cc: hbase-u...@hadoop.apache.org
> Subject: RE: getSplits question
>
>
> Ryan,
>
> Just to point out the obvious...
>
> On smaller tables where you don't get enough parallelism, you can
> manually force the table's regions to be split.
> My understanding that if/when the table grows it will then go back to
> splitting normally.
>
> This way if you have a 'small' look up table that is relatively static,
> you manually split it to the 'right' size for your cloud.
> If you are seeding a system, you can do the splits to get good
> parallelism and not overload a single region with inserts, then let it
> go back to its normal growth pattern and splits.
>
> This would solve the OP's issue and as you point out, not worry about
> getSplits().
>
> Does this make sense, or am I missing something?
>
> -Mike
>
>> Date: Wed, 9 Feb 2011 23:54:19 -0800
>> Subject: Re: getSplits question
>> From: ryano...@gmail.com
>> To: user@hbase.apache.org
>> CC: hbase-u...@hadoop.apache.org
>>
>> By default each map gets the contents of 1 region. A region is by
>> default a maximum of 256MB. There is no trivial way to generally
>> bisect a region in half, in terms of row count, by just knowing what
>> we known (start, end key).
>>
>> For very large tables that have > 100 regions, this algorithm works
>> really well and you get some good parallelism.  If you want to see a
>> lot of parallelism out of 1 region, you might have to work a lot
>> harder.  Or reduce your region size and have more regions.  Be warned
>> though, that more regions has performance hits in other areas
>> (specifically server startup/shutdown/assignment times).  So you
>> probably dont want 50,000 32MB regions.
>>
>> -ryan
>>
>> On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey <ghend...@decarta.com>
> wrote:
>> > Oh, I definitely don't *need* my own to run mapreduce. However, if I
> want to control the number of records handled by each mapper (splitsize)
> and the startrow and endrow, then I thought I had to write my own
> getSplits(). Is there another way to accomplish this, because I do need
> the combination of controlled splitsize and start/endrow.
>> >
>> > -geoff
>> >
>> > -----Original Message-----
>> > From: Ryan Rawson [mailto:ryano...@gmail.com]
>> > Sent: Wednesday, February 09, 2011 11:43 PM
>> > To: user@hbase.apache.org
>> > Cc: hbase-u...@hadoop.apache.org
>> > Subject: Re: getSplits question
>> >
>> > You shouldn't need to write your own getSplits() method to run a map
>> > reduce, I never did at least...
>> >
>> > -ryan
>> >
>> > On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey
> <ghend...@decarta.com> wrote:
>> >> Are endrows inclusive or exclusive? The docs say exclusive, but
> then the
>> >> question arises as to how to form the last split for getSplits().
> The
>> >> code below runs fine, but I believe it is omitting some rows,
> perhaps
>> >> b/c of the exclusive end row. For the final split, should the
> endrow be
>> >> null? I tried that, and got what appeared to be a final split
> without an
>> >> endrow at all. Would appreciate a pointer to the correct
> implementation
>> >> of getSplits in which I desire to provide a startrow, endrow, and
>> >> splitsize. Apparently this isn't it J :
>> >>
>> >>
>> >>
>> >> int splitSize = context.getConfiguration().getInt("splitsize",
> 1000);
>> >>
>> >>                byte[] splitStop = null;
>> >>
>> >>                String hostname = null;
>> >>
>> >>                while ((results =
> resultScanner.next(splitSize)).length
>> >>> 0) {
>> >>
>> >>                    //   System.out.println("results
>> >> :-------------------------- "+results);
>> >>
>> >>                    byte[] splitStart = results[0].getRow();
>> >>
>> >>                    splitStop = results[results.length -
> 1].getRow();
>> >> //I think this is a problem...we don't actually include this row in
> the
>> >> split since it's exclusive..revisit this and correct
>> >>
>> >>                    HRegionLocation location =
>> >> table.getRegionLocation(splitStart);
>> >>
>> >>                    hostname =
>> >> location.getServerAddress().getHostname();
>> >>
>> >>                    InputSplit split = new
>> >> TableSplit(table.getTableName(), splitStart, splitStop, hostname);
>> >>
>> >>                    splits.add(split);
>> >>
>> >>                    System.out.println("initializing splits: " +
>> >> split.toString());
>> >>
>> >>                }
>> >>
>> >>                resultScanner.close();
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> -g
>> >>
>> >>
>> >
>
>

Re: getSplits question

Reply via email to