By default each map gets the contents of 1 region. A region is by
default a maximum of 256MB. There is no trivial way to generally
bisect a region in half, in terms of row count, by just knowing what
we known (start, end key).

For very large tables that have > 100 regions, this algorithm works
really well and you get some good parallelism.  If you want to see a
lot of parallelism out of 1 region, you might have to work a lot
harder.  Or reduce your region size and have more regions.  Be warned
though, that more regions has performance hits in other areas
(specifically server startup/shutdown/assignment times).  So you
probably dont want 50,000 32MB regions.

-ryan

On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey <ghend...@decarta.com> wrote:
> Oh, I definitely don't *need* my own to run mapreduce. However, if I want to 
> control the number of records handled by each mapper (splitsize) and the 
> startrow and endrow, then I thought I had to write my own getSplits(). Is 
> there another way to accomplish this, because I do need the combination of 
> controlled splitsize and start/endrow.
>
> -geoff
>
> -----Original Message-----
> From: Ryan Rawson [mailto:ryano...@gmail.com]
> Sent: Wednesday, February 09, 2011 11:43 PM
> To: user@hbase.apache.org
> Cc: hbase-u...@hadoop.apache.org
> Subject: Re: getSplits question
>
> You shouldn't need to write your own getSplits() method to run a map
> reduce, I never did at least...
>
> -ryan
>
> On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey <ghend...@decarta.com> wrote:
>> Are endrows inclusive or exclusive? The docs say exclusive, but then the
>> question arises as to how to form the last split for getSplits(). The
>> code below runs fine, but I believe it is omitting some rows, perhaps
>> b/c of the exclusive end row. For the final split, should the endrow be
>> null? I tried that, and got what appeared to be a final split without an
>> endrow at all. Would appreciate a pointer to the correct implementation
>> of getSplits in which I desire to provide a startrow, endrow, and
>> splitsize. Apparently this isn't it J :
>>
>>
>>
>> int splitSize = context.getConfiguration().getInt("splitsize", 1000);
>>
>>                byte[] splitStop = null;
>>
>>                String hostname = null;
>>
>>                while ((results = resultScanner.next(splitSize)).length
>>> 0) {
>>
>>                    //   System.out.println("results
>> :-------------------------- "+results);
>>
>>                    byte[] splitStart = results[0].getRow();
>>
>>                    splitStop = results[results.length - 1].getRow();
>> //I think this is a problem...we don't actually include this row in the
>> split since it's exclusive..revisit this and correct
>>
>>                    HRegionLocation location =
>> table.getRegionLocation(splitStart);
>>
>>                    hostname =
>> location.getServerAddress().getHostname();
>>
>>                    InputSplit split = new
>> TableSplit(table.getTableName(), splitStart, splitStop, hostname);
>>
>>                    splits.add(split);
>>
>>                    System.out.println("initializing splits: " +
>> split.toString());
>>
>>                }
>>
>>                resultScanner.close();
>>
>>
>>
>>
>>
>> -g
>>
>>
>

Reply via email to