By default each map gets the contents of 1 region. A region is by default a maximum of 256MB. There is no trivial way to generally bisect a region in half, in terms of row count, by just knowing what we known (start, end key).
For very large tables that have > 100 regions, this algorithm works really well and you get some good parallelism. If you want to see a lot of parallelism out of 1 region, you might have to work a lot harder. Or reduce your region size and have more regions. Be warned though, that more regions has performance hits in other areas (specifically server startup/shutdown/assignment times). So you probably dont want 50,000 32MB regions. -ryan On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey <ghend...@decarta.com> wrote: > Oh, I definitely don't *need* my own to run mapreduce. However, if I want to > control the number of records handled by each mapper (splitsize) and the > startrow and endrow, then I thought I had to write my own getSplits(). Is > there another way to accomplish this, because I do need the combination of > controlled splitsize and start/endrow. > > -geoff > > -----Original Message----- > From: Ryan Rawson [mailto:ryano...@gmail.com] > Sent: Wednesday, February 09, 2011 11:43 PM > To: user@hbase.apache.org > Cc: hbase-u...@hadoop.apache.org > Subject: Re: getSplits question > > You shouldn't need to write your own getSplits() method to run a map > reduce, I never did at least... > > -ryan > > On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey <ghend...@decarta.com> wrote: >> Are endrows inclusive or exclusive? The docs say exclusive, but then the >> question arises as to how to form the last split for getSplits(). The >> code below runs fine, but I believe it is omitting some rows, perhaps >> b/c of the exclusive end row. For the final split, should the endrow be >> null? I tried that, and got what appeared to be a final split without an >> endrow at all. Would appreciate a pointer to the correct implementation >> of getSplits in which I desire to provide a startrow, endrow, and >> splitsize. Apparently this isn't it J : >> >> >> >> int splitSize = context.getConfiguration().getInt("splitsize", 1000); >> >> byte[] splitStop = null; >> >> String hostname = null; >> >> while ((results = resultScanner.next(splitSize)).length >>> 0) { >> >> // System.out.println("results >> :-------------------------- "+results); >> >> byte[] splitStart = results[0].getRow(); >> >> splitStop = results[results.length - 1].getRow(); >> //I think this is a problem...we don't actually include this row in the >> split since it's exclusive..revisit this and correct >> >> HRegionLocation location = >> table.getRegionLocation(splitStart); >> >> hostname = >> location.getServerAddress().getHostname(); >> >> InputSplit split = new >> TableSplit(table.getTableName(), splitStart, splitStop, hostname); >> >> splits.add(split); >> >> System.out.println("initializing splits: " + >> split.toString()); >> >> } >> >> resultScanner.close(); >> >> >> >> >> >> -g >> >> >