I did a very similar approach and it worked fine for me. Just spot check the regions after to make sure they look lexicographically sorted. I used ImmutableBytesWritable as my key, and the default hadoop sorting for that turned out to sort lexicographically as required. Our hbase rows varied in size, so instead of doing a count of the number of rows, we tallied up the KeyValue.getLenght() for each KeyValue in a row until the size reached a certain limit.
On Sat, May 12, 2012 at 7:21 PM, Something Something < [email protected]> wrote: > Hello, > > This is really a MapReduce question, but the output from this will be used > to create regions for an HBase table. Here's what I want to do: > > Take an input file that contains data about users. > Sort this file by a key (which consists of a few fields from the row) > After every x # of rows write the key. > > > Here's how I was going to structure my MapReduce: > > public Splitter { > > static int counter; > > private Mapper { > map() { > Build key by concatenating fields > Write key > increment counter; > } > } > > // # of reducers will be set to 1. My understanding is that this will > send the lines to reducer in sorted order one at a time - is this a correct > assumption? > private Reducer { > static long i; > reduce() { > static long splitSize = counter / 300; // 300 is region size > if (i == 0 || i == splitSize) { > Write key; // this will be used as a 'startkey'. > i = 0; > } > i++; > } > } > } > > To summarize, there are 2 questions: > > 1) I am passing # of rows processed by Mapper to Reducer via a static > counter. Would this work? Is there a better way? > 2) If I set # of reducers to 1, would the lines be sent to reducer in > sorted order one at a time? > > Thanks in advance for the help. >
