I believe number of reducers will not guarantee appropriate splitting strategy. Splitting strategy is a property of HBase table which should be set up manually during table creation. Something like:
1. Region 1: keys from 0 to 03 2. Region 2: keys from 03 to 06 3. Region 3: keys from 06 to infinity Split the table from the HBase API after writing out is an option. LoadIncrementalHFiles.doBulkLoad command will resplit files automatically. But this process takes a lot of time. I got something like 10 minutes to generate HFiles + 120 minutes to resplit them during doBulkLoad command… So, I believe the best option is split files properly during HFiles generation. However, we need to find the way how to do this:) Thanks, Dmitry. From: David Ortiz <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Wednesday, January 11, 2017 at 19:01 To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: How to generate 1 HFile per 1 HBase region from crunch? Dmitry, You could use the public void configure() method in the DoFn to manually set the number of reducers. Could also manually split the table from the HBase API after writing out. Dave On Wed, Jan 11, 2017 at 10:47 AM Dmitry Gorbatsevich <[email protected]<mailto:[email protected]>> wrote: Hey, I am trying to use crunch to bulk load data into HBase. If you are using plain MR you can push HBase to control number of reducers (1 HFile per region) using the following code: HFileOutputFormat.configureIncrementalLoad (job, table) However, I did not manage to find anything related using crunch classes (HFileTarget). Without this LoadIncrementalHFiles.doBulkLoad(new Path(hBasePath), hTable); takes a lot of time because of resplitting files… I am wondering how can I push crunch to use the same strategy as pure MR uses with HFileOutputFormat.configureIncrementalLoad? Is it possible? Here is the sample code that I use to write data into HFiles: PCollection<Cell> cellsUsers = users.parallelDo(new DoFn<Pair<String, Integer>, Cell>() { @Override public void process(Pair<String, Integer> input, Emitter<Cell> emitter) { byte[] row = input.first().getBytes(); byte[] value = String.valueOf(input.second()).getBytes(); byte[] family = "cf1".getBytes(); byte[] qualifier = “q1".getBytes(); long timestamp = System.currentTimeMillis(); Cell cell = CellUtil.createCell(row, family, qualifier, timestamp, KeyValue.Type.Put.getCode(), value); emitter.emit(cell); } }, cells()); cellsUsers.write(new HFileTarget(hBaseFullPath), WriteMode.OVERWRITE); Thanks in advance, Dmitry.
