Re: How to generate 1 HFile per 1 HBase region from crunch?

Dmitry Gorbatsevich Wed, 11 Jan 2017 08:11:36 -0800

I believe number of reducers will not guarantee appropriate splitting strategy.
Splitting strategy is a property of HBase table which should be set up manually 
during table creation. Something like:


  1.  Region 1: keys from 0 to 03
  2.  Region 2: keys from 03 to 06
  3.  Region 3: keys from 06 to infinity

Split the table from the HBase API after writing out is an option. 
LoadIncrementalHFiles.doBulkLoad command will resplit files automatically.
But this process takes a lot of time. I got something like 10 minutes to 
generate HFiles + 120 minutes to resplit them during doBulkLoad command…

So, I believe the best option is split files properly during HFiles generation. 
However, we need to find the way how to do this:)

Thanks,
Dmitry.

From: David Ortiz <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, January 11, 2017 at 19:01
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: How to generate 1 HFile per 1 HBase region from crunch?

Dmitry,

      You could use the public void configure() method in the DoFn to manually 
set the number of reducers.  Could also manually split the table from the HBase 
API after writing out.

Dave

On Wed, Jan 11, 2017 at 10:47 AM Dmitry Gorbatsevich 
<[email protected]<mailto:[email protected]>>
 wrote:
Hey,

I am trying to use crunch to bulk load data into HBase.
If you are using plain MR you can push HBase to control number of reducers (1 
HFile per region) using the following code:
HFileOutputFormat.configureIncrementalLoad (job, table)

However, I did not manage to find anything related using crunch classes 
(HFileTarget).
Without this LoadIncrementalHFiles.doBulkLoad(new Path(hBasePath), hTable); 
takes a lot of time because of resplitting files…
I am wondering how can I push crunch to use the same strategy as pure MR uses 
with HFileOutputFormat.configureIncrementalLoad?
Is it possible?

Here is the sample code that I use to write data into HFiles:
PCollection<Cell> cellsUsers = users.parallelDo(new DoFn<Pair<String, Integer>, 
Cell>() {

@Override
public void process(Pair<String, Integer> input, Emitter<Cell> emitter) {
byte[] row = input.first().getBytes();
byte[] value = String.valueOf(input.second()).getBytes();
byte[] family = "cf1".getBytes();
byte[] qualifier = “q1".getBytes();
long timestamp = System.currentTimeMillis();

Cell cell = CellUtil.createCell(row, family, qualifier, timestamp, 
KeyValue.Type.Put.getCode(), value);

emitter.emit(cell);
}
}, cells());

cellsUsers.write(new HFileTarget(hBaseFullPath), WriteMode.OVERWRITE);

Thanks in advance,
Dmitry.

Re: How to generate 1 HFile per 1 HBase region from crunch?

Reply via email to