Re: How to generate 1 HFile per 1 HBase region from crunch?

David Ortiz Wed, 11 Jan 2017 10:27:13 -0800

Fake the key sorting portion with a custom partitioner?
GroupingOptions.builder().partitionerClass()


On Wed, Jan 11, 2017 at 1:17 PM Josh Wills <[email protected]> wrote:

> I think Dmitry is right, we don't have a good abstraction to do what he
> wants write now, especially as our current HFileTarget doesn't use the
> HFileOutputFormat, and Crunch needs the ability to control which reducer
> class it uses, whereas HFileOutputFormat.configureIncrementalLoad uses
> custom reducer implementations for different value types (KeyValue vs. Put
> etc.).
>
>
> On Wed, Jan 11, 2017 at 8:10 AM, Dmitry Gorbatsevich <
> [email protected]> wrote:
>
> I believe number of reducers will not guarantee appropriate splitting
> strategy.
> Splitting strategy is a property of HBase table which should be set up
> manually during table creation. Something like:
>
>    1. Region 1: keys from 0 to 03
>    2. Region 2: keys from 03 to 06
>    3. Region 3: keys from 06 to infinity
>
> Split the table from the HBase API after writing out is an option.
> LoadIncrementalHFiles.doBulkLoad command will resplit files automatically.
> But this process takes a lot of time. I got something like 10 minutes to
> generate HFiles + 120 minutes to resplit them during doBulkLoad command…
>
> So, I believe the best option is split files properly during HFiles
> generation. However, we need to find the way how to do this:)
>
> Thanks,
> Dmitry.
>
> From: David Ortiz <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Wednesday, January 11, 2017 at 19:01
> To: "[email protected]" <[email protected]>
> Subject: Re: How to generate 1 HFile per 1 HBase region from crunch?
>
> Dmitry,
>
>       You could use the public void configure() method in the DoFn to
> manually set the number of reducers.  Could also manually split the table
> from the HBase API after writing out.
>
> Dave
>
> On Wed, Jan 11, 2017 at 10:47 AM Dmitry Gorbatsevich <
> [email protected]> wrote:
>
> Hey,
>
> I am trying to use crunch to bulk load data into HBase.
> If you are using plain MR you can push HBase to control number of reducers
> (1 HFile per region) using the following code:
>
> HFileOutputFormat.configureIncrementalLoad (job, table)
>
>
> However, I did not manage to find anything related using crunch classes
> (HFileTarget).
> Without this LoadIncrementalHFiles.doBulkLoad(new Path(hBasePath),
> hTable); takes a lot of time because of resplitting files…
> I am wondering how can I push crunch to use the same strategy as pure MR
> uses with HFileOutputFormat.configureIncrementalLoad?
> Is it possible?
>
> Here is the sample code that I use to write data into HFiles:
>
> PCollection<Cell> cellsUsers = users.parallelDo(new DoFn<Pair<String,
> Integer>, Cell>() {
>
> @Override
> public void process(Pair<String, Integer> input, Emitter<Cell> emitter) {
> byte[] row = input.first().getBytes();
> byte[] value = String.valueOf(input.second()).getBytes();
> byte[] family = "cf1".getBytes();
> byte[] qualifier = “q1".getBytes();
> long timestamp = System.currentTimeMillis();
>
> Cell cell = CellUtil.createCell(row, family, qualifier, timestamp,
> KeyValue.Type.Put.getCode(), value);
>
> emitter.emit(cell);
> }
> }, cells());
>
> cellsUsers.write(new HFileTarget(hBaseFullPath), WriteMode.OVERWRITE);
>
> Thanks in advance,
> Dmitry.
>
>
>

Re: How to generate 1 HFile per 1 HBase region from crunch?

Reply via email to