Re: How to generate 1 HFile per 1 HBase region from crunch?

David Ortiz Thu, 12 Jan 2017 08:14:19 -0800

Need to import hipchat's palette for email clearly

On Thu, Jan 12, 2017 at 11:11 AM Josh Wills <[email protected]> wrote:


> There's never a facepalm emoji around when you need one.
> On Wed, Jan 11, 2017 at 11:52 PM Gabriel Reid <[email protected]>
> wrote:
>
> Doesn't o.a.c.io.hbase.HFileUtils#writeToHFilesForIncrementalLoad do
> exactly this?
>
> If you have a PCollection of Cells, and then call
> HFileUtils.writeToHFilesForIncrementalLoad with the PCollection and
> output path, then the PCollection will be written to the output path
> partitioned on region, and LoadIncrementalHFiles.doBulkLoad will be
> able to load the files directly without any re-splitting (unless the
> composition of regions has changed in the meantime).
>
> - Gabriel
>
> On Wed, Jan 11, 2017 at 7:26 PM, David Ortiz <[email protected]> wrote:
> > Fake the key sorting portion with a custom partitioner?
> > GroupingOptions.builder().partitionerClass()
> >
> > On Wed, Jan 11, 2017 at 1:17 PM Josh Wills <[email protected]> wrote:
> >>
> >> I think Dmitry is right, we don't have a good abstraction to do what he
> >> wants write now, especially as our current HFileTarget doesn't use the
> >> HFileOutputFormat, and Crunch needs the ability to control which reducer
> >> class it uses, whereas HFileOutputFormat.configureIncrementalLoad uses
> >> custom reducer implementations for different value types (KeyValue vs.
> Put
> >> etc.).
> >>
> >>
> >> On Wed, Jan 11, 2017 at 8:10 AM, Dmitry Gorbatsevich
> >> <[email protected]> wrote:
> >>>
> >>> I believe number of reducers will not guarantee appropriate splitting
> >>> strategy.
> >>> Splitting strategy is a property of HBase table which should be set up
> >>> manually during table creation. Something like:
> >>>
> >>> Region 1: keys from 0 to 03
> >>> Region 2: keys from 03 to 06
> >>> Region 3: keys from 06 to infinity
> >>>
> >>> Split the table from the HBase API after writing out is an option.
> >>> LoadIncrementalHFiles.doBulkLoad command will resplit files
> automatically.
> >>> But this process takes a lot of time. I got something like 10 minutes
> to
> >>> generate HFiles + 120 minutes to resplit them during doBulkLoad
> command…
> >>>
> >>> So, I believe the best option is split files properly during HFiles
> >>> generation. However, we need to find the way how to do this:)
> >>>
> >>> Thanks,
> >>> Dmitry.
> >>>
> >>> From: David Ortiz <[email protected]>
> >>> Reply-To: "[email protected]" <[email protected]>
> >>> Date: Wednesday, January 11, 2017 at 19:01
> >>> To: "[email protected]" <[email protected]>
> >>> Subject: Re: How to generate 1 HFile per 1 HBase region from crunch?
> >>>
> >>> Dmitry,
> >>>
> >>>       You could use the public void configure() method in the DoFn to
> >>> manually set the number of reducers.  Could also manually split the
> table
> >>> from the HBase API after writing out.
> >>>
> >>> Dave
> >>>
> >>> On Wed, Jan 11, 2017 at 10:47 AM Dmitry Gorbatsevich
> >>> <[email protected]> wrote:
> >>>>
> >>>> Hey,
> >>>>
> >>>> I am trying to use crunch to bulk load data into HBase.
> >>>> If you are using plain MR you can push HBase to control number of
> >>>> reducers (1 HFile per region) using the following code:
> >>>>
> >>>> HFileOutputFormat.configureIncrementalLoad (job, table)
> >>>>
> >>>>
> >>>> However, I did not manage to find anything related using crunch
> classes
> >>>> (HFileTarget).
> >>>> Without this LoadIncrementalHFiles.doBulkLoad(new Path(hBasePath),
> >>>> hTable); takes a lot of time because of resplitting files…
> >>>> I am wondering how can I push crunch to use the same strategy as pure
> MR
> >>>> uses with HFileOutputFormat.configureIncrementalLoad?
> >>>> Is it possible?
> >>>>
> >>>> Here is the sample code that I use to write data into HFiles:
> >>>>
> >>>> PCollection<Cell> cellsUsers = users.parallelDo(new DoFn<Pair<String,
> >>>> Integer>, Cell>() {
> >>>>
> >>>> @Override
> >>>> public void process(Pair<String, Integer> input, Emitter<Cell>
> emitter)
> >>>> {
> >>>> byte[] row = input.first().getBytes();
> >>>> byte[] value = String.valueOf(input.second()).getBytes();
> >>>> byte[] family = "cf1".getBytes();
> >>>> byte[] qualifier = “q1".getBytes();
> >>>> long timestamp = System.currentTimeMillis();
> >>>>
> >>>> Cell cell = CellUtil.createCell(row, family, qualifier, timestamp,
> >>>> KeyValue.Type.Put.getCode(), value);
> >>>>
> >>>> emitter.emit(cell);
> >>>> }
> >>>> }, cells());
> >>>>
> >>>> cellsUsers.write(new HFileTarget(hBaseFullPath), WriteMode.OVERWRITE);
> >>>>
> >>>> Thanks in advance,
> >>>> Dmitry.
> >>
> >>
> >
>
>

Re: How to generate 1 HFile per 1 HBase region from crunch?

Reply via email to