Need to import hipchat's palette for email clearly On Thu, Jan 12, 2017 at 11:11 AM Josh Wills <[email protected]> wrote:
> There's never a facepalm emoji around when you need one. > On Wed, Jan 11, 2017 at 11:52 PM Gabriel Reid <[email protected]> > wrote: > > Doesn't o.a.c.io.hbase.HFileUtils#writeToHFilesForIncrementalLoad do > exactly this? > > If you have a PCollection of Cells, and then call > HFileUtils.writeToHFilesForIncrementalLoad with the PCollection and > output path, then the PCollection will be written to the output path > partitioned on region, and LoadIncrementalHFiles.doBulkLoad will be > able to load the files directly without any re-splitting (unless the > composition of regions has changed in the meantime). > > - Gabriel > > On Wed, Jan 11, 2017 at 7:26 PM, David Ortiz <[email protected]> wrote: > > Fake the key sorting portion with a custom partitioner? > > GroupingOptions.builder().partitionerClass() > > > > On Wed, Jan 11, 2017 at 1:17 PM Josh Wills <[email protected]> wrote: > >> > >> I think Dmitry is right, we don't have a good abstraction to do what he > >> wants write now, especially as our current HFileTarget doesn't use the > >> HFileOutputFormat, and Crunch needs the ability to control which reducer > >> class it uses, whereas HFileOutputFormat.configureIncrementalLoad uses > >> custom reducer implementations for different value types (KeyValue vs. > Put > >> etc.). > >> > >> > >> On Wed, Jan 11, 2017 at 8:10 AM, Dmitry Gorbatsevich > >> <[email protected]> wrote: > >>> > >>> I believe number of reducers will not guarantee appropriate splitting > >>> strategy. > >>> Splitting strategy is a property of HBase table which should be set up > >>> manually during table creation. Something like: > >>> > >>> Region 1: keys from 0 to 03 > >>> Region 2: keys from 03 to 06 > >>> Region 3: keys from 06 to infinity > >>> > >>> Split the table from the HBase API after writing out is an option. > >>> LoadIncrementalHFiles.doBulkLoad command will resplit files > automatically. > >>> But this process takes a lot of time. I got something like 10 minutes > to > >>> generate HFiles + 120 minutes to resplit them during doBulkLoad > command… > >>> > >>> So, I believe the best option is split files properly during HFiles > >>> generation. However, we need to find the way how to do this:) > >>> > >>> Thanks, > >>> Dmitry. > >>> > >>> From: David Ortiz <[email protected]> > >>> Reply-To: "[email protected]" <[email protected]> > >>> Date: Wednesday, January 11, 2017 at 19:01 > >>> To: "[email protected]" <[email protected]> > >>> Subject: Re: How to generate 1 HFile per 1 HBase region from crunch? > >>> > >>> Dmitry, > >>> > >>> You could use the public void configure() method in the DoFn to > >>> manually set the number of reducers. Could also manually split the > table > >>> from the HBase API after writing out. > >>> > >>> Dave > >>> > >>> On Wed, Jan 11, 2017 at 10:47 AM Dmitry Gorbatsevich > >>> <[email protected]> wrote: > >>>> > >>>> Hey, > >>>> > >>>> I am trying to use crunch to bulk load data into HBase. > >>>> If you are using plain MR you can push HBase to control number of > >>>> reducers (1 HFile per region) using the following code: > >>>> > >>>> HFileOutputFormat.configureIncrementalLoad (job, table) > >>>> > >>>> > >>>> However, I did not manage to find anything related using crunch > classes > >>>> (HFileTarget). > >>>> Without this LoadIncrementalHFiles.doBulkLoad(new Path(hBasePath), > >>>> hTable); takes a lot of time because of resplitting files… > >>>> I am wondering how can I push crunch to use the same strategy as pure > MR > >>>> uses with HFileOutputFormat.configureIncrementalLoad? > >>>> Is it possible? > >>>> > >>>> Here is the sample code that I use to write data into HFiles: > >>>> > >>>> PCollection<Cell> cellsUsers = users.parallelDo(new DoFn<Pair<String, > >>>> Integer>, Cell>() { > >>>> > >>>> @Override > >>>> public void process(Pair<String, Integer> input, Emitter<Cell> > emitter) > >>>> { > >>>> byte[] row = input.first().getBytes(); > >>>> byte[] value = String.valueOf(input.second()).getBytes(); > >>>> byte[] family = "cf1".getBytes(); > >>>> byte[] qualifier = “q1".getBytes(); > >>>> long timestamp = System.currentTimeMillis(); > >>>> > >>>> Cell cell = CellUtil.createCell(row, family, qualifier, timestamp, > >>>> KeyValue.Type.Put.getCode(), value); > >>>> > >>>> emitter.emit(cell); > >>>> } > >>>> }, cells()); > >>>> > >>>> cellsUsers.write(new HFileTarget(hBaseFullPath), WriteMode.OVERWRITE); > >>>> > >>>> Thanks in advance, > >>>> Dmitry. > >> > >> > > > >
