Thanks guys! Now I have some performance issues using this approach, but this is another story:)
Dmitry. From: David Ortiz <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Thursday, January 12, 2017 at 19:13 To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: How to generate 1 HFile per 1 HBase region from crunch? Need to import hipchat's palette for email clearly On Thu, Jan 12, 2017 at 11:11 AM Josh Wills <[email protected]<mailto:[email protected]>> wrote: There's never a facepalm emoji around when you need one. On Wed, Jan 11, 2017 at 11:52 PM Gabriel Reid <[email protected]<mailto:[email protected]>> wrote: Doesn't o.a.c.io.hbase.HFileUtils#writeToHFilesForIncrementalLoad do exactly this? If you have a PCollection of Cells, and then call HFileUtils.writeToHFilesForIncrementalLoad with the PCollection and output path, then the PCollection will be written to the output path partitioned on region, and LoadIncrementalHFiles.doBulkLoad will be able to load the files directly without any re-splitting (unless the composition of regions has changed in the meantime). - Gabriel On Wed, Jan 11, 2017 at 7:26 PM, David Ortiz <[email protected]<mailto:[email protected]>> wrote: > Fake the key sorting portion with a custom partitioner? > GroupingOptions.builder().partitionerClass() > > On Wed, Jan 11, 2017 at 1:17 PM Josh Wills > <[email protected]<mailto:[email protected]>> wrote: >> >> I think Dmitry is right, we don't have a good abstraction to do what he >> wants write now, especially as our current HFileTarget doesn't use the >> HFileOutputFormat, and Crunch needs the ability to control which reducer >> class it uses, whereas HFileOutputFormat.configureIncrementalLoad uses >> custom reducer implementations for different value types (KeyValue vs. Put >> etc.). >> >> >> On Wed, Jan 11, 2017 at 8:10 AM, Dmitry Gorbatsevich >> <[email protected]<mailto:[email protected]>> >> wrote: >>> >>> I believe number of reducers will not guarantee appropriate splitting >>> strategy. >>> Splitting strategy is a property of HBase table which should be set up >>> manually during table creation. Something like: >>> >>> Region 1: keys from 0 to 03 >>> Region 2: keys from 03 to 06 >>> Region 3: keys from 06 to infinity >>> >>> Split the table from the HBase API after writing out is an option. >>> LoadIncrementalHFiles.doBulkLoad command will resplit files automatically. >>> But this process takes a lot of time. I got something like 10 minutes to >>> generate HFiles + 120 minutes to resplit them during doBulkLoad command… >>> >>> So, I believe the best option is split files properly during HFiles >>> generation. However, we need to find the way how to do this:) >>> >>> Thanks, >>> Dmitry. >>> >>> From: David Ortiz <[email protected]<mailto:[email protected]>> >>> Reply-To: "[email protected]<mailto:[email protected]>" >>> <[email protected]<mailto:[email protected]>> >>> Date: Wednesday, January 11, 2017 at 19:01 >>> To: "[email protected]<mailto:[email protected]>" >>> <[email protected]<mailto:[email protected]>> >>> Subject: Re: How to generate 1 HFile per 1 HBase region from crunch? >>> >>> Dmitry, >>> >>> You could use the public void configure() method in the DoFn to >>> manually set the number of reducers. Could also manually split the table >>> from the HBase API after writing out. >>> >>> Dave >>> >>> On Wed, Jan 11, 2017 at 10:47 AM Dmitry Gorbatsevich >>> <[email protected]<mailto:[email protected]>> >>> wrote: >>>> >>>> Hey, >>>> >>>> I am trying to use crunch to bulk load data into HBase. >>>> If you are using plain MR you can push HBase to control number of >>>> reducers (1 HFile per region) using the following code: >>>> >>>> HFileOutputFormat.configureIncrementalLoad (job, table) >>>> >>>> >>>> However, I did not manage to find anything related using crunch classes >>>> (HFileTarget). >>>> Without this LoadIncrementalHFiles.doBulkLoad(new Path(hBasePath), >>>> hTable); takes a lot of time because of resplitting files… >>>> I am wondering how can I push crunch to use the same strategy as pure MR >>>> uses with HFileOutputFormat.configureIncrementalLoad? >>>> Is it possible? >>>> >>>> Here is the sample code that I use to write data into HFiles: >>>> >>>> PCollection<Cell> cellsUsers = users.parallelDo(new DoFn<Pair<String, >>>> Integer>, Cell>() { >>>> >>>> @Override >>>> public void process(Pair<String, Integer> input, Emitter<Cell> emitter) >>>> { >>>> byte[] row = input.first().getBytes(); >>>> byte[] value = String.valueOf(input.second()).getBytes(); >>>> byte[] family = "cf1".getBytes(); >>>> byte[] qualifier = “q1".getBytes(); >>>> long timestamp = System.currentTimeMillis(); >>>> >>>> Cell cell = CellUtil.createCell(row, family, qualifier, timestamp, >>>> KeyValue.Type.Put.getCode(), value); >>>> >>>> emitter.emit(cell); >>>> } >>>> }, cells()); >>>> >>>> cellsUsers.write(new HFileTarget(hBaseFullPath), WriteMode.OVERWRITE); >>>> >>>> Thanks in advance, >>>> Dmitry. >> >> >
