Fake the key sorting portion with a custom partitioner? GroupingOptions.builder().partitionerClass()
On Wed, Jan 11, 2017 at 1:17 PM Josh Wills <[email protected]> wrote: > I think Dmitry is right, we don't have a good abstraction to do what he > wants write now, especially as our current HFileTarget doesn't use the > HFileOutputFormat, and Crunch needs the ability to control which reducer > class it uses, whereas HFileOutputFormat.configureIncrementalLoad uses > custom reducer implementations for different value types (KeyValue vs. Put > etc.). > > > On Wed, Jan 11, 2017 at 8:10 AM, Dmitry Gorbatsevich < > [email protected]> wrote: > > I believe number of reducers will not guarantee appropriate splitting > strategy. > Splitting strategy is a property of HBase table which should be set up > manually during table creation. Something like: > > 1. Region 1: keys from 0 to 03 > 2. Region 2: keys from 03 to 06 > 3. Region 3: keys from 06 to infinity > > Split the table from the HBase API after writing out is an option. > LoadIncrementalHFiles.doBulkLoad command will resplit files automatically. > But this process takes a lot of time. I got something like 10 minutes to > generate HFiles + 120 minutes to resplit them during doBulkLoad command… > > So, I believe the best option is split files properly during HFiles > generation. However, we need to find the way how to do this:) > > Thanks, > Dmitry. > > From: David Ortiz <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Wednesday, January 11, 2017 at 19:01 > To: "[email protected]" <[email protected]> > Subject: Re: How to generate 1 HFile per 1 HBase region from crunch? > > Dmitry, > > You could use the public void configure() method in the DoFn to > manually set the number of reducers. Could also manually split the table > from the HBase API after writing out. > > Dave > > On Wed, Jan 11, 2017 at 10:47 AM Dmitry Gorbatsevich < > [email protected]> wrote: > > Hey, > > I am trying to use crunch to bulk load data into HBase. > If you are using plain MR you can push HBase to control number of reducers > (1 HFile per region) using the following code: > > HFileOutputFormat.configureIncrementalLoad (job, table) > > > However, I did not manage to find anything related using crunch classes > (HFileTarget). > Without this LoadIncrementalHFiles.doBulkLoad(new Path(hBasePath), > hTable); takes a lot of time because of resplitting files… > I am wondering how can I push crunch to use the same strategy as pure MR > uses with HFileOutputFormat.configureIncrementalLoad? > Is it possible? > > Here is the sample code that I use to write data into HFiles: > > PCollection<Cell> cellsUsers = users.parallelDo(new DoFn<Pair<String, > Integer>, Cell>() { > > @Override > public void process(Pair<String, Integer> input, Emitter<Cell> emitter) { > byte[] row = input.first().getBytes(); > byte[] value = String.valueOf(input.second()).getBytes(); > byte[] family = "cf1".getBytes(); > byte[] qualifier = “q1".getBytes(); > long timestamp = System.currentTimeMillis(); > > Cell cell = CellUtil.createCell(row, family, qualifier, timestamp, > KeyValue.Type.Put.getCode(), value); > > emitter.emit(cell); > } > }, cells()); > > cellsUsers.write(new HFileTarget(hBaseFullPath), WriteMode.OVERWRITE); > > Thanks in advance, > Dmitry. > > >
