Re: loading data in HBase table using APIs

Doug Meil Thu, 18 Aug 2011 05:31:34 -0700

Have you reviewed this?

http://hbase.apache.org/book.html#mapreduce.example


I'm planning to add more examples in this chapter, but there is some
sample code to review.



On 8/18/11 4:18 AM, "abhay ratnaparkhi" <[email protected]>
wrote:

>Thank you for all these information.
>Can you give me any example where I have only map task and I can put data
>in
>HBase from map?
>I tried following settings.
>
>          job = new Job(conf, "Bulk Processing - Only Map.");
>          job.setNumReduceTasks(0);
>          job.setJarByClass(MyBulkDataLoader.class);
>          //job.setMapOutputKeyClass(ImmutableBytesWritable.class);
>          //job.setMapOutputValueClass(ImmutableBytesWritable.class);
>          job.setOutputKeyClass(ImmutableBytesWritable.class);
>          job.setOutputValueClass(Put.class);
>          job.setOutputFormatClass(TableOutputFormat.class);
>          Scan scan = new Scan();
>          TableMapReduceUtil.initTableMapperJob((INPUT_TABLE_NAME),
>scan,MyBulkLoaderMapper.class, ImmutableBytesWritable.class,Put.class,
>job);
>          //TableMapReduceUtil.initTableReducerJob((OUTPUT_TABLE_NAME),
>IdentityTableReducer.class,  job);
>          LOG.info("Started " + INPUT_TABLE_NAME);
>          job.waitForCompletion(true);
>
>From map class I am doing...
>context.write(new 
>ImmutableBytesWritable(Bytes.toBytes(OUTPUT_TABLE_NAME)),
>p);   //P is an instance of Put.
>
>Previously I was using "IdentityTableReducer". As reduce step is not
>required for bulk loading I only need to insert data in Hbase through Map
>phase.
>Where can I give the output table name?
> If you can give me any example that only has map task and HBase as a
>source
>and sink that will be helpful.
>
>Thank you.
>Abhay.
>On Tue, Aug 9, 2011 at 4:51 AM, Stack <[email protected]> wrote:
>
>> The doc here suggests avoiding reduce:
>>
>> 
>>http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package
>>-summary.html#sink
>> St.Ack
>>
>> On Fri, Aug 5, 2011 at 2:19 AM, Doug Meil
>><[email protected]>
>> wrote:
>> >
>> > It's not obvious to a lot of newer folks that an MR job can exist
>>minus
>> > the R.
>> >
>> >
>> >
>> >
>> >
>> > On 8/4/11 5:52 PM, "Michael Segel" <[email protected]> wrote:
>> >
>> >>
>> >>Uhm Silly question...
>> >>
>> >>Why would you ever need a reduce step when you're writing to an HBase
>> >>table?
>> >>
>> >>Now I'm sure that there may be some fringe case, but in the past two
>> >>years, I've never come across a case where you would need to do a
>>reducer
>> >>when you're writing to HBase.
>> >>
>> >>So what am I missing?
>> >>
>> >>
>> >>
>> >>> From: [email protected]
>> >>> To: [email protected]
>> >>> Date: Thu, 4 Aug 2011 11:18:57 -0400
>> >>> Subject: Re: loading data in HBase table using APIs
>> >>>
>> >>>
>> >>> David, thanks for the tip on this.  I just checked in a reorg to the
>> >>> performance chapter and included this tip.
>> >>>
>> >>> Stack does the website updating so it's not visible yet, but this
>>tip
>> is
>> >>> in there.
>> >>>
>> >>> Thanks!
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On 7/18/11 6:18 PM, "Buttler, David" <[email protected]> wrote:
>> >>>
>> >>> >After a quick scan of the performance section, I didn't see what I
>> >>> >consider to be a huge performance consideration:
>> >>> >If at all possible, don't do a reduce on your puts.  The
>>shuffle/sort
>> >>> >part of the map/reduce paradigm is often useless if all you are
>>trying
>> >>>to
>> >>> >do is insert/update data in HBase.  From the OP's description it
>> sounds
>> >>> >like he doesn't need to have any kind of reduce phase [and may be a
>> >>>great
>> >>> >candidate for bulk loading and the pre-creation of regions].  In
>>any
>> >>> >case, don't reduce if you can avoid it.
>> >>> >
>> >>> >Dave
>> >>> >
>> >>> >-----Original Message-----
>> >>> >From: Doug Meil [mailto:[email protected]]
>> >>> >Sent: Sunday, July 17, 2011 4:40 PM
>> >>> >To: [email protected]
>> >>> >Subject: Re: loading data in HBase table using APIs
>> >>> >
>> >>> >
>> >>> >Hi there-
>> >>> >
>> >>> >Take a look at this for starters:
>> >>> >http://hbase.apache.org/book.html#schema
>> >>> >
>> >>> >1)  double-check your row-keys (sanity check), that's in the Schema
>> >>>Design
>> >>> >chapter.
>> >>> >
>> >>> >http://hbase.apache.org/book.html#performance
>> >>> >
>> >>> >
>> >>> >2)  if not using bulk-load - re-create regions, do this regardless
>>of
>> >>> >using MR or non-MR.
>> >>> >
>> >>> >3)  if not using MR job and are using multiple threads with the
>>Java
>> >>>API,
>> >>> >take a look at HTableUtil.  It's on trunk, but that utility can
>>help
>> >>>you.
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >On 7/17/11 4:08 PM, "abhay ratnaparkhi"
>><[email protected]>
>> >>> >wrote:
>> >>> >
>> >>> >>Hello,
>> >>> >>
>> >>> >>I am loading lots of data through API in HBase table.
>> >>> >>I am using HBase Java API to do this.
>> >>> >>If I convert this code to map-reduce task and use
>>*TableOutputFormat*
>> >>> >>class
>> >>> >>then will I get any performance improvement?
>> >>> >>
>> >>> >>As I am not getting input data from existing HBase table or HDFS
>> files
>> >>> >>there
>> >>> >>will not be any input to map task.
>> >>> >>The only advantage is multiple map tasks running simultaneously
>>might
>> >>> >>make
>> >>> >>processing faster.
>> >>> >>
>> >>> >>Thanks!
>> >>> >>Regars,
>> >>> >>Abhay
>> >>> >
>> >>>
>> >>
>> >
>> >
>>

Re: loading data in HBase table using APIs

Reply via email to