The doc here suggests avoiding reduce: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink St.Ack
On Fri, Aug 5, 2011 at 2:19 AM, Doug Meil <[email protected]> wrote: > > It's not obvious to a lot of newer folks that an MR job can exist minus > the R. > > > > > > On 8/4/11 5:52 PM, "Michael Segel" <[email protected]> wrote: > >> >>Uhm Silly question... >> >>Why would you ever need a reduce step when you're writing to an HBase >>table? >> >>Now I'm sure that there may be some fringe case, but in the past two >>years, I've never come across a case where you would need to do a reducer >>when you're writing to HBase. >> >>So what am I missing? >> >> >> >>> From: [email protected] >>> To: [email protected] >>> Date: Thu, 4 Aug 2011 11:18:57 -0400 >>> Subject: Re: loading data in HBase table using APIs >>> >>> >>> David, thanks for the tip on this. I just checked in a reorg to the >>> performance chapter and included this tip. >>> >>> Stack does the website updating so it's not visible yet, but this tip is >>> in there. >>> >>> Thanks! >>> >>> >>> >>> >>> On 7/18/11 6:18 PM, "Buttler, David" <[email protected]> wrote: >>> >>> >After a quick scan of the performance section, I didn't see what I >>> >consider to be a huge performance consideration: >>> >If at all possible, don't do a reduce on your puts. The shuffle/sort >>> >part of the map/reduce paradigm is often useless if all you are trying >>>to >>> >do is insert/update data in HBase. From the OP's description it sounds >>> >like he doesn't need to have any kind of reduce phase [and may be a >>>great >>> >candidate for bulk loading and the pre-creation of regions]. In any >>> >case, don't reduce if you can avoid it. >>> > >>> >Dave >>> > >>> >-----Original Message----- >>> >From: Doug Meil [mailto:[email protected]] >>> >Sent: Sunday, July 17, 2011 4:40 PM >>> >To: [email protected] >>> >Subject: Re: loading data in HBase table using APIs >>> > >>> > >>> >Hi there- >>> > >>> >Take a look at this for starters: >>> >http://hbase.apache.org/book.html#schema >>> > >>> >1) double-check your row-keys (sanity check), that's in the Schema >>>Design >>> >chapter. >>> > >>> >http://hbase.apache.org/book.html#performance >>> > >>> > >>> >2) if not using bulk-load - re-create regions, do this regardless of >>> >using MR or non-MR. >>> > >>> >3) if not using MR job and are using multiple threads with the Java >>>API, >>> >take a look at HTableUtil. It's on trunk, but that utility can help >>>you. >>> > >>> > >>> > >>> > >>> > >>> > >>> >On 7/17/11 4:08 PM, "abhay ratnaparkhi" <[email protected]> >>> >wrote: >>> > >>> >>Hello, >>> >> >>> >>I am loading lots of data through API in HBase table. >>> >>I am using HBase Java API to do this. >>> >>If I convert this code to map-reduce task and use *TableOutputFormat* >>> >>class >>> >>then will I get any performance improvement? >>> >> >>> >>As I am not getting input data from existing HBase table or HDFS files >>> >>there >>> >>will not be any input to map task. >>> >>The only advantage is multiple map tasks running simultaneously might >>> >>make >>> >>processing faster. >>> >> >>> >>Thanks! >>> >>Regars, >>> >>Abhay >>> > >>> >> > >
