Uhm Silly question...

Why would you ever need a reduce step when you're writing to an HBase table?

Now I'm sure that there may be some fringe case, but in the past two years, 
I've never come across a case where you would need to do a reducer when you're 
writing to HBase.

So what am I missing?



> From: [email protected]
> To: [email protected]
> Date: Thu, 4 Aug 2011 11:18:57 -0400
> Subject: Re: loading data in HBase table using APIs
> 
> 
> David, thanks for the tip on this.  I just checked in a reorg to the
> performance chapter and included this tip.
> 
> Stack does the website updating so it's not visible yet, but this tip is
> in there.
> 
> Thanks!
> 
> 
> 
> 
> On 7/18/11 6:18 PM, "Buttler, David" <[email protected]> wrote:
> 
> >After a quick scan of the performance section, I didn't see what I
> >consider to be a huge performance consideration:
> >If at all possible, don't do a reduce on your puts.  The shuffle/sort
> >part of the map/reduce paradigm is often useless if all you are trying to
> >do is insert/update data in HBase.  From the OP's description it sounds
> >like he doesn't need to have any kind of reduce phase [and may be a great
> >candidate for bulk loading and the pre-creation of regions].  In any
> >case, don't reduce if you can avoid it.
> >
> >Dave
> >
> >-----Original Message-----
> >From: Doug Meil [mailto:[email protected]]
> >Sent: Sunday, July 17, 2011 4:40 PM
> >To: [email protected]
> >Subject: Re: loading data in HBase table using APIs
> >
> >
> >Hi there-
> >
> >Take a look at this for starters:
> >http://hbase.apache.org/book.html#schema
> >
> >1)  double-check your row-keys (sanity check), that's in the Schema Design
> >chapter.
> >
> >http://hbase.apache.org/book.html#performance
> >
> >
> >2)  if not using bulk-load - re-create regions, do this regardless of
> >using MR or non-MR.
> >
> >3)  if not using MR job and are using multiple threads with the Java API,
> >take a look at HTableUtil.  It's on trunk, but that utility can help you.
> >
> >
> >
> >
> >
> >
> >On 7/17/11 4:08 PM, "abhay ratnaparkhi" <[email protected]>
> >wrote:
> >
> >>Hello,
> >>
> >>I am loading lots of data through API in HBase table.
> >>I am using HBase Java API to do this.
> >>If I convert this code to map-reduce task and use *TableOutputFormat*
> >>class
> >>then will I get any performance improvement?
> >>
> >>As I am not getting input data from existing HBase table or HDFS files
> >>there
> >>will not be any input to map task.
> >>The only advantage is multiple map tasks running simultaneously might
> >>make
> >>processing faster.
> >>
> >>Thanks!
> >>Regars,
> >>Abhay
> >
> 
                                          

Reply via email to