We ran across a use-case this week. During inserts into the table, there was one field that was populated from hand-crafted HTML that should only have a small range of values (e.g. a primary color). We wanted to keep a log of all of the unique values that were found here, and so the values were the map job output and then sorted and counted in the reduce phase. A handy way for us to debug the HTML into a persistent file (we could have just used counters, but those disappear after a while unless you manually copy them).
-----Original Message----- From: Michael Segel [mailto:[email protected]] Sent: Friday, March 25, 2011 8:26 AM To: [email protected] Subject: RE: How could I re-calculate every entries in hbase efficiently through mapreduce? Yeah... Uhm I don't know of many use cases where you would want or need a reducer step when dealing with HBase. I'm sure one may exist, but from past practical experience... you shouldn't need one. ---------------------------------------- > From: [email protected] > To: [email protected] > Date: Fri, 25 Mar 2011 08:20:45 -0700 > Subject: RE: How could I re-calculate every entries in hbase efficiently > through mapreduce? > > There is no reason to use a reducer in this scenario. I frequently do > map-only update jobs. Skipping the reduce step saves a lot of unnecessary > work. > > Dave > > -----Original Message----- > From: Stanley Xu [mailto:[email protected]] > Sent: Thursday, March 24, 2011 7:37 PM > To: [email protected] > Subject: How could I re-calculate every entries in hbase efficiently through > mapreduce? > > Dear Buddies, > > I need to re-calculate the entries in a hbase everyday, like let x = 0.9x > everyday, to make the time has impact on the entry values. > > So I write a TableMapper to get the Entry, and recalculate the result, and > use Context.write(key, put) to put the update operation in context, and then > use a IdentityTableReducer to write that directly back the hbase. In order > to make the job done in a short time, I use the HRegionPartitioner to > increase the reducer number to 50. > > But I have two doubts here: > 1. It looks the partitioner will do a lots of shuffling, I am wondering why > it couldn't just do the put on the local region since the read and write on > the same entry should be on the same region, isn't it? > > 2. If the job failed for any reason(like timeout), the HBase might be in a > partial-updated status, is it? > > Is there any suggestion that I could avoid these two problems? > > > Thanks. > > Best wishes, > Stanley Xu
