Dear all, Thanks for all your suggestions.
I probably understand that the use a reduce process and do things like sort in reduce side is meaningless at most time. So now what I did is like the following: 1. Use a Mapper and no Reducer to go through the whole table and recalculate the scores and let the value be a Put and output it to a SequenceFileOutput 2. After the first Mapper finished successfully, use another mapper to read the output of the first mapper and output it to hbase through either TableOutputFormat, or write my own mapper to submit the Put operation in a list to get the Put update in a batch. What I use two mappers is that if I put the update operation just in one Mapper, the Mapper had side-effect for a failure or retry. Like if we make x = 0.9x, if the hadoop create two tasks do the same mapper job, x might be calculated twice and x would be 0.81 x as a result. I checked the speculative executions mentioned by Stack, and get this http://developer.yahoo.com/hadoop/tutorial/module4.html Per my understand, if turn the feature off, it will only make sure one mapper will run only once, but it could not avoid a "partial-updated" table issue. Which means something made the update failed(like one node crash or the job takes too long), and we cannot figure which part has been updated and which part do not. Is my understanding correct? Or there might be some solution even better? Thanks. Best wishes, Stanley Xu On Sat, Mar 26, 2011 at 4:16 AM, Michael Segel <[email protected]>wrote: > > > Well there goes my weekend. :-P > ---------------------------------------- > > From: [email protected] > > To: [email protected] > > Date: Fri, 25 Mar 2011 10:00:26 -0700 > > Subject: RE: How could I re-calculate every entries in hbase efficiently > through mapreduce? > > > > I would certainly find it useful if you wrote such a blog post. > > Dave > > > > -----Original Message----- > > From: Michael Segel [mailto:[email protected]] > > Sent: Friday, March 25, 2011 8:55 AM > > To: [email protected] > > Subject: RE: How could I re-calculate every entries in hbase efficiently > through mapreduce? > > > > > > "During inserts into the table, there was one field that was populated > > from hand-crafted HTML that should only have a small range of values > > (e.g. a primary color). We wanted to keep a log of all of the unique > > values that were found here, and so the values were the map job output > > and then sorted and counted in the reduce phase." > > > > Ahhh, have you heard about dynamic counters? > > You don't need a reducer and all you have to do is dump the counters in > your main job after your mappers run. > > > > Maybe I should write a blog entry where you can do your word counter app > using just dynamic counters and no reducers? > > > > HTH > > > > -Mike > > > > > > ---------------------------------------- > > > From: [email protected] > > > To: [email protected] > > > Date: Fri, 25 Mar 2011 08:44:12 -0700 > > > Subject: RE: How could I re-calculate every entries in hbase > efficiently through mapreduce? > > > > > > We ran across a use-case this week. During inserts into the table, > there was one field that was populated from hand-crafted HTML that should > only have a small range of values (e.g. a primary color). We wanted to keep > a log of all of the unique values that were found here, and so the values > were the map job output and then sorted and counted in the reduce phase. A > handy way for us to debug the HTML into a persistent file (we could have > just used counters, but those disappear after a while unless you manually > copy them). > > > > > > -----Original Message----- > > > From: Michael Segel [mailto:[email protected]] > > > Sent: Friday, March 25, 2011 8:26 AM > > > To: [email protected] > > > Subject: RE: How could I re-calculate every entries in hbase > efficiently through mapreduce? > > > > > > > > > > > > Yeah... > > > Uhm I don't know of many use cases where you would want or need a > reducer step when dealing with HBase. > > > I'm sure one may exist, but from past practical experience... you > shouldn't need one. > > > > > > ---------------------------------------- > > > > From: [email protected] > > > > To: [email protected] > > > > Date: Fri, 25 Mar 2011 08:20:45 -0700 > > > > Subject: RE: How could I re-calculate every entries in hbase > efficiently through mapreduce? > > > > > > > > There is no reason to use a reducer in this scenario. I frequently do > map-only update jobs. Skipping the reduce step saves a lot of unnecessary > work. > > > > > > > > Dave > > > > > > > > -----Original Message----- > > > > From: Stanley Xu [mailto:[email protected]] > > > > Sent: Thursday, March 24, 2011 7:37 PM > > > > To: [email protected] > > > > Subject: How could I re-calculate every entries in hbase efficiently > through mapreduce? > > > > > > > > Dear Buddies, > > > > > > > > I need to re-calculate the entries in a hbase everyday, like let x = > 0.9x > > > > everyday, to make the time has impact on the entry values. > > > > > > > > So I write a TableMapper to get the Entry, and recalculate the > result, and > > > > use Context.write(key, put) to put the update operation in context, > and then > > > > use a IdentityTableReducer to write that directly back the hbase. In > order > > > > to make the job done in a short time, I use the HRegionPartitioner to > > > > increase the reducer number to 50. > > > > > > > > But I have two doubts here: > > > > 1. It looks the partitioner will do a lots of shuffling, I am > wondering why > > > > it couldn't just do the put on the local region since the read and > write on > > > > the same entry should be on the same region, isn't it? > > > > > > > > 2. If the job failed for any reason(like timeout), the HBase might be > in a > > > > partial-updated status, is it? > > > > > > > > Is there any suggestion that I could avoid these two problems? > > > > > > > > > > > > Thanks. > > > > > > > > Best wishes, > > > > Stanley Xu > > > > > > >
