Hi Harsh, The hidden map operation which is applied to the reduced partition at one stage can generate keys that are outside of the range covered by that particular reducer. I still need to have the many-to-many communication from reduce step k to reduce step k+1. Otherwise, I think the ChainReducer would do the job and apply multiple maps to each isolated partition produced by the reducer.
Jim On Fri, Oct 5, 2012 at 12:54 PM, Harsh J <[email protected]> wrote: > Would it then be right to assume that the keys produced by the reduced > partition at one stage would be isolated to its partition alone and > not occur in any of the other partition outputs? I'm guessing not, > based on the nature of your data? > > I'm trying to understand why shuffling is good to be avoided here, and > if it can be in some ways, given the data. As I see it, you need > re-sort based on the new key per partition, but not the shuffle? Or am > I wrong? > > On Fri, Oct 5, 2012 at 11:13 PM, Jim Twensky <[email protected]> wrote: >> Hi Harsh, >> >> Yes, there is actually a "hidden" map stage, that generates new >> <key,value> pairs based on the last reduce output but I can create >> those records during the reduce step instead and get rid of the >> intermediate map computation completely. The idea is to apply the map >> function to each output of the reduce inside the reduce class and emit >> the result as the output of the reducer. >> >> Jim >> >> On Fri, Oct 5, 2012 at 12:18 PM, Harsh J <[email protected]> wrote: >>> Hey Jim, >>> >>> Are you looking to re-sort or re-partition your data by a different >>> key or key combo after each output from reduce? >>> >>> On Fri, Oct 5, 2012 at 10:01 PM, Jim Twensky <[email protected]> wrote: >>>> Hi, >>>> >>>> I have a complex Hadoop job that iterates over large graph data >>>> multiple times until some convergence condition is met. I know that >>>> the map output goes to the local disk of each particular mapper first, >>>> and then fetched by the reducers before the reduce tasks start. I can >>>> see that this is an overhead, and it theory we can ship the data >>>> directly from mappers to reducers, without serializing on the local >>>> disk first. I understand that this step is necessary for fault >>>> tolerance and it is an essential building block of MapReduce. >>>> >>>> In my application, the map process consists of identity mappers which >>>> read the input from HDFS and ship it to reducers. Essentially, what I >>>> am doing is applying chains of reduce jobs until the algorithm >>>> converges. My question is, can I bypass the serialization of the local >>>> data and ship it from mappers to reducers immediately (as soon as I >>>> call context.write() in my mapper class)? If not, are there any other >>>> MR platforms that can do this? I've been searching around and couldn't >>>> see anything similar to what I need. Hadoop On Line is a prototype and >>>> has some similar functionality but it hasn't been updated for a while. >>>> >>>> Note: I know about ChainMapper and ChainReducer classes but I don't >>>> want to chain multiple mappers in the same local node. I want to chain >>>> multiple reduce functions globally so the data flow looks like: Map -> >>>> Reduce -> Reduce -> Reduce, which means each reduce operation is >>>> followed by a shuffle and sort essentially bypassing the map >>>> operation. >>> >>> >>> >>> -- >>> Harsh J > > > > -- > Harsh J
