The question is not how to sequence all. Cascading could indeed help in that case.
But how to skip the map phase and do the split/local sort directly at the end of the reduce so that the next reduce need only to do a merge on the sorted files obtained from the previous reduce. This is basically a performance optimization (avoid unnecessary network/disk transfers). Cascading is not equipped to do it, it will only compile the flow into a sequence of map-reduce. Regards Bertrand On Mon, Oct 8, 2012 at 12:44 PM, Fabio Pitzolu <[email protected]>wrote: > Isn't also of some help using Cascading (http://www.cascading.org/) ? > > *Fabio Pitzolu* > Consultant - BI & Infrastructure > > Mob. +39 3356033776 > Telefono 02 87157239 > Fax. 02 93664786 > > *Gruppo Consulenza Innovazione - http://www.gr-ci.com* > > > > > 2012/10/8 Bertrand Dechoux <[email protected]> > >> Have you looked at graph processing for Hadoop? Like Hama ( >> http://hama.apache.org/) or Giraph (http://incubator.apache.org/giraph/). >> I can't say for sure it would help you but it seems to be in the same >> problem domain. >> >> With regard to the chaining reducer issue this is indeed a general >> implementation decision of Hadoop 1. >> From a purely functional point of view, regardless of performance, I >> guess it could be shown that a map/reduce/map can be done with a reduce >> only and that a sequence of map can be done with a single map. Of course, >> with Hadoop the picture is bit more complex due to the sort phase. >> >> map -> sort -> reduce : operations in map/reduce can not generally be >> transferred due to the sort 'blocking' them when they are related to the >> sort key >> reduce -> map : all operations can be performed in the reduce >> So >> map -> sort -> reduce -> map -> sort -> reduce -> map -> sort -> reduce >> can generally be implemented as >> map -> sort -> reduce -> sort -> reduce -> sort -> reduce >> if you are willing to let the possibility of having different scaling >> options for maps and reduces >> >> And that's what you are asking. But with hadoop 1 the map phase is not an >> option (even though you could use the identify but that's not a wise option >> with regards to performance like you said). The picture might be changing >> with Hadoop 2/YARN. I can't provide the details but it may be worth it to >> look at it. >> >> Regards >> >> Bertrand >> >> >> On Fri, Oct 5, 2012 at 8:02 PM, Jim Twensky <[email protected]>wrote: >> >>> Hi Harsh, >>> >>> The hidden map operation which is applied to the reduced partition at >>> one stage can generate keys that are outside of the range covered by >>> that particular reducer. I still need to have the many-to-many >>> communication from reduce step k to reduce step k+1. Otherwise, I >>> think the ChainReducer would do the job and apply multiple maps to >>> each isolated partition produced by the reducer. >>> >>> Jim >>> >>> On Fri, Oct 5, 2012 at 12:54 PM, Harsh J <[email protected]> wrote: >>> > Would it then be right to assume that the keys produced by the reduced >>> > partition at one stage would be isolated to its partition alone and >>> > not occur in any of the other partition outputs? I'm guessing not, >>> > based on the nature of your data? >>> > >>> > I'm trying to understand why shuffling is good to be avoided here, and >>> > if it can be in some ways, given the data. As I see it, you need >>> > re-sort based on the new key per partition, but not the shuffle? Or am >>> > I wrong? >>> > >>> > On Fri, Oct 5, 2012 at 11:13 PM, Jim Twensky <[email protected]> >>> wrote: >>> >> Hi Harsh, >>> >> >>> >> Yes, there is actually a "hidden" map stage, that generates new >>> >> <key,value> pairs based on the last reduce output but I can create >>> >> those records during the reduce step instead and get rid of the >>> >> intermediate map computation completely. The idea is to apply the map >>> >> function to each output of the reduce inside the reduce class and emit >>> >> the result as the output of the reducer. >>> >> >>> >> Jim >>> >> >>> >> On Fri, Oct 5, 2012 at 12:18 PM, Harsh J <[email protected]> wrote: >>> >>> Hey Jim, >>> >>> >>> >>> Are you looking to re-sort or re-partition your data by a different >>> >>> key or key combo after each output from reduce? >>> >>> >>> >>> On Fri, Oct 5, 2012 at 10:01 PM, Jim Twensky <[email protected]> >>> wrote: >>> >>>> Hi, >>> >>>> >>> >>>> I have a complex Hadoop job that iterates over large graph data >>> >>>> multiple times until some convergence condition is met. I know that >>> >>>> the map output goes to the local disk of each particular mapper >>> first, >>> >>>> and then fetched by the reducers before the reduce tasks start. I >>> can >>> >>>> see that this is an overhead, and it theory we can ship the data >>> >>>> directly from mappers to reducers, without serializing on the local >>> >>>> disk first. I understand that this step is necessary for fault >>> >>>> tolerance and it is an essential building block of MapReduce. >>> >>>> >>> >>>> In my application, the map process consists of identity mappers >>> which >>> >>>> read the input from HDFS and ship it to reducers. Essentially, what >>> I >>> >>>> am doing is applying chains of reduce jobs until the algorithm >>> >>>> converges. My question is, can I bypass the serialization of the >>> local >>> >>>> data and ship it from mappers to reducers immediately (as soon as I >>> >>>> call context.write() in my mapper class)? If not, are there any >>> other >>> >>>> MR platforms that can do this? I've been searching around and >>> couldn't >>> >>>> see anything similar to what I need. Hadoop On Line is a prototype >>> and >>> >>>> has some similar functionality but it hasn't been updated for a >>> while. >>> >>>> >>> >>>> Note: I know about ChainMapper and ChainReducer classes but I don't >>> >>>> want to chain multiple mappers in the same local node. I want to >>> chain >>> >>>> multiple reduce functions globally so the data flow looks like: Map >>> -> >>> >>>> Reduce -> Reduce -> Reduce, which means each reduce operation is >>> >>>> followed by a shuffle and sort essentially bypassing the map >>> >>>> operation. >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Harsh J >>> > >>> > >>> > >>> > -- >>> > Harsh J >>> >> >> >> >> -- >> Bertrand Dechoux >> > > -- Bertrand Dechoux
