Reduce

Jim Twensky Fri, 05 Oct 2012 09:31:47 -0700

Hi,

I have a complex Hadoop job that iterates over  large graph data
multiple times until some convergence condition is met. I know that
the map output goes to the local disk of each particular mapper first,
and then fetched by the reducers before the reduce tasks start. I can
see that this is an overhead, and it theory we can ship the data
directly from mappers to reducers, without serializing on the local
disk first. I understand that this step is necessary for fault
tolerance and it is an essential building block of MapReduce.


In my application, the map process consists of identity mappers which
read the input from HDFS and ship it to reducers. Essentially, what I
am doing is applying chains of reduce jobs until the algorithm
converges. My question is, can I bypass the serialization of the local
data and ship it from mappers to reducers immediately (as soon as I
call context.write() in my mapper class)? If not, are there any other
MR platforms that can do this? I've been searching around and couldn't
see anything similar to what I need. Hadoop On Line is a prototype and
has some similar functionality but it hasn't been updated for a while.

Note: I know about ChainMapper and ChainReducer classes but I don't
want to chain multiple mappers in the same local node. I want to chain
multiple reduce functions globally so the data flow looks like: Map ->
Reduce -> Reduce -> Reduce, which means each reduce operation is
followed by a shuffle and sort essentially bypassing the map
operation.

Chaning Multiple Reducers: Reduce -> Reduce -> Reduce

Reply via email to