Reduce

Jim Twensky Fri, 05 Oct 2012 11:03:09 -0700

Hi Harsh,

The hidden map operation which is applied to the reduced partition at
one stage can generate keys that are outside of the range covered by
that particular reducer. I still need to have the many-to-many
communication from reduce step k to reduce step k+1. Otherwise, I
think the ChainReducer would do the job and apply multiple maps to
each isolated partition produced by the reducer.


Jim

On Fri, Oct 5, 2012 at 12:54 PM, Harsh J <[email protected]> wrote:
> Would it then be right to assume that the keys produced by the reduced
> partition at one stage would be isolated to its partition alone and
> not occur in any of the other partition outputs? I'm guessing not,
> based on the nature of your data?
>
> I'm trying to understand why shuffling is good to be avoided here, and
> if it can be in some ways, given the data. As I see it, you need
> re-sort based on the new key per partition, but not the shuffle? Or am
> I wrong?
>
> On Fri, Oct 5, 2012 at 11:13 PM, Jim Twensky <[email protected]> wrote:
>> Hi Harsh,
>>
>> Yes, there is actually a "hidden" map stage, that generates new
>> <key,value> pairs based on the last reduce output but I can create
>> those records during the reduce step instead and get rid of the
>> intermediate map computation completely. The idea is to apply the map
>> function to each output of the reduce inside the reduce class and emit
>> the result as the output of the reducer.
>>
>> Jim
>>
>> On Fri, Oct 5, 2012 at 12:18 PM, Harsh J <[email protected]> wrote:
>>> Hey Jim,
>>>
>>> Are you looking to re-sort or re-partition your data by a different
>>> key or key combo after each output from reduce?
>>>
>>> On Fri, Oct 5, 2012 at 10:01 PM, Jim Twensky <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> I have a complex Hadoop job that iterates over  large graph data
>>>> multiple times until some convergence condition is met. I know that
>>>> the map output goes to the local disk of each particular mapper first,
>>>> and then fetched by the reducers before the reduce tasks start. I can
>>>> see that this is an overhead, and it theory we can ship the data
>>>> directly from mappers to reducers, without serializing on the local
>>>> disk first. I understand that this step is necessary for fault
>>>> tolerance and it is an essential building block of MapReduce.
>>>>
>>>> In my application, the map process consists of identity mappers which
>>>> read the input from HDFS and ship it to reducers. Essentially, what I
>>>> am doing is applying chains of reduce jobs until the algorithm
>>>> converges. My question is, can I bypass the serialization of the local
>>>> data and ship it from mappers to reducers immediately (as soon as I
>>>> call context.write() in my mapper class)? If not, are there any other
>>>> MR platforms that can do this? I've been searching around and couldn't
>>>> see anything similar to what I need. Hadoop On Line is a prototype and
>>>> has some similar functionality but it hasn't been updated for a while.
>>>>
>>>> Note: I know about ChainMapper and ChainReducer classes but I don't
>>>> want to chain multiple mappers in the same local node. I want to chain
>>>> multiple reduce functions globally so the data flow looks like: Map ->
>>>> Reduce -> Reduce -> Reduce, which means each reduce operation is
>>>> followed by a shuffle and sort essentially bypassing the map
>>>> operation.
>>>
>>>
>>>
>>> --
>>> Harsh J
>
>
>
> --
> Harsh J

Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce

Reply via email to