Jake is absolutely right here, the combiner is also applied on the reducers, I forgot to mention that.
The shuffle phase in Hadoop is basically a distributed merge-sort. When the reducers start to merge the mapper outputs, they can also apply the combiner. However this doesn't help with reducing network traffic. The chapter 'Shuffle and Sort' in 'Hadoop: The definitive guide' has a detailed chapter describing this process. --sebastian On 27.09.2012 11:11, Sigurd Spieckermann wrote: > OK, I see. Makes sense. Thank you! > > 2012/9/27 Sean Owen <[email protected]> > >> I think he means that it is not only applied to the output of the >> mapper, but to output of the combiners many times as well. It is not >> used at the reducer. >> >> On Thu, Sep 27, 2012 at 9:56 AM, Sigurd Spieckermann >> <[email protected]> wrote: >>> @Jake: Could you please elaborate on how exactly the combiner can be >> called >>> before the reducer gets the data? Do you mean the combiner is called at >> the >>> datanode that instantiates reducer tasks? I thought the combiner is just >>> called after the map task has finished and still on that datanode. >>> >>> 2012/9/26 Jake Mannix <[email protected]> >>> >>>> It should also be noted that the Combiner does not only run for the >> mappers >>>> - >>>> they can be used one (or more) times after mapping, and then one or more >>>> times before the reducer gets the results. It's not quite so simple as >> to >>>> say that >>>> you get combiners used only (and always) on the outputs of each map >> task. >> >
