A reduce can't process the complete data set until it has fetched all partitions. And any map may produce a partition for any reducer. Hence, we generally wait before all maps have terminated, and their partition outputs ready and copied over to reduces, before we begin to group and process the keys.
However, given that you began thinking about this, this paper on "Online" Hadoop may interest you: http://www.neilconway.org/docs/nsdi2010_hop.pdf On Sat, Dec 22, 2012 at 6:55 PM, Lin Ma <[email protected]> wrote: > Hi guys, > > Supposing in a Hadoop job, there are both mappers and reducers. My question > is, reducer tasks cannot begin until all mapper tasks complete? If so, why > designed in this way? > > thanks in advance, > Lin -- Harsh J
