Let's say that you're running a MapReduce job with *M* map tasks, *R *reduce tasks, and *K* machines. Each map task will produce *R* shuffle outputs (so *M*R* shuffle blocks total). When the reduce phase starts, pending reduce tasks are pulled off a queue and scheduled on executors. Reduce tasks aren't assigned to particular machines in advance; they're scheduled as executors become free.
If you have more reduce tasks than machines (*R > K)*, then some machines will run multiple reduce tasks. You might want to run more reduce tasks than machines to a). limit an individual reduce task's memory requirements, or b). adapt to skew and stragglers. With smaller, more granular reduce tasks, slower machines can simply run fewer tasks while the remaining work can be divided among the other machines. The trade-off here is increased scheduling overhead and more reduce output partitions, although the scheduling overhead may be negligible in many cases and the small post-shuffle outputs could be combined using coalesce(). On Mon, Nov 11, 2013 at 8:54 AM, Umar Javed <[email protected]> wrote: > Say that you have a taskSet of maps, each operating on one Hadoop > partition. How does the scheduler decide which mapTask output (i.e., a > shuffle block) goes to what reducer? Are the shuffle blocks evenly split > among reducers? > > > On Sun, Nov 10, 2013 at 9:50 PM, Aaron Davidson <[email protected]>wrote: > >> It is responsible for a subset of shuffle blocks. MapTasks split up their >> data, creating one shuffle block for every reducer. During the shuffle >> phase, the reducer will fetch all shuffle blocks that were intended for it. >> >> >> On Sun, Nov 10, 2013 at 9:38 PM, Umar Javed <[email protected]>wrote: >> >>> I was wondering how does the scheduler assign the ShuffledRDD locations >>> to the reduce tasks? Say that you have 4 reduce tasks, and a number of >>> shuffle blocks across two machines. Is each reduce task responsible for a >>> subset of individual keys or a subset of shuffle blocks? >>> >>> Umar >>> >> >> >
