In a shuffle/reduce, the portion of the intermediate results destined for each of the R reducers and produced by a task run on each of the N partitions of the RDD needs to be materialized and sent to that reducer. So, N tasks each producing and materializing R intermediate results implies N*R files created.
On Tue, Dec 10, 2013 at 7:47 AM, Grega Kešpret <[email protected]> wrote: > Hi Aaron, > thanks for the explanation, I also find it very helpful. > > On Mon, Dec 9, 2013 at 9:28 PM, Aaron Davidson <[email protected]> wrote: > >> If you have N map partitions and R reducers, we create N*R files on disk >> across the cluster in order to do the group by. > > > Do you mind giving a link or explaining why N*R files are created? Thanks! > > Grega >
