Re: groupBy() with really big groups fails

Mark Hamstra Tue, 10 Dec 2013 09:10:19 -0800

In a shuffle/reduce, the portion of the intermediate results destined for
each of the R reducers and produced by a task run on each of the N
partitions of the RDD needs to be materialized and sent to that reducer.
So, N tasks each producing and materializing R intermediate results implies
N*R files created.



On Tue, Dec 10, 2013 at 7:47 AM, Grega Kešpret <[email protected]> wrote:

> Hi Aaron,
> thanks for the explanation, I also find it very helpful.
>
> On Mon, Dec 9, 2013 at 9:28 PM, Aaron Davidson <[email protected]> wrote:
>
>> If you have N map partitions and R reducers, we create N*R files on disk
>> across the cluster in order to do the group by.
>
>
> Do you mind giving a link or explaining why N*R files are created? Thanks!
>
> Grega
>

Re: groupBy() with really big groups fails

Reply via email to