Re: Newbie question: can shuffle avoid writing and reading from disk?

Saisai Shao Wed, 05 Aug 2015 16:41:33 -0700

Hi Muler,

Shuffle data will be written to disk, no matter how large memory you have,
large memory could alleviate shuffle spill where temporary file will be
generated if memory is not enough.


Yes, each node writes shuffle data to file and pulled from disk in reduce
stage from network framework (default is Netty).

Thanks
Saisai

On Thu, Aug 6, 2015 at 7:10 AM, Muler <mulugeta.abe...@gmail.com> wrote:

> Hi,
>
> Consider I'm running WordCount with 100m of data on 4 node cluster.
> Assuming my RAM size on each node is 200g and i'm giving my executors 100g
> (just enough memory for 100m data)
>
>
>    1. If I have enough memory, can Spark 100% avoid writing to disk?
>    2. During shuffle, where results have to be collected from nodes, does
>    each node write to disk and then the results are pulled from disk? If not,
>    what is the API that is being used to pull data from nodes across the
>    cluster? (I'm thinking what Scala or Java packages would allow you to read
>    in-memory data from other machines?)
>
> Thanks,
>

Re: Newbie question: can shuffle avoid writing and reading from disk?

Reply via email to