Thanks for the reply.
This will reduce the shuffle write to disk to an extent but for a long
running job(multiple days), the shuffle write would still occupy a lot of
space on disk. Why do we need to store the data from older map tasks to
memory?

On Tue, Mar 31, 2015 at 1:19 PM, Bijay Pathak <bijay.pat...@cloudwick.com>
wrote:

> The Spark Sort-Based Shuffle (default from 1.1) keeps the data from
> each Map tasks to memory until they they can't fit after which they
> are sorted and spilled to disk. You can reduce the Shuffle write to
> disk by increasing spark.shuffle.memoryFraction(default 0.2).
>
> By writing the shuffle output to disk the Spark lineage can be
> truncated when the RDDs are already materialized as the side-effects
> of earlier shuffle.This is the under the hood optimization in Spark
> which is only possible because of shuffle output output being written
> to disk.
>
> You can set spark.shuffle.spill to false if you don't want to spill to
> the disk and assuming you have enough heap memory.
>
> On Tue, Mar 31, 2015 at 12:35 PM, Udit Mehta <ume...@groupon.com> wrote:
> > I have noticed a similar issue when using spark streaming. The spark
> shuffle
> > write size increases to a large size(in GB) and then the app crashes
> saying:
> > java.io.FileNotFoundException:
> >
> /data/vol0/nodemanager/usercache/$user/appcache/application_1427480955913_0339/spark-local-20150330231234-db1a/0b/temp_shuffle_1b23808f-f285-40b2-bec7-1c6790050d7f
> > (No such file or directory)
> >
> > I dont understand why the shuffle size increases to such a large value
> for
> > long running jobs.
> >
> > Thanks,
> > Udiy
> >
> > On Mon, Mar 30, 2015 at 4:01 AM, shahab <shahab.mok...@gmail.com> wrote:
> >>
> >> Thanks Saisai. I will try your solution, but still i don't understand
> why
> >> filesystem should be used where there is a plenty of memory available!
> >>
> >>
> >>
> >> On Mon, Mar 30, 2015 at 11:22 AM, Saisai Shao <sai.sai.s...@gmail.com>
> >> wrote:
> >>>
> >>> Shuffle write will finally spill the data into file system as a bunch
> of
> >>> files. If you want to avoid disk write, you can mount a ramdisk and
> >>> configure "spark.local.dir" to this ram disk. So shuffle output will
> write
> >>> to memory based FS, and will not introduce disk IO.
> >>>
> >>> Thanks
> >>> Jerry
> >>>
> >>> 2015-03-30 17:15 GMT+08:00 shahab <shahab.mok...@gmail.com>:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I was looking at SparkUI, Executors, and I noticed that I have 597 MB
> >>>> for  "Shuffle while I am using cached temp-table and the Spark had 2
> GB free
> >>>> memory (the number under Memory Used is 597 MB /2.6 GB) ?!!!
> >>>>
> >>>> Shouldn't be Shuffle Write be zero and everything (map/reduce) tasks
> be
> >>>> done in memory?
> >>>>
> >>>> best,
> >>>>
> >>>> /Shahab
> >>>
> >>>
> >>
> >
>

Reply via email to