The intermediate shuffle output gets written to disk, but it often hits the OS-buffer cache since it's not explicitly fsync'ed, so in many cases it stays entirely in memory. The behavior of the shuffle is agnostic to whether the base RDD is in cache or in disk.
For on-disk RDD's or inputs, the shuffle path still has some key differences with Hadoop's implementation, including that it doesn't sort on the map side before shuffling. - Patrick On Thu, Jan 16, 2014 at 6:24 AM, suman bharadwaj <[email protected]> wrote: > Hi, > > Is this behavior the same when the data is in memory ? > If the data is stored to disk, then how is it different than Hadoop map > reduce ? > > Regards, > SB > > > On Thu, Jan 16, 2014 at 5:11 PM, Archit Thakur <[email protected]> > wrote: >> >> For any shuffle operation, groupByKey, etc. it does write map output to >> disk before performing the reduce task on the data. >> >> >> On Thu, Jan 16, 2014 at 4:03 PM, suman bharadwaj <[email protected]> >> wrote: >>> >>> Hi, >>> >>> I'm new to spark. And wanted to understand more on how shuffle works in >>> spark >>> >>> In Hadoop map reduce, while performing a reduce operation, the >>> intermediate data from map gets written to disk. How does the same happen in >>> Spark ? >>> >>> Does spark write the intermediate data to disk ? >>> >>> Thanks in advance. >>> >>> Regards, >>> SB >> >> >
