The data gets written to files for fault tolerance, in case we need to re-run a reduce task and re-fetch the files after. Otherwise, we’d have to re-run *all* the map tasks whenever one reduce task fails. However, these files usually remain in the OS buffer cache so they are written and read at memory speed. In the future we might add a setting that skips this and uses Spark’s memory store for shuffle data instead.
On the reduce side there’s no use of disk except in Spark 0.9, where we added the option to spill to disk if the reduce’s inputs don’t fit in memory. Matei On Jan 23, 2014, at 2:25 PM, suman bharadwaj <[email protected]> wrote: > Hi, > > Sorry for the confusion. > > So let me rephrase my question. > > Why does SPARK have to write the intermediate data to disk when there is a > shuffle dependency? Can't the communication happen directly just like Giraph ? > And does data get written at reducer side as well ? > > Again please feel free to correct me, in case my understanding is incorrect. > > Regards, > SB > > > On Fri, Jan 24, 2014 at 3:44 AM, Jey Kottalam <[email protected]> wrote: > Hi Suman, > > Spark does indeed do in-memory computation, and does not require > spilling to disk after every map task. Could you explain where you > "see that intermediate map outputs gets written to disk"? Perhaps > you're seeing some intermediate results during a shuffle phase? In > that case, you may want to look into the > "spark.shuffle.consolidateFiles" option: > https://spark.incubator.apache.org/docs/0.8.1/configuration.html > > -Jey > > On Thu, Jan 23, 2014 at 1:10 PM, suman bharadwaj <[email protected]> wrote: > > Hi, > > > > I might be wrong, but need your help. > > > > My understanding in Giraph is that, it doesn't write the intermediate data > > to disk while sending messages to different machines. But in SPARK, I see > > that intermediate map outputs gets written to disk. Why does SPARK write > > intermediate data to disk ? > > > > What happens at reducer side ? Does SPARK write the data again to disk ? How > > does it differ from Hadoop MR ? > > > > Can't SPARK communicate everything in memory ? > > > > If my understanding is wrong. Please do correct me. > > > > Regards, > > Suman Bharadwaj S >
