Hi, Sorry for the confusion.
So let me rephrase my question. Why does SPARK have to write the intermediate data to disk when there is a shuffle dependency? Can't the communication happen directly just like Giraph ? And does data get written at reducer side as well ? Again please feel free to correct me, in case my understanding is incorrect. Regards, SB On Fri, Jan 24, 2014 at 3:44 AM, Jey Kottalam <[email protected]> wrote: > Hi Suman, > > Spark does indeed do in-memory computation, and does not require > spilling to disk after every map task. Could you explain where you > "see that intermediate map outputs gets written to disk"? Perhaps > you're seeing some intermediate results during a shuffle phase? In > that case, you may want to look into the > "spark.shuffle.consolidateFiles" option: > https://spark.incubator.apache.org/docs/0.8.1/configuration.html > > -Jey > > On Thu, Jan 23, 2014 at 1:10 PM, suman bharadwaj <[email protected]> > wrote: > > Hi, > > > > I might be wrong, but need your help. > > > > My understanding in Giraph is that, it doesn't write the intermediate > data > > to disk while sending messages to different machines. But in SPARK, I see > > that intermediate map outputs gets written to disk. Why does SPARK write > > intermediate data to disk ? > > > > What happens at reducer side ? Does SPARK write the data again to disk ? > How > > does it differ from Hadoop MR ? > > > > Can't SPARK communicate everything in memory ? > > > > If my understanding is wrong. Please do correct me. > > > > Regards, > > Suman Bharadwaj S >
