Thanks guys. Really appreciate it !! Things are clarified at memory speed in this forum :)
Regards, SB On Fri, Jan 24, 2014 at 4:11 AM, Matei Zaharia <[email protected]>wrote: > The data gets written to files for fault tolerance, in case we need to > re-run a reduce task and re-fetch the files after. Otherwise, we’d have to > re-run *all* the map tasks whenever one reduce task fails. However, these > files usually remain in the OS buffer cache so they are written and read at > memory speed. In the future we might add a setting that skips this and uses > Spark’s memory store for shuffle data instead. > > On the reduce side there’s no use of disk except in Spark 0.9, where we > added the option to spill to disk if the reduce’s inputs don’t fit in > memory. > > Matei > > On Jan 23, 2014, at 2:25 PM, suman bharadwaj <[email protected]> wrote: > > Hi, > > Sorry for the confusion. > > So let me rephrase my question. > > Why does SPARK have to write the intermediate data to disk when there is a > shuffle dependency? Can't the communication happen directly just like > Giraph ? > And does data get written at reducer side as well ? > > Again please feel free to correct me, in case my understanding is > incorrect. > > Regards, > SB > > > On Fri, Jan 24, 2014 at 3:44 AM, Jey Kottalam <[email protected]> wrote: > >> Hi Suman, >> >> Spark does indeed do in-memory computation, and does not require >> spilling to disk after every map task. Could you explain where you >> "see that intermediate map outputs gets written to disk"? Perhaps >> you're seeing some intermediate results during a shuffle phase? In >> that case, you may want to look into the >> "spark.shuffle.consolidateFiles" option: >> https://spark.incubator.apache.org/docs/0.8.1/configuration.html >> >> -Jey >> >> On Thu, Jan 23, 2014 at 1:10 PM, suman bharadwaj <[email protected]> >> wrote: >> > Hi, >> > >> > I might be wrong, but need your help. >> > >> > My understanding in Giraph is that, it doesn't write the intermediate >> data >> > to disk while sending messages to different machines. But in SPARK, I >> see >> > that intermediate map outputs gets written to disk. Why does SPARK write >> > intermediate data to disk ? >> > >> > What happens at reducer side ? Does SPARK write the data again to disk >> ? How >> > does it differ from Hadoop MR ? >> > >> > Can't SPARK communicate everything in memory ? >> > >> > If my understanding is wrong. Please do correct me. >> > >> > Regards, >> > Suman Bharadwaj S >> > > >
