Thanks guys. Really appreciate it !! Things are clarified at memory speed
in this forum :)

Regards,
SB


On Fri, Jan 24, 2014 at 4:11 AM, Matei Zaharia <[email protected]>wrote:

> The data gets written to files for fault tolerance, in case we need to
> re-run a reduce task and re-fetch the files after. Otherwise, we’d have to
> re-run *all* the map tasks whenever one reduce task fails. However, these
> files usually remain in the OS buffer cache so they are written and read at
> memory speed. In the future we might add a setting that skips this and uses
> Spark’s memory store for shuffle data instead.
>
> On the reduce side there’s no use of disk except in Spark 0.9, where we
> added the option to spill to disk if the reduce’s inputs don’t fit in
> memory.
>
> Matei
>
> On Jan 23, 2014, at 2:25 PM, suman bharadwaj <[email protected]> wrote:
>
> Hi,
>
> Sorry for the confusion.
>
> So let me rephrase my question.
>
> Why does SPARK have to write the intermediate data to disk when there is a
> shuffle dependency? Can't the communication happen directly just like
> Giraph ?
> And does data get written at reducer side as well ?
>
> Again please feel free to correct me, in case my understanding is
> incorrect.
>
> Regards,
> SB
>
>
> On Fri, Jan 24, 2014 at 3:44 AM, Jey Kottalam <[email protected]> wrote:
>
>> Hi Suman,
>>
>> Spark does indeed do in-memory computation, and does not require
>> spilling to disk after every map task. Could you explain where you
>> "see that intermediate map outputs gets written to disk"? Perhaps
>> you're seeing some intermediate results during a shuffle phase? In
>> that case, you may want to look into the
>> "spark.shuffle.consolidateFiles" option:
>> https://spark.incubator.apache.org/docs/0.8.1/configuration.html
>>
>> -Jey
>>
>> On Thu, Jan 23, 2014 at 1:10 PM, suman bharadwaj <[email protected]>
>> wrote:
>> > Hi,
>> >
>> > I might be wrong, but need your help.
>> >
>> > My understanding in Giraph is that, it doesn't write the intermediate
>> data
>> > to disk while sending messages to different machines. But in SPARK, I
>> see
>> > that intermediate map outputs gets written to disk. Why does SPARK write
>> > intermediate data to disk ?
>> >
>> > What happens at reducer side ? Does SPARK write the data again to disk
>> ? How
>> > does it differ from Hadoop MR ?
>> >
>> > Can't SPARK communicate everything in memory ?
>> >
>> > If my understanding is wrong. Please do correct me.
>> >
>> > Regards,
>> > Suman Bharadwaj S
>>
>
>
>

Reply via email to