On Thu, Jan 23, 2014 at 5:41 PM, Matei Zaharia <[email protected]>wrote:

> The data gets written to files for fault tolerance, in case we need to
> re-run a reduce task and re-fetch the files after. Otherwise, we’d have to
> re-run *all* the map tasks whenever one reduce task fails. However, these
> files usually remain in the OS buffer cache so they are written and read at
> memory speed.
>

Then, how is it fault tolerant? Does the checkpoint API have a sync  if i
am really paranoid about a specific step that has a high fan-in of mappers?

Thanks.



> On Jan 23, 2014, at 2:25 PM, suman bharadwaj <[email protected]> wrote:
>
> Hi,
>
> Sorry for the confusion.
>
> So let me rephrase my question.
>
> Why does SPARK have to write the intermediate data to disk when there is a
> shuffle dependency? Can't the communication happen directly just like
> Giraph ?
> And does data get written at reducer side as well ?
>
> Again please feel free to correct me, in case my understanding is
> incorrect.
>
> Regards,
> SB
>
>
> On Fri, Jan 24, 2014 at 3:44 AM, Jey Kottalam <[email protected]> wrote:
>
>> Hi Suman,
>>
>> Spark does indeed do in-memory computation, and does not require
>> spilling to disk after every map task. Could you explain where you
>> "see that intermediate map outputs gets written to disk"? Perhaps
>> you're seeing some intermediate results during a shuffle phase? In
>> that case, you may want to look into the
>> "spark.shuffle.consolidateFiles" option:
>> https://spark.incubator.apache.org/docs/0.8.1/configuration.html
>>
>> -Jey
>>
>> On Thu, Jan 23, 2014 at 1:10 PM, suman bharadwaj <[email protected]>
>> wrote:
>> > Hi,
>> >
>> > I might be wrong, but need your help.
>> >
>> > My understanding in Giraph is that, it doesn't write the intermediate
>> data
>> > to disk while sending messages to different machines. But in SPARK, I
>> see
>> > that intermediate map outputs gets written to disk. Why does SPARK write
>> > intermediate data to disk ?
>> >
>> > What happens at reducer side ? Does SPARK write the data again to disk
>> ? How
>> > does it differ from Hadoop MR ?
>> >
>> > Can't SPARK communicate everything in memory ?
>> >
>> > If my understanding is wrong. Please do correct me.
>> >
>> > Regards,
>> > Suman Bharadwaj S
>>
>
>
>

Reply via email to