why does shuffle in spark write shuffle data to disk by default?

2015-11-23 Thread huan zhang
Hi All,
I'm wonderring why does shuffle in spark write shuffle data to disk by
default?
In Stackoverflow, someone said it's used by FTS, but node down is the
most common reason of fault, and write to disk cannot do FTS in this case
either.
So why not use ramdisk as default instread of SDD or HDD only?

Thanks
Hubert Zhang


Re: why does shuffle in spark write shuffle data to disk by default?

2015-11-23 Thread Reynold Xin
I think for most jobs the bottleneck isn't in writing shuffle data to disk,
since shuffle data needs to be "shuffled" and sent across the network.

You can always use a ramdisk yourself. Requiring ramdisk by default would
significantly complicate configuration and platform portability.


On Mon, Nov 23, 2015 at 5:36 PM, huan zhang  wrote:

> Hi All,
> I'm wonderring why does shuffle in spark write shuffle data to disk by
> default?
> In Stackoverflow, someone said it's used by FTS, but node down is the
> most common reason of fault, and write to disk cannot do FTS in this case
> either.
> So why not use ramdisk as default instread of SDD or HDD only?
>
> Thanks
> Hubert Zhang
>