Re: Performance advantage by loading data from local node over S3.

Akhil Das Thu, 30 Apr 2015 01:56:37 -0700

If the data is too huge and is in S3, that'll be a lot of network traffic,
instead, if the data is available in HDFS (with proper replication
available) then it will be faster as most of the time, data will be
available as PROCESS_LOCAL/NODE_LOCAL to the executor.


Thanks
Best Regards

On Wed, Apr 29, 2015 at 10:50 PM, Nisrina Luthfiyati <
nisrina.luthfiy...@gmail.com> wrote:

> Hi all,
> I'm new to Spark so I'm sorry if the question is too vague. I'm currently
> trying to deploy a Spark cluster using YARN on an amazon EMR cluster. For
> the data storage I'm currently using S3 but would loading the data in HDFS
> from local node gives considerable performance advantage over loading from
> S3?
> Would the reduced traffic latency in data load affect the runtime largely,
> considering most of the computation is done in memory?
>
> Thank you,
> Nisrina.
>

Re: Performance advantage by loading data from local node over S3.

Reply via email to