Re: Performance advantage by loading data from local node over S3.

2015-04-30 Thread Akhil Das
If the data is too huge and is in S3, that'll be a lot of network traffic, instead, if the data is available in HDFS (with proper replication available) then it will be faster as most of the time, data will be available as PROCESS_LOCAL/NODE_LOCAL to the executor. Thanks Best Regards On Wed, Apr

Performance advantage by loading data from local node over S3.

2015-04-29 Thread Nisrina Luthfiyati
Hi all, I'm new to Spark so I'm sorry if the question is too vague. I'm currently trying to deploy a Spark cluster using YARN on an amazon EMR cluster. For the data storage I'm currently using S3 but would loading the data in HDFS from local node gives considerable performance advantage over