Re: Spark loads data from HDFS or S3

Sebastian Nagel Wed, 13 Dec 2017 01:54:39 -0800

> When Spark loads data from S3 (sc.textFile('s3://...'), how all data will be 
> spread on Workers?


The data is read by workers. Only make sure that the data is splittable, by 
using a splittable
format or by passing a list of files
 sc.textFile('s3://.../*.txt')
to achieve full parallelism. Otherwise (e.g., if reading a single gzipped file) 
only one worker
will read the data.

> So it migt be a trade-off compared to HDFS?

Accessing data on S3 from        Hadoop is usually slower than HDFS, cf.
  
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Other_issues

> What kind of points in S3 is better than that of HDFS?

It's independent from your Hadoop cluster: easier to share, you don't have to
care for the data when maintaining your cluster, ...

Sebastian

On 12/13/2017 09:39 AM, Philip Lee wrote:
> Hi
> 
> 
> 
> I have a few of questions about a structure of HDFS and S3 when Spark-like 
> loads data from two storage.
> 
> 
> Generally, when Spark loads data from HDFS, HDFS supports data locality and 
> already own distributed
> file on datanodes, right? Spark could just process data on workers.
> 
> 
> What about S3? many people in this field use S3 for storage or loading data 
> remotely. When Spark
> loads data from S3 (sc.textFile('s3://...'), how all data will be spread on 
> Workers? Master node's
> responsible for this task? It reads all data from S3, then spread the data to 
> Worker? So it migt be
> a trade-off compared to HDFS? or I got a wrong point of this
> 
> .
> 
> 
> 
> What kind of points in S3 is better than that of HDFS?
> 
> 
> Thanks in Advanced
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark loads data from HDFS or S3

Reply via email to