Re: Spark loads data from HDFS or S3

2017-12-13 Thread Jörn Franke
S3 can be realized cheaper than HDFS on Amazon.

As you correctly describe it does not support data locality. The data is 
distributed to the workers.

Depending on your use case it can make sense to have HDFS as a temporary 
“cache” for S3 data.

> On 13. Dec 2017, at 09:39, Philip Lee  wrote:
> 
> Hi​
> 
> I have a few of questions about a structure of HDFS and S3 when Spark-like 
> loads data from two storage.
> 
> Generally, when Spark loads data from HDFS, HDFS supports data locality and 
> already own distributed file on datanodes, right? Spark could just process 
> data on workers.
> 
> What about S3? many people in this field use S3 for storage or loading data 
> remotely. When Spark loads data from S3 (sc.textFile('s3://...'), how all 
> data will be spread on Workers? Master node's responsible for this task? It 
> reads all data from S3, then spread the data to Worker? So it migt be a 
> trade-off compared to HDFS? or I got a wrong point of this
> ​.
> ​
> What kind of points in S3 is better than that of HDFS?​
> ​Thanks in Advanced​


Re: Spark loads data from HDFS or S3

2017-12-13 Thread Sebastian Nagel
> When Spark loads data from S3 (sc.textFile('s3://...'), how all data will be 
> spread on Workers?

The data is read by workers. Only make sure that the data is splittable, by 
using a splittable
format or by passing a list of files
 sc.textFile('s3://.../*.txt')
to achieve full parallelism. Otherwise (e.g., if reading a single gzipped file) 
only one worker
will read the data.

> So it migt be a trade-off compared to HDFS?

Accessing data on S3 fromHadoop is usually slower than HDFS, cf.
  
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Other_issues

> What kind of points in S3 is better than that of HDFS?

It's independent from your Hadoop cluster: easier to share, you don't have to
care for the data when maintaining your cluster, ...

Sebastian

On 12/13/2017 09:39 AM, Philip Lee wrote:
> Hi
> ​
> 
> 
> I have a few of questions about a structure of HDFS and S3 when Spark-like 
> loads data from two storage.
> 
> 
> Generally, when Spark loads data from HDFS, HDFS supports data locality and 
> already own distributed
> file on datanodes, right? Spark could just process data on workers.
> 
> 
> What about S3? many people in this field use S3 for storage or loading data 
> remotely. When Spark
> loads data from S3 (sc.textFile('s3://...'), how all data will be spread on 
> Workers? Master node's
> responsible for this task? It reads all data from S3, then spread the data to 
> Worker? So it migt be
> a trade-off compared to HDFS? or I got a wrong point of this
> 
> ​.
> 
> ​
> 
> What kind of points in S3 is better than that of HDFS?
> ​
> 
> ​Thanks in Advanced​
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark loads data from HDFS or S3

2017-12-13 Thread Philip Lee
Hi
​


I have a few of questions about a structure of HDFS and S3 when Spark-like
loads data from two storage.


Generally, when Spark loads data from HDFS, HDFS supports data locality and
already own distributed file on datanodes, right? Spark could just process
data on workers.


What about S3? many people in this field use S3 for storage or loading data
remotely. When Spark loads data from S3 (sc.textFile('s3://...'), how all
data will be spread on Workers? Master node's responsible for this task? It
reads all data from S3, then spread the data to Worker? So it migt be a
trade-off compared to HDFS? or I got a wrong point of this
​.

​

What kind of points in S3 is better than that of HDFS?
​

​Thanks in Advanced​