> When Spark loads data from S3 (sc.textFile('s3://...'), how all data will be > spread on Workers?
The data is read by workers. Only make sure that the data is splittable, by using a splittable format or by passing a list of files sc.textFile('s3://.../*.txt') to achieve full parallelism. Otherwise (e.g., if reading a single gzipped file) only one worker will read the data. > So it migt be a trade-off compared to HDFS? Accessing data on S3 from Hadoop is usually slower than HDFS, cf. https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Other_issues > What kind of points in S3 is better than that of HDFS? It's independent from your Hadoop cluster: easier to share, you don't have to care for the data when maintaining your cluster, ... Sebastian On 12/13/2017 09:39 AM, Philip Lee wrote: > Hi > > > > I have a few of questions about a structure of HDFS and S3 when Spark-like > loads data from two storage. > > > Generally, when Spark loads data from HDFS, HDFS supports data locality and > already own distributed > file on datanodes, right? Spark could just process data on workers. > > > What about S3? many people in this field use S3 for storage or loading data > remotely. When Spark > loads data from S3 (sc.textFile('s3://...'), how all data will be spread on > Workers? Master node's > responsible for this task? It reads all data from S3, then spread the data to > Worker? So it migt be > a trade-off compared to HDFS? or I got a wrong point of this > > . > > > > What kind of points in S3 is better than that of HDFS? > > > Thanks in Advanced > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org