HDFS, as the name implies, is a distributed file system. A file stored on HDFS is already distributed. So if you create an RDD from a HDFS file, the created RDD just points to the file partitions on different nodes.
You can read more about HDFS here. http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html Mohammed Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID] Sent: Friday, February 26, 2016 9:41 AM To: User Subject: Clarification on RDD Hi, Spark doco says Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs example: val textFile = sc.textFile("README.md") my question is when RDD is created like above from a file stored on HDFS, does that mean that data is distributed among all the nodes in the cluster or data from the md file is copied to each node of the cluster so each node has complete copy of data? Has the data is actually moved around or data is not copied over until an action like COUNT() is performed on RDD? Thanks