HDFS, as the name implies, is a distributed file system. A file stored on HDFS 
is already distributed. So if you create an RDD from a HDFS file, the created 
RDD just points to the file partitions on different nodes.

You can read more about HDFS here.

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID]
Sent: Friday, February 26, 2016 9:41 AM
To: User
Subject: Clarification on RDD

Hi,

Spark doco says

Spark’s primary abstraction is a distributed collection of items called a 
Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop 
InputFormats (such as HDFS files) or by transforming other RDDs

example:

val textFile = sc.textFile("README.md")


my question is when RDD is created like above from a file stored on HDFS, does 
that mean that data is distributed among all the nodes in the cluster or data 
from the md file is copied to each node of the cluster so each node has 
complete copy of data? Has the data is actually moved around or data is not 
copied over until an action like COUNT() is performed on RDD?

Thanks

Reply via email to