Hi,
Spark doco says
Spark’s primary abstraction is a distributed collection of items called a 
Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop 
InputFormats (such as HDFS files) or by transforming other RDDs
example:
val textFile = sc.textFile("README.md")

my question is when RDD is created like above from a file stored on HDFS, does 
that mean that data is distributed among all the nodes in the cluster or data 
from the md file is copied to each node of the cluster so each node has 
complete copy of data? Has the data is actually moved around or data is not 
copied over until an action like COUNT() is performed on RDD? 
Thanks

Reply via email to