Hi, Spark doco says Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs example: val textFile = sc.textFile("README.md")
my question is when RDD is created like above from a file stored on HDFS, does that mean that data is distributed among all the nodes in the cluster or data from the md file is copied to each node of the cluster so each node has complete copy of data? Has the data is actually moved around or data is not copied over until an action like COUNT() is performed on RDD? Thanks