Hi, The data (in this case example README.md) is kept in Hadoop Distributed File System (HDFS) among all datanodes in Hadoop cluster. The metadata that is used to get info about the storage of this file is kept in namenode. Your data is always stored in HDFS.
Spark is an application that can access this data and do something useful with it. RDD is an Spark construct (construct used in a general term here). It has pointers to the partitions of that file that are distributed throughout HDFS. In a rough and ready language it is an interface between that file of yours and your Spark application. One of the most important concepts of RDDs is that they are immutable. This means that given the same RDD, we will always get the same answer. This also allows for Spark to make some optimizations under the hood. If a job fails, it just has to perform the operation again. There is no state (beyond the current step it is performing) that Spark needs to keep track of. You can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method. In general it is recommended to use a DataFrame where possible due to the built in query optimization. A DataFrame is equivalent to a table in RDBMS and can also be manipulated in similar ways to the "native" distributed collections in RDDs. Unlike RDDs , DataFrames keep track of the schema and support various relational operations that lead to more optimized execution. Each DataFrame object represents a logical plan but because of their "lazy" nature no execution occurs until the user calls a specific "output operation". HTH Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 27 February 2016 at 01:37, Mohammed Guller <moham...@glassbeam.com> wrote: > HDFS, as the name implies, is a distributed file system. A file stored on > HDFS is already distributed. So if you create an RDD from a HDFS file, the > created RDD just points to the file partitions on different nodes. > > > > You can read more about HDFS here. > > > > > http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html > > > > Mohammed > > Author: Big Data Analytics with Spark > <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> > > > > *From:* Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID] > *Sent:* Friday, February 26, 2016 9:41 AM > *To:* User > *Subject:* Clarification on RDD > > > > Hi, > > > > Spark doco says > > > > Spark’s primary abstraction is a distributed collection of items called a > Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop > InputFormats (such as HDFS files) or by transforming other RDDs > > > > example: > > > > val textFile = sc.textFile("README.md") > > > > > > my question is when RDD is created like above from a file stored on HDFS, > does that mean that data is distributed among all the nodes in the cluster > or data from the md file is copied to each node of the cluster so each node > has complete copy of data? Has the data is actually moved around or data is > not copied over until an action like COUNT() is performed on RDD? > > > > Thanks > > >