Re: Clarification on RDD

Mich Talebzadeh Sat, 27 Feb 2016 00:25:11 -0800

Hi,

The data (in this case example README.md) is kept in Hadoop Distributed
File System (HDFS) among all datanodes in Hadoop cluster. The metadata that
is used to get info about the storage of this file is kept in namenode.
Your data is always stored in HDFS.

Spark is an application that can access this data and do something
useful with it. RDD is an Spark construct (construct used in a general term
here). It has pointers to the partitions of that file that are distributed
throughout HDFS. In a rough and ready language it is an interface between
that file of yours and your Spark application.

One of the most important concepts of RDDs is that they are immutable. This
means that given the same RDD, we will always get the same answer. This
also allows for Spark to make some optimizations under the hood. If a job
fails, it just has to perform the operation again. There is no state
(beyond the current step it is performing) that Spark needs to keep track
of.

You can go from an RDD to a DataFrame (if the RDD is in a tabular format)
via the toDF method. In general it is recommended to use a DataFrame where
possible due to the built in query optimization.

A DataFrame is equivalent to a table in RDBMS and can also be manipulated
in similar ways to the "native" distributed collections in RDDs. Unlike
RDDs , DataFrames keep track of the schema and support various relational
operations that lead to more optimized execution. Each DataFrame object
represents a logical plan but because of their "lazy" nature no execution
occurs until the user calls a specific "output operation".

HTH

Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 27 February 2016 at 01:37, Mohammed Guller <moham...@glassbeam.com>
wrote:

> HDFS, as the name implies, is a distributed file system. A file stored on
> HDFS is already distributed. So if you create an RDD from a HDFS file, the
> created RDD just points to the file partitions on different nodes.
>
>
>
> You can read more about HDFS here.
>
>
>
>
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID]
> *Sent:* Friday, February 26, 2016 9:41 AM
> *To:* User
> *Subject:* Clarification on RDD
>
>
>
> Hi,
>
>
>
> Spark doco says
>
>
>
> Spark’s primary abstraction is a distributed collection of items called a
> Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop
> InputFormats (such as HDFS files) or by transforming other RDDs
>
>
>
> example:
>
>
>
> val textFile = sc.textFile("README.md")
>
>
>
>
>
> my question is when RDD is created like above from a file stored on HDFS,
> does that mean that data is distributed among all the nodes in the cluster
> or data from the md file is copied to each node of the cluster so each node
> has complete copy of data? Has the data is actually moved around or data is
> not copied over until an action like COUNT() is performed on RDD?
>
>
>
> Thanks
>
>
>

Re: Clarification on RDD

Reply via email to