Re: Difference between Data set and Data Frame in Spark 2

Maciej Bryński Thu, 01 Sep 2016 11:12:09 -0700

I think there could be performance reason.
RDD can be faster than Datasets.


For example check query plan for this code:
spark.range(100).map(_ * 2).filter(_ < 100).map(_ * 2).collect()

There are two serialize / deserialize pairs.

And then compare with RDD equivalent.
sc.parallelize(1 to 100).map(_ * 2).filter(_ < 100).map(_ * 2).collect()

Regards,
M


2016-09-01 18:15 GMT+02:00 Sean Owen <so...@cloudera.com>:

> On Thu, Sep 1, 2016 at 4:56 PM, Mich Talebzadeh
> <mich.talebza...@gmail.com> wrote:
> > Data Frame built on top of RDD to create as tabular format that we all
> love
> > to make the original build easily usable (say SQL like queries, column
> > headings etc). The drawback is it restricts you with what you can do with
> > Data Frame (now that you have dome RDD.toDF)
>
> DataFrame is a Dataset[Row], literally, rather than based on an RDD.
>
> > DataSet  is the new RDD with improvements on RDD. As I understand from
> > Sean's explanation they add some optimisation on top the common RDD.
>
> At the moment I don't think there's any particular reason to use RDDs
> except to interoperate with code that uses RDDs -- which is entirely
> valid. I believe new code would generally touch only Dataset and
> DataFrame otherwise. So I don't think there are really 3 elemental
> concepts in play as of Spark 2.x.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Maciek Bryński

Re: Difference between Data set and Data Frame in Spark 2

Reply via email to