I think there could be performance reason. RDD can be faster than Datasets.
For example check query plan for this code: spark.range(100).map(_ * 2).filter(_ < 100).map(_ * 2).collect() There are two serialize / deserialize pairs. And then compare with RDD equivalent. sc.parallelize(1 to 100).map(_ * 2).filter(_ < 100).map(_ * 2).collect() Regards, M 2016-09-01 18:15 GMT+02:00 Sean Owen <so...@cloudera.com>: > On Thu, Sep 1, 2016 at 4:56 PM, Mich Talebzadeh > <mich.talebza...@gmail.com> wrote: > > Data Frame built on top of RDD to create as tabular format that we all > love > > to make the original build easily usable (say SQL like queries, column > > headings etc). The drawback is it restricts you with what you can do with > > Data Frame (now that you have dome RDD.toDF) > > DataFrame is a Dataset[Row], literally, rather than based on an RDD. > > > DataSet is the new RDD with improvements on RDD. As I understand from > > Sean's explanation they add some optimisation on top the common RDD. > > At the moment I don't think there's any particular reason to use RDDs > except to interoperate with code that uses RDDs -- which is entirely > valid. I believe new code would generally touch only Dataset and > DataFrame otherwise. So I don't think there are really 3 elemental > concepts in play as of Spark 2.x. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Maciek Bryński