Re: Difference between Data set and Data Frame in Spark 2

Mich Talebzadeh Thu, 01 Sep 2016 13:02:50 -0700

yes I tested that. sounds like RDD is faster.

Having said that I think there are advantages within DS over RDD.


Will RDD be phased out?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 September 2016 at 19:11, Maciej Bryński <mac...@brynski.pl> wrote:

> I think there could be performance reason.
> RDD can be faster than Datasets.
>
> For example check query plan for this code:
> spark.range(100).map(_ * 2).filter(_ < 100).map(_ * 2).collect()
>
> There are two serialize / deserialize pairs.
>
> And then compare with RDD equivalent.
> sc.parallelize(1 to 100).map(_ * 2).filter(_ < 100).map(_ * 2).collect()
>
> Regards,
> M
>
>
> 2016-09-01 18:15 GMT+02:00 Sean Owen <so...@cloudera.com>:
>
>> On Thu, Sep 1, 2016 at 4:56 PM, Mich Talebzadeh
>> <mich.talebza...@gmail.com> wrote:
>> > Data Frame built on top of RDD to create as tabular format that we all
>> love
>> > to make the original build easily usable (say SQL like queries, column
>> > headings etc). The drawback is it restricts you with what you can do
>> with
>> > Data Frame (now that you have dome RDD.toDF)
>>
>> DataFrame is a Dataset[Row], literally, rather than based on an RDD.
>>
>> > DataSet  is the new RDD with improvements on RDD. As I understand from
>> > Sean's explanation they add some optimisation on top the common RDD.
>>
>> At the moment I don't think there's any particular reason to use RDDs
>> except to interoperate with code that uses RDDs -- which is entirely
>> valid. I believe new code would generally touch only Dataset and
>> DataFrame otherwise. So I don't think there are really 3 elemental
>> concepts in play as of Spark 2.x.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Maciek Bryński
>

Re: Difference between Data set and Data Frame in Spark 2

Reply via email to