Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-14 Thread Supun Nakandala
Hi Weichen, Thank you very much for the explanation. On Fri, Oct 13, 2017 at 6:56 PM, Weichen Xu wrote: > Hi Supun, > > Dataframe API is NOT using the old RDD implementation under the covers, > dataframe has its own implementation. (Dataframe use binary row format

Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Weichen Xu
Hi Supun, Dataframe API is NOT using the old RDD implementation under the covers, dataframe has its own implementation. (Dataframe use binary row format and columnar storage when cached). So dataframe has no relationship with the `RDD[Row]` you want get. When calling `df.rdd`, and then cache, it

Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Stephen Boesch
@Vadim Would it be true to say the `.rdd` *may* be creating a new job - depending on whether the DataFrame/DataSet had already been materialized via an action or checkpoint? If the only prior operations on the DataFrame had been transformations then the dataframe would still not have been

Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Vadim Semenov
When you do `Dataset.rdd` you actually create a new job here you can see what it does internally: https://github.com/apache/spark/blob/master/sql/core/ src/main/scala/org/apache/spark/sql/Dataset.scala#L2816-L2828 On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala

Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Supun Nakandala
Hi Weichen, Thank you for the reply. My understanding was Dataframe API is using the old RDD implementation under the covers though it presents a different API. And calling df.rdd will simply give access to the underlying RDD. Is this assumption wrong? I would appreciate if you can shed more

Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Weichen Xu
You should use `df.cache()` `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the original `df`. and then cache the new RDD. On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala wrote: > Hi all, > > I have been experimenting with

Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Supun Nakandala
Hi all, I have been experimenting with cache/persist/unpersist methods with respect to both Dataframes and RDD APIs. However, I am experiencing different behaviors Ddataframe API compared RDD API such Dataframes are not getting cached when count() is called. Is there a difference between how