Re: Is there a difference between df.cache() vs df.rdd.cache()

Supun Nakandala Fri, 13 Oct 2017 14:26:00 -0700

Hi Weichen,

Thank you for the reply.

My understanding was Dataframe API is using the old RDD implementation
under the covers though it presents a different API. And calling
df.rdd will simply give access to the underlying RDD. Is this assumption
wrong? I would appreciate if you can shed more insights on this issue or
point me to documentation where I can learn them.

Thank you in advance.

On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu <weichen...@databricks.com>
wrote:

> You should use `df.cache()`
> `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the
> original `df`. and then cache the new RDD.
>
> On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <
> supun.nakand...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have been experimenting with cache/persist/unpersist methods with
>> respect to both Dataframes and RDD APIs. However, I am experiencing
>> different behaviors Ddataframe API compared RDD API such Dataframes are not
>> getting cached when count() is called.
>>
>> Is there a difference between how these operations act wrt to Dataframe
>> and RDD APIs?
>>
>> Thank You.
>> -Supun
>>
>
>

Re: Is there a difference between df.cache() vs df.rdd.cache()

Reply via email to