Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, 
without caching,  Spark will read the CSV twice: Once to load it for DF1, and 
once to load it for DF2. When you add a cache on DF1 or DF2, it reads from CSV 
only once.

You might want to look at doing a windowed  query on DF1 to avoid joining DF1 
with DF2. This should give you better or similar  performance when compared to  
cache because Spark will optimize for cache the data during shuffle.

From: Amit Sharma <resolve...@gmail.com>
Reply-To: "resolve...@gmail.com" <resolve...@gmail.com>
Date: Monday, December 7, 2020 at 12:47 PM
To: Theodoros Gkountouvas <theo.gkountou...@futurewei.com>, 
"user@spark.apache.org" <user@spark.apache.org>
Subject: RE: [EXTERNAL] Caching


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Thanks for the information. I am using  spark 2.3.3 There are few more questions

1. Yes I am using DF1 two times but at the end action is one on DF3. In that 
case action of DF1 should be just 1 or it depends how many times this dataframe 
is used in transformation.

I believe even if we use a dataframe multiple times for transformation , use 
caching should be based on actions. In my case action is one save call on DF3. 
Please correct me if i am wrong.

Thanks
Amit

On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas 
<theo.gkountou...@futurewei.com<mailto:theo.gkountou...@futurewei.com>> wrote:
Hi Amit,

One action might use the same DataFrame more than once. You can look at your 
LogicalPlan by executing DF3.explain (arguments different depending the version 
of Spark you are using) and see how many times you need to compute DF2 or DF1. 
Given the information you have provided I suspect that DF1 is used more than 
once (one time at  DF2 and another one at DF3). So, Spark is going to cache it 
the first time and it will load it from cache instead of running it again the 
second time.

I hope this helped,
Theo.

From: Amit Sharma <resolve...@gmail.com<mailto:resolve...@gmail.com>>
Sent: Monday, December 7, 2020 11:32 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Caching

Hi All, I am using caching in my code. I have a DF like
val  DF1 = read csv.
val DF2 = DF1.groupBy().agg().select(.....)

Val DF3 =  read csv .join(DF1).join(DF2)
  DF3 .save.

If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 
action only why do I need to cache.

Thanks
Amit


Reply via email to