Thanks for the information. I am using spark 2.3.3 There are few more questions
1. Yes I am using DF1 two times but at the end action is one on DF3. In that case action of DF1 should be just 1 or it depends how many times this dataframe is used in transformation. I believe even if we use a dataframe multiple times for transformation , use caching should be based on actions. In my case action is one save call on DF3. Please correct me if i am wrong. Thanks Amit On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas < theo.gkountou...@futurewei.com> wrote: > Hi Amit, > > > > One action might use the same DataFrame more than once. You can look at > your LogicalPlan by executing DF3.explain (arguments different depending > the version of Spark you are using) and see how many times you need to > compute DF2 or DF1. Given the information you have provided I suspect that > DF1 is used more than once (one time at DF2 and another one at DF3). So, > Spark is going to cache it the first time and it will load it from cache > instead of running it again the second time. > > > > I hope this helped, > > Theo. > > > > *From:* Amit Sharma <resolve...@gmail.com> > *Sent:* Monday, December 7, 2020 11:32 AM > *To:* user@spark.apache.org > *Subject:* Caching > > > > Hi All, I am using caching in my code. I have a DF like > > val DF1 = read csv. > > val DF2 = DF1.groupBy().agg().select(.....) > > > > Val DF3 = read csv .join(DF1).join(DF2) > > DF3 .save. > > > > If I do not cache DF2 or Df1 it is taking longer time . But i am doing 1 > action only why do I need to cache. > > > > Thanks > > Amit > > > > >