Jayesh, but during logical plan spark would be knowing to use the same DF twice so it will optimize the query.
Thanks Amit On Mon, Dec 7, 2020 at 1:16 PM Lalwani, Jayesh <jlalw...@amazon.com> wrote: > Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, > without caching, Spark will read the CSV twice: Once to load it for DF1, > and once to load it for DF2. When you add a cache on DF1 or DF2, it reads > from CSV only once. > > > > You might want to look at doing a windowed query on DF1 to avoid joining > DF1 with DF2. This should give you better or similar performance when > compared to cache because Spark will optimize for cache the data during > shuffle. > > > > *From: *Amit Sharma <resolve...@gmail.com> > *Reply-To: *"resolve...@gmail.com" <resolve...@gmail.com> > *Date: *Monday, December 7, 2020 at 12:47 PM > *To: *Theodoros Gkountouvas <theo.gkountou...@futurewei.com>, " > user@spark.apache.org" <user@spark.apache.org> > *Subject: *RE: [EXTERNAL] Caching > > > > *CAUTION*: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > Thanks for the information. I am using spark 2.3.3 There are few more > questions > > > > 1. Yes I am using DF1 two times but at the end action is one on DF3. In > that case action of DF1 should be just 1 or it depends how many times this > dataframe is used in transformation. > > > > I believe even if we use a dataframe multiple times for transformation , > use caching should be based on actions. In my case action is one save call > on DF3. Please correct me if i am wrong. > > > > Thanks > > Amit > > > > On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas < > theo.gkountou...@futurewei.com> wrote: > > Hi Amit, > > > > One action might use the same DataFrame more than once. You can look at > your LogicalPlan by executing DF3.explain (arguments different depending > the version of Spark you are using) and see how many times you need to > compute DF2 or DF1. Given the information you have provided I suspect that > DF1 is used more than once (one time at DF2 and another one at DF3). So, > Spark is going to cache it the first time and it will load it from cache > instead of running it again the second time. > > > > I hope this helped, > > Theo. > > > > *From:* Amit Sharma <resolve...@gmail.com> > *Sent:* Monday, December 7, 2020 11:32 AM > *To:* user@spark.apache.org > *Subject:* Caching > > > > Hi All, I am using caching in my code. I have a DF like > > val DF1 = read csv. > > val DF2 = DF1.groupBy().agg().select(.....) > > > > Val DF3 = read csv .join(DF1).join(DF2) > > DF3 .save. > > > > If I do not cache DF2 or Df1 it is taking longer time . But i am doing 1 > action only why do I need to cache. > > > > Thanks > > Amit > > > > > >