You persist before actions, not after, if you want the action's outputs to be persistent. If anything swap line 2 and 3. However, there's no point in the count() here, and because there is already only one action following to write, no caching is useful in that example.
On Thu, Apr 21, 2022 at 2:26 AM Sid <flinkbyhe...@gmail.com> wrote: > Hi Folks, > > I am working on Spark Dataframe API where I am doing following thing: > > 1) df = spark.sql("some sql on huge dataset").persist() > 2) df1 = df.count() > 3) df.repartition().write.mode().parquet("") > > > AFAIK, persist should be used after count statement if at all it is needed > to be used since spark is lazily evaluated and if I call any action it will > recompute the above code and hence no use of persisting it before action. > > Therefore, it should be something like the below that should give better > performance. > 1) df= spark.sql("some sql on huge dataset") > 2) df1 = df.count() > 3) df.persist() > 4) df.repartition().write.mode().parquet("") > > So please help me to understand how it should be exactly and why? If I am > not correct > > Thanks, > Sid > >