Re: When should we cache / persist ? After or Before Actions?

Sean Owen Thu, 21 Apr 2022 06:23:47 -0700

You persist before actions, not after, if you want the action's outputs to
be persistent.
If anything swap line 2 and 3. However, there's no point in the count()
here, and because there is already only one action following to write, no
caching is useful in that example.


On Thu, Apr 21, 2022 at 2:26 AM Sid <flinkbyhe...@gmail.com> wrote:

> Hi Folks,
>
> I am working on Spark Dataframe API where I am doing following thing:
>
> 1) df = spark.sql("some sql on huge dataset").persist()
> 2) df1 = df.count()
> 3) df.repartition().write.mode().parquet("")
>
>
> AFAIK, persist should be used after count statement if at all it is needed
> to be used since spark is lazily evaluated and if I call any action it will
> recompute the above code and hence no use of persisting it before action.
>
> Therefore, it should be something like the below that should give better
> performance.
> 1) df= spark.sql("some sql on huge dataset")
> 2) df1 = df.count()
> 3) df.persist()
> 4) df.repartition().write.mode().parquet("")
>
> So please help me to understand how it should be exactly and why? If I am
> not correct
>
> Thanks,
> Sid
>
>

Re: When should we cache / persist ? After or Before Actions?

Reply via email to