Hello folks,

I have a use case where I save two pyspark dataframes as parquet files and
then use them later to join with each other or with other tables and
perform multiple aggregations.

Since I know the column being used in the downstream joins and groupby, I
was hoping I could use co-partitioning for the two dataframes when saving
them and avoid shuffle later.

I repartitioned the two dataframes (providing same number of partitions and
same column for repartitioning).

While I'm seeing an *improvement in execution time* with the above
approach, how do I confirm that a shuffle is actually NOT happening (maybe
through SparkUI)?
The spark plan and shuffle read/write are the same in the two scenarios:
1. Using repartitioned dataframes to perform join+aggregation
2. Using base dataframes itself (without explicit repartitioning) to
perform join+aggregatio

I have a StackOverflow post with more details regarding the same:
https://stackoverflow.com/q/74771971/14741697

Thanks in advance, appreciate your help!

Regards,
Shivam

Reply via email to