Hello folks, I have a use case where I save two pyspark dataframes as parquet files and then use them later to join with each other or with other tables and perform multiple aggregations.
Since I know the column being used in the downstream joins and groupby, I was hoping I could use co-partitioning for the two dataframes when saving them and avoid shuffle later. I repartitioned the two dataframes (providing same number of partitions and same column for repartitioning). While I'm seeing an *improvement in execution time* with the above approach, how do I confirm that a shuffle is actually NOT happening (maybe through SparkUI)? The spark plan and shuffle read/write are the same in the two scenarios: 1. Using repartitioned dataframes to perform join+aggregation 2. Using base dataframes itself (without explicit repartitioning) to perform join+aggregatio I have a StackOverflow post with more details regarding the same: https://stackoverflow.com/q/74771971/14741697 Thanks in advance, appreciate your help! Regards, Shivam