[pyspark 2.3+] repartition followed by window function

Rishi Shah Wed, 22 May 2019 03:33:14 -0700

Hi All,

If dataframe is repartitioned in memory by (date, id) columns and then if I
use multiple window functions which uses partition by clause with (date,
id) columns --> we can avoid shuffle/sort again I believe.. Can someone
confirm this?


However what happens when dataframe repartition was done using (date, id)
columns, but window function which follows repartition needs a partition by
clause with (date, id, col3, col4) columns ? Would spark reshuffle the
data? or would it know to utilize the initially partitioned/shuffled data
by date/id (as date & id are the common partition keys)?

-- 
Regards,

Rishi Shah

[pyspark 2.3+] repartition followed by window function

Reply via email to