Hi All, Any suggestions?
Thanks, -Rishi On Sun, Oct 20, 2019 at 12:56 AM Rishi Shah <rishishah.s...@gmail.com> wrote: > Hi All, > > I have a use case where I need to perform nested windowing functions on a > data frame to get final set of columns. Example: > > w1 = Window.partitionBy('col1') > df = df.withColumn('sum1', F.sum('val')) > > w2 = Window.partitionBy('col1', 'col2') > df = df.withColumn('sum2', F.sum('val')) > > w3 = Window.partitionBy('col1', 'col2', 'col3') > df = df.withColumn('sum3', F.sum('val')) > > These 3 partitions are not huge at all, however the data size is 2T > parquet snappy compressed. This throws a lot of outofmemory errors. > > I would like to get some advice around whether nested window functions is > a good idea in pyspark? I wanted to avoid using multiple filter + joins to > get to the final state, as join can create crazy shuffle. > > Any suggestions would be appreciated! > > -- > Regards, > > Rishi Shah > -- Regards, Rishi Shah