Hi, also if you are using SPARK 3.2.x please try to see the documentation on handling skew using SPARK settings.
Regards, Gourav Sengupta On Tue, Dec 14, 2021 at 6:01 PM David Diebold <davidjdieb...@gmail.com> wrote: > Hello all, > > I was wondering if it possible to encounter out of memory exceptions on > spark executors when doing some aggregation, when a dataset is skewed. > Let's say we have a dataset with two columns: > - key : int > - value : float > And I want to aggregate values by key. > Let's say that we have a tons of key equal to 0. > > Is it possible to encounter some out of memory exception after the shuffle > ? > My expectation would be that the executor responsible of aggregating the > '0' partition could indeed have some oom exception if it tries to put all > the files of this partition in memory before processing them. > But why would it need to put them in memory when doing in aggregation ? It > looks to me that aggregation can be performed in a stream fashion, so I > would not expect any oom at all.. > > Thank you in advance for your lights :) > David > > >