You can change spark.shuffle.safetyFraction , but that is a really big margin to add.
The problem is that the estimated size of used memory is inaccurate. I dig into the codes and found that SizeEstimator.visitArray randomly selects 100 cell and use them to estimate the memory size of the whole array. In our case, most of the biggest cells are not selected by SizeEstimator.visitArray, which causes the big gap between the estimated size (27M) and the real one (5G). spark.shuffle.safetyFraction is useless for this case because the gap is really huge. Here 100 is controlled by SizeEstimator.ARRAY_SAMPLE_SIZE and there is no configuration to change it. What are you doing with the data after you co-group it? Are you doing a reduction? You could try doing a union and a reduceByKey Actually, it’s a join. However, we will try to change our codes to avoid it. I asked here because I want to check if anyone has similar problem and better solution. Best Regards, Shixiong Zhu 2014-10-28 0:13 GMT+08:00 Holden Karau <hol...@pigscanfly.ca>: > > > On Monday, October 27, 2014, Shixiong Zhu <zsxw...@gmail.com> wrote: > >> We encountered some special OOM cases of "cogroup" when the data in one >> partition is not balanced. >> >> 1. The estimated size of used memory is inaccurate. For example, there >> are too many values for some special keys. Because SizeEstimator.visitArray >> only samples at most 100 cells for an array, it may miss most of these >> special keys and get a wrong estimated size of used memory. In our case, it >> reports a CompactBuffer is only 27M, but actually it's more than 5G. Since >> the estimated value is wrong, the CompactBuffer won't be spilled and cause >> OOM. >> > You can change spark.shuffle.safetyFraction , but that is a really big > margin to add. > >> >> 2. There are too many values for a special key and these values cannot >> fit into memory. Spilling data to disk helps nothing because cogroup needs >> to read all values for a key into memory. >> >> What are you doing with the data after you co-group it? Are you doing a > reduction? You could try doing a union and a reduceByKey > >> Any suggestion to solve these OOM cases? Thank you,. >> >> Best Regards, >> Shixiong Zhu >> >