You can change spark.shuffle.safetyFraction , but that is a really big
margin to add.

The problem is that the estimated size of used memory is inaccurate. I dig
into the codes and found that SizeEstimator.visitArray randomly selects 100
cell and use them to estimate the memory size of the whole array. In our
case, most of the biggest cells are not selected by SizeEstimator.visitArray,
which causes the big gap between the estimated size (27M) and the real one
(5G). spark.shuffle.safetyFraction is useless for this case because the gap
is really huge. Here 100 is controlled by SizeEstimator.ARRAY_SAMPLE_SIZE
and there is no configuration to change it.

What are you doing with the data after you co-group it? Are you doing a
reduction? You could try doing a union and a reduceByKey

Actually, it’s a join. However, we will try to change our codes to avoid
it. I asked here because I want to check if anyone has similar problem and
better solution.
​

Best Regards,
Shixiong Zhu

2014-10-28 0:13 GMT+08:00 Holden Karau <hol...@pigscanfly.ca>:

>
>
> On Monday, October 27, 2014, Shixiong Zhu <zsxw...@gmail.com> wrote:
>
>> We encountered some special OOM cases of "cogroup" when the data in one
>> partition is not balanced.
>>
>> 1. The estimated size of used memory is inaccurate. For example, there
>> are too many values for some special keys. Because SizeEstimator.visitArray
>> only samples at most 100 cells for an array, it may miss most of these
>> special keys and get a wrong estimated size of used memory. In our case, it
>> reports a CompactBuffer is only 27M, but actually it's more than 5G. Since
>> the estimated value is wrong, the CompactBuffer won't be spilled and cause
>> OOM.
>>
> You can change spark.shuffle.safetyFraction , but that is a really big
> margin to add.
>
>>
>> 2. There are too many values for a special key and these values cannot
>> fit into memory. Spilling data to disk helps nothing because cogroup needs
>> to read all values for a key into memory.
>>
>> What are you doing with the data after you co-group it? Are you doing a
> reduction? You could try doing a union and a reduceByKey
>
>> Any suggestion to solve these OOM cases? Thank you,.
>>
>> Best Regards,
>> Shixiong Zhu
>>
>

Reply via email to