Hi,
I'm having trouble serializing tasks for this code:
val rddC = (rddA join rddB)
.map { case (x, (y, z)) => z -> y }
.reduceByKey( { (y1, y2) => Semigroup.plus(y1, y2) }, 1000)
Somehow when running on a small data set the size of the serialized task is
about 650KB, which is very big, and when running on a big data set it's
about 1.3MB, which is huge. I can't find what's causing it. The only
reference to an object outside the scope of the closure is to a static
method on Semigroup, which should serialize fine. The proof that it's okay
to call Semigroup like that is that I have another operation in the same
job that uses it and that's serializes okay (~30KB).
The part I really don't get is why is the size of the serialized task
dependent on the size of the RDD? I haven't used "parallelize" to create
rddA and rddB, but rather derived them from transformations of hive tables.
Thanks,
- Sebastien