Kyro serialization slow and runs OOM

Vipul Pandey Thu, 30 Jan 2014 17:02:23 -0800

Hola!

I have about half a TB of (Lzo compressed protobuf) data that I try loading on to my cluster. I have 20 nodes and I assign 100G for executor memory.

-Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.executor.memory=100g

Now, when I load my dataset, transform it with some one to one transformations, and try to cache the eventual RDD - it runs really slow and then runs out of memory. When I remove Kyro serializer and default back to java serialization it works just fine and is able to load and cache the 700Gs of resultant data.

(Btw, I am not registering my classes with Kyro yet but I do'nt think it should be worst than Java Serialization - should it?)

Here's a summary of all the experiments I ran :

Any explanation for this behavior?

Also, I saw that even in the cases when caching was successful, the Size In Memory would go up to a certain level and then fall down, and then climb back up. Why does that happen?

Regards,

Vipul

Kyro serialization slow and runs OOM

Reply via email to