Hello, I have written a standalone spark job which I run through Ooyala Job Server. The program is working correctly, now I'm looking into how to optimize it.
My program without optimization took 4 hours to run. The first optimization of KyroSerializer and compiling regex pattern and reusing them reduced the running time to 2.8 hours. I was looking into the stages to understand what was going on, when I came across this - Duration GC Time Result Ser Time 3.4 min 2.8 min 3.3 min 2.8 min 10 ms 3.4 min 2.8 min 1 ms 3.3 min 2.8 min 1 ms 3.3 min 2.8 min 3.4 min 2.8 min 3.3 min 2.8 min 1 ms 3.3 min 2.8 min 3.4 min 2.9 min Is this expected time for a program to spend in GC? Or is it time for me to deep dive into GC is behaving for my program? Or would it be easier if I just serialize the RDD onto an SSD and work off the heap (using Tachyon)? I'm still relatively new to Spark, there are several ways of tuning and they are confusing, so please don't mind any dumb questions. What I'm doing -> Read n files into n RDDs, cogroup them into 1, do a foreach to transform the objects into a string. Setup and settings - 1 machine, 16 cores, 128 GB RAM Driver memory (Ooyala Job Server) - 90gb using -Xmx90gb spark.executor.memory - 90gb master = local[16] spark.storage.memoryFraction=0.3 spark.shuffle.memoryFraction=0.6 spark.local.dir=SSD, ip and op directory=SSD spark.default.parallelism=48 Any suggestions for optimization are welcome.