Hi spark-user,

I am using spark 1.6 to build reverse index for one month of twitter data
(~50GB). The split size of HDFS is 1GB, thus by default sc.textFile creates
50 partitions. I'd like to increase the parallelism by increase the number
of input partitions. Thus, I use textFile(..., 200) to yield 200 partitions.

I found a significant GC overhead for the stage of building reverse indexes
(with a large shuffle). More than 80% of task time is consumed by GC. I
tried to decrease the # of cores per executor from 8 to 5, and the GC time
was reduced but still high (more than 50% of task time). However, with the
default number of partitions (50), there is no GC overhead at all.

The machines running executors have more than 100GB memory, and I set
executor memory to 32GB. I can confirm that no more than 1 executor running
on each machine.

I am wondering why there is such significant GC overhead after I increase
the number of input partitions?

Thanks,
J.

Reply via email to