Have you tried UseG1GC in place of UseConcMarkSweepGC? This article really 
helped me with GC a few short weeks ago

https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
 

David Newberger

-----Original Message-----
From: Marco1982 [mailto:marco.plata...@yahoo.it] 
Sent: Friday, June 3, 2016 2:19 PM
To: user@spark.apache.org
Subject: Spark Streaming - long garbage collection time

Hi all,

I'm running a Spark Streaming application with 1-hour batches to join two data 
feeds and write the output to disk. The total size of one data feed is about 40 
GB per hour (split in multiple files), while the size of the second data feed 
is about 600-800 MB per hour (also split in multiple files). Due to application 
constraints, I may not be able to run smaller batches.
Currently, it takes about 20 minutes to produce the output in a cluster with
140 cores and 700 GB of RAM. I'm running 7 workers and 28 executors, each with 
5 cores and 22 GB of RAM.

I execute mapToPair(), filter(), and reduceByKeyAndWindow(1 hour batch) on the 
40 GB data feed. Most of the computation time is spent on these operations. 
What worries me is the Garbage Collection (GC) execution time per executor, 
which goes from 25 secs to 9.2 mins. I attach two screenshots
below: one lists the GC time and one prints out GC comments for a single 
executor. I anticipate that the executor that spends 9.2 mins in doing garbage 
collection is eventually killed by the Spark driver.

I think these numbers are too high. Do you have any suggestion about keeping GC 
time low? I'm already using Kryo Serializer, ++UseConcMarkSweepGC, and 
spark.rdd.compress=true.

Is there anything else that would help?

Thanks
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n27087/gc_time.png>
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n27087/executor_16.png>
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-long-garbage-collection-time-tp27087.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to