Wei Deng created SPARK-14652:
--------------------------------

             Summary: pyspark streaming driver unable to cleanup metadata for 
cached RDDs leading to driver OOM
                 Key: SPARK-14652
                 URL: https://issues.apache.org/jira/browse/SPARK-14652
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Streaming
    Affects Versions: 1.6.1
         Environment: pyspark 1.6.1
python 2.7.6
Ubuntu 14.04.2 LTS
Oracle JDK 1.8.0_77

            Reporter: Wei Deng


ContextCleaner was introduced in SPARK-1103 and according to its PR 
[here|https://github.com/apache/spark/pull/126]:

{quote}
RDD cleanup:
{{ContextCleaner}} calls {{RDD.unpersist()}} is used to cleanup persisted RDDs. 
Regarding metadata, the DAGScheduler automatically cleans up all metadata 
related to a RDD after all jobs have completed. Only the 
{{SparkContext.persistentRDDs}} keeps strong references to persisted RDDs. The 
{{TimeStampedHashMap}} used for that has been replaced by 
{{TimeStampedWeakValueHashMap}} that keeps only weak references to the RDDs, 
allowing them to be garbage collected.
{quote}

However, we have observed that for a cached RDD in pyspark streaming code this 
is not the case with the current latest Spark 1.6.1 version. This is reflected 
in the forever growing number of RDDs in the {{Storage}} tab of the Spark 
Streaming application's UI once a pyspark streaming code starts to run. We used 
the 
[writemetrics.py|https://github.com/weideng1/energyiot/blob/f74d3a8b5b01639e6ff53ac461b87bb8a7b1976f/analytics/writemetrics.py]
 code to reproduce the problem, and every time after running for 20+ hours, the 
driver's JVM will start to show signs of OOM, with old gen being filled up and 
JVM stuck in full GC cycles without any old gen JVM space being freed up, and 
eventually the driver will crash with OOM.

We have collected heap dump right before the OOM happened, and can make it 
available for analysis if it's considered as useful. However, it might be 
easier to just monitor the growth of the number of RDDs in the {{Storage}} tab 
from the Spark application's UI to confirm this is happening. To illustrate the 
problem, we also tried to set {{--conf spark.cleaner.periodicGC.interval=10s}} 
in the spark-submit command line of pyspark code and enabled DEBUG level 
logging of the driver's logback.xml and confirmed that even if the cleaner gets 
triggered as quickly as every 10 seconds, none of the cached RDDs will be 
unpersisted automatically by ContextCleaner.

Currently we have resorted to manually calling unpersist() to work around the 
problem. However, this goes against the spirit of SPARK-1103, i.e. automated 
garbage collection in the SparkContext.

We also conducted a simple test with Scala code and with setting {{--conf 
spark.cleaner.periodicGC.interval=10s}}, and found the Scala code was able to 
clean up the RDDs every 10 seconds as expected, so this appears to be a pyspark 
specific issue. We suspect it has something to do with python not being able to 
pass those out of scope RDDs as weak references to the Context Cleaner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to