Igor Amelin created SPARK-35262:
-----------------------------------

             Summary: Memory leak when dataset is being persisted
                 Key: SPARK-35262
                 URL: https://issues.apache.org/jira/browse/SPARK-35262
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.1.1
            Reporter: Igor Amelin


If a Java- or Scala-application with SparkSession runs a long time and persists 
a lot of datasets, it can crash because of a memory leak.
 I've noticed the following. When we have a dataset and persist it, the 
SparkSession used to load that dataset is cloned in CacheManager, and this 
clone is added as a listener to `listenersPlusTimers` in `ListenerBus`. But 
this clone isn't removed from the list of listeners after that, e.g. 
unpersisting the dataset. If we persist a lot of datasets, the SparkSession is 
cloned and added to `ListenerBus` many times. This leads to a memory leak since 
the `listenersPlusTimers` list become very large.

I've found out that the SparkSession is cloned is CacheManager when the 
parameters `spark.sql.sources.bucketing.autoBucketedScan.enabled` and 
`spark.sql.adaptive.enabled` are true. The first one is true by default, and 
this default behavior leads to the problem. When auto bucketed scan is 
disabled, the SparkSession isn't cloned, and there are no duplicates in 
ListenerBus, so the memory leak doesn't occur.

Here is a small Java application to reproduce the memory leak: 
[https://github.com/iamelin/spark-memory-leak]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to