Hidayat Teonadi created SPARK-26020:
---------------------------------------

             Summary: shuffle data from spark streaming not cleaned up when 
External Shuffle Service is enabled
                 Key: SPARK-26020
                 URL: https://issues.apache.org/jira/browse/SPARK-26020
             Project: Spark
          Issue Type: Bug
          Components: Block Manager, Spark Core
    Affects Versions: 2.3.0
            Reporter: Hidayat Teonadi


Hi, I'm running Spark Streaming on YARN and have enabled dynamic allocation + 
External Spark Shuffle Service. I'm noticing that during the lifetime of my 
spark streaming application, the nm appcache folder is building up with 
blockmgr directories (filled with shuffle_*.data).

I understand why the data is not immediately cleaned up due to dynamic executor 
allocation, but will any cleanup of these directories be done during the 
lifetime of the spark streaming application ? Some of these shuffle data are 
generated as part of spark jobs/stages that have already completed.

I've initially designed the application to run perpetually, but without any 
cleanup eventually the cluster will run out of disk and crash the application.

[https://stackoverflow.com/questions/52923386/spark-streaming-job-doesnt-delete-shuffle-files]
 suggests a stop gap solution of cleaning up via cron.

YARN-8991is the ticket I filed for YARN, who suggested me to file a ticket for 
spark. Appreciate any help.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to