[ https://issues.apache.org/jira/browse/SPARK-47764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan reassigned SPARK-47764: ----------------------------------- Assignee: Bo Zhang > Cleanup shuffle dependencies for Spark Connect SQL executions > ------------------------------------------------------------- > > Key: SPARK-47764 > URL: https://issues.apache.org/jira/browse/SPARK-47764 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL > Affects Versions: 4.0.0 > Reporter: Bo Zhang > Assignee: Bo Zhang > Priority: Major > Labels: pull-request-available > > Shuffle dependencies are created by shuffle map stages, which consists of > files on disks and the corresponding references in Spark JVM heap memory. > Currently Spark cleanup unused shuffle dependencies through JVM GCs, and > periodic GCs are triggered once every 30 minutes (see ContextCleaner). > However, we still found cases in which the size of the shuffle data files are > too large, which makes shuffle data migration slow. > > We do have chances to cleanup shuffle dependencies, especially for SQL > queries created by Spark Connect, since we do have better control of the > DataFrame instances there. Even if DataFrame instances are reused in the > client side, on the server side the instances are still recreated. > > We might also provide the option to 1. cleanup eagerly after each query > executions, or 2. only mark the shuffle executions and do not migrate them at > node decommissions. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org