Ziqi Liu created SPARK-46995:
--------------------------------

             Summary: Allow AQE coalesce final stage in SQL cached plan
                 Key: SPARK-46995
                 URL: https://issues.apache.org/jira/browse/SPARK-46995
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 4.0.0
            Reporter: Ziqi Liu


[https://github.com/apache/spark/pull/43435] and 
[https://github.com/apache/spark/pull/43760] are fixing a correctness issue 
which will be triggered when AQE applied on cached query plan, specifically, 
when AQE coalescing the final result stage of the cached plan.

 

The current semantic of 
{{spark.sql.optimizer.canChangeCachedPlanOutputPartitioning}}

([source 
code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L403-L411]):
 * when true, we enable AQE, but disable coalescing final stage ({*}default{*})

 * when false, we disable AQE

 

But let’s revisit the semantic of this config: actually for caller the only 
thing that matters is whether we change the output partitioning of the cached 
plan. And we should only try to apply AQE if possible.  Thus we want to modify 
the semantic of {{spark.sql.optimizer.canChangeCachedPlanOutputPartitioning}}
 * when true, we enable AQE and allow coalescing final: this might lead to perf 
regression, because it introduce extra shuffle

 * when false, we enable AQE, but disable coalescing final stage. *(this is 
actually the `true` semantic of old behavior)*

Also, to keep the default behavior unchanged, we might want to flip the default 
value of {{spark.sql.optimizer.canChangeCachedPlanOutputPartitioning}} to 
`false`

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to