[spark] branch branch-3.3 updated: [SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance

wenchen Tue, 05 Apr 2022 23:49:56 -0700

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.3 by this push:
     new 23fa70e9b2b [SPARK-28090][SQL] Improve `replaceAliasButKeepName` 
performance
23fa70e9b2b is described below

commit 23fa70e9b2b5c896a12f95173dd581d9044b85a7
Author: Peter Toth <peter.t...@gmail.com>
AuthorDate: Sun Apr 3 00:47:40 2022 -0700

    [SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance
    
    ### What changes were proposed in this pull request?
    
    SPARK-28090 ticket description contains an example query with multiple 
nested struct creation and field extraction. The following is is an example of 
the query when the sample code range is set to only 3:
    ```
    Project [struct(num1, numerics#23.num1, num10, numerics#23.num10, num11, 
numerics#23.num11, num12, numerics#23.num12, num13, numerics#23.num13, num14, 
numerics#23.num14, num15, numerics#23.num15, num2, numerics#23.num2, num3, 
numerics#23.num3, num4, numerics#23.num4, num5, numerics#23.num5, num6, 
numerics#23.num6, num7, numerics#23.num7, num8, numerics#23.num8, num9, 
numerics#23.num9, out_num1, numerics#23.out_num1, out_num2, -numerics#23.num2) 
AS numerics#42]
    +- Project [struct(num1, numerics#5.num1, num10, numerics#5.num10, num11, 
numerics#5.num11, num12, numerics#5.num12, num13, numerics#5.num13, num14, 
numerics#5.num14, num15, numerics#5.num15, num2, numerics#5.num2, num3, 
numerics#5.num3, num4, numerics#5.num4, num5, numerics#5.num5, num6, 
numerics#5.num6, num7, numerics#5.num7, num8, numerics#5.num8, num9, 
numerics#5.num9, out_num1, -numerics#5.num1) AS numerics#23]
       +- LogicalRDD [numerics#5], false
    ```
    If the level of nesting reaches 7 the query plan generation becomes 
extremely slow on Spark 2.4. That is because both
    - `CollapseProject` rule is slow and
    - some of the expression optimization rules running on the huge, not yet 
simplified expression tree of the single, collapsed `Project` node are slow.
    
    On Spark 3.3, after SPARK-36718, `CollapseProject` doesn't collapse such 
plans so the above issues don't occur,
    but `PhysicalOperation` extractor has an issue that it also builds up that 
huge expression tree and then traverses and modifies it in 
`AliasHelper.replaceAliasButKeepName()`. With a small change in that function 
we can avoid such costly operations.
    
    ### Why are the changes needed?
    The suggested change reduced the plan generation time of the example query 
from minutes (range = 7) or hours (range = 8+) to seconds.
    
    ### Does this PR introduce _any_ user-facing change?
    The example query can be executed.
    
    ### How was this patch tested?
    Existing UTs + manual test of the example query in the ticket description.
    
    Closes #35382 from peter-toth/SPARK-28090-improve-replacealiasbutkeepname.
    
    Authored-by: Peter Toth <peter.t...@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongj...@apache.org>
---
 .../spark/sql/catalyst/expressions/AliasHelper.scala       | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
index dea7ea0f144..888a9869e07 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
@@ -70,11 +70,17 @@ trait AliasHelper {
   protected def replaceAliasButKeepName(
      expr: NamedExpression,
      aliasMap: AttributeMap[Alias]): NamedExpression = {
-    // Use transformUp to prevent infinite recursion when the replacement 
expression
-    // redefines the same ExprId,
-    trimNonTopLevelAliases(expr.transformUp {
+    expr match {
+      // We need to keep the `Alias` if we replace a top-level Attribute, so 
that it's still a
+      // `NamedExpression`. We also need to keep the name of the original 
Attribute.
       case a: Attribute => aliasMap.get(a).map(_.withName(a.name)).getOrElse(a)
-    }).asInstanceOf[NamedExpression]
+      case o =>
+        // Use transformUp to prevent infinite recursion when the replacement 
expression
+        // redefines the same ExprId.
+        o.mapChildren(_.transformUp {
+          case a: Attribute => aliasMap.get(a).map(_.child).getOrElse(a)
+        }).asInstanceOf[NamedExpression]
+    }
   }
 
   protected def trimAliases(e: Expression): Expression = {


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance

Reply via email to