This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.3 by this push: new 23fa70e9b2b [SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance 23fa70e9b2b is described below commit 23fa70e9b2b5c896a12f95173dd581d9044b85a7 Author: Peter Toth <peter.t...@gmail.com> AuthorDate: Sun Apr 3 00:47:40 2022 -0700 [SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance ### What changes were proposed in this pull request? SPARK-28090 ticket description contains an example query with multiple nested struct creation and field extraction. The following is is an example of the query when the sample code range is set to only 3: ``` Project [struct(num1, numerics#23.num1, num10, numerics#23.num10, num11, numerics#23.num11, num12, numerics#23.num12, num13, numerics#23.num13, num14, numerics#23.num14, num15, numerics#23.num15, num2, numerics#23.num2, num3, numerics#23.num3, num4, numerics#23.num4, num5, numerics#23.num5, num6, numerics#23.num6, num7, numerics#23.num7, num8, numerics#23.num8, num9, numerics#23.num9, out_num1, numerics#23.out_num1, out_num2, -numerics#23.num2) AS numerics#42] +- Project [struct(num1, numerics#5.num1, num10, numerics#5.num10, num11, numerics#5.num11, num12, numerics#5.num12, num13, numerics#5.num13, num14, numerics#5.num14, num15, numerics#5.num15, num2, numerics#5.num2, num3, numerics#5.num3, num4, numerics#5.num4, num5, numerics#5.num5, num6, numerics#5.num6, num7, numerics#5.num7, num8, numerics#5.num8, num9, numerics#5.num9, out_num1, -numerics#5.num1) AS numerics#23] +- LogicalRDD [numerics#5], false ``` If the level of nesting reaches 7 the query plan generation becomes extremely slow on Spark 2.4. That is because both - `CollapseProject` rule is slow and - some of the expression optimization rules running on the huge, not yet simplified expression tree of the single, collapsed `Project` node are slow. On Spark 3.3, after SPARK-36718, `CollapseProject` doesn't collapse such plans so the above issues don't occur, but `PhysicalOperation` extractor has an issue that it also builds up that huge expression tree and then traverses and modifies it in `AliasHelper.replaceAliasButKeepName()`. With a small change in that function we can avoid such costly operations. ### Why are the changes needed? The suggested change reduced the plan generation time of the example query from minutes (range = 7) or hours (range = 8+) to seconds. ### Does this PR introduce _any_ user-facing change? The example query can be executed. ### How was this patch tested? Existing UTs + manual test of the example query in the ticket description. Closes #35382 from peter-toth/SPARK-28090-improve-replacealiasbutkeepname. Authored-by: Peter Toth <peter.t...@gmail.com> Signed-off-by: Dongjoon Hyun <dongj...@apache.org> --- .../spark/sql/catalyst/expressions/AliasHelper.scala | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala index dea7ea0f144..888a9869e07 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala @@ -70,11 +70,17 @@ trait AliasHelper { protected def replaceAliasButKeepName( expr: NamedExpression, aliasMap: AttributeMap[Alias]): NamedExpression = { - // Use transformUp to prevent infinite recursion when the replacement expression - // redefines the same ExprId, - trimNonTopLevelAliases(expr.transformUp { + expr match { + // We need to keep the `Alias` if we replace a top-level Attribute, so that it's still a + // `NamedExpression`. We also need to keep the name of the original Attribute. case a: Attribute => aliasMap.get(a).map(_.withName(a.name)).getOrElse(a) - }).asInstanceOf[NamedExpression] + case o => + // Use transformUp to prevent infinite recursion when the replacement expression + // redefines the same ExprId. + o.mapChildren(_.transformUp { + case a: Attribute => aliasMap.get(a).map(_.child).getOrElse(a) + }).asInstanceOf[NamedExpression] + } } protected def trimAliases(e: Expression): Expression = { --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org