[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key
cloud-fan commented on code in PR #36530: URL: https://github.com/apache/spark/pull/36530#discussion_r873577805 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala: ## @@ -268,4 +268,59 @@ class OuterJoinEliminationSuite extends PlanTest { comparePlans(optimized, originalQuery.analyze) } + + test("SPARK-39172: Remove left outer join if only left-side columns being selected and " + Review Comment: Remove left/right outer join if only left/right side columns are selected and the join keys on the other side are unique. ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala: ## @@ -268,4 +268,59 @@ class OuterJoinEliminationSuite extends PlanTest { comparePlans(optimized, originalQuery.analyze) } + + test("SPARK-39172: Remove left outer join if only left-side columns being selected and " + Review Comment: The same to PR title -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key
cloud-fan commented on code in PR #36530: URL: https://github.com/apache/spark/pull/36530#discussion_r873578814 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala: ## @@ -268,4 +268,59 @@ class OuterJoinEliminationSuite extends PlanTest { comparePlans(optimized, originalQuery.analyze) } + + test("SPARK-39172: Remove left outer join if only left-side columns being selected and " + Review Comment: The PR title can be `Remove left/right outer join if only left/right side columns are selected and the join keys on the other side are unique` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key
cloud-fan commented on code in PR #36530: URL: https://github.com/apache/spark/pull/36530#discussion_r873577805 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala: ## @@ -268,4 +268,59 @@ class OuterJoinEliminationSuite extends PlanTest { comparePlans(optimized, originalQuery.analyze) } + + test("SPARK-39172: Remove left outer join if only left-side columns being selected and " + Review Comment: Remove left outer join if only left-side columns are selected and the join keys on the other side are unique. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key
cloud-fan commented on code in PR #36530: URL: https://github.com/apache/spark/pull/36530#discussion_r873576622 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala: ## @@ -462,6 +462,7 @@ package object dsl { Window(windowExpressions, partitionSpec, orderSpec, logicalPlan) def subquery(alias: Symbol): LogicalPlan = SubqueryAlias(alias.name, logicalPlan) Review Comment: I don't know why do we need to accept a `Symbol` here. We can probably do a cleanup later and remove this method. The same to `def as(alias: Symbol): NamedExpression = Alias(expr, alias.name)()` in this file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key
cloud-fan commented on code in PR #36530: URL: https://github.com/apache/spark/pull/36530#discussion_r873520469 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala: ## @@ -268,4 +268,54 @@ class OuterJoinEliminationSuite extends PlanTest { comparePlans(optimized, originalQuery.analyze) } + + test("SPARK-39172: Remove outer join if all output come from streamed side and buffered side " + +"keys exist unique key") { +val x = testRelation.subquery(Symbol("x")) +val y = testRelation1.subquery(Symbol("y")) + +// left outer +comparePlans(Optimize.execute( + x.join(y.groupBy($"d")($"d"), LeftOuter, Some($"a" === $"d")) +.select($"a", $"b", $"c").analyze), + x.select($"a", $"b", $"c").analyze +) + +comparePlans(Optimize.execute( + x.join(y.groupBy($"d")($"d", count($"d").as("x")), LeftOuter, +Some($"a" === $"d" && $"b" === $"x")) +.select($"a", $"b", $"c").analyze), + x.select($"a", $"b", $"c").analyze +) + +// right outer +comparePlans(Optimize.execute( + x.groupBy($"a")($"a").join(y, RightOuter, Some($"a" === $"d")) +.select($"d", $"e", $"f").analyze), + y.select($"d", $"e", $"f").analyze +) + +comparePlans(Optimize.execute( + x.groupBy($"a")($"a", count($"a").as("x")).join(y, RightOuter, +Some($"a" === $"d" && $"x" === $"e")) +.select($"d", $"e", $"f").analyze), + y.select($"d", $"e", $"f").analyze +) + +// negative case +// not a equi-join +val p1 = x.join(y.groupBy($"d")($"d"), LeftOuter, Some($"a" > $"d")) + .select($"a").analyze +comparePlans(Optimize.execute(p1), p1) + +// do not exist unique key +val p2 = x.join(y.groupBy($"d", $"e")($"d", $"e"), LeftOuter, Some($"a" === $"d")) + .select($"a").analyze +comparePlans(Optimize.execute(p2), p2) + +// output comes from buffered side Review Comment: ```suggestion // output comes from the right side of a left outer join ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key
cloud-fan commented on code in PR #36530: URL: https://github.com/apache/spark/pull/36530#discussion_r873517906 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala: ## @@ -268,4 +268,54 @@ class OuterJoinEliminationSuite extends PlanTest { comparePlans(optimized, originalQuery.analyze) } + + test("SPARK-39172: Remove outer join if all output come from streamed side and buffered side " + Review Comment: please update the test name as well ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala: ## @@ -268,4 +268,54 @@ class OuterJoinEliminationSuite extends PlanTest { comparePlans(optimized, originalQuery.analyze) } + + test("SPARK-39172: Remove outer join if all output come from streamed side and buffered side " + +"keys exist unique key") { +val x = testRelation.subquery(Symbol("x")) Review Comment: ```suggestion val x = testRelation.subquery("x") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key
cloud-fan commented on code in PR #36530: URL: https://github.com/apache/spark/pull/36530#discussion_r873517334 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala: ## @@ -139,6 +139,17 @@ object ReorderJoin extends Rule[LogicalPlan] with PredicateHelper { * SELECT t1.c1, max(t1.c2) FROM t1 GROUP BY t1.c1 * }}} * + * 3. Remove outer join if: + * - For a left outer join with only left-side columns being selected and the right side join + * keys are unique. + * - For a right outer join with only right-side columns being selected and the left side join + * keys are unique. + * + * {{{ + * SELECT t1.* FROM t1 LEFT JOIN (SELECT DISTINCT c1 as c1 FROM t)t2 ON t1.c1 = t2.c1 ==> Review Comment: ```suggestion * SELECT t1.* FROM t1 LEFT JOIN (SELECT DISTINCT c1 as c1 FROM t) t2 ON t1.c1 = t2.c1 ==> ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key
cloud-fan commented on code in PR #36530: URL: https://github.com/apache/spark/pull/36530#discussion_r873346931 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala: ## @@ -211,6 +219,15 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper { if projectList.forall(_.deterministic) && p.references.subsetOf(right.outputSet) && allDuplicateAgnostic(aggExprs) => a.copy(child = p.copy(child = right)) + +case p @ Project(_, ExtractEquiJoinKeys(LeftOuter, _, rightKeys, _, _, left, right, _)) +if right.distinctKeys.exists(_.subsetOf(ExpressionSet(rightKeys))) && + p.references.subsetOf(left.outputSet) => + p.copy(child = left) Review Comment: For a left outer join with only left-side columns being selected, the join can only change the result if we can find more than one matched row on the right side. If the right side join keys are unique, apparently we can't find more than one match. So this optimization LGTM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key
cloud-fan commented on code in PR #36530: URL: https://github.com/apache/spark/pull/36530#discussion_r873344595 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala: ## @@ -139,6 +139,14 @@ object ReorderJoin extends Rule[LogicalPlan] with PredicateHelper { * SELECT t1.c1, max(t1.c2) FROM t1 GROUP BY t1.c1 * }}} * + * 3. Remove outer join if all output comes from streamed side and the join keys from buffered side + * exist unique key. Review Comment: it looks a bit weird to talk about stream side and buffer side in the logical plan phase. Can we explain this optimization in a different way? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org