[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key

2022-05-16 Thread GitBox


cloud-fan commented on code in PR #36530:
URL: https://github.com/apache/spark/pull/36530#discussion_r873577805


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala:
##
@@ -268,4 +268,59 @@ class OuterJoinEliminationSuite extends PlanTest {
 
 comparePlans(optimized, originalQuery.analyze)
   }
+
+  test("SPARK-39172: Remove left outer join if only left-side columns being 
selected and " +

Review Comment:
   Remove left/right outer join if only left/right side columns are selected 
and the join keys on the other side are unique.



##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala:
##
@@ -268,4 +268,59 @@ class OuterJoinEliminationSuite extends PlanTest {
 
 comparePlans(optimized, originalQuery.analyze)
   }
+
+  test("SPARK-39172: Remove left outer join if only left-side columns being 
selected and " +

Review Comment:
   The same to PR title



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key

2022-05-16 Thread GitBox


cloud-fan commented on code in PR #36530:
URL: https://github.com/apache/spark/pull/36530#discussion_r873578814


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala:
##
@@ -268,4 +268,59 @@ class OuterJoinEliminationSuite extends PlanTest {
 
 comparePlans(optimized, originalQuery.analyze)
   }
+
+  test("SPARK-39172: Remove left outer join if only left-side columns being 
selected and " +

Review Comment:
   The PR title can be `Remove left/right outer join if only left/right side 
columns are selected and the join keys on the other side are unique`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key

2022-05-16 Thread GitBox


cloud-fan commented on code in PR #36530:
URL: https://github.com/apache/spark/pull/36530#discussion_r873577805


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala:
##
@@ -268,4 +268,59 @@ class OuterJoinEliminationSuite extends PlanTest {
 
 comparePlans(optimized, originalQuery.analyze)
   }
+
+  test("SPARK-39172: Remove left outer join if only left-side columns being 
selected and " +

Review Comment:
   Remove left outer join if only left-side columns are selected and the join 
keys on the other side are unique.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key

2022-05-16 Thread GitBox


cloud-fan commented on code in PR #36530:
URL: https://github.com/apache/spark/pull/36530#discussion_r873576622


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:
##
@@ -462,6 +462,7 @@ package object dsl {
 Window(windowExpressions, partitionSpec, orderSpec, logicalPlan)
 
   def subquery(alias: Symbol): LogicalPlan = SubqueryAlias(alias.name, 
logicalPlan)

Review Comment:
   I don't know why do we need to accept a `Symbol` here. We can probably do a 
cleanup later and remove this method. The same to `def as(alias: Symbol): 
NamedExpression = Alias(expr, alias.name)()` in this file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key

2022-05-16 Thread GitBox


cloud-fan commented on code in PR #36530:
URL: https://github.com/apache/spark/pull/36530#discussion_r873520469


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala:
##
@@ -268,4 +268,54 @@ class OuterJoinEliminationSuite extends PlanTest {
 
 comparePlans(optimized, originalQuery.analyze)
   }
+
+  test("SPARK-39172: Remove outer join if all output come from streamed side 
and buffered side " +
+"keys exist unique key") {
+val x = testRelation.subquery(Symbol("x"))
+val y = testRelation1.subquery(Symbol("y"))
+
+// left outer
+comparePlans(Optimize.execute(
+  x.join(y.groupBy($"d")($"d"), LeftOuter, Some($"a" === $"d"))
+.select($"a", $"b", $"c").analyze),
+  x.select($"a", $"b", $"c").analyze
+)
+
+comparePlans(Optimize.execute(
+  x.join(y.groupBy($"d")($"d", count($"d").as("x")), LeftOuter,
+Some($"a" === $"d" && $"b" === $"x"))
+.select($"a", $"b", $"c").analyze),
+  x.select($"a", $"b", $"c").analyze
+)
+
+// right outer
+comparePlans(Optimize.execute(
+  x.groupBy($"a")($"a").join(y, RightOuter, Some($"a" === $"d"))
+.select($"d", $"e", $"f").analyze),
+  y.select($"d", $"e", $"f").analyze
+)
+
+comparePlans(Optimize.execute(
+  x.groupBy($"a")($"a", count($"a").as("x")).join(y, RightOuter,
+Some($"a" === $"d" && $"x" === $"e"))
+.select($"d", $"e", $"f").analyze),
+  y.select($"d", $"e", $"f").analyze
+)
+
+// negative case
+// not a equi-join
+val p1 = x.join(y.groupBy($"d")($"d"), LeftOuter, Some($"a" > $"d"))
+  .select($"a").analyze
+comparePlans(Optimize.execute(p1), p1)
+
+// do not exist unique key
+val p2 = x.join(y.groupBy($"d", $"e")($"d", $"e"), LeftOuter, Some($"a" 
=== $"d"))
+  .select($"a").analyze
+comparePlans(Optimize.execute(p2), p2)
+
+// output comes from buffered side

Review Comment:
   ```suggestion
   // output comes from the right side of a left outer join
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key

2022-05-16 Thread GitBox


cloud-fan commented on code in PR #36530:
URL: https://github.com/apache/spark/pull/36530#discussion_r873517906


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala:
##
@@ -268,4 +268,54 @@ class OuterJoinEliminationSuite extends PlanTest {
 
 comparePlans(optimized, originalQuery.analyze)
   }
+
+  test("SPARK-39172: Remove outer join if all output come from streamed side 
and buffered side " +

Review Comment:
   please update the test name as well



##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala:
##
@@ -268,4 +268,54 @@ class OuterJoinEliminationSuite extends PlanTest {
 
 comparePlans(optimized, originalQuery.analyze)
   }
+
+  test("SPARK-39172: Remove outer join if all output come from streamed side 
and buffered side " +
+"keys exist unique key") {
+val x = testRelation.subquery(Symbol("x"))

Review Comment:
   ```suggestion
   val x = testRelation.subquery("x")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key

2022-05-16 Thread GitBox


cloud-fan commented on code in PR #36530:
URL: https://github.com/apache/spark/pull/36530#discussion_r873517334


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala:
##
@@ -139,6 +139,17 @@ object ReorderJoin extends Rule[LogicalPlan] with 
PredicateHelper {
  *   SELECT t1.c1, max(t1.c2) FROM t1 GROUP BY t1.c1
  * }}}
  *
+ * 3. Remove outer join if:
+ *   - For a left outer join with only left-side columns being selected and 
the right side join
+ * keys are unique.
+ *   - For a right outer join with only right-side columns being selected and 
the left side join
+ * keys are unique.
+ *
+ * {{{
+ *   SELECT t1.* FROM t1 LEFT JOIN (SELECT DISTINCT c1 as c1 FROM t)t2 ON 
t1.c1 = t2.c1  ==>

Review Comment:
   ```suggestion
*   SELECT t1.* FROM t1 LEFT JOIN (SELECT DISTINCT c1 as c1 FROM t) t2 ON 
t1.c1 = t2.c1  ==>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key

2022-05-15 Thread GitBox


cloud-fan commented on code in PR #36530:
URL: https://github.com/apache/spark/pull/36530#discussion_r873346931


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala:
##
@@ -211,6 +219,15 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with 
PredicateHelper {
 if projectList.forall(_.deterministic) && 
p.references.subsetOf(right.outputSet) &&
   allDuplicateAgnostic(aggExprs) =>
   a.copy(child = p.copy(child = right))
+
+case p @ Project(_, ExtractEquiJoinKeys(LeftOuter, _, rightKeys, _, _, 
left, right, _))
+if right.distinctKeys.exists(_.subsetOf(ExpressionSet(rightKeys))) &&
+  p.references.subsetOf(left.outputSet) =>
+  p.copy(child = left)

Review Comment:
   For a left outer join with only left-side columns being selected, the join 
can only change the result if we can find more than one matched row on the 
right side. If the right side join keys are unique, apparently we can't find 
more than one match. So this optimization LGTM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key

2022-05-15 Thread GitBox


cloud-fan commented on code in PR #36530:
URL: https://github.com/apache/spark/pull/36530#discussion_r873344595


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala:
##
@@ -139,6 +139,14 @@ object ReorderJoin extends Rule[LogicalPlan] with 
PredicateHelper {
  *   SELECT t1.c1, max(t1.c2) FROM t1 GROUP BY t1.c1
  * }}}
  *
+ * 3. Remove outer join if all output comes from streamed side and the join 
keys from buffered side
+ * exist unique key.

Review Comment:
   it looks a bit weird to talk about stream side and buffer side in the 
logical plan phase. Can we explain this optimization in a different way?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org