[GitHub] spark pull request #20333: [SPARK-23087][SQL] CheckCartesianProduct too rest...

2018-01-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20333


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20333: [SPARK-23087][SQL] CheckCartesianProduct too rest...

2018-01-21 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20333#discussion_r162807813
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala ---
@@ -274,4 +274,18 @@ class DataFrameJoinSuite extends QueryTest with 
SharedSQLContext {
 checkAnswer(innerJoin, Row(1) :: Nil)
   }
 
+  test("SPARK-23087: don't throw Analysis Exception in 
CheckCartesianProduct when join condition " +
+"is false or null") {
+val df = spark.range(10)
--- End diff --

shouldn't it be false?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20333: [SPARK-23087][SQL] CheckCartesianProduct too rest...

2018-01-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20333#discussion_r162794983
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala ---
@@ -274,4 +274,18 @@ class DataFrameJoinSuite extends QueryTest with 
SharedSQLContext {
 checkAnswer(innerJoin, Row(1) :: Nil)
   }
 
+  test("SPARK-23087: don't throw Analysis Exception in 
CheckCartesianProduct when join condition " +
+"is false or null") {
+val df = spark.range(10)
--- End diff --

> `withSQLConf(CROSS_JOINS_ENABLED.key -> "true") {`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20333: [SPARK-23087][SQL] CheckCartesianProduct too rest...

2018-01-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20333#discussion_r162794939
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1108,15 +1108,19 @@ object CheckCartesianProducts extends 
Rule[LogicalPlan] with PredicateHelper {
*/
   def isCartesianProduct(join: Join): Boolean = {
 val conditions = 
join.condition.map(splitConjunctivePredicates).getOrElse(Nil)
-!conditions.map(_.references).exists(refs => 
refs.exists(join.left.outputSet.contains)
-&& refs.exists(join.right.outputSet.contains))
+
+conditions match {
+  case Seq(Literal.FalseLiteral) | Seq(Literal(null, BooleanType)) => 
false
+  case _ => !conditions.map(_.references).exists(refs =>
+refs.exists(join.left.outputSet.contains) && 
refs.exists(join.right.outputSet.contains))
+}
   }
 
   def apply(plan: LogicalPlan): LogicalPlan =
 if (SQLConf.get.crossJoinEnabled) {
   plan
 } else plan transform {
-  case j @ Join(left, right, Inner | LeftOuter | RightOuter | 
FullOuter, condition)
+  case j @ Join(left, right, Inner | LeftOuter | RightOuter | 
FullOuter, _)
--- End diff --

Yeah. For outer join, it makes sense to remove this check


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20333: [SPARK-23087][SQL] CheckCartesianProduct too rest...

2018-01-20 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20333#discussion_r162793942
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1108,15 +1108,19 @@ object CheckCartesianProducts extends 
Rule[LogicalPlan] with PredicateHelper {
*/
   def isCartesianProduct(join: Join): Boolean = {
 val conditions = 
join.condition.map(splitConjunctivePredicates).getOrElse(Nil)
-!conditions.map(_.references).exists(refs => 
refs.exists(join.left.outputSet.contains)
-&& refs.exists(join.right.outputSet.contains))
+
+conditions match {
+  case Seq(Literal.FalseLiteral) | Seq(Literal(null, BooleanType)) => 
false
+  case _ => !conditions.map(_.references).exists(refs =>
+refs.exists(join.left.outputSet.contains) && 
refs.exists(join.right.outputSet.contains))
+}
   }
 
   def apply(plan: LogicalPlan): LogicalPlan =
 if (SQLConf.get.crossJoinEnabled) {
   plan
 } else plan transform {
-  case j @ Join(left, right, Inner | LeftOuter | RightOuter | 
FullOuter, condition)
+  case j @ Join(left, right, Inner | LeftOuter | RightOuter | 
FullOuter, _)
--- End diff --

why are you saying that the size of the result set is the same?
If you have a relation A (of size n, let's say 1M rows) in outer join with 
a relation B (of size m, let's say 1M rows). If the condition is true, the 
output relation is 1M * 1M (ie. (n * m)); if the condition is false, the result 
is 1M (n) for a left join, 1M (m) for a right join, 1M + 1M (m +n) for a full 
outer join. Therefore the size is not the same at all. But maybe you meant 
something different, am I missing something?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20333: [SPARK-23087][SQL] CheckCartesianProduct too rest...

2018-01-19 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20333#discussion_r162759088
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1108,15 +1108,19 @@ object CheckCartesianProducts extends 
Rule[LogicalPlan] with PredicateHelper {
*/
   def isCartesianProduct(join: Join): Boolean = {
 val conditions = 
join.condition.map(splitConjunctivePredicates).getOrElse(Nil)
-!conditions.map(_.references).exists(refs => 
refs.exists(join.left.outputSet.contains)
-&& refs.exists(join.right.outputSet.contains))
+
+conditions match {
+  case Seq(Literal.FalseLiteral) | Seq(Literal(null, BooleanType)) => 
false
+  case _ => !conditions.map(_.references).exists(refs =>
+refs.exists(join.left.outputSet.contains) && 
refs.exists(join.right.outputSet.contains))
+}
   }
 
   def apply(plan: LogicalPlan): LogicalPlan =
 if (SQLConf.get.crossJoinEnabled) {
   plan
 } else plan transform {
-  case j @ Join(left, right, Inner | LeftOuter | RightOuter | 
FullOuter, condition)
+  case j @ Join(left, right, Inner | LeftOuter | RightOuter | 
FullOuter, _)
--- End diff --

For inner joins, we will not hit this, because it is already optimized to 
an empty relation. For the other outer join types, we face the exactly same 
issue as the condition is true. That is, the size of the join result sets is 
still the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20333: [SPARK-23087][SQL] CheckCartesianProduct too rest...

2018-01-19 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20333#discussion_r162758553
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala ---
@@ -274,4 +274,18 @@ class DataFrameJoinSuite extends QueryTest with 
SharedSQLContext {
 checkAnswer(innerJoin, Row(1) :: Nil)
   }
 
+  test("SPARK-23087: don't throw Analysis Exception in 
CheckCartesianProduct when join condition " +
+"is false or null") {
+val df = spark.range(10)
+val dfNull = spark.range(10).select(lit(null).as("b"))
+val planNull = df.join(dfNull, $"id" === $"b", 
"left").queryExecution.analyzed
+
+spark.sessionState.executePlan(planNull).optimizedPlan
+
+val dfOne = df.select(lit(1).as("a"))
+val dfTwo = spark.range(10).select(lit(2).as("a"))
--- End diff --

`a` -> `b`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20333: [SPARK-23087][SQL] CheckCartesianProduct too rest...

2018-01-19 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/20333

[SPARK-23087][SQL] CheckCartesianProduct too restrictive when condition is 
false/null

## What changes were proposed in this pull request?

CheckCartesianProduct raises an AnalysisException also when the join 
condition is always false/null. In this case, we shouldn't raise it, since the 
result will not be a cartesian product. 

## How was this patch tested?

added UT

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-23087

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20333.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20333


commit 9c88781dcd4cd301373927bfbe7f3530c80f4f05
Author: Marco Gaido 
Date:   2018-01-19T20:45:29Z

[SPARK-23087][SQL] CheckCartesianProduct too restrictive when condition is 
false/null




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org