[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224950875 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,31 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + // The following optimization is applicable only when the operands are nullable, --- End diff -- fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224950588 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,31 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + // The following optimization is applicable only when the operands are nullable, --- End diff -- typo: `only when the operands are not nullable` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22702 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224748765 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,15 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(c) => And(a, b) + case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => And(b, c) + case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => And(a, c) + + case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => Or(a, c) --- End diff -- oh, yes you're right, this might be a problem indeed if the expression is inside a `not`. Sorry, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224745249 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,15 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(c) => And(a, b) + case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => And(b, c) + case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => And(a, c) + + case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => Or(a, c) --- End diff -- > when a is null, And(a, c) returns null This is not always the case, `null && false` is `false` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224709234 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,15 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(c) => And(a, b) + case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => And(b, c) + case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => And(a, c) + + case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => Or(a, c) --- End diff -- Sorry, it is the other case where the change is not needed, right? `a And (b Or c)` -> `And(a, c)` when `a` is `null`, `And(a, c)` returns `null` (I got a bit confused earlier, sorry). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224708102 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,15 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(c) => And(a, b) + case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => And(b, c) + case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => And(a, c) + + case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => Or(a, c) --- End diff -- I see now, sorry. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224706383 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,15 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(c) => And(a, b) + case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => And(b, c) + case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => And(a, c) + + case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => Or(a, c) --- End diff -- the problem is when a is null, c is true --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224704266 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,15 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(c) => And(a, b) + case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => And(b, c) + case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => And(a, c) + + case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => Or(a, c) --- End diff -- these shouldn't be a problem, since if `a` is `true`, then `a Or b` is true, regardless of `b`'s value/nullability, isn't it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224658881 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,15 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c) --- End diff -- after more thoughts, `a And (b Or c)` should be better than `If(IsNull(a), null, And(a, c))`, as it's more likely to get pushed down to data source, so the changes here LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224655860 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,15 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c) --- End diff -- Since this is complicated, shall we put a comment to explain it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224655771 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,15 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c) --- End diff -- assuming a is null, then b is also null. If c is null: `a And (b Or c)` -> null, And(a, c) -> null If c is true: `a And (b Or c)` -> null, And(a, c) -> null if c is false: `a And (b Or c)` -> null, And(a, c) -> false So yes this is a bug, and we should rewrite it to `If(IsNull(a), a, And(a, c))` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/22702 [SPARK-25714] Fix Null Handling in the Optimizer rule BooleanSimplification ## What changes were proposed in this pull request? ```Scala val df1 = Seq(("abc", 1), (null, 2)).toDF("col1", "col2") df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1") val df2 = spark.read.parquet("/tmp/test1") df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() ``` Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release. ## How was this patch tested? Added test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark fixBooleanSimplify2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22702.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22702 commit 5d1dde12b6cb1b3a61f32d678b952ac4ce5b6c0f Author: gatorsmile Date: 2018-10-11T21:38:38Z fix commit a9359abff62017f46f33ef18d7f56f97c885af3d Author: gatorsmile Date: 2018-10-11T21:40:44Z style --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org