[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...

2018-10-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22702#discussion_r224950875
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -276,15 +276,31 @@ object BooleanSimplification extends 
Rule[LogicalPlan] with PredicateHelper {
   case a And b if a.semanticEquals(b) => a
   case a Or b if a.semanticEquals(b) => a
 
-  case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c)
-  case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b)
-  case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c)
-  case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c)
-
-  case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c)
-  case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
-  case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
-  case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
+  // The following optimization is applicable only when the operands 
are nullable,
--- End diff --

fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...

2018-10-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22702#discussion_r224950588
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -276,15 +276,31 @@ object BooleanSimplification extends 
Rule[LogicalPlan] with PredicateHelper {
   case a And b if a.semanticEquals(b) => a
   case a Or b if a.semanticEquals(b) => a
 
-  case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c)
-  case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b)
-  case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c)
-  case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c)
-
-  case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c)
-  case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
-  case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
-  case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
+  // The following optimization is applicable only when the operands 
are nullable,
--- End diff --

typo: `only when the operands are not nullable`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...

2018-10-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22702


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...

2018-10-12 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22702#discussion_r224748765
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -276,15 +276,15 @@ object BooleanSimplification extends 
Rule[LogicalPlan] with PredicateHelper {
   case a And b if a.semanticEquals(b) => a
   case a Or b if a.semanticEquals(b) => a
 
-  case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c)
-  case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b)
-  case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c)
-  case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c)
-
-  case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c)
-  case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
-  case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
-  case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => 
And(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(c) => 
And(a, b)
+  case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => 
And(b, c)
+  case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => 
And(a, c)
+
+  case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => 
Or(a, c)
--- End diff --

oh, yes you're right, this might be a problem indeed if the expression is 
inside a `not`. Sorry, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...

2018-10-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22702#discussion_r224745249
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -276,15 +276,15 @@ object BooleanSimplification extends 
Rule[LogicalPlan] with PredicateHelper {
   case a And b if a.semanticEquals(b) => a
   case a Or b if a.semanticEquals(b) => a
 
-  case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c)
-  case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b)
-  case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c)
-  case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c)
-
-  case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c)
-  case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
-  case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
-  case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => 
And(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(c) => 
And(a, b)
+  case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => 
And(b, c)
+  case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => 
And(a, c)
+
+  case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => 
Or(a, c)
--- End diff --

> when a is null, And(a, c) returns null

This is not always the case, `null && false`  is `false`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...

2018-10-12 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22702#discussion_r224709234
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -276,15 +276,15 @@ object BooleanSimplification extends 
Rule[LogicalPlan] with PredicateHelper {
   case a And b if a.semanticEquals(b) => a
   case a Or b if a.semanticEquals(b) => a
 
-  case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c)
-  case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b)
-  case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c)
-  case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c)
-
-  case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c)
-  case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
-  case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
-  case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => 
And(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(c) => 
And(a, b)
+  case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => 
And(b, c)
+  case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => 
And(a, c)
+
+  case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => 
Or(a, c)
--- End diff --

Sorry, it is the other case where the change is not needed, right?
`a And (b Or c)` -> `And(a, c)` when `a` is `null`, `And(a, c)` returns 
`null` (I got a bit confused earlier, sorry).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...

2018-10-12 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22702#discussion_r224708102
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -276,15 +276,15 @@ object BooleanSimplification extends 
Rule[LogicalPlan] with PredicateHelper {
   case a And b if a.semanticEquals(b) => a
   case a Or b if a.semanticEquals(b) => a
 
-  case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c)
-  case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b)
-  case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c)
-  case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c)
-
-  case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c)
-  case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
-  case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
-  case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => 
And(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(c) => 
And(a, b)
+  case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => 
And(b, c)
+  case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => 
And(a, c)
+
+  case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => 
Or(a, c)
--- End diff --

I see now, sorry. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...

2018-10-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22702#discussion_r224706383
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -276,15 +276,15 @@ object BooleanSimplification extends 
Rule[LogicalPlan] with PredicateHelper {
   case a And b if a.semanticEquals(b) => a
   case a Or b if a.semanticEquals(b) => a
 
-  case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c)
-  case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b)
-  case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c)
-  case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c)
-
-  case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c)
-  case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
-  case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
-  case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => 
And(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(c) => 
And(a, b)
+  case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => 
And(b, c)
+  case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => 
And(a, c)
+
+  case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => 
Or(a, c)
--- End diff --

the problem is when a is null, c is true


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...

2018-10-12 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22702#discussion_r224704266
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -276,15 +276,15 @@ object BooleanSimplification extends 
Rule[LogicalPlan] with PredicateHelper {
   case a And b if a.semanticEquals(b) => a
   case a Or b if a.semanticEquals(b) => a
 
-  case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c)
-  case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b)
-  case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c)
-  case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c)
-
-  case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c)
-  case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
-  case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
-  case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => 
And(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(c) => 
And(a, b)
+  case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => 
And(b, c)
+  case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => 
And(a, c)
+
+  case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => 
Or(a, c)
--- End diff --

these shouldn't be a problem, since if `a` is `true`, then `a Or b` is 
true, regardless of `b`'s value/nullability, isn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...

2018-10-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22702#discussion_r224658881
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -276,15 +276,15 @@ object BooleanSimplification extends 
Rule[LogicalPlan] with PredicateHelper {
   case a And b if a.semanticEquals(b) => a
   case a Or b if a.semanticEquals(b) => a
 
-  case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c)
-  case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b)
-  case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c)
-  case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c)
-
-  case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c)
-  case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
-  case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
-  case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => 
And(a, c)
--- End diff --

after more thoughts, `a And (b Or c)` should be better than `If(IsNull(a), 
null, And(a, c))`, as it's more likely to get pushed down to data source, so 
the changes here LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...

2018-10-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22702#discussion_r224655860
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -276,15 +276,15 @@ object BooleanSimplification extends 
Rule[LogicalPlan] with PredicateHelper {
   case a And b if a.semanticEquals(b) => a
   case a Or b if a.semanticEquals(b) => a
 
-  case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c)
-  case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b)
-  case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c)
-  case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c)
-
-  case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c)
-  case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
-  case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
-  case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => 
And(a, c)
--- End diff --

Since this is complicated, shall we put a comment to explain it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...

2018-10-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22702#discussion_r224655771
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -276,15 +276,15 @@ object BooleanSimplification extends 
Rule[LogicalPlan] with PredicateHelper {
   case a And b if a.semanticEquals(b) => a
   case a Or b if a.semanticEquals(b) => a
 
-  case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c)
-  case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b)
-  case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c)
-  case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c)
-
-  case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c)
-  case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
-  case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
-  case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
+  case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => 
And(a, c)
--- End diff --

assuming a is null, then b is also null.
If c is null: `a And (b Or c)` -> null, And(a, c) -> null
If c is true: `a And (b Or c)` -> null, And(a, c) -> null
if c is false: `a And (b Or c)` -> null, And(a, c) -> false

So yes this is a bug, and we should rewrite it to `If(IsNull(a), a, And(a, 
c))`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...

2018-10-11 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/22702

[SPARK-25714] Fix Null Handling in the Optimizer rule BooleanSimplification

## What changes were proposed in this pull request?
```Scala
val df1 = Seq(("abc", 1), (null, 2)).toDF("col1", "col2")
df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1")
val df2 = spark.read.parquet("/tmp/test1")
df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
```

Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 
1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is 
a bug introduced in Spark 1.6 release.

## How was this patch tested?
Added test cases

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark fixBooleanSimplify2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22702.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22702


commit 5d1dde12b6cb1b3a61f32d678b952ac4ce5b6c0f
Author: gatorsmile 
Date:   2018-10-11T21:38:38Z

fix

commit a9359abff62017f46f33ef18d7f56f97c885af3d
Author: gatorsmile 
Date:   2018-10-11T21:40:44Z

style




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org