[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...

2018-11-30 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23176#discussion_r237826533
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala
 ---
@@ -293,6 +293,54 @@ class PredicateSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 }
   }
 
+  test("INSET: binary") {
--- End diff --

Submitted #23187.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...

2018-11-29 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23176#discussion_r237771176
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala
 ---
@@ -293,6 +293,54 @@ class PredicateSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 }
   }
 
+  test("INSET: binary") {
--- End diff --

Sure, I'll do it later. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...

2018-11-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23176#discussion_r237770687
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala
 ---
@@ -293,6 +293,54 @@ class PredicateSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 }
   }
 
+  test("INSET: binary") {
--- End diff --

good idea! we should test `In` and `InSet` together


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...

2018-11-29 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/23176#discussion_r237766198
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala
 ---
@@ -293,6 +293,54 @@ class PredicateSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 }
   }
 
+  test("INSET: binary") {
--- End diff --

Regarding the semantics, InSet is equal to In. Could we combine the test 
cases? Test both?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...

2018-11-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/23176


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...

2018-11-29 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23176#discussion_r237400321
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -367,11 +367,29 @@ case class InSet(child: Expression, hset: Set[Any]) 
extends UnaryExpression with
   }
 
   @transient lazy val set: Set[Any] = child.dataType match {
-case _: AtomicType => hset
+case t: AtomicType if !t.isInstanceOf[BinaryType] => hset
 case _: NullType => hset
 case _ =>
+  val ord = TypeUtils.getInterpretedOrdering(child.dataType)
+  val ordering = if (hasNull) {
+new Ordering[Any] {
+  override def compare(x: Any, y: Any): Int = {
+if (x == null && y == null) {
+  0
+} else if (x == null) {
+  -1
+} else if (y == null) {
+  1
+} else {
+  ord.compare(x, y)
+}
+  }
+}
+  } else {
+ord
+  }
   // for structs use interpreted ordering to be able to compare 
UnsafeRows with non-UnsafeRows
-  TreeSet.empty(TypeUtils.getInterpretedOrdering(child.dataType)) ++ 
hset
+  TreeSet.empty(ordering) ++ hset
--- End diff --

Actually we are using `nullSafeEval`, so we don't need to update it.
Instead, I'm updating to use `nullSafeCodeGen` for codegen path.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...

2018-11-29 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23176#discussion_r237399670
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -367,11 +367,29 @@ case class InSet(child: Expression, hset: Set[Any]) 
extends UnaryExpression with
   }
 
   @transient lazy val set: Set[Any] = child.dataType match {
-case _: AtomicType => hset
+case t: AtomicType if !t.isInstanceOf[BinaryType] => hset
 case _: NullType => hset
 case _ =>
+  val ord = TypeUtils.getInterpretedOrdering(child.dataType)
+  val ordering = if (hasNull) {
+new Ordering[Any] {
+  override def compare(x: Any, y: Any): Int = {
--- End diff --

Thanks! yeah, I'm updating as @cloud-fan's idea.
Also we can use `nullSafeCodeGen` for codegen path, I'll update it as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...

2018-11-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23176#discussion_r237398085
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -367,11 +367,29 @@ case class InSet(child: Expression, hset: Set[Any]) 
extends UnaryExpression with
   }
 
   @transient lazy val set: Set[Any] = child.dataType match {
-case _: AtomicType => hset
+case t: AtomicType if !t.isInstanceOf[BinaryType] => hset
 case _: NullType => hset
 case _ =>
+  val ord = TypeUtils.getInterpretedOrdering(child.dataType)
+  val ordering = if (hasNull) {
+new Ordering[Any] {
+  override def compare(x: Any, y: Any): Int = {
--- End diff --

Or simply filter out null from the tree set as @cloud-fan's idea.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...

2018-11-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23176#discussion_r237397442
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -367,11 +367,29 @@ case class InSet(child: Expression, hset: Set[Any]) 
extends UnaryExpression with
   }
 
   @transient lazy val set: Set[Any] = child.dataType match {
-case _: AtomicType => hset
+case t: AtomicType if !t.isInstanceOf[BinaryType] => hset
 case _: NullType => hset
 case _ =>
+  val ord = TypeUtils.getInterpretedOrdering(child.dataType)
+  val ordering = if (hasNull) {
+new Ordering[Any] {
+  override def compare(x: Any, y: Any): Int = {
--- End diff --

InSet overrides nullSafeEval, and for codegen we look into `set` only if 
`!ev.isNull`, so I think we only need to consider the case one side is null.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...

2018-11-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23176#discussion_r237383211
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala
 ---
@@ -293,6 +293,54 @@ class PredicateSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 }
   }
 
+  test("INSET: binary") {
+val hS = HashSet[Any]() + Array(1.toByte, 2.toByte) + Array(3.toByte)
+val nS = HashSet[Any]() + Array(1.toByte, 2.toByte) + Array(3.toByte) 
+ null
+val onetwo = Literal(Array(1.toByte, 2.toByte))
+val three = Literal(Array(3.toByte))
+val threefour = Literal(Array(3.toByte, 4.toByte))
+val nl = Literal(null, onetwo.dataType)
+checkEvaluation(InSet(onetwo, hS), true)
+checkEvaluation(InSet(three, hS), true)
+checkEvaluation(InSet(three, nS), true)
--- End diff --

this line is duplicated


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...

2018-11-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23176#discussion_r237382990
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -367,11 +367,29 @@ case class InSet(child: Expression, hset: Set[Any]) 
extends UnaryExpression with
   }
 
   @transient lazy val set: Set[Any] = child.dataType match {
-case _: AtomicType => hset
+case t: AtomicType if !t.isInstanceOf[BinaryType] => hset
 case _: NullType => hset
 case _ =>
+  val ord = TypeUtils.getInterpretedOrdering(child.dataType)
+  val ordering = if (hasNull) {
+new Ordering[Any] {
+  override def compare(x: Any, y: Any): Int = {
+if (x == null && y == null) {
+  0
+} else if (x == null) {
+  -1
+} else if (y == null) {
+  1
+} else {
+  ord.compare(x, y)
+}
+  }
+}
+  } else {
+ord
+  }
   // for structs use interpreted ordering to be able to compare 
UnsafeRows with non-UnsafeRows
-  TreeSet.empty(TypeUtils.getInterpretedOrdering(child.dataType)) ++ 
hset
+  TreeSet.empty(ordering) ++ hset
--- End diff --

and udpate eval to
```
if (value == null) {
  null
} else if (set.contains(value)) {
  true
} else if (hasNull) {
  null
} else {
  false
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...

2018-11-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23176#discussion_r237382322
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -367,11 +367,29 @@ case class InSet(child: Expression, hset: Set[Any]) 
extends UnaryExpression with
   }
 
   @transient lazy val set: Set[Any] = child.dataType match {
-case _: AtomicType => hset
+case t: AtomicType if !t.isInstanceOf[BinaryType] => hset
 case _: NullType => hset
 case _ =>
+  val ord = TypeUtils.getInterpretedOrdering(child.dataType)
+  val ordering = if (hasNull) {
+new Ordering[Any] {
+  override def compare(x: Any, y: Any): Int = {
+if (x == null && y == null) {
+  0
+} else if (x == null) {
+  -1
+} else if (y == null) {
+  1
+} else {
+  ord.compare(x, y)
+}
+  }
+}
+  } else {
+ord
+  }
   // for structs use interpreted ordering to be able to compare 
UnsafeRows with non-UnsafeRows
-  TreeSet.empty(TypeUtils.getInterpretedOrdering(child.dataType)) ++ 
hset
+  TreeSet.empty(ordering) ++ hset
--- End diff --

shall we just filter out nulls when building the tree set?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...

2018-11-28 Thread ueshin
GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/23176

[SPARK-26211][SQL] Fix InSet for binary, and struct and array with null.

## What changes were proposed in this pull request?

Currently `InSet` doesn't work properly for binary type, or struct and 
array type with null value in the set.
Because, as for binary type, the `HashSet` doesn't work properly for 
`Array[Byte]`, and as for struct and array type with null value in the set, the 
`ordering` will throw a `NPE`.

## How was this patch tested?

Added a few tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-26211/inset

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23176.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23176


commit 277c48f63b9d36028b5dfbb3c850418713197cc4
Author: Takuya UESHIN 
Date:   2018-11-29T06:47:29Z

Fix InSet for binary, and struct and array with null.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org