[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/23176#discussion_r237826533 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala --- @@ -293,6 +293,54 @@ class PredicateSuite extends SparkFunSuite with ExpressionEvalHelper { } } + test("INSET: binary") { --- End diff -- Submitted #23187. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/23176#discussion_r237771176 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala --- @@ -293,6 +293,54 @@ class PredicateSuite extends SparkFunSuite with ExpressionEvalHelper { } } + test("INSET: binary") { --- End diff -- Sure, I'll do it later. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23176#discussion_r237770687 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala --- @@ -293,6 +293,54 @@ class PredicateSuite extends SparkFunSuite with ExpressionEvalHelper { } } + test("INSET: binary") { --- End diff -- good idea! we should test `In` and `InSet` together --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/23176#discussion_r237766198 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala --- @@ -293,6 +293,54 @@ class PredicateSuite extends SparkFunSuite with ExpressionEvalHelper { } } + test("INSET: binary") { --- End diff -- Regarding the semantics, InSet is equal to In. Could we combine the test cases? Test both? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23176 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/23176#discussion_r237400321 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -367,11 +367,29 @@ case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with } @transient lazy val set: Set[Any] = child.dataType match { -case _: AtomicType => hset +case t: AtomicType if !t.isInstanceOf[BinaryType] => hset case _: NullType => hset case _ => + val ord = TypeUtils.getInterpretedOrdering(child.dataType) + val ordering = if (hasNull) { +new Ordering[Any] { + override def compare(x: Any, y: Any): Int = { +if (x == null && y == null) { + 0 +} else if (x == null) { + -1 +} else if (y == null) { + 1 +} else { + ord.compare(x, y) +} + } +} + } else { +ord + } // for structs use interpreted ordering to be able to compare UnsafeRows with non-UnsafeRows - TreeSet.empty(TypeUtils.getInterpretedOrdering(child.dataType)) ++ hset + TreeSet.empty(ordering) ++ hset --- End diff -- Actually we are using `nullSafeEval`, so we don't need to update it. Instead, I'm updating to use `nullSafeCodeGen` for codegen path. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/23176#discussion_r237399670 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -367,11 +367,29 @@ case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with } @transient lazy val set: Set[Any] = child.dataType match { -case _: AtomicType => hset +case t: AtomicType if !t.isInstanceOf[BinaryType] => hset case _: NullType => hset case _ => + val ord = TypeUtils.getInterpretedOrdering(child.dataType) + val ordering = if (hasNull) { +new Ordering[Any] { + override def compare(x: Any, y: Any): Int = { --- End diff -- Thanks! yeah, I'm updating as @cloud-fan's idea. Also we can use `nullSafeCodeGen` for codegen path, I'll update it as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23176#discussion_r237398085 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -367,11 +367,29 @@ case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with } @transient lazy val set: Set[Any] = child.dataType match { -case _: AtomicType => hset +case t: AtomicType if !t.isInstanceOf[BinaryType] => hset case _: NullType => hset case _ => + val ord = TypeUtils.getInterpretedOrdering(child.dataType) + val ordering = if (hasNull) { +new Ordering[Any] { + override def compare(x: Any, y: Any): Int = { --- End diff -- Or simply filter out null from the tree set as @cloud-fan's idea. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23176#discussion_r237397442 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -367,11 +367,29 @@ case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with } @transient lazy val set: Set[Any] = child.dataType match { -case _: AtomicType => hset +case t: AtomicType if !t.isInstanceOf[BinaryType] => hset case _: NullType => hset case _ => + val ord = TypeUtils.getInterpretedOrdering(child.dataType) + val ordering = if (hasNull) { +new Ordering[Any] { + override def compare(x: Any, y: Any): Int = { --- End diff -- InSet overrides nullSafeEval, and for codegen we look into `set` only if `!ev.isNull`, so I think we only need to consider the case one side is null. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23176#discussion_r237383211 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala --- @@ -293,6 +293,54 @@ class PredicateSuite extends SparkFunSuite with ExpressionEvalHelper { } } + test("INSET: binary") { +val hS = HashSet[Any]() + Array(1.toByte, 2.toByte) + Array(3.toByte) +val nS = HashSet[Any]() + Array(1.toByte, 2.toByte) + Array(3.toByte) + null +val onetwo = Literal(Array(1.toByte, 2.toByte)) +val three = Literal(Array(3.toByte)) +val threefour = Literal(Array(3.toByte, 4.toByte)) +val nl = Literal(null, onetwo.dataType) +checkEvaluation(InSet(onetwo, hS), true) +checkEvaluation(InSet(three, hS), true) +checkEvaluation(InSet(three, nS), true) --- End diff -- this line is duplicated --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23176#discussion_r237382990 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -367,11 +367,29 @@ case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with } @transient lazy val set: Set[Any] = child.dataType match { -case _: AtomicType => hset +case t: AtomicType if !t.isInstanceOf[BinaryType] => hset case _: NullType => hset case _ => + val ord = TypeUtils.getInterpretedOrdering(child.dataType) + val ordering = if (hasNull) { +new Ordering[Any] { + override def compare(x: Any, y: Any): Int = { +if (x == null && y == null) { + 0 +} else if (x == null) { + -1 +} else if (y == null) { + 1 +} else { + ord.compare(x, y) +} + } +} + } else { +ord + } // for structs use interpreted ordering to be able to compare UnsafeRows with non-UnsafeRows - TreeSet.empty(TypeUtils.getInterpretedOrdering(child.dataType)) ++ hset + TreeSet.empty(ordering) ++ hset --- End diff -- and udpate eval to ``` if (value == null) { null } else if (set.contains(value)) { true } else if (hasNull) { null } else { false } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23176#discussion_r237382322 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -367,11 +367,29 @@ case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with } @transient lazy val set: Set[Any] = child.dataType match { -case _: AtomicType => hset +case t: AtomicType if !t.isInstanceOf[BinaryType] => hset case _: NullType => hset case _ => + val ord = TypeUtils.getInterpretedOrdering(child.dataType) + val ordering = if (hasNull) { +new Ordering[Any] { + override def compare(x: Any, y: Any): Int = { +if (x == null && y == null) { + 0 +} else if (x == null) { + -1 +} else if (y == null) { + 1 +} else { + ord.compare(x, y) +} + } +} + } else { +ord + } // for structs use interpreted ordering to be able to compare UnsafeRows with non-UnsafeRows - TreeSet.empty(TypeUtils.getInterpretedOrdering(child.dataType)) ++ hset + TreeSet.empty(ordering) ++ hset --- End diff -- shall we just filter out nulls when building the tree set? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23176: [SPARK-26211][SQL] Fix InSet for binary, and stru...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/23176 [SPARK-26211][SQL] Fix InSet for binary, and struct and array with null. ## What changes were proposed in this pull request? Currently `InSet` doesn't work properly for binary type, or struct and array type with null value in the set. Because, as for binary type, the `HashSet` doesn't work properly for `Array[Byte]`, and as for struct and array type with null value in the set, the `ordering` will throw a `NPE`. ## How was this patch tested? Added a few tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-26211/inset Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23176.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23176 commit 277c48f63b9d36028b5dfbb3c850418713197cc4 Author: Takuya UESHIN Date: 2018-11-29T06:47:29Z Fix InSet for binary, and struct and array with null. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org