[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15801624#comment-15801624 ] Nattavut Sutyanyong commented on SPARK-19017: - One way to interpret the behaviour of the comparison of tuples is it operates strictly on 2-value logic of the comparison of each element of the tuples. That is, Case 1: (a1, a2) = (b1, b2) is interpreted as IF (a1 = b1) is true THEN true ELSE false AND IF (a2 = b2) is true THEN true ELSE false Case 2: (a1, a2) <> (b1, b2) is interpreted as a negation of the equality operator on the 2-value logic of each element. IF (a1 = b1) is true THEN false ELSE true OR IF (a2 = b2) is true THEN false ELSE true The NOT IN semantics is different that it preserves the 3-value logic in the comparison of each element. I don't know that we should align the two usages to the same semantics or not. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799388#comment-15799388 ] Nattavut Sutyanyong commented on SPARK-19017: - On c1: (1 <> 2) or (2 <> null) => true or unknown => true On c2: (1 <> 1) or (2 <> null) => false or unknown => unknown <-- This is correct, right? In (P1 or P2), if P1 is true, we can short-circuit and conclude the result is true. In the latter case, if P1 is false, we have to continue evaluating P2 and since P2 is unknown, the result is unknown. I may confess it took me a few iterations to work out the whole story as well. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799374#comment-15799374 ] Herman van Hovell commented on SPARK-19017: --- Thanks! > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799369#comment-15799369 ] Herman van Hovell commented on SPARK-19017: --- I agree that they are equal. It just seems weird to me that in some cases it is ok that the tuples in the subquery can have null values. That being said, I am convinced that your approach is correct. I have also checked if comparing structs returns the same results as the underlying fields. In some cases it does not: {noformat} scala> sql("select (2, 2) <> (2, cast(null as int)) as c1, 2 <> 1 or 2 <> cast(null as int) as c2").show +++ | c1| c2| +++ |true|true| +++ scala> sql("select (1, 2) <> (2, cast(null as int)) as c1, 1 <> 1 or 2 <> cast(null as int) as c2").show +++ | c1| c2| +++ |true|null| <-- Result for struct is wrong. +++ {noformat} We fortunately do not use this, but this is still a bug. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799335#comment-15799335 ] Nattavut Sutyanyong commented on SPARK-19017: - I also have the output from a MySQL system. Note that I changed the tuple in {{t2}} to {{(3, null)}} to avoid having the value {{1}} twice but it should not make any difference. {code} $ mysql -V mysql Ver 14.14 Distrib 5.1.73, for redhat-linux-gnu (x86_64) using readline 5.1 mysql> create table mydb.t1(t1a int, t1b int); Query OK, 0 rows affected (0.08 sec) mysql> create table mydb.t2(t2a int, t2b int); Query OK, 0 rows affected (0.11 sec) mysql> insert into mydb.t1 values(2,1); Query OK, 1 row affected (0.00 sec) mysql> insert into mydb.t2 values(3, null); Query OK, 1 row affected (0.00 sec) mysql> select * from mydb.t1 where(t1a, t1b) not in(select t2a, t2b from mydb.t2); +--+--+ | t1a | t1b | +--+--+ |2 |1 | +--+--+ 1 row in set (0.00 sec) {code} > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799055#comment-15799055 ] Nattavut Sutyanyong commented on SPARK-19017: - I think we both agree that the result of the expression {{(2, 1) = (2, null)}} is unknown (null). It is the negation of this expression that is the centre of this discussion. If {{a = (2, 1)}} and {{b = (1, null)}}, what is the semantics of {{a <> b}}? You have suggested two formulas {{not(a.x1 = b.x1 and a.x2 = b.x2)}} and {{a.x1 <> b.x1 or a.x2 <> b.x2}} Would you agree that these two formulas are equivalent? The second formula is just the result of applying De Morgan's laws on the first formula, i.e., {{not (x and y) == (not x) or (not y)}}. How does Spark evaluate the formulas today? Here is my test: {code} scala> Seq(1).toDF("t1a").createOrReplaceTempView("t1") scala> sql("select t1a from t1").show +---+ |t1a| +---+ | 1| +---+ scala> sql("select t1a from t1 where not (2=1 and 1=null)").show +---+ |t1a| +---+ | 1| +---+ scala> sql("select t1a from t1 where 2<>1 or 1<>null").show +---+ |t1a| +---+ | 1| +---+ {code} > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15798758#comment-15798758 ] Herman van Hovell commented on SPARK-19017: --- Ok, my bad. Lets try this again. If I follow the NAAJ explanation in the section 6.1 of the Enhanced Subquery Optimizations in Oracle paper (http://www.vldb.org/pvldb/2/vldb09-423.pdf). Then I can rewrite the following query {{select * from tbl_a where a not in (select b from tbl_b)}} into the following (given that tbl_b has n elements): {noformat} SELECT * FROMTBL_a WHERE a <> b1 AND a <> b2 AND ... AND a <> bn {noformat} My basic thought here is that if we are comparing a complete tuple, for instance (2, 1), to an incomplete tuple, for instance (2, null), then the result should be unknown (null). Following this it would be obvious that a conjunctive predicate containing such an incomplete comparison should also evaluate to unknown. It now boils down to how we should compare the tuples; i.e. how is {{a <> b}} evaluated. Note that I do not have an answer here. The following two things could apply: * {{not(a.x1 = b.x1 and a.x2 = b.x2 and ... and a.xn = b.xn)}} - I would lean towards this option. * {{a.x1 <> b.x1 or a.x2 <> b.x2 or ... or a.xn <> b.xn)}} - Spark currently implements this option if you compare structs. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796839#comment-15796839 ] Apache Spark commented on SPARK-19017: -- User 'nsyca' has created a pull request for this issue: https://github.com/apache/spark/pull/16467 > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15785496#comment-15785496 ] Nattavut Sutyanyong commented on SPARK-19017: - In 3-value logic, true OR unknown = true. Using your formula above, we will have (2,1) NOT IN (1,null) evaluated as (2 <> 1) OR (1 <> null) which is true. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15785494#comment-15785494 ] Nattavut Sutyanyong commented on SPARK-19017: - In 3-value logic, true OR unknown = true. Using your formula above, we will have (2,1) NOT IN (1,null) evaluated as (2 <> 1) OR (1 <> null) which is true. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15785436#comment-15785436 ] Herman van Hovell commented on SPARK-19017: --- Ok, that is fair. Let me correct my mistake. {{NOT IN}} can be rewritten into, in to a sequence of NOT equals statements. Each statement contains one tuple of the subquery relation. So we would get something like: {noformat} WHERE (NOT (a1 = a2(1) AND b1 = b2(1))) AND (NOT (a1 = a2(2) AND b1 = b2(2))) AND ... AND (NOT (a1 = a2(n) AND b1 = b2(n))) {noformat} Which can be rewritten into: {noformat} WHERE (a1 <> a2(1) OR b1 <> b2(1)) AND (a1 <> a2(2) OR b1 <> b2(2)) AND ... AND (a1 <> a2(n) OR b1 <> b2(n)) {noformat} This would evaluate to null if one of the tuples in the subquery relation contains a null. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15785417#comment-15785417 ] Nattavut Sutyanyong commented on SPARK-19017: - Using your interpretation, (2,1) not in (2,0) would be evaluated to false. Spark returns (2,1). So do many other SQL engines. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15783937#comment-15783937 ] Herman van Hovell commented on SPARK-19017: --- [~nsyca] Why is this incorrect? If I rewrite the NOT IN into a WHERE statement this would become: {noformat} select * from t1 where (a1 <> 1 AND b1 <> NULL) {noformat} There WHERE would evaluate to NULL, and it would never return a result. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15783653#comment-15783653 ] Nattavut Sutyanyong commented on SPARK-19017: - The semantics of the NOT IN for multiple columns T1(a1, b1, ... ) NOT IN T2(a2, b2, ...) is # For any rows of T1 if a1 <> ALL (T2.a2), those rows are returned. # For any rows of T1 if a1 = ANY (T2.a2), take the qualified rows from T1 and T2 and compare the values from the next pair of columns with the similar condition in 1. -- if b1 <> ALL (T2.b2), those rows are returned. # Repeat the steps until the last pair in the column list. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org