[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2017-01-05 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15801624#comment-15801624
 ] 

Nattavut Sutyanyong commented on SPARK-19017:
-

One way to interpret the behaviour of the comparison of tuples is it operates 
strictly on 2-value logic of the comparison of each element of the tuples.

That is,

Case 1:
(a1, a2) = (b1, b2)

is interpreted as

IF (a1 = b1) is true THEN true ELSE false
AND
IF (a2 = b2) is true THEN true ELSE false

Case 2:
(a1, a2) <> (b1, b2)

is interpreted as a negation of the equality operator on the 2-value logic of 
each element.

IF (a1 = b1) is true THEN false ELSE true
OR
IF (a2 = b2) is true THEN false ELSE true


The NOT IN semantics is different that it preserves the 3-value logic in the 
comparison of each element.

I don't know that we should align the two usages to the same semantics or not. 

> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2017-01-04 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799388#comment-15799388
 ] 

Nattavut Sutyanyong commented on SPARK-19017:
-

On c1: (1 <> 2) or (2 <> null) => true or unknown => true
On c2: (1 <> 1) or (2 <> null) => false or unknown => unknown <-- This is 
correct, right?

In (P1 or P2), if P1 is true, we can short-circuit and conclude the result is 
true. In the latter case, if P1 is false, we have to continue evaluating P2 and 
since P2 is unknown, the result is unknown.

I may confess it took me a few iterations to work out the whole story as well.

> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2017-01-04 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799374#comment-15799374
 ] 

Herman van Hovell commented on SPARK-19017:
---

Thanks!

> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2017-01-04 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799369#comment-15799369
 ] 

Herman van Hovell commented on SPARK-19017:
---

I agree that they are equal.

It just seems weird to me that in some cases it is ok that the tuples in the 
subquery can have null values. That being said, I am convinced that your 
approach is correct.

I have also checked if comparing structs returns the same results as the 
underlying fields. In some cases it does not:
{noformat}
scala> sql("select (2, 2) <> (2, cast(null as int)) as c1, 2 <> 1 or 2 <> 
cast(null as int) as c2").show
+++
|  c1|  c2|
+++
|true|true|
+++

scala> sql("select (1, 2) <> (2, cast(null as int)) as c1, 1 <> 1 or 2 <> 
cast(null as int) as c2").show
+++
|  c1|  c2|
+++
|true|null| <-- Result for struct is wrong.
+++
{noformat}
We fortunately do not use this, but this is still a bug.

> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2017-01-04 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799335#comment-15799335
 ] 

Nattavut Sutyanyong commented on SPARK-19017:
-

I also have the output from a MySQL system. Note that I changed the tuple in 
{{t2}} to {{(3, null)}} to avoid having the value {{1}} twice but it should not 
make any difference.

{code}
$ mysql -V
mysql  Ver 14.14 Distrib 5.1.73, for redhat-linux-gnu (x86_64) using readline 
5.1

mysql> create table mydb.t1(t1a int, t1b int);
Query OK, 0 rows affected (0.08 sec)

mysql> create table mydb.t2(t2a int, t2b int);
Query OK, 0 rows affected (0.11 sec)

mysql> insert into mydb.t1 values(2,1);
Query OK, 1 row affected (0.00 sec)

mysql> insert into mydb.t2 values(3, null);
Query OK, 1 row affected (0.00 sec)

mysql> select * from mydb.t1 where(t1a, t1b) not in(select t2a, t2b from 
mydb.t2);
+--+--+
| t1a  | t1b  |
+--+--+
|2 |1 |
+--+--+
1 row in set (0.00 sec)
{code}

> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2017-01-04 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799055#comment-15799055
 ] 

Nattavut Sutyanyong commented on SPARK-19017:
-

I think we both agree that the result of the expression {{(2, 1) = (2, null)}} 
is unknown (null). It is the negation of this expression that is the centre of 
this discussion.

If {{a = (2, 1)}} and {{b = (1, null)}}, what is the semantics of {{a <> b}}? 
You have suggested two formulas

{{not(a.x1 = b.x1 and a.x2 = b.x2)}}

and

{{a.x1 <> b.x1 or a.x2 <> b.x2}}

Would you agree that these two formulas are equivalent? The second formula is 
just the result of applying De Morgan's laws on the first formula, i.e., {{not 
(x and y) == (not x) or (not y)}}.

How does Spark evaluate the formulas today? Here is my test:

{code}
scala> Seq(1).toDF("t1a").createOrReplaceTempView("t1")

scala> sql("select t1a from t1").show
+---+
|t1a|
+---+
|  1|
+---+

scala> sql("select t1a from t1 where not (2=1 and 1=null)").show
+---+
|t1a|
+---+
|  1|
+---+


scala> sql("select t1a from t1 where 2<>1 or 1<>null").show
+---+
|t1a|
+---+
|  1|
+---+
{code}


> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2017-01-04 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15798758#comment-15798758
 ] 

Herman van Hovell commented on SPARK-19017:
---

Ok, my bad. Lets try this again.

If I follow the NAAJ explanation in the section 6.1 of the Enhanced Subquery 
Optimizations in Oracle paper 
(http://www.vldb.org/pvldb/2/vldb09-423.pdf). Then I can rewrite the 
following query {{select * from tbl_a where a not in (select b from tbl_b)}} 
into the following (given that tbl_b has n elements):
{noformat}
SELECT *
FROMTBL_a
WHERE  a <> b1 AND a <> b2 AND ... AND a <> bn
{noformat}

My basic thought here is that if we are comparing a complete tuple, for 
instance (2, 1), to an incomplete tuple, for instance (2, null), then the 
result should be unknown (null). Following this it would be obvious that a 
conjunctive predicate containing such an incomplete comparison should also 
evaluate to unknown.

It now boils down to how we should compare the tuples; i.e. how is {{a <> b}} 
evaluated. Note that I do not have an answer here. The following two things 
could apply:
* {{not(a.x1 = b.x1 and a.x2 = b.x2 and ... and a.xn = b.xn)}}  - I would lean 
towards this option.
* {{a.x1 <> b.x1 or a.x2 <> b.x2 or ... or a.xn <> b.xn)}} - Spark currently 
implements this option if you compare structs.




> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2017-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796839#comment-15796839
 ] 

Apache Spark commented on SPARK-19017:
--

User 'nsyca' has created a pull request for this issue:
https://github.com/apache/spark/pull/16467

> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2016-12-29 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15785496#comment-15785496
 ] 

Nattavut Sutyanyong commented on SPARK-19017:
-

In 3-value logic, true OR unknown = true. Using your formula above, we will 
have (2,1) NOT IN (1,null) evaluated as (2 <> 1) OR (1 <> null) which is true.

> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2016-12-29 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15785494#comment-15785494
 ] 

Nattavut Sutyanyong commented on SPARK-19017:
-

In 3-value logic, true OR unknown = true. Using your formula above, we will 
have (2,1) NOT IN (1,null) evaluated as (2 <> 1) OR (1 <> null) which is true.

> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2016-12-29 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15785436#comment-15785436
 ] 

Herman van Hovell commented on SPARK-19017:
---

Ok, that is fair. Let me correct my mistake.

{{NOT IN}} can be rewritten into, in to a sequence of NOT equals statements. 
Each statement contains one tuple of the subquery relation. So we would get 
something like:
{noformat}
WHERE (NOT (a1 = a2(1) AND  b1 = b2(1))) AND (NOT (a1 = a2(2) AND  b1 = b2(2))) 
AND ... AND (NOT (a1 = a2(n) AND  b1 = b2(n)))
{noformat}
Which can be rewritten into:
{noformat}
WHERE  (a1 <> a2(1) OR  b1 <> b2(1)) AND (a1 <> a2(2) OR  b1 <> b2(2)) AND ... 
AND (a1 <> a2(n) OR  b1 <> b2(n))
{noformat}
This would evaluate to null if one of the tuples in the subquery relation 
contains a null.

> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2016-12-29 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15785417#comment-15785417
 ] 

Nattavut Sutyanyong commented on SPARK-19017:
-

Using your interpretation, (2,1) not in (2,0) would be evaluated to false. 
Spark returns (2,1). So do many other SQL engines.


> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2016-12-28 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15783937#comment-15783937
 ] 

Herman van Hovell commented on SPARK-19017:
---

[~nsyca] Why is this incorrect?

If I rewrite the NOT IN into a WHERE statement this would become:
{noformat}
select * from t1 where (a1 <> 1 AND b1 <> NULL)
{noformat}
There WHERE would evaluate to NULL, and it would never return a result.

> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results

2016-12-28 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15783653#comment-15783653
 ] 

Nattavut Sutyanyong commented on SPARK-19017:
-

The semantics of the NOT IN for multiple columns T1(a1, b1, ... ) NOT IN T2(a2, 
b2, ...) is

# For any rows of T1 if a1 <> ALL (T2.a2), those rows are returned.
# For any rows of T1 if a1 = ANY (T2.a2), take the qualified rows from T1 and 
T2 and compare the values from the next pair of columns with the similar 
condition in 1. -- if b1 <> ALL (T2.b2), those rows are returned.
# Repeat the steps until the last pair in the column list.

> NOT IN subquery with more than one column may return incorrect results
> --
>
> Key: SPARK-19017
> URL: https://issues.apache.org/jira/browse/SPARK-19017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nattavut Sutyanyong
>
> When putting more than one column in the NOT IN, the query may not return 
> correctly if there is a null data. We can demonstrate the problem with the 
> following data set and query:
> {code}
> Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2")
> sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org