[jira] [Commented] (SPARK-22307) NOT condition working incorrectly

2017-10-20 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213432#comment-16213432
 ] 

kevin yu commented on SPARK-22307:
--

It is correct behavior based on SQL standards, as Marco said. Your query has 
623 records: 617 records are null, 2 records are 'true', and 4 records are 
'false'. So the not(col1) return 4. 

> NOT condition working incorrectly
> -
>
> Key: SPARK-22307
> URL: https://issues.apache.org/jira/browse/SPARK-22307
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Andrey Yakovenko
> Attachments: Catalog.json.gz
>
>
> Suggest test case: table with x record filtered by expression expr returns y 
> records (< x), not(expr) does not returns x-y records. Work around: 
> when(expr, false).otherwise(true) is working fine.
> {code}
> val ctg = spark.sqlContext.read.json("/user/Catalog.json")
> scala> ctg.printSchema
> root
>  |-- Id: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- Parent: struct (nullable = true)
>  ||-- Id: string (nullable = true)
>  ||-- Name: string (nullable = true)
>  ||-- Parent: struct (nullable = true)
>  |||-- Id: string (nullable = true)
>  |||-- Name: string (nullable = true)
>  |||-- Parent: struct (nullable = true)
>  ||||-- Id: string (nullable = true)
>  ||||-- Name: string (nullable = true)
>  ||||-- Parent: string (nullable = true)
>  ||||-- SKU: string (nullable = true)
>  |||-- SKU: string (nullable = true)
>  ||-- SKU: string (nullable = true)
>  |-- SKU: string (nullable = true)
> val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN 
> ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', 
> '13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
> col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR 
> (Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 
> 13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))
> scala> ctg.count
> res5: Long = 623
> scala> ctg.filter(col1).count
> res2: Long = 2
> scala> ctg.filter(not(col1)).count
> res3: Long = 4
> scala> ctg.filter(when(col1, false).otherwise(true)).count
> res4: Long = 621
> {code}
> Table is hierarchy like structure and has a records with different number of 
> levels filled up. I have a suspicion that due to partly filled hierarchy 
> condition return null/undefined/failed/nan some times (neither true or false).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22307) NOT condition working incorrectly

2017-10-19 Thread Andrey Yakovenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211354#comment-16211354
 ] 

Andrey Yakovenko commented on SPARK-22307:
--

Id is not null for all records but Parent.Id, Parent.ParentId could be null 
for some records. I'm expecting that in case of Parent null evaluation of 
Parent.Id IN (something) is null => false and then not(Parent.Id IN 
(something)) => true. I'm not a guru in SQL standards so you probably right. 

> NOT condition working incorrectly
> -
>
> Key: SPARK-22307
> URL: https://issues.apache.org/jira/browse/SPARK-22307
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Andrey Yakovenko
> Attachments: Catalog.json.gz
>
>
> Suggest test case: table with x record filtered by expression expr returns y 
> records (< x), not(expr) does not returns x-y records. Work around: 
> when(expr, false).otherwise(true) is working fine.
> {code}
> val ctg = spark.sqlContext.read.json("/user/Catalog.json")
> scala> ctg.printSchema
> root
>  |-- Id: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- Parent: struct (nullable = true)
>  ||-- Id: string (nullable = true)
>  ||-- Name: string (nullable = true)
>  ||-- Parent: struct (nullable = true)
>  |||-- Id: string (nullable = true)
>  |||-- Name: string (nullable = true)
>  |||-- Parent: struct (nullable = true)
>  ||||-- Id: string (nullable = true)
>  ||||-- Name: string (nullable = true)
>  ||||-- Parent: string (nullable = true)
>  ||||-- SKU: string (nullable = true)
>  |||-- SKU: string (nullable = true)
>  ||-- SKU: string (nullable = true)
>  |-- SKU: string (nullable = true)
> val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN 
> ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', 
> '13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
> col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR 
> (Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 
> 13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))
> scala> ctg.count
> res5: Long = 623
> scala> ctg.filter(col1).count
> res2: Long = 2
> scala> ctg.filter(not(col1)).count
> res3: Long = 4
> scala> ctg.filter(when(col1, false).otherwise(true)).count
> res4: Long = 621
> {code}
> Table is hierarchy like structure and has a records with different number of 
> levels filled up. I have a suspicion that due to partly filled hierarchy 
> condition return null/undefined/failed/nan some times (neither true or false).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22307) NOT condition working incorrectly

2017-10-19 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211239#comment-16211239
 ] 

Marco Gaido commented on SPARK-22307:
-

Have you checked if the missing records contain null as a value for `col1`? If 
so, there is no bug and this is an expected behavior according to SQL 
standards, since operations involving nulls are evaluated to null and null is 
considered false in conditions. Thus nulls are filtered in both cases correctly.

> NOT condition working incorrectly
> -
>
> Key: SPARK-22307
> URL: https://issues.apache.org/jira/browse/SPARK-22307
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Andrey Yakovenko
> Attachments: Catalog.json.gz
>
>
> Suggest test case: table with x record filtered by expression expr returns y 
> records (< x), not(expr) does not returns x-y records. Work around: 
> when(expr, false).otherwise(true) is working fine.
> {code}
> val ctg = spark.sqlContext.read.json("/user/Catalog.json")
> scala> ctg.printSchema
> root
>  |-- Id: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- Parent: struct (nullable = true)
>  ||-- Id: string (nullable = true)
>  ||-- Name: string (nullable = true)
>  ||-- Parent: struct (nullable = true)
>  |||-- Id: string (nullable = true)
>  |||-- Name: string (nullable = true)
>  |||-- Parent: struct (nullable = true)
>  ||||-- Id: string (nullable = true)
>  ||||-- Name: string (nullable = true)
>  ||||-- Parent: string (nullable = true)
>  ||||-- SKU: string (nullable = true)
>  |||-- SKU: string (nullable = true)
>  ||-- SKU: string (nullable = true)
>  |-- SKU: string (nullable = true)
> val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN 
> ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', 
> '13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
> col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR 
> (Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 
> 13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))
> scala> ctg.count
> res5: Long = 623
> scala> ctg.filter(col1).count
> res2: Long = 2
> scala> ctg.filter(not(col1)).count
> res3: Long = 4
> scala> ctg.filter(when(col1, false).otherwise(true)).count
> res4: Long = 621
> {code}
> Table is hierarchy like structure and has a records with different number of 
> levels filled up. I have a suspicion that due to partly filled hierarchy 
> condition return null/undefined/failed/nan some times (neither true or false).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org