[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31519:
--
Affects Version/s: 2.2.3

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31519:
--
Priority: Blocker  (was: Major)

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31519:
--
Affects Version/s: 2.3.4

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31519:
--
Fix Version/s: 2.4.6

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31519:
--
Affects Version/s: 2.4.5

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31519:
--
Labels: correctness  (was: )

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org