[jira] [Updated] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate

2020-12-07 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-25914:
-
Target Version/s:   (was: 3.0.0)

> Separate projection from grouping and aggregate in logical Aggregate
> 
>
> Key: SPARK-25914
> URL: https://issues.apache.org/jira/browse/SPARK-25914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wei Xue
>Priority: Major
>
> Currently the Spark SQL logical Aggregate has two expression fields: 
> {{groupingExpressions}} and {{aggregateExpressions}}, in which 
> {{aggregateExpressions}} is actually the result expressions, or in other 
> words, the project list in the SELECT clause.
>   
>  This would cause an exception while processing the following query:
> {code:java}
> SELECT concat('x', concat(a, 's'))
> FROM testData2
> GROUP BY concat(a, 's'){code}
>  After optimization, the query becomes:
> {code:java}
> SELECT concat('x', a, 's')
> FROM testData2
> GROUP BY concat(a, 's'){code}
> The optimization rule {{CombineConcats}} optimizes the expressions by 
> flattening "concat" and causes the query to fail since the expression 
> {{concat('x', a, 's')}} in the SELECT clause is neither referencing a 
> grouping expression nor a aggregate expression.
>   
>  The problem is that we try to mix two operations in one operator, and worse, 
> in one field: the group-and-aggregate operation and the project operation. 
> There are two ways to solve this problem:
>  1. Break the two operations into two logical operators, which means a 
> group-by query can usually be mapped into a Project-over-Aggregate pattern.
>  2. Break the two operations into multiple fields in the Aggregate operator, 
> the same way we do for physical aggregate classes (e.g., 
> {{HashAggregateExec}}, or {{SortAggregateExec}}). Thus, 
> {{groupingExpressions}} would still be the expressions from the GROUP BY 
> clause (as before), but {{aggregateExpressions}} would contain aggregate 
> functions only, and {{resultExpressions}} would be the project list in the 
> SELECT clause holding references to either {{groupingExpressions}} or 
> {{aggregateExpressions}}.
>   
>  I would say option 1 is even clearer, but it would be more likely to break 
> the pattern matching in existing optimization rules and thus require more 
> changes in the compiler. So we'd probably wanna go with option 2. That said, 
> I suggest we achieve this goal through two iterative steps:
>   
>  Phase 1: Keep the current fields of logical Aggregate as 
> {{groupingExpressions}} and {{aggregateExpressions}}, but change the 
> semantics of {{aggregateExpressions}} by replacing the grouping expressions 
> with corresponding references to expressions in {{groupingExpressions}}. The 
> aggregate expressions in  {{aggregateExpressions}} will remain the same.
>   
>  Phase 2: Add {{resultExpressions}} for the project list, and keep only 
> aggregate expressions in {{aggregateExpressions}}.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate

2018-11-11 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25914:

Target Version/s: 3.0.0

> Separate projection from grouping and aggregate in logical Aggregate
> 
>
> Key: SPARK-25914
> URL: https://issues.apache.org/jira/browse/SPARK-25914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Dilip Biswal
>Priority: Major
>
> Currently the Spark SQL logical Aggregate has two expression fields: 
> {{groupingExpressions}} and {{aggregateExpressions}}, in which 
> {{aggregateExpressions}} is actually the result expressions, or in other 
> words, the project list in the SELECT clause.
>   
>  This would cause an exception while processing the following query:
> {code:java}
> SELECT concat('x', concat(a, 's'))
> FROM testData2
> GROUP BY concat(a, 's'){code}
>  After optimization, the query becomes:
> {code:java}
> SELECT concat('x', a, 's')
> FROM testData2
> GROUP BY concat(a, 's'){code}
> The optimization rule {{CombineConcats}} optimizes the expressions by 
> flattening "concat" and causes the query to fail since the expression 
> {{concat('x', a, 's')}} in the SELECT clause is neither referencing a 
> grouping expression nor a aggregate expression.
>   
>  The problem is that we try to mix two operations in one operator, and worse, 
> in one field: the group-and-aggregate operation and the project operation. 
> There are two ways to solve this problem:
>  1. Break the two operations into two logical operators, which means a 
> group-by query can usually be mapped into a Project-over-Aggregate pattern.
>  2. Break the two operations into multiple fields in the Aggregate operator, 
> the same way we do for physical aggregate classes (e.g., 
> {{HashAggregateExec}}, or {{SortAggregateExec}}). Thus, 
> {{groupingExpressions}} would still be the expressions from the GROUP BY 
> clause (as before), but {{aggregateExpressions}} would contain aggregate 
> functions only, and {{resultExpressions}} would be the project list in the 
> SELECT clause holding references to either {{groupingExpressions}} or 
> {{aggregateExpressions}}.
>   
>  I would say option 1 is even clearer, but it would be more likely to break 
> the pattern matching in existing optimization rules and thus require more 
> changes in the compiler. So we'd probably wanna go with option 2. That said, 
> I suggest we achieve this goal through two iterative steps:
>   
>  Phase 1: Keep the current fields of logical Aggregate as 
> {{groupingExpressions}} and {{aggregateExpressions}}, but change the 
> semantics of {{aggregateExpressions}} by replacing the grouping expressions 
> with corresponding references to expressions in {{groupingExpressions}}. The 
> aggregate expressions in  {{aggregateExpressions}} will remain the same.
>   
>  Phase 2: Add {{resultExpressions}} for the project list, and keep only 
> aggregate expressions in {{aggregateExpressions}}.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate

2018-11-11 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25914:
-
Target Version/s:   (was: 3.0.0)

> Separate projection from grouping and aggregate in logical Aggregate
> 
>
> Key: SPARK-25914
> URL: https://issues.apache.org/jira/browse/SPARK-25914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Dilip Biswal
>Priority: Major
>
> Currently the Spark SQL logical Aggregate has two expression fields: 
> {{groupingExpressions}} and {{aggregateExpressions}}, in which 
> {{aggregateExpressions}} is actually the result expressions, or in other 
> words, the project list in the SELECT clause.
>   
>  This would cause an exception while processing the following query:
> {code:java}
> SELECT concat('x', concat(a, 's'))
> FROM testData2
> GROUP BY concat(a, 's'){code}
>  After optimization, the query becomes:
> {code:java}
> SELECT concat('x', a, 's')
> FROM testData2
> GROUP BY concat(a, 's'){code}
> The optimization rule {{CombineConcats}} optimizes the expressions by 
> flattening "concat" and causes the query to fail since the expression 
> {{concat('x', a, 's')}} in the SELECT clause is neither referencing a 
> grouping expression nor a aggregate expression.
>   
>  The problem is that we try to mix two operations in one operator, and worse, 
> in one field: the group-and-aggregate operation and the project operation. 
> There are two ways to solve this problem:
>  1. Break the two operations into two logical operators, which means a 
> group-by query can usually be mapped into a Project-over-Aggregate pattern.
>  2. Break the two operations into multiple fields in the Aggregate operator, 
> the same way we do for physical aggregate classes (e.g., 
> {{HashAggregateExec}}, or {{SortAggregateExec}}). Thus, 
> {{groupingExpressions}} would still be the expressions from the GROUP BY 
> clause (as before), but {{aggregateExpressions}} would contain aggregate 
> functions only, and {{resultExpressions}} would be the project list in the 
> SELECT clause holding references to either {{groupingExpressions}} or 
> {{aggregateExpressions}}.
>   
>  I would say option 1 is even clearer, but it would be more likely to break 
> the pattern matching in existing optimization rules and thus require more 
> changes in the compiler. So we'd probably wanna go with option 2. That said, 
> I suggest we achieve this goal through two iterative steps:
>   
>  Phase 1: Keep the current fields of logical Aggregate as 
> {{groupingExpressions}} and {{aggregateExpressions}}, but change the 
> semantics of {{aggregateExpressions}} by replacing the grouping expressions 
> with corresponding references to expressions in {{groupingExpressions}}. The 
> aggregate expressions in  {{aggregateExpressions}} will remain the same.
>   
>  Phase 2: Add {{resultExpressions}} for the project list, and keep only 
> aggregate expressions in {{aggregateExpressions}}.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate

2018-11-01 Thread Maryann Xue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated SPARK-25914:

Description: 
Currently the Spark SQL logical Aggregate has two expression fields: 
{{groupingExpressions}} and {{aggregateExpressions}}, in which 
{{aggregateExpressions}} is actually the result expressions, or in other words, 
the project list in the SELECT clause.
  
 This would cause an exception while processing the following query:
{code:java}
SELECT concat('x', concat(a, 's'))
FROM testData2
GROUP BY concat(a, 's'){code}
 After optimization, the query becomes:
{code:java}
SELECT concat('x', a, 's')
FROM testData2
GROUP BY concat(a, 's'){code}
The optimization rule {{CombineConcats}} optimizes the expressions by 
flattening "concat" and causes the query to fail since the expression 
{{concat('x', a, 's')}} in the SELECT clause is neither referencing a grouping 
expression nor a aggregate expression.
  
 The problem is that we try to mix two operations in one operator, and worse, 
in one field: the group-and-aggregate operation and the project operation. 
There are two ways to solve this problem:
 1. Break the two operations into two logical operators, which means a group-by 
query can usually be mapped into a Project-over-Aggregate pattern.
 2. Break the two operations into multiple fields in the Aggregate operator, 
the same way we do for physical aggregate classes (e.g., {{HashAggregateExec}}, 
or {{SortAggregateExec}}). Thus, {{groupingExpressions}} would still be the 
expressions from the GROUP BY clause (as before), but {{aggregateExpressions}} 
would contain aggregate functions only, and {{resultExpressions}} would be the 
project list in the SELECT clause holding references to either 
{{groupingExpressions}} or {{aggregateExpressions}}.
  
 I would say option 1 is even clearer, but it would be more likely to break the 
pattern matching in existing optimization rules and thus require more changes 
in the compiler. So we'd probably wanna go with option 2. That said, I suggest 
we achieve this goal through two iterative steps:
  
 Phase 1: Keep the current fields of logical Aggregate as 
{{groupingExpressions}} and {{aggregateExpressions}}, but change the semantics 
of {{aggregateExpressions}} by replacing the grouping expressions with 
corresponding references to expressions in {{groupingExpressions}}. The 
aggregate expressions in  {{aggregateExpressions}} will remain the same.
  
 Phase 2: Add {{resultExpressions}} for the project list, and keep only 
aggregate expressions in {{aggregateExpressions}}.
  

  was:
Currently the Spark SQL logical Aggregate has two expression fields: 
{{groupingExpressions}} and {{aggregateExpressions}}, in which 
{{aggregateExpressions}} is actually the result expressions, or in other words, 
the project list in the SELECT clause.
  
 This would cause an exception while processing the following query:
{code:java}
SELECT concat('x', concat(a, 's'))
FROM testData2
GROUP BY concat(a, 's'){code}
 After optimization, the query becomes:
{code:java}
SELECT concat('x', a, 's')
FROM testData2
GROUP BY concat(a, 's'){code}
The optimization rule {{CombineConcats}} optimizes the expressions by 
flattening "concat" and causes the query to fail since the expression 
{{concat('x', a, 's')}} in the SELECT clause is neither referencing a grouping 
expression nor a aggregate expression.
  
 The problem is that we try to mix two operations in one operator, and worse, 
in one field: the group-and-aggregate operation and the project operation. 
There are two ways to solve this problem:
 1. Break the two operations into two logical operators, which means a group-by 
query can usually to be mapped into a Project-over-Aggregate pattern.
 2. Break the two operations into multiple fields in the Aggregate operator, 
the same way we do for physical aggregate classes (e.g., {{HashAggregateExec}}, 
or {{SortAggregateExec}}). Thus, {{groupingExpressions}} would still be the 
expressions from the GROUP BY clause (as before), but {{aggregateExpressions}} 
would contain aggregate functions only, and {{resultExpressions}} would be the 
project list in the SELECT clause holding references to either 
{{groupingExpressions}} or {{aggregateExpressions}}.
  
 I would say option 1 is even clearer, but it would be more likely to break the 
pattern matching in existing optimization rules and thus require more changes 
in the compiler. So we'd probably wanna go with option 2. That said, I suggest 
we achieve this goal through two iterative steps:
  
 Phase 1: Keep the current fields of logical Aggregate as 
{{groupingExpressions}} and {{aggregateExpressions}}, but change the semantics 
of {{aggregateExpressions}} by replacing the grouping expressions with 
corresponding references to expressions in {{groupingExpressions}}. The 
aggregate expressions in  {{aggregateExpressions}} will remain the same.
  
 Phase 

[jira] [Updated] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate

2018-11-01 Thread Maryann Xue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated SPARK-25914:

Description: 
Currently the Spark SQL logical Aggregate has two expression fields: 
{{groupingExpressions}} and {{aggregateExpressions}}, in which 
{{aggregateExpressions}} is actually the result expressions, or in other words, 
the project list in the SELECT clause.
  
 This would cause an exception while processing the following query:
{code:java}
SELECT concat('x', concat(a, 's'))
FROM testData2
GROUP BY concat(a, 's'){code}
 After optimization, the query becomes:
{code:java}
SELECT concat('x', a, 's')
FROM testData2
GROUP BY concat(a, 's'){code}
The optimization rule {{CombineConcats}} optimizes the expressions by 
flattening "concat" and causes the query to fail since the expression 
{{concat('x', a, 's')}} in the SELECT clause is neither referencing a grouping 
expression nor a aggregate expression.
  
 The problem is that we try to mix two operations in one operator, and worse, 
in one field: the group-and-aggregate operation and the project operation. 
There are two ways to solve this problem:
 1. Break the two operations into two logical operators, which means a group-by 
query can usually to be mapped into a Project-over-Aggregate pattern.
 2. Break the two operations into multiple fields in the Aggregate operator, 
the same way we do for physical aggregate classes (e.g., {{HashAggregateExec}}, 
or {{SortAggregateExec}}). Thus, {{groupingExpressions}} would still be the 
expressions from the GROUP BY clause (as before), but {{aggregateExpressions}} 
would contain aggregate functions only, and {{resultExpressions}} would be the 
project list in the SELECT clause holding references to either 
{{groupingExpressions}} or {{aggregateExpressions}}.
  
 I would say option 1 is even clearer, but it would be more likely to break the 
pattern matching in existing optimization rules and thus require more changes 
in the compiler. So we'd probably wanna go with option 2. That said, I suggest 
we achieve this goal through two iterative steps:
  
 Phase 1: Keep the current fields of logical Aggregate as 
{{groupingExpressions}} and {{aggregateExpressions}}, but change the semantics 
of {{aggregateExpressions}} by replacing the grouping expressions with 
corresponding references to expressions in {{groupingExpressions}}. The 
aggregate expressions in  {{aggregateExpressions}} will remain the same.
  
 Phase 2: Add {{resultExpressions}} for the project list, and keep only 
aggregate expressions in {{aggregateExpressions}}.
  

  was:
Currently the Spark SQL logical Aggregate has two expression fields: 
{{groupingExpressions}} and {{aggregateExpressions}}, in which 
{{aggregateExpressions}} is actually the result expressions, or in other words, 
the project list in the SELECT clause.
 
This would cause an exception while processing the following query:
 
  SELECT concat('x', concat(a, 's'))
  FROM testData2
  GROUP BY concat(a, 's')
 
After optimization, the query becomes:
 
  SELECT concat('x', a, 's')
  FROM testData2
  GROUP BY concat(a, 's')
 
The optimization rule {{CombineConcats}} optimizes the expressions by 
flattening "concat" and causes the query to fail since the expression 
{{concat('x', a, 's')}} in the SELECT clause is neither referencing a grouping 
expression nor a aggregate expression.
 
The problem is that we try to mix two operations in one operator, and worse, in 
one field: the group-and-aggregate operation and the project operation. There 
are two ways to solve this problem:
1. Break the two operations into two logical operators, which means a group-by 
query can usually to be mapped into a Project-over-Aggregate pattern.
2. Break the two operations into multiple fields in the Aggregate operator, the 
same way we do for physical aggregate classes (e.g., {{HashAggregateExec}}, or 
{{SortAggregateExec}}). Thus, {{groupingExpressions}} would still be the 
expressions from the GROUP BY clause (as before), but {{aggregateExpressions}} 
would contain aggregate functions only, and {{resultExpressions}} would be the 
project list in the SELECT clause holding references to either 
{{groupingExpressions}} or {{aggregateExpressions}}.
 
I would say option 1 is even clearer, but it would be more likely to break the 
pattern matching in existing optimization rules and thus require more changes 
in the compiler. So we'd probably wanna go with option 2. That said, I suggest 
we achieve this goal through two iterative steps:
 
Phase 1: Keep the current fields of logical Aggregate as 
{{groupingExpressions}} and {{aggregateExpressions}}, but change the semantics 
of {{aggregateExpressions}} by replacing the grouping expressions with 
corresponding references to expressions in {{groupingExpressions}}. The 
aggregate expressions in  {{aggregateExpressions}} will remain the same.
 
Phase 2: Add