[jira] [Commented] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules
[ https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130481#comment-15130481 ] Apache Spark commented on SPARK-12725: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/11050 > SQL generation suffers from name conficts introduced by some analysis rules > --- > > Key: SPARK-12725 > URL: https://issues.apache.org/jira/browse/SPARK-12725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian >Assignee: Xiao Li > > Some analysis rules generate auxiliary attribute references with the same > name but different expression IDs. For example, {{ResolveAggregateFunctions}} > introduces {{havingCondition}} and {{aggOrder}}, and > {{DistinctAggregationRewriter}} introduces {{gid}}. > This is OK for normal query execution since these attribute references get > expression IDs. However, it's troublesome when converting resolved query > plans back to SQL query strings since expression IDs are erased. > Here's an example Spark 1.6.0 snippet for illustration: > {code} > sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t") > sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), > COUNT(b)").explain(true) > {code} > The above code produces the following resolved plan: > {noformat} > == Analyzed Logical Plan == > _c0: bigint > Project [_c0#101L] > +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true >+- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) > AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS > aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L] > +- Subquery t > +- Project [id#46L AS a#47L,id#46L AS b#48L] > +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at > :26 > {noformat} > Here we can see that both aggregate expressions in {{ORDER BY}} are extracted > into an {{Aggregate}} operator, and both of them are named {{aggOrder}} with > different expression IDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules
[ https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127817#comment-15127817 ] Xiao Li commented on SPARK-12725: - Let me work on this at first. Will submit a PR tomorrow. : ) Thank you! > SQL generation suffers from name conficts introduced by some analysis rules > --- > > Key: SPARK-12725 > URL: https://issues.apache.org/jira/browse/SPARK-12725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian > > Some analysis rules generate auxiliary attribute references with the same > name but different expression IDs. For example, {{ResolveAggregateFunctions}} > introduces {{havingCondition}} and {{aggOrder}}, and > {{DistinctAggregationRewriter}} introduces {{gid}}. > This is OK for normal query execution since these attribute references get > expression IDs. However, it's troublesome when converting resolved query > plans back to SQL query strings since expression IDs are erased. > Here's an example Spark 1.6.0 snippet for illustration: > {code} > sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t") > sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), > COUNT(b)").explain(true) > {code} > The above code produces the following resolved plan: > {noformat} > == Analyzed Logical Plan == > _c0: bigint > Project [_c0#101L] > +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true >+- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) > AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS > aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L] > +- Subquery t > +- Project [id#46L AS a#47L,id#46L AS b#48L] > +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at > :26 > {noformat} > Here we can see that both aggregate expressions in {{ORDER BY}} are extracted > into an {{Aggregate}} operator, and both of them are named {{aggOrder}} with > different expression IDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules
[ https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125754#comment-15125754 ] Xiao Li commented on SPARK-12725: - You are right. : ) > SQL generation suffers from name conficts introduced by some analysis rules > --- > > Key: SPARK-12725 > URL: https://issues.apache.org/jira/browse/SPARK-12725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian > > Some analysis rules generate auxiliary attribute references with the same > name but different expression IDs. For example, {{ResolveAggregateFunctions}} > introduces {{havingCondition}} and {{aggOrder}}, and > {{DistinctAggregationRewriter}} introduces {{gid}}. > This is OK for normal query execution since these attribute references get > expression IDs. However, it's troublesome when converting resolved query > plans back to SQL query strings since expression IDs are erased. > Here's an example Spark 1.6.0 snippet for illustration: > {code} > sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t") > sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), > COUNT(b)").explain(true) > {code} > The above code produces the following resolved plan: > {noformat} > == Analyzed Logical Plan == > _c0: bigint > Project [_c0#101L] > +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true >+- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) > AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS > aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L] > +- Subquery t > +- Project [id#46L AS a#47L,id#46L AS b#48L] > +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at > :26 > {noformat} > Here we can see that both aggregate expressions in {{ORDER BY}} are extracted > into an {{Aggregate}} operator, and both of them are named {{aggOrder}} with > different expression IDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules
[ https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125587#comment-15125587 ] Cheng Lian commented on SPARK-12725: There are other analysis rules that may use generated attributes (e.g., {{DistinctAggregationRewriter}}). I think a generic approach is better than special casing them one by one. > SQL generation suffers from name conficts introduced by some analysis rules > --- > > Key: SPARK-12725 > URL: https://issues.apache.org/jira/browse/SPARK-12725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian > > Some analysis rules generate auxiliary attribute references with the same > name but different expression IDs. For example, {{ResolveAggregateFunctions}} > introduces {{havingCondition}} and {{aggOrder}}, and > {{DistinctAggregationRewriter}} introduces {{gid}}. > This is OK for normal query execution since these attribute references get > expression IDs. However, it's troublesome when converting resolved query > plans back to SQL query strings since expression IDs are erased. > Here's an example Spark 1.6.0 snippet for illustration: > {code} > sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t") > sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), > COUNT(b)").explain(true) > {code} > The above code produces the following resolved plan: > {noformat} > == Analyzed Logical Plan == > _c0: bigint > Project [_c0#101L] > +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true >+- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) > AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS > aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L] > +- Subquery t > +- Project [id#46L AS a#47L,id#46L AS b#48L] > +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at > :26 > {noformat} > Here we can see that both aggregate expressions in {{ORDER BY}} are extracted > into an {{Aggregate}} operator, and both of them are named {{aggOrder}} with > different expression IDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules
[ https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125252#comment-15125252 ] Xiao Li commented on SPARK-12725: - Recently, I am working on a PR related to ResolveAggregateFunctions. Could we just change the rule ResolveAggregateFunctions and generate a unique alias name without any conflict? It will be a very simple fix, if it works. As [~lian cheng] said, put the expression id in the name, since the generated name will not be exposed to the users. This idea has been used in the code. > SQL generation suffers from name conficts introduced by some analysis rules > --- > > Key: SPARK-12725 > URL: https://issues.apache.org/jira/browse/SPARK-12725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian > > Some analysis rules generate auxiliary attribute references with the same > name but different expression IDs. For example, {{ResolveAggregateFunctions}} > introduces {{havingCondition}} and {{aggOrder}}, and > {{DistinctAggregationRewriter}} introduces {{gid}}. > This is OK for normal query execution since these attribute references get > expression IDs. However, it's troublesome when converting resolved query > plans back to SQL query strings since expression IDs are erased. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules
[ https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122043#comment-15122043 ] Cheng Lian commented on SPARK-12725: Thanks, this also sounds good to me. Will try this approach first. > SQL generation suffers from name conficts introduced by some analysis rules > --- > > Key: SPARK-12725 > URL: https://issues.apache.org/jira/browse/SPARK-12725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian > > Some analysis rules generate auxiliary attribute references with the same > name but different expression IDs. For example, {{ResolveAggregateFunctions}} > introduces {{havingCondition}} and {{aggOrder}}, and > {{DistinctAggregationRewriter}} introduces {{gid}}. > This is OK for normal query execution since these attribute references get > expression IDs. However, it's troublesome when converting resolved query > plans back to SQL query strings since expression IDs are erased. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules
[ https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122016#comment-15122016 ] Michael Armbrust commented on SPARK-12725: -- Why don't we just add a flag to AttributeReference to say if its generated? We have wanted that in the past anyway since generated attributes should not be resolvable. > SQL generation suffers from name conficts introduced by some analysis rules > --- > > Key: SPARK-12725 > URL: https://issues.apache.org/jira/browse/SPARK-12725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian > > Some analysis rules generate auxiliary attribute references with the same > name but different expression IDs. For example, {{ResolveAggregateFunctions}} > introduces {{havingCondition}} and {{aggOrder}}, and > {{DistinctAggregationRewriter}} introduces {{gid}}. > This is OK for normal query execution since these attribute references get > expression IDs. However, it's troublesome when converting resolved query > plans back to SQL query strings since expression IDs are erased. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules
[ https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122013#comment-15122013 ] Cheng Lian commented on SPARK-12725: One possible solution I was thinking about is that we can add a new {{Attribute}} class named {{GeneratedAttributeRef}}, which is exactly the same as {{AttributeReference}} except that it's {{sql}} representation includes expression ID (e.g. {{gid_42}} instead of {{gid}}). To avoid code duplication, we can extract common code into an abstract class, say {{AbstractAttributeRef}}. [~yhuai] [~rxin] [~marmbrus] What do you think? > SQL generation suffers from name conficts introduced by some analysis rules > --- > > Key: SPARK-12725 > URL: https://issues.apache.org/jira/browse/SPARK-12725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian > > Some analysis rules generate auxiliary attribute references with the same > name but different expression IDs. For example, {{ResolveAggregateFunctions}} > introduces {{havingCondition}} and {{aggOrder}}, and > {{DistinctAggregationRewriter}} introduces {{gid}}. > This is OK for normal query execution since these attribute references get > expression IDs. However, it's troublesome when converting resolved query > plans back to SQL query strings since expression IDs are erased. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org