[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)
[ https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524175#comment-15524175 ] Xiao Li commented on SPARK-17653: - Since Simon already submitted the PR, I will not continue the investigation. Thanks for answering my original question. > Optimizer should remove unnecessary distincts (in multiple unions) > -- > > Key: SPARK-17653 > URL: https://issues.apache.org/jira/browse/SPARK-17653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Reynold Xin > > Query: > {code} > select 1 a union select 2 b union select 3 c > {code} > Explain plan: > {code} > == Physical Plan == > *HashAggregate(keys=[a#13], functions=[]) > +- Exchange hashpartitioning(a#13, 200) >+- *HashAggregate(keys=[a#13], functions=[]) > +- Union > :- *HashAggregate(keys=[a#13], functions=[]) > : +- Exchange hashpartitioning(a#13, 200) > : +- *HashAggregate(keys=[a#13], functions=[]) > :+- Union > : :- *Project [1 AS a#13] > : : +- Scan OneRowRelation[] > : +- *Project [2 AS b#14] > : +- Scan OneRowRelation[] > +- *Project [3 AS c#15] > +- Scan OneRowRelation[] > {code} > Only one distinct should be necessary. This makes a bunch of unions slower > than a bunch of union alls followed by a distinct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)
[ https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15521950#comment-15521950 ] Apache Spark commented on SPARK-17653: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/15238 > Optimizer should remove unnecessary distincts (in multiple unions) > -- > > Key: SPARK-17653 > URL: https://issues.apache.org/jira/browse/SPARK-17653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Reynold Xin > > Query: > {code} > select 1 a union select 2 b union select 3 c > {code} > Explain plan: > {code} > == Physical Plan == > *HashAggregate(keys=[a#13], functions=[]) > +- Exchange hashpartitioning(a#13, 200) >+- *HashAggregate(keys=[a#13], functions=[]) > +- Union > :- *HashAggregate(keys=[a#13], functions=[]) > : +- Exchange hashpartitioning(a#13, 200) > : +- *HashAggregate(keys=[a#13], functions=[]) > :+- Union > : :- *Project [1 AS a#13] > : : +- Scan OneRowRelation[] > : +- *Project [2 AS b#14] > : +- Scan OneRowRelation[] > +- *Project [3 AS c#15] > +- Scan OneRowRelation[] > {code} > Only one distinct should be necessary. This makes a bunch of unions slower > than a bunch of union alls followed by a distinct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)
[ https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519953#comment-15519953 ] Xiao Li commented on SPARK-17653: - Yeah. You are right. It does not work. : ) > Optimizer should remove unnecessary distincts (in multiple unions) > -- > > Key: SPARK-17653 > URL: https://issues.apache.org/jira/browse/SPARK-17653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Reynold Xin > > Query: > {code} > select 1 a union select 2 b union select 3 c > {code} > Explain plan: > {code} > == Physical Plan == > *HashAggregate(keys=[a#13], functions=[]) > +- Exchange hashpartitioning(a#13, 200) >+- *HashAggregate(keys=[a#13], functions=[]) > +- Union > :- *HashAggregate(keys=[a#13], functions=[]) > : +- Exchange hashpartitioning(a#13, 200) > : +- *HashAggregate(keys=[a#13], functions=[]) > :+- Union > : :- *Project [1 AS a#13] > : : +- Scan OneRowRelation[] > : +- *Project [2 AS b#14] > : +- Scan OneRowRelation[] > +- *Project [3 AS c#15] > +- Scan OneRowRelation[] > {code} > Only one distinct should be necessary. This makes a bunch of unions slower > than a bunch of union alls followed by a distinct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)
[ https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519950#comment-15519950 ] Xiao Li commented on SPARK-17653: - I see. After rethinking it, Union is special. My PR is not applicable to it. We are unable to eliminate the Distinct in this pattern. I think what you said is correct. We can do it for UNION. Do you want me to try it? Or somebody else already started it? Thanks! BTW, in traditional RDBMS, many optimizer rules are based on the unique constraints. However, Spark SQL does not have the concept of primary key or unique constraints. If we allow users specify unique constraints using Hints, we could further optimize the plan and the execution. Do you think adding such a HINT is OK to Spark SQL? > Optimizer should remove unnecessary distincts (in multiple unions) > -- > > Key: SPARK-17653 > URL: https://issues.apache.org/jira/browse/SPARK-17653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Reynold Xin > > Query: > {code} > select 1 a union select 2 b union select 3 c > {code} > Explain plan: > {code} > == Physical Plan == > *HashAggregate(keys=[a#13], functions=[]) > +- Exchange hashpartitioning(a#13, 200) >+- *HashAggregate(keys=[a#13], functions=[]) > +- Union > :- *HashAggregate(keys=[a#13], functions=[]) > : +- Exchange hashpartitioning(a#13, 200) > : +- *HashAggregate(keys=[a#13], functions=[]) > :+- Union > : :- *Project [1 AS a#13] > : : +- Scan OneRowRelation[] > : +- *Project [2 AS b#14] > : +- Scan OneRowRelation[] > +- *Project [3 AS c#15] > +- Scan OneRowRelation[] > {code} > Only one distinct should be necessary. This makes a bunch of unions slower > than a bunch of union alls followed by a distinct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)
[ https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519923#comment-15519923 ] Reynold Xin commented on SPARK-17653: - [~smilegator] - I just took a quick look at #11930. It looks to me it mainly propagates uniqueness property up. In this case we want to remove distincts down a subtree. How would it work in your case? > Optimizer should remove unnecessary distincts (in multiple unions) > -- > > Key: SPARK-17653 > URL: https://issues.apache.org/jira/browse/SPARK-17653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Reynold Xin > > Query: > {code} > select 1 a union select 2 b union select 3 c > {code} > Explain plan: > {code} > == Physical Plan == > *HashAggregate(keys=[a#13], functions=[]) > +- Exchange hashpartitioning(a#13, 200) >+- *HashAggregate(keys=[a#13], functions=[]) > +- Union > :- *HashAggregate(keys=[a#13], functions=[]) > : +- Exchange hashpartitioning(a#13, 200) > : +- *HashAggregate(keys=[a#13], functions=[]) > :+- Union > : :- *Project [1 AS a#13] > : : +- Scan OneRowRelation[] > : +- *Project [2 AS b#14] > : +- Scan OneRowRelation[] > +- *Project [3 AS c#15] > +- Scan OneRowRelation[] > {code} > Only one distinct should be necessary. This makes a bunch of unions slower > than a bunch of union alls followed by a distinct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)
[ https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519710#comment-15519710 ] Reynold Xin commented on SPARK-17653: - There are different ways to fix this, from fairly general ones to more surgical ones. The most surgical fix I can think of is to just match a bunch of Distinct(Union(Distinct(Union(...))) and combine them into a single Distinct(Union(...)). If the more general fix is simple enough, that could be a good idea too. cc [~vssrinath] > Optimizer should remove unnecessary distincts (in multiple unions) > -- > > Key: SPARK-17653 > URL: https://issues.apache.org/jira/browse/SPARK-17653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Reynold Xin > > Query: > {code} > select 1 a union select 2 b union select 3 c > {code} > Explain plan: > {code} > == Physical Plan == > *HashAggregate(keys=[a#13], functions=[]) > +- Exchange hashpartitioning(a#13, 200) >+- *HashAggregate(keys=[a#13], functions=[]) > +- Union > :- *HashAggregate(keys=[a#13], functions=[]) > : +- Exchange hashpartitioning(a#13, 200) > : +- *HashAggregate(keys=[a#13], functions=[]) > :+- Union > : :- *Project [1 AS a#13] > : : +- Scan OneRowRelation[] > : +- *Project [2 AS b#14] > : +- Scan OneRowRelation[] > +- *Project [3 AS c#15] > +- Scan OneRowRelation[] > {code} > Only one distinct should be necessary. This makes a bunch of unions slower > than a bunch of union alls followed by a distinct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)
[ https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518577#comment-15518577 ] Xiao Li commented on SPARK-17653: - [~rxin] I submitted a PR https://github.com/apache/spark/pull/11930 for resolving a related issue. If you think that is a right direction, I will continue/enhance it and write the design doc. > Optimizer should remove unnecessary distincts (in multiple unions) > -- > > Key: SPARK-17653 > URL: https://issues.apache.org/jira/browse/SPARK-17653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Reynold Xin > > Query: > {code} > select 1 a union select 2 b union select 3 c > {code} > Explain plan: > {code} > == Physical Plan == > *HashAggregate(keys=[a#13], functions=[]) > +- Exchange hashpartitioning(a#13, 200) >+- *HashAggregate(keys=[a#13], functions=[]) > +- Union > :- *HashAggregate(keys=[a#13], functions=[]) > : +- Exchange hashpartitioning(a#13, 200) > : +- *HashAggregate(keys=[a#13], functions=[]) > :+- Union > : :- *Project [1 AS a#13] > : : +- Scan OneRowRelation[] > : +- *Project [2 AS b#14] > : +- Scan OneRowRelation[] > +- *Project [3 AS c#15] > +- Scan OneRowRelation[] > {code} > Only one distinct should be necessary. This makes a bunch of unions slower > than a bunch of union alls followed by a distinct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org