[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)

2016-09-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524175#comment-15524175
 ] 

Xiao Li commented on SPARK-17653:
-

Since Simon already submitted the PR, I will not continue the investigation. 
Thanks for answering my original question. 

> Optimizer should remove unnecessary distincts (in multiple unions)
> --
>
> Key: SPARK-17653
> URL: https://issues.apache.org/jira/browse/SPARK-17653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>
> Query:
> {code}
> select 1 a union select 2 b union select 3 c
> {code}
> Explain plan:
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#13], functions=[])
> +- Exchange hashpartitioning(a#13, 200)
>+- *HashAggregate(keys=[a#13], functions=[])
>   +- Union
>  :- *HashAggregate(keys=[a#13], functions=[])
>  :  +- Exchange hashpartitioning(a#13, 200)
>  : +- *HashAggregate(keys=[a#13], functions=[])
>  :+- Union
>  :   :- *Project [1 AS a#13]
>  :   :  +- Scan OneRowRelation[]
>  :   +- *Project [2 AS b#14]
>  :  +- Scan OneRowRelation[]
>  +- *Project [3 AS c#15]
> +- Scan OneRowRelation[]
> {code}
> Only one distinct should be necessary. This makes a bunch of unions slower 
> than a bunch of union alls followed by a distinct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)

2016-09-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15521950#comment-15521950
 ] 

Apache Spark commented on SPARK-17653:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/15238

> Optimizer should remove unnecessary distincts (in multiple unions)
> --
>
> Key: SPARK-17653
> URL: https://issues.apache.org/jira/browse/SPARK-17653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>
> Query:
> {code}
> select 1 a union select 2 b union select 3 c
> {code}
> Explain plan:
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#13], functions=[])
> +- Exchange hashpartitioning(a#13, 200)
>+- *HashAggregate(keys=[a#13], functions=[])
>   +- Union
>  :- *HashAggregate(keys=[a#13], functions=[])
>  :  +- Exchange hashpartitioning(a#13, 200)
>  : +- *HashAggregate(keys=[a#13], functions=[])
>  :+- Union
>  :   :- *Project [1 AS a#13]
>  :   :  +- Scan OneRowRelation[]
>  :   +- *Project [2 AS b#14]
>  :  +- Scan OneRowRelation[]
>  +- *Project [3 AS c#15]
> +- Scan OneRowRelation[]
> {code}
> Only one distinct should be necessary. This makes a bunch of unions slower 
> than a bunch of union alls followed by a distinct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)

2016-09-24 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519953#comment-15519953
 ] 

Xiao Li commented on SPARK-17653:
-

Yeah. You are right. It does not work. : )

> Optimizer should remove unnecessary distincts (in multiple unions)
> --
>
> Key: SPARK-17653
> URL: https://issues.apache.org/jira/browse/SPARK-17653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>
> Query:
> {code}
> select 1 a union select 2 b union select 3 c
> {code}
> Explain plan:
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#13], functions=[])
> +- Exchange hashpartitioning(a#13, 200)
>+- *HashAggregate(keys=[a#13], functions=[])
>   +- Union
>  :- *HashAggregate(keys=[a#13], functions=[])
>  :  +- Exchange hashpartitioning(a#13, 200)
>  : +- *HashAggregate(keys=[a#13], functions=[])
>  :+- Union
>  :   :- *Project [1 AS a#13]
>  :   :  +- Scan OneRowRelation[]
>  :   +- *Project [2 AS b#14]
>  :  +- Scan OneRowRelation[]
>  +- *Project [3 AS c#15]
> +- Scan OneRowRelation[]
> {code}
> Only one distinct should be necessary. This makes a bunch of unions slower 
> than a bunch of union alls followed by a distinct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)

2016-09-24 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519950#comment-15519950
 ] 

Xiao Li commented on SPARK-17653:
-

I see. After rethinking it, Union is special. My PR is not applicable to it. We 
are unable to eliminate the Distinct in this pattern. I think what you said is 
correct. We can do it for UNION. Do you want me to try it? Or somebody else 
already started it? Thanks!

BTW, in traditional RDBMS, many optimizer rules are based on the unique 
constraints. However, Spark SQL does not have the concept of primary key or 
unique constraints. If we allow users specify unique constraints using Hints, 
we could further optimize the plan and the execution. Do you think adding such 
a HINT is OK to Spark SQL? 

> Optimizer should remove unnecessary distincts (in multiple unions)
> --
>
> Key: SPARK-17653
> URL: https://issues.apache.org/jira/browse/SPARK-17653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>
> Query:
> {code}
> select 1 a union select 2 b union select 3 c
> {code}
> Explain plan:
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#13], functions=[])
> +- Exchange hashpartitioning(a#13, 200)
>+- *HashAggregate(keys=[a#13], functions=[])
>   +- Union
>  :- *HashAggregate(keys=[a#13], functions=[])
>  :  +- Exchange hashpartitioning(a#13, 200)
>  : +- *HashAggregate(keys=[a#13], functions=[])
>  :+- Union
>  :   :- *Project [1 AS a#13]
>  :   :  +- Scan OneRowRelation[]
>  :   +- *Project [2 AS b#14]
>  :  +- Scan OneRowRelation[]
>  +- *Project [3 AS c#15]
> +- Scan OneRowRelation[]
> {code}
> Only one distinct should be necessary. This makes a bunch of unions slower 
> than a bunch of union alls followed by a distinct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)

2016-09-24 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519923#comment-15519923
 ] 

Reynold Xin commented on SPARK-17653:
-

[~smilegator] - I just took a quick look at #11930. It looks to me it mainly 
propagates uniqueness property up. In this case we want to remove distincts 
down a subtree. How would it work in your case?


> Optimizer should remove unnecessary distincts (in multiple unions)
> --
>
> Key: SPARK-17653
> URL: https://issues.apache.org/jira/browse/SPARK-17653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>
> Query:
> {code}
> select 1 a union select 2 b union select 3 c
> {code}
> Explain plan:
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#13], functions=[])
> +- Exchange hashpartitioning(a#13, 200)
>+- *HashAggregate(keys=[a#13], functions=[])
>   +- Union
>  :- *HashAggregate(keys=[a#13], functions=[])
>  :  +- Exchange hashpartitioning(a#13, 200)
>  : +- *HashAggregate(keys=[a#13], functions=[])
>  :+- Union
>  :   :- *Project [1 AS a#13]
>  :   :  +- Scan OneRowRelation[]
>  :   +- *Project [2 AS b#14]
>  :  +- Scan OneRowRelation[]
>  +- *Project [3 AS c#15]
> +- Scan OneRowRelation[]
> {code}
> Only one distinct should be necessary. This makes a bunch of unions slower 
> than a bunch of union alls followed by a distinct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)

2016-09-24 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519710#comment-15519710
 ] 

Reynold Xin commented on SPARK-17653:
-

There are different ways to fix this, from fairly general ones to more surgical 
ones. The most surgical fix I can think of is to just match a bunch of 
Distinct(Union(Distinct(Union(...))) and combine them into a single 
Distinct(Union(...)).

If the more general fix is simple enough, that could be a good idea too.

cc [~vssrinath]

> Optimizer should remove unnecessary distincts (in multiple unions)
> --
>
> Key: SPARK-17653
> URL: https://issues.apache.org/jira/browse/SPARK-17653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>
> Query:
> {code}
> select 1 a union select 2 b union select 3 c
> {code}
> Explain plan:
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#13], functions=[])
> +- Exchange hashpartitioning(a#13, 200)
>+- *HashAggregate(keys=[a#13], functions=[])
>   +- Union
>  :- *HashAggregate(keys=[a#13], functions=[])
>  :  +- Exchange hashpartitioning(a#13, 200)
>  : +- *HashAggregate(keys=[a#13], functions=[])
>  :+- Union
>  :   :- *Project [1 AS a#13]
>  :   :  +- Scan OneRowRelation[]
>  :   +- *Project [2 AS b#14]
>  :  +- Scan OneRowRelation[]
>  +- *Project [3 AS c#15]
> +- Scan OneRowRelation[]
> {code}
> Only one distinct should be necessary. This makes a bunch of unions slower 
> than a bunch of union alls followed by a distinct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)

2016-09-24 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518577#comment-15518577
 ] 

Xiao Li commented on SPARK-17653:
-

[~rxin] I submitted a PR https://github.com/apache/spark/pull/11930 for 
resolving a related issue. If you think that is a right direction, I will 
continue/enhance it and write the design doc. 

> Optimizer should remove unnecessary distincts (in multiple unions)
> --
>
> Key: SPARK-17653
> URL: https://issues.apache.org/jira/browse/SPARK-17653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>
> Query:
> {code}
> select 1 a union select 2 b union select 3 c
> {code}
> Explain plan:
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#13], functions=[])
> +- Exchange hashpartitioning(a#13, 200)
>+- *HashAggregate(keys=[a#13], functions=[])
>   +- Union
>  :- *HashAggregate(keys=[a#13], functions=[])
>  :  +- Exchange hashpartitioning(a#13, 200)
>  : +- *HashAggregate(keys=[a#13], functions=[])
>  :+- Union
>  :   :- *Project [1 AS a#13]
>  :   :  +- Scan OneRowRelation[]
>  :   +- *Project [2 AS b#14]
>  :  +- Scan OneRowRelation[]
>  +- *Project [3 AS c#15]
> +- Scan OneRowRelation[]
> {code}
> Only one distinct should be necessary. This makes a bunch of unions slower 
> than a bunch of union alls followed by a distinct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org