[ 
https://issues.apache.org/jira/browse/SPARK-23839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoju Wu updated SPARK-23839:
------------------------------
    Description: 
Since spark 2.2, the cost-based JoinReorder rule is implemented and in Spark 
2.3 released, it is improved with histogram. While it doesn't take the cost of 
the different join implementations. For example:

TableA JOIN TableB JOIN TableC

TableA  will output 10,000 rows after filter and projection. 

TableB  will output 10,000 rows after filter and projection. 

TableC  will output 8,000 rows after filter and projection. 

The current JoinReorder rule will possibly optimize the plan to join TableC 
with TableA firstly and then TableB. But if the TableA and TableB are bucket 
tables and can be applied with BucketJoin, it could be a different story. 

 

Also, to support bucket join of more than 2 tables when table bucket number is 
multiple of another (SPARK-17570), whether bucket join can take effect depends 
on the result of JoinReorder. For example of "A join B join C" which has bucket 
number like 8, 4, 12, JoinReorder rule should optimize the order to "A join B 
join C“ to make the bucket join take effect instead of "C join A join B". 

 

Based on current CBO JoinReorder, there are possibly 2 part to be changed:
 # CostBasedJoinReorder rule is applied in optimizer phase while we do Join 
selection in planner phase and bucket join optimization in EnsureRequirements 
which is in preparation phase. Both are after optimizer. 
 # Current statistics and join cost formula are based data selectivity and 
cardinality, we need to add statistics for present the join method cost like 
shuffle, sort, hash and etc. Also we need to add the statistics into the 
formula to estimate the join cost. 

  was:
Since spark 2.2, the cost-based JoinReorder rule is implemented and in Spark 
2.3 released, it is improved with histogram. While it doesn't take the cost of 
the different join implementations. For example:

TableA JOIN TableB JOIN TableC

TableA  will output 10,000 rows after filter and projection. 

TableB  will output 10,000 rows after filter and projection. 

TableC  will output 8,000 rows after filter and projection. 

The current JoinReorder rule will possibly optimize the plan to join TableC 
with TableA firstly and then TableB. But if the TableA and TableB are bucket 
tables and can be applied with BucketJoin, it could be a different story. 

 

Also, to support bucket join of more than 2 tables when table bucket number is 
multiple of another (SPARK-17570), whether bucket join can take effect depends 
on the result of JoinReorder. For example of "a join b join c" which has bucket 
number like 8, 12, 4, JoinReorder rule should optimize the order to "c join a 
join b“ to make the bucket join take effect. 

 

Based on current CBO JoinReorder, there are possibly 2 part to be changed:
 # CostBasedJoinReorder rule is applied in optimizer phase while we do Join 
selection in planner phase and bucket join optimization in EnsureRequirements 
which is in preparation phase. Both are after optimizer. 
 # Current statistics and join cost formula are based data selectivity and 
cardinality, we need to add statistics for present the join method cost like 
shuffle, sort, hash and etc. Also we need to add the statistics into the 
formula to estimate the join cost. 


> consider bucket join in cost-based JoinReorder rule
> ---------------------------------------------------
>
>                 Key: SPARK-23839
>                 URL: https://issues.apache.org/jira/browse/SPARK-23839
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Xiaoju Wu
>            Priority: Minor
>
> Since spark 2.2, the cost-based JoinReorder rule is implemented and in Spark 
> 2.3 released, it is improved with histogram. While it doesn't take the cost 
> of the different join implementations. For example:
> TableA JOIN TableB JOIN TableC
> TableA  will output 10,000 rows after filter and projection. 
> TableB  will output 10,000 rows after filter and projection. 
> TableC  will output 8,000 rows after filter and projection. 
> The current JoinReorder rule will possibly optimize the plan to join TableC 
> with TableA firstly and then TableB. But if the TableA and TableB are bucket 
> tables and can be applied with BucketJoin, it could be a different story. 
>  
> Also, to support bucket join of more than 2 tables when table bucket number 
> is multiple of another (SPARK-17570), whether bucket join can take effect 
> depends on the result of JoinReorder. For example of "A join B join C" which 
> has bucket number like 8, 4, 12, JoinReorder rule should optimize the order 
> to "A join B join C“ to make the bucket join take effect instead of "C join A 
> join B". 
>  
> Based on current CBO JoinReorder, there are possibly 2 part to be changed:
>  # CostBasedJoinReorder rule is applied in optimizer phase while we do Join 
> selection in planner phase and bucket join optimization in EnsureRequirements 
> which is in preparation phase. Both are after optimizer. 
>  # Current statistics and join cost formula are based data selectivity and 
> cardinality, we need to add statistics for present the join method cost like 
> shuffle, sort, hash and etc. Also we need to add the statistics into the 
> formula to estimate the join cost. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to