[ https://issues.apache.org/jira/browse/SPARK-23839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422228#comment-16422228 ]
Xiaoju Wu edited comment on SPARK-23839 at 4/2/18 3:05 PM: ----------------------------------------------------------- Yes, bucketing is one of the cases to say that the cost of shuffling and sorting should be taken into consideration of the cost formula. I agree the simple the better. Seems we need to think about a simple solution, which will not lead the join order to worse performance than that optimized after current CostBasedJoinReorder rule. Any suggestion? was (Author: xiaojuwu): Yes, bucketing is one of the cases to say that the cost of shuffling and sorting should be taken into consideration of the cost formula. I agree the simple the better. Seems we need to think about a simple solution, which will not lead the join order to worse performance than that optimized after current CostBasedJoinReorder rule. > consider bucket join in cost-based JoinReorder rule > --------------------------------------------------- > > Key: SPARK-23839 > URL: https://issues.apache.org/jira/browse/SPARK-23839 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.0 > Reporter: Xiaoju Wu > Priority: Minor > > Since spark 2.2, the cost-based JoinReorder rule is implemented and in Spark > 2.3 released, it is improved with histogram. While it doesn't take the cost > of the different join implementations. For example: > TableA JOIN TableB JOIN TableC > TableA will output 10,000 rows after filter and projection. > TableB will output 10,000 rows after filter and projection. > TableC will output 8,000 rows after filter and projection. > The current JoinReorder rule will possibly optimize the plan to join TableC > with TableA firstly and then TableB. But if the TableA and TableB are bucket > tables and can be applied with BucketJoin, it could be a different story. > > Also, to support bucket join of more than 2 tables when table bucket number > is multiple of another (SPARK-17570), whether bucket join can take effect > depends on the result of JoinReorder. For example of "A join B join C" which > has bucket number like 8, 4, 12, JoinReorder rule should optimize the order > to "A join B join C“ to make the bucket join take effect instead of "C join A > join B". > > Based on current CBO JoinReorder, there are possibly 2 part to be changed: > # CostBasedJoinReorder rule is applied in optimizer phase while we do Join > selection in planner phase and bucket join optimization in EnsureRequirements > which is in preparation phase. Both are after optimizer. > # Current statistics and join cost formula are based data selectivity and > cardinality, we need to add statistics for present the join method cost like > shuffle, sort, hash and etc. Also we need to add the statistics into the > formula to estimate the join cost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org