[jira] [Commented] (HIVE-10153) CBO (Calcite Return Path): TPC-DS Q15 in-efficient join order
[ https://issues.apache.org/jira/browse/HIVE-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407306#comment-15407306 ] Nemon Lou commented on HIVE-10153: -- This filter prevents joining date_dim first: {noformat} ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475', '85392', '85460', '80348', '81792') or customer_address.ca_state in ('CA','WA','GA') or catalog_sales.cs_sales_price > 500) {noformat} With this filter, table date_dim can not be combined into the same mutijoin RelNode with other 3 tables. {code} private boolean canCombine(RelNode input, boolean nullGenerating) { return input instanceof MultiJoin && !((MultiJoin) input).isFullOuterJoin() && !((MultiJoin) input).containsOuter() && !nullGenerating; } {code} The input is a filter RelNode instead of MultiJoin. {noformat} 2016-08-04 14:23:38,637 | DEBUG | HiveServer2-Handler-Pool: Thread-123 | Original Plan: HiveSort(fetch=[100]) HiveSort(sort0=[$0], dir0=[ASC]) HiveProject(ca_zip=[$0], _o__c1=[$1]) HiveAggregate(group=[{0}], agg#0=[sum($1)]) HiveProject($f0=[$67], $f1=[$20]) HiveFilter(condition=[AND(=($2, $37), =($41, $58), =($33, $74), OR(in(substr($67, 1, 5), '85669', '86197', '88274', '83405', '86475', '85392', '85460', '80348', '81792'), in($66, 'CA', 'WA', 'GA'), >($20, 5E2)), =($84, 2), =($80, 2000))]) HiveJoin(condition=[true], joinType=[inner], algorithm=[none], cost=[not available]) HiveJoin(condition=[true], joinType=[inner], algorithm=[none], cost=[not available]) HiveJoin(condition=[true], joinType=[inner], algorithm=[none], cost=[not available]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.catalog_sales]]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.customer]]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.customer_address]]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.date_dim]]) | org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:911) 2016-08-04 14:23:38,654 | DEBUG | HiveServer2-Handler-Pool: Thread-123 | Plan After PPD, PartPruning, ColumnPruning: HiveSort(fetch=[100]) HiveSort(sort0=[$0], dir0=[ASC]) HiveAggregate(group=[{0}], agg#0=[sum($1)]) HiveProject($f0=[$7], $f1=[$1]) HiveJoin(condition=[=($2, $8)], joinType=[inner], algorithm=[none], cost=[not available]) HiveFilter(condition=[OR(in(substr($7, 1, 5), '85669', '86197', '88274', '83405', '86475', '85392', '85460', '80348', '81792'), in($6, 'CA', 'WA', 'GA'), >($1, 5E2))]) HiveJoin(condition=[=($4, $5)], joinType=[inner], algorithm=[none], cost=[not available]) HiveJoin(condition=[=($0, $3)], joinType=[inner], algorithm=[none], cost=[not available]) HiveProject(cs_bill_customer_sk=[$2], cs_sales_price=[$20], cs_sold_date_sk=[$33]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.catalog_sales]]) HiveProject(c_customer_sk=[$0], c_current_addr_sk=[$4]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.customer]]) HiveProject(ca_address_sk=[$0], ca_state=[$8], ca_zip=[$9]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.customer_address]]) HiveProject(d_date_sk=[$0], d_year=[$6], d_qoy=[$10]) HiveFilter(condition=[AND(=($10, 2), =($6, 2000))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.date_dim]]) | org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:912) {noformat} Replacing 'or' with 'and' can help. {noformat} ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475', '85392', '85460', '80348', '81792') and customer_address.ca_state in ('CA','WA','GA') and catalog_sales.cs_sales_price > 500) {noformat} > CBO (Calcite Return Path): TPC-DS Q15 in-efficient join order > -- > > Key: HIVE-10153 > URL: https://issues.apache.org/jira/browse/HIVE-10153 > Project: Hive > Issue Type: Bug > Components: CBO >Affects Versions: cbo-branch >Reporter: Mostafa Mokhtar >Assignee: Laljo John Pullokkaran > Fix For: cbo-branch > > > TPC-DS Q15 joins catalog_sales with date_dim last where it should be the > first join. > Query > {code} > select ca_zip >,sum(cs_sales_price) > from catalog_sales > ,customer > ,customer_address > ,date_dim > where catalog_sales.cs_bill_customer_sk = customer.c_customer_sk > and customer.c_current_addr_sk = customer_address.ca_address_sk > and ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475', >'85392', '85460', '80348',
[jira] [Commented] (HIVE-10153) CBO (Calcite Return Path): TPC-DS Q15 in-efficient join order
[ https://issues.apache.org/jira/browse/HIVE-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387752#comment-14387752 ] Mostafa Mokhtar commented on HIVE-10153: [~jpullokkaran] I looked at HiveJoin.java:chooseJoinAlgorithmAndGetCost and noticed catalog_sales x date_dim is not one of the joins that get costed, which means that re-ordering happened already and this join wasn't considered selective. As you mentioned the catalog_sales x date_dim join must have been trimmed earlier. {code} 2015-03-30 20:56:30,676 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(160)) - HiveJoin(condition=[=($0, $3)], joinType=[inner], joinAlgorithm=[none], cost=[null]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.catalog_sales]]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer]]) 2015-03-30 20:56:30,694 DEBUG [main]: stats.StatsUtils (StatsUtils.java:setUnknownRcDsToAverage(383)) - Estimated average row size: 8 2015-03-30 20:56:31,030 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(169)) - COMMONJOIN possible 2 2015-03-30 20:56:31,031 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(170)) - COMMONJOIN cost: {2.88149727E8 rows, 5.891110492737591E9 cpu, 3.1120170516E12 io} 2015-03-30 20:56:31,046 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(180)) - MAPJOIN possible 2015-03-30 20:56:31,046 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(181)) - MAPJOIN cost: {2.88149727E8 rows, 2.89749727E8 cpu, 3.84E10 io} 2015-03-30 20:56:31,049 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(121)) - PERFLOG method=partition-retrieving from=org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner 2015-03-30 20:56:31,049 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=partition-retrieving start=1427763391049 end=1427763391049 duration=0 from=org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner 2015-03-30 20:56:31,072 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(160)) - HiveJoin(condition=[=($1, $2)], joinType=[inner], joinAlgorithm=[none], cost=[null]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer]]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer_address]]) 2015-03-30 20:56:31,093 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(169)) - COMMONJOIN possible 2 2015-03-30 20:56:31,093 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(170)) - COMMONJOIN cost: {240.0 rows, 3.613071630485607E7 cpu, 2.89314882E10 io} 2015-03-30 20:56:31,101 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(180)) - MAPJOIN possible 2015-03-30 20:56:31,101 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(181)) - MAPJOIN cost: {240.0 rows, 320.0 cpu, 0.0 io} 2015-03-30 20:56:31,108 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(160)) - HiveJoin(condition=[=($0, $3)], joinType=[inner], joinAlgorithm=[none], cost=[null]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.catalog_sales]]) HiveJoin(condition=[=($1, $2)], joinType=[inner], joinAlgorithm=[map_join], cost=[{240.0 rows, 320.0 cpu, 0.0 io}]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer]]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer_address]]) 2015-03-30 20:56:31,109 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(169)) - COMMONJOIN possible 2 2015-03-30 20:56:31,109 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(170)) - COMMONJOIN cost: {2.88358116798421E8 rows, 5.894517244918468E9 cpu, 3.140605701919576E12 io} 2015-03-30 20:56:31,117 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(180)) - MAPJOIN possible 2015-03-30 20:56:31,117 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(181)) - MAPJOIN cost: {2.88358116798421E8 rows, 5.74907843798421E8 cpu, 1.82245626372E13 io} 2015-03-30 20:56:31,123 DEBUG [main]: reloperators.HiveJoin (HiveJoin.java:chooseJoinAlgorithmAndGetCost(160)) - HiveJoin(condition=[=($4, $5)], joinType=[inner], joinAlgorithm=[none], cost=[null]) HiveJoin(condition=[=($0, $3)], joinType=[inner], joinAlgorithm=[map_join], cost=[{2.88149727E8 rows, 2.89749727E8 cpu, 3.84E10 io}]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.catalog_sales]]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer]]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer_address]]) 2015-03-30 20:56:31,124 DEBUG [main]: reloperators.HiveJoin