[jira] [Commented] (HIVE-10153) CBO (Calcite Return Path): TPC-DS Q15 in-efficient join order

2016-08-04 Thread Nemon Lou (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407306#comment-15407306
 ] 

Nemon Lou commented on HIVE-10153:
--

This filter prevents joining date_dim first:
{noformat}
( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475',
 '85392', '85460', '80348', '81792')
 or customer_address.ca_state in ('CA','WA','GA')
 or catalog_sales.cs_sales_price > 500)
{noformat}
With this filter, table date_dim can not be combined into the same mutijoin 
RelNode with other 3 tables. 
{code}
  private boolean canCombine(RelNode input, boolean nullGenerating) {
return input instanceof MultiJoin
&& !((MultiJoin) input).isFullOuterJoin()
&& !((MultiJoin) input).containsOuter()
&& !nullGenerating;
  }
{code}
The input is a filter RelNode instead of MultiJoin.
{noformat}
2016-08-04 14:23:38,637 | DEBUG | HiveServer2-Handler-Pool: Thread-123 | 
Original Plan:
HiveSort(fetch=[100])
  HiveSort(sort0=[$0], dir0=[ASC])
HiveProject(ca_zip=[$0], _o__c1=[$1])
  HiveAggregate(group=[{0}], agg#0=[sum($1)])
HiveProject($f0=[$67], $f1=[$20])
  HiveFilter(condition=[AND(=($2, $37), =($41, $58), =($33, $74), 
OR(in(substr($67, 1, 5), '85669', '86197', '88274', '83405', '86475', '85392', 
'85460', '80348', '81792'), in($66, 'CA', 'WA', 'GA'), >($20, 5E2)), =($84, 2), 
=($80, 2000))])
HiveJoin(condition=[true], joinType=[inner], algorithm=[none], 
cost=[not available])
  HiveJoin(condition=[true], joinType=[inner], algorithm=[none], 
cost=[not available])
HiveJoin(condition=[true], joinType=[inner], algorithm=[none], 
cost=[not available])
  
HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.catalog_sales]])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.customer]])

HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.customer_address]])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.date_dim]])
 | 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:911)
2016-08-04 14:23:38,654 | DEBUG | HiveServer2-Handler-Pool: Thread-123 | Plan 
After PPD, PartPruning, ColumnPruning:
HiveSort(fetch=[100])
  HiveSort(sort0=[$0], dir0=[ASC])
HiveAggregate(group=[{0}], agg#0=[sum($1)])
  HiveProject($f0=[$7], $f1=[$1])
HiveJoin(condition=[=($2, $8)], joinType=[inner], algorithm=[none], 
cost=[not available])
  HiveFilter(condition=[OR(in(substr($7, 1, 5), '85669', '86197', 
'88274', '83405', '86475', '85392', '85460', '80348', '81792'), in($6, 'CA', 
'WA', 'GA'), >($1, 5E2))])
HiveJoin(condition=[=($4, $5)], joinType=[inner], algorithm=[none], 
cost=[not available])
  HiveJoin(condition=[=($0, $3)], joinType=[inner], 
algorithm=[none], cost=[not available])
HiveProject(cs_bill_customer_sk=[$2], cs_sales_price=[$20], 
cs_sold_date_sk=[$33])
  
HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.catalog_sales]])
HiveProject(c_customer_sk=[$0], c_current_addr_sk=[$4])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.customer]])
  HiveProject(ca_address_sk=[$0], ca_state=[$8], ca_zip=[$9])

HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.customer_address]])
  HiveProject(d_date_sk=[$0], d_year=[$6], d_qoy=[$10])
HiveFilter(condition=[AND(=($10, 2), =($6, 2000))])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_10.date_dim]])
 | 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:912)
{noformat}

Replacing 'or' with 'and' can help.
{noformat}
 ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475',
 '85392', '85460', '80348', '81792')
 and customer_address.ca_state in ('CA','WA','GA')
 and catalog_sales.cs_sales_price > 500)
{noformat}

> CBO (Calcite Return Path): TPC-DS Q15 in-efficient join order 
> --
>
> Key: HIVE-10153
> URL: https://issues.apache.org/jira/browse/HIVE-10153
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: cbo-branch
>Reporter: Mostafa Mokhtar
>Assignee: Laljo John Pullokkaran
> Fix For: cbo-branch
>
>
> TPC-DS Q15 joins catalog_sales with date_dim last where it should be the 
> first join.
> Query 
> {code}
> select  ca_zip
>,sum(cs_sales_price)
>  from catalog_sales
>  ,customer
>  ,customer_address
>  ,date_dim
>  where catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
>   and customer.c_current_addr_sk = customer_address.ca_address_sk 
>   and ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475',
>'85392', '85460', '80348', 

[jira] [Commented] (HIVE-10153) CBO (Calcite Return Path): TPC-DS Q15 in-efficient join order

2015-03-30 Thread Mostafa Mokhtar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387752#comment-14387752
 ] 

Mostafa Mokhtar commented on HIVE-10153:


[~jpullokkaran]

I looked at HiveJoin.java:chooseJoinAlgorithmAndGetCost and noticed 
catalog_sales x date_dim is not one of the joins that get costed, which means 
that re-ordering happened already and this join wasn't considered selective.
As you mentioned the catalog_sales x date_dim join must have been trimmed 
earlier.

{code}
2015-03-30 20:56:30,676 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(160)) - HiveJoin(condition=[=($0, 
$3)], joinType=[inner], joinAlgorithm=[none], cost=[null])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.catalog_sales]])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer]])

2015-03-30 20:56:30,694 DEBUG [main]: stats.StatsUtils 
(StatsUtils.java:setUnknownRcDsToAverage(383)) - Estimated average row size: 8
2015-03-30 20:56:31,030 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(169)) - COMMONJOIN possible 2
2015-03-30 20:56:31,031 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(170)) - COMMONJOIN cost: 
{2.88149727E8 rows, 5.891110492737591E9 cpu, 3.1120170516E12 io}
2015-03-30 20:56:31,046 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(180)) - MAPJOIN possible
2015-03-30 20:56:31,046 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(181)) - MAPJOIN cost: 
{2.88149727E8 rows, 2.89749727E8 cpu, 3.84E10 io}
2015-03-30 20:56:31,049 INFO  [main]: log.PerfLogger 
(PerfLogger.java:PerfLogBegin(121)) - PERFLOG method=partition-retrieving 
from=org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner
2015-03-30 20:56:31,049 INFO  [main]: log.PerfLogger 
(PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=partition-retrieving 
start=1427763391049 end=1427763391049 duration=0 
from=org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner
2015-03-30 20:56:31,072 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(160)) - HiveJoin(condition=[=($1, 
$2)], joinType=[inner], joinAlgorithm=[none], cost=[null])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer]])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer_address]])

2015-03-30 20:56:31,093 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(169)) - COMMONJOIN possible 2
2015-03-30 20:56:31,093 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(170)) - COMMONJOIN cost: 
{240.0 rows, 3.613071630485607E7 cpu, 2.89314882E10 io}
2015-03-30 20:56:31,101 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(180)) - MAPJOIN possible
2015-03-30 20:56:31,101 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(181)) - MAPJOIN cost: {240.0 
rows, 320.0 cpu, 0.0 io}
2015-03-30 20:56:31,108 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(160)) - HiveJoin(condition=[=($0, 
$3)], joinType=[inner], joinAlgorithm=[none], cost=[null])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.catalog_sales]])
  HiveJoin(condition=[=($1, $2)], joinType=[inner], joinAlgorithm=[map_join], 
cost=[{240.0 rows, 320.0 cpu, 0.0 io}])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer]])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer_address]])

2015-03-30 20:56:31,109 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(169)) - COMMONJOIN possible 2
2015-03-30 20:56:31,109 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(170)) - COMMONJOIN cost: 
{2.88358116798421E8 rows, 5.894517244918468E9 cpu, 3.140605701919576E12 io}
2015-03-30 20:56:31,117 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(180)) - MAPJOIN possible
2015-03-30 20:56:31,117 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(181)) - MAPJOIN cost: 
{2.88358116798421E8 rows, 5.74907843798421E8 cpu, 1.82245626372E13 io}
2015-03-30 20:56:31,123 DEBUG [main]: reloperators.HiveJoin 
(HiveJoin.java:chooseJoinAlgorithmAndGetCost(160)) - HiveJoin(condition=[=($4, 
$5)], joinType=[inner], joinAlgorithm=[none], cost=[null])
  HiveJoin(condition=[=($0, $3)], joinType=[inner], joinAlgorithm=[map_join], 
cost=[{2.88149727E8 rows, 2.89749727E8 cpu, 3.84E10 io}])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.catalog_sales]])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer]])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_200_1.customer_address]])

2015-03-30 20:56:31,124 DEBUG [main]: reloperators.HiveJoin