[ https://issues.apache.org/jira/browse/HIVE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sushanth Sowmyan updated HIVE-10194: ------------------------------------ Fix Version/s: (was: 1.2.0) > CBO (Calcite Return Path): Equi join followed by theta join produces a cross > product > ------------------------------------------------------------------------------------ > > Key: HIVE-10194 > URL: https://issues.apache.org/jira/browse/HIVE-10194 > Project: Hive > Issue Type: Sub-task > Components: CBO > Reporter: Mostafa Mokhtar > Assignee: Laljo John Pullokkaran > > Query > {code} > SELECT count(distinct ws_order_number) as order_count, > sum(ws_ext_ship_cost) as total_shipping_cost, > sum(ws_net_profit) as total_net_profit > FROM web_sales ws1 > JOIN customer_address ca ON (ws1.ws_ship_addr_sk = ca.ca_address_sk) > JOIN web_site s ON (ws1.ws_web_site_sk = s.web_site_sk) > JOIN date_dim d ON (ws1.ws_ship_date_sk = d.d_date_sk) > LEFT SEMI JOIN (SELECT ws2.ws_order_number as ws_order_number > FROM web_sales ws2 JOIN web_sales ws3 > ON (ws2.ws_order_number = ws3.ws_order_number) > WHERE ws2.ws_warehouse_sk <> > ws3.ws_warehouse_sk > ) ws_wh1 > ON (ws1.ws_order_number = ws_wh1.ws_order_number) > LEFT OUTER JOIN web_returns wr1 ON (ws1.ws_order_number = wr1.wr_order_number) > WHERE d.d_date between '1999-05-01' and '1999-07-01' and > ca.ca_state = 'TX' and > s.web_company_name = 'pri' and > wr1.wr_order_number is null > limit 100 > {code} > Plan > {code} > OK > Time taken: 0.23 seconds > Warning: Map Join MAPJOIN[83][bigTable=ws1] in task 'Map 2' is a cross product > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > Edges: > Map 2 <- Map 1 (BROADCAST_EDGE) > Map 8 <- Reducer 4 (BROADCAST_EDGE) > Reducer 3 <- Map 2 (SIMPLE_EDGE), Map 5 (BROADCAST_EDGE), Map 6 > (BROADCAST_EDGE), Map 7 (SIMPLE_EDGE) > Reducer 4 <- Map 10 (SIMPLE_EDGE), Reducer 3 (SIMPLE_EDGE) > Reducer 9 <- Map 8 (SIMPLE_EDGE) > DagName: mmokhtar_20150402132417_1bc8688b-59a0-4909-82a4-b9d386065bbd:3 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: ws1 > filterExpr: (((ws_ship_addr_sk = ws_order_number) and > (ws_ship_date_sk <> ws_web_site_sk)) and ws_ship_addr_sk is not null) (type: > boolean) > Statistics: Num rows: 143966864 Data size: 33110363004 > Basic stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: (((ws_ship_addr_sk = ws_order_number) and > (ws_ship_date_sk <> ws_web_site_sk)) and ws_ship_addr_sk is not null) (type: > boolean) > Statistics: Num rows: 71974471 Data size: 1151483592 > Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: ws_ship_addr_sk (type: int) > outputColumnNames: _col1 > Statistics: Num rows: 71974471 Data size: 287862044 > Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > sort order: > Statistics: Num rows: 71974471 Data size: 287862044 > Basic stats: COMPLETE Column stats: COMPLETE > value expressions: _col1 (type: int) > Execution mode: vectorized > Map 10 > Map Operator Tree: > TableScan > alias: wr1 > Statistics: Num rows: 13749816 Data size: 2585240312 Basic > stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: wr_order_number (type: int) > sort order: + > Map-reduce partition columns: wr_order_number (type: int) > Statistics: Num rows: 13749816 Data size: 2585240312 > Basic stats: COMPLETE Column stats: COMPLETE > Execution mode: vectorized > Map 2 > Map Operator Tree: > TableScan > alias: ws1 > Statistics: Num rows: 143966864 Data size: 33110363004 > Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > keys: > 0 > 1 > outputColumnNames: _col1 > input vertices: > 0 Map 1 > Statistics: Num rows: 5180969438964472 Data size: > 20723877755857888 Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col1 (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 5180969438964472 Data size: > 20723877755857888 Basic stats: COMPLETE Column stats: COMPLETE > Group By Operator > keys: _col0 (type: int) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 73333928460636 Data size: > 293335713842544 Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 73333928460636 Data size: > 293335713842544 Basic stats: COMPLETE Column stats: COMPLETE > Execution mode: vectorized > Map 5 > Map Operator Tree: > TableScan > alias: ca > filterExpr: ((ca_state = 'TX') and ca_address_sk is not > null) (type: boolean) > Statistics: Num rows: 800000 Data size: 811903688 Basic > stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: ((ca_state = 'TX') and ca_address_sk is not > null) (type: boolean) > Statistics: Num rows: 15686 Data size: 1411740 Basic > stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: ca_address_sk (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 15686 Data size: 62744 Basic > stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 15686 Data size: 62744 Basic > stats: COMPLETE Column stats: COMPLETE > Execution mode: vectorized > Map 6 > Map Operator Tree: > TableScan > alias: d > filterExpr: (d_date BETWEEN '1999-05-01' AND '1999-07-01' > and d_date_sk is not null) (type: boolean) > Statistics: Num rows: 73049 Data size: 81741831 Basic > stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: (d_date BETWEEN '1999-05-01' AND '1999-07-01' > and d_date_sk is not null) (type: boolean) > Statistics: Num rows: 36524 Data size: 3579352 Basic > stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: d_date_sk (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 36524 Data size: 146096 Basic > stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 36524 Data size: 146096 Basic > stats: COMPLETE Column stats: COMPLETE > Execution mode: vectorized > Map 7 > Map Operator Tree: > TableScan > alias: ws1 > filterExpr: (((ws_order_number is not null and > ws_ship_addr_sk is not null) and ws_ship_date_sk is not null) and > ws_web_site_sk is not null) (type: boolean) > Statistics: Num rows: 143966864 Data size: 33110363004 > Basic stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: (((ws_order_number is not null and > ws_ship_addr_sk is not null) and ws_ship_date_sk is not null) and > ws_web_site_sk is not null) (type: boolean) > Statistics: Num rows: 143912892 Data size: 3453621540 > Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: ws_order_number (type: int) > sort order: + > Map-reduce partition columns: ws_order_number (type: > int) > Statistics: Num rows: 143912892 Data size: 3453621540 > Basic stats: COMPLETE Column stats: COMPLETE > value expressions: ws_ship_date_sk (type: int), > ws_ship_addr_sk (type: int), ws_web_site_sk (type: int), ws_ext_ship_cost > (type: float), ws_net_profit (type: float) > Execution mode: vectorized > Map 8 > Map Operator Tree: > TableScan > alias: s > filterExpr: ((web_company_name = 'pri') and web_site_sk is > not null) (type: boolean) > Statistics: Num rows: 38 Data size: 70614 Basic stats: > COMPLETE Column stats: COMPLETE > Filter Operator > predicate: ((web_company_name = 'pri') and web_site_sk is > not null) (type: boolean) > Statistics: Num rows: 5 Data size: 460 Basic stats: > COMPLETE Column stats: COMPLETE > Select Operator > expressions: web_site_sk (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 5 Data size: 20 Basic stats: > COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > keys: > 0 _col30 (type: int) > 1 _col0 (type: int) > outputColumnNames: _col31, _col32, _col33 > input vertices: > 0 Reducer 4 > Statistics: Num rows: 5 Data size: 22 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col31 (type: int), _col32 (type: > float), _col33 (type: float) > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 5 Data size: 22 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: count(DISTINCT _col0), sum(_col1), > sum(_col2) > keys: _col0 (type: int) > mode: hash > outputColumnNames: _col0, _col1, _col2, _col3 > Statistics: Num rows: 5 Data size: 22 Basic > stats: COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: _col0 (type: int) > sort order: + > Statistics: Num rows: 5 Data size: 22 Basic > stats: COMPLETE Column stats: NONE > TopN Hash Memory Usage: 0.04 > value expressions: _col2 (type: double), _col3 > (type: double) > Execution mode: vectorized > Reducer 3 > Reduce Operator Tree: > Merge Join Operator > condition map: > Left Semi Join 0 to 1 > keys: > 0 ws_order_number (type: int) > 1 _col0 (type: int) > outputColumnNames: _col2, _col11, _col13, _col17, _col28, > _col33 > Statistics: Num rows: 718857633877870 Data size: > 17252583213068880 Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > keys: > 0 _col11 (type: int) > 1 _col0 (type: int) > outputColumnNames: _col2, _col13, _col17, _col28, _col33 > input vertices: > 1 Map 5 > Statistics: Num rows: 14094999486464 Data size: > 281899989729280 Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > keys: > 0 _col2 (type: int) > 1 _col0 (type: int) > outputColumnNames: _col13, _col17, _col28, _col33 > input vertices: > 1 Map 6 > Statistics: Num rows: 7047403274240 Data size: > 112758452387840 Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col13 (type: int), _col17 (type: int), > _col28 (type: float), _col33 (type: float) > outputColumnNames: _col2, _col3, _col4, _col5 > Statistics: Num rows: 7047403274240 Data size: > 112758452387840 Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col3 (type: int) > sort order: + > Map-reduce partition columns: _col3 (type: int) > Statistics: Num rows: 7047403274240 Data size: > 112758452387840 Basic stats: COMPLETE Column stats: COMPLETE > value expressions: _col2 (type: int), _col4 (type: > float), _col5 (type: float) > Reducer 4 > Reduce Operator Tree: > Merge Join Operator > condition map: > Right Outer Join0 to 1 > keys: > 0 wr_order_number (type: int) > 1 _col3 (type: int) > outputColumnNames: _col13, _col30, _col31, _col32, _col33 > Statistics: Num rows: 12100482980189 Data size: > 242009659603780 Basic stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: _col13 is null (type: boolean) > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE > Column stats: COMPLETE > Reduce Output Operator > key expressions: _col30 (type: int) > sort order: + > Map-reduce partition columns: _col30 (type: int) > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE > Column stats: COMPLETE > value expressions: _col31 (type: int), _col32 (type: > float), _col33 (type: float) > Reducer 9 > Reduce Operator Tree: > Group By Operator > aggregations: count(DISTINCT KEY._col0:0._col0), > sum(VALUE._col1), sum(VALUE._col2) > mode: mergepartial > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE > Column stats: NONE > Limit > Number of rows: 100 > Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE > Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 1 Data size: 32 Basic stats: > COMPLETE Column stats: NONE > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: 100 > Processor Tree: > ListSink > {code} > Logical plan > {code} > HiveSort(fetch=[100]): rowcount = 1.0, cumulative cost = {4.594004456323317E8 > rows, 5.882012320482085E9 cpu, 9.353802456E12 io}, id = 1080 > HiveAggregate(group=[{}], agg#0=[count(DISTINCT $0)], agg#1=[sum($1)], > agg#2=[sum($2)]): rowcount = 1.0, cumulative cost = {4.594004336323317E8 > rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1078 > HiveProject($f0=[$3], $f1=[$4], $f2=[$5]): rowcount = 464.4485504122314, > cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, > 9.353802456E12 io}, id = 1076 > HiveFilter(condition=[isnull($12)]): rowcount = 464.4485504122314, > cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, > 9.353802456E12 io}, id = 1074 > HiveProject(ws_ship_date_sk=[$1], ws_ship_addr_sk=[$2], > ws_web_site_sk=[$3], ws_order_number=[$4], ws_ext_ship_cost=[$5], > ws_net_profit=[$6], ca_address_sk=[$7], ca_state=[$8], web_site_sk=[$11], > web_company_name=[$12], d_date_sk=[$9], d_date=[$10], wr_order_number=[$0]): > rowcount = 1229395.3129411766, cumulative cost = {4.594004336323317E8 rows, > 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1185 > HiveJoin(condition=[=($3, $11)], joinType=[inner], > joinAlgorithm=[map_join], cost=[{1.3749818235294119E7 rows, > 1.3749820470588237E7 cpu, 0.0 io}]): rowcount = 1229395.3129411766, > cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, > 9.353802456E12 io}, id = 1183 > HiveJoin(condition=[=($4, $0)], joinType=[right], > joinAlgorithm=[map_join], cost=[{1.3749817E7 rows, 2.7499633E7 cpu, > 2.47496688E10 io}]): rowcount = 1.3749816E7, cumulative cost = > {4.456506153970376E8 rows, 5.868262499011497E9 cpu, 9.353802456E12 io}, id = > 1181 > HiveTableScan(table=[[tpcds_bin_orc_200.web_returns]]): > rowcount = 1.3749816E7, cumulative cost = {0}, id = 974 > HiveJoin(condition=[=($0, $8)], joinType=[inner], > joinAlgorithm=[map_join], cost=[{204.39703763146147 rows, 205.39703763146147 > cpu, 0.0 io}]): rowcount = 1.0, cumulative cost = {4.319007983970376E8 rows, > 5.840762866011497E9 cpu, 9.3290527872E12 io}, id = 1179 > HiveJoin(condition=[=($1, $6)], joinType=[inner], > joinAlgorithm=[map_join], cost=[{1.43966865E8 rows, 1.43966866E8 cpu, 0.0 > io}]): rowcount = 203.39703763146147, cumulative cost = {4.31900594E8 rows, > 5.840762660614459E9 cpu, 9.3290527872E12 io}, id = 1110 > SemiJoin(condition=[=($3, $6)], joinType=[inner]): rowcount > = 1.43966864E8, cumulative cost = {2.87933729E8 rows, 5.696795794614459E9 > cpu, 9.3290527872E12 io}, id = 1058 > HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): > rowcount = 1.43966864E8, cumulative cost = {0}, id = 958 > HiveProject(ws_order_number=[$1]): rowcount = > 1.655760644524185E11, cumulative cost = {2.87933728E8 rows, > 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1056 > HiveFilter(condition=[<>($0, $2)]): rowcount = > 1.655760644524185E11, cumulative cost = {2.87933728E8 rows, > 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1054 > HiveJoin(condition=[=($1, $3)], joinType=[inner], > joinAlgorithm=[common_join], cost=[{2.87933728E8 rows, 5.696795793614459E9 > cpu, 9.3290527872E12 io}]): rowcount = 1.6564734127740878E11, cumulative cost > = {2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1103 > > HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): rowcount = > 1.43966864E8, cumulative cost = {0}, id = 958 > > HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): rowcount = > 1.43966864E8, cumulative cost = {0}, id = 958 > HiveFilter(condition=[=($1, 'TX')]): rowcount = 1.0, > cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1060 > > HiveTableScan(table=[[tpcds_bin_orc_200.customer_address]]): rowcount = > 800000.0, cumulative cost = {0}, id = 965 > HiveFilter(condition=[between(false, $1, '1999-05-01', > '1999-07-01')]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 > io}, id = 1068 > HiveTableScan(table=[[tpcds_bin_orc_200.date_dim]]): > rowcount = 73049.0, cumulative cost = {0}, id = 971 > HiveFilter(condition=[=($1, 'pri')]): rowcount = > 2.235294117647059, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1064 > HiveTableScan(table=[[tpcds_bin_orc_200.web_site]]): rowcount = > 38.0, cumulative cost = {0}, id = 968 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)