[jira] [Commented] (HIVE-10194) CBO (Calcite Return Path): Equi join followed by theta join produces a cross product

Sushanth Sowmyan (JIRA) Fri, 08 May 2015 15:27:12 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535764#comment-14535764
 ]


Sushanth Sowmyan commented on HIVE-10194:
-----------------------------------------

Removing fix version of 1.2.0 in preparation of release, since this is not a 
blocker for 1.2.0.

> CBO (Calcite Return Path): Equi join followed by theta join produces a cross 
> product
> ------------------------------------------------------------------------------------
>
>                 Key: HIVE-10194
>                 URL: https://issues.apache.org/jira/browse/HIVE-10194
>             Project: Hive
>          Issue Type: Sub-task
>          Components: CBO
>            Reporter: Mostafa Mokhtar
>            Assignee: Laljo John Pullokkaran
>
> Query 
> {code}
> SELECT count(distinct ws_order_number) as order_count,
>                sum(ws_ext_ship_cost) as total_shipping_cost,
>                sum(ws_net_profit) as total_net_profit
> FROM web_sales ws1
> JOIN customer_address ca ON (ws1.ws_ship_addr_sk = ca.ca_address_sk)
> JOIN web_site s ON (ws1.ws_web_site_sk = s.web_site_sk)
> JOIN date_dim d ON (ws1.ws_ship_date_sk = d.d_date_sk)
> LEFT SEMI JOIN (SELECT ws2.ws_order_number as ws_order_number
>                                FROM web_sales ws2 JOIN web_sales ws3
>                                ON (ws2.ws_order_number = ws3.ws_order_number)
>                                WHERE ws2.ws_warehouse_sk <> 
> ws3.ws_warehouse_sk
>                       ) ws_wh1
> ON (ws1.ws_order_number = ws_wh1.ws_order_number)
> LEFT OUTER JOIN web_returns wr1 ON (ws1.ws_order_number = wr1.wr_order_number)
> WHERE d.d_date between '1999-05-01' and '1999-07-01' and
>                ca.ca_state = 'TX' and
>                s.web_company_name = 'pri' and
>                wr1.wr_order_number is null
> limit 100
> {code}
> Plan
> {code}
> OK
> Time taken: 0.23 seconds
> Warning: Map Join MAPJOIN[83][bigTable=ws1] in task 'Map 2' is a cross product
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       Edges:
>         Map 2 <- Map 1 (BROADCAST_EDGE)
>         Map 8 <- Reducer 4 (BROADCAST_EDGE)
>         Reducer 3 <- Map 2 (SIMPLE_EDGE), Map 5 (BROADCAST_EDGE), Map 6 
> (BROADCAST_EDGE), Map 7 (SIMPLE_EDGE)
>         Reducer 4 <- Map 10 (SIMPLE_EDGE), Reducer 3 (SIMPLE_EDGE)
>         Reducer 9 <- Map 8 (SIMPLE_EDGE)
>       DagName: mmokhtar_20150402132417_1bc8688b-59a0-4909-82a4-b9d386065bbd:3
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: ws1
>                   filterExpr: (((ws_ship_addr_sk = ws_order_number) and 
> (ws_ship_date_sk <> ws_web_site_sk)) and ws_ship_addr_sk is not null) (type: 
> boolean)
>                   Statistics: Num rows: 143966864 Data size: 33110363004 
> Basic stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: (((ws_ship_addr_sk = ws_order_number) and 
> (ws_ship_date_sk <> ws_web_site_sk)) and ws_ship_addr_sk is not null) (type: 
> boolean)
>                     Statistics: Num rows: 71974471 Data size: 1151483592 
> Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: ws_ship_addr_sk (type: int)
>                       outputColumnNames: _col1
>                       Statistics: Num rows: 71974471 Data size: 287862044 
> Basic stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         sort order:
>                         Statistics: Num rows: 71974471 Data size: 287862044 
> Basic stats: COMPLETE Column stats: COMPLETE
>                         value expressions: _col1 (type: int)
>             Execution mode: vectorized
>         Map 10
>             Map Operator Tree:
>                 TableScan
>                   alias: wr1
>                   Statistics: Num rows: 13749816 Data size: 2585240312 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                   Reduce Output Operator
>                     key expressions: wr_order_number (type: int)
>                     sort order: +
>                     Map-reduce partition columns: wr_order_number (type: int)
>                     Statistics: Num rows: 13749816 Data size: 2585240312 
> Basic stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized
>         Map 2
>             Map Operator Tree:
>                 TableScan
>                   alias: ws1
>                   Statistics: Num rows: 143966864 Data size: 33110363004 
> Basic stats: COMPLETE Column stats: COMPLETE
>                   Map Join Operator
>                     condition map:
>                          Inner Join 0 to 1
>                     keys:
>                       0
>                       1
>                     outputColumnNames: _col1
>                     input vertices:
>                       0 Map 1
>                     Statistics: Num rows: 5180969438964472 Data size: 
> 20723877755857888 Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col1 (type: int)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 5180969438964472 Data size: 
> 20723877755857888 Basic stats: COMPLETE Column stats: COMPLETE
>                       Group By Operator
>                         keys: _col0 (type: int)
>                         mode: hash
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 73333928460636 Data size: 
> 293335713842544 Basic stats: COMPLETE Column stats: COMPLETE
>                         Reduce Output Operator
>                           key expressions: _col0 (type: int)
>                           sort order: +
>                           Map-reduce partition columns: _col0 (type: int)
>                           Statistics: Num rows: 73333928460636 Data size: 
> 293335713842544 Basic stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized
>         Map 5
>             Map Operator Tree:
>                 TableScan
>                   alias: ca
>                   filterExpr: ((ca_state = 'TX') and ca_address_sk is not 
> null) (type: boolean)
>                   Statistics: Num rows: 800000 Data size: 811903688 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ((ca_state = 'TX') and ca_address_sk is not 
> null) (type: boolean)
>                     Statistics: Num rows: 15686 Data size: 1411740 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: ca_address_sk (type: int)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 15686 Data size: 62744 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 15686 Data size: 62744 Basic 
> stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized
>         Map 6
>             Map Operator Tree:
>                 TableScan
>                   alias: d
>                   filterExpr: (d_date BETWEEN '1999-05-01' AND '1999-07-01' 
> and d_date_sk is not null) (type: boolean)
>                   Statistics: Num rows: 73049 Data size: 81741831 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: (d_date BETWEEN '1999-05-01' AND '1999-07-01' 
> and d_date_sk is not null) (type: boolean)
>                     Statistics: Num rows: 36524 Data size: 3579352 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: d_date_sk (type: int)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 36524 Data size: 146096 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 36524 Data size: 146096 Basic 
> stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized
>         Map 7
>             Map Operator Tree:
>                 TableScan
>                   alias: ws1
>                   filterExpr: (((ws_order_number is not null and 
> ws_ship_addr_sk is not null) and ws_ship_date_sk is not null) and 
> ws_web_site_sk is not null) (type: boolean)
>                   Statistics: Num rows: 143966864 Data size: 33110363004 
> Basic stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: (((ws_order_number is not null and 
> ws_ship_addr_sk is not null) and ws_ship_date_sk is not null) and 
> ws_web_site_sk is not null) (type: boolean)
>                     Statistics: Num rows: 143912892 Data size: 3453621540 
> Basic stats: COMPLETE Column stats: COMPLETE
>                     Reduce Output Operator
>                       key expressions: ws_order_number (type: int)
>                       sort order: +
>                       Map-reduce partition columns: ws_order_number (type: 
> int)
>                       Statistics: Num rows: 143912892 Data size: 3453621540 
> Basic stats: COMPLETE Column stats: COMPLETE
>                       value expressions: ws_ship_date_sk (type: int), 
> ws_ship_addr_sk (type: int), ws_web_site_sk (type: int), ws_ext_ship_cost 
> (type: float), ws_net_profit (type: float)
>             Execution mode: vectorized
>         Map 8
>             Map Operator Tree:
>                 TableScan
>                   alias: s
>                   filterExpr: ((web_company_name = 'pri') and web_site_sk is 
> not null) (type: boolean)
>                   Statistics: Num rows: 38 Data size: 70614 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ((web_company_name = 'pri') and web_site_sk is 
> not null) (type: boolean)
>                     Statistics: Num rows: 5 Data size: 460 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: web_site_sk (type: int)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 5 Data size: 20 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Map Join Operator
>                         condition map:
>                              Inner Join 0 to 1
>                         keys:
>                           0 _col30 (type: int)
>                           1 _col0 (type: int)
>                         outputColumnNames: _col31, _col32, _col33
>                         input vertices:
>                           0 Reducer 4
>                         Statistics: Num rows: 5 Data size: 22 Basic stats: 
> COMPLETE Column stats: NONE
>                         Select Operator
>                           expressions: _col31 (type: int), _col32 (type: 
> float), _col33 (type: float)
>                           outputColumnNames: _col0, _col1, _col2
>                           Statistics: Num rows: 5 Data size: 22 Basic stats: 
> COMPLETE Column stats: NONE
>                           Group By Operator
>                             aggregations: count(DISTINCT _col0), sum(_col1), 
> sum(_col2)
>                             keys: _col0 (type: int)
>                             mode: hash
>                             outputColumnNames: _col0, _col1, _col2, _col3
>                             Statistics: Num rows: 5 Data size: 22 Basic 
> stats: COMPLETE Column stats: NONE
>                             Reduce Output Operator
>                               key expressions: _col0 (type: int)
>                               sort order: +
>                               Statistics: Num rows: 5 Data size: 22 Basic 
> stats: COMPLETE Column stats: NONE
>                               TopN Hash Memory Usage: 0.04
>                               value expressions: _col2 (type: double), _col3 
> (type: double)
>             Execution mode: vectorized
>         Reducer 3
>             Reduce Operator Tree:
>               Merge Join Operator
>                 condition map:
>                      Left Semi Join 0 to 1
>                 keys:
>                   0 ws_order_number (type: int)
>                   1 _col0 (type: int)
>                 outputColumnNames: _col2, _col11, _col13, _col17, _col28, 
> _col33
>                 Statistics: Num rows: 718857633877870 Data size: 
> 17252583213068880 Basic stats: COMPLETE Column stats: COMPLETE
>                 Map Join Operator
>                   condition map:
>                        Inner Join 0 to 1
>                   keys:
>                     0 _col11 (type: int)
>                     1 _col0 (type: int)
>                   outputColumnNames: _col2, _col13, _col17, _col28, _col33
>                   input vertices:
>                     1 Map 5
>                   Statistics: Num rows: 14094999486464 Data size: 
> 281899989729280 Basic stats: COMPLETE Column stats: COMPLETE
>                   Map Join Operator
>                     condition map:
>                          Inner Join 0 to 1
>                     keys:
>                       0 _col2 (type: int)
>                       1 _col0 (type: int)
>                     outputColumnNames: _col13, _col17, _col28, _col33
>                     input vertices:
>                       1 Map 6
>                     Statistics: Num rows: 7047403274240 Data size: 
> 112758452387840 Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col13 (type: int), _col17 (type: int), 
> _col28 (type: float), _col33 (type: float)
>                       outputColumnNames: _col2, _col3, _col4, _col5
>                       Statistics: Num rows: 7047403274240 Data size: 
> 112758452387840 Basic stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col3 (type: int)
>                         sort order: +
>                         Map-reduce partition columns: _col3 (type: int)
>                         Statistics: Num rows: 7047403274240 Data size: 
> 112758452387840 Basic stats: COMPLETE Column stats: COMPLETE
>                         value expressions: _col2 (type: int), _col4 (type: 
> float), _col5 (type: float)
>         Reducer 4
>             Reduce Operator Tree:
>               Merge Join Operator
>                 condition map:
>                      Right Outer Join0 to 1
>                 keys:
>                   0 wr_order_number (type: int)
>                   1 _col3 (type: int)
>                 outputColumnNames: _col13, _col30, _col31, _col32, _col33
>                 Statistics: Num rows: 12100482980189 Data size: 
> 242009659603780 Basic stats: COMPLETE Column stats: COMPLETE
>                 Filter Operator
>                   predicate: _col13 is null (type: boolean)
>                   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
> Column stats: COMPLETE
>                   Reduce Output Operator
>                     key expressions: _col30 (type: int)
>                     sort order: +
>                     Map-reduce partition columns: _col30 (type: int)
>                     Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
> Column stats: COMPLETE
>                     value expressions: _col31 (type: int), _col32 (type: 
> float), _col33 (type: float)
>         Reducer 9
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(DISTINCT KEY._col0:0._col0), 
> sum(VALUE._col1), sum(VALUE._col2)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1, _col2
>                 Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE 
> Column stats: NONE
>                 Limit
>                   Number of rows: 100
>                   Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE 
> Column stats: NONE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 1 Data size: 32 Basic stats: 
> COMPLETE Column stats: NONE
>                     table:
>                         input format: org.apache.hadoop.mapred.TextInputFormat
>                         output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: 100
>       Processor Tree:
>         ListSink
> {code}
> Logical plan 
> {code}
> HiveSort(fetch=[100]): rowcount = 1.0, cumulative cost = {4.594004456323317E8 
> rows, 5.882012320482085E9 cpu, 9.353802456E12 io}, id = 1080
>   HiveAggregate(group=[{}], agg#0=[count(DISTINCT $0)], agg#1=[sum($1)], 
> agg#2=[sum($2)]): rowcount = 1.0, cumulative cost = {4.594004336323317E8 
> rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1078
>     HiveProject($f0=[$3], $f1=[$4], $f2=[$5]): rowcount = 464.4485504122314, 
> cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 
> 9.353802456E12 io}, id = 1076
>       HiveFilter(condition=[isnull($12)]): rowcount = 464.4485504122314, 
> cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 
> 9.353802456E12 io}, id = 1074
>         HiveProject(ws_ship_date_sk=[$1], ws_ship_addr_sk=[$2], 
> ws_web_site_sk=[$3], ws_order_number=[$4], ws_ext_ship_cost=[$5], 
> ws_net_profit=[$6], ca_address_sk=[$7], ca_state=[$8], web_site_sk=[$11], 
> web_company_name=[$12], d_date_sk=[$9], d_date=[$10], wr_order_number=[$0]): 
> rowcount = 1229395.3129411766, cumulative cost = {4.594004336323317E8 rows, 
> 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1185
>           HiveJoin(condition=[=($3, $11)], joinType=[inner], 
> joinAlgorithm=[map_join], cost=[{1.3749818235294119E7 rows, 
> 1.3749820470588237E7 cpu, 0.0 io}]): rowcount = 1229395.3129411766, 
> cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 
> 9.353802456E12 io}, id = 1183
>             HiveJoin(condition=[=($4, $0)], joinType=[right], 
> joinAlgorithm=[map_join], cost=[{1.3749817E7 rows, 2.7499633E7 cpu, 
> 2.47496688E10 io}]): rowcount = 1.3749816E7, cumulative cost = 
> {4.456506153970376E8 rows, 5.868262499011497E9 cpu, 9.353802456E12 io}, id = 
> 1181
>               HiveTableScan(table=[[tpcds_bin_orc_200.web_returns]]): 
> rowcount = 1.3749816E7, cumulative cost = {0}, id = 974
>               HiveJoin(condition=[=($0, $8)], joinType=[inner], 
> joinAlgorithm=[map_join], cost=[{204.39703763146147 rows, 205.39703763146147 
> cpu, 0.0 io}]): rowcount = 1.0, cumulative cost = {4.319007983970376E8 rows, 
> 5.840762866011497E9 cpu, 9.3290527872E12 io}, id = 1179
>                 HiveJoin(condition=[=($1, $6)], joinType=[inner], 
> joinAlgorithm=[map_join], cost=[{1.43966865E8 rows, 1.43966866E8 cpu, 0.0 
> io}]): rowcount = 203.39703763146147, cumulative cost = {4.31900594E8 rows, 
> 5.840762660614459E9 cpu, 9.3290527872E12 io}, id = 1110
>                   SemiJoin(condition=[=($3, $6)], joinType=[inner]): rowcount 
> = 1.43966864E8, cumulative cost = {2.87933729E8 rows, 5.696795794614459E9 
> cpu, 9.3290527872E12 io}, id = 1058
>                     HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): 
> rowcount = 1.43966864E8, cumulative cost = {0}, id = 958
>                     HiveProject(ws_order_number=[$1]): rowcount = 
> 1.655760644524185E11, cumulative cost = {2.87933728E8 rows, 
> 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1056
>                       HiveFilter(condition=[<>($0, $2)]): rowcount = 
> 1.655760644524185E11, cumulative cost = {2.87933728E8 rows, 
> 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1054
>                         HiveJoin(condition=[=($1, $3)], joinType=[inner], 
> joinAlgorithm=[common_join], cost=[{2.87933728E8 rows, 5.696795793614459E9 
> cpu, 9.3290527872E12 io}]): rowcount = 1.6564734127740878E11, cumulative cost 
> = {2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1103
>                           
> HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): rowcount = 
> 1.43966864E8, cumulative cost = {0}, id = 958
>                           
> HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): rowcount = 
> 1.43966864E8, cumulative cost = {0}, id = 958
>                   HiveFilter(condition=[=($1, 'TX')]): rowcount = 1.0, 
> cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1060
>                     
> HiveTableScan(table=[[tpcds_bin_orc_200.customer_address]]): rowcount = 
> 800000.0, cumulative cost = {0}, id = 965
>                 HiveFilter(condition=[between(false, $1, '1999-05-01', 
> '1999-07-01')]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 
> io}, id = 1068
>                   HiveTableScan(table=[[tpcds_bin_orc_200.date_dim]]): 
> rowcount = 73049.0, cumulative cost = {0}, id = 971
>             HiveFilter(condition=[=($1, 'pri')]): rowcount = 
> 2.235294117647059, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1064
>               HiveTableScan(table=[[tpcds_bin_orc_200.web_site]]): rowcount = 
> 38.0, cumulative cost = {0}, id = 968
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10194) CBO (Calcite Return Path): Equi join followed by theta join produces a cross product

Reply via email to