Hi all, I encounter problems when trying to run map-join for large tables by using bucketing tables. The table is partitioned into 20 buckets: create table if not exists ip_c_bucket (country string, ip_from bigint, ip_to bigint) clustered by(ip_from) into 20 buckets; For test, I let this table join itself: select * from ip_c_bucket a join ip_c_bucket b on a.ip_from=b.ip_from; It's found that the parameter hive.optimize.bucketmapjoin has nothing to do with the query plan. Whether this parameter is set to true or not, the query is the same. However, it seems that map-join has been used from the query plan. Since the table is joining itself, two tables of course have the same number of buckets. Therefore I try sort-merge join by setting parameters : set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; However the query plan is still the same. Any explanation about these problems?
Here is the query plan: STAGE DEPENDENCIES: Stage-7 is a root stage , consists of Stage-9, Stage-10, Stage-1 Stage-9 has a backup stage: Stage-1 Stage-5 depends on stages: Stage-9 Stage-4 depends on stages: Stage-1, Stage-5, Stage-6 , consists of Stage-8 Stage-8 Stage-3 depends on stages: Stage-8 Stage-10 has a backup stage: Stage-1 Stage-6 depends on stages: Stage-10 Stage-1 Stage-0 is a root stage STAGE PLANS: Stage: Stage-7 Conditional Operator Stage: Stage-9 Map Reduce Local Work Alias -> Map Local Tables: b Fetch Operator limit: -1 Alias -> Map Local Operator Tree: b TableScan alias: b HashTable Sink Operator condition expressions: 0 {country} {ip_from} {ip_to} 1 {country} {ip_from} {ip_to} handleSkewJoin: false keys: 0 [Column[ip_from]] 1 [Column[ip_from]] Position of Big Table: 0 Stage: Stage-5 Map Reduce Alias -> Map Operator Tree: a TableScan alias: a Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {country} {ip_from} {ip_to} 1 {country} {ip_from} {ip_to} handleSkewJoin: false keys: 0 [Column[ip_from]] 1 [Column[ip_from]] outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7 Position of Big Table: 0 Select Operator expressions: expr: _col0 type: string expr: _col1 type: bigint expr: _col2 type: bigint expr: _col5 type: string expr: _col6 type: bigint expr: _col7 type: bigint outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Local Work: Map Reduce Local Work Stage: Stage-4 Conditional Operator Stage: Stage-8 Map Reduce Local Work Alias -> Map Local Tables: 1 Fetch Operator limit: -1 Alias -> Map Local Operator Tree: 1 HashTable Sink Operator condition expressions: 0 {0_VALUE_0} {0_VALUE_1} {0_VALUE_2} 1 {1_VALUE_0} {1_VALUE_1} {1_VALUE_2} handleSkewJoin: false keys: 0 [Column[joinkey0]] 1 [Column[joinkey0]] Position of Big Table: 0 Stage: Stage-3 Map Reduce Alias -> Map Operator Tree: 0 Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {0_VALUE_0} {0_VALUE_1} {0_VALUE_2} 1 {1_VALUE_0} {1_VALUE_1} {1_VALUE_2} handleSkewJoin: false keys: 0 [Column[joinkey0]] 1 [Column[joinkey0]] outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7 Position of Big Table: 0 Select Operator expressions: expr: _col0 type: string expr: _col1 type: bigint expr: _col2 type: bigint expr: _col5 type: string expr: _col6 type: bigint expr: _col7 type: bigint outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Local Work: Map Reduce Local Work Stage: Stage-10 Map Reduce Local Work Alias -> Map Local Tables: a Fetch Operator limit: -1 Alias -> Map Local Operator Tree: a TableScan alias: a HashTable Sink Operator condition expressions: 0 {country} {ip_from} {ip_to} 1 {country} {ip_from} {ip_to} handleSkewJoin: false keys: 0 [Column[ip_from]] 1 [Column[ip_from]] Position of Big Table: 1 Stage: Stage-6 Map Reduce Alias -> Map Operator Tree: b TableScan alias: b Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {country} {ip_from} {ip_to} 1 {country} {ip_from} {ip_to} handleSkewJoin: false keys: 0 [Column[ip_from]] 1 [Column[ip_from]] outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7 Position of Big Table: 1 Select Operator expressions: expr: _col0 type: string expr: _col1 type: bigint expr: _col2 type: bigint expr: _col5 type: string expr: _col6 type: bigint expr: _col7 type: bigint outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Local Work: Map Reduce Local Work Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a TableScan alias: a Reduce Output Operator key expressions: expr: ip_from type: bigint sort order: + Map-reduce partition columns: expr: ip_from type: bigint tag: 0 value expressions: expr: country type: string expr: ip_from type: bigint expr: ip_to type: bigint b TableScan alias: b Reduce Output Operator key expressions: expr: ip_from type: bigint sort order: + Map-reduce partition columns: expr: ip_from type: bigint tag: 1 value expressions: expr: country type: string expr: ip_from type: bigint expr: ip_to type: bigint Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} {VALUE._col2} 1 {VALUE._col0} {VALUE._col1} {VALUE._col2} handleSkewJoin: true outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7 Select Operator expressions: expr: _col0 type: string expr: _col1 type: bigint expr: _col2 type: bigint expr: _col5 type: string expr: _col6 type: bigint expr: _col7 type: bigint outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1