about bucket map join

sky88088 Wed, 21 May 2014 22:52:28 -0700

Hi all, I encounter problems when trying to run map-join for large tables by 
using bucketing tables.            The table is partitioned into 20 buckets:    
   create table if not exists ip_c_bucket (country string, ip_from bigint, 
ip_to bigint) clustered by(ip_from) into 20 buckets;
        For test, I let this table join itself: select * from ip_c_bucket a 
join ip_c_bucket b on a.ip_from=b.ip_from;
        It's found that the parameter hive.optimize.bucketmapjoin has nothing 
to do with the query plan. Whether this parameter is set to true or not, the 
query is the same. However, it seems that map-join has been used from the query 
plan. 
        Since the table is joining itself, two tables of course have the same 
number of buckets. Therefore I try sort-merge join by setting parameters :      
  set hive.optimize.bucketmapjoin = true;         set 
hive.optimize.bucketmapjoin.sortedmerge = true;     set 
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; 
        However the query plan is still the same.
        Any explanation about these problems?



        Here is the query plan:
        STAGE DEPENDENCIES:  Stage-7 is a root stage , consists of Stage-9, 
Stage-10, Stage-1  Stage-9 has a backup stage: Stage-1  Stage-5 depends on 
stages: Stage-9  Stage-4 depends on stages: Stage-1, Stage-5, Stage-6 , 
consists of Stage-8  Stage-8  Stage-3 depends on stages: Stage-8  Stage-10 has 
a backup stage: Stage-1  Stage-6 depends on stages: Stage-10  Stage-1  Stage-0 
is a root stage
STAGE PLANS:  Stage: Stage-7    Conditional Operator
  Stage: Stage-9    Map Reduce Local Work      Alias -> Map Local Tables:       
 b           Fetch Operator            limit: -1      Alias -> Map Local 
Operator Tree:        b           TableScan            alias: b            
HashTable Sink Operator              condition expressions:                0 
{country} {ip_from} {ip_to}                1 {country} {ip_from} {ip_to}        
      handleSkewJoin: false              keys:                0 
[Column[ip_from]]                1 [Column[ip_from]]              Position of 
Big Table: 0
  Stage: Stage-5    Map Reduce      Alias -> Map Operator Tree:        a        
   TableScan            alias: a            Map Join Operator              
condition map:                   Inner Join 0 to 1              condition 
expressions:                0 {country} {ip_from} {ip_to}                1 
{country} {ip_from} {ip_to}              handleSkewJoin: false              
keys:                0 [Column[ip_from]]                1 [Column[ip_from]]     
         outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7            
  Position of Big Table: 0              Select Operator                
expressions:                      expr: _col0                      type: string 
                     expr: _col1                      type: bigint              
        expr: _col2                      type: bigint                      
expr: _col5                      type: string                      expr: _col6  
                    type: bigint                      expr: _col7               
       type: bigint                outputColumnNames: _col0, _col1, _col2, 
_col3, _col4, _col5                File Output Operator                  
compressed: false                  GlobalTableId: 0                  table:     
                 input format: org.apache.hadoop.mapred.TextInputFormat         
             output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      Local Work:     
   Map Reduce Local Work
  Stage: Stage-4    Conditional Operator
  Stage: Stage-8    Map Reduce Local Work      Alias -> Map Local Tables:       
 1           Fetch Operator            limit: -1      Alias -> Map Local 
Operator Tree:        1             HashTable Sink Operator              
condition expressions:                0 {0_VALUE_0} {0_VALUE_1} {0_VALUE_2}     
           1 {1_VALUE_0} {1_VALUE_1} {1_VALUE_2}              handleSkewJoin: 
false              keys:                0 [Column[joinkey0]]                1 
[Column[joinkey0]]              Position of Big Table: 0
  Stage: Stage-3    Map Reduce      Alias -> Map Operator Tree:        0        
     Map Join Operator              condition map:                   Inner Join 
0 to 1              condition expressions:                0 {0_VALUE_0} 
{0_VALUE_1} {0_VALUE_2}                1 {1_VALUE_0} {1_VALUE_1} {1_VALUE_2}    
          handleSkewJoin: false              keys:                0 
[Column[joinkey0]]                1 [Column[joinkey0]]              
outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7              
Position of Big Table: 0              Select Operator                
expressions:                      expr: _col0                      type: string 
                     expr: _col1                      type: bigint              
        expr: _col2                      type: bigint                      
expr: _col5                      type: string                      expr: _col6  
                    type: bigint                      expr: _col7               
       type: bigint                outputColumnNames: _col0, _col1, _col2, 
_col3, _col4, _col5                File Output Operator                  
compressed: false                  GlobalTableId: 0                  table:     
                 input format: org.apache.hadoop.mapred.TextInputFormat         
             output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      Local Work:     
   Map Reduce Local Work
  Stage: Stage-10    Map Reduce Local Work      Alias -> Map Local Tables:      
  a           Fetch Operator            limit: -1      Alias -> Map Local 
Operator Tree:        a           TableScan            alias: a            
HashTable Sink Operator              condition expressions:                0 
{country} {ip_from} {ip_to}                1 {country} {ip_from} {ip_to}        
      handleSkewJoin: false              keys:                0 
[Column[ip_from]]                1 [Column[ip_from]]              Position of 
Big Table: 1
  Stage: Stage-6    Map Reduce      Alias -> Map Operator Tree:        b        
   TableScan            alias: b            Map Join Operator              
condition map:                   Inner Join 0 to 1              condition 
expressions:                0 {country} {ip_from} {ip_to}                1 
{country} {ip_from} {ip_to}              handleSkewJoin: false              
keys:                0 [Column[ip_from]]                1 [Column[ip_from]]     
         outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7            
  Position of Big Table: 1              Select Operator                
expressions:                      expr: _col0                      type: string 
                     expr: _col1                      type: bigint              
        expr: _col2                      type: bigint                      
expr: _col5                      type: string                      expr: _col6  
                    type: bigint                      expr: _col7               
       type: bigint                outputColumnNames: _col0, _col1, _col2, 
_col3, _col4, _col5                File Output Operator                  
compressed: false                  GlobalTableId: 0                  table:     
                 input format: org.apache.hadoop.mapred.TextInputFormat         
             output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      Local Work:     
   Map Reduce Local Work
  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        a        
   TableScan            alias: a            Reduce Output Operator              
key expressions:                    expr: ip_from                    type: 
bigint              sort order: +              Map-reduce partition columns:    
                expr: ip_from                    type: bigint              tag: 
0              value expressions:                    expr: country              
      type: string                    expr: ip_from                    type: 
bigint                    expr: ip_to                    type: bigint        b  
         TableScan            alias: b            Reduce Output Operator        
      key expressions:                    expr: ip_from                    
type: bigint              sort order: +              Map-reduce partition 
columns:                    expr: ip_from                    type: bigint       
       tag: 1              value expressions:                    expr: country  
                  type: string                    expr: ip_from                 
   type: bigint                    expr: ip_to                    type: bigint  
    Reduce Operator Tree:        Join Operator          condition map:          
     Inner Join 0 to 1          condition expressions:            0 
{VALUE._col0} {VALUE._col1} {VALUE._col2}            1 {VALUE._col0} 
{VALUE._col1} {VALUE._col2}          handleSkewJoin: true          
outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7          Select 
Operator            expressions:                  expr: _col0                  
type: string                  expr: _col1                  type: bigint         
         expr: _col2                  type: bigint                  expr: _col5 
                 type: string                  expr: _col6                  
type: bigint                  expr: _col7                  type: bigint         
   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5            File 
Output Operator              compressed: false              GlobalTableId: 0    
          table:                  input format: 
org.apache.hadoop.mapred.TextInputFormat                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  Stage: Stage-0    Fetch Operator      limit: -1

about bucket map join

Reply via email to