Hi,everyone

I know “hive.groupby.skewindata“ can be used when there is skew in data,but I 
cannot understand how does it works.


I've tried to make a test.


Sql:


select platform,count(distinct did) uv from acm_expose_v1 where visit_date = 
'2016-11-06' group by platform


explain:


STAGE DEPENDENCIES:
 Stage-1 is a root stage
 Stage-2 depends on stages: Stage-1
 Stage-0 depends on stages: Stage-2


STAGE PLANS:
 Stage: Stage-1
  Map Reduce
   Map Operator Tree:
     TableScan
      alias: acm_expose_v1
      Statistics: Num rows: 653422251 Data size: 130684450268 Basic stats: 
COMPLETE Column stats: NONE
      Select Operator
       expressions: platform (type: string), did (type: string)
       outputColumnNames: _col0, _col1
       Statistics: Num rows: 653422251 Data size: 130684450268 Basic stats: 
COMPLETE Column stats: NONE
       Group By Operator
        aggregations: count(DISTINCT _col1)
        keys: _col0 (type: string), _col1 (type: string)
        mode: hash
        outputColumnNames: _col0, _col1, _col2
        Statistics: Num rows: 653422251 Data size: 130684450268 Basic stats: 
COMPLETE Column stats: NONE
        Reduce Output Operator
         key expressions: _col0 (type: string), _col1 (type: string)
         sort order: ++
         Map-reduce partition columns: _col0 (type: string)
         Statistics: Num rows: 653422251 Data size: 130684450268 Basic stats: 
COMPLETE Column stats: NONE
   Reduce Operator Tree:
    Group By Operator
     aggregations: count(DISTINCT KEY._col1:0._col0)
     keys: KEY._col0 (type: string)
     mode: partials
     outputColumnNames: _col0, _col1
     Statistics: Num rows: 653422251 Data size: 130684450268 Basic stats: 
COMPLETE Column stats: NONE
     File Output Operator
      compressed: true
      table:
        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
        serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe


 Stage: Stage-2
  Map Reduce
   Map Operator Tree:
     TableScan
      Reduce Output Operator
       key expressions: _col0 (type: string)
       sort order: +
       Map-reduce partition columns: _col0 (type: string)
       Statistics: Num rows: 653422251 Data size: 130684450268 Basic stats: 
COMPLETE Column stats: NONE
       value expressions: _col1 (type: bigint)
   Reduce Operator Tree:
    Group By Operator
     aggregations: count(VALUE._col0)
     keys: KEY._col0 (type: string)
     mode: final
     outputColumnNames: _col0, _col1
     Statistics: Num rows: 326711125 Data size: 65342225034 Basic stats: 
COMPLETE Column stats: NONE
     File Output Operator
      compressed: true
      Statistics: Num rows: 326711125 Data size: 65342225034 Basic stats: 
COMPLETE Column stats: NONE
      table:
        input format: org.apache.hadoop.mapred.TextInputFormat
        output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe


 Stage: Stage-0
  Fetch Operator
   limit: -1
   Processor Tree:
    ListSink
   
   
I can’t see how does Stage-2 solve the skew problem. Any one can explain it to 
me. thanks so much!

Reply via email to