Loading data into partition taking seven times total of (map+reduce) on highly skewed data

Stephen Boesch Fri, 20 Sep 2013 14:47:17 -0700

We have a small (3GB /280M rows) table with 435 partitions that is highly
skewed:  one partition has nearly 200M, two others have nearly 40M apiece,
then the remaining 432 have all together less than 1% of total table size.


So .. the skew is something to be addressed.  However - even give that -
why would the following occur?


Table Structure:

     # Partition Information
# col_name             data_type           comment
 derived_create_dt   string               None

# Detailed Table Information
 ..
Protect Mode:       None
Retention:           0
 ..
Table Type:         MANAGED_TABLE
Table Parameters:
 SORTBUCKETCOLSPREFIX TRUE
transient_lastDdlTime 1379678551

# Storage Information
SerDe Library:       org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
 InputFormat:         org.apache.hadoop.hive.ql.io.RCFileInputFormat
OutputFormat:       org.apache.hadoop.hive.ql.io.RCFileOutputFormat
 Compressed:         No
Num Buckets:         64
 Bucket Columns:     [station_id]
Sort Columns:       [Order(col:station_id, order:1)]
 Storage Desc Params:
serialization.format 1

HIGHLY SKEWED data:  although
This particular load:
    300M rows
     4GB
    435 partitions
       Over 99% of data in just 3 out of the 435 partitons
        2013-09-18 26733990
      2013-09-19 191634067
      2013-09-20 63790065



Map takes 10 min
Reduce 13 mins
Loading into partitions takes 3 hours 27 minutes

Loading data into partition taking seven times total of (map+reduce) on highly skewed data

Reply via email to