We have a small (3GB /280M rows) table with 435 partitions that is highly
skewed: one partition has nearly 200M, two others have nearly 40M apiece,
then the remaining 432 have all together less than 1% of total table size.
So .. the skew is something to be addressed. However - even give that -
why would the following occur?
Table Structure:
# Partition Information
# col_name data_type comment
derived_create_dt string None
# Detailed Table Information
..
Protect Mode: None
Retention: 0
..
Table Type: MANAGED_TABLE
Table Parameters:
SORTBUCKETCOLSPREFIX TRUE
transient_lastDdlTime 1379678551
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
InputFormat: org.apache.hadoop.hive.ql.io.RCFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.RCFileOutputFormat
Compressed: No
Num Buckets: 64
Bucket Columns: [station_id]
Sort Columns: [Order(col:station_id, order:1)]
Storage Desc Params:
serialization.format 1
HIGHLY SKEWED data: although
This particular load:
300M rows
4GB
435 partitions
Over 99% of data in just 3 out of the 435 partitons
2013-09-18 26733990
2013-09-19 191634067
2013-09-20 63790065
Map takes 10 min
Reduce 13 mins
Loading into partitions takes 3 hours 27 minutes