ShengJun Zheng created KYLIN-4945:
-------------------------------------

             Summary: Repartition encoded dataset to avoid data skew caused by 
a single column
                 Key: KYLIN-4945
                 URL: https://issues.apache.org/jira/browse/KYLIN-4945
             Project: Kylin
          Issue Type: Improvement
          Components: Job Engine
    Affects Versions: v4.0.0-beta
            Reporter: ShengJun Zheng
             Fix For: v4.0.0-GA
         Attachments: image-2021-03-24-17-37-57-505.png

In KYLIN4, global dictionary will be split into several buckets. To encode flat 
datasource table more efficiently, source dataset will be repartitioned to the 
same amount of partitions as the dictionary's bucket size. It sometimes bring 
side effect, because repartition by a single column is more likely to cause 
data skew.

We have a case that a topN/count_distinct measure has serious data skew. The 
dataset get skewed after repartition, causing one task takes the majority of 
time in first layer's cuboid build job.

!image-2021-03-24-17-37-57-505.png!

To improve this case, we add a step to repartition the encoded dataset by all 
RowKey columns, and the first layer's build time reduced from 20min to 4min.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to