That’s great to know about step 2!

How would you define or determine an even distribution? This is a four node 
Hdfs cluster and the bz2 files as the data source (external table) have a dfs 
distribution of 2. I’d imagine the distribution would not be horrible on a 
small cluster. 

On the reducer could this is a spark setup. So on yarn I see this step running 
as a spark job. Does a mar reduce setting such as this apply? If so what is a 
larger value. I think the default here is 1 ... should it be 2,5,10,or 100? 
It’s a 4 node cluster with 10 cpus and ~550gb ram. 

Sent from my iPhoneX

> On Dec 20, 2018, at 7:24 PM, Chao Long <[email protected]> wrote:
> 
> Hi,
>   If the data have an even distribution, you can set 
> "kylin.source.hive.redistribute-flat-table=false" to skip Step 2. And about 
> Step 3, if you have many UHC dimension, you can set 
> "kylin.engine.mr.uhc-reducer-count" a larger value to use more reducer to 
> handle dict.
> 
> ------------------
> Best Regards,
> Chao Long
> ------------------ 原始邮件 ------------------
> 发件人: "Jon Shoberg"<[email protected]>;
> 发送时间: 2018年12月20日(星期四) 晚上10:20
> 收件人: "user"<[email protected]>;
> 主题: Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min
> 
> Question ...
> 
>   Is there a way to optimize the first three steps of a Kylin build?
> 
>   Total build time of a development cube is 626 minutes and a break down by 
> steps:
> 87  min - Create Intermediate Flat Hive Table
> 207 min -  Redistribute Flat Hive Table
> 248 min -  Extract Fact Table Distinct Columns
> 0   min
> 0   min
> 62  min -  Build Cube with Spark
> 19  min -  Convert Cuboid Data to HFile
> 0   min
> 0   min
> 0   min
> 0   min
>    The data set is summary files (~35M records) and detail files (~4B records 
> - 40GB compressed).
> 
>    There is a join needed for the final data which is handled in a view 
> within hive.  So I do expect a performance cost there.
> 
>    However, staging the data other ways (loading to sequence/org file vs 
> external table to bz2 files) there is no net-gain.
> 
>    This means, pre-processing the data externally can make Kylin run a little 
> faster but the overall time from absolute start to finish is still ~600min.
> 
>    Steps 1/2 seem to be a redundancy given how my data is structured; the 
> hsql/sql commands Kylin sends to Hive could be done before the build process.
> 
>    Is it possible to optimize steps 1/2/3? Is it possible to skip steps 1/2 
> and jump to step 3 if the data was staged as-needed/correctly beforehand?
> 
>    My guess is there are mostly 'no' answers where (which is fine) but 
> thought I'd ask.
> 
>    (The test lab is getting doubled in size today so I'm not ultimately 
> worried but I'm seeking other improvements vs. only adding hardware and 
> networking)
> 
> Thanks! J
>  
>    
> 
> 
> 
> 
> 

Reply via email to