That’s great to know about step 2! How would you define or determine an even distribution? This is a four node Hdfs cluster and the bz2 files as the data source (external table) have a dfs distribution of 2. I’d imagine the distribution would not be horrible on a small cluster.
On the reducer could this is a spark setup. So on yarn I see this step running as a spark job. Does a mar reduce setting such as this apply? If so what is a larger value. I think the default here is 1 ... should it be 2,5,10,or 100? It’s a 4 node cluster with 10 cpus and ~550gb ram. Sent from my iPhoneX > On Dec 20, 2018, at 7:24 PM, Chao Long <[email protected]> wrote: > > Hi, > If the data have an even distribution, you can set > "kylin.source.hive.redistribute-flat-table=false" to skip Step 2. And about > Step 3, if you have many UHC dimension, you can set > "kylin.engine.mr.uhc-reducer-count" a larger value to use more reducer to > handle dict. > > ------------------ > Best Regards, > Chao Long > ------------------ 原始邮件 ------------------ > 发件人: "Jon Shoberg"<[email protected]>; > 发送时间: 2018年12月20日(星期四) 晚上10:20 > 收件人: "user"<[email protected]>; > 主题: Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min > > Question ... > > Is there a way to optimize the first three steps of a Kylin build? > > Total build time of a development cube is 626 minutes and a break down by > steps: > 87 min - Create Intermediate Flat Hive Table > 207 min - Redistribute Flat Hive Table > 248 min - Extract Fact Table Distinct Columns > 0 min > 0 min > 62 min - Build Cube with Spark > 19 min - Convert Cuboid Data to HFile > 0 min > 0 min > 0 min > 0 min > The data set is summary files (~35M records) and detail files (~4B records > - 40GB compressed). > > There is a join needed for the final data which is handled in a view > within hive. So I do expect a performance cost there. > > However, staging the data other ways (loading to sequence/org file vs > external table to bz2 files) there is no net-gain. > > This means, pre-processing the data externally can make Kylin run a little > faster but the overall time from absolute start to finish is still ~600min. > > Steps 1/2 seem to be a redundancy given how my data is structured; the > hsql/sql commands Kylin sends to Hive could be done before the build process. > > Is it possible to optimize steps 1/2/3? Is it possible to skip steps 1/2 > and jump to step 3 if the data was staged as-needed/correctly beforehand? > > My guess is there are mostly 'no' answers where (which is fine) but > thought I'd ask. > > (The test lab is getting doubled in size today so I'm not ultimately > worried but I'm seeking other improvements vs. only adding hardware and > networking) > > Thanks! J > > > > > > >
