An even distribution means there is not a skew distribution. If data skew 
happen, there may some task's execution time are very larger then average time. 
And the RedistributeFlatHiveTableStep is to avoid data skew as far as possible, 
for more details you can see 
https://issues.apache.org/jira/browse/KYLIN-1656
https://issues.apache.org/jira/browse/KYLIN-1677


And the parameter "kylin.engine.mr.uhc-reducer-count" work for Mapreduce and 
Spark. In Spark, a larger value means allocate more tasks. About what value 
should it be, I think you can see the task execution state of "Extract Fact 
Table Distinct Columns" job in Spark UI and identify the most time consuming 
task and give this parameter a suitable value. And about what exactly it is, I 
don't know.



------------------
Best Regards,
Chao Long


------------------ ???????? ------------------
??????: "Jon Shoberg"<[email protected]>;
????????: 2018??12??21??(??????) ????10:34
??????: "user"<[email protected]>;

????: Re: ??????Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 
- 171min



That??s great to know about step 2!

How would you define or determine an even distribution? This is a four node 
Hdfs cluster and the bz2 files as the data source (external table) have a dfs 
distribution of 2. I??d imagine the distribution would not be horrible on a 
small cluster. 


On the reducer could this is a spark setup. So on yarn I see this step running 
as a spark job. Does a mar reduce setting such as this apply? If so what is a 
larger value. I think the default here is 1 ... should it be 2,5,10,or 100? 
It??s a 4 node cluster with 10 cpus and ~550gb ram. 

Sent from my iPhoneX

On Dec 20, 2018, at 7:24 PM, Chao Long <[email protected]> wrote:


Hi,
  If the data have an even distribution, you can set 
"kylin.source.hive.redistribute-flat-table=false" to skip Step 2. And about 
Step 3, if you have many UHC dimension, you can set 
"kylin.engine.mr.uhc-reducer-count" a larger value to use more reducer to 
handle dict.


------------------
Best Regards,
Chao Long


------------------ ???????? ------------------
??????: "Jon Shoberg"<[email protected]>;
????????: 2018??12??20??(??????) ????10:20
??????: "user"<[email protected]>;

????: Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min



Question ...

  Is there a way to optimize the first three steps of a Kylin build?


  Total build time of a development cube is 626 minutes and a break down by 
steps:

87  min - Create Intermediate Flat Hive Table

207 min -  Redistribute Flat Hive Table

248 min -  Extract Fact Table Distinct Columns

0   min

0   min

62  min -  Build Cube with Spark

19  min -  Convert Cuboid Data to HFile

0   min

0   min

0   min

0   min
   The data set is summary files (~35M records) and detail files (~4B records - 
40GB compressed).


   There is a join needed for the final data which is handled in a view within 
hive.  So I do expect a performance cost there.


   However, staging the data other ways (loading to sequence/org file vs 
external table to bz2 files) there is no net-gain.


   This means, pre-processing the data externally can make Kylin run a little 
faster but the overall time from absolute start to finish is still ~600min.


   Steps 1/2 seem to be a redundancy given how my data is structured; the 
hsql/sql commands Kylin sends to Hive could be done before the build process.


   Is it possible to optimize steps 1/2/3? Is it possible to skip steps 1/2 and 
jump to step 3 if the data was staged as-needed/correctly beforehand?


   My guess is there are mostly 'no' answers where (which is fine) but thought 
I'd ask.


   (The test lab is getting doubled in size today so I'm not ultimately worried 
but I'm seeking other improvements vs. only adding hardware and networking)


Thanks! J

Reply via email to