Hi,
  If the data have an even distribution, you can set 
"kylin.source.hive.redistribute-flat-table=false" to skip Step 2. And about 
Step 3, if you have many UHC dimension, you can set 
"kylin.engine.mr.uhc-reducer-count" a larger value to use more reducer to 
handle dict.


------------------
Best Regards,
Chao Long


------------------ ???????? ------------------
??????: "Jon Shoberg"<[email protected]>;
????????: 2018??12??20??(??????) ????10:20
??????: "user"<[email protected]>;

????: Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min



Question ...

  Is there a way to optimize the first three steps of a Kylin build?


  Total build time of a development cube is 626 minutes and a break down by 
steps:

87  min - Create Intermediate Flat Hive Table

207 min -  Redistribute Flat Hive Table

248 min -  Extract Fact Table Distinct Columns

0   min

0   min

62  min -  Build Cube with Spark

19  min -  Convert Cuboid Data to HFile

0   min

0   min

0   min

0   min
   The data set is summary files (~35M records) and detail files (~4B records - 
40GB compressed).


   There is a join needed for the final data which is handled in a view within 
hive.  So I do expect a performance cost there.


   However, staging the data other ways (loading to sequence/org file vs 
external table to bz2 files) there is no net-gain.


   This means, pre-processing the data externally can make Kylin run a little 
faster but the overall time from absolute start to finish is still ~600min.


   Steps 1/2 seem to be a redundancy given how my data is structured; the 
hsql/sql commands Kylin sends to Hive could be done before the build process.


   Is it possible to optimize steps 1/2/3? Is it possible to skip steps 1/2 and 
jump to step 3 if the data was staged as-needed/correctly beforehand?


   My guess is there are mostly 'no' answers where (which is fine) but thought 
I'd ask.


   (The test lab is getting doubled in size today so I'm not ultimately worried 
but I'm seeking other improvements vs. only adding hardware and networking)


Thanks! J

Reply via email to