Best Practice? Build dimensions in Kylin or Hive?

Jon Shoberg Mon, 10 Dec 2018 12:34:48 -0800

Question ... is it better to build dimensions in Kylin or Hive?

Source data arrives as bzip files, ~67 of them totaling 40GB compressed and
35B records.


Previously I've been working in Hive to separate source data into a star
schema:

   - Load bzip files to HDFS
   - Connect Hive to files as external table
   - Script the creation of five dimensions
   - Script the creation of a final fact table

Within Kylin I setup the table joins to reach the dimension values to the
fact table.

However, scaling out to the full data and seeing how Steps 1 - 4 create
intermediate data the above work seems redundant.

Would it be more efficient to let Kylin build the star schema such as:

   - Load bzip files to HDFS
   - Connect hive to files as external table
   - Move data to sequencefile with comrpession (Kylin seems to work best
   with sequencefiles)
   - In the cube build select dimension and fact columns from source data
   - Let Kylin intermediate processing and further steps organize the
   source data to fnish the cube build

  Is it worthwhile to create the tables of dimension values and a fully
normlaized fact table before going into cube design?

  Or is it 'better' do to everything in the Kylin cube design given that my
source data ultimately has all the required values (no external joins).

  One the data set is process its going to be static with no further
updates.  Analysis is likely done via Kylin ODBC with Tableau and/or custom
app to be developed.

Thanks! J

Best Practice? Build dimensions in Kylin or Hive?

Reply via email to