Question ... is it better to build dimensions in Kylin or Hive? Source data arrives as bzip files, ~67 of them totaling 40GB compressed and 35B records.
Previously I've been working in Hive to separate source data into a star schema: - Load bzip files to HDFS - Connect Hive to files as external table - Script the creation of five dimensions - Script the creation of a final fact table Within Kylin I setup the table joins to reach the dimension values to the fact table. However, scaling out to the full data and seeing how Steps 1 - 4 create intermediate data the above work seems redundant. Would it be more efficient to let Kylin build the star schema such as: - Load bzip files to HDFS - Connect hive to files as external table - Move data to sequencefile with comrpession (Kylin seems to work best with sequencefiles) - In the cube build select dimension and fact columns from source data - Let Kylin intermediate processing and further steps organize the source data to fnish the cube build Is it worthwhile to create the tables of dimension values and a fully normlaized fact table before going into cube design? Or is it 'better' do to everything in the Kylin cube design given that my source data ultimately has all the required values (no external joins). One the data set is process its going to be static with no further updates. Analysis is likely done via Kylin ODBC with Tableau and/or custom app to be developed. Thanks! J
