Hi Jon, The dimension table is for reusing across different scenarios and ease of maintenance. If you don't have those requirements, you can just keep them in the fact table. Kylin supports a single fact table as well.
Kylin's first 1 or 2 steps seems to be redundant for some cases, but it is to simplify the subsequent processing. For example, the table is a virtual view, or in a new file format which doesn't be supported by MapReduce; with materializing them into a consistent file format, the subsequent processing can be much simpler. Best regards, Shaofeng Shi 史少锋 Apache Kylin PMC Work email: [email protected] Kyligence Inc: https://kyligence.io/ Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html Join Kylin user mail group: [email protected] Join Kylin dev mail group: [email protected] Jon Shoberg <[email protected]> 于2018年12月11日周二 上午4:34写道: > Question ... is it better to build dimensions in Kylin or Hive? > > Source data arrives as bzip files, ~67 of them totaling 40GB compressed > and 35B records. > > Previously I've been working in Hive to separate source data into a star > schema: > > - Load bzip files to HDFS > - Connect Hive to files as external table > - Script the creation of five dimensions > - Script the creation of a final fact table > > Within Kylin I setup the table joins to reach the dimension values to the > fact table. > > However, scaling out to the full data and seeing how Steps 1 - 4 create > intermediate data the above work seems redundant. > > Would it be more efficient to let Kylin build the star schema such as: > > - Load bzip files to HDFS > - Connect hive to files as external table > - Move data to sequencefile with comrpession (Kylin seems to work best > with sequencefiles) > - In the cube build select dimension and fact columns from source data > - Let Kylin intermediate processing and further steps organize the > source data to fnish the cube build > > Is it worthwhile to create the tables of dimension values and a fully > normlaized fact table before going into cube design? > > Or is it 'better' do to everything in the Kylin cube design given that > my source data ultimately has all the required values (no external joins). > > One the data set is process its going to be static with no further > updates. Analysis is likely done via Kylin ODBC with Tableau and/or custom > app to be developed. > > Thanks! J > > >
