Hi Jon,

The dimension table is for reusing across different scenarios and ease of
maintenance. If you don't have those requirements, you can just keep them
in the fact table. Kylin supports a single fact table as well.

Kylin's first 1 or 2 steps seems to be redundant for some cases, but it is
to simplify the subsequent processing. For example, the table is a virtual
view, or in a new file format which doesn't be supported by MapReduce; with
materializing them into a consistent file format, the subsequent processing
can be much simpler.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: [email protected]
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: [email protected]
Join Kylin dev mail group: [email protected]




Jon Shoberg <[email protected]> 于2018年12月11日周二 上午4:34写道:

> Question ... is it better to build dimensions in Kylin or Hive?
>
> Source data arrives as bzip files, ~67 of them totaling 40GB compressed
> and 35B records.
>
> Previously I've been working in Hive to separate source data into a star
> schema:
>
>    - Load bzip files to HDFS
>    - Connect Hive to files as external table
>    - Script the creation of five dimensions
>    - Script the creation of a final fact table
>
> Within Kylin I setup the table joins to reach the dimension values to the
> fact table.
>
> However, scaling out to the full data and seeing how Steps 1 - 4 create
> intermediate data the above work seems redundant.
>
> Would it be more efficient to let Kylin build the star schema such as:
>
>    - Load bzip files to HDFS
>    - Connect hive to files as external table
>    - Move data to sequencefile with comrpession (Kylin seems to work best
>    with sequencefiles)
>    - In the cube build select dimension and fact columns from source data
>    - Let Kylin intermediate processing and further steps organize the
>    source data to fnish the cube build
>
>   Is it worthwhile to create the tables of dimension values and a fully
> normlaized fact table before going into cube design?
>
>   Or is it 'better' do to everything in the Kylin cube design given that
> my source data ultimately has all the required values (no external joins).
>
>   One the data set is process its going to be static with no further
> updates.  Analysis is likely done via Kylin ODBC with Tableau and/or custom
> app to be developed.
>
> Thanks! J
>
>
>

Reply via email to