Re: Best Practice? Build dimensions in Kylin or Hive?

ShaoFeng Shi Wed, 12 Dec 2018 22:34:06 -0800

Hi Jon,

The dimension table is for reusing across different scenarios and ease of
maintenance. If you don't have those requirements, you can just keep them
in the fact table. Kylin supports a single fact table as well.


Kylin's first 1 or 2 steps seems to be redundant for some cases, but it is
to simplify the subsequent processing. For example, the table is a virtual
view, or in a new file format which doesn't be supported by MapReduce; with
materializing them into a consistent file format, the subsequent processing
can be much simpler.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: [email protected]
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: [email protected]
Join Kylin dev mail group: [email protected]




Jon Shoberg <[email protected]> 于2018年12月11日周二 上午4:34写道：

> Question ... is it better to build dimensions in Kylin or Hive?
>
> Source data arrives as bzip files, ~67 of them totaling 40GB compressed
> and 35B records.
>
> Previously I've been working in Hive to separate source data into a star
> schema:
>
>    - Load bzip files to HDFS
>    - Connect Hive to files as external table
>    - Script the creation of five dimensions
>    - Script the creation of a final fact table
>
> Within Kylin I setup the table joins to reach the dimension values to the
> fact table.
>
> However, scaling out to the full data and seeing how Steps 1 - 4 create
> intermediate data the above work seems redundant.
>
> Would it be more efficient to let Kylin build the star schema such as:
>
>    - Load bzip files to HDFS
>    - Connect hive to files as external table
>    - Move data to sequencefile with comrpession (Kylin seems to work best
>    with sequencefiles)
>    - In the cube build select dimension and fact columns from source data
>    - Let Kylin intermediate processing and further steps organize the
>    source data to fnish the cube build
>
>   Is it worthwhile to create the tables of dimension values and a fully
> normlaized fact table before going into cube design?
>
>   Or is it 'better' do to everything in the Kylin cube design given that
> my source data ultimately has all the required values (no external joins).
>
>   One the data set is process its going to be static with no further
> updates.  Analysis is likely done via Kylin ODBC with Tableau and/or custom
> app to be developed.
>
> Thanks! J
>
>
>

Re: Best Practice? Build dimensions in Kylin or Hive?

Reply via email to