Re: Aggregation Groups and High Cardinality Dimension

Roberto Tardío Olmos Tue, 25 Apr 2017 04:01:04 -0700

Thank you very much Billy, now I can understand how aggregation works inqueries that involves different agg groups.

As you suggest, TopN looks like a very good solution to implement HCDwhen you need TopN measures. As I can read in link that you sent, thisis because Top N aggregates are pre calculated and stored in fact tablebase cuboid, avoiding combinations with HCD dimension. I am going to tryit.

However, in addition to Top N Sales Income Measure, I also need toretrieve the Sum of Sales Income for one o or more customers ID (usuallyjust one). Moreover this sum measure is less used than other measures.Because that, I thought that define it in a different Agg Group could besolution. By this way I can query HCD dimension if I need, achieving atrade off between cube size and query latency. Moreover, as I have moredimensions than the 3 that I exposed in previous email (11/Normal/dimensions + 10/Derived/ dimensions) and my cluster is not too Big (3node with 21 Gb Ram each one and 4 Vcores), so this is the only way thatI can compute this cube in a reasonable time (about 6 hours).

However, after your email, I think I can achieve the same goal withoutinclude the Dimension Customer in any Agg Group. I thought that It wasmandatory to include all dimensions in at least one Agg Group, but Ihave checked that is possible not to include a defined Dimension in anyagg group.

¿Do you think this a good strategy for this use case (trade off betweencube building time, size and query latency)?

Moreover, in order to optimize the building cube process(https://kylin.apache.org/docs16/howto/howto_optimize_build.html ) Iread that it is recommended to define Rowkey for HCD dimension as a"Shard By". ¿What do you think about this strategy for our case studio?


King Regards,

El 24/04/2017 a las 15:18, Billy Liu escribió:

Thanks Roberto for the questions. These are quite good questions.

For AGG Group in your case, if none of the aggs could answer yourquery, the query will be executed against the base cuboid, whichincludes all dimensions. That also explains why the performance willdegrade much, since the query will base on the post-processing at therunning time.

TopN measure is a good way to process the HCD case.http://kylin.apache.org/blog/2016/03/19/approximate-topn-measure/But if the filter condition has no order rule, the TopN could not beused. The AGG group will not help either.

2017-04-24 19:14 GMT+08:00 Roberto Tardío Olmos<[email protected] <mailto:[email protected]>>:


    Hi Kylin Community,

    I have some doubts about how aggregation groups works and HCD
    dimensions good practices:

    1. I have created a cube with two aggregation groups. In the first
    Agg Group I have included Time and Company dimensions. In the
    second AGG Group I have included Customer Dimension, with a
    cardinality about 1 million of rows. This dimension is used with
    less frequency than the dimension at the first agg group and
    filtering (some IDs) is always applied to it.

    After cube build, I can execute queries that combine Dimensions of
    the two aggregations groups if a need. However, the query latency
    is quite poor than when I define the three dimensions together at
    the same Agg Group. I guess that is due to aggregation occurs
    during execution, because are no precalculated like when the three
    dimensions are in the same AGG group.

    How the two aggregations groups are combined at query execution? I
    suppose that the FK reference of any fact involved in the query
    result is stored and known by two AGG groups. I would like to know
    more detail about how this works.

    2. If a have an HCD dimension that is little used in queries and
    always applied Customer ID filtering to get data only for some
    customers. ¿Is is a good practice to define it in a separated AGG
    group?

    Regards,

--*Roberto Tardío Olmos*

    /Senior Big Data & Business Intelligence Consultant/

    Avenida de Brasil, 17, Planta 16.

    28020 Madrid

    Fijo: 91.788.34.10


--
*Roberto Tardío Olmos*
/Senior Big Data & Business Intelligence Consultant/

Avenida de Brasil, 17, Planta 16.

28020 Madrid

Fijo: 91.788.34.10

Re: Aggregation Groups and High Cardinality Dimension

Reply via email to