Thank you very much Billy, now I can understand how aggregation works in queries that involves different agg groups.

As you suggest, TopN looks like a very good solution to implement HCD when you need TopN measures. As I can read in link that you sent, this is because Top N aggregates are pre calculated and stored in fact table base cuboid, avoiding combinations with HCD dimension. I am going to try it.

However, in addition to Top N Sales Income Measure, I also need to retrieve the Sum of Sales Income for one o or more customers ID (usually just one). Moreover this sum measure is less used than other measures. Because that, I thought that define it in a different Agg Group could be solution. By this way I can query HCD dimension if I need, achieving a trade off between cube size and query latency. Moreover, as I have more dimensions than the 3 that I exposed in previous email (11/Normal/ dimensions + 10/Derived/ dimensions) and my cluster is not too Big (3 node with 21 Gb Ram each one and 4 Vcores), so this is the only way that I can compute this cube in a reasonable time (about 6 hours).

However, after your email, I think I can achieve the same goal without include the Dimension Customer in any Agg Group. I thought that It was mandatory to include all dimensions in at least one Agg Group, but I have checked that is possible not to include a defined Dimension in any agg group.

¿Do you think this a good strategy for this use case (trade off between cube building time, size and query latency)?

Moreover, in order to optimize the building cube process (https://kylin.apache.org/docs16/howto/howto_optimize_build.html ) I read that it is recommended to define Rowkey for HCD dimension as a "Shard By". ¿What do you think about this strategy for our case studio?

King Regards,

El 24/04/2017 a las 15:18, Billy Liu escribió:
Thanks Roberto for the questions. These are quite good questions.

For AGG Group in your case, if none of the aggs could answer your query, the query will be executed against the base cuboid, which includes all dimensions. That also explains why the performance will degrade much, since the query will base on the post-processing at the running time.

TopN measure is a good way to process the HCD case. http://kylin.apache.org/blog/2016/03/19/approximate-topn-measure/ But if the filter condition has no order rule, the TopN could not be used. The AGG group will not help either.

2017-04-24 19:14 GMT+08:00 Roberto Tardío Olmos <[email protected] <mailto:[email protected]>>:

    Hi Kylin Community,

    I have some doubts about how aggregation groups works and HCD
    dimensions good practices:

    1. I have created a cube with two aggregation groups. In the first
    Agg Group I have included Time and Company dimensions. In the
    second AGG Group I have included Customer Dimension, with a
    cardinality about 1 million of rows. This dimension is used with
    less frequency than the dimension at the first agg group and
    filtering (some IDs) is always applied to it.

    After cube build, I can execute queries that combine Dimensions of
    the two aggregations groups if a need. However, the query latency
    is quite poor than when I define the three dimensions together at
    the same Agg Group. I guess that is due to aggregation occurs
    during execution, because are no precalculated like when the three
    dimensions are in the same AGG group.

    How the two aggregations groups are combined at query execution? I
    suppose that the FK reference of any fact involved in the query
    result is stored and known by two AGG groups. I would like to know
    more detail about how this works.

    2. If a have an HCD dimension that is little used in queries and
    always applied Customer ID filtering to get data only for some
    customers. ¿Is is a good practice to define it in a separated AGG
    group?

    Regards,

-- *Roberto Tardío Olmos*
    /Senior Big Data & Business Intelligence Consultant/

    Avenida de Brasil, 17, Planta 16.

    28020 Madrid

    Fijo: 91.788.34.10



--
*Roberto Tardío Olmos*
/Senior Big Data & Business Intelligence Consultant/

Avenida de Brasil, 17, Planta 16.

28020 Madrid

Fijo: 91.788.34.10

Reply via email to