Hi, Na: Thanks for your reply. Let me explain why I created two cube for the same fact table.
Normally the query is for accessing aggregated data such as SUM(<data-column>). But sometimes we need to query some days better defined with TOPN function, especially when th cardinality of the group by dimension in the top n is hugh. In that case, using normal cube would be asking kyli n to store high fibroid table. So I was hoping the kylin cube candidate algorithm can based on the query and pick the right cube candidate. I am trying to study how kylin assigns cost and decided multiplier to each cube candidate. Since I don't much about kylin internal terminology, it is not obvious to me what the java code is trying to accomplish. Ideally, if in the query we can find the ingredient of applying top n function, then we should assign lower cost to top n cube,otherwise, we should avoid using top n for non-stop query. So I thought, the basic ingredient to apply top n cube is (1) select list contain s something like sum(x); (2) group by list include y; (3) there's should be order by sum(x); (4) there should be limit n. That should warrant the apply of topn(x), group by y metric. Thanks for any help to understand the cost evaluation of cube candidate. Kang-sen ________________________________ From: Na Zhai <[email protected]> Sent: Wednesday, January 9, 2019 3:12:07 AM To: [email protected] Subject: 答复: question about how kylin chooses a specific cube design over others Hi, Kang-Sen Lu. If multiple cubes contain the same table, Kylin will do the following thing. 1. Kylin will choose the cube that contains all the dimensions in your SQL. 2. if there are still more than one cubes, Kylin will choose the cube that has less cost(it depends on the dimension, measure and inner join number). 发送自 Windows 10 版邮件<https://go.microsoft.com/fwlink/?LinkId=550986>应用 ________________________________ 发件人: Kang-Sen Lu <[email protected]> 发送时间: Monday, January 7, 2019 11:11:09 PM 收件人: [email protected] 主题: RE: question about how kylin chooses a specific cube design over others Let me clarify my topn query problem clearly. Here is the query: SELECT (COUNT(DISTINCT ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES)) FROM ZETTICSDW.A_MA_HOURLY_V WHERE ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501')) I have a data model, ma_aggs_model. I have created two cube designs: ma_aggs_cube, ma_aggs_topn_cube. In ma_aggs_cube, I have configured metric like “COUNT_DISTINCT(SUBSCRIBER_ID), SUM(HITS),… In ma_aggs_topn_cube, I have configured metric like SUM(HITS), TOPN(HITS, GROUP BY SUBSCRIBER_ID), … But no COUNT_DISTINCT(SUBSCRIBER_ID) metric. When a query contains COUNT(DISTINCT(SUBSCRIBER_ID) is issued, kylin favored to use ma_aggs_topn_cube over ma_aggs_cube. If I disabled ma_aggs_topc_cube, then kylin chooses ma_aggs_cube. The question I have is why COUNT_DISTINCT(xyz) in select phrase would always favor the topn cube, even if there is no “LIMIT xyz” and no “ORDER BY SUBSCRIBER_ID INSC” in the query. What is the criterion in kylin which favors topn cube over non-topn cube? Thanks. Kang-sen From: Kang-Sen Lu <[email protected]> Sent: Monday, January 07, 2019 8:21 AM To: [email protected] Subject: RE: question about how kylin chooses a specific cube design over others I want to add another data point: if I disable the topn cube design, then the same query would go through smoothly and correctly. Her eis the kylin log: 2019-01-07 08:18:16,256 INFO [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] service.QueryService:387 : The original query: SELECT (COUNT(DISTINCT ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES)) FROM ZETTICSDW.A_MA_HOURLY_V WHERE ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501')) 2019-01-07 08:18:16,263 INFO [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] service.QueryService:563 : The corrected query: SELECT (COUNT(DISTINCT ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES)) FROM ZETTICSDW.A_MA_HOURLY_V WHERE ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501')) LIMIT 50000 2019-01-07 08:18:16,284 INFO [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:58 : Find candidates by table ZETTICSDW.A_MA_HOURLY_V and project=Anovadata : CUBE[name=ma_aggs_cube_6] 2019-01-07 08:18:16,285 INFO [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule, realizations before: [CUBE[name=ma_aggs_cube_6]], realizations after: [CUBE[name=ma_aggs_cube_6]] 2019-01-07 08:18:16,285 INFO [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule, realizations before: [CUBE[name=ma_aggs_cube_6]], realizations after: [CUBE[name=ma_aggs_cube_6]] 2019-01-07 08:18:16,286 INFO [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] rules.RealizationSortRule:40 : CUBE[name=ma_aggs_cube_6] priority 1 cost 279. Kang-sen From: Kang-Sen Lu <[email protected]<mailto:[email protected]>> Sent: Monday, January 07, 2019 8:06 AM To: [email protected]<mailto:[email protected]> Subject: question about how kylin chooses a specific cube design over others I am using kylin 2.5.1. I have a data model and two cube designs on top of that data model. One data model is used to perform aggregation over a set of aggregation groups. So the metric are all using “COUNT DISTINCT”, and “SUM” functions. To speed up TOPN application, I have created another smaller cube design which addresses the TOPN application specifically, i.e. the metric contains only “COUNT DISTINCT”, and “TOPN”, but no “SUM” aggregation. Here is a normal query, and somehow I am surprised that the TOPN cube was sleected. That caused over 10,000,000 row of data being returned and failed. From the kylin.log, you can see the “SELECT” statement and the cost evaluation data. I am not sure what has caused the wrongly selection of the cube. I hope someone can provide me with some hint or references. 2019-01-07 07:55:46,137 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] service.QueryService:387 : The original query: SELECT (COUNT(DISTINCT ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES)) FROM ZETTICSDW.A_MA_HOURLY_V WHERE ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501')) 2019-01-07 07:55:46,184 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:58 : Find candidates by table ZETTICSDW.A_MA_HOURLY_V and project=Anovadata : CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube] 2019-01-07 07:55:46,185 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule, realizations before: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], realizations after: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]] 2019-01-07 07:55:46,185 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule, realizations before: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], realizations after: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]] 2019-01-07 07:55:46,185 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] rules.RealizationSortRule:40 : CUBE[name=ma_aggs_cube_6] priority 1 cost 279. CUBE[name=ma_aggs_topn_cube] priority 1 cost 105. 2019-01-07 07:55:46,186 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RealizationSortRule, realizations before: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], realizations after: [CUBE[name=ma_aggs_topn_cube],CUBE[name=ma_aggs_cube_6]] 2019-01-07 07:55:46,186 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:95 : Adjust DimensionAsMeasure for FunctionDesc [expression=COUNT_DISTINCT, parameter=ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID, returnType=null] 2019-01-07 07:55:46,186 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:75 : The realizations remaining: [CUBE[name=ma_aggs_topn_cube],CUBE[name=ma_aggs_cube_6]],and the final chosen one for current olap context 0 is CUBE[name=ma_aggs_topn_cube] Thanks. Kang-sen
