I want to add another data point: if I disable the topn cube design, then the same query would go through smoothly and correctly. Her eis the kylin log:
2019-01-07 08:18:16,256 INFO [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] service.QueryService:387 : The original query: SELECT (COUNT(DISTINCT ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES)) FROM ZETTICSDW.A_MA_HOURLY_V WHERE ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501')) 2019-01-07 08:18:16,263 INFO [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] service.QueryService:563 : The corrected query: SELECT (COUNT(DISTINCT ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES)) FROM ZETTICSDW.A_MA_HOURLY_V WHERE ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501')) LIMIT 50000 2019-01-07 08:18:16,284 INFO [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:58 : Find candidates by table ZETTICSDW.A_MA_HOURLY_V and project=Anovadata : CUBE[name=ma_aggs_cube_6] 2019-01-07 08:18:16,285 INFO [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule, realizations before: [CUBE[name=ma_aggs_cube_6]], realizations after: [CUBE[name=ma_aggs_cube_6]] 2019-01-07 08:18:16,285 INFO [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule, realizations before: [CUBE[name=ma_aggs_cube_6]], realizations after: [CUBE[name=ma_aggs_cube_6]] 2019-01-07 08:18:16,286 INFO [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] rules.RealizationSortRule:40 : CUBE[name=ma_aggs_cube_6] priority 1 cost 279. Kang-sen From: Kang-Sen Lu <[email protected]> Sent: Monday, January 07, 2019 8:06 AM To: [email protected] Subject: question about how kylin chooses a specific cube design over others I am using kylin 2.5.1. I have a data model and two cube designs on top of that data model. One data model is used to perform aggregation over a set of aggregation groups. So the metric are all using "COUNT DISTINCT", and "SUM" functions. To speed up TOPN application, I have created another smaller cube design which addresses the TOPN application specifically, i.e. the metric contains only "COUNT DISTINCT", and "TOPN", but no "SUM" aggregation. Here is a normal query, and somehow I am surprised that the TOPN cube was sleected. That caused over 10,000,000 row of data being returned and failed. >From the kylin.log, you can see the "SELECT" statement and the cost evaluation >data. I am not sure what has caused the wrongly selection of the cube. I hope >someone can provide me with some hint or references. 2019-01-07 07:55:46,137 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] service.QueryService:387 : The original query: SELECT (COUNT(DISTINCT ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES)) FROM ZETTICSDW.A_MA_HOURLY_V WHERE ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501')) 2019-01-07 07:55:46,184 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:58 : Find candidates by table ZETTICSDW.A_MA_HOURLY_V and project=Anovadata : CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube] 2019-01-07 07:55:46,185 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule, realizations before: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], realizations after: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]] 2019-01-07 07:55:46,185 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule, realizations before: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], realizations after: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]] 2019-01-07 07:55:46,185 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] rules.RealizationSortRule:40 : CUBE[name=ma_aggs_cube_6] priority 1 cost 279. CUBE[name=ma_aggs_topn_cube] priority 1 cost 105. 2019-01-07 07:55:46,186 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RealizationSortRule, realizations before: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], realizations after: [CUBE[name=ma_aggs_topn_cube],CUBE[name=ma_aggs_cube_6]] 2019-01-07 07:55:46,186 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:95 : Adjust DimensionAsMeasure for FunctionDesc [expression=COUNT_DISTINCT, parameter=ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID, returnType=null] 2019-01-07 07:55:46,186 INFO [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:75 : The realizations remaining: [CUBE[name=ma_aggs_topn_cube],CUBE[name=ma_aggs_cube_6]],and the final chosen one for current olap context 0 is CUBE[name=ma_aggs_topn_cube] Thanks. Kang-sen
