Re: question about how kylin chooses a specific cube design over others

Kang-Sen Lu Wed, 09 Jan 2019 17:35:13 -0800

Hi, Na:

Thanks for your reply. Let me explain why I created two cube for the same fact 
table.


Normally the query is for accessing aggregated data such as SUM(<data-column>). 
But sometimes we need to query some days better defined with TOPN function, 
especially when th cardinality of the group by dimension in the top n is hugh. 
In that case, using normal cube would be asking kyli n to store high  fibroid 
table.

So I was hoping the kylin cube candidate algorithm can based on the query and 
pick the right cube candidate.

I am trying to study how kylin assigns cost and decided multiplier to each cube 
candidate. Since I don't much about kylin internal terminology, it is not 
obvious to me what the java code is trying to accomplish.

Ideally, if in the query we can find the ingredient of applying top n function, 
then we should assign lower cost to top n cube,otherwise, we should avoid using 
top n for non-stop query. So I thought, the basic ingredient to apply top n 
cube is (1) select list contain s something like sum(x); (2) group by list 
include y; (3) there's should be order by sum(x); (4) there should be limit n. 
That should warrant the apply of topn(x), group by y metric.

Thanks for any help to understand the cost evaluation of cube candidate.

Kang-sen
________________________________
From: Na Zhai <[email protected]>
Sent: Wednesday, January 9, 2019 3:12:07 AM
To: [email protected]
Subject: 答复: question about how kylin chooses a specific cube design over others


Hi, Kang-Sen Lu.



If multiple cubes contain the same table, Kylin will do the following thing.



1. Kylin will choose the cube that contains all the dimensions in your SQL.

2. if there are still more than one cubes, Kylin will choose the cube that has 
less cost(it depends on the dimension, measure and inner join number).



发送自 Windows 10 版邮件<https://go.microsoft.com/fwlink/?LinkId=550986>应用



________________________________
发件人: Kang-Sen Lu <[email protected]>
发送时间: Monday, January 7, 2019 11:11:09 PM
收件人: [email protected]
主题: RE: question about how kylin chooses a specific cube design over others


Let me clarify my topn query problem clearly. Here is the query:



SELECT  (COUNT(DISTINCT ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES))  FROM  ZETTICSDW.A_MA_HOURLY_V   WHERE  
((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND 
(ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501'))



I have a data model, ma_aggs_model.



I have created two cube designs: ma_aggs_cube, ma_aggs_topn_cube.



In ma_aggs_cube, I have configured metric like “COUNT_DISTINCT(SUBSCRIBER_ID), 
SUM(HITS),…



In ma_aggs_topn_cube, I have configured metric like SUM(HITS), TOPN(HITS, GROUP 
BY SUBSCRIBER_ID), … But no COUNT_DISTINCT(SUBSCRIBER_ID) metric.



When a query contains COUNT(DISTINCT(SUBSCRIBER_ID) is issued, kylin favored to 
use ma_aggs_topn_cube over ma_aggs_cube. If I disabled ma_aggs_topc_cube, then 
kylin chooses ma_aggs_cube.



The question I have is why COUNT_DISTINCT(xyz) in select phrase would always 
favor the topn cube, even if there is no “LIMIT xyz” and no “ORDER BY 
SUBSCRIBER_ID INSC” in the query. What is the criterion in kylin which favors 
topn cube over non-topn cube?



Thanks.



Kang-sen



From: Kang-Sen Lu <[email protected]>
Sent: Monday, January 07, 2019 8:21 AM
To: [email protected]
Subject: RE: question about how kylin chooses a specific cube design over others



I want to add another data point: if I disable the topn cube design, then the 
same query would go through smoothly and correctly. Her eis the kylin log:



2019-01-07 08:18:16,256 INFO  [Query 
a78321d0-5913-758b-6eb1-db5a39f92735-134702] service.QueryService:387 : The 
original query:   SELECT  (COUNT(DISTINCT 
ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES))  FROM  ZETTICSDW.A_MA_HOURLY_V   WHERE  
((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND 
(ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501'))



2019-01-07 08:18:16,263 INFO  [Query 
a78321d0-5913-758b-6eb1-db5a39f92735-134702] service.QueryService:563 : The 
corrected query: SELECT  (COUNT(DISTINCT 
ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES))  FROM  ZETTICSDW.A_MA_HOURLY_V   WHERE  
((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND 
(ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501'))

LIMIT 50000

2019-01-07 08:18:16,284 INFO  [Query 
a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:58 : Find 
candidates by table ZETTICSDW.A_MA_HOURLY_V and project=Anovadata : 
CUBE[name=ma_aggs_cube_6]

2019-01-07 08:18:16,285 INFO  [Query 
a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:51 : Applying 
rule: class 
org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule, 
realizations before: [CUBE[name=ma_aggs_cube_6]], realizations after: 
[CUBE[name=ma_aggs_cube_6]]

2019-01-07 08:18:16,285 INFO  [Query 
a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:51 : Applying 
rule: class 
org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule, 
realizations before: [CUBE[name=ma_aggs_cube_6]], realizations after: 
[CUBE[name=ma_aggs_cube_6]]

2019-01-07 08:18:16,286 INFO  [Query 
a78321d0-5913-758b-6eb1-db5a39f92735-134702] rules.RealizationSortRule:40 : 
CUBE[name=ma_aggs_cube_6] priority 1 cost 279.



Kang-sen



From: Kang-Sen Lu <[email protected]<mailto:[email protected]>>
Sent: Monday, January 07, 2019 8:06 AM
To: [email protected]<mailto:[email protected]>
Subject: question about how kylin chooses a specific cube design over others



I am using kylin 2.5.1. I have a data model and two cube designs on top of that 
data model.



One data model is used to perform aggregation over a set of aggregation groups. 
So the metric are all using “COUNT DISTINCT”, and “SUM” functions. To speed up 
TOPN application, I have created another smaller cube design which addresses 
the TOPN application specifically, i.e. the metric contains only “COUNT 
DISTINCT”, and “TOPN”, but no “SUM” aggregation.



Here is a normal query, and somehow I am surprised that the TOPN cube was 
sleected. That caused over 10,000,000 row of data being returned and failed.



From the kylin.log, you can see the “SELECT” statement and the cost evaluation 
data. I am not sure what has caused the wrongly selection of the cube. I hope 
someone can provide me with some hint or references.





2019-01-07 07:55:46,137 INFO  [Query 
06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] service.QueryService:387 : The 
original query:  SELECT  (COUNT(DISTINCT 
ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), 
(SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES))  FROM  ZETTICSDW.A_MA_HOURLY_V   WHERE  
((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND 
(ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501'))



2019-01-07 07:55:46,184 INFO  [Query 
06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:58 : Find 
candidates by table ZETTICSDW.A_MA_HOURLY_V and project=Anovadata : 
CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]

2019-01-07 07:55:46,185 INFO  [Query 
06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51 : Applying 
rule: class 
org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule, 
realizations before: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], 
realizations after: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]]

2019-01-07 07:55:46,185 INFO  [Query 
06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51 : Applying 
rule: class 
org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule, 
realizations before: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], 
realizations after: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]]

2019-01-07 07:55:46,185 INFO  [Query 
06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] rules.RealizationSortRule:40 : 
CUBE[name=ma_aggs_cube_6] priority 1 cost 279. CUBE[name=ma_aggs_topn_cube] 
priority 1 cost 105.

2019-01-07 07:55:46,186 INFO  [Query 
06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51 : Applying 
rule: class org.apache.kylin.query.routing.rules.RealizationSortRule, 
realizations before: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], 
realizations after: [CUBE[name=ma_aggs_topn_cube],CUBE[name=ma_aggs_cube_6]]

2019-01-07 07:55:46,186 INFO  [Query 
06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:95 : Adjust 
DimensionAsMeasure for FunctionDesc [expression=COUNT_DISTINCT, 
parameter=ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID, returnType=null]

2019-01-07 07:55:46,186 INFO  [Query 
06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:75 : The 
realizations remaining: 
[CUBE[name=ma_aggs_topn_cube],CUBE[name=ma_aggs_cube_6]],and the final chosen 
one for current olap context 0 is CUBE[name=ma_aggs_topn_cube]



Thanks.



Kang-sen

Re: question about how kylin chooses a specific cube design over others

Reply via email to