Re: Questions about SUM behavior when rewritten as TOPN

ShaoFeng Shi Mon, 15 May 2017 06:34:46 -0700

Thanks Tingmao, we will handle that.

2017-05-15 19:22 GMT+08:00 Tingmao Lin <[email protected]>:


> Yes, the serializer just don't work with negative counters. That explains
> why the result is inaccurate and it's not a bug.
>
>
> About the SUM auto-rewrite behavior,
>
> I have created jira: https://issues.apache.org/jira/browse/KYLIN-2620
>
> Thank Shi ShaoFeng, Li Yang and Billy Liu for help resolving the questions.
>
> ------------------------------
> *From:* ShaoFeng Shi <[email protected]>
> *Sent:* Monday, May 15, 2017 10:36 AM
> *To:* user
> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN
>
> Ok then it explains.
>
> Top N algorithm is based on the assumption that your data follows Zipf
> distribution (https://en.wikipedia.org/wiki/Zipf%27s_law). If not, the
> result can be very inaccurate. Your sample is such a case, with only 1 key
> and negative counters, the error be accumulated and finally see a variation
> from the real value, which is not bug I think. Please use SUM measure,
> instead of abusing TopN.
> <https://en.wikipedia.org/wiki/Zipf%27s_law>
> Zipf's law - Wikipedia <https://en.wikipedia.org/wiki/Zipf%27s_law>
> en.wikipedia.org
> Zipf's law (/ ˈ z ɪ f /) is an empirical law formulated using mathematical
> statistics that refers to the fact that many types of data studied in the
> physical and ...
>
>
> 2017-05-15 16:55 GMT+08:00 Tingmao Lin <[email protected]>:
>
>> We found something about serialization precision that might be related to
>> the inaccurate result problem.
>>
>>
>> the TopNCounterSerializer uses DoubleDeltaSerializer to serialize double
>> array.  The DoubleDeltaSerializer will write double values as
>>  int(d*1000+0.5)/1000 so for each serialization process, negative integer
>> value increases by 0.001.
>>
>>
>> Test case:    (slightly modified from TopNCounterSerializerTest.
>> testSerialization)
>>
>>
>> @Test
>> public void testNegativeSerialization() {
>>     TopNCounter<ByteArray> vs = new TopNCounter<ByteArray>(50);
>>     Integer[] stream = {1};
>>     double[] incValues = {-1};
>>     for (int counter = 0; counter < stream.length; ++counter) {
>>         Integer i = stream[counter]; double v = incValues[counter];
>>         vs.offer(new ByteArray(Bytes.toBytes(i)), v);
>>     }
>>     vs.sortAndRetain();
>>     ByteBuffer out = ByteBuffer.allocate(1024);
>>     serializer.serialize(vs, out);
>>
>>     byte[] copyBytes = new byte[out.position()];
>>     System.arraycopy(out.array(), 0, copyBytes, 0, out.position());
>>
>>     ByteBuffer in = ByteBuffer.wrap(copyBytes);
>>     TopNCounter<ByteArray> vsNew = serializer.deserialize(in);
>>
>>     Assert.assertEquals(vs.toString(), vsNew.toString());
>> }
>>
>>
>> and result is
>>
>> org.junit.ComparisonFailure:
>> Expected :[\x00\x00\x00\x01:-1.0]
>> Actual   :[\x00\x00\x00\x01:-0.999]
>>
>>
>> so if we set type of measure1 as double, and build the cube using a
>> single line {measure1 = -1.0}
>> the "select sum(measure1) from table group by dim2_id"  will return
>>  "-0.994" instead of "-1"
>>
>>
>> ------------------------------
>> *From:* Tingmao Lin <[email protected]>
>> *Sent:* Monday, May 15, 2017 8:44 AM
>> *To:* [email protected]
>>
>> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN
>>
>>
>> The query log and the cube definition is here:
>>
>>
>>
>> Query without rewrite
>> 2017-05-15 14:59:50,741 INFO  [pool-8-thread-1]
>> threadpool.DefaultScheduler:118 : Job Fetcher: 0 should running, 0
>> actual running, 0 ready, 577 already succeed, 19 error, 11 discarded, 0
>> others
>> 2017-05-15 14:59:51,637 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> service.QueryService:336 : Using project: TEST
>> 2017-05-15 14:59:51,638 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> service.QueryService:337 : The original query:  select sum(v),count(v) from
>> test group by lv2_id
>> 2017-05-15 14:59:51,638 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> service.QueryService:440 : The corrected query: select sum(v),count(v) from
>> test group by lv2_id
>> LIMIT 50000
>> 2017-05-15 14:59:51,674 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> routing.QueryRouter:48 : The project manager's reference is
>> org.apache.kylin.metadata.project.ProjectManager@ac216f8
>> 2017-05-15 14:59:51,674 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and
>> project=TEST : CUBE[name=TEST1ROWS1DIM]
>> 2017-05-15 14:59:51,674 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing
>> .rules.RemoveBlackoutRealizationsRule, realizations before:
>> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
>> 2017-05-15 14:59:51,674 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing
>> .rules.RemoveUncapableRealizationsRule, realizations before:
>> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
>> 2017-05-15 14:59:51,675 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> rules.RealizationSortRule:40 : CUBE[name=TEST1ROWS1DIM] priority 1 cost 73.
>> 2017-05-15 14:59:51,675 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> routing.QueryRouter:51 : Applying rule: class 
>> org.apache.kylin.query.routing.rules.RealizationSortRule,
>> realizations before: [TEST1ROWS1DIM(CUBE)], realizations after:
>> [TEST1ROWS1DIM(CUBE)]
>> 2017-05-15 14:59:51,675 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUBE)]
>> And the final chosen one is the first one
>> 2017-05-15 14:59:51,707 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified:
>> cube=TEST1ROWS1DIM, cuboidId=1, groupsD=[DEFAULT.TEST.LV2_ID], otherDimsD=[]
>> 2017-05-15 14:59:51,707 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> gtrecord.GTCubeStorageQueryBase:230 : GroupD :[DEFAULT.TEST.LV2_ID]
>> 2017-05-15 14:59:51,707 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[]
>> 2017-05-15 14:59:51,707 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns
>> :[DEFAULT.TEST.LV2_ID]
>> 2017-05-15 14:59:51,707 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> gtrecord.GTCubeStorageQueryBase:238 : Does not need storage aggregation
>> 2017-05-15 14:59:51,708 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> storage.StorageContext:134 : Not enabling limit push down because the
>> limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max
>> 10000
>> 2017-05-15 14:59:51,708 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment
>> 29542_29543
>> 2017-05-15 14:59:51,763 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 311 bytes,
>> rawScanBytesString 40 bytes
>> 2017-05-15 14:59:51,764 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> v2.CubeHBaseEndpointRPC:141 : The scan 6fdd1fd2 for segment
>> TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard
>> part of start/end key is set to 0
>> 2017-05-15 14:59:51,764 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid exact
>> match, from 1 to 1 Start: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00
>> (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop:
>>  \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00
>> (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key
>> 2017-05-15 14:59:51,768 INFO  [kylin-coproc--pool14-t14]
>> v2.CubeHBaseEndpointRPC:200 : <sub-thread for Query
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20 GTScanRequest 6fdd1fd2>Endpoint RPC
>> returned from HTable KYLIN_EBM57UYZMQ Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x
>> 42\x4D\x35\x37\x55\x59\x5A\x4D\x51\x2C\x2C\x31\x34\x39\x34\
>> x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\x33\x35\x38\
>> x33\x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\
>> x32\x64\x65\x65\x62\x63\x62\x34\x35\x31\x39\x38\x64\x30\x2E on host:
>> W168PC03.Total scanned row: 1. Total filtered/aggred row: 0. Time elapsed
>> in EP: 1(ms). Server CPU usage: 0.04998115611708333, server physical mem
>> left: 8.621008896E10, server swap mem left:0.0.Etc message: start latency:
>> 58@0,agg done@1,compress done@1,server stats done@1,
>> debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal Complete:
>> true.
>> 2017-05-15 14:59:51,771 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> service.QueryService:576 : Scan count for each storageContext: 1,
>> 2017-05-15 14:59:51,771 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> service.QueryService:359 : Stats of SQL response: isException: false,
>> duration: 133, total scan count 1
>> 2017-05-15 14:59:51,771 INFO  [Query 
>> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
>> service.QueryService:284 :
>> ==========================[QUERY]===============================
>> Query Id: cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20
>> SQL: select sum(v),count(v) from test group by lv2_id
>> LIMIT 50000
>> User: ADMIN
>> Success: true
>> Duration: 0.133
>> Project: TEST
>> Realization Names: [TEST1ROWS1DIM]
>> Cuboid Ids: [1]
>> Total scan count: 1
>> Result row count: 1
>> Accept Partial: true
>> Is Partial Result: false
>> Hit Exception Cache: false
>> Storage cache used: false
>> Message: null
>> ==========================[QUERY]===============================
>>
>>
>>
>>
>>
>> Query with rewrite:
>> 2017-05-15 14:59:34,156 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> service.QueryService:336 : Using project: TEST
>> 2017-05-15 14:59:34,157 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> service.QueryService:337 : The original query:  select sum(v) from test
>> group by lv2_id
>> 2017-05-15 14:59:34,157 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> service.QueryService:440 : The corrected query: select sum(v) from test
>> group by lv2_id
>> LIMIT 50000
>> 2017-05-15 14:59:34,185 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> routing.QueryRouter:48 : The project manager's reference is
>> org.apache.kylin.metadata.project.ProjectManager@ac216f8
>> 2017-05-15 14:59:34,185 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and
>> project=TEST : CUBE[name=TEST1ROWS1DIM]
>> 2017-05-15 14:59:34,185 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing
>> .rules.RemoveBlackoutRealizationsRule, realizations before:
>> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
>> 2017-05-15 14:59:34,186 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> cube.CubeCapabilityChecker:191 : Cube CUBE[name=TEST1ROWS1DIM]
>> CapabilityInfluences: TOP10@class org.apache.kylin.measure.topn.
>> TopNMeasureType
>> 2017-05-15 14:59:34,186 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing
>> .rules.RemoveUncapableRealizationsRule, realizations before:
>> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
>> 2017-05-15 14:59:34,186 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> rules.RealizationSortRule:40 : CUBE[name=TEST1ROWS1DIM] priority 1 cost 21.
>> 2017-05-15 14:59:34,186 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> routing.QueryRouter:51 : Applying rule: class 
>> org.apache.kylin.query.routing.rules.RealizationSortRule,
>> realizations before: [TEST1ROWS1DIM(CUBE)], realizations after:
>> [TEST1ROWS1DIM(CUBE)]
>> 2017-05-15 14:59:34,186 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUBE)]
>> And the final chosen one is the first one
>> 2017-05-15 14:59:34,217 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> topn.TopNMeasureType:338 : Rewrite function FunctionDesc [expression=SUM,
>> parameter=ParameterDesc [type=column, value=V, nextParam=null],
>> returnType=bigint] to FunctionDesc [expression=TOP_N,
>> parameter=ParameterDesc [type=column, value=V, nextParam=ParameterDesc
>> [type=column, value=LV2_ID, nextParam=null]], returnType=topn(10)]
>> 2017-05-15 14:59:34,217 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified:
>> cube=TEST1ROWS1DIM, cuboidId=1, groupsD=[], otherDimsD=[]
>> 2017-05-15 14:59:34,217 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> gtrecord.GTCubeStorageQueryBase:230 : GroupD :[]
>> 2017-05-15 14:59:34,217 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[]
>> 2017-05-15 14:59:34,218 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns
>> :[DEFAULT.TEST.LV2_ID]
>> 2017-05-15 14:59:34,218 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> gtrecord.GTCubeStorageQueryBase:241 : Need storage aggregation
>> 2017-05-15 14:59:34,218 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> storage.StorageContext:134 : Not enabling limit push down because the
>> limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max
>> 10000
>> 2017-05-15 14:59:34,218 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> gtrecord.GTCubeStorageQueryBase:334 : Memory budget is set to 805306 rows
>> 2017-05-15 14:59:34,218 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment
>> 29542_29543
>> 2017-05-15 14:59:34,219 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> dict.DictionaryManager:420 : DictionaryManager(1833236614) loading
>> DictionaryInfo(loadDictObj:true) at /dict/DEFAULT.TEST/LV2_ID/7be3
>> d6b1-0ed4-4b2d-9015-0c92b0757725.dict
>> 2017-05-15 14:59:34,221 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 306 bytes,
>> rawScanBytesString 40 bytes
>> 2017-05-15 14:59:34,221 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> v2.CubeHBaseEndpointRPC:141 : The scan 37ca24c9 for segment
>> TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard
>> part of start/end key is set to 0
>> 2017-05-15 14:59:34,221 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid require
>> post aggregation, from 0 to 1 Start: 
>> \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00
>> (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop:
>>  \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00
>> (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key
>> 2017-05-15 14:59:34,256 INFO  [kylin-coproc--pool14-t13]
>> v2.CubeHBaseEndpointRPC:200 : <sub-thread for Query
>> 54ac350a-7676-4601-a109-04dec543e56e GTScanRequest 37ca24c9>Endpoint RPC
>> returned from HTable KYLIN_EBM57UYZMQ Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x
>> 42\x4D\x35\x37\x55\x59\x5A\x4D\x51\x2C\x2C\x31\x34\x39\x34\
>> x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\x33\x35\x38\
>> x33\x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\
>> x32\x64\x65\x65\x62\x63\x62\x34\x35\x31\x39\x38\x64\x30\x2E on host:
>> W168PC03.Total scanned row: 1. Total filtered/aggred row: 0. Time elapsed
>> in EP: 2(ms). Server CPU usage: 0.07331964572432702, server physical mem
>> left: 8.619999232E10, server swap mem left:0.0.Etc message: start latency:
>> 33@1,agg done@1,compress done@1,server stats done@2,
>> debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal Complete:
>> true.
>> 2017-05-15 14:59:34,259 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> service.QueryService:576 : Scan count for each storageContext: 1,
>> 2017-05-15 14:59:34,259 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> service.QueryService:359 : Stats of SQL response: isException: false,
>> duration: 102, total scan count 1
>> 2017-05-15 14:59:34,260 INFO  [Query 
>> 54ac350a-7676-4601-a109-04dec543e56e-108]
>> service.QueryService:284 :
>> ==========================[QUERY]===============================
>> Query Id: 54ac350a-7676-4601-a109-04dec543e56e
>> SQL: select sum(v) from test group by lv2_id
>> LIMIT 50000
>> User: ADMIN
>> Success: true
>> Duration: 0.102
>> Project: TEST
>> Realization Names: [TEST1ROWS1DIM]
>> Cuboid Ids: [1]
>> Total scan count: 1
>> Result row count: 1
>> Accept Partial: true
>> Is Partial Result: false
>> Hit Exception Cache: false
>> Storage cache used: false
>> Message: null
>> ==========================[QUERY]===============================
>>
>>
>>
>>
>>
>> cube definition:
>>
>> {
>>   "uuid": "4a7fae8c-b518-4828-8f75-e6cd04689d75",
>>   "last_modified": 1494573765171,
>>   "version": "1.6.0",
>>   "name": "TEST1ROWS1DIM",
>>   "model_name": "TEST",
>>   "description": "",
>>   "null_string": null,
>>   "dimensions": [
>>     {
>>       "name": "LV2_ID",
>>       "table": "DEFAULT.TEST",
>>       "column": "LV2_ID",
>>       "derived": null
>>     },
>>     {
>>       "name": "MINUTE_START",
>>       "table": "DEFAULT.TEST",
>>       "column": "MINUTE_START",
>>       "derived": null
>>     }
>>   ],
>>   "measures": [
>>     {
>>       "name": "_COUNT_",
>>       "function": {
>>         "expression": "COUNT",
>>         "parameter": {
>>           "type": "constant",
>>           "value": "1",
>>           "next_parameter": null
>>         },
>>         "returntype": "bigint"
>>       },
>>       "dependent_measure_ref": null
>>     },
>>     {
>>       "name": "COUNT",
>>       "function": {
>>         "expression": "SUM",
>>         "parameter": {
>>           "type": "column",
>>           "value": "V",
>>           "next_parameter": null
>>         },
>>         "returntype": "bigint"
>>       },
>>       "dependent_measure_ref": null
>>     },
>>     {
>>       "name": "TOP10",
>>       "function": {
>>         "expression": "TOP_N",
>>         "parameter": {
>>           "type": "column",
>>           "value": "V",
>>           "next_parameter": {
>>             "type": "column",
>>             "value": "LV2_ID",
>>             "next_parameter": null
>>           }
>>         },
>>         "returntype": "topn(10)",
>>         "configuration": {
>>           "topn.encoding.LV2_ID": "dict"
>>         }
>>       },
>>       "dependent_measure_ref": null
>>     }
>>   ],
>>   "dictionaries": [],
>>   "rowkey": {
>>     "rowkey_columns": [
>>       {
>>         "column": "MINUTE_START",
>>         "encoding": "time",
>>         "isShardBy": false
>>       },
>>       {
>>         "column": "LV2_ID",
>>         "encoding": "dict",
>>         "isShardBy": false
>>       }
>>     ]
>>   },
>>   "hbase_mapping": {
>>     "column_family": [
>>       {
>>         "name": "F1",
>>         "columns": [
>>           {
>>             "qualifier": "M",
>>             "measure_refs": [
>>               "_COUNT_",
>>               "COUNT",
>>               "TOP10"
>>             ]
>>           }
>>         ]
>>       }
>>     ]
>>   },
>>   "aggregation_groups": [
>>     {
>>       "includes": [
>>         "MINUTE_START",
>>         "LV2_ID"
>>       ],
>>       "select_rule": {
>>         "hierarchy_dims": [],
>>         "mandatory_dims": [],
>>         "joint_dims": []
>>       }
>>     }
>>   ],
>>   "signature": "o75jqwjKPFF859EyU6j9eA==",
>>   "notify_list": [],
>>   "status_need_notify": [
>>     "ERROR",
>>     "DISCARDED",
>>     "SUCCEED"
>>   ],
>>   "partition_date_start": 0,
>>   "partition_date_end": 3153600000000,
>>   "auto_merge_time_ranges": [
>>     604800000,
>>     2419200000 <024%201920%200000>
>>   ],
>>   "retention_range": 0,
>>   "engine_type": 2,
>>   "storage_type": 2,
>>   "override_kylin_properties": {}
>> }
>>
>>
>>
>> ------------------------------
>> *From:* ShaoFeng Shi <[email protected]>
>> *Sent:* Monday, May 15, 2017 1:37 AM
>> *To:* user
>> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN
>>
>> Hi Tingmao,
>>
>> Your finding is correct; Kylin didn't check whether there is "order by
>> limit" pattern or not. The main reason is, the SQL parser didn't push down
>> the "order by limit" info here (at least when TopN was developed), so I
>> couldn't leverage that.  Another reason is usually we won't put the "group
>> and order" column as a normal cube dimension (but Kylin didn't stop you
>> from doing that).
>>
>> Please feel free to open a JIRA on the checking "order by limit" pattern,
>> I can check that later.
>>
>>
>>
>> 2017-05-15 2:37 GMT+08:00 Tingmao Lin <[email protected]>:
>>
>>> Yeah, we looked into the TopNCounter source code and found that for low 
>>> cardinality
>>> scenario, in TopNCounter, m1=m2=0 so it just sum up the values. But the
>>> result still goes wrong. I will collect more information for
>>> investigation ASAP :)
>>>
>>>
>>> I have a question: as Billy Liu said in this thread kylin will check
>>> the ORDER BY clause to determine whether to rewrite. But  I didn't find
>>> any access to SQLdigest.sortColumns  in  
>>> TopNMeasureType.influenceCapabilityCheck().
>>>  Does kylin check it elsewhere ?  It seems that  if   sum(measure) is
>>> the only measure in the query  and the group by  column matches, then
>>> TopNMeasureType.isTopNCompatibleSum()    will pass and the query get
>>> rewritten. This confuses the user since they may expect a accurate result
>>> for every distinct value of group by column(s).
>>>
>>> ------------------------------
>>> *From:* ShaoFeng Shi <[email protected]>
>>> *Sent:* Sunday, May 14, 2017 2:13 PM
>>>
>>> *To:* user
>>> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN
>>>
>>> Agree with Yang's points; When cardinality is small than the TopN
>>> counter's capacity, the result should be accurate. I checked the 1.6 source
>>> code and didn't find clue. Please share more information (cube definition
>>> and query logs) for investigation, thanks!
>>>
>>> For negtive number in TopN, actually that isn't recommended, as it goes
>>> against TopN's purpose, which is to counting something happened. When
>>> merging two TopN counters, one counter will use another's last element's
>>> number to accumulate (if another is full) on its elements (as a
>>> supplement). If the last element is close to 0, its impaction will be
>>> minor. But if the last element is a big negative value, you know it's
>>> impaction will be considerable!  It doesn't make sense to reduce existing
>>> element's counting value if the last element's value is negative. So please
>>> use it properly in your scenario. Ofcourse, I think Kylin should also add
>>> more checks there.
>>>
>>> 2017-05-14 17:18 GMT+08:00 Tingmao Lin <[email protected]>:
>>>
>>>> The SQL in the original email is exactly what we input in the "insight"
>>>> tab in kylin admin UI.
>>>>
>>>> I do not have access to the host running kylin now ,and I will post the
>>>> detailed log output tomorrow.
>>>>
>>>>
>>>> We reproduced the inaccurate result behavior using a source table with
>>>> <10 rows and  I should be able to write a reproduce step tomorrow.
>>>>
>>>> ------------------------------
>>>> *From:* Billy Liu <[email protected]>
>>>> *Sent:* Sunday, May 14, 2017 12:21 AM
>>>> *To:* user
>>>> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN
>>>>
>>>> Thanks Tingmao for the report.
>>>>
>>>> Could you show us the complete SQL? In your SQL, there is no order by
>>>> statement. If no ORDER BY, the query should not be rewritten into TopN
>>>> measure.
>>>>
>>>> 2017-05-12 23:52 GMT+08:00 Tingmao Lin <[email protected]>:
>>>>
>>>>> Hi,
>>>>>
>>>>> We found that SUM() query on a cardinality 1 dimension is not accurate
>>>>> (or "not correct") when automatically  rewritten as TOPN.
>>>>> Is that the expected behavior of kylin or there are any other issue?
>>>>>
>>>>> We built a cube on a table ( measure1: bigint, dim1_id:varchar,
>>>>> dim2_id:varchar, ... ) using kylin 1.6.0 (Kafka streaming source)
>>>>>
>>>>> The cube has two measures: SUM(measure1) and
>>>>> TOPN(10,sum-orderby(measure1),group by dim2_id) . (other measures
>>>>> omitted)
>>>>> and two dimensions  dim1_id, dim2_id   (other dims omitted)
>>>>>
>>>>> About the source table data:
>>>>> The cardinality of dim1_id  is 1 (same dim1_id for all rows in the
>>>>> source table)
>>>>> The cardinality of dim2_id  is 1 (same dim2_id for all rows in the
>>>>> source table)
>>>>> The possible value of measure1 is [1,0,-1]
>>>>>
>>>>> When we query
>>>>>     "select SUM(measure1) FROM table GROUP BY dim2_id"
>>>>>    =>     the result has one row:"sum=7",
>>>>>       from the kylin logs we found that the query has been
>>>>> automatically  rewritten as TOPN(measure1,sum-orderby(measure1),group
>>>>> by dim2_id)
>>>>>
>>>>> When we write another query to prevent TOPN rewrite, for example:
>>>>>
>>>>>    "select SUM(measure1),count(*) FROM table GROUP BY dim2_id"     =>
>>>>>   one row -- "sum=-2,count=24576"
>>>>>
>>>>>    "select SUM(measure1),count(*) FROM table"
>>>>>                =>   one row -- "sum=-2,count=24576"
>>>>>
>>>>>
>>>>> The result is different (7 and -2) when rewritting to TOPN or not.
>>>>>
>>>>>
>>>>> My question is: are the following behavior "works as expected" ,or
>>>>> TOPN algorithm does not support negative counter values very well , or any
>>>>> issue there?
>>>>>
>>>>>
>>>>> 1. SUM() query  automatically rewritten as TOPN and gives approximated
>>>>> result when no TOPN present in the query.
>>>>>
>>>>> 2. When cardinality is 1, TOPN does not give accurate result.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>>
>>> Shaofeng Shi 史少锋
>>>
>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Questions about SUM behavior when rewritten as TOPN

Reply via email to