Thanks Tingmao, we will handle that. 2017-05-15 19:22 GMT+08:00 Tingmao Lin <[email protected]>:
> Yes, the serializer just don't work with negative counters. That explains > why the result is inaccurate and it's not a bug. > > > About the SUM auto-rewrite behavior, > > I have created jira: https://issues.apache.org/jira/browse/KYLIN-2620 > > Thank Shi ShaoFeng, Li Yang and Billy Liu for help resolving the questions. > > ------------------------------ > *From:* ShaoFeng Shi <[email protected]> > *Sent:* Monday, May 15, 2017 10:36 AM > *To:* user > *Subject:* Re: Questions about SUM behavior when rewritten as TOPN > > Ok then it explains. > > Top N algorithm is based on the assumption that your data follows Zipf > distribution (https://en.wikipedia.org/wiki/Zipf%27s_law). If not, the > result can be very inaccurate. Your sample is such a case, with only 1 key > and negative counters, the error be accumulated and finally see a variation > from the real value, which is not bug I think. Please use SUM measure, > instead of abusing TopN. > <https://en.wikipedia.org/wiki/Zipf%27s_law> > Zipf's law - Wikipedia <https://en.wikipedia.org/wiki/Zipf%27s_law> > en.wikipedia.org > Zipf's law (/ ˈ z ɪ f /) is an empirical law formulated using mathematical > statistics that refers to the fact that many types of data studied in the > physical and ... > > > 2017-05-15 16:55 GMT+08:00 Tingmao Lin <[email protected]>: > >> We found something about serialization precision that might be related to >> the inaccurate result problem. >> >> >> the TopNCounterSerializer uses DoubleDeltaSerializer to serialize double >> array. The DoubleDeltaSerializer will write double values as >> int(d*1000+0.5)/1000 so for each serialization process, negative integer >> value increases by 0.001. >> >> >> Test case: (slightly modified from TopNCounterSerializerTest. >> testSerialization) >> >> >> @Test >> public void testNegativeSerialization() { >> TopNCounter<ByteArray> vs = new TopNCounter<ByteArray>(50); >> Integer[] stream = {1}; >> double[] incValues = {-1}; >> for (int counter = 0; counter < stream.length; ++counter) { >> Integer i = stream[counter]; double v = incValues[counter]; >> vs.offer(new ByteArray(Bytes.toBytes(i)), v); >> } >> vs.sortAndRetain(); >> ByteBuffer out = ByteBuffer.allocate(1024); >> serializer.serialize(vs, out); >> >> byte[] copyBytes = new byte[out.position()]; >> System.arraycopy(out.array(), 0, copyBytes, 0, out.position()); >> >> ByteBuffer in = ByteBuffer.wrap(copyBytes); >> TopNCounter<ByteArray> vsNew = serializer.deserialize(in); >> >> Assert.assertEquals(vs.toString(), vsNew.toString()); >> } >> >> >> and result is >> >> org.junit.ComparisonFailure: >> Expected :[\x00\x00\x00\x01:-1.0] >> Actual :[\x00\x00\x00\x01:-0.999] >> >> >> so if we set type of measure1 as double, and build the cube using a >> single line {measure1 = -1.0} >> the "select sum(measure1) from table group by dim2_id" will return >> "-0.994" instead of "-1" >> >> >> ------------------------------ >> *From:* Tingmao Lin <[email protected]> >> *Sent:* Monday, May 15, 2017 8:44 AM >> *To:* [email protected] >> >> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN >> >> >> The query log and the cube definition is here: >> >> >> >> Query without rewrite >> 2017-05-15 14:59:50,741 INFO [pool-8-thread-1] >> threadpool.DefaultScheduler:118 : Job Fetcher: 0 should running, 0 >> actual running, 0 ready, 577 already succeed, 19 error, 11 discarded, 0 >> others >> 2017-05-15 14:59:51,637 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> service.QueryService:336 : Using project: TEST >> 2017-05-15 14:59:51,638 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> service.QueryService:337 : The original query: select sum(v),count(v) from >> test group by lv2_id >> 2017-05-15 14:59:51,638 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> service.QueryService:440 : The corrected query: select sum(v),count(v) from >> test group by lv2_id >> LIMIT 50000 >> 2017-05-15 14:59:51,674 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> routing.QueryRouter:48 : The project manager's reference is >> org.apache.kylin.metadata.project.ProjectManager@ac216f8 >> 2017-05-15 14:59:51,674 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and >> project=TEST : CUBE[name=TEST1ROWS1DIM] >> 2017-05-15 14:59:51,674 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing >> .rules.RemoveBlackoutRealizationsRule, realizations before: >> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] >> 2017-05-15 14:59:51,674 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing >> .rules.RemoveUncapableRealizationsRule, realizations before: >> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] >> 2017-05-15 14:59:51,675 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> rules.RealizationSortRule:40 : CUBE[name=TEST1ROWS1DIM] priority 1 cost 73. >> 2017-05-15 14:59:51,675 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> routing.QueryRouter:51 : Applying rule: class >> org.apache.kylin.query.routing.rules.RealizationSortRule, >> realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: >> [TEST1ROWS1DIM(CUBE)] >> 2017-05-15 14:59:51,675 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUBE)] >> And the final chosen one is the first one >> 2017-05-15 14:59:51,707 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified: >> cube=TEST1ROWS1DIM, cuboidId=1, groupsD=[DEFAULT.TEST.LV2_ID], otherDimsD=[] >> 2017-05-15 14:59:51,707 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> gtrecord.GTCubeStorageQueryBase:230 : GroupD :[DEFAULT.TEST.LV2_ID] >> 2017-05-15 14:59:51,707 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[] >> 2017-05-15 14:59:51,707 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns >> :[DEFAULT.TEST.LV2_ID] >> 2017-05-15 14:59:51,707 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> gtrecord.GTCubeStorageQueryBase:238 : Does not need storage aggregation >> 2017-05-15 14:59:51,708 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> storage.StorageContext:134 : Not enabling limit push down because the >> limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max >> 10000 >> 2017-05-15 14:59:51,708 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment >> 29542_29543 >> 2017-05-15 14:59:51,763 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 311 bytes, >> rawScanBytesString 40 bytes >> 2017-05-15 14:59:51,764 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> v2.CubeHBaseEndpointRPC:141 : The scan 6fdd1fd2 for segment >> TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard >> part of start/end key is set to 0 >> 2017-05-15 14:59:51,764 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid exact >> match, from 1 to 1 Start: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00 >> (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop: >> \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00 >> (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key >> 2017-05-15 14:59:51,768 INFO [kylin-coproc--pool14-t14] >> v2.CubeHBaseEndpointRPC:200 : <sub-thread for Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20 GTScanRequest 6fdd1fd2>Endpoint RPC >> returned from HTable KYLIN_EBM57UYZMQ Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x >> 42\x4D\x35\x37\x55\x59\x5A\x4D\x51\x2C\x2C\x31\x34\x39\x34\ >> x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\x33\x35\x38\ >> x33\x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\ >> x32\x64\x65\x65\x62\x63\x62\x34\x35\x31\x39\x38\x64\x30\x2E on host: >> W168PC03.Total scanned row: 1. Total filtered/aggred row: 0. Time elapsed >> in EP: 1(ms). Server CPU usage: 0.04998115611708333, server physical mem >> left: 8.621008896E10, server swap mem left:0.0.Etc message: start latency: >> 58@0,agg done@1,compress done@1,server stats done@1, >> debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal Complete: >> true. >> 2017-05-15 14:59:51,771 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> service.QueryService:576 : Scan count for each storageContext: 1, >> 2017-05-15 14:59:51,771 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> service.QueryService:359 : Stats of SQL response: isException: false, >> duration: 133, total scan count 1 >> 2017-05-15 14:59:51,771 INFO [Query >> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] >> service.QueryService:284 : >> ==========================[QUERY]=============================== >> Query Id: cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20 >> SQL: select sum(v),count(v) from test group by lv2_id >> LIMIT 50000 >> User: ADMIN >> Success: true >> Duration: 0.133 >> Project: TEST >> Realization Names: [TEST1ROWS1DIM] >> Cuboid Ids: [1] >> Total scan count: 1 >> Result row count: 1 >> Accept Partial: true >> Is Partial Result: false >> Hit Exception Cache: false >> Storage cache used: false >> Message: null >> ==========================[QUERY]=============================== >> >> >> >> >> >> Query with rewrite: >> 2017-05-15 14:59:34,156 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> service.QueryService:336 : Using project: TEST >> 2017-05-15 14:59:34,157 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> service.QueryService:337 : The original query: select sum(v) from test >> group by lv2_id >> 2017-05-15 14:59:34,157 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> service.QueryService:440 : The corrected query: select sum(v) from test >> group by lv2_id >> LIMIT 50000 >> 2017-05-15 14:59:34,185 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> routing.QueryRouter:48 : The project manager's reference is >> org.apache.kylin.metadata.project.ProjectManager@ac216f8 >> 2017-05-15 14:59:34,185 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and >> project=TEST : CUBE[name=TEST1ROWS1DIM] >> 2017-05-15 14:59:34,185 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing >> .rules.RemoveBlackoutRealizationsRule, realizations before: >> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] >> 2017-05-15 14:59:34,186 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> cube.CubeCapabilityChecker:191 : Cube CUBE[name=TEST1ROWS1DIM] >> CapabilityInfluences: TOP10@class org.apache.kylin.measure.topn. >> TopNMeasureType >> 2017-05-15 14:59:34,186 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing >> .rules.RemoveUncapableRealizationsRule, realizations before: >> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] >> 2017-05-15 14:59:34,186 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> rules.RealizationSortRule:40 : CUBE[name=TEST1ROWS1DIM] priority 1 cost 21. >> 2017-05-15 14:59:34,186 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> routing.QueryRouter:51 : Applying rule: class >> org.apache.kylin.query.routing.rules.RealizationSortRule, >> realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: >> [TEST1ROWS1DIM(CUBE)] >> 2017-05-15 14:59:34,186 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUBE)] >> And the final chosen one is the first one >> 2017-05-15 14:59:34,217 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> topn.TopNMeasureType:338 : Rewrite function FunctionDesc [expression=SUM, >> parameter=ParameterDesc [type=column, value=V, nextParam=null], >> returnType=bigint] to FunctionDesc [expression=TOP_N, >> parameter=ParameterDesc [type=column, value=V, nextParam=ParameterDesc >> [type=column, value=LV2_ID, nextParam=null]], returnType=topn(10)] >> 2017-05-15 14:59:34,217 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified: >> cube=TEST1ROWS1DIM, cuboidId=1, groupsD=[], otherDimsD=[] >> 2017-05-15 14:59:34,217 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> gtrecord.GTCubeStorageQueryBase:230 : GroupD :[] >> 2017-05-15 14:59:34,217 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[] >> 2017-05-15 14:59:34,218 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns >> :[DEFAULT.TEST.LV2_ID] >> 2017-05-15 14:59:34,218 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> gtrecord.GTCubeStorageQueryBase:241 : Need storage aggregation >> 2017-05-15 14:59:34,218 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> storage.StorageContext:134 : Not enabling limit push down because the >> limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max >> 10000 >> 2017-05-15 14:59:34,218 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> gtrecord.GTCubeStorageQueryBase:334 : Memory budget is set to 805306 rows >> 2017-05-15 14:59:34,218 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment >> 29542_29543 >> 2017-05-15 14:59:34,219 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> dict.DictionaryManager:420 : DictionaryManager(1833236614) loading >> DictionaryInfo(loadDictObj:true) at /dict/DEFAULT.TEST/LV2_ID/7be3 >> d6b1-0ed4-4b2d-9015-0c92b0757725.dict >> 2017-05-15 14:59:34,221 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 306 bytes, >> rawScanBytesString 40 bytes >> 2017-05-15 14:59:34,221 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> v2.CubeHBaseEndpointRPC:141 : The scan 37ca24c9 for segment >> TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard >> part of start/end key is set to 0 >> 2017-05-15 14:59:34,221 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid require >> post aggregation, from 0 to 1 Start: >> \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00 >> (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop: >> \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00 >> (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key >> 2017-05-15 14:59:34,256 INFO [kylin-coproc--pool14-t13] >> v2.CubeHBaseEndpointRPC:200 : <sub-thread for Query >> 54ac350a-7676-4601-a109-04dec543e56e GTScanRequest 37ca24c9>Endpoint RPC >> returned from HTable KYLIN_EBM57UYZMQ Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x >> 42\x4D\x35\x37\x55\x59\x5A\x4D\x51\x2C\x2C\x31\x34\x39\x34\ >> x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\x33\x35\x38\ >> x33\x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\ >> x32\x64\x65\x65\x62\x63\x62\x34\x35\x31\x39\x38\x64\x30\x2E on host: >> W168PC03.Total scanned row: 1. Total filtered/aggred row: 0. Time elapsed >> in EP: 2(ms). Server CPU usage: 0.07331964572432702, server physical mem >> left: 8.619999232E10, server swap mem left:0.0.Etc message: start latency: >> 33@1,agg done@1,compress done@1,server stats done@2, >> debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal Complete: >> true. >> 2017-05-15 14:59:34,259 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> service.QueryService:576 : Scan count for each storageContext: 1, >> 2017-05-15 14:59:34,259 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> service.QueryService:359 : Stats of SQL response: isException: false, >> duration: 102, total scan count 1 >> 2017-05-15 14:59:34,260 INFO [Query >> 54ac350a-7676-4601-a109-04dec543e56e-108] >> service.QueryService:284 : >> ==========================[QUERY]=============================== >> Query Id: 54ac350a-7676-4601-a109-04dec543e56e >> SQL: select sum(v) from test group by lv2_id >> LIMIT 50000 >> User: ADMIN >> Success: true >> Duration: 0.102 >> Project: TEST >> Realization Names: [TEST1ROWS1DIM] >> Cuboid Ids: [1] >> Total scan count: 1 >> Result row count: 1 >> Accept Partial: true >> Is Partial Result: false >> Hit Exception Cache: false >> Storage cache used: false >> Message: null >> ==========================[QUERY]=============================== >> >> >> >> >> >> cube definition: >> >> { >> "uuid": "4a7fae8c-b518-4828-8f75-e6cd04689d75", >> "last_modified": 1494573765171, >> "version": "1.6.0", >> "name": "TEST1ROWS1DIM", >> "model_name": "TEST", >> "description": "", >> "null_string": null, >> "dimensions": [ >> { >> "name": "LV2_ID", >> "table": "DEFAULT.TEST", >> "column": "LV2_ID", >> "derived": null >> }, >> { >> "name": "MINUTE_START", >> "table": "DEFAULT.TEST", >> "column": "MINUTE_START", >> "derived": null >> } >> ], >> "measures": [ >> { >> "name": "_COUNT_", >> "function": { >> "expression": "COUNT", >> "parameter": { >> "type": "constant", >> "value": "1", >> "next_parameter": null >> }, >> "returntype": "bigint" >> }, >> "dependent_measure_ref": null >> }, >> { >> "name": "COUNT", >> "function": { >> "expression": "SUM", >> "parameter": { >> "type": "column", >> "value": "V", >> "next_parameter": null >> }, >> "returntype": "bigint" >> }, >> "dependent_measure_ref": null >> }, >> { >> "name": "TOP10", >> "function": { >> "expression": "TOP_N", >> "parameter": { >> "type": "column", >> "value": "V", >> "next_parameter": { >> "type": "column", >> "value": "LV2_ID", >> "next_parameter": null >> } >> }, >> "returntype": "topn(10)", >> "configuration": { >> "topn.encoding.LV2_ID": "dict" >> } >> }, >> "dependent_measure_ref": null >> } >> ], >> "dictionaries": [], >> "rowkey": { >> "rowkey_columns": [ >> { >> "column": "MINUTE_START", >> "encoding": "time", >> "isShardBy": false >> }, >> { >> "column": "LV2_ID", >> "encoding": "dict", >> "isShardBy": false >> } >> ] >> }, >> "hbase_mapping": { >> "column_family": [ >> { >> "name": "F1", >> "columns": [ >> { >> "qualifier": "M", >> "measure_refs": [ >> "_COUNT_", >> "COUNT", >> "TOP10" >> ] >> } >> ] >> } >> ] >> }, >> "aggregation_groups": [ >> { >> "includes": [ >> "MINUTE_START", >> "LV2_ID" >> ], >> "select_rule": { >> "hierarchy_dims": [], >> "mandatory_dims": [], >> "joint_dims": [] >> } >> } >> ], >> "signature": "o75jqwjKPFF859EyU6j9eA==", >> "notify_list": [], >> "status_need_notify": [ >> "ERROR", >> "DISCARDED", >> "SUCCEED" >> ], >> "partition_date_start": 0, >> "partition_date_end": 3153600000000, >> "auto_merge_time_ranges": [ >> 604800000, >> 2419200000 <024%201920%200000> >> ], >> "retention_range": 0, >> "engine_type": 2, >> "storage_type": 2, >> "override_kylin_properties": {} >> } >> >> >> >> ------------------------------ >> *From:* ShaoFeng Shi <[email protected]> >> *Sent:* Monday, May 15, 2017 1:37 AM >> *To:* user >> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN >> >> Hi Tingmao, >> >> Your finding is correct; Kylin didn't check whether there is "order by >> limit" pattern or not. The main reason is, the SQL parser didn't push down >> the "order by limit" info here (at least when TopN was developed), so I >> couldn't leverage that. Another reason is usually we won't put the "group >> and order" column as a normal cube dimension (but Kylin didn't stop you >> from doing that). >> >> Please feel free to open a JIRA on the checking "order by limit" pattern, >> I can check that later. >> >> >> >> 2017-05-15 2:37 GMT+08:00 Tingmao Lin <[email protected]>: >> >>> Yeah, we looked into the TopNCounter source code and found that for low >>> cardinality >>> scenario, in TopNCounter, m1=m2=0 so it just sum up the values. But the >>> result still goes wrong. I will collect more information for >>> investigation ASAP :) >>> >>> >>> I have a question: as Billy Liu said in this thread kylin will check >>> the ORDER BY clause to determine whether to rewrite. But I didn't find >>> any access to SQLdigest.sortColumns in >>> TopNMeasureType.influenceCapabilityCheck(). >>> Does kylin check it elsewhere ? It seems that if sum(measure) is >>> the only measure in the query and the group by column matches, then >>> TopNMeasureType.isTopNCompatibleSum() will pass and the query get >>> rewritten. This confuses the user since they may expect a accurate result >>> for every distinct value of group by column(s). >>> >>> ------------------------------ >>> *From:* ShaoFeng Shi <[email protected]> >>> *Sent:* Sunday, May 14, 2017 2:13 PM >>> >>> *To:* user >>> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN >>> >>> Agree with Yang's points; When cardinality is small than the TopN >>> counter's capacity, the result should be accurate. I checked the 1.6 source >>> code and didn't find clue. Please share more information (cube definition >>> and query logs) for investigation, thanks! >>> >>> For negtive number in TopN, actually that isn't recommended, as it goes >>> against TopN's purpose, which is to counting something happened. When >>> merging two TopN counters, one counter will use another's last element's >>> number to accumulate (if another is full) on its elements (as a >>> supplement). If the last element is close to 0, its impaction will be >>> minor. But if the last element is a big negative value, you know it's >>> impaction will be considerable! It doesn't make sense to reduce existing >>> element's counting value if the last element's value is negative. So please >>> use it properly in your scenario. Ofcourse, I think Kylin should also add >>> more checks there. >>> >>> 2017-05-14 17:18 GMT+08:00 Tingmao Lin <[email protected]>: >>> >>>> The SQL in the original email is exactly what we input in the "insight" >>>> tab in kylin admin UI. >>>> >>>> I do not have access to the host running kylin now ,and I will post the >>>> detailed log output tomorrow. >>>> >>>> >>>> We reproduced the inaccurate result behavior using a source table with >>>> <10 rows and I should be able to write a reproduce step tomorrow. >>>> >>>> ------------------------------ >>>> *From:* Billy Liu <[email protected]> >>>> *Sent:* Sunday, May 14, 2017 12:21 AM >>>> *To:* user >>>> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN >>>> >>>> Thanks Tingmao for the report. >>>> >>>> Could you show us the complete SQL? In your SQL, there is no order by >>>> statement. If no ORDER BY, the query should not be rewritten into TopN >>>> measure. >>>> >>>> 2017-05-12 23:52 GMT+08:00 Tingmao Lin <[email protected]>: >>>> >>>>> Hi, >>>>> >>>>> We found that SUM() query on a cardinality 1 dimension is not accurate >>>>> (or "not correct") when automatically rewritten as TOPN. >>>>> Is that the expected behavior of kylin or there are any other issue? >>>>> >>>>> We built a cube on a table ( measure1: bigint, dim1_id:varchar, >>>>> dim2_id:varchar, ... ) using kylin 1.6.0 (Kafka streaming source) >>>>> >>>>> The cube has two measures: SUM(measure1) and >>>>> TOPN(10,sum-orderby(measure1),group by dim2_id) . (other measures >>>>> omitted) >>>>> and two dimensions dim1_id, dim2_id (other dims omitted) >>>>> >>>>> About the source table data: >>>>> The cardinality of dim1_id is 1 (same dim1_id for all rows in the >>>>> source table) >>>>> The cardinality of dim2_id is 1 (same dim2_id for all rows in the >>>>> source table) >>>>> The possible value of measure1 is [1,0,-1] >>>>> >>>>> When we query >>>>> "select SUM(measure1) FROM table GROUP BY dim2_id" >>>>> => the result has one row:"sum=7", >>>>> from the kylin logs we found that the query has been >>>>> automatically rewritten as TOPN(measure1,sum-orderby(measure1),group >>>>> by dim2_id) >>>>> >>>>> When we write another query to prevent TOPN rewrite, for example: >>>>> >>>>> "select SUM(measure1),count(*) FROM table GROUP BY dim2_id" => >>>>> one row -- "sum=-2,count=24576" >>>>> >>>>> "select SUM(measure1),count(*) FROM table" >>>>> => one row -- "sum=-2,count=24576" >>>>> >>>>> >>>>> The result is different (7 and -2) when rewritting to TOPN or not. >>>>> >>>>> >>>>> My question is: are the following behavior "works as expected" ,or >>>>> TOPN algorithm does not support negative counter values very well , or any >>>>> issue there? >>>>> >>>>> >>>>> 1. SUM() query automatically rewritten as TOPN and gives approximated >>>>> result when no TOPN present in the query. >>>>> >>>>> 2. When cardinality is 1, TOPN does not give accurate result. >>>>> >>>>> >>>>> >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Best regards, >>> >>> Shaofeng Shi 史少锋 >>> >>> >> >> >> -- >> Best regards, >> >> Shaofeng Shi 史少锋 >> >> > > > -- > Best regards, > > Shaofeng Shi 史少锋 > > -- Best regards, Shaofeng Shi 史少锋
