Ok then it explains. Top N algorithm is based on the assumption that your data follows Zipf distribution (https://en.wikipedia.org/wiki/Zipf%27s_law). If not, the result can be very inaccurate. Your sample is such a case, with only 1 key and negative counters, the error be accumulated and finally see a variation from the real value, which is not bug I think. Please use SUM measure, instead of abusing TopN.
2017-05-15 16:55 GMT+08:00 Tingmao Lin <[email protected]>: > We found something about serialization precision that might be related to > the inaccurate result problem. > > > the TopNCounterSerializer uses DoubleDeltaSerializer to serialize double > array. The DoubleDeltaSerializer will write double values as > int(d*1000+0.5)/1000 so for each serialization process, negative integer > value increases by 0.001. > > > Test case: (slightly modified from TopNCounterSerializerTest. > testSerialization) > > > @Test > public void testNegativeSerialization() { > TopNCounter<ByteArray> vs = new TopNCounter<ByteArray>(50); > Integer[] stream = {1}; > double[] incValues = {-1}; > for (int counter = 0; counter < stream.length; ++counter) { > Integer i = stream[counter]; double v = incValues[counter]; > vs.offer(new ByteArray(Bytes.toBytes(i)), v); > } > vs.sortAndRetain(); > ByteBuffer out = ByteBuffer.allocate(1024); > serializer.serialize(vs, out); > > byte[] copyBytes = new byte[out.position()]; > System.arraycopy(out.array(), 0, copyBytes, 0, out.position()); > > ByteBuffer in = ByteBuffer.wrap(copyBytes); > TopNCounter<ByteArray> vsNew = serializer.deserialize(in); > > Assert.assertEquals(vs.toString(), vsNew.toString()); > } > > > and result is > > org.junit.ComparisonFailure: > Expected :[\x00\x00\x00\x01:-1.0] > Actual :[\x00\x00\x00\x01:-0.999] > > > so if we set type of measure1 as double, and build the cube using a single > line {measure1 = -1.0} > the "select sum(measure1) from table group by dim2_id" will return > "-0.994" instead of "-1" > > > ------------------------------ > *From:* Tingmao Lin <[email protected]> > *Sent:* Monday, May 15, 2017 8:44 AM > *To:* [email protected] > > *Subject:* Re: Questions about SUM behavior when rewritten as TOPN > > > The query log and the cube definition is here: > > > > Query without rewrite > 2017-05-15 14:59:50,741 INFO [pool-8-thread-1] > threadpool.DefaultScheduler:118 : Job Fetcher: 0 should running, 0 actual > running, 0 ready, 577 already succeed, 19 error, 11 discarded, 0 others > 2017-05-15 14:59:51,637 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > service.QueryService:336 : Using project: TEST > 2017-05-15 14:59:51,638 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > service.QueryService:337 : The original query: select sum(v),count(v) from > test group by lv2_id > 2017-05-15 14:59:51,638 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > service.QueryService:440 : The corrected query: select sum(v),count(v) from > test group by lv2_id > LIMIT 50000 > 2017-05-15 14:59:51,674 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > routing.QueryRouter:48 : The project manager's reference is > org.apache.kylin.metadata.project.ProjectManager@ac216f8 > 2017-05-15 14:59:51,674 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and > project=TEST : CUBE[name=TEST1ROWS1DIM] > 2017-05-15 14:59:51,674 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query. > routing.rules.RemoveBlackoutRealizationsRule, realizations before: > [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] > 2017-05-15 14:59:51,674 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query. > routing.rules.RemoveUncapableRealizationsRule, realizations before: > [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] > 2017-05-15 14:59:51,675 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > rules.RealizationSortRule:40 : CUBE[name=TEST1ROWS1DIM] priority 1 cost 73. > 2017-05-15 14:59:51,675 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query. > routing.rules.RealizationSortRule, realizations before: > [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] > 2017-05-15 14:59:51,675 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUBE)] > And the final chosen one is the first one > 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified: > cube=TEST1ROWS1DIM, cuboidId=1, groupsD=[DEFAULT.TEST.LV2_ID], otherDimsD=[] > 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > gtrecord.GTCubeStorageQueryBase:230 : GroupD :[DEFAULT.TEST.LV2_ID] > 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[] > 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns > :[DEFAULT.TEST.LV2_ID] > 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > gtrecord.GTCubeStorageQueryBase:238 : Does not need storage aggregation > 2017-05-15 14:59:51,708 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > storage.StorageContext:134 : Not enabling limit push down because the > limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max > 10000 > 2017-05-15 14:59:51,708 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment > 29542_29543 > 2017-05-15 14:59:51,763 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 311 bytes, > rawScanBytesString 40 bytes > 2017-05-15 14:59:51,764 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > v2.CubeHBaseEndpointRPC:141 : The scan 6fdd1fd2 for segment > TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard > part of start/end key is set to 0 > 2017-05-15 14:59:51,764 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid exact > match, from 1 to 1 Start: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00 > (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop: > \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00 > (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key > 2017-05-15 14:59:51,768 INFO [kylin-coproc--pool14-t14] > v2.CubeHBaseEndpointRPC:200 : <sub-thread for Query > cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20 > GTScanRequest 6fdd1fd2>Endpoint RPC returned from HTable KYLIN_EBM57UYZMQ > Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\ > x4D\x51\x2C\x2C\x31\x34\x39\x34\x35\x37\x33\x38\x39\x30\ > x38\x33\x35\x2E\x37\x33\x35\x38\x33\x62\x36\x38\x30\x39\ > x62\x65\x39\x61\x37\x37\x66\x61\x32\x64\x65\x65\x62\x63\ > x62\x34\x35\x31\x39\x38\x64\x30\x2E on host: W168PC03.Total scanned row: > 1. Total filtered/aggred row: 0. Time elapsed in EP: 1(ms). Server CPU > usage: 0.04998115611708333, server physical mem left: 8.621008896E10, > server swap mem left:0.0.Etc message: start latency: 58@0,agg done@1,compress > done@1,server stats done@1, > debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal > Complete: true. > 2017-05-15 14:59:51,771 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > service.QueryService:576 : Scan count for each storageContext: 1, > 2017-05-15 14:59:51,771 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > service.QueryService:359 : Stats of SQL response: isException: false, > duration: 133, total scan count 1 > 2017-05-15 14:59:51,771 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] > service.QueryService:284 : > ==========================[QUERY]=============================== > Query Id: cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20 > SQL: select sum(v),count(v) from test group by lv2_id > LIMIT 50000 > User: ADMIN > Success: true > Duration: 0.133 > Project: TEST > Realization Names: [TEST1ROWS1DIM] > Cuboid Ids: [1] > Total scan count: 1 > Result row count: 1 > Accept Partial: true > Is Partial Result: false > Hit Exception Cache: false > Storage cache used: false > Message: null > ==========================[QUERY]=============================== > > > > > > Query with rewrite: > 2017-05-15 14:59:34,156 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > service.QueryService:336 : Using project: TEST > 2017-05-15 14:59:34,157 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > service.QueryService:337 : The original query: select sum(v) from test > group by lv2_id > 2017-05-15 14:59:34,157 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > service.QueryService:440 : The corrected query: select sum(v) from test > group by lv2_id > LIMIT 50000 > 2017-05-15 14:59:34,185 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > routing.QueryRouter:48 : The project manager's reference is > org.apache.kylin.metadata.project.ProjectManager@ac216f8 > 2017-05-15 14:59:34,185 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and > project=TEST : CUBE[name=TEST1ROWS1DIM] > 2017-05-15 14:59:34,185 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query. > routing.rules.RemoveBlackoutRealizationsRule, realizations before: > [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] > 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > cube.CubeCapabilityChecker:191 : Cube CUBE[name=TEST1ROWS1DIM] > CapabilityInfluences: TOP10@class org.apache.kylin.measure.topn. > TopNMeasureType > 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query. > routing.rules.RemoveUncapableRealizationsRule, realizations before: > [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] > 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > rules.RealizationSortRule:40 : CUBE[name=TEST1ROWS1DIM] priority 1 cost 21. > 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query. > routing.rules.RealizationSortRule, realizations before: > [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] > 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUBE)] > And the final chosen one is the first one > 2017-05-15 14:59:34,217 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > topn.TopNMeasureType:338 : Rewrite function FunctionDesc [expression=SUM, > parameter=ParameterDesc [type=column, value=V, nextParam=null], > returnType=bigint] to FunctionDesc [expression=TOP_N, > parameter=ParameterDesc [type=column, value=V, nextParam=ParameterDesc > [type=column, value=LV2_ID, nextParam=null]], returnType=topn(10)] > 2017-05-15 14:59:34,217 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified: > cube=TEST1ROWS1DIM, cuboidId=1, groupsD=[], otherDimsD=[] > 2017-05-15 14:59:34,217 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > gtrecord.GTCubeStorageQueryBase:230 : GroupD :[] > 2017-05-15 14:59:34,217 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[] > 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns > :[DEFAULT.TEST.LV2_ID] > 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > gtrecord.GTCubeStorageQueryBase:241 : Need storage aggregation > 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > storage.StorageContext:134 : Not enabling limit push down because the > limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max > 10000 > 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > gtrecord.GTCubeStorageQueryBase:334 : Memory budget is set to 805306 rows > 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment > 29542_29543 > 2017-05-15 14:59:34,219 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > dict.DictionaryManager:420 : DictionaryManager(1833236614) loading > DictionaryInfo(loadDictObj:true) at /dict/DEFAULT.TEST/LV2_ID/ > 7be3d6b1-0ed4-4b2d-9015-0c92b0757725.dict > 2017-05-15 14:59:34,221 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 306 bytes, > rawScanBytesString 40 bytes > 2017-05-15 14:59:34,221 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > v2.CubeHBaseEndpointRPC:141 : The scan 37ca24c9 for segment > TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard > part of start/end key is set to 0 > 2017-05-15 14:59:34,221 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid require > post aggregation, from 0 to 1 Start: > \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00 > (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop: > \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00 > (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key > 2017-05-15 14:59:34,256 INFO [kylin-coproc--pool14-t13] > v2.CubeHBaseEndpointRPC:200 : <sub-thread for Query > 54ac350a-7676-4601-a109-04dec543e56e > GTScanRequest 37ca24c9>Endpoint RPC returned from HTable KYLIN_EBM57UYZMQ > Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\ > x4D\x51\x2C\x2C\x31\x34\x39\x34\x35\x37\x33\x38\x39\x30\ > x38\x33\x35\x2E\x37\x33\x35\x38\x33\x62\x36\x38\x30\x39\ > x62\x65\x39\x61\x37\x37\x66\x61\x32\x64\x65\x65\x62\x63\ > x62\x34\x35\x31\x39\x38\x64\x30\x2E on host: W168PC03.Total scanned row: > 1. Total filtered/aggred row: 0. Time elapsed in EP: 2(ms). Server CPU > usage: 0.07331964572432702, server physical mem left: 8.619999232E10, > server swap mem left:0.0.Etc message: start latency: 33@1,agg done@1,compress > done@1,server stats done@2, > debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal > Complete: true. > 2017-05-15 14:59:34,259 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > service.QueryService:576 : Scan count for each storageContext: 1, > 2017-05-15 14:59:34,259 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > service.QueryService:359 : Stats of SQL response: isException: false, > duration: 102, total scan count 1 > 2017-05-15 14:59:34,260 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] > service.QueryService:284 : > ==========================[QUERY]=============================== > Query Id: 54ac350a-7676-4601-a109-04dec543e56e > SQL: select sum(v) from test group by lv2_id > LIMIT 50000 > User: ADMIN > Success: true > Duration: 0.102 > Project: TEST > Realization Names: [TEST1ROWS1DIM] > Cuboid Ids: [1] > Total scan count: 1 > Result row count: 1 > Accept Partial: true > Is Partial Result: false > Hit Exception Cache: false > Storage cache used: false > Message: null > ==========================[QUERY]=============================== > > > > > > cube definition: > > { > "uuid": "4a7fae8c-b518-4828-8f75-e6cd04689d75", > "last_modified": 1494573765171, > "version": "1.6.0", > "name": "TEST1ROWS1DIM", > "model_name": "TEST", > "description": "", > "null_string": null, > "dimensions": [ > { > "name": "LV2_ID", > "table": "DEFAULT.TEST", > "column": "LV2_ID", > "derived": null > }, > { > "name": "MINUTE_START", > "table": "DEFAULT.TEST", > "column": "MINUTE_START", > "derived": null > } > ], > "measures": [ > { > "name": "_COUNT_", > "function": { > "expression": "COUNT", > "parameter": { > "type": "constant", > "value": "1", > "next_parameter": null > }, > "returntype": "bigint" > }, > "dependent_measure_ref": null > }, > { > "name": "COUNT", > "function": { > "expression": "SUM", > "parameter": { > "type": "column", > "value": "V", > "next_parameter": null > }, > "returntype": "bigint" > }, > "dependent_measure_ref": null > }, > { > "name": "TOP10", > "function": { > "expression": "TOP_N", > "parameter": { > "type": "column", > "value": "V", > "next_parameter": { > "type": "column", > "value": "LV2_ID", > "next_parameter": null > } > }, > "returntype": "topn(10)", > "configuration": { > "topn.encoding.LV2_ID": "dict" > } > }, > "dependent_measure_ref": null > } > ], > "dictionaries": [], > "rowkey": { > "rowkey_columns": [ > { > "column": "MINUTE_START", > "encoding": "time", > "isShardBy": false > }, > { > "column": "LV2_ID", > "encoding": "dict", > "isShardBy": false > } > ] > }, > "hbase_mapping": { > "column_family": [ > { > "name": "F1", > "columns": [ > { > "qualifier": "M", > "measure_refs": [ > "_COUNT_", > "COUNT", > "TOP10" > ] > } > ] > } > ] > }, > "aggregation_groups": [ > { > "includes": [ > "MINUTE_START", > "LV2_ID" > ], > "select_rule": { > "hierarchy_dims": [], > "mandatory_dims": [], > "joint_dims": [] > } > } > ], > "signature": "o75jqwjKPFF859EyU6j9eA==", > "notify_list": [], > "status_need_notify": [ > "ERROR", > "DISCARDED", > "SUCCEED" > ], > "partition_date_start": 0, > "partition_date_end": 3153600000000, > "auto_merge_time_ranges": [ > 604800000, > 2419200000 <024%201920%200000> > ], > "retention_range": 0, > "engine_type": 2, > "storage_type": 2, > "override_kylin_properties": {} > } > > > > ------------------------------ > *From:* ShaoFeng Shi <[email protected]> > *Sent:* Monday, May 15, 2017 1:37 AM > *To:* user > *Subject:* Re: Questions about SUM behavior when rewritten as TOPN > > Hi Tingmao, > > Your finding is correct; Kylin didn't check whether there is "order by > limit" pattern or not. The main reason is, the SQL parser didn't push down > the "order by limit" info here (at least when TopN was developed), so I > couldn't leverage that. Another reason is usually we won't put the "group > and order" column as a normal cube dimension (but Kylin didn't stop you > from doing that). > > Please feel free to open a JIRA on the checking "order by limit" pattern, > I can check that later. > > > > 2017-05-15 2:37 GMT+08:00 Tingmao Lin <[email protected]>: > >> Yeah, we looked into the TopNCounter source code and found that for low >> cardinality >> scenario, in TopNCounter, m1=m2=0 so it just sum up the values. But the >> result still goes wrong. I will collect more information for >> investigation ASAP :) >> >> >> I have a question: as Billy Liu said in this thread kylin will check the >> ORDER BY clause to determine whether to rewrite. But I didn't find any >> access to SQLdigest.sortColumns in >> TopNMeasureType.influenceCapabilityCheck(). >> Does kylin check it elsewhere ? It seems that if sum(measure) is >> the only measure in the query and the group by column matches, then >> TopNMeasureType.isTopNCompatibleSum() will pass and the query get >> rewritten. This confuses the user since they may expect a accurate result >> for every distinct value of group by column(s). >> >> ------------------------------ >> *From:* ShaoFeng Shi <[email protected]> >> *Sent:* Sunday, May 14, 2017 2:13 PM >> >> *To:* user >> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN >> >> Agree with Yang's points; When cardinality is small than the TopN >> counter's capacity, the result should be accurate. I checked the 1.6 source >> code and didn't find clue. Please share more information (cube definition >> and query logs) for investigation, thanks! >> >> For negtive number in TopN, actually that isn't recommended, as it goes >> against TopN's purpose, which is to counting something happened. When >> merging two TopN counters, one counter will use another's last element's >> number to accumulate (if another is full) on its elements (as a >> supplement). If the last element is close to 0, its impaction will be >> minor. But if the last element is a big negative value, you know it's >> impaction will be considerable! It doesn't make sense to reduce existing >> element's counting value if the last element's value is negative. So please >> use it properly in your scenario. Ofcourse, I think Kylin should also add >> more checks there. >> >> 2017-05-14 17:18 GMT+08:00 Tingmao Lin <[email protected]>: >> >>> The SQL in the original email is exactly what we input in the "insight" >>> tab in kylin admin UI. >>> >>> I do not have access to the host running kylin now ,and I will post the >>> detailed log output tomorrow. >>> >>> >>> We reproduced the inaccurate result behavior using a source table with >>> <10 rows and I should be able to write a reproduce step tomorrow. >>> >>> ------------------------------ >>> *From:* Billy Liu <[email protected]> >>> *Sent:* Sunday, May 14, 2017 12:21 AM >>> *To:* user >>> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN >>> >>> Thanks Tingmao for the report. >>> >>> Could you show us the complete SQL? In your SQL, there is no order by >>> statement. If no ORDER BY, the query should not be rewritten into TopN >>> measure. >>> >>> 2017-05-12 23:52 GMT+08:00 Tingmao Lin <[email protected]>: >>> >>>> Hi, >>>> >>>> We found that SUM() query on a cardinality 1 dimension is not accurate >>>> (or "not correct") when automatically rewritten as TOPN. >>>> Is that the expected behavior of kylin or there are any other issue? >>>> >>>> We built a cube on a table ( measure1: bigint, dim1_id:varchar, >>>> dim2_id:varchar, ... ) using kylin 1.6.0 (Kafka streaming source) >>>> >>>> The cube has two measures: SUM(measure1) and >>>> TOPN(10,sum-orderby(measure1),group by dim2_id) . (other measures >>>> omitted) >>>> and two dimensions dim1_id, dim2_id (other dims omitted) >>>> >>>> About the source table data: >>>> The cardinality of dim1_id is 1 (same dim1_id for all rows in the >>>> source table) >>>> The cardinality of dim2_id is 1 (same dim2_id for all rows in the >>>> source table) >>>> The possible value of measure1 is [1,0,-1] >>>> >>>> When we query >>>> "select SUM(measure1) FROM table GROUP BY dim2_id" >>>> => the result has one row:"sum=7", >>>> from the kylin logs we found that the query has been >>>> automatically rewritten as TOPN(measure1,sum-orderby(measure1),group >>>> by dim2_id) >>>> >>>> When we write another query to prevent TOPN rewrite, for example: >>>> >>>> "select SUM(measure1),count(*) FROM table GROUP BY dim2_id" => one >>>> row -- "sum=-2,count=24576" >>>> >>>> "select SUM(measure1),count(*) FROM table" >>>> => one row -- "sum=-2,count=24576" >>>> >>>> >>>> The result is different (7 and -2) when rewritting to TOPN or not. >>>> >>>> >>>> My question is: are the following behavior "works as expected" ,or TOPN >>>> algorithm does not support negative counter values very well , or any issue >>>> there? >>>> >>>> >>>> 1. SUM() query automatically rewritten as TOPN and gives approximated >>>> result when no TOPN present in the query. >>>> >>>> 2. When cardinality is 1, TOPN does not give accurate result. >>>> >>>> >>>> >>>> >>>> Thanks. >>>> >>>> >>>> >>>> >>> >> >> >> -- >> Best regards, >> >> Shaofeng Shi 史少锋 >> >> > > > -- > Best regards, > > Shaofeng Shi 史少锋 > > -- Best regards, Shaofeng Shi 史少锋
