Re: Questions about SUM behavior when rewritten as TOPN

ShaoFeng Shi Sun, 14 May 2017 18:38:51 -0700

Hi Tingmao,

Your finding is correct; Kylin didn't check whether there is "order by
limit" pattern or not. The main reason is, the SQL parser didn't push down
the "order by limit" info here (at least when TopN was developed), so I
couldn't leverage that.  Another reason is usually we won't put the "group
and order" column as a normal cube dimension (but Kylin didn't stop you
from doing that).


Please feel free to open a JIRA on the checking "order by limit" pattern, I
can check that later.



2017-05-15 2:37 GMT+08:00 Tingmao Lin <[email protected]>:

> Yeah, we looked into the TopNCounter source code and found that for low 
> cardinality
> scenario, in TopNCounter, m1=m2=0 so it just sum up the values. But the
> result still goes wrong. I will collect more information for
> investigation ASAP :)
>
>
> I have a question: as Billy Liu said in this thread kylin will check the
> ORDER BY clause to determine whether to rewrite. But  I didn't find any
> access to SQLdigest.sortColumns  in  
> TopNMeasureType.influenceCapabilityCheck().
>  Does kylin check it elsewhere ?  It seems that  if   sum(measure) is the
> only measure in the query  and the group by  column matches, then
> TopNMeasureType.isTopNCompatibleSum()    will pass and the query get
> rewritten. This confuses the user since they may expect a accurate result
> for every distinct value of group by column(s).
>
> ------------------------------
> *From:* ShaoFeng Shi <[email protected]>
> *Sent:* Sunday, May 14, 2017 2:13 PM
>
> *To:* user
> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN
>
> Agree with Yang's points; When cardinality is small than the TopN
> counter's capacity, the result should be accurate. I checked the 1.6 source
> code and didn't find clue. Please share more information (cube definition
> and query logs) for investigation, thanks!
>
> For negtive number in TopN, actually that isn't recommended, as it goes
> against TopN's purpose, which is to counting something happened. When
> merging two TopN counters, one counter will use another's last element's
> number to accumulate (if another is full) on its elements (as a
> supplement). If the last element is close to 0, its impaction will be
> minor. But if the last element is a big negative value, you know it's
> impaction will be considerable!  It doesn't make sense to reduce existing
> element's counting value if the last element's value is negative. So please
> use it properly in your scenario. Ofcourse, I think Kylin should also add
> more checks there.
>
> 2017-05-14 17:18 GMT+08:00 Tingmao Lin <[email protected]>:
>
>> The SQL in the original email is exactly what we input in the "insight"
>> tab in kylin admin UI.
>>
>> I do not have access to the host running kylin now ,and I will post the
>> detailed log output tomorrow.
>>
>>
>> We reproduced the inaccurate result behavior using a source table with
>> <10 rows and  I should be able to write a reproduce step tomorrow.
>>
>> ------------------------------
>> *From:* Billy Liu <[email protected]>
>> *Sent:* Sunday, May 14, 2017 12:21 AM
>> *To:* user
>> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN
>>
>> Thanks Tingmao for the report.
>>
>> Could you show us the complete SQL? In your SQL, there is no order by
>> statement. If no ORDER BY, the query should not be rewritten into TopN
>> measure.
>>
>> 2017-05-12 23:52 GMT+08:00 Tingmao Lin <[email protected]>:
>>
>>> Hi,
>>>
>>> We found that SUM() query on a cardinality 1 dimension is not accurate
>>> (or "not correct") when automatically  rewritten as TOPN.
>>> Is that the expected behavior of kylin or there are any other issue?
>>>
>>> We built a cube on a table ( measure1: bigint, dim1_id:varchar,
>>> dim2_id:varchar, ... ) using kylin 1.6.0 (Kafka streaming source)
>>>
>>> The cube has two measures: SUM(measure1) and
>>> TOPN(10,sum-orderby(measure1),group by dim2_id) . (other measures
>>> omitted)
>>> and two dimensions  dim1_id, dim2_id   (other dims omitted)
>>>
>>> About the source table data:
>>> The cardinality of dim1_id  is 1 (same dim1_id for all rows in the
>>> source table)
>>> The cardinality of dim2_id  is 1 (same dim2_id for all rows in the
>>> source table)
>>> The possible value of measure1 is [1,0,-1]
>>>
>>> When we query
>>>     "select SUM(measure1) FROM table GROUP BY dim2_id"
>>>  =>     the result has one row:"sum=7",
>>>       from the kylin logs we found that the query has been
>>> automatically  rewritten as TOPN(measure1,sum-orderby(measure1),group
>>> by dim2_id)
>>>
>>> When we write another query to prevent TOPN rewrite, for example:
>>>
>>>    "select SUM(measure1),count(*) FROM table GROUP BY dim2_id"     =>   one
>>> row -- "sum=-2,count=24576"
>>>
>>>    "select SUM(measure1),count(*) FROM table"
>>>              =>   one row -- "sum=-2,count=24576"
>>>
>>>
>>> The result is different (7 and -2) when rewritting to TOPN or not.
>>>
>>>
>>> My question is: are the following behavior "works as expected" ,or TOPN
>>> algorithm does not support negative counter values very well , or any issue
>>> there?
>>>
>>>
>>> 1. SUM() query  automatically rewritten as TOPN and gives approximated
>>> result when no TOPN present in the query.
>>>
>>> 2. When cardinality is 1, TOPN does not give accurate result.
>>>
>>>
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Questions about SUM behavior when rewritten as TOPN

Reply via email to