Re: doubt about measure of processedRowCount

JiaTao Tao Tue, 06 Nov 2018 17:59:00 -0800

Thanks, Shaofeng, for your affirmation :).

ShaoFeng Shi <[email protected]> 于2018年11月7日周三 上午9:29写道：


> Good job Jiatao! I appreciate your support to the community!
>
> JiaTao Tao <[email protected]> 于2018年11月7日周三 上午9:17写道：
>
>> Very glad that my reply is helpful, I already opened a JIRA to add logs
>> for "*GTStreamAggregateScanner*" and next time it would be much easier
>> to navigate this :).
>>
>> cheney <[email protected]> 于2018年11月6日周二 下午11:57写道：
>>
>>> Hi, JiaTao, thank you very much!  The statis is right when I config 
>>> "kylin.query.stream-aggregate-enabled=false".
>>> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "JiaTao Tao"<[email protected]>;
>>> *发送时间:* 2018年11月6日(星期二) 晚上10:50
>>> *收件人:* "user"<[email protected]>;
>>> *主题:* Re: doubt about measure of processedRowCount
>>>
>>> One possible place I can find in the code is using
>>> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
>>> You can find it does do aggregate in
>>> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
>>> reduce the inputs. But there's no log printing in this class as you can
>>> see, so it's pretty hard to confirm. Try
>>> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
>>> see any differences.
>>>
>>> cheney <[email protected]> 于2018年11月5日周一 下午6:55写道：
>>>
>>>> Yes. the log is as following.
>>>>
>>>> 2018-11-02 22:25:34,980 DEBUG [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>>> gtrecord.StorageResponseGTScatter:88 : Using
>>>> SortMergedPartitionResultIterator to merge 103 partition results
>>>> 2018-11-02 22:25:34,982 INFO  [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>>>> merge segment results*
>>>> 2018-11-02 22:25:34,982 DEBUG [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>>>> : return TupleIterator...
>>>> 2018-11-02 22:25:34,991 INFO  [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : 
>>>> *Processed
>>>> rows for each storageContext*: 366
>>>> 2018-11-02 22:25:34,991 INFO  [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>>>> Stats of SQL response: isException: false, duration: 20, *total scan
>>>> count 1552*
>>>>
>>>> Acoording the log,  *valueA *= 366. *valueB*= (total scan count) 1552
>>>> - (total Agrrated/filterd in hbase)270 = 1282
>>>>  *valueB *is much larger than *valueA *.
>>>>
>>>>
>>>>
>>>> ------------------ 原始邮件 ------------------
>>>> *发件人:* "JiaTao Tao"<[email protected]>;
>>>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>>>> *收件人:* "user"<[email protected]>;
>>>> *主题:* Re: doubt about measure of processedRowCount
>>>>
>>>> Can you grep logs like "to merge segment results" in that scenario?
>>>>
>>>> cheney <[email protected]> 于2018年11月3日周六 下午4:15写道：
>>>>
>>>>> Thank your repling, .but I  am sure there's only one OlapContext in
>>>>> the quey in my scenario.
>>>>> ---Original---
>>>>> *From:* "JiaTao Tao"<[email protected]>
>>>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>>>> *To:* "user"<[email protected]>;
>>>>> *Subject:* Re: doubt about measure of processedRowCount
>>>>>
>>>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>>>> there's more than one OlapContext in the query ( one OlapContext 
>>>>> correspond
>>>>> one storageContext ).
>>>>>
>>>>> There are two good blogs about Kylin's query engine, you may take a
>>>>> look :).
>>>>>
>>>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>>>
>>>>> https://zhuanlan.zhihu.com/p/30613434
>>>>>
>>>>> cheney <[email protected]> 于2018年11月2日周五 下午11:10写道：
>>>>>
>>>>>> Hi, guys
>>>>>>
>>>>>>         When I executed a sql in kylin, kylin server will log some
>>>>>> log about query statics. for example, The log is as following:
>>>>>>
>>>>>>        "Processed rows for each storageContext: *valueA*". *valueA *is 
>>>>>> processedRowCount.
>>>>>>
>>>>>>        What I understand is processedRowCount is the record rows
>>>>>> numbers returned by hbase.
>>>>>>
>>>>>>        Hbase corprocessor will log region stats, including:  "*Total
>>>>>> scanned row*","Total filtered/aggred row".
>>>>>>
>>>>>>         For  one region,  final records returned by hbase = *Total 
>>>>>> scanned
>>>>>> row - *Total filtered/aggred row;
>>>>>>        Suppose this query need to scan 10 region in hbase, we can get
>>>>>> every region stats. we can get all records  *valueB *returned by
>>>>>> hbase by
>>>>>>        suming every final records in 10 region.
>>>>>>
>>>>>>       In general, *valueA *is equal to * valueB*, but *valueB *is
>>>>>> much larger than *valueA* in sometimes. Why?
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> Regards!
>>>>>
>>>>> Aron Tao
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> Regards!
>>>>
>>>> Aron Tao
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

-- 


Regards!

Aron Tao

Re: doubt about measure of processedRowCount

Reply via email to