Dayue Gao created KYLIN-2501:
--------------------------------

             Summary: Stream Aggregate GTRecords at Query Server
                 Key: KYLIN-2501
                 URL: https://issues.apache.org/jira/browse/KYLIN-2501
             Project: Kylin
          Issue Type: Improvement
          Components: Query Engine, Storage - HBase
    Affects Versions: v1.6.0
            Reporter: Dayue Gao
            Assignee: Dayue Gao


*Problem*
When query server needs to handle millions of records from storage, 
CubeTupleConverter could become performance bottleneck.
An experiment shows that converting 5 millions records takes ~11s, which 
accounts for 50% of the total query time.

*Motivation*
Records returned from each storage partition is guaranteed to be ordered. 
Therefore we could reduce the number of records passed to CubeTupleConverter by
# merge sorted records from all partitions, similar to what we have done in 
KYLIN-1787
# use a [stream 
aggregate|https://blogs.msdn.microsoft.com/craigfr/2006/09/13/stream-aggregate/]
 algorithm on merged stream to aggregate those records with the same key

*Proposal*
# Add a new physical operator GTStreamAggregateScanner which implements the 
stream aggregate algorithm
# Refine SortedIteratorMergerWithLimit that was used to merge sort records from 
different partitions. The previous implementation has performance issues 
(KYLIN-2483) due to expensive record clone
# Leverage GTStreamAggregateScanner to aggregate records on merged stream

*Scope*
Stream aggregate has some good properties such as low memory usage and 
streamable ordered outputs, making it better than hash/sort based alternatives 
when input is already sorted. So I bet the new GTStreamAggregateScanner 
operator can also be used to accelerating cubing and coprocessor in certain 
cases. I will leave it as future works.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to