Hi Kirill:

Would you want to optimize cube design or SQL statement? 

You can design cube according the SQL statement you want to query, such as add 
BATCH_ID and SEGMENT_SIZE to dimension, add sum(VALUE) as a measure. These can 
make your query  faster.

> 在 2020年7月29日,21:50,Kirill Bogdanov <[email protected]> 写道:
> 
> To answer my own questions, I have came up with the following SQL statement 
> that seems to work correctly. 
> 
> WITH myview (sgmnt_size, total_batch_val, msg_count) as (SELECT BATCH_SIZE, 
> sum(VALUE), count(MESSAGE_ID) from TABLE2 group by BATCH_ID, SEGMENT_SIZE)
> SELECT sum(total_batch_val) FROM myview WHERE msg_count=sgmnt_size
> 
> Could you please comment on the performance characteristics of the above 
> statement in the context of Kylin? Is there a better way of achieving the 
> same?
> 
> Thanks!
> 
> On Wed, 29 Jul 2020 at 14:15, Kirill Bogdanov <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi, 
> 
> I am working on a real time data analytics and evaluating the possibility of 
> using Kylin for our project. To date, I was able to connect Kafka with Kylin 
> and run basic queries on cubes. However, I have a specific functionality 
> requirements that I currently don't know how to achieve in Kylin. 
> 
> My incoming Kafka data stream receives batches of messages. Main columns look 
> as follows:
> BATCH_ID (int)-  unique increasing number (cube's dimension). All messages 
> within one batch have the same BATCH_ID
> BATCH_SIZE (int) - defines number of expected messages in this batch, an 
> integer in the range of 1 to 10000 (cube's dimension)
> MESSAGE_ID (int) - message's sequence number within the batch (any number 
> from 1 to BATCH_SIZE), unique within its batch.  (cube's dimension)
> VALUE - cube's metrics for which I want to compute the sum. 
> 
> I would like to write a query that would aggregate total VALUE of all 
> received messages (e.g., SELECT sum(value) from TABLE ....), however I only 
> want to count messages that belong to complete batches. A batch is considered 
> to be completed if all messages of that batch have been received (i.e., 
> aggregated in the cube). For example if BATCH_ID 123 has BATCH_SIZE = 100 
> then we should consider VALUEs only if we have 100 messages with BATCH_ID == 
> 100.  
> 
> What would be an SQL statement in Kylin to achieve this functionality? Any 
> specific optimisations that we could consider?
> 
> Thanks!
> Kirill

Reply via email to