To answer my own questions, I have came up with the following SQL statement that seems to work correctly.
WITH myview (sgmnt_size, total_batch_val, msg_count) as (SELECT BATCH_SIZE, sum(VALUE), count(MESSAGE_ID) from TABLE2 group by BATCH_ID, SEGMENT_SIZE) SELECT sum(total_batch_val) FROM myview WHERE msg_count=sgmnt_size Could you please comment on the performance characteristics of the above statement in the context of Kylin? Is there a better way of achieving the same? Thanks! On Wed, 29 Jul 2020 at 14:15, Kirill Bogdanov <[email protected]> wrote: > Hi, > > I am working on a real time data analytics and evaluating the possibility > of using Kylin for our project. To date, I was able to connect Kafka with > Kylin and run basic queries on cubes. However, I have a specific > functionality requirements that I currently don't know how to achieve in > Kylin. > > My incoming Kafka data stream receives batches of messages. Main columns > look as follows: > BATCH_ID (int)- unique increasing number (cube's dimension). All messages > within one batch have the same BATCH_ID > BATCH_SIZE (int) - defines number of expected messages in this batch, an > integer in the range of 1 to 10000 (cube's dimension) > MESSAGE_ID (int) - message's sequence number within the batch (any > number from 1 to BATCH_SIZE), unique within its batch. (cube's dimension) > VALUE - cube's metrics for which I want to compute the sum. > > I would like to write a query that would aggregate total VALUE of all > received messages (e.g., SELECT sum(value) from TABLE ....), however I only > want to count messages that belong to complete batches. A batch is > considered to be completed if all messages of that batch have been received > (i.e., aggregated in the cube). For example if BATCH_ID 123 has BATCH_SIZE > = 100 then we should consider VALUEs only if we have 100 messages with > BATCH_ID == 100. > > What would be an SQL statement in Kylin to achieve this functionality? Any > specific optimisations that we could consider? > > Thanks! > Kirill >
