Hi,

I am working on a real time data analytics and evaluating the possibility
of using Kylin for our project. To date, I was able to connect Kafka with
Kylin and run basic queries on cubes. However, I have a specific
functionality requirements that I currently don't know how to achieve in
Kylin.

My incoming Kafka data stream receives batches of messages. Main columns
look as follows:
BATCH_ID (int)-  unique increasing number (cube's dimension). All messages
within one batch have the same BATCH_ID
BATCH_SIZE (int) - defines number of expected messages in this batch, an
integer in the range of 1 to 10000 (cube's dimension)
MESSAGE_ID (int) - message's sequence number within the batch (any
number from 1 to BATCH_SIZE), unique within its batch.  (cube's dimension)
VALUE - cube's metrics for which I want to compute the sum.

I would like to write a query that would aggregate total VALUE of all
received messages (e.g., SELECT sum(value) from TABLE ....), however I only
want to count messages that belong to complete batches. A batch is
considered to be completed if all messages of that batch have been received
(i.e., aggregated in the cube). For example if BATCH_ID 123 has BATCH_SIZE
= 100 then we should consider VALUEs only if we have 100 messages with
BATCH_ID == 100.

What would be an SQL statement in Kylin to achieve this functionality? Any
specific optimisations that we could consider?

Thanks!
Kirill

Reply via email to