Hi, I am working on a real time data analytics and evaluating the possibility of using Kylin for our project. To date, I was able to connect Kafka with Kylin and run basic queries on cubes. However, I have a specific functionality requirements that I currently don't know how to achieve in Kylin.
My incoming Kafka data stream receives batches of messages. Main columns look as follows: BATCH_ID (int)- unique increasing number (cube's dimension). All messages within one batch have the same BATCH_ID BATCH_SIZE (int) - defines number of expected messages in this batch, an integer in the range of 1 to 10000 (cube's dimension) MESSAGE_ID (int) - message's sequence number within the batch (any number from 1 to BATCH_SIZE), unique within its batch. (cube's dimension) VALUE - cube's metrics for which I want to compute the sum. I would like to write a query that would aggregate total VALUE of all received messages (e.g., SELECT sum(value) from TABLE ....), however I only want to count messages that belong to complete batches. A batch is considered to be completed if all messages of that batch have been received (i.e., aggregated in the cube). For example if BATCH_ID 123 has BATCH_SIZE = 100 then we should consider VALUEs only if we have 100 messages with BATCH_ID == 100. What would be an SQL statement in Kylin to achieve this functionality? Any specific optimisations that we could consider? Thanks! Kirill
