Maybe you can try to set the time window to purge the historical data, but 
there is no way to set the id window according to batch_id.

If you can estimate the time period required to receive a batch data that you 
interested in, you can set the data before this time cycle to be cleaned up, 
which can be set here


There are also some documents about kylin real time, hoping to help you.

http://kylin.apache.org/docs/tutorial/realtime_olap.html 
<http://kylin.apache.org/docs/tutorial/realtime_olap.html>
http://kylin.apache.org/blog/2019/04/12/rt-streaming-design/ 
<http://kylin.apache.org/blog/2019/04/12/rt-streaming-design/>

> 在 2020年7月31日,00:31,Kirill Bogdanov <[email protected]> 写道:
> 
> Hi Yaqian, 
> 
> Thanks for your reply. After running a few more tests, I noticed that the 
> query time is increasing with the size of the cube. 
> 
> As per my SQL query I am only interested in the sum(VALUE) of the completed 
> BATCH_IDs and I don't need to differentiate between individual BATCH_IDs. 
> Basically, the moment a given batch is completed I can 'reduce that 
> dimension'. Could you please advise on how this can be achieved? 
> 
> Thanks & best regards,
> Kirill
> 
> 
> 
> On Thu, 30 Jul 2020 at 05:06, Yaqian Zhang <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Kirill:
> 
> Would you want to optimize cube design or SQL statement? 
> 
> You can design cube according the SQL statement you want to query, such as 
> add BATCH_ID and SEGMENT_SIZE to dimension, add sum(VALUE) as a measure. 
> These can make your query  faster.
> 
>> 在 2020年7月29日,21:50,Kirill Bogdanov <[email protected] 
>> <mailto:[email protected]>> 写道:
>> 
>> To answer my own questions, I have came up with the following SQL statement 
>> that seems to work correctly. 
>> 
>> WITH myview (sgmnt_size, total_batch_val, msg_count) as (SELECT BATCH_SIZE, 
>> sum(VALUE), count(MESSAGE_ID) from TABLE2 group by BATCH_ID, SEGMENT_SIZE)
>> SELECT sum(total_batch_val) FROM myview WHERE msg_count=sgmnt_size
>> 
>> Could you please comment on the performance characteristics of the above 
>> statement in the context of Kylin? Is there a better way of achieving the 
>> same?
>> 
>> Thanks!
>> 
>> On Wed, 29 Jul 2020 at 14:15, Kirill Bogdanov <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi, 
>> 
>> I am working on a real time data analytics and evaluating the possibility of 
>> using Kylin for our project. To date, I was able to connect Kafka with Kylin 
>> and run basic queries on cubes. However, I have a specific functionality 
>> requirements that I currently don't know how to achieve in Kylin. 
>> 
>> My incoming Kafka data stream receives batches of messages. Main columns 
>> look as follows:
>> BATCH_ID (int)-  unique increasing number (cube's dimension). All messages 
>> within one batch have the same BATCH_ID
>> BATCH_SIZE (int) - defines number of expected messages in this batch, an 
>> integer in the range of 1 to 10000 (cube's dimension)
>> MESSAGE_ID (int) - message's sequence number within the batch (any number 
>> from 1 to BATCH_SIZE), unique within its batch.  (cube's dimension)
>> VALUE - cube's metrics for which I want to compute the sum. 
>> 
>> I would like to write a query that would aggregate total VALUE of all 
>> received messages (e.g., SELECT sum(value) from TABLE ....), however I only 
>> want to count messages that belong to complete batches. A batch is 
>> considered to be completed if all messages of that batch have been received 
>> (i.e., aggregated in the cube). For example if BATCH_ID 123 has BATCH_SIZE = 
>> 100 then we should consider VALUEs only if we have 100 messages with 
>> BATCH_ID == 100.  
>> 
>> What would be an SQL statement in Kylin to achieve this functionality? Any 
>> specific optimisations that we could consider?
>> 
>> Thanks!
>> Kirill
> 

Reply via email to