Hi Yaqian,

Thanks for your reply. After running a few more tests, I noticed that the
query time is increasing with the size of the cube.

As per my SQL query I am only interested in the sum(VALUE) of the completed
BATCH_IDs and I don't need to differentiate between individual BATCH_IDs.
Basically, the moment a given batch is completed I can 'reduce that
dimension'. Could you please advise on how this can be achieved?

Thanks & best regards,
Kirill



On Thu, 30 Jul 2020 at 05:06, Yaqian Zhang <[email protected]> wrote:

> Hi Kirill:
>
> Would you want to optimize cube design or SQL statement?
>
> You can design cube according the SQL statement you want to query, such as
> add BATCH_ID and SEGMENT_SIZE to dimension, add sum(VALUE) as a measure.
> These can make your query  faster.
>
> 在 2020年7月29日,21:50,Kirill Bogdanov <[email protected]> 写道:
>
> To answer my own questions, I have came up with the following SQL
> statement that seems to work correctly.
>
> WITH myview (sgmnt_size, total_batch_val, msg_count) as (SELECT
> BATCH_SIZE, sum(VALUE), count(MESSAGE_ID) from TABLE2 group by BATCH_ID,
> SEGMENT_SIZE)
> SELECT sum(total_batch_val) FROM myview WHERE msg_count=sgmnt_size
>
> Could you please comment on the performance characteristics of the
> above statement in the context of Kylin? Is there a better way of achieving
> the same?
>
> Thanks!
>
> On Wed, 29 Jul 2020 at 14:15, Kirill Bogdanov <[email protected]> wrote:
>
>> Hi,
>>
>> I am working on a real time data analytics and evaluating the possibility
>> of using Kylin for our project. To date, I was able to connect Kafka with
>> Kylin and run basic queries on cubes. However, I have a specific
>> functionality requirements that I currently don't know how to achieve in
>> Kylin.
>>
>> My incoming Kafka data stream receives batches of messages. Main columns
>> look as follows:
>> BATCH_ID (int)-  unique increasing number (cube's dimension). All
>> messages within one batch have the same BATCH_ID
>> BATCH_SIZE (int) - defines number of expected messages in this batch, an
>> integer in the range of 1 to 10000 (cube's dimension)
>> MESSAGE_ID (int) - message's sequence number within the batch (any
>> number from 1 to BATCH_SIZE), unique within its batch.  (cube's dimension)
>> VALUE - cube's metrics for which I want to compute the sum.
>>
>> I would like to write a query that would aggregate total VALUE of all
>> received messages (e.g., SELECT sum(value) from TABLE ....), however I only
>> want to count messages that belong to complete batches. A batch is
>> considered to be completed if all messages of that batch have been received
>> (i.e., aggregated in the cube). For example if BATCH_ID 123 has BATCH_SIZE
>> = 100 then we should consider VALUEs only if we have 100 messages with
>> BATCH_ID == 100.
>>
>> What would be an SQL statement in Kylin to achieve this functionality?
>> Any specific optimisations that we could consider?
>>
>> Thanks!
>> Kirill
>>
>
>

Reply via email to