[ 
https://issues.apache.org/jira/browse/KAFKA-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-13842:
------------------------------------
    Description: 
Kafka Streams follows a continuous refinement model for aggregation. For this 
reason, we never implement a pre-aggregation step before data repartitioning, 
because it won't help much to reduce repartition cost (there is no natural 
boundary when a pre-aggregation is finished and when to emit it downstream for 
the actual aggregation roll-up).

With https://issues.apache.org/jira/browse/KAFKA-13785 we introduce a 
per-aggregation "emit final" feature (different to suppress()) that changes the 
continuous refinement model and thus it seems to be a good optimization to add 
a pre-aggregation step if this new feature is used.

We might want to give user control over inserting the pre-aggregation step 
because there is no free lunch... If we have X distinct keys, pre-aggregation 
implies that the upstream RocksDB store will need to store up to X rows to hold 
the pre-aggregate. Thus, given N input partitions, we need to hold N*X rows 
(upstream) plus X rows (in the final donwstream aggregation). – In contrast, a 
direct repartition step will only require to hold X rows downstream. It's a 
tradeoff between (much) higher disk usage vs network/Kafka traffic.

  was:
Kafka Streams follows a continuous refinement model for aggregation. For this 
reason, we never implement a pre-aggregation step before data repartitioning, 
because it won't help much to reduce repartition cost (there is no natural 
boundary when a pre-aggregation is finished and when to emit it downstream for 
the actual aggregation roll-up).

With https://issues.apache.org/jira/browse/KAFKA-13785 we introduce a 
per-aggregation "emit final" feature (different to suppress()) that changes the 
continuous refinement model and thus it seems to be a good optimization to add 
a pre-aggregation step if this new feature is used.


> Add per-aggregation step before repartitioning
> ----------------------------------------------
>
>                 Key: KAFKA-13842
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13842
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Matthias J. Sax
>            Priority: Major
>              Labels: needs-kip
>
> Kafka Streams follows a continuous refinement model for aggregation. For this 
> reason, we never implement a pre-aggregation step before data repartitioning, 
> because it won't help much to reduce repartition cost (there is no natural 
> boundary when a pre-aggregation is finished and when to emit it downstream 
> for the actual aggregation roll-up).
> With https://issues.apache.org/jira/browse/KAFKA-13785 we introduce a 
> per-aggregation "emit final" feature (different to suppress()) that changes 
> the continuous refinement model and thus it seems to be a good optimization 
> to add a pre-aggregation step if this new feature is used.
> We might want to give user control over inserting the pre-aggregation step 
> because there is no free lunch... If we have X distinct keys, pre-aggregation 
> implies that the upstream RocksDB store will need to store up to X rows to 
> hold the pre-aggregate. Thus, given N input partitions, we need to hold N*X 
> rows (upstream) plus X rows (in the final donwstream aggregation). – In 
> contrast, a direct repartition step will only require to hold X rows 
> downstream. It's a tradeoff between (much) higher disk usage vs network/Kafka 
> traffic.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to