Re: What does kafka streams groupBy does internally?

2024-01-30 Thread Matthias J. Sax

Did reply on SO.

-Matthias

On 1/24/24 2:18 AM, warrior2...@gmail.com wrote:
Let's say there's a topic in which chunks of different files are all 
mixed up represented by a tuple |(FileId, Chunk)|.


Chunks of a same file also can be a little out of order.

The task is to aggregate all files and store them into some store.

The number of files is unbound.

In pseudo stream DSL that might look like

|topic('chunks') .groupByKey((fileId, chunk) -> fileId) .sortBy((fileId, 
chunk) -> chunk.offset) .aggregate((fileId, chunk) -> 
store.append(fileId, chunk)); |


I want to understand whether kafka streams can solve this efficiently. 
Since the number of files is unbound how would kafka manage intermediate 
topics for groupBy operation? How many partitions will it use etc? Can't 
find this details in the docs. Also let's say chunk has a flag that 
indicates EOF. How to indicate that specific group will no longer have 
any new data?



That’s a copy of my stack overflow question.
apple-touch-i...@2.png
What does kafka streams groupBy does internally? 
<https://stackoverflow.com/questions/77870807/what-does-kafka-streams-groupby-does-internally>
stackoverflow.com 
<https://stackoverflow.com/questions/77870807/what-does-kafka-streams-groupby-does-internally>


<https://stackoverflow.com/questions/77870807/what-does-kafka-streams-groupby-does-internally>


—
Michael


What does kafka streams groupBy does internally?

2024-01-24 Thread warrior2031
Let's say there's a topic in which chunks of different files are all mixed up represented by a tuple (FileId, Chunk).Chunks of a same file also can be a little out of order.The task is to aggregate all files and store them into some store.The number of files is unbound.In pseudo stream DSL that might look liketopic('chunks')
.groupByKey((fileId, chunk) -> fileId)
.sortBy((fileId, chunk) -> chunk.offset)
.aggregate((fileId, chunk) -> store.append(fileId, chunk));
I want to understand whether kafka streams can solve this efficiently. Since the number of files is unbound how would kafka manage intermediate topics for groupBy operation? How many partitions will it use etc? Can't find this details in the docs. Also let's say chunk has a flag that indicates EOF. How to indicate that specific group will no longer have any new data?That’s a copy of my stack overflow question.What does kafka streams groupBy does internally?stackoverflow.com—Michael