What does kafka streams groupBy does internally?

warrior2031 Wed, 24 Jan 2024 02:45:58 -0800

Let's say there's a topic in which chunks of different files are all mixed up represented by a tuple (FileId, Chunk).

Chunks of a same file also can be a little out of order.

The task is to aggregate all files and store them into some store.

The number of files is unbound.

In pseudo stream DSL that might look like

topic('chunks')
    .groupByKey((fileId, chunk) -> fileId)
    .sortBy((fileId, chunk) -> chunk.offset)
    .aggregate((fileId, chunk) -> store.append(fileId, chunk));

I want to understand whether kafka streams can solve this efficiently. Since the number of files is unbound how would kafka manage intermediate topics for groupBy operation? How many partitions will it use etc? Can't find this details in the docs. Also let's say chunk has a flag that indicates EOF. How to indicate that specific group will no longer have any new data?