Kafka Streams LEFT JOIN multiple tables

Karsten Stöckmann Wed, 24 Jan 2024 03:55:21 -0800

Hi,

we're currently in the process of evaluating Debezium and Kafka as a CDC
system for our Postgres database in order to build an indexing solution
(i.e. Solr or OpenSearch).


Debezium captures changes per table and propagates them into dedicated
Kafka topics each. The ingested tables originally feature multiple
relationships of all kinds (1:1, 1:n, n:m) - aggregated data from those
tables should eventually reflect in composite documents.
To illustrate this, assume the following heavily simplified table layout:

   - customer(id, name, firstname),
   - communication(id, customer_id, value),
   - label(id, customer_id, value),
   - (maybe more dependent tables related to customers 1:n or 1:1 or even
   n:m - not considered here for brevity).

All tables are streamed into Kafka topics as pointed out above. Now in
order to aggregate an output customer with all associated communication and
label values (i.e. aggregated as lists), what would be the most elegant
solution leveraging Kafka Streams? Note that customers do not necessisarily
have any communication or label at all, thus non-key joins are out of the
game as far as I understand.

Our initial (naive) solution was to re-key the dependent KTables and then
left joining, but that involves a lot of steps and intermediate State
Stores, especially when considering the stated example is heavily
simplified:

KTable<Long, Customer> customers = streamBuilder.table(customerTopic,
Consumed.with(<Serdes>));
KTable<Long, Comm> comm = streamBuilder.table(commTopic,
Consumed.with(<Serdes>));
KTable<Long, Label> labels = streamBuilder.table(labelTopic,
Consumed.with(<Serdes>));
// Re-Keying
KTable<Long, GroupedComm> groupedComm =
communications.groupBy(...).aggregate(...);
KTable<Long, GroupedLabel> groupedLabels =
labels.groupBy(...).aggregate(...);
// Join
KTable<Long, AggregateCustomer> aggregated = customers
.leftJoin(groupedComm, ...)
.leftJoin(groupedLabel, ...)
.groupBy(...)
.aggregate(...);

Are there more efficient / less naive / more elegant / simpler solutions to
this? Co-grouping input streams did not yield expected results...

Best wishes,

Karsten

Kafka Streams LEFT JOIN multiple tables

Reply via email to