My team is building a set of services that use Kafka Connect and Debezium
to forward data changes from our Postgres database to Kafka, and then use
Kafka Streams (via Spring Cloud Stream) to process this data and output an
aggregate of the source data.

We have been trying to track down an issue where the stream processors are
not correctly configured when the application starts up before Kafka is up.
Specifically, all of the processor nodes are correctly created except for
the KTABLE-SINK-000000000# node. The result of this is that the services
consume messages but do not forward their changes to their output topics.
Therefore, data processing stops while the consumer offset continues to be
incremented, so we lose messages and have to roll back the offsets and
reprocess a large amount of data.

This happens in both our Kubernetes environment and our Chef-managed
environment. In the Chef environment, simply restarting the server is
enough to trigger this issue, since Kafka takes longer to start up than the
application. In Kubernetes, I can reliably trigger the issue by removing
the application's dependency on Kafka, stopping Kafka, restarting the
application, and then starting Kafka.

I have tested with Kafka 3.0.0 and 3.3.1, and the behavior does not change.
We are using the latest Spring dependencies (Spring cloud 2021.0.5, Spring
Cloud Stream 3.2.6).

This may be a "heisenbug": a bug that only occurs when it is not being
observed. I spent most of yesterday debugging the Kafka setup code during
application startup with and without the Kafka broker running, and was
unable to reproduce the bug while doing so, but was able to consistently
reproduce it as soon as I stopped debugging the startup process. I suspect
that this may mean that this is a race condition in the parts of the
application that only run after connecting to the Kafka broker, but I'm not
familiar enough with the Kafka source to go much further on this.

Although I understand that this is a bit of an edge case (the Kafka broker
should generally not go down), the results here involve missing/invalid
data, and it is not possible to confirm whether the application is in this
case except by either confirming that "this consumer consumed a message but
didn't forward it to its output topic" or by hooking up a debugger and
inspecting the ProcessorContext, so we can't reasonably implement a health
check to verify whether the application is in the bad state and restart it.
(Even if we could check for this state, there's no guarantee that it didn't
lose some messages while it was in the bad state.)

I've done a fair amount of searching for this sort of issue, but have been
unable to find any other people who I can confirm to have the same issue. I
am not certain whether this is an issue in Kafka itself or in Spring Cloud
Stream.

Any guidance or suggestions would be appreciated.

Reply via email to