My team is building a set of services that use Kafka Connect and Debezium to forward data changes from our Postgres database to Kafka, and then use Kafka Streams (via Spring Cloud Stream) to process this data and output an aggregate of the source data.
We have been trying to track down an issue where the stream processors are not correctly configured when the application starts up before Kafka is up. Specifically, all of the processor nodes are correctly created except for the KTABLE-SINK-000000000# node. The result of this is that the services consume messages but do not forward their changes to their output topics. Therefore, data processing stops while the consumer offset continues to be incremented, so we lose messages and have to roll back the offsets and reprocess a large amount of data. This happens in both our Kubernetes environment and our Chef-managed environment. In the Chef environment, simply restarting the server is enough to trigger this issue, since Kafka takes longer to start up than the application. In Kubernetes, I can reliably trigger the issue by removing the application's dependency on Kafka, stopping Kafka, restarting the application, and then starting Kafka. I have tested with Kafka 3.0.0 and 3.3.1, and the behavior does not change. We are using the latest Spring dependencies (Spring cloud 2021.0.5, Spring Cloud Stream 3.2.6). This may be a "heisenbug": a bug that only occurs when it is not being observed. I spent most of yesterday debugging the Kafka setup code during application startup with and without the Kafka broker running, and was unable to reproduce the bug while doing so, but was able to consistently reproduce it as soon as I stopped debugging the startup process. I suspect that this may mean that this is a race condition in the parts of the application that only run after connecting to the Kafka broker, but I'm not familiar enough with the Kafka source to go much further on this. Although I understand that this is a bit of an edge case (the Kafka broker should generally not go down), the results here involve missing/invalid data, and it is not possible to confirm whether the application is in this case except by either confirming that "this consumer consumed a message but didn't forward it to its output topic" or by hooking up a debugger and inspecting the ProcessorContext, so we can't reasonably implement a health check to verify whether the application is in the bad state and restart it. (Even if we could check for this state, there's no guarantee that it didn't lose some messages while it was in the bad state.) I've done a fair amount of searching for this sort of issue, but have been unable to find any other people who I can confirm to have the same issue. I am not certain whether this is an issue in Kafka itself or in Spring Cloud Stream. Any guidance or suggestions would be appreciated.