Hello Kafka Team,
we are observing some unexpected behavior in the Java Kafka client:
Problem description:
When a KafkaShareConsumer fails to connect to a cluster (because e.g. a
port is misconfigured, ...) it enters a busy loop. The symptoms are an
excessive amount of logs, high CPU usage and a slowly increasing memory
footprint.
Software Version:
We are using the org.apache.kafka.kafka-clients:4.1.1.
Sample:
I created a repository with a minimal sample to reproduce the behavior:
https://github.com/HenrikLueschenTNG/share-consumer-busy-loop/blob/main/src/main/java/com/example/shareconsumerbusyloop/ShareConsumerBusyLoopApplication.java
Details:
When the consumer fails to establish a connection, we first see a large
amount of identical logs, often many published within the same millisecond:
2026-01-16 07:49:59.311 INFO [consumer_background_thread]
org.apache.kafka.clients.Metadata - [ShareConsumer
clientId=consumer-test-group-1, groupId=test-group] Rebootstrapping with
[localhost/127.0.0.1:9094]
2026-01-16 07:49:59.311 INFO [consumer_background_thread]
org.apache.kafka.clients.Metadata - [ShareConsumer
clientId=consumer-test-group-1, groupId=test-group] Rebootstrapping with
[localhost/127.0.0.1:9094]
2026-01-16 07:49:59.311 INFO [consumer_background_thread]
org.apache.kafka.clients.Metadata - [ShareConsumer
clientId=consumer-test-group-1, groupId=test-group] Rebootstrapping with
[localhost/127.0.0.1:9094]
After a few seconds, the production of these logs ends, but the CPU
usage remains very high.
I have done a little but of digging and found the following:
- Within the loop of the ConsumerNetworkThread, several RequestManagers
are used to determine the timeout for the next poll to the
networkClientDelegate. The CoordinatorRequestManager frequently sets
this timeout to zero. Its timeout is calculated as Math.max(0, backoffMs
- timeSinceLastReceiveMs); As the backoff is, by default, between
100ms-1000ms but the request timeout is 30000ms, the difference between
the backoff and the timeSinceLastReceived is almost always negative when
no connection can be made. I think this is causing the initial symptom
of the many logs.
- After a few seconds, the client stops producing logs, but the CPU
usage remains high. Additionally, a slow increase of memory usage can be
observed. I believe this is due to an accumulation of applicationEvents
in the ConsumerNetworkThread. I have observed that within a few seconds
several million such events need to be (and cannot be) processed in the
call to "processApplicationEvents". This appears to slow down the loop
in the ConsumerNetworkThread, resulting in the production of fewer logs,
while simultaneously keeping the CPU busy and using increasing amounts
of memory.
- In the case of a classic consumer, no such behavior can be observed.
Thanks in advance on any advice on this issue!
Greetings
Henrik