Jaebin Yoon created KAFKA-10827:
-----------------------------------

             Summary: Consumer group coordinator node never gets updated for 
manual partition assignment
                 Key: KAFKA-10827
                 URL: https://issues.apache.org/jira/browse/KAFKA-10827
             Project: Kafka
          Issue Type: Bug
          Components: consumer
    Affects Versions: 2.2.1
            Reporter: Jaebin Yoon


We've run into a situation where the coordinator node in the consumer never 
gets updated with the new coordinator when the coordinator broker gets replaced 
with a new instance. Once the consumer gets into this mode, the consumer keeps 
trying to connect to the old coordinator and never recovers unless restarted.

This happens when the consumer uses manual partition assignment and commits 
offsets very infrequently (every 5 minutes) and the coordinator broker is not 
reachable (ip address, hostname are gone in a cloud environment).

The exception the consumer keeps getting isĀ 
{code:java}
Offset commit failed with a retriable exception. You should retry committing 
the latest consumed offsets. Caused by: 
org.apache.kafka.common.errors.TimeoutException: Failed to send request after 
120000 ms.
{code}
We could see a bunch of *SYN_SENT* tcp state from the consumer app to the old 
hostname in this error condition.

In the current manual partition assignment scenario, the only way for the 
coordinator to gets updated is through checkAndGetCoordinator in 
AbstractCoordinator but this gets called only in committing offsets every 5 
minutes in our case.  

The current logic of checkAndGetCoordinator is using 
ConsumerNetworkClient.isUnavailable but it returns false unless the Network 
client is in reconnect backoff time, which is currently configured with default 
values (reconnect.backoff.ms (50), reconnect.backoff.max.ms (1000) while 
request.timeout.ms is 120000.  In this scenario, 
ConsumerNetworkClient.isUnavailable for the old coordinator node always returns 
false, resulting in checkAndGetCoordinator keeps the old coordinator node 
forever.

It seems checkAndGetCoordinator should not rely on 
ConsumerNetworkClient.isUnavailable but just check the client.connectionFailed.

We had to restart the consumer to recover from this condition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to