Luke,

We did not upgrade to resolve the issue.  We simply restarted the failing 
clients.

Regards, James.

On 23/11/2021, at 16:10, Luke Chen 
<show...@gmail.com<mailto:show...@gmail.com>> wrote:

Hi James,
> Bouncing the clients resolved the issue
Could you please describe which version you upgrade to, to resolve this issue? 
That should also help other users encountering the same issue.

And the code snippet you listed, existed since 2018, I don't think there is any 
problem there.
Maybe there are bugs existed in other places, and got fixed indirectly.

Thank you.
Luke

On Tue, Nov 23, 2021 at 10:27 AM James Olsen 
<ja...@inaseq.com<mailto:ja...@inaseq.com>> wrote:
We had a 2.5.1 Broker/Client system running for some time with regular rolling 
OS upgrades to the Brokers without any problems.  A while ago we upgraded both 
Broker and Clients to 2.7.1 and now on the first rolling OS upgrade to the 
2.7.1 Brokers we encountered some Consumer issues.  We have a 3 Broker setup 
with min-ISRs configured to avoid any outage.

So maybe we just got lucky 6 times in a row with the 2.5.1 or maybe there is an 
issue with the 2.7.1.

The observable symptom is a continuous stream of "The coordinator is not 
available" messages when trying to commit offsets.  It starts with the usual 
messages you might expect during a rolling upgrade...

2021-11-22 04:41:25,269 WARN  
[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] 
'pool-7-thread-132' [Consumer clientId=consumer-MyService-group-58, 
groupId=MyService-group] Offset commit failed on partition MyTopic-0 at offset 
866799313: The coordinator is loading and hence can't process requests.

... then 5 minutes of all OK, then ...

2021-11-22 04:46:33,258 WARN  
[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] 
'pool-7-thread-132' [Consumer clientId=consumer-MyService-group-58, 
groupId=MyService-group] Offset commit failed on partition MyTopic-0 at offset 
866803953: This is not the correct coordinator.

2021-11-22 04:46:33,258 INFO  
[org.apache.kafka.clients.consumer.internals.AbstractCoordinator] 
'pool-7-thread-132' [Consumer clientId=consumer-MyService-group-58, 
groupId=MyService-group] Group coordinator 
b-2.xxx.com:9094<http://b-2.xxx.com:9094/><http://b-2.xxx.com:9094<http://b-2.xxx.com:9094/>>
 (id: 2147483645 rack: null) is unavailable or invalid due to cause: error 
response NOT_COORDINATOR.isDisconnected: false. Rediscovery will be attempted.

2021-11-22 04:46:33,258 WARN  [xxx.KafkaConsumerRunner] 'pool-7-thread-132' 
Offset commit with offsets {MyTopic-0=OffsetAndMetadata{offset=866803953, 
leaderEpoch=null, metadata=''}} failed: 
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit 
failed with a retriable exception. You should retry committing the latest 
consumed offsets.
Caused by: org.apache.kafka.common.errors.NotCoordinatorException: This is not 
the correct coordinator.

... then the following message for every subsequent attempt to commit offsets 
...

2021-11-22 04:46:33,284 WARN  [xxx.KafkaConsumerRunner] 'pool-7-thread-132' 
Offset commit with offsets {MyTopic-0=OffsetAndMetadata{offset=866803954, 
leaderEpoch=82, metadata=''}, MyOtherTopic-0=OffsetAndMetadata{offset=12654756, 
leaderEpoch=79, metadata=''}} failed: 
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit 
failed with a retriable exception. You should retry committing the latest 
consumed offsets.
Caused by: org.apache.kafka.common.errors.CoordinatorNotAvailableException: The 
coordinator is not available.

In the above example we are doing manual async-commits but we also had offset 
commit failure for a different consumer group (observed through lag monitoring) 
that uses auto-commit, it just didn't log the ongoing failures.  In both cases 
messages were still being processed, it was just the commits not working.  
These are our two busiest consumer groups and both have static Topic 
assignments.  Other consumer groups continued OK.

I've spent some time examining the (Java) client code and started to wonder 
whether there is a bug or race condition that means the coordinator never gets 
reassigned after being invalidated and we simply keep hitting the following 
short-circuit:

org.apache.kafka.clients.consumer.internals.ConsumerCoordinator

    RequestFuture<Void> sendOffsetCommitRequest(final Map<TopicPartition, 
OffsetAndMetadata> offsets) {
        if (offsets.isEmpty())
            return RequestFuture.voidSuccess();

        Node coordinator = checkAndGetCoordinator();
        if (coordinator == null)
            return RequestFuture.coordinatorNotAvailable();

I'm not sure what the exact pathway is to getting the coordinator set but I 
note that 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(Timer)
 and other methods that look like they may be related tend to only log at debug 
when they encounter RetriableException so could explain why I don't have more 
detail to provide.

I'm not familiar enough with the code to be able to trace this through any 
further, but if you've had the patience to keep reading this far then maybe you 
do!

Bouncing the clients resolved the issue, but I'd be interested if any experts 
out there can identify if there is any weakness in the 2.7.1 version.

Regards, James.


Reply via email to