A. Sophie Blee-Goldman created KAFKA-10793:
----------------------------------------------

             Summary: Race condition in FindCoordinatorFuture permanently 
severs connection to group coordinator
                 Key: KAFKA-10793
                 URL: https://issues.apache.org/jira/browse/KAFKA-10793
             Project: Kafka
          Issue Type: Bug
          Components: consumer, streams
    Affects Versions: 2.5.0
            Reporter: A. Sophie Blee-Goldman


Pretty much as soon as we started actively monitoring the 
_last-rebalance-seconds-ago_ metric in our Kafka Streams test environment, we 
started seeing something weird. Every so often one of the StreamThreads (ie a 
single Consumer instance) would appear to permanently fall out of the group, as 
evidenced by a monotonically increasing _last-rebalance-seconds-ago._ We inject 
artificial network failures every few hours at most, so the group rebalances 
quite often. But the one consumer never rejoins, with no other symptoms 
(besides a slight drop in throughput since the remaining threads had to take 
over this member's work). We're confident that the problem exists in the client 
layer, since the logs confirmed that the unhealthy consumer was still calling 
poll. It was also calling Consumer#committed in its main poll loop, which was 
consistently failing with a TimeoutException.

When I attached a remote debugger to an instance experiencing this issue, the 
network client's connection to the group coordinator (the one that uses 
MAX_VALUE - node.id as the coordinator id) was in the DISCONNECTED state. But 
for some reason it never tried to re-establish this connection, although it did 
successfully connect to that same broker through the "normal" connection (ie 
the one that juts uses node.id).

The tl;dr is that the AbstractCoordinator's FindCoordinatorRequest has failed 
(presumably due to a disconnect), but the _findCoordinatorFuture_ is non-null 
so a new request is never sent. This shouldn't be possible since the 
FindCoordinatorResponseHandler is supposed to clear the _findCoordinatorFuture_ 
when the future is completed. But somehow that didn't happen, so the consumer 
continues to assume there's still a FindCoordinator request in flight and never 
even notices that it's dropped out of the group.

These are the only confirmed findings so far, however we have some guesses 
which I'll leave in the comments. Note that we only noticed this due to the 
newly added _last-rebalance-seconds-ago_ __metric, and there's no reason to 
believe this bug hasn't been flying under the radar since the Consumer's 
inception



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to