Hello everyone, We’re experiencing an issue where Kafka clients are significantly delayed in rediscovering the GroupCoordinator after the broker originally assigned as the GroupCoordinator becomes unreachable.
In this scenario, while most clients are able to quickly locate a new GroupCoordinator using the FindCoordinator protocol, a few clients are taking as long as max.poll.interval.ms to do so. This delay in rediscovery is causing the group rebalance to be postponed, leading to a prolonged interruption in message consumption. Our Kafka server version is 2.3.1, but the clients are using version 1.1.1. We observed that after the client logs the message: ---- Group coordinator ... is unavailable or invalid, will attempt rediscovery --- it takes about 5 minutes before we see: --- Discovered group coordinator ... --- Unfortunately, due to the older client version (1.1.1), we lack more detailed logs for further insight. Has anyone experienced a similar delay in coordinator rediscovery on some Kafka clients? Would reducing max.poll.interval.ms help by causing these delayed clients to be removed from the group more quickly, potentially speeding up the rebalance process? I’ve checked KAFKA-9752[1], but since there is no log like “Pending member $memberId in group {groupId} has been removed after session timeout expiration,” I’m not sure if this issue is related. Any insight or suggestions would be appreciated. Best regards, Minwoo Kang. [1]: https://issues.apache.org/jira/browse/KAFKA-9752