Hey, We've experienced few incidents where suddenly Kafka producers weren't able to send messages and were simply timing out. I've a repro now after some investigation. We're running the cluster of 3 brokers on K8s and it's enough to kill the underlying AWS VM using AWS Console which causes also IP change (in our case 10.128.132.196 → 10.128.135.90). It needs to be VM hosting broker which is a leader of partition (node 10000).
Slightly before the kill there were 2 events: About to close the idle connection from 10001 due to being idle for 38545 > millis > About to close the idle connection from 10002 due to being idle for 34946 > millis And then client picks the node to connect to: Removing node 10.128.140.202:9092 (id: 10002 rack: us-east-1c) from least > loaded node selection since it is neither ready for sending or connecting > Found least loaded connecting node 10.128.132.196:9092 (id: 10000 rack: > us-east-1a) > About to close the idle connection from 10000 due to being idle for 30529 > millis > Node 10000 disconnected. > Initiating connection to node 10.128.132.196:9092 (id: 10000 rack: > us-east-1a) using address /10.128.132.196 and then it repeats: Found least loaded connecting node 10.128.132.196:9092 (id: 10000 rack: > us-east-1a) > About to close the idle connection from 10000 due to being idle for 30027 > millis > Node 10000 disconnected. > Initiating connection to node 10.128.132.196:9092 (id: 10000 rack: > us-east-1a) using address /10.128.132.196 Found least loaded connecting node 10.128.132.196:9092 (id: 10000 rack: > us-east-1a) > About to close the idle connection from 10000 due to being idle for 30027 > millis > Node 10000 disconnected. > Initiating connection to node 10.128.132.196:9092 (id: 10000 rack: > us-east-1a) using address /10.128.132.196 Found least loaded connecting node 10.128.132.196:9092 (id: 10000 rack: > us-east-1a) > About to close the idle connection from 10000 due to being idle for 30027 > millis > Node 10000 disconnected. > Initiating connection to node 10.128.132.196:9092 (id: 10000 rack: > us-east-1a) using address /10.128.132.196 Affected clients won't ever request METADATA from other brokers (10001 or 10002) to discover new IP. It doesn't happen for all clients as some handle the process gracefully and those log with: log.info("Hostname for node {} changed from {} to {}.", id, > connectionState.host(), host); -- BR, Michał Łowicki