mlowicki created KAFKA-12836:
--------------------------------

             Summary: Possible race in handling IP address change
                 Key: KAFKA-12836
                 URL: https://issues.apache.org/jira/browse/KAFKA-12836
             Project: Kafka
          Issue Type: Bug
          Components: clients
    Affects Versions: 2.4.1
            Reporter: mlowicki


We've experienced few incidents where suddenly Kafka producers weren't able to 
send messages and were simply timing out. I've a repro now after some 
investigation. We're running the cluster of 3 brokers on K8s and it's enough to 
kill the underlying AWS VM using AWS Console which causes also IP change (in 
our case 10.128.132.196 → 10.128.135.90). It needs to be VM hosting broker 
which is a leader of partition (node 10000).
 
Slightly before the kill there were 2 events:
 
{quote}About to close the idle connection from 10001 due to being idle for 
38545 millis
About to close the idle connection from 10002 due to being idle for 34946 
millis{quote}
 
And then client picks the node to connect to:
 
{quote}Removing node [10.128.140.202:9092|http://10.128.140.202:9092/] (id: 
10002 rack: us-east-1c) from least loaded node selection since it is neither 
ready for sending or connecting
Found least loaded connecting node 
[10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a)
About to close the idle connection from 10000 due to being idle for 30529 millis
Node 10000 disconnected.
Initiating connection to node [10.128.132.196:9092|http://10.128.132.196:9092/] 
(id: 10000 rack: us-east-1a) using address 
/[10.128.132.196|http://10.128.132.196/]{quote}
 
and then it repeats:
 
{quote}Found least loaded connecting node 
[10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a)
About to close the idle connection from 10000 due to being idle for 30027 millis
Node 10000 disconnected.
Initiating connection to node [10.128.132.196:9092|http://10.128.132.196:9092/] 
(id: 10000 rack: us-east-1a) using address 
/[10.128.132.196|http://10.128.132.196/]{quote}
 
{quote}Found least loaded connecting node 
[10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a)
About to close the idle connection from 10000 due to being idle for 30027 millis
Node 10000 disconnected.
Initiating connection to node [10.128.132.196:9092|http://10.128.132.196:9092/] 
(id: 10000 rack: us-east-1a) using address 
/[10.128.132.196|http://10.128.132.196/]{quote}
 
{quote}Found least loaded connecting node 
[10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a)
About to close the idle connection from 10000 due to being idle for 30027 millis
Node 10000 disconnected.
Initiating connection to node [10.128.132.196:9092|http://10.128.132.196:9092/] 
(id: 10000 rack: us-east-1a) using address 
/[10.128.132.196|http://10.128.132.196/]{quote}
 
Affected clients won't ever request METADATA from other brokers (10001 or 
10002) to discover new IP. It doesn't happen for all clients as some handle the 
process gracefully and those log with:
 
{quote}[log.info|http://log.info/]("Hostname for node {} changed from {} to 
{}.", id, connectionState.host(), host);{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to