Possible race in handling IP address change

Michał Łowicki Wed, 19 May 2021 06:55:58 -0700

Hey,

We've experienced few incidents where suddenly Kafka producers weren't able
to send messages and were simply timing out. I've a repro now after some
investigation. We're running the cluster of 3 brokers on K8s and it's
enough to kill the underlying AWS VM using AWS Console which causes also IP
change (in our case 10.128.132.196 → 10.128.135.90). It needs to be VM
hosting broker which is a leader of partition (node 10000).


Slightly before the kill there were 2 events:

About to close the idle connection from 10001 due to being idle for 38545
> millis
> About to close the idle connection from 10002 due to being idle for 34946
> millis


And then client picks the node to connect to:

Removing node 10.128.140.202:9092 (id: 10002 rack: us-east-1c) from least
> loaded node selection since it is neither ready for sending or connecting
> Found least loaded connecting node 10.128.132.196:9092 (id: 10000 rack:
> us-east-1a)
> About to close the idle connection from 10000 due to being idle for 30529
> millis
> Node 10000 disconnected.
> Initiating connection to node 10.128.132.196:9092 (id: 10000 rack:
> us-east-1a) using address /10.128.132.196


and then it repeats:

Found least loaded connecting node 10.128.132.196:9092 (id: 10000 rack:
> us-east-1a)
> About to close the idle connection from 10000 due to being idle for 30027
> millis
> Node 10000 disconnected.
> Initiating connection to node 10.128.132.196:9092 (id: 10000 rack:
> us-east-1a) using address /10.128.132.196


Found least loaded connecting node 10.128.132.196:9092 (id: 10000 rack:
> us-east-1a)
> About to close the idle connection from 10000 due to being idle for 30027
> millis
> Node 10000 disconnected.
> Initiating connection to node 10.128.132.196:9092 (id: 10000 rack:
> us-east-1a) using address /10.128.132.196


Found least loaded connecting node 10.128.132.196:9092 (id: 10000 rack:
> us-east-1a)
> About to close the idle connection from 10000 due to being idle for 30027
> millis
> Node 10000 disconnected.
> Initiating connection to node 10.128.132.196:9092 (id: 10000 rack:
> us-east-1a) using address /10.128.132.196


Affected clients won't ever request METADATA from other brokers (10001 or
10002) to discover new IP. It doesn't happen for all clients as some handle
the process gracefully and those log with:

log.info("Hostname for node {} changed from {} to {}.", id,
> connectionState.host(), host);


-- 
BR,
Michał Łowicki

Possible race in handling IP address change

Reply via email to