[ 
https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634471#comment-17634471
 ] 

Ron Dagostino commented on KAFKA-14392:
---------------------------------------

One possibility is to continue to use `controller.socket.timeout.ms` as is 
currently being done but then update the documentation to make this clear -- 
unfortunately the default value for `controller.socket.timeout.ms` is 30 
seconds, whereas the default value for `broker.session.timeout.ms` is 9 seconds.

Another possibility is to use the value passed into the Broker-to-Controller 
channel manager.  For the broker's heartbeat thread, this is 
`broker.heartbeat.interval.ms`, which defaults to 2 seconds.

The latter seems better -- it requires no change to any configs and better 
reflects the desire on the broker side, which is to basically cancel the 
request if it doesn't succeed within the heartbeat period we are using and 
simply try again.

> KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-14392
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14392
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Ron Dagostino
>            Assignee: Ron Dagostino
>            Priority: Minor
>
> KRaft brokers maintain their liveness in the cluster by sending 
> BROKER_HEARTBEAT requests to the active controller; the active controller 
> fences a broker if it doesn't receive a heartbeat request from that broker 
> within the period defined by `broker.session.timeout.ms`.  The broker should 
> use a request timeout for its BROKER_HEARTBEAT requests that is not larger 
> than the session timeout being used by the controller; doing so creates the 
> possibility that upon controller failover the broker might not cancel an 
> existing heartbeat request in time and then subsequently heartbeat to the new 
> controller to maintain an uninterrupted session in the cluster.  In other 
> words, a failure of the active controller could result in under-replicated 
> (or under-min ISR) partitions simply due to a delay in brokers heartbeating 
> to the new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to