[ https://issues.apache.org/jira/browse/KAFKA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luke Chen resolved KAFKA-16247. ------------------------------- Resolution: Fixed Fixed in 3.7.0 RC4 > replica keep out-of-sync after migrating broker to KRaft > -------------------------------------------------------- > > Key: KAFKA-16247 > URL: https://issues.apache.org/jira/browse/KAFKA-16247 > Project: Kafka > Issue Type: Bug > Affects Versions: 3.7.0 > Reporter: Luke Chen > Priority: Major > Attachments: KAFKA-16247.zip > > > We are deploying 3 controllers and 3 brokers, and following the steps in > [doc|https://kafka.apache.org/documentation/#kraft_zk_migration]. When we're > moving from "Enabling the migration on the brokers" state to "Migrating > brokers to KRaft" state, the first rolled broker becomes out-of-sync and > never become in-sync. > From the log, we can see some "reject alterPartition" errors, but it just > happen 2 times. Theoretically, the leader should add the follower into ISR > as long as the follower is fetching since we don't have client writing data. > But can't figure out why it didn't fetch. > Logs: https://gist.github.com/showuon/64c4dcecb238a317bdbdec8db17fd494 > === > update Feb. 14 > After further investigating the logs, I think the reason why the replica is > not added into ISR is because the alterPartition request got non-retriable > error from controller: > {code:java} > Failed to alter partition to PendingExpandIsr(newInSyncReplicaId=0, > sentLeaderAndIsr=LeaderAndIsr(leader=1, leaderEpoch=4, > isrWithBrokerEpoch=List(BrokerState(brokerId=1, brokerEpoch=-1), > BrokerState(brokerId=2, brokerEpoch=-1), BrokerState(brokerId=0, > brokerEpoch=-1)), leaderRecoveryState=RECOVERED, partitionEpoch=7), > leaderRecoveryState=RECOVERED, > lastCommittedState=CommittedPartitionState(isr=Set(1, 2), > leaderRecoveryState=RECOVERED)) because the partition epoch is invalid. > Partition state may be out of sync, awaiting new the latest metadata. > (kafka.cluster.Partition) > [zk-broker-1-to-controller-alter-partition-channel-manager] > {code} > Since it's a non-retriable error, we'll keep the state as pending, and > waiting for later leaderAndISR update as described > [here|https://github.com/apache/kafka/blob/d24abe0edebad37e554adea47408c3063037f744/core/src/main/scala/kafka/cluster/Partition.scala#L1876C1-L1876C41]. > Log analysis: https://gist.github.com/showuon/5514cbb995fc2ae6acd5858f69c137bb > So the question becomes: > 1. Why does the controller increase the partition epoch? > 2. When the leader receives the leaderAndISR request from the controller, it > ignored the request because the leader epoch is identical, even though the > partition epoch is updated. Is the behavior expected? Will it impact the > alterPartition request later? -- This message was sent by Atlassian Jira (v8.20.10#820010)