ReplicaFetcherThread marked as failed and NotLeaderOrFollowerException after upgrade to 2.7.1

Aurel Paulovič Mon, 23 Aug 2021 08:09:34 -0700

Hello, we have recently updated our production kafka cluster from 2.6.1 to 
2.7.1 and started receiving 2 types of errors:


1. When a broker is restared, upon starting up it produces a lot of warnings 
with information about old partition leader epoch and:


...

[2021-08-23 15:25:55,629] INFO [ReplicaFetcher replicaId=10, leaderId=11, 
fetcherId=2] Partition redacted-topic1-name-19 has an older epoch (44) than the 
current leader. Will await the new LeaderAndIsr state before resuming fetching. 
(kafka.server.ReplicaFetcherThread)
[2021-08-23 15:25:55,630] WARN [ReplicaFetcher replicaId=10, leaderId=11, 
fetcherId=2] Partition redacted-topic1-name-19 marked as failed 
(kafka.server.ReplicaFetcherThread)

...


At the end of broker startup I get:

[2021-08-23 15:25:55,645] INFO [ReplicaFetcherManager on broker 10] Removed 
fetcher for partitions Set(...[a lot of partitions], redacted-topic1-name-19, 
[a lot of partitions]...) (kafka.server.ReplicaFetcherManager)

2. While running, the broker seems to be spamming STDOUT with a ~30 sec period 
(we have set leader.imbalance.check.interval.seconds=30) with messages like 
this, that end up in /var/log/messages. They have no stack shown and don't 
appear in the standard server.log, just the /var/log/messages (so they look 
like STDOUT captured by journald)


...

Aug 23 16:50:34 prod-kafka10 java: 
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while 
fetching partition state for redacted-topic1-name-0
Aug 23 16:50:34 prod-kafka10 java: 
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while 
fetching partition state for redacted-topic1-name-1
Aug 23 16:50:34 prod-kafka10 java: 
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while 
fetching partition state for redacted-topic1-name-2
Aug 23 16:50:34 prod-kafka10 java: 
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while 
fetching partition state for redacted-topic1-name-3
Aug 23 16:50:34 prod-kafka10 java: 
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while 
fetching partition state for redacted-topic2-name-0
Aug 23 16:50:34 prod-kafka10 java: 
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while 
fetching partition state for redacted-topic2-name-1

...


These messages are even for topics/partitions that are not hosted by the 
broker. For the topics/partitions that are replicated by the broker we get 
different exceptions:


...

Aug 23 16:50:34 prod-kafka10 java: 
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Failed to find 
leader log for partition redacted-topic3-name-1 with leader epoch 
Optional.empty. The current leader is Some(11) and the current epoch 188
Aug 23 16:50:34 prod-kafka10 java: 
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Failed to find 
leader log for partition redacted-topic3-name-2 with leader epoch 
Optional.empty. The current leader is Some(9) and the current epoch 164

...


There are no exceptions for the partitions, for which the broker is the leader.


Does anyone know what is wrong with the cluster and how to fix it? So far the 
cluster appears to be running, producers are successfully writing messages to 
it and consumers are reading them and there appears to be no message loss. 
Also, ISR is full on all partitions, no partitions are under-replicated or 
offline. We have update a number of different clusters in our company prior to 
the production cluster and no other cluster shows these errors.

ReplicaFetcherThread marked as failed and NotLeaderOrFollowerException after upgrade to 2.7.1

Reply via email to