lost committed offsets during reassignment

Girish Aher Wed, 16 May 2018 08:19:35 -0700

Hello All,

We have a 3 node kafka cluster (v0.10.1.1) which hosts about 8500
partitions with incoming byte rate of 25 MBps. We are currently in the
process of expanding it to a 6 node cluster and doing the required data
reassignment.


In order the distribute the load, we reassigned the 50 partitions of the
__consumer_offsets topic across all nodes in the clusters (just like few
other topics we identified that we had to move).

The problem we saw was that one of the already running consumer groups (say
CG1) started to begin processing from the beginning of the topic (say
TOPIC1). We know by experience and metrics that CG1 is usually caught up up
to within the last few hundred messages, so that was unexpected. It
started *re-processing
all the 4.2 billion messages* in the topic TOPIC1 all over again.

CG1 is set up with auto.offset.reset=earliest so it means all committed
offsets by CG1 were lost during the rebalancing of __consumer_offsets topic.

Upon further investigation, we saw that a few of the threads in the
consumer group were rebalancing during the time we triggered the partition
reassignment.

There are few logs in Kafka logs that indicate any errors but nothing
special that points to the problem:

Has anybody faced the problem before? Any pointers on what can help debug
this issue?



Below are some noticeable kafka logs.
A few of these:

ERROR [ReplicaFetcherThread-2-4:Logging$class@99] - [ReplicaFetcherThread-2-
4], Error for partition [__consumer_offsets,49] to broker 4:org.apache.kafka
.common.errors.UnknownServerException: The server experienced an unexpected
error when processing the request


And a few of these for different set of partitions:
FYI (old cluster broker ids: 4,5,6) -- (scaled cluster broker ids:
1,2,3,4,5,6)

ERROR [kafka-request-handler-1:Logging$class@105] - [KafkaApi-4] Error when
handling request Name: FetchRequest; Version: 3; CorrelationId: 0; ClientId:
ReplicaFetcherThread-2-4; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 bytes;
MaxBytes:10485760 bytes; RequestInfo: ([__consumer_offsets,49],
PartitionFetchInfo(0,11534336)) kafka.common.NotAssignedReplicaException:
Leader 4 failed to record follower 1's position 0 since the replica is not
recognized to be one of the assigned replicas 4,5,6 for partition [__
consumer_offsets,49].



Thanks,
--Girish

lost committed offsets during reassignment

Reply via email to