Hello All, We have a 3 node kafka cluster (v0.10.1.1) which hosts about 8500 partitions with incoming byte rate of 25 MBps. We are currently in the process of expanding it to a 6 node cluster and doing the required data reassignment.
In order the distribute the load, we reassigned the 50 partitions of the __consumer_offsets topic across all nodes in the clusters (just like few other topics we identified that we had to move). The problem we saw was that one of the already running consumer groups (say CG1) started to begin processing from the beginning of the topic (say TOPIC1). We know by experience and metrics that CG1 is usually caught up up to within the last few hundred messages, so that was unexpected. It started *re-processing all the 4.2 billion messages* in the topic TOPIC1 all over again. CG1 is set up with auto.offset.reset=earliest so it means all committed offsets by CG1 were lost during the rebalancing of __consumer_offsets topic. Upon further investigation, we saw that a few of the threads in the consumer group were rebalancing during the time we triggered the partition reassignment. There are few logs in Kafka logs that indicate any errors but nothing special that points to the problem: Has anybody faced the problem before? Any pointers on what can help debug this issue? Below are some noticeable kafka logs. A few of these: ERROR [ReplicaFetcherThread-2-4:Logging$class@99] - [ReplicaFetcherThread-2- 4], Error for partition [__consumer_offsets,49] to broker 4:org.apache.kafka .common.errors.UnknownServerException: The server experienced an unexpected error when processing the request And a few of these for different set of partitions: FYI (old cluster broker ids: 4,5,6) -- (scaled cluster broker ids: 1,2,3,4,5,6) ERROR [kafka-request-handler-1:Logging$class@105] - [KafkaApi-4] Error when handling request Name: FetchRequest; Version: 3; CorrelationId: 0; ClientId: ReplicaFetcherThread-2-4; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 bytes; MaxBytes:10485760 bytes; RequestInfo: ([__consumer_offsets,49], PartitionFetchInfo(0,11534336)) kafka.common.NotAssignedReplicaException: Leader 4 failed to record follower 1's position 0 since the replica is not recognized to be one of the assigned replicas 4,5,6 for partition [__ consumer_offsets,49]. Thanks, --Girish