Andre Araujo created KAFKA-10048:
------------------------------------

             Summary: Possible data gap for a consumer after a failover when 
using MM2
                 Key: KAFKA-10048
                 URL: https://issues.apache.org/jira/browse/KAFKA-10048
             Project: Kafka
          Issue Type: Bug
          Components: mirrormaker
    Affects Versions: 2.5.0
            Reporter: Andre Araujo


I've been looking at some MM2 scenarios and identified a situation where 
consumers can miss consuming some data in the even of a failover.
 
When a consumer subscribes to a topic for the first time and commits offsets, 
the offsets for every existing partition of that topic will be saved to the 
cluster's {{__consumer_offset}} topic. Even if a partition is completely empty, 
the offset {{0}} will still be saved for the consumer's consumer group.

 
When MM2 is replicating the checkpoints to the remote cluster, though, it 
[ignores anything that has an offset equals to 
zero|https://github.com/apache/kafka/blob/856e36651203b03bf9a6df2f2d85a356644cbce3/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointTask.java#L135],
 replicating offsets only for partitions that contain data.
 
This can lead to a gap in the data consumed by consumers in the following 
scenario:
 # Topic is created on the source cluster.
 # MM2 is configured to replicate the topic and consumer groups
 # Producer starts to produce data to the source topic but for some reason some 
partitions do not get data initially, while others do (skewed keyed messages or 
bad luck)
 # Consumers start to consume data from that topic and their consumer groups' 
offsets are replicated to the target cluster, *but only for partitions that 
contain data*. The consumers are using the default setting auto.offset.reset = 
latest.
 # A consumer failover to the second cluster is performed (for whatever 
reason), and the offset translation steps are completed. The consumer are not 
restarted yet.
 # The producers continue to produce data to the source cluster topic and now 
produce data to the partitions that were empty before.
 # *After* the producers start producing data, consumers are started on the 
target cluster and start consuming.
For the partitions that already had data before the failover, everything works 
fine. The consumer offsets will have been translated correctly and the 
consumers will start consuming from the correct position.
For the partitions that were empty before the failover, though, any data 
written by the producers to those partitions *after the failover but before the 
consumers start* will be completely missed, since the consumers will jump 
straight to the latest offset when they start due to the lack of a zero offset 
stored locally on the target cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to