[jira] [Created] (KAFKA-10048) Possible data gap for a consumer after a failover when using MM2

Andre Araujo (Jira) Tue, 26 May 2020 23:45:10 -0700

Andre Araujo created KAFKA-10048:
------------------------------------

             Summary: Possible data gap for a consumer after a failover when 
using MM2
                 Key: KAFKA-10048
                 URL: https://issues.apache.org/jira/browse/KAFKA-10048
             Project: Kafka
          Issue Type: Bug
          Components: mirrormaker
    Affects Versions: 2.5.0
            Reporter: Andre Araujo

I've been looking at some MM2 scenarios and identified a situation where
consumers can miss consuming some data in the even of a failover.
When a consumer subscribes to a topic for the first time and commits offsets,
the offsets for every existing partition of that topic will be saved to the
cluster's {{__consumer_offset}} topic. Even if a partition is completely empty,
the offset {{0}} will still be saved for the consumer's consumer group.

When MM2 is replicating the checkpoints to the remote cluster, though, it
[ignores anything that has an offset equals to
zero|https://github.com/apache/kafka/blob/856e36651203b03bf9a6df2f2d85a356644cbce3/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointTask.java#L135],
replicating offsets only for partitions that contain data.
This can lead to a gap in the data consumed by consumers in the following
scenario:
# Topic is created on the source cluster.
# MM2 is configured to replicate the topic and consumer groups
# Producer starts to produce data to the source topic but for some reason some
partitions do not get data initially, while others do (skewed keyed messages or
bad luck)
# Consumers start to consume data from that topic and their consumer groups'
offsets are replicated to the target cluster, *but only for partitions that
contain data*. The consumers are using the default setting auto.offset.reset =
latest.
# A consumer failover to the second cluster is performed (for whatever
reason), and the offset translation steps are completed. The consumer are not
restarted yet.
# The producers continue to produce data to the source cluster topic and now
produce data to the partitions that were empty before.
# *After* the producers start producing data, consumers are started on the
target cluster and start consuming.
For the partitions that already had data before the failover, everything works
fine. The consumer offsets will have been translated correctly and the
consumers will start consuming from the correct position.
For the partitions that were empty before the failover, though, any data
written by the producers to those partitions *after the failover but before the
consumers start* will be completely missed, since the consumers will jump
straight to the latest offset when they start due to the lack of a zero offset
stored locally on the target cluster.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-10048) Possible data gap for a consumer after a failover when using MM2

Reply via email to