Hi, I have read the official instructions about Kafka geo-replication (Cross-DC-mirroring). But I think it is not able to guarantee the data is always been replicated in multiple DCs. When DC-level failure happens (whole DC become unavailable), the cluster in other DC may not have all the data replicated. So we come up with an idea with one Kafka cluster across multiple DCs.
Imaging that we have 3 data centers in 2 different cities. 2 DCs are in city A and 1 DC is in city B. Our goal is to make our application to tolerate any DC failure without data loss. The network latency between two cities is about 30ms. A Kafka cluster is setup with 5 brokers, 2 in DC-A1, 2 in DC-A2, 1 in DC-B. Topics are created with 5 replicas distributed in each broker. As we want to maximize the data availability, producer sending ack is set with “all”. So massage sending would not be successful unless it has been wrote to all replicas. But the cost is also high because the latency of replica in DC-B is longer. We know that acks=all means that producer.send() has to wait all the replicas’ response in ISR set. If we can find a way to make the replica in DC-B out of ISR set constantly ,while the replicas in DC-A1 & DC-A2 are always in ISR, and the replication process is still working, that will be great to decrease the producer latency. So we found the broker-side config: replica.lag.time.max.ms, we tried to set this config with value less than 100ms, but the whole cluster became un-stable. We can see the ISR of topic partitions located in DC-A1 and DC-A2 shrinking and expanding frequently from server.log, even with very low throughput. If we turn up the config value, the DC-B replica is always in ISR. I think it shows that the replica fetching process and ISR update may be more complex than I thought. I don’t know if there is a proper replica.lag.time.max.ms for this. Does anyone have any suggestions? Thanks.