> Am I correct in assuming that if the preferred leader is not available, the next replica in the ISR list is chosen to be the leader?
Yes, that's correct :) On Wed, May 11, 2022 at 1:15 PM Andrew Otto <o...@wikimedia.org> wrote: > Thanks so much Guozhang! > > > 1) For the producer -> leader hop, could you save the cross-DC network? > > even if your message's partition has to be determined deterministically > by the key, in operations you can still see if most of your active > producers > are from one DC, then configure your topic partitions to be hosted by > brokers within the same DC. Generally speaking, there are various ways you > can consider saving this hop from across DCs. > > Hm, perhaps something like this? > If we run the producer in active/standby mode, so that the producer > application only ever runs in one DC at a time, could we manage the > preferred leaders via the replica list order during a failover? Example: > If DC-A is the 'active' DC, then the producer would run only in DC-A. We'd > ensure that each partition's replica list starts with brokers only in DC-A. > > > Let Broker A1 and A2 be in DC-A, and Broker B1 and B2 in DC-B. partition 0 > and partition 1 have a replication factor of 4. > > p0: [A1, A2, B1,B2] > p1: [A2, A1, B2, B1] > > In order to failover to DC-B, we'd reassign the partition replica list to > put the DC-B brokers first, like: > p0: [B1, B2, A1,A2] > p1: [B2, B1, A2, A1] > > Then issue a preferred leader election, stop the producer in DC-A, and > start it in DC-B. > We'd incur a producer latency hit during the failover process until both > partition leaders and the producer are in DC-B, but hopefully that will be > short lived (minutes)? > > With follower fetching, this would still allow consumers in either DC to > read from the closest replica, so it would allow for active/active reads. > With at least 2 replicas in each DC, rolling broker restarts would > hopefully still allow consumers to consume from replicas in their local DC. > > --- > Also, a quick question about leader election. Am I correct in assuming > that if the preferred leader is not available, the next replica in the ISR > list is chosen to be the leader? Or, is it a random selection from any of > the ISRs? If it is a random selection, then manually optimizing the replica > list to reduce producer hops probably isn't worth trying, as we'd get the > producer hops during normal broker maintenance. > > Thank you! > > > > > > > > On Mon, May 9, 2022 at 6:00 PM Guozhang Wang <wangg...@gmail.com> wrote: > > > Hello Andrew. > > > > Just to answer your questions first, yes that's correct in your described > > settings that three round-trips between DCs would incur, but since the > > replica fetches can be done in parallel, the latency is not a sum of all > > the round-trips. But if you stay with 2 DCs only, the number of > round-trips > > would only depend on the number of follower replicas that are on > > different DCs with the leader replica. > > > > Jumping out of the question and your described settings, there are a > couple > > of things you can consider for your design: > > > > 1) For the producer -> leader hop, could you save the cross-DC network? > For > > example, if your message can potentially go to any partitions (such as it > > is not key-ed), then you can customize your partitioner as a "rack-aware" > > one that would always try to pick the partition whose leader co-exist > > within the same DC as the producer client; even if your message's > partition > > has to be determined deterministically by the key, in operations you can > > still see if most of your active producers are from one DC, then > configure > > your topic partitions to be hosted by brokers within the same DC. > Generally > > speaking, there are various ways you can consider saving this hop from > > across DCs. > > > > 2) For the leader -> follower hop, you can start from first validating > how > > many failures cross DCs that you'd like to tolerate. For example, let's > say > > you have 2N+1 replicas per partition, with N+1 replicas including the > > leader on one DC and N other replicas on the other DC, if we can set the > > acks to N+2 then it means we will have the data replicated at least on > one > > remote replica before returning the request, and hence the data would not > > be lost if the one whole DC fails, which could be sufficient from many > > stretching and multi-colo cases. Then in practice, since the cross-colo > > usually takes more latency, you'd usually get much fewer round-trips > than N > > across DC before satisfying the acks. And your average/p99 latencies > would > > not increase much compared with just one cross-DC replica. > > > > > > Guozhang > > > > > > On Mon, May 9, 2022 at 11:58 AM Andrew Otto <o...@wikimedia.org> wrote: > > > > > Hi all, > > > > > > I'm evaluating <https://phabricator.wikimedia.org/T307944> the > > feasibility > > > of setting up a cross datacenter Kafka 'stretch' cluster at The > Wikimedia > > > Foundation. > > > > > > I've found docs here and there, but they are pretty slim. My > > > biggest concern is the fact that while Follower Fetching > > > < > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica > > > > > > > helps > > > with potential consumer latency in a stretch cluster, there is nothing > > that > > > addresses producer latency. I'd have expected the docs I've read to > > > mention this if it was a concern, but I haven't seen it. > > > > > > Specifically, let's say I'm a producer in DC-A, and I want to produce > to > > > partition X with acks=all. Partition X has 3 replicas, on brokers B1 > in > > DC > > > A, B2 in DC-A and B3 in DC-B. Currently, the replica on B3(DC-B) is > the > > > partition leader. IIUC, when I produce my message to partition X, that > > > message will cross the DC boundary for my produce request to B3(DC-B), > > then > > > back again when replica B1(DC-A) fetches, and also when replica > B2(DC-A) > > > fetches, for a total of 3 times between DCs. > > > > > > Questions: > > > - Am I correct in understanding that each one of these fetches > > contributes > > > to the ack latency? > > > > > > - And, as the number of brokers and replica increases, the number of > > times > > > a message crosses the DC (likely) increases too? > > > > > > - When replicas are promoted to be a partition leader, producer > clients > > > will shuffle their connections around, often resulting in them > connecting > > > to the leader in a remote datacenter. Should I be worried about this > > > unpredictability in cross DC network connections and traffic? > > > > > > I'm really hoping that a stretch cluster will help solve some Multi DC > > > streaming app architecture woes, but I'm not so sure the potential > issues > > > with partition leaders is worth it! > > > > > > Thanks for any insight y'all have, > > > -Andrew Otto > > > Wikimedia Foundation > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang