Hi everyone, I recently observed some behavior in a KRaft-based Kafka cluster (v3.7.2) that I was hoping to get some technical clarification on. My Setup:
- *Architecture:* 3 Controllers / 3 Brokers (KRaft mode). - *Version:* 3.7.2. The Scenario: 1. I brought down *2 out of 3 controllers*, effectively breaking the KRaft quorum. As expected, describe topics and metadata quorum commands failed. 2. While the quorum was down, I brought down *1 Broker*. 3. *Producer Behavior:* Remained fully functional. It automatically redirected messages to the remaining nodes for the affected partition leaders without issue. 4. *Consumer Behavior:* The consumer initially stopped when the broker went down, but upon restarting the consumer, it resumed processing perfectly fine. I also waited for 5 mins for any caches to clear, it was able to consume after restart. Though this wasn't the case across all tests. My Questions: I am curious how the data plane remains so resilient when the control plane is offline: - *Partition Management:* How is the partition leadership/assignment maintained and communicated to clients when the Controller Quorum is unavailable to handle new metadata updates? - *Consumer Rebalancing:* In the event of a broker failure while the quorum is down, how does the Group Coordinator handle partition assignments? Is it simply relying on the last "frozen" metadata state cached on the remaining brokers? - *Internal Offsets:* Since the __consumer_offsets topic is just another topic, am I correct in assuming that as long as those partitions have surviving replicas, the consumer group logic functions independently of the KRaft Quorum status and there would be no impact on data integrity? I'd appreciate any insights into the internal mechanics of how the brokers utilize the cached metadata log during a total controller outage. Best regards, Vruttant Mankad
