Hi everyone,

I recently observed some behavior in a KRaft-based Kafka cluster (v3.7.2)
that I was hoping to get some technical clarification on.
My Setup:

   -

   *Architecture:* 3 Controllers / 3 Brokers (KRaft mode).
   -

   *Version:* 3.7.2.

The Scenario:

   1.

   I brought down *2 out of 3 controllers*, effectively breaking the KRaft
   quorum. As expected, describe topics and metadata quorum commands failed.
   2.

   While the quorum was down, I brought down *1 Broker*.
   3.

   *Producer Behavior:* Remained fully functional. It automatically
   redirected messages to the remaining nodes for the affected partition
   leaders without issue.
   4.

   *Consumer Behavior:* The consumer initially stopped when the broker went
   down, but upon restarting the consumer, it resumed processing perfectly
   fine. I also waited for 5 mins for any caches to clear, it was able to
   consume after restart. Though this wasn't the case across all tests.

My Questions:

I am curious how the data plane remains so resilient when the control plane
is offline:

   -

   *Partition Management:* How is the partition leadership/assignment
   maintained and communicated to clients when the Controller Quorum is
   unavailable to handle new metadata updates?
   -

   *Consumer Rebalancing:* In the event of a broker failure while the
   quorum is down, how does the Group Coordinator handle partition
   assignments? Is it simply relying on the last "frozen" metadata state
   cached on the remaining brokers?
   -

   *Internal Offsets:* Since the __consumer_offsets topic is just another
   topic, am I correct in assuming that as long as those partitions have
   surviving replicas, the consumer group logic functions independently of the
   KRaft Quorum status and there would be no impact on data integrity?

I'd appreciate any insights into the internal mechanics of how the brokers
utilize the cached metadata log during a total controller outage.

Best regards,
Vruttant Mankad

Reply via email to