Hi Kafka users,
we're running a cluster of two Kafka 0.8.1.1 brokers, with a twofold
replicaton of each topic.
When both brokers are up, after a short while the FetchRequestPurgatory
starts to grow indefinitely on the leader (detectable via a heap dump
and also via the "FetchRequestPurgatory"."PurgatorySize" JMX metric),
eventually leading to an OOM error. When one of the brokers is shut
down, the purgatory stops growing in size, and the remaining broker runs
fine. In https://issues.apache.org/jira/browse/KAFKA-1016, I see this
can occur when a fetcher specifies a too large max wait time, but we
don't override replica.fetch.wait.max.ms, leaving it at the default 500 ms.
Do you have any suggestions what can be the cause and how to fix it?
Thanks a lot,
András