Hi Kafka users,

we're running a cluster of two Kafka 0.8.1.1 brokers, with a twofold replicaton of each topic.

When both brokers are up, after a short while the FetchRequestPurgatory starts to grow indefinitely on the leader (detectable via a heap dump and also via the "FetchRequestPurgatory"."PurgatorySize" JMX metric), eventually leading to an OOM error. When one of the brokers is shut down, the purgatory stops growing in size, and the remaining broker runs fine. In https://issues.apache.org/jira/browse/KAFKA-1016, I see this can occur when a fetcher specifies a too large max wait time, but we don't override replica.fetch.wait.max.ms, leaving it at the default 500 ms.

Do you have any suggestions what can be the cause and how to fix it?

Thanks a lot,
András

Reply via email to