We're doing some testing on Kafka 1.1 brokers in AWS EC2. Specifically,
we're cleanly shutting brokers down for 5 mins or more, and then restarting
them, while producing and consuming from the cluster all the time. In
theory, this should be relatively seamless to both producers and consumers.
However, we're getting these errors on the producer application when
testing a broker being down for over 5 mins:
ERROR - 2019-02-27 04:34:24.946 - message size [2494]
-Expiring 7 record(s) for topic2-0: 30033 ms has passed since last append
And then the first of many errors similar to this.
ERROR - 2019-02-27 04:35:13.098; Topic [topic2], message size [2494]
-Failed to allocate memory within the configured max blocking
time 60000 ms.
At that point, the long-running producers become non-responsive, and every
producer request fails with that same Failed to allocate memory error. I
tried to search online for similar issues, but all I could find is an old
Kafka JIRA ticket that was resolved in 0.10.1.1, so it shouldn't apply for
the newer 1.1 version we're using.
https://issues.apache.org/jira/browse/KAFKA-3651
We have attempted a lot of different scenarios, including changing the
producer configuration back to Kafka defaults, to see if anything would
help, but we always run into that problem whenever a broker comes back into
the cluster after being down for 5 mins or more.
I also posted on SO:
https://stackoverflow.com/questions/54911991/failed-to-allocate-memory-and-expiring-record-error-after-kafka-broker-is-do
Any idea what we might be doing wrong?
Thanks,
Marcos