To remediate, you could start another broker, rebalance, and then shut down the
busted broker. But, you really should put some monitoring on your system (to
help diagnose the actual problem). Datadog has a pretty good set of articles
for using jmx to do this:
There are lots of jmx metrics gathering tools too…such as jmxtrans:
confluent also offers tooling (such as command center) to help with monitoring.
As far as mirror maker goes, you can play with the consumer/producer timeout
settings to make sure the process waits long enough for a slow machine.
On 9/16/16, 7:11 AM, "Gerard Klijs" <gerard.kl...@dizzit.com> wrote:
We just had an interesting issue, luckily this was only on our test cluster.
Because of some reason one of the machines in a cluster became really slow.
Because it was still alive, it stil was the leader for some
topic-partitions. Our mirror maker reads and writes to multiple
topic-partitions on each thread. When committing the offsets this will fail
for the topic-partitions located on the slow machine, because the consumers
have timed out. The data for these topic-partitions will be send over and
over, causing a flood of duplicate messages.
What would be the best way to prevent this in the future. Is there some way
the broker could notice it's performing poorly and shut's off for example?