Re: Slow machine disrupting the cluster

David Garcia Fri, 16 Sep 2016 07:41:43 -0700

To remediate, you could start another broker, rebalance, and then shut down the 
busted broker.  But, you really should put some monitoring on your system (to 
help diagnose the actual problem).  Datadog has a pretty good set of articles 
for using jmx to do this: 
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/


There are lots of jmx metrics gathering tools too…such as jmxtrans: 
https://github.com/jmxtrans/jmxtrans

<confluent-plug>
confluent also offers tooling (such as command center) to help with monitoring.
</confluent-plug>

As far as mirror maker goes, you can play with the consumer/producer timeout 
settings to make sure the process waits long enough for a slow machine.

-David

On 9/16/16, 7:11 AM, "Gerard Klijs" <gerard.kl...@dizzit.com> wrote:

    We just had an interesting issue, luckily this was only on our test cluster.
    Because of some reason one of the machines in a cluster became really slow.
    Because it was still alive, it stil was the leader for some
    topic-partitions. Our mirror maker reads and writes to multiple
    topic-partitions on each thread. When committing the offsets this will fail
    for the topic-partitions located on the slow machine, because the consumers
    have timed out. The data for these topic-partitions will be send over and
    over, causing a flood of duplicate messages.
    What would be the best way to prevent this in the future. Is there some way
    the broker could notice it's performing poorly and shut's off for example?

Re: Slow machine disrupting the cluster

Reply via email to