This means either the brokers are not healthy (bad hardware) or that the
replication fetchers can't keep up with the rate of incoming messages.
If the latter, you need to figure out where the latency bottleneck is and
what your latency SLAs are.
Common sources of latency bottlenecks:
- network has slow roundtrip speeds: Increase network speed, or increase
bytes per trip, or increase number of simultaneous fetchers, or increase
the timeout so that the broker has time to fill all the bytes in the fetch
- broker slow disk I/O: increase disk speed, or increase linux page cache
There are JMX metrics that help disambiguate whether the problem is disk vs
network... unfortunately the Datadog check is lacking many of these,
something that I've had on my todo list to patch as we also use Datadog at
my day job.
One other possible problem is when you have a combination of a lot of
low-volume partitions being replicated in each call along with a couple of
high-volume partitions... then the broker can take a long time assembling
the responses because it has to look at each partition, which might add
only 1 KB, so it takes a long time to hit the 1MB bytes partition... so it
hits the timeout first. Then it sends a small response, even though you've
got a handful of partitions that are really hot and will soon be marked as
not being in sync.
I know this doesn't provide full details, but hopefully it's enough to get
you pointed in the right direction...
On Fri, Feb 2, 2018 at 11:27 AM, Richard Rodseth <rrods...@gmail.com> wrote:
> We have a DataDog integration showing some metrics, and for one of our
> clusters the above two
> values are > 0 and highlighted in red.
> What's the usual remedy (Confluient Platform, OSS version) ?
jeffwidman.com <http://www.jeffwidman.com/> | 740-WIDMAN-J (943-6265)