This means either the brokers are not healthy (bad hardware) or that the replication fetchers can't keep up with the rate of incoming messages.
If the latter, you need to figure out where the latency bottleneck is and what your latency SLAs are. Common sources of latency bottlenecks: - network has slow roundtrip speeds: Increase network speed, or increase bytes per trip, or increase number of simultaneous fetchers, or increase the timeout so that the broker has time to fill all the bytes in the fetch request... - broker slow disk I/O: increase disk speed, or increase linux page cache size There are JMX metrics that help disambiguate whether the problem is disk vs network... unfortunately the Datadog check is lacking many of these, something that I've had on my todo list to patch as we also use Datadog at my day job. One other possible problem is when you have a combination of a lot of low-volume partitions being replicated in each call along with a couple of high-volume partitions... then the broker can take a long time assembling the responses because it has to look at each partition, which might add only 1 KB, so it takes a long time to hit the 1MB bytes partition... so it hits the timeout first. Then it sends a small response, even though you've got a handful of partitions that are really hot and will soon be marked as not being in sync. I know this doesn't provide full details, but hopefully it's enough to get you pointed in the right direction... Cheers, Jeff On Fri, Feb 2, 2018 at 11:27 AM, Richard Rodseth <rrods...@gmail.com> wrote: > We have a DataDog integration showing some metrics, and for one of our > clusters the above two > values are > 0 and highlighted in red. > > What's the usual remedy (Confluient Platform, OSS version) ? > > Thanks > -- *Jeff Widman* jeffwidman.com <http://www.jeffwidman.com/> | 740-WIDMAN-J (943-6265) <><