On Fri, Feb 2, 2018 at 11:58 AM, Jeff Widman <j...@jeffwidman.com> wrote:
> This means either the brokers are not healthy (bad hardware) or that the
> replication fetchers can't keep up with the rate of incoming messages.
> If the latter, you need to figure out where the latency bottleneck is and
> what your latency SLAs are.
> Common sources of latency bottlenecks:
> - network has slow roundtrip speeds: Increase network speed, or increase
> bytes per trip, or increase number of simultaneous fetchers, or increase
> the timeout so that the broker has time to fill all the bytes in the fetch
> - broker slow disk I/O: increase disk speed, or increase linux page cache
> There are JMX metrics that help disambiguate whether the problem is disk vs
> network... unfortunately the Datadog check is lacking many of these,
> something that I've had on my todo list to patch as we also use Datadog at
> my day job.
> One other possible problem is when you have a combination of a lot of
> low-volume partitions being replicated in each call along with a couple of
> high-volume partitions... then the broker can take a long time assembling
> the responses because it has to look at each partition, which might add
> only 1 KB, so it takes a long time to hit the 1MB bytes partition... so it
> hits the timeout first. Then it sends a small response, even though you've
> got a handful of partitions that are really hot and will soon be marked as
> not being in sync.
> I know this doesn't provide full details, but hopefully it's enough to get
> you pointed in the right direction...
> On Fri, Feb 2, 2018 at 11:27 AM, Richard Rodseth <rrods...@gmail.com>
> > We have a DataDog integration showing some metrics, and for one of our
> > clusters the above two
> > values are > 0 and highlighted in red.
> > What's the usual remedy (Confluient Platform, OSS version) ?
> > Thanks
> *Jeff Widman*
> jeffwidman.com <http://www.jeffwidman.com/> | 740-WIDMAN-J (943-6265)