Which version?

I'm not sure what you're observing but slower responses is usually due to
backlogging from expensive requests (like /state), however we made several
changes that have made it much less of a potential problem (see the blog
posts).

How much CPU is the master consuming? What kind of latency are you seeing
when you make a request to /health? What does "connections slowing down"
mean?

Assuming it's a cpu load problem, you can grab and share a flame graph per
the performance docs on the website, so we can see where the master is
spending time.

On Sat, Nov 7, 2020 at 10:17 PM Renan DelValle <re...@apache.org> wrote:

> Hi all,
>
> We've been noticing connections slowing down between our elected master
> and other components in the cluster the like the agents, frameworks,
> executor, etc.
>
>  From a high level view, it looks like the master is too busy doing
> other tasks to reply to messages and we've seen ACKs from our exectuor
> get delayed to the point where a new request has been sent by the retry
> mechanism.
>
> My initial suspicion is that we have some metric collectors that are
> hitting expensive endpoints (/metrics/snapshot, /master/state) too
> frequently and causing the master process to get bogged down.
>
> I was wondering if anyone had any experience with this and could confirm
> whether I'm on the right track with this.
>
> If this hunch is right, it would also be great if anyone could chime
> with a rough estimate of tasks and agents at which we should avoid
> hitting the Web UI directly since that generates a call to
> /metrics/snapshot at an interval.
>
> Thanks!
>
> -Renan
>
>

Reply via email to