Which version? I'm not sure what you're observing but slower responses is usually due to backlogging from expensive requests (like /state), however we made several changes that have made it much less of a potential problem (see the blog posts).
How much CPU is the master consuming? What kind of latency are you seeing when you make a request to /health? What does "connections slowing down" mean? Assuming it's a cpu load problem, you can grab and share a flame graph per the performance docs on the website, so we can see where the master is spending time. On Sat, Nov 7, 2020 at 10:17 PM Renan DelValle <re...@apache.org> wrote: > Hi all, > > We've been noticing connections slowing down between our elected master > and other components in the cluster the like the agents, frameworks, > executor, etc. > > From a high level view, it looks like the master is too busy doing > other tasks to reply to messages and we've seen ACKs from our exectuor > get delayed to the point where a new request has been sent by the retry > mechanism. > > My initial suspicion is that we have some metric collectors that are > hitting expensive endpoints (/metrics/snapshot, /master/state) too > frequently and causing the master process to get bogged down. > > I was wondering if anyone had any experience with this and could confirm > whether I'm on the right track with this. > > If this hunch is right, it would also be great if anyone could chime > with a rough estimate of tasks and agents at which we should avoid > hitting the Web UI directly since that generates a call to > /metrics/snapshot at an interval. > > Thanks! > > -Renan > >