Hi all,

We've been noticing connections slowing down between our elected master and other components in the cluster the like the agents, frameworks, executor, etc.

From a high level view, it looks like the master is too busy doing other tasks to reply to messages and we've seen ACKs from our exectuor get delayed to the point where a new request has been sent by the retry mechanism.

My initial suspicion is that we have some metric collectors that are hitting expensive endpoints (/metrics/snapshot, /master/state) too frequently and causing the master process to get bogged down.

I was wondering if anyone had any experience with this and could confirm whether I'm on the right track with this.

If this hunch is right, it would also be great if anyone could chime with a rough estimate of tasks and agents at which we should avoid hitting the Web UI directly since that generates a call to /metrics/snapshot at an interval.

Thanks!

-Renan

Reply via email to