Hi all,
We've been noticing connections slowing down between our elected master
and other components in the cluster the like the agents, frameworks,
executor, etc.
From a high level view, it looks like the master is too busy doing
other tasks to reply to messages and we've seen ACKs from our exectuor
get delayed to the point where a new request has been sent by the retry
mechanism.
My initial suspicion is that we have some metric collectors that are
hitting expensive endpoints (/metrics/snapshot, /master/state) too
frequently and causing the master process to get bogged down.
I was wondering if anyone had any experience with this and could confirm
whether I'm on the right track with this.
If this hunch is right, it would also be great if anyone could chime
with a rough estimate of tasks and agents at which we should avoid
hitting the Web UI directly since that generates a call to
/metrics/snapshot at an interval.
Thanks!
-Renan
- Slow communications between components Renan DelValle
-