Ben,
Thanks for the reply. Answers inline.
On 11/8/20 9:59 PM, Benjamin Mahler wrote:
Which version?
1.5.3
I'm not sure what you're observing but slower responses is usually due
to backlogging from expensive requests (like /state), however we made
several changes that have made it much less of a potential problem
(see the blog posts).
Gotcha. It sounds like I should push for the cluster to be upgraded to
at least 1.7.x.
How much CPU is the master consuming? What kind of latency are you
seeing when you make a request to /health? What does "connections
slowing down" mean?
In the graph I saw, it didn't look like it was spiking, but it could be
that the box it's running in is so big that the spikes were drowned out
in the graph. When I was looking at top it was operating at the low 100s
in percentage and spiking up to 200% a few times.
By slowing down I meant all components started to experience lags in
round trip requests made to Mesos.
Aurora would end up hanging while electing a leader when Mesos took too
long to reply (realistically, Aurora should time out here, this may be
an Aurora bug).
Our executor would time out waiting for ACKs from Mesos.
UI became unbearably slow, taking time in the order of magnitude of
minutes to load.
I also noticed that ZK was taking a long time to answer Aurora queries
but this may be related to a separate issue.
A particularly weird issue we noticed was that offers were coming back
to Aurora without being combined. It seemed like whatever was freed
after an executor had exited was what was being offered, which slowed
down scheduling on dedicated boxes to a crawl.
Assuming it's a cpu load problem, you can grab and share a flame graph
per the performance docs on the website, so we can see where the
master is spending time.
Tried my best to get this but it looks like our cloud provider doesn't
support this since our VM doesn't have access to the hardware counters
perf needs. Any recommendation for an alternative off the top of your head?
After using the firewall arg to block off /state and /metrics/snapshot
we haven't run into the same issues for the time being so I guess that's
indicative of something. Maybe too many automated calls to both of those
endpoints (users loading the UI and leaving it open included) coupled
with the fact that we haven't picked up the improvements to the
serializing made in 1.7.x.
Thanks again for your time Ben!
-Renan
On Sat, Nov 7, 2020 at 10:17 PM Renan DelValle <re...@apache.org
<mailto:re...@apache.org>> wrote:
Hi all,
We've been noticing connections slowing down between our elected
master
and other components in the cluster the like the agents, frameworks,
executor, etc.
From a high level view, it looks like the master is too busy doing
other tasks to reply to messages and we've seen ACKs from our
exectuor
get delayed to the point where a new request has been sent by the
retry
mechanism.
My initial suspicion is that we have some metric collectors that are
hitting expensive endpoints (/metrics/snapshot, /master/state) too
frequently and causing the master process to get bogged down.
I was wondering if anyone had any experience with this and could
confirm
whether I'm on the right track with this.
If this hunch is right, it would also be great if anyone could chime
with a rough estimate of tasks and agents at which we should avoid
hitting the Web UI directly since that generates a call to
/metrics/snapshot at an interval.
Thanks!
-Renan