Ben,

Thanks for the reply. Answers inline.

On 11/8/20 9:59 PM, Benjamin Mahler wrote:
Which version?
1.5.3
I'm not sure what you're observing but slower responses is usually due to backlogging from expensive requests (like /state), however we made several changes that have made it much less of a potential problem (see the blog posts).

Gotcha. It sounds like I should push for the cluster to be upgraded to at least 1.7.x.

How much CPU is the master consuming? What kind of latency are you seeing when you make a request to /health? What does "connections slowing down" mean?

In the graph I saw, it didn't look like it was spiking, but it could be that the box it's running in is so big that the spikes were drowned out in the graph. When I was looking at top it was operating at the low 100s in percentage and spiking up to 200% a few times.

By slowing down I meant all components started to experience lags in round trip requests made to Mesos.

Aurora would end up hanging while electing a leader when Mesos took too long to reply (realistically, Aurora should time out here, this may be an Aurora bug).

Our executor would time out waiting for ACKs from Mesos.

UI became unbearably slow, taking time in the order of magnitude of minutes to load.

I also noticed that ZK was taking a long time to answer Aurora queries but this may be related to a separate issue.

A particularly weird issue we noticed was that offers were coming back to Aurora without being combined. It seemed like whatever was freed after an executor had exited was what was being offered, which slowed down scheduling on dedicated boxes to a crawl.

Assuming it's a cpu load problem, you can grab and share a flame graph per the performance docs on the website, so we can see where the master is spending time.

Tried my best to get this but it looks like our cloud provider doesn't support this since our VM doesn't have access to the hardware counters perf needs. Any recommendation for an alternative off the top of your head?

After using the firewall arg to block off /state and /metrics/snapshot we haven't run into the same issues for the time being so I guess that's indicative of something. Maybe too many automated calls to both of those endpoints (users loading the UI and leaving it open included) coupled with the fact that we haven't picked up the improvements to the serializing made in 1.7.x.

Thanks again for your time Ben!

-Renan


On Sat, Nov 7, 2020 at 10:17 PM Renan DelValle <re...@apache.org <mailto:re...@apache.org>> wrote:

    Hi all,

    We've been noticing connections slowing down between our elected
    master
    and other components in the cluster the like the agents, frameworks,
    executor, etc.

     From a high level view, it looks like the master is too busy doing
    other tasks to reply to messages and we've seen ACKs from our
    exectuor
    get delayed to the point where a new request has been sent by the
    retry
    mechanism.

    My initial suspicion is that we have some metric collectors that are
    hitting expensive endpoints (/metrics/snapshot, /master/state) too
    frequently and causing the master process to get bogged down.

    I was wondering if anyone had any experience with this and could
    confirm
    whether I'm on the right track with this.

    If this hunch is right, it would also be great if anyone could chime
    with a rough estimate of tasks and agents at which we should avoid
    hitting the Web UI directly since that generates a call to
    /metrics/snapshot at an interval.

    Thanks!

    -Renan

Reply via email to