Re: Slow communications between components

Renan DelValle Thu, 12 Nov 2020 12:58:39 -0800

Ben,

Thanks for the reply. Answers inline.


On 11/8/20 9:59 PM, Benjamin Mahler wrote:

Which version?

1.5.3

I'm not sure what you're observing but slower responses is usually dueto backlogging from expensive requests (like /state), however we madeseveral changes that have made it much less of a potential problem(see the blog posts).

Gotcha. It sounds like I should push for the cluster to be upgraded toat least 1.7.x.

How much CPU is the master consuming? What kind of latency are youseeing when you make a request to /health? What does "connectionsslowing down" mean?

In the graph I saw, it didn't look like it was spiking, but it could bethat the box it's running in is so big that the spikes were drowned outin the graph. When I was looking at top it was operating at the low 100sin percentage and spiking up to 200% a few times.

By slowing down I meant all components started to experience lags inround trip requests made to Mesos.

Aurora would end up hanging while electing a leader when Mesos took toolong to reply (realistically, Aurora should time out here, this may bean Aurora bug).


Our executor would time out waiting for ACKs from Mesos.

UI became unbearably slow, taking time in the order of magnitude ofminutes to load.

I also noticed that ZK was taking a long time to answer Aurora queriesbut this may be related to a separate issue.

A particularly weird issue we noticed was that offers were coming backto Aurora without being combined. It seemed like whatever was freedafter an executor had exited was what was being offered, which sloweddown scheduling on dedicated boxes to a crawl.

Assuming it's a cpu load problem, you can grab and share a flame graphper the performance docs on the website, so we can see where themaster is spending time.

Tried my best to get this but it looks like our cloud provider doesn'tsupport this since our VM doesn't have access to the hardware countersperf needs. Any recommendation for an alternative off the top of your head?

After using the firewall arg to block off /state and /metrics/snapshotwe haven't run into the same issues for the time being so I guess that'sindicative of something. Maybe too many automated calls to both of thoseendpoints (users loading the UI and leaving it open included) coupledwith the fact that we haven't picked up the improvements to theserializing made in 1.7.x.


Thanks again for your time Ben!

-Renan

On Sat, Nov 7, 2020 at 10:17 PM Renan DelValle <re...@apache.org<mailto:re...@apache.org>> wrote:


    Hi all,

    We've been noticing connections slowing down between our elected
    master
    and other components in the cluster the like the agents, frameworks,
    executor, etc.

     From a high level view, it looks like the master is too busy doing
    other tasks to reply to messages and we've seen ACKs from our
    exectuor
    get delayed to the point where a new request has been sent by the
    retry
    mechanism.

    My initial suspicion is that we have some metric collectors that are
    hitting expensive endpoints (/metrics/snapshot, /master/state) too
    frequently and causing the master process to get bogged down.

    I was wondering if anyone had any experience with this and could
    confirm
    whether I'm on the right track with this.

    If this hunch is right, it would also be great if anyone could chime
    with a rough estimate of tasks and agents at which we should avoid
    hitting the Web UI directly since that generates a call to
    /metrics/snapshot at an interval.

    Thanks!

    -Renan

Re: Slow communications between components

Reply via email to