Re: Slow communications between components

2020-11-12 Thread Renan DelValle

Ben,

Thanks for the reply. Answers inline.

On 11/8/20 9:59 PM, Benjamin Mahler wrote:

Which version?

1.5.3
I'm not sure what you're observing but slower responses is usually due 
to backlogging from expensive requests (like /state), however we made 
several changes that have made it much less of a potential problem 
(see the blog posts).


Gotcha. It sounds like I should push for the cluster to be upgraded to 
at least 1.7.x.


How much CPU is the master consuming? What kind of latency are you 
seeing when you make a request to /health? What does "connections 
slowing down" mean?


In the graph I saw, it didn't look like it was spiking, but it could be 
that the box it's running in is so big that the spikes were drowned out 
in the graph. When I was looking at top it was operating at the low 100s 
in percentage and spiking up to 200% a few times.


By slowing down I meant all components started to experience lags in 
round trip requests made to Mesos.


Aurora would end up hanging while electing a leader when Mesos took too 
long to reply (realistically, Aurora should time out here, this may be 
an Aurora bug).


Our executor would time out waiting for ACKs from Mesos.

UI became unbearably slow, taking time in the order of magnitude of 
minutes to load.


I also noticed that ZK was taking a long time to answer Aurora queries 
but this may be related to a separate issue.


A particularly weird issue we noticed was that offers were coming back 
to Aurora without being combined. It seemed like whatever was freed 
after an executor had exited was what was being offered, which slowed 
down scheduling on dedicated boxes to a crawl.


Assuming it's a cpu load problem, you can grab and share a flame graph 
per the performance docs on the website, so we can see where the 
master is spending time.


Tried my best to get this but it looks like our cloud provider doesn't 
support this since our VM doesn't have access to the hardware counters 
perf needs. Any recommendation for an alternative off the top of your head?


After using the firewall arg to block off /state and /metrics/snapshot 
we haven't run into the same issues for the time being so I guess that's 
indicative of something. Maybe too many automated calls to both of those 
endpoints (users loading the UI and leaving it open included) coupled 
with the fact that we haven't picked up the improvements to the 
serializing made in 1.7.x.


Thanks again for your time Ben!

-Renan



On Sat, Nov 7, 2020 at 10:17 PM Renan DelValle > wrote:


Hi all,

We've been noticing connections slowing down between our elected
master
and other components in the cluster the like the agents, frameworks,
executor, etc.

 From a high level view, it looks like the master is too busy doing
other tasks to reply to messages and we've seen ACKs from our
exectuor
get delayed to the point where a new request has been sent by the
retry
mechanism.

My initial suspicion is that we have some metric collectors that are
hitting expensive endpoints (/metrics/snapshot, /master/state) too
frequently and causing the master process to get bogged down.

I was wondering if anyone had any experience with this and could
confirm
whether I'm on the right track with this.

If this hunch is right, it would also be great if anyone could chime
with a rough estimate of tasks and agents at which we should avoid
hitting the Web UI directly since that generates a call to
/metrics/snapshot at an interval.

Thanks!

-Renan



Re: Slow communications between components

2020-11-08 Thread Benjamin Mahler
Which version?

I'm not sure what you're observing but slower responses is usually due to
backlogging from expensive requests (like /state), however we made several
changes that have made it much less of a potential problem (see the blog
posts).

How much CPU is the master consuming? What kind of latency are you seeing
when you make a request to /health? What does "connections slowing down"
mean?

Assuming it's a cpu load problem, you can grab and share a flame graph per
the performance docs on the website, so we can see where the master is
spending time.

On Sat, Nov 7, 2020 at 10:17 PM Renan DelValle  wrote:

> Hi all,
>
> We've been noticing connections slowing down between our elected master
> and other components in the cluster the like the agents, frameworks,
> executor, etc.
>
>  From a high level view, it looks like the master is too busy doing
> other tasks to reply to messages and we've seen ACKs from our exectuor
> get delayed to the point where a new request has been sent by the retry
> mechanism.
>
> My initial suspicion is that we have some metric collectors that are
> hitting expensive endpoints (/metrics/snapshot, /master/state) too
> frequently and causing the master process to get bogged down.
>
> I was wondering if anyone had any experience with this and could confirm
> whether I'm on the right track with this.
>
> If this hunch is right, it would also be great if anyone could chime
> with a rough estimate of tasks and agents at which we should avoid
> hitting the Web UI directly since that generates a call to
> /metrics/snapshot at an interval.
>
> Thanks!
>
> -Renan
>
>


Slow communications between components

2020-11-07 Thread Renan DelValle

Hi all,

We've been noticing connections slowing down between our elected master 
and other components in the cluster the like the agents, frameworks, 
executor, etc.


From a high level view, it looks like the master is too busy doing 
other tasks to reply to messages and we've seen ACKs from our exectuor 
get delayed to the point where a new request has been sent by the retry 
mechanism.


My initial suspicion is that we have some metric collectors that are 
hitting expensive endpoints (/metrics/snapshot, /master/state) too 
frequently and causing the master process to get bogged down.


I was wondering if anyone had any experience with this and could confirm 
whether I'm on the right track with this.


If this hunch is right, it would also be great if anyone could chime 
with a rough estimate of tasks and agents at which we should avoid 
hitting the Web UI directly since that generates a call to 
/metrics/snapshot at an interval.


Thanks!

-Renan