RE: Mesos sometimes not allocating the entire cluster

Klaus Ma Sat, 20 Feb 2016 17:34:48 -0800

Hi Tom,

What's the allocation interval, can you try to reduce filter's timeout of 
framework?
According to the log, ~12 frameworks on cluster with ~42 agents; the filter 
duration is 5sec, and there're ~60 times filtered in each seconds (e.g. 65 in 
18:08:34). For example, framework (20160219-164457-67375276-5050-28802-0015) 
just get resources from 6 agents and filtered the other 36 agents at 18:08:35 
(egrep "Alloca|Filtered" mesos-master.log | grep 
"20160219-164457-67375276-5050-28802-0015" | grep "18:08:35")
ThanksKlaus
From: [email protected]
Subject: Re: Mesos sometimes not allocating the entire cluster
Date: Sat, 20 Feb 2016 16:36:54 +0000
To: [email protected]


Hi Guangya,
Indeed we have about ~45 agents. I’ve attached the log from the master…


Hope there’s something here that highlights the issue, we can’t find anything 
that we can’t explain.
Cheers,
Tom.


On 19 Feb 2016, at 03:02, Guangya Liu <[email protected]> wrote:Hi Tom,
After the patch was applied, there is no need to restart framework but only 
mesos master.
One question is that I saw from your log, seems your cluster has at least 36 
agents, right? I was asking this question because if there are more frameworks 
than agents, frameworks with low weight may not able to get resources sometimes.
Can you please enable GLOG_v=2 for mesos master for a while and put the log 
somewhere for us to check (Do not enable this for a long time as you will get 
log message flooded), this kind of log messages may give some help for your 
problem.
Another is that there is another problem trying to fix another performance 
issue for allocator but may not help you much, but you can still take a look: 
https://issues.apache.org/jira/browse/MESOS-4694
Thanks,
Guangya
On Fri, Feb 19, 2016 at 2:19 AM, Tom Arnfeld <[email protected]> wrote:
Hi Ben,
We've rolled that patch out (applied over 0.23.1) on our production cluster and 
have seen little change, the master is still not sending any offers to those 
frameworks. We did this upgrade online, so would there be any reason the fix 
wouldn't have helped (other than it not being the cause)? Would we need to 
restart the frameworks (so they get new IDs) to see the effect?
It's not that the master is never sending them offers, it's that it does it up 
to a certain point... for different types of frameworks (all using libmesos) 
but then no more, regardless of how much free resource is available... the free 
resources are offered to some frameworks, but not all. Is there any way for us 
to do more introspection into the state of the master / allocator to try and 
debug? Right now we're at a bit of a loss of where to start diving in...
Much appreciated as always,
Tom.
On 18 February 2016 at 10:21, Tom Arnfeld <[email protected]> wrote:
Hi Ben,
I've only just seen your email! Really appreciate the reply, that's certainly 
an interesting bug and we'll try that patch and see how we get on.
Cheers,
Tom.
On 29 January 2016 at 19:54, Benjamin Mahler <[email protected]> wrote:
Hi Tom,
I suspect you may be tripping the following 
issue:https://issues.apache.org/jira/browse/MESOS-4302

Please have a read through this and see if it applies here. You may also be 
able to apply the fix to your cluster to see if that helps things.
Ben
On Wed, Jan 20, 2016 at 10:19 AM, Tom Arnfeld <[email protected]> wrote:
Hey,
I've noticed some interesting behaviour recently when we have lots of different 
frameworks connected to our Mesos cluster at once, all using a variety of 
different shares. Some of the frameworks don't get offered more resources (for 
long periods of time, hours even) leaving the cluster under utilised.
Here's an example state where we see this happen..
Framework 1 - 13% (user A)Framework 2 - 22% (user B)Framework 3 - 4% (user 
C)Framework 4 - 0.5% (user C)
Framework 5 - 1% (user C)
Framework 6 - 1% (user C)
Framework 7 - 1% (user C)
Framework 8 - 0.8% (user C)
Framework 9 - 11% (user D)
Framework 10 - 7% (user C)Framework 11 - 1% (user C)Framework 12 - 1% (user C)
Framework 13 - 6% (user E)
In this example, there's another ~30% of the cluster that is unallocated, and 
it stays like this for a significant amount of time until something changes, 
perhaps another user joins and allocates the rest.... chunks of this spare 
resource is offered to some of the frameworks, but not all of them.
I had always assumed that when lots of frameworks were involved, eventually the 
frameworks that would keep accepting resources indefinitely would consume the 
remaining resource, as every other framework had rejected the offers.
Could someone elaborate a little on how the DRF allocator / sorter handles this 
situation, is this likely to be related to the different users being used? Is 
there a way to mitigate this?
We're running version 0.23.1.
Cheers,
Tom.








-- 
Guangya Liu (刘光亚)
Senior Software Engineer
DCOS and OpenStack Development
IBM Platform Computing
Systems and Technology Group

RE: Mesos sometimes not allocating the entire cluster

Reply via email to