Re: Mesos sometimes not allocating the entire cluster

Tom Arnfeld Mon, 22 Feb 2016 01:17:54 -0800

No roles, no reservations.

We're using the default filter options with all frameworks and default 
allocation interval.


> On 21 Feb 2016, at 08:10, Guangya Liu <[email protected]> wrote:
> 
> Hi Tom,
> 
> I traced the agent of "20160112-165226-67375276-5050-22401-S199" and found 
> that it is keeps declining by many frameworks: once a framework got it, the 
> framework will decline it immediately. Does some your framework has special 
> offer filter logic?
> 
> Also I want to get more for your cluster:
> 1) What is the role for each framework and what is the weight for each role?
> 2) Do you start all agents without any reservation?
> 
> Thanks,
> 
> Guangya 
> 
>> On Sun, Feb 21, 2016 at 9:23 AM, Klaus Ma <[email protected]> wrote:
>> Hi Tom,
>> 
>> What's the allocation interval, can you try to reduce filter's timeout of 
>> framework?
>> 
>> According to the log, ~12 frameworks on cluster with ~42 agents; the filter 
>> duration is 5sec, and there're ~60 times filtered in each seconds (e.g. 65 
>> in 18:08:34). For example, framework 
>> (20160219-164457-67375276-5050-28802-0015) just get resources from 6 agents 
>> and filtered the other 36 agents at 18:08:35 (egrep "Alloca|Filtered" 
>> mesos-master.log | grep "20160219-164457-67375276-5050-28802-0015" | grep 
>> "18:08:35")
>> 
>> Thanks
>> Klaus
>> 
>> From: [email protected]
>> Subject: Re: Mesos sometimes not allocating the entire cluster
>> Date: Sat, 20 Feb 2016 16:36:54 +0000
>> To: [email protected]
>> 
>> Hi Guangya,
>> 
>> Indeed we have about ~45 agents. I’ve attached the log from the master…
>> 
>> 
>> 
>> Hope there’s something here that highlights the issue, we can’t find 
>> anything that we can’t explain.
>> 
>> Cheers,
>> 
>> Tom.
>> 
>> On 19 Feb 2016, at 03:02, Guangya Liu <[email protected]> wrote:
>> 
>> Hi Tom,
>> 
>> After the patch was applied, there is no need to restart framework but only 
>> mesos master.
>> 
>> One question is that I saw from your log, seems your cluster has at least 36 
>> agents, right? I was asking this question because if there are more 
>> frameworks than agents, frameworks with low weight may not able to get 
>> resources sometimes.
>> 
>> Can you please enable GLOG_v=2 for mesos master for a while and put the log 
>> somewhere for us to check (Do not enable this for a long time as you will 
>> get log message flooded), this kind of log messages may give some help for 
>> your problem.
>> 
>> Another is that there is another problem trying to fix another performance 
>> issue for allocator but may not help you much, but you can still take a 
>> look: https://issues.apache.org/jira/browse/MESOS-4694
>> 
>> Thanks,
>> 
>> Guangya
>> 
>> On Fri, Feb 19, 2016 at 2:19 AM, Tom Arnfeld <[email protected]> wrote:
>> Hi Ben,
>> 
>> We've rolled that patch out (applied over 0.23.1) on our production cluster 
>> and have seen little change, the master is still not sending any offers to 
>> those frameworks. We did this upgrade online, so would there be any reason 
>> the fix wouldn't have helped (other than it not being the cause)? Would we 
>> need to restart the frameworks (so they get new IDs) to see the effect?
>> 
>> It's not that the master is never sending them offers, it's that it does it 
>> up to a certain point... for different types of frameworks (all using 
>> libmesos) but then no more, regardless of how much free resource is 
>> available... the free resources are offered to some frameworks, but not all. 
>> Is there any way for us to do more introspection into the state of the 
>> master / allocator to try and debug? Right now we're at a bit of a loss of 
>> where to start diving in...
>> 
>> Much appreciated as always,
>> 
>> Tom.
>> 
>> On 18 February 2016 at 10:21, Tom Arnfeld <[email protected]> wrote:
>> Hi Ben,
>> 
>> I've only just seen your email! Really appreciate the reply, that's 
>> certainly an interesting bug and we'll try that patch and see how we get on.
>> 
>> Cheers,
>> 
>> Tom.
>> 
>> On 29 January 2016 at 19:54, Benjamin Mahler <[email protected]> wrote:
>> Hi Tom,
>> 
>> I suspect you may be tripping the following issue:
>> https://issues.apache.org/jira/browse/MESOS-4302
>> 
>> Please have a read through this and see if it applies here. You may also be 
>> able to apply the fix to your cluster to see if that helps things.
>> 
>> Ben
>> 
>> On Wed, Jan 20, 2016 at 10:19 AM, Tom Arnfeld <[email protected]> wrote:
>> Hey,
>> 
>> I've noticed some interesting behaviour recently when we have lots of 
>> different frameworks connected to our Mesos cluster at once, all using a 
>> variety of different shares. Some of the frameworks don't get offered more 
>> resources (for long periods of time, hours even) leaving the cluster under 
>> utilised.
>> 
>> Here's an example state where we see this happen..
>> 
>> Framework 1 - 13% (user A)
>> Framework 2 - 22% (user B)
>> Framework 3 - 4% (user C)
>> Framework 4 - 0.5% (user C)
>> Framework 5 - 1% (user C)
>> Framework 6 - 1% (user C)
>> Framework 7 - 1% (user C)
>> Framework 8 - 0.8% (user C)
>> Framework 9 - 11% (user D)
>> Framework 10 - 7% (user C)
>> Framework 11 - 1% (user C)
>> Framework 12 - 1% (user C)
>> Framework 13 - 6% (user E)
>> 
>> In this example, there's another ~30% of the cluster that is unallocated, 
>> and it stays like this for a significant amount of time until something 
>> changes, perhaps another user joins and allocates the rest.... chunks of 
>> this spare resource is offered to some of the frameworks, but not all of 
>> them.
>> 
>> I had always assumed that when lots of frameworks were involved, eventually 
>> the frameworks that would keep accepting resources indefinitely would 
>> consume the remaining resource, as every other framework had rejected the 
>> offers.
>> 
>> Could someone elaborate a little on how the DRF allocator / sorter handles 
>> this situation, is this likely to be related to the different users being 
>> used? Is there a way to mitigate this?
>> 
>> We're running version 0.23.1.
>> 
>> Cheers,
>> 
>> Tom.
>> 
>> 
>> 
>> 
>> 
>> 
>> -- 
>> Guangya Liu (刘光亚)
>> Senior Software Engineer
>> DCOS and OpenStack Development
>> IBM Platform Computing
>> Systems and Technology Group
> 
> 
> 
> -- 
> Guangya Liu (刘光亚)
> Senior Software Engineer
> DCOS and OpenStack Development
> IBM Platform Computing
> Systems and Technology Group

Re: Mesos sometimes not allocating the entire cluster

Reply via email to