No roles, no reservations. We're using the default filter options with all frameworks and default allocation interval.
> On 21 Feb 2016, at 08:10, Guangya Liu <[email protected]> wrote: > > Hi Tom, > > I traced the agent of "20160112-165226-67375276-5050-22401-S199" and found > that it is keeps declining by many frameworks: once a framework got it, the > framework will decline it immediately. Does some your framework has special > offer filter logic? > > Also I want to get more for your cluster: > 1) What is the role for each framework and what is the weight for each role? > 2) Do you start all agents without any reservation? > > Thanks, > > Guangya > >> On Sun, Feb 21, 2016 at 9:23 AM, Klaus Ma <[email protected]> wrote: >> Hi Tom, >> >> What's the allocation interval, can you try to reduce filter's timeout of >> framework? >> >> According to the log, ~12 frameworks on cluster with ~42 agents; the filter >> duration is 5sec, and there're ~60 times filtered in each seconds (e.g. 65 >> in 18:08:34). For example, framework >> (20160219-164457-67375276-5050-28802-0015) just get resources from 6 agents >> and filtered the other 36 agents at 18:08:35 (egrep "Alloca|Filtered" >> mesos-master.log | grep "20160219-164457-67375276-5050-28802-0015" | grep >> "18:08:35") >> >> Thanks >> Klaus >> >> From: [email protected] >> Subject: Re: Mesos sometimes not allocating the entire cluster >> Date: Sat, 20 Feb 2016 16:36:54 +0000 >> To: [email protected] >> >> Hi Guangya, >> >> Indeed we have about ~45 agents. I’ve attached the log from the master… >> >> >> >> Hope there’s something here that highlights the issue, we can’t find >> anything that we can’t explain. >> >> Cheers, >> >> Tom. >> >> On 19 Feb 2016, at 03:02, Guangya Liu <[email protected]> wrote: >> >> Hi Tom, >> >> After the patch was applied, there is no need to restart framework but only >> mesos master. >> >> One question is that I saw from your log, seems your cluster has at least 36 >> agents, right? I was asking this question because if there are more >> frameworks than agents, frameworks with low weight may not able to get >> resources sometimes. >> >> Can you please enable GLOG_v=2 for mesos master for a while and put the log >> somewhere for us to check (Do not enable this for a long time as you will >> get log message flooded), this kind of log messages may give some help for >> your problem. >> >> Another is that there is another problem trying to fix another performance >> issue for allocator but may not help you much, but you can still take a >> look: https://issues.apache.org/jira/browse/MESOS-4694 >> >> Thanks, >> >> Guangya >> >> On Fri, Feb 19, 2016 at 2:19 AM, Tom Arnfeld <[email protected]> wrote: >> Hi Ben, >> >> We've rolled that patch out (applied over 0.23.1) on our production cluster >> and have seen little change, the master is still not sending any offers to >> those frameworks. We did this upgrade online, so would there be any reason >> the fix wouldn't have helped (other than it not being the cause)? Would we >> need to restart the frameworks (so they get new IDs) to see the effect? >> >> It's not that the master is never sending them offers, it's that it does it >> up to a certain point... for different types of frameworks (all using >> libmesos) but then no more, regardless of how much free resource is >> available... the free resources are offered to some frameworks, but not all. >> Is there any way for us to do more introspection into the state of the >> master / allocator to try and debug? Right now we're at a bit of a loss of >> where to start diving in... >> >> Much appreciated as always, >> >> Tom. >> >> On 18 February 2016 at 10:21, Tom Arnfeld <[email protected]> wrote: >> Hi Ben, >> >> I've only just seen your email! Really appreciate the reply, that's >> certainly an interesting bug and we'll try that patch and see how we get on. >> >> Cheers, >> >> Tom. >> >> On 29 January 2016 at 19:54, Benjamin Mahler <[email protected]> wrote: >> Hi Tom, >> >> I suspect you may be tripping the following issue: >> https://issues.apache.org/jira/browse/MESOS-4302 >> >> Please have a read through this and see if it applies here. You may also be >> able to apply the fix to your cluster to see if that helps things. >> >> Ben >> >> On Wed, Jan 20, 2016 at 10:19 AM, Tom Arnfeld <[email protected]> wrote: >> Hey, >> >> I've noticed some interesting behaviour recently when we have lots of >> different frameworks connected to our Mesos cluster at once, all using a >> variety of different shares. Some of the frameworks don't get offered more >> resources (for long periods of time, hours even) leaving the cluster under >> utilised. >> >> Here's an example state where we see this happen.. >> >> Framework 1 - 13% (user A) >> Framework 2 - 22% (user B) >> Framework 3 - 4% (user C) >> Framework 4 - 0.5% (user C) >> Framework 5 - 1% (user C) >> Framework 6 - 1% (user C) >> Framework 7 - 1% (user C) >> Framework 8 - 0.8% (user C) >> Framework 9 - 11% (user D) >> Framework 10 - 7% (user C) >> Framework 11 - 1% (user C) >> Framework 12 - 1% (user C) >> Framework 13 - 6% (user E) >> >> In this example, there's another ~30% of the cluster that is unallocated, >> and it stays like this for a significant amount of time until something >> changes, perhaps another user joins and allocates the rest.... chunks of >> this spare resource is offered to some of the frameworks, but not all of >> them. >> >> I had always assumed that when lots of frameworks were involved, eventually >> the frameworks that would keep accepting resources indefinitely would >> consume the remaining resource, as every other framework had rejected the >> offers. >> >> Could someone elaborate a little on how the DRF allocator / sorter handles >> this situation, is this likely to be related to the different users being >> used? Is there a way to mitigate this? >> >> We're running version 0.23.1. >> >> Cheers, >> >> Tom. >> >> >> >> >> >> >> -- >> Guangya Liu (刘光亚) >> Senior Software Engineer >> DCOS and OpenStack Development >> IBM Platform Computing >> Systems and Technology Group > > > > -- > Guangya Liu (刘光亚) > Senior Software Engineer > DCOS and OpenStack Development > IBM Platform Computing > Systems and Technology Group

