No particular things, so need master log to check what's happening in your cluster. You can append it to JIRA instead of sending email.
---- Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer Platform OpenSource Technology, STG, IBM GCG +86-10-8245 4084 | [email protected] | http://k82.me On Fri, Jan 22, 2016 at 9:07 PM, Tom Arnfeld <[email protected]> wrote: > I can’t send the entire log as there’s a lot of activity on the cluster > all the time, is there anything particular you’re looking for? > > On 22 Jan 2016, at 12:46, Klaus Ma <[email protected]> wrote: > > Can you share the whole log of master? I'll be helpful :). > > ---- > Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer > Platform OpenSource Technology, STG, IBM GCG > +86-10-8245 4084 | [email protected] | http://k82.me > > On Thu, Jan 21, 2016 at 11:57 PM, Tom Arnfeld <[email protected]> wrote: > >> Guangya - Nope, there's no outstanding offers for any frameworks, the >> ones that are getting offers are responding properly. >> >> Klaus - This was just a sample of logs for a single agent, the cluster >> has at least ~40 agents at any one time. >> >> On 21 January 2016 at 15:20, Guangya Liu <[email protected]> wrote: >> >>> Can you please help check if some outstanding offers in cluster which >>> does not accept by any framework? You can check this via the endpoint of >>> /master/state.json endpoint. >>> >>> If there are some outstanding offers, you can start the master with a >>> offer_timeout flag to let master rescind some offers if those offers are >>> not accepted by framework. >>> >>> Cited from >>> https://github.com/apache/mesos/blob/master/docs/configuration.md >>> >>> --offer_timeout=VALUE Duration of time before an offer is rescinded >>> from a framework. >>> >>> This helps fairness when running frameworks that hold on to offers, or >>> frameworks that accidentally drop offers. >>> >>> Thanks, >>> >>> Guangya >>> >>> On Thu, Jan 21, 2016 at 9:44 PM, Tom Arnfeld <[email protected]> wrote: >>> >>>> Hi Klaus, >>>> >>>> Sorry I think I explained this badly, these are the logs for one slave >>>> (that's empty) and we can see that it is making offers to some frameworks. >>>> In this instance, the Hadoop framework (and others) are not among those >>>> getting any offers, they get offered nothing. The allocator is deciding to >>>> send offers in a loop to a certain set of frameworks, starving others. >>>> >>>> On 21 January 2016 at 13:17, Klaus Ma <[email protected]> wrote: >>>> >>>>> Yes, it seems Hadoop framework did not consume all offered resources: >>>>> if framework launch task (1 CPUs) on offer (10 CPUs), the other 9 CPUs >>>>> will >>>>> return back to master (recoverResouces). >>>>> >>>>> ---- >>>>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer >>>>> Platform OpenSource Technology, STG, IBM GCG >>>>> +86-10-8245 4084 | [email protected] | http://k82.me >>>>> >>>>> On Thu, Jan 21, 2016 at 6:46 PM, Tom Arnfeld <[email protected]> wrote: >>>>> >>>>>> Thanks everyone! >>>>>> >>>>>> Stephan - There's a couple of useful points there, will definitely >>>>>> give it a read. >>>>>> >>>>>> Klaus - Thanks, we're running a bunch of different frameworks, in >>>>>> that list there's Hadoop MRv1, Apache Spark, Marathon and a couple of >>>>>> home >>>>>> grown frameworks we have. In this particular case the Hadoop framework is >>>>>> the major concern, as it's designed to continually accept offers until it >>>>>> has enough slots it needs. With the example I gave above, we observe that >>>>>> the master is never sending any sizeable offers to some of these >>>>>> frameworks >>>>>> (the ones with the larger shares), which is where my confusion stems >>>>>> from. >>>>>> >>>>>> I've attached a snippet of our active master logs which show the >>>>>> activity for a single slave (which has no active executors). We can see >>>>>> that it's cycling though sending and recovering declined offers from a >>>>>> selection of different frameworks (in order) but I can say that not all >>>>>> of >>>>>> the frameworks are receiving these offers, in this case that's the Hadoop >>>>>> framework. >>>>>> >>>>>> >>>>>> On 21 January 2016 at 00:26, Klaus Ma <[email protected]> wrote: >>>>>> >>>>>>> Hi Tom, >>>>>>> >>>>>>> Which framework are you using, e.g. Swarm, Marathon or something >>>>>>> else? and which language package are you using? >>>>>>> >>>>>>> DRF will sort role/framework by allocation ratio, and offer all >>>>>>> "available" resources by slave; but if the resources it too small (< >>>>>>> 0.1CPU) or the resources was reject/declined by framework, the resources >>>>>>> will not offer it until filter timeout. For example, in Swarm 1.0, the >>>>>>> default filter timeout 5s (because of go scheduler API); so here is case >>>>>>> that may impact the utilisation: the Swarm got one slave with 16 CPUS, >>>>>>> but >>>>>>> only launch one container with 1 CPUS; the other 15 CPUS will return >>>>>>> back >>>>>>> to master and did not re-offer until filter timeout (5s). >>>>>>> I had pull a request to make Swarm's parameters configurable, refer >>>>>>> to https://github.com/docker/swarm/pull/1585. I think you can check >>>>>>> this case by master log. >>>>>>> >>>>>>> If any comments, please let me know. >>>>>>> >>>>>>> ---- >>>>>>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer >>>>>>> Platform OpenSource Technology, STG, IBM GCG >>>>>>> +86-10-8245 4084 | [email protected] | http://k82.me >>>>>>> >>>>>>> On Thu, Jan 21, 2016 at 2:19 AM, Tom Arnfeld <[email protected]> wrote: >>>>>>> >>>>>>>> Hey, >>>>>>>> >>>>>>>> I've noticed some interesting behaviour recently when we have lots >>>>>>>> of different frameworks connected to our Mesos cluster at once, all >>>>>>>> using a >>>>>>>> variety of different shares. Some of the frameworks don't get offered >>>>>>>> more >>>>>>>> resources (for long periods of time, hours even) leaving the cluster >>>>>>>> under >>>>>>>> utilised. >>>>>>>> >>>>>>>> Here's an example state where we see this happen.. >>>>>>>> >>>>>>>> Framework 1 - 13% (user A) >>>>>>>> Framework 2 - 22% (user B) >>>>>>>> Framework 3 - 4% (user C) >>>>>>>> Framework 4 - 0.5% (user C) >>>>>>>> Framework 5 - 1% (user C) >>>>>>>> Framework 6 - 1% (user C) >>>>>>>> Framework 7 - 1% (user C) >>>>>>>> Framework 8 - 0.8% (user C) >>>>>>>> Framework 9 - 11% (user D) >>>>>>>> Framework 10 - 7% (user C) >>>>>>>> Framework 11 - 1% (user C) >>>>>>>> Framework 12 - 1% (user C) >>>>>>>> Framework 13 - 6% (user E) >>>>>>>> >>>>>>>> In this example, there's another ~30% of the cluster that is >>>>>>>> unallocated, and it stays like this for a significant amount of time >>>>>>>> until >>>>>>>> something changes, perhaps another user joins and allocates the >>>>>>>> rest.... >>>>>>>> chunks of this spare resource is offered to some of the frameworks, >>>>>>>> but not >>>>>>>> all of them. >>>>>>>> >>>>>>>> I had always assumed that when lots of frameworks were involved, >>>>>>>> eventually the frameworks that would keep accepting resources >>>>>>>> indefinitely >>>>>>>> would consume the remaining resource, as every other framework had >>>>>>>> rejected >>>>>>>> the offers. >>>>>>>> >>>>>>>> Could someone elaborate a little on how the DRF allocator / sorter >>>>>>>> handles this situation, is this likely to be related to the different >>>>>>>> users >>>>>>>> being used? Is there a way to mitigate this? >>>>>>>> >>>>>>>> We're running version 0.23.1. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Tom. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > >

