Having the knowledge of tasks pending in the frameworks, at least via the offer filters specifying minimum resource sizes, could prove useful. And roles+weights would be complementary. This might remove the need to use dynamic reservations for every framework that uses more than the smallest size resources. Starvation often ends up being addressed via multiple "tricks" including reservations, priority/weights based preemptions, and oversubscription of resources, to name a few.
This may then tend to make frameworks be relatively more homogeneous in their task sizes, unless they further implement prioritization within their tasks and ask for mostly offer sizes to fit their bigger tasks. Effectively, they become homogeneous in terms of the offer sizes they filter on. In general, the more diverse the resource requests, more difficult the scheduling problem. On Tue, Jun 30, 2015 at 7:25 AM, Dharmesh Kakadia <[email protected]> wrote: > Yes, alternative allocator module will be great in terms of > implementation, but adding more capabilities to "filters" might be required > to convey some more info to the Mesos scheduler/allocator. Am I correct > here or are there already ways to convey such info ? > > Thanks, > Dharmesh > > On Tue, Jun 30, 2015 at 7:15 PM, Alex Rukletsov <[email protected]> > wrote: > >> One option is to implement alternative behaviour in an allocator module. >> >> On Tue, Jun 30, 2015 at 3:34 PM, Dharmesh Kakadia <[email protected]> >> wrote: >> >>> Interesting. >>> >>> I agree, that dynamic reservation and optimistic offers will help >>> mitigate the issue, but the resource fragmentation (and starvation due to >>> that) is a more general problem. Predictive models can certainly aid the >>> Mesos scheduler here. I think the filters in Mesos can be extended to add >>> more general preferences like the offer size, execution/predictive model >>> etc. For the Mesos scheduler, the user should be able to configure what all >>> filters it recognizes while making offers, which will also make the effect >>> on scalability limited,as far as I understand. Thoughts? >>> >>> Thanks, >>> Dharmesh >>> >>> >>> >>> On Sun, Jun 28, 2015 at 7:29 PM, Alex Rukletsov <[email protected]> >>> wrote: >>> >>>> Sharma, >>>> >>>> that's exactly what we plan to add to Mesos. Dynamic reservations will >>>> land in 0.23, the next step is to optimistically offer reserved but yet >>>> unused resources (we call them optimistic offers) to other framework as >>>> revocable. The alternative with one framework will of course work, but this >>>> implies having a general-purpose framework, that does some work that is >>>> better done by Mesos (which has more information and therefore can take >>>> better decisions). >>>> >>>> On Wed, Jun 24, 2015 at 11:54 PM, Sharma Podila <[email protected]> >>>> wrote: >>>> >>>>> In a previous (more HPC like) system I worked on, the scheduler did >>>>> "advance reservation" of resources, claiming bits and pieces it got and >>>>> holding on until all were available. Say the last bit is expected to come >>>>> in about 1 hour from now (and this needs job runtime >>>>> estimation/knowledge), >>>>> any short jobs are "back filled" on to the advance reserved resources that >>>>> are sitting idle for an hour, to improve utilization. This was combined >>>>> with weights and priority based job preemptions, sometimes 1GB jobs are >>>>> higher priority than the 1GB job. Unfortunately, that technique doesn't >>>>> lend itself natively onto Mesos based scheduling. >>>>> >>>>> One idea that may work in Mesos is (thinking aloud): >>>>> >>>>> - The large (20GB) framework reserves 20 GB on some number of slaves >>>>> (I am referring to dynamic reservations here, which aren't available yet) >>>>> - The small framework continues to use up 1GB offers. >>>>> - When the large framework needs to run a job, it will have the 20 GB >>>>> offers since it has the reservation. >>>>> - When the large framework does not have any jobs running on it, the >>>>> small framework may be given those resources, but, those jobs will have to >>>>> be preempted in order to offer 20 GB to the large framework. >>>>> >>>>> I understand this idea has some forward looking expectations on how >>>>> dynamic reservations would/could work. Caveat: I haven't involved myself >>>>> closely with that feature definition, so could be wrong with my >>>>> expectations. >>>>> >>>>> Until something like that lands, the existing static reservations, of >>>>> course, should work. But, that reduces utilization drastically if the >>>>> large >>>>> framework runs jobs sporadically. >>>>> >>>>> Another idea is to have one framework schedule both the 20GB jobs and >>>>> 1GB jobs. Within the framework, it can bin pack the 1GB jobs on to as >>>>> small >>>>> a number of slaves as possible. This increases the likelihood of finding >>>>> 20GB on a slave. Combining that with preemptions from within the framework >>>>> (a simple kill of certain number of 1GB jobs) should satisfy the 20 GB >>>>> jobs. >>>>> >>>>> >>>>> >>>>> On Wed, Jun 24, 2015 at 9:26 AM, Tim St Clair <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> ----- Original Message ----- >>>>>> > From: "Brian Candler" <[email protected]> >>>>>> > To: [email protected] >>>>>> > Sent: Wednesday, June 24, 2015 10:50:43 AM >>>>>> > Subject: Re: Setting minimum offer size >>>>>> > >>>>>> > On 24/06/2015 16:31, Alex Gaudio wrote: >>>>>> > > Does anyone have other ideas? >>>>>> > HTCondor deals with this by having a "defrag" demon, which >>>>>> periodically >>>>>> > stops hosts accepting small jobs, so that it can coalesce small >>>>>> slots >>>>>> > into larger ones. >>>>>> > >>>>>> > >>>>>> http://research.cs.wisc.edu/htcondor/manual/latest/3_5Policy_Configuration.html#sec:SMP-defrag >>>>>> > >>>>>> >>>>>> Yuppers, and guess who helped work on it ;-) >>>>>> >>>>>> > You can configure policies based on how many drained machines are >>>>>> > already available, and how many can be draining at once. >>>>>> > >>>>>> >>>>>> It had to be done this way, as there was only so much sophistication >>>>>> you can put into scheduling before you start to add latency. >>>>>> >>>>>> > Maybe there would be a benefit if Mesos could work out what is the >>>>>> > largest job any framework has waiting to run, so it knows whether >>>>>> > draining is required and how far to drain down. This might take the >>>>>> > form of a message to the framework: "suppose I offered you all the >>>>>> > resources on the cluster, what is the largest single job you would >>>>>> want >>>>>> > to run, and which machine(s) could it run on?" Or something like >>>>>> that. >>>>>> > >>>>>> > Regards, >>>>>> > >>>>>> > Brian. >>>>>> > >>>>>> > >>>>>> >>>>>> -- >>>>>> Cheers, >>>>>> Timothy St. Clair >>>>>> Red Hat Inc. >>>>>> >>>>> >>>>> >>>> >>> >> >

