Interesting to see HTCondor has a "defragmentation" feature, this kind of thing has come up before for Mesos as well.
Specifically, adding Inverse Offers as a generic mechanism for obtaining resources back from a framework unlocks a lot of functionality. The first use case was cluster maintenance. Defragmentation, enforcing a quota change, etc could also be done using inverse offers. On Tue, Jun 30, 2015 at 12:00 PM, Sharma Podila <spod...@netflix.com> wrote: > Having the knowledge of tasks pending in the frameworks, at least via the > offer filters specifying minimum resource sizes, could prove useful. And > roles+weights would be complementary. This might remove the need to use > dynamic reservations for every framework that uses more than the smallest > size resources. Starvation often ends up being addressed via multiple > "tricks" including reservations, priority/weights based preemptions, and > oversubscription of resources, to name a few. > > This may then tend to make frameworks be relatively more homogeneous in > their task sizes, unless they further implement prioritization within their > tasks and ask for mostly offer sizes to fit their bigger tasks. > Effectively, they become homogeneous in terms of the offer sizes they > filter on. > > In general, the more diverse the resource requests, more difficult the > scheduling problem. > > > On Tue, Jun 30, 2015 at 7:25 AM, Dharmesh Kakadia <dhkaka...@gmail.com> > wrote: > >> Yes, alternative allocator module will be great in terms of >> implementation, but adding more capabilities to "filters" might be required >> to convey some more info to the Mesos scheduler/allocator. Am I correct >> here or are there already ways to convey such info ? >> >> Thanks, >> Dharmesh >> >> On Tue, Jun 30, 2015 at 7:15 PM, Alex Rukletsov <a...@mesosphere.com> >> wrote: >> >>> One option is to implement alternative behaviour in an allocator module. >>> >>> On Tue, Jun 30, 2015 at 3:34 PM, Dharmesh Kakadia <dhkaka...@gmail.com> >>> wrote: >>> >>>> Interesting. >>>> >>>> I agree, that dynamic reservation and optimistic offers will help >>>> mitigate the issue, but the resource fragmentation (and starvation due to >>>> that) is a more general problem. Predictive models can certainly aid the >>>> Mesos scheduler here. I think the filters in Mesos can be extended to add >>>> more general preferences like the offer size, execution/predictive model >>>> etc. For the Mesos scheduler, the user should be able to configure what all >>>> filters it recognizes while making offers, which will also make the effect >>>> on scalability limited,as far as I understand. Thoughts? >>>> >>>> Thanks, >>>> Dharmesh >>>> >>>> >>>> >>>> On Sun, Jun 28, 2015 at 7:29 PM, Alex Rukletsov <a...@mesosphere.com> >>>> wrote: >>>> >>>>> Sharma, >>>>> >>>>> that's exactly what we plan to add to Mesos. Dynamic reservations will >>>>> land in 0.23, the next step is to optimistically offer reserved but yet >>>>> unused resources (we call them optimistic offers) to other framework as >>>>> revocable. The alternative with one framework will of course work, but >>>>> this >>>>> implies having a general-purpose framework, that does some work that is >>>>> better done by Mesos (which has more information and therefore can take >>>>> better decisions). >>>>> >>>>> On Wed, Jun 24, 2015 at 11:54 PM, Sharma Podila <spod...@netflix.com> >>>>> wrote: >>>>> >>>>>> In a previous (more HPC like) system I worked on, the scheduler did >>>>>> "advance reservation" of resources, claiming bits and pieces it got and >>>>>> holding on until all were available. Say the last bit is expected to come >>>>>> in about 1 hour from now (and this needs job runtime >>>>>> estimation/knowledge), >>>>>> any short jobs are "back filled" on to the advance reserved resources >>>>>> that >>>>>> are sitting idle for an hour, to improve utilization. This was combined >>>>>> with weights and priority based job preemptions, sometimes 1GB jobs are >>>>>> higher priority than the 1GB job. Unfortunately, that technique doesn't >>>>>> lend itself natively onto Mesos based scheduling. >>>>>> >>>>>> One idea that may work in Mesos is (thinking aloud): >>>>>> >>>>>> - The large (20GB) framework reserves 20 GB on some number of slaves >>>>>> (I am referring to dynamic reservations here, which aren't available yet) >>>>>> - The small framework continues to use up 1GB offers. >>>>>> - When the large framework needs to run a job, it will have the 20 GB >>>>>> offers since it has the reservation. >>>>>> - When the large framework does not have any jobs running on it, the >>>>>> small framework may be given those resources, but, those jobs will have >>>>>> to >>>>>> be preempted in order to offer 20 GB to the large framework. >>>>>> >>>>>> I understand this idea has some forward looking expectations on how >>>>>> dynamic reservations would/could work. Caveat: I haven't involved myself >>>>>> closely with that feature definition, so could be wrong with my >>>>>> expectations. >>>>>> >>>>>> Until something like that lands, the existing static reservations, of >>>>>> course, should work. But, that reduces utilization drastically if the >>>>>> large >>>>>> framework runs jobs sporadically. >>>>>> >>>>>> Another idea is to have one framework schedule both the 20GB jobs and >>>>>> 1GB jobs. Within the framework, it can bin pack the 1GB jobs on to as >>>>>> small >>>>>> a number of slaves as possible. This increases the likelihood of finding >>>>>> 20GB on a slave. Combining that with preemptions from within the >>>>>> framework >>>>>> (a simple kill of certain number of 1GB jobs) should satisfy the 20 GB >>>>>> jobs. >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jun 24, 2015 at 9:26 AM, Tim St Clair <tstcl...@redhat.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> ----- Original Message ----- >>>>>>> > From: "Brian Candler" <b.cand...@pobox.com> >>>>>>> > To: user@mesos.apache.org >>>>>>> > Sent: Wednesday, June 24, 2015 10:50:43 AM >>>>>>> > Subject: Re: Setting minimum offer size >>>>>>> > >>>>>>> > On 24/06/2015 16:31, Alex Gaudio wrote: >>>>>>> > > Does anyone have other ideas? >>>>>>> > HTCondor deals with this by having a "defrag" demon, which >>>>>>> periodically >>>>>>> > stops hosts accepting small jobs, so that it can coalesce small >>>>>>> slots >>>>>>> > into larger ones. >>>>>>> > >>>>>>> > >>>>>>> http://research.cs.wisc.edu/htcondor/manual/latest/3_5Policy_Configuration.html#sec:SMP-defrag >>>>>>> > >>>>>>> >>>>>>> Yuppers, and guess who helped work on it ;-) >>>>>>> >>>>>>> > You can configure policies based on how many drained machines are >>>>>>> > already available, and how many can be draining at once. >>>>>>> > >>>>>>> >>>>>>> It had to be done this way, as there was only so much sophistication >>>>>>> you can put into scheduling before you start to add latency. >>>>>>> >>>>>>> > Maybe there would be a benefit if Mesos could work out what is the >>>>>>> > largest job any framework has waiting to run, so it knows whether >>>>>>> > draining is required and how far to drain down. This might take >>>>>>> the >>>>>>> > form of a message to the framework: "suppose I offered you all the >>>>>>> > resources on the cluster, what is the largest single job you would >>>>>>> want >>>>>>> > to run, and which machine(s) could it run on?" Or something like >>>>>>> that. >>>>>>> > >>>>>>> > Regards, >>>>>>> > >>>>>>> > Brian. >>>>>>> > >>>>>>> > >>>>>>> >>>>>>> -- >>>>>>> Cheers, >>>>>>> Timothy St. Clair >>>>>>> Red Hat Inc. >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >