Yes, alternative allocator module will be great in terms of implementation, but adding more capabilities to "filters" might be required to convey some more info to the Mesos scheduler/allocator. Am I correct here or are there already ways to convey such info ?
Thanks, Dharmesh On Tue, Jun 30, 2015 at 7:15 PM, Alex Rukletsov <[email protected]> wrote: > One option is to implement alternative behaviour in an allocator module. > > On Tue, Jun 30, 2015 at 3:34 PM, Dharmesh Kakadia <[email protected]> > wrote: > >> Interesting. >> >> I agree, that dynamic reservation and optimistic offers will help >> mitigate the issue, but the resource fragmentation (and starvation due to >> that) is a more general problem. Predictive models can certainly aid the >> Mesos scheduler here. I think the filters in Mesos can be extended to add >> more general preferences like the offer size, execution/predictive model >> etc. For the Mesos scheduler, the user should be able to configure what all >> filters it recognizes while making offers, which will also make the effect >> on scalability limited,as far as I understand. Thoughts? >> >> Thanks, >> Dharmesh >> >> >> >> On Sun, Jun 28, 2015 at 7:29 PM, Alex Rukletsov <[email protected]> >> wrote: >> >>> Sharma, >>> >>> that's exactly what we plan to add to Mesos. Dynamic reservations will >>> land in 0.23, the next step is to optimistically offer reserved but yet >>> unused resources (we call them optimistic offers) to other framework as >>> revocable. The alternative with one framework will of course work, but this >>> implies having a general-purpose framework, that does some work that is >>> better done by Mesos (which has more information and therefore can take >>> better decisions). >>> >>> On Wed, Jun 24, 2015 at 11:54 PM, Sharma Podila <[email protected]> >>> wrote: >>> >>>> In a previous (more HPC like) system I worked on, the scheduler did >>>> "advance reservation" of resources, claiming bits and pieces it got and >>>> holding on until all were available. Say the last bit is expected to come >>>> in about 1 hour from now (and this needs job runtime estimation/knowledge), >>>> any short jobs are "back filled" on to the advance reserved resources that >>>> are sitting idle for an hour, to improve utilization. This was combined >>>> with weights and priority based job preemptions, sometimes 1GB jobs are >>>> higher priority than the 1GB job. Unfortunately, that technique doesn't >>>> lend itself natively onto Mesos based scheduling. >>>> >>>> One idea that may work in Mesos is (thinking aloud): >>>> >>>> - The large (20GB) framework reserves 20 GB on some number of slaves (I >>>> am referring to dynamic reservations here, which aren't available yet) >>>> - The small framework continues to use up 1GB offers. >>>> - When the large framework needs to run a job, it will have the 20 GB >>>> offers since it has the reservation. >>>> - When the large framework does not have any jobs running on it, the >>>> small framework may be given those resources, but, those jobs will have to >>>> be preempted in order to offer 20 GB to the large framework. >>>> >>>> I understand this idea has some forward looking expectations on how >>>> dynamic reservations would/could work. Caveat: I haven't involved myself >>>> closely with that feature definition, so could be wrong with my >>>> expectations. >>>> >>>> Until something like that lands, the existing static reservations, of >>>> course, should work. But, that reduces utilization drastically if the large >>>> framework runs jobs sporadically. >>>> >>>> Another idea is to have one framework schedule both the 20GB jobs and >>>> 1GB jobs. Within the framework, it can bin pack the 1GB jobs on to as small >>>> a number of slaves as possible. This increases the likelihood of finding >>>> 20GB on a slave. Combining that with preemptions from within the framework >>>> (a simple kill of certain number of 1GB jobs) should satisfy the 20 GB >>>> jobs. >>>> >>>> >>>> >>>> On Wed, Jun 24, 2015 at 9:26 AM, Tim St Clair <[email protected]> >>>> wrote: >>>> >>>>> >>>>> >>>>> ----- Original Message ----- >>>>> > From: "Brian Candler" <[email protected]> >>>>> > To: [email protected] >>>>> > Sent: Wednesday, June 24, 2015 10:50:43 AM >>>>> > Subject: Re: Setting minimum offer size >>>>> > >>>>> > On 24/06/2015 16:31, Alex Gaudio wrote: >>>>> > > Does anyone have other ideas? >>>>> > HTCondor deals with this by having a "defrag" demon, which >>>>> periodically >>>>> > stops hosts accepting small jobs, so that it can coalesce small slots >>>>> > into larger ones. >>>>> > >>>>> > >>>>> http://research.cs.wisc.edu/htcondor/manual/latest/3_5Policy_Configuration.html#sec:SMP-defrag >>>>> > >>>>> >>>>> Yuppers, and guess who helped work on it ;-) >>>>> >>>>> > You can configure policies based on how many drained machines are >>>>> > already available, and how many can be draining at once. >>>>> > >>>>> >>>>> It had to be done this way, as there was only so much sophistication >>>>> you can put into scheduling before you start to add latency. >>>>> >>>>> > Maybe there would be a benefit if Mesos could work out what is the >>>>> > largest job any framework has waiting to run, so it knows whether >>>>> > draining is required and how far to drain down. This might take the >>>>> > form of a message to the framework: "suppose I offered you all the >>>>> > resources on the cluster, what is the largest single job you would >>>>> want >>>>> > to run, and which machine(s) could it run on?" Or something like >>>>> that. >>>>> > >>>>> > Regards, >>>>> > >>>>> > Brian. >>>>> > >>>>> > >>>>> >>>>> -- >>>>> Cheers, >>>>> Timothy St. Clair >>>>> Red Hat Inc. >>>>> >>>> >>>> >>> >> >

