Interesting to see HTCondor has a "defragmentation" feature, this kind of
thing has come up before for Mesos as well.

Specifically, adding Inverse Offers as a generic mechanism for obtaining
resources back from a framework unlocks a lot of functionality. The first
use case was cluster maintenance. Defragmentation, enforcing a quota
change, etc could also be done using inverse offers.

On Tue, Jun 30, 2015 at 12:00 PM, Sharma Podila <spod...@netflix.com> wrote:

> Having the knowledge of tasks pending in the frameworks, at least via the
> offer filters specifying minimum resource sizes, could prove useful. And
> roles+weights would be complementary. This might remove the need to use
> dynamic reservations for every framework that uses more than the smallest
> size resources. Starvation often ends up being addressed via multiple
> "tricks" including reservations, priority/weights based preemptions, and
> oversubscription of resources, to name a few.
>
> This may then tend to make frameworks be relatively more homogeneous in
> their task sizes, unless they further implement prioritization within their
> tasks and ask for mostly offer sizes to fit their bigger tasks.
> Effectively, they become homogeneous in terms of the offer sizes they
> filter on.
>
> In general, the more diverse the resource requests, more difficult the
> scheduling problem.
>
>
> On Tue, Jun 30, 2015 at 7:25 AM, Dharmesh Kakadia <dhkaka...@gmail.com>
> wrote:
>
>> Yes, alternative allocator module will be great in terms of
>> implementation, but adding more capabilities to "filters" might be required
>> to convey some more info to the Mesos scheduler/allocator. Am I correct
>> here or are there already ways to convey such info ?
>>
>> Thanks,
>> Dharmesh
>>
>> On Tue, Jun 30, 2015 at 7:15 PM, Alex Rukletsov <a...@mesosphere.com>
>> wrote:
>>
>>> One option is to implement alternative behaviour in an allocator module.
>>>
>>> On Tue, Jun 30, 2015 at 3:34 PM, Dharmesh Kakadia <dhkaka...@gmail.com>
>>> wrote:
>>>
>>>> Interesting.
>>>>
>>>> I agree, that dynamic reservation and optimistic offers will help
>>>> mitigate the issue, but the resource fragmentation (and starvation due to
>>>> that) is a more general problem. Predictive models can certainly aid the
>>>> Mesos scheduler here. I think the filters in Mesos can be extended to add
>>>> more general preferences like the offer size, execution/predictive model
>>>> etc. For the Mesos scheduler, the user should be able to configure what all
>>>> filters it recognizes while making offers, which will also make the effect
>>>> on scalability limited,as far as I understand. Thoughts?
>>>>
>>>> Thanks,
>>>> Dharmesh
>>>>
>>>>
>>>>
>>>> On Sun, Jun 28, 2015 at 7:29 PM, Alex Rukletsov <a...@mesosphere.com>
>>>> wrote:
>>>>
>>>>> Sharma,
>>>>>
>>>>> that's exactly what we plan to add to Mesos. Dynamic reservations will
>>>>> land in 0.23, the next step is to optimistically offer reserved but yet
>>>>> unused resources (we call them optimistic offers) to other framework as
>>>>> revocable. The alternative with one framework will of course work, but 
>>>>> this
>>>>> implies having a general-purpose framework, that does some work that is
>>>>> better done by Mesos (which has more information and therefore can take
>>>>> better decisions).
>>>>>
>>>>> On Wed, Jun 24, 2015 at 11:54 PM, Sharma Podila <spod...@netflix.com>
>>>>> wrote:
>>>>>
>>>>>> In a previous (more HPC like) system I worked on, the scheduler did
>>>>>> "advance reservation" of resources, claiming bits and pieces it got and
>>>>>> holding on until all were available. Say the last bit is expected to come
>>>>>> in about 1 hour from now (and this needs job runtime 
>>>>>> estimation/knowledge),
>>>>>> any short jobs are "back filled" on to the advance reserved resources 
>>>>>> that
>>>>>> are sitting idle for an hour, to improve utilization. This was combined
>>>>>> with weights and priority based job preemptions, sometimes 1GB jobs are
>>>>>> higher priority than the 1GB job. Unfortunately, that technique doesn't
>>>>>> lend itself natively onto Mesos based scheduling.
>>>>>>
>>>>>> One idea that may work in Mesos is (thinking aloud):
>>>>>>
>>>>>> - The large (20GB) framework reserves 20 GB on some number of slaves
>>>>>> (I am referring to dynamic reservations here, which aren't available yet)
>>>>>> - The small framework continues to use up 1GB offers.
>>>>>> - When the large framework needs to run a job, it will have the 20 GB
>>>>>> offers since it has the reservation.
>>>>>> - When the large framework does not have any jobs running on it, the
>>>>>> small framework may be given those resources, but, those jobs will have 
>>>>>> to
>>>>>> be preempted in order to offer 20 GB to the large framework.
>>>>>>
>>>>>> I understand this idea has some forward looking expectations on how
>>>>>> dynamic reservations would/could work. Caveat: I haven't involved myself
>>>>>> closely with that feature definition, so could be wrong with my
>>>>>> expectations.
>>>>>>
>>>>>> Until something like that lands, the existing static reservations, of
>>>>>> course, should work. But, that reduces utilization drastically if the 
>>>>>> large
>>>>>> framework runs jobs sporadically.
>>>>>>
>>>>>> Another idea is to have one framework schedule both the 20GB jobs and
>>>>>> 1GB jobs. Within the framework, it can bin pack the 1GB jobs on to as 
>>>>>> small
>>>>>> a number of slaves as possible. This increases the likelihood of finding
>>>>>> 20GB on a slave. Combining that with preemptions from within the 
>>>>>> framework
>>>>>> (a simple kill of certain number of 1GB jobs) should satisfy the 20 GB 
>>>>>> jobs.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 24, 2015 at 9:26 AM, Tim St Clair <tstcl...@redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>> > From: "Brian Candler" <b.cand...@pobox.com>
>>>>>>> > To: user@mesos.apache.org
>>>>>>> > Sent: Wednesday, June 24, 2015 10:50:43 AM
>>>>>>> > Subject: Re: Setting minimum offer size
>>>>>>> >
>>>>>>> > On 24/06/2015 16:31, Alex Gaudio wrote:
>>>>>>> > > Does anyone have other ideas?
>>>>>>> > HTCondor deals with this by having a "defrag" demon, which
>>>>>>> periodically
>>>>>>> > stops hosts accepting small jobs, so that it can coalesce small
>>>>>>> slots
>>>>>>> > into larger ones.
>>>>>>> >
>>>>>>> >
>>>>>>> http://research.cs.wisc.edu/htcondor/manual/latest/3_5Policy_Configuration.html#sec:SMP-defrag
>>>>>>> >
>>>>>>>
>>>>>>> Yuppers, and guess who helped work on it ;-)
>>>>>>>
>>>>>>> > You can configure policies based on how many drained machines are
>>>>>>> > already available, and how many can be draining at once.
>>>>>>> >
>>>>>>>
>>>>>>> It had to be done this way, as there was only so much sophistication
>>>>>>> you can put into scheduling before you start to add latency.
>>>>>>>
>>>>>>> > Maybe there would be a benefit if Mesos could work out what is the
>>>>>>> > largest job any framework has waiting to run, so it knows whether
>>>>>>> > draining is required and how far to drain down.  This might take
>>>>>>> the
>>>>>>> > form of a message to the framework: "suppose I offered you all the
>>>>>>> > resources on the cluster, what is the largest single job you would
>>>>>>> want
>>>>>>> > to run, and which machine(s) could it run on?"  Or something like
>>>>>>> that.
>>>>>>> >
>>>>>>> > Regards,
>>>>>>> >
>>>>>>> > Brian.
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>> --
>>>>>>> Cheers,
>>>>>>> Timothy St. Clair
>>>>>>> Red Hat Inc.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to