One option is to implement alternative behaviour in an allocator module.

On Tue, Jun 30, 2015 at 3:34 PM, Dharmesh Kakadia <>

> Interesting.
> I agree, that dynamic reservation and optimistic offers will help mitigate
> the issue, but the resource fragmentation (and starvation due to that) is a
> more general problem. Predictive models can certainly aid the Mesos
> scheduler here. I think the filters in Mesos can be extended to add more
> general preferences like the offer size, execution/predictive model etc.
> For the Mesos scheduler, the user should be able to configure what all
> filters it recognizes while making offers, which will also make the effect
> on scalability limited,as far as I understand. Thoughts?
> Thanks,
> Dharmesh
> On Sun, Jun 28, 2015 at 7:29 PM, Alex Rukletsov <>
> wrote:
>> Sharma,
>> that's exactly what we plan to add to Mesos. Dynamic reservations will
>> land in 0.23, the next step is to optimistically offer reserved but yet
>> unused resources (we call them optimistic offers) to other framework as
>> revocable. The alternative with one framework will of course work, but this
>> implies having a general-purpose framework, that does some work that is
>> better done by Mesos (which has more information and therefore can take
>> better decisions).
>> On Wed, Jun 24, 2015 at 11:54 PM, Sharma Podila <>
>> wrote:
>>> In a previous (more HPC like) system I worked on, the scheduler did
>>> "advance reservation" of resources, claiming bits and pieces it got and
>>> holding on until all were available. Say the last bit is expected to come
>>> in about 1 hour from now (and this needs job runtime estimation/knowledge),
>>> any short jobs are "back filled" on to the advance reserved resources that
>>> are sitting idle for an hour, to improve utilization. This was combined
>>> with weights and priority based job preemptions, sometimes 1GB jobs are
>>> higher priority than the 1GB job. Unfortunately, that technique doesn't
>>> lend itself natively onto Mesos based scheduling.
>>> One idea that may work in Mesos is (thinking aloud):
>>> - The large (20GB) framework reserves 20 GB on some number of slaves (I
>>> am referring to dynamic reservations here, which aren't available yet)
>>> - The small framework continues to use up 1GB offers.
>>> - When the large framework needs to run a job, it will have the 20 GB
>>> offers since it has the reservation.
>>> - When the large framework does not have any jobs running on it, the
>>> small framework may be given those resources, but, those jobs will have to
>>> be preempted in order to offer 20 GB to the large framework.
>>> I understand this idea has some forward looking expectations on how
>>> dynamic reservations would/could work. Caveat: I haven't involved myself
>>> closely with that feature definition, so could be wrong with my
>>> expectations.
>>> Until something like that lands, the existing static reservations, of
>>> course, should work. But, that reduces utilization drastically if the large
>>> framework runs jobs sporadically.
>>> Another idea is to have one framework schedule both the 20GB jobs and
>>> 1GB jobs. Within the framework, it can bin pack the 1GB jobs on to as small
>>> a number of slaves as possible. This increases the likelihood of finding
>>> 20GB on a slave. Combining that with preemptions from within the framework
>>> (a simple kill of certain number of 1GB jobs) should satisfy the 20 GB jobs.
>>> On Wed, Jun 24, 2015 at 9:26 AM, Tim St Clair <>
>>> wrote:
>>>> ----- Original Message -----
>>>> > From: "Brian Candler" <>
>>>> > To:
>>>> > Sent: Wednesday, June 24, 2015 10:50:43 AM
>>>> > Subject: Re: Setting minimum offer size
>>>> >
>>>> > On 24/06/2015 16:31, Alex Gaudio wrote:
>>>> > > Does anyone have other ideas?
>>>> > HTCondor deals with this by having a "defrag" demon, which
>>>> periodically
>>>> > stops hosts accepting small jobs, so that it can coalesce small slots
>>>> > into larger ones.
>>>> >
>>>> >
>>>> >
>>>> Yuppers, and guess who helped work on it ;-)
>>>> > You can configure policies based on how many drained machines are
>>>> > already available, and how many can be draining at once.
>>>> >
>>>> It had to be done this way, as there was only so much sophistication
>>>> you can put into scheduling before you start to add latency.
>>>> > Maybe there would be a benefit if Mesos could work out what is the
>>>> > largest job any framework has waiting to run, so it knows whether
>>>> > draining is required and how far to drain down.  This might take the
>>>> > form of a message to the framework: "suppose I offered you all the
>>>> > resources on the cluster, what is the largest single job you would
>>>> want
>>>> > to run, and which machine(s) could it run on?"  Or something like
>>>> that.
>>>> >
>>>> > Regards,
>>>> >
>>>> > Brian.
>>>> >
>>>> >
>>>> --
>>>> Cheers,
>>>> Timothy St. Clair
>>>> Red Hat Inc.

Reply via email to