Hi Ben,

We've rolled that patch out (applied over 0.23.1) on our production cluster
and have seen little change, the master is still not sending any offers to
those frameworks. We did this upgrade online, so would there be any reason
the fix wouldn't have helped (other than it not being the cause)? Would we
need to restart the frameworks (so they get new IDs) to see the effect?

It's not that the master is never sending them offers, it's that it does it
up to a certain point... for different types of frameworks (all using
libmesos) but then no more, regardless of how much free resource is
available... the free resources are offered to some frameworks, but not
all. Is there any way for us to do more introspection into the state of the
master / allocator to try and debug? Right now we're at a bit of a loss of
where to start diving in...

Much appreciated as always,

Tom.

On 18 February 2016 at 10:21, Tom Arnfeld <[email protected]> wrote:

> Hi Ben,
>
> I've only just seen your email! Really appreciate the reply, that's
> certainly an interesting bug and we'll try that patch and see how we get on.
>
> Cheers,
>
> Tom.
>
> On 29 January 2016 at 19:54, Benjamin Mahler <[email protected]> wrote:
>
>> Hi Tom,
>>
>> I suspect you may be tripping the following issue:
>> https://issues.apache.org/jira/browse/MESOS-4302
>>
>> Please have a read through this and see if it applies here. You may also
>> be able to apply the fix to your cluster to see if that helps things.
>>
>> Ben
>>
>> On Wed, Jan 20, 2016 at 10:19 AM, Tom Arnfeld <[email protected]> wrote:
>>
>>> Hey,
>>>
>>> I've noticed some interesting behaviour recently when we have lots of
>>> different frameworks connected to our Mesos cluster at once, all using a
>>> variety of different shares. Some of the frameworks don't get offered more
>>> resources (for long periods of time, hours even) leaving the cluster under
>>> utilised.
>>>
>>> Here's an example state where we see this happen..
>>>
>>> Framework 1 - 13% (user A)
>>> Framework 2 - 22% (user B)
>>> Framework 3 - 4% (user C)
>>> Framework 4 - 0.5% (user C)
>>> Framework 5 - 1% (user C)
>>> Framework 6 - 1% (user C)
>>> Framework 7 - 1% (user C)
>>> Framework 8 - 0.8% (user C)
>>> Framework 9 - 11% (user D)
>>> Framework 10 - 7% (user C)
>>> Framework 11 - 1% (user C)
>>> Framework 12 - 1% (user C)
>>> Framework 13 - 6% (user E)
>>>
>>> In this example, there's another ~30% of the cluster that is
>>> unallocated, and it stays like this for a significant amount of time until
>>> something changes, perhaps another user joins and allocates the rest....
>>> chunks of this spare resource is offered to some of the frameworks, but not
>>> all of them.
>>>
>>> I had always assumed that when lots of frameworks were involved,
>>> eventually the frameworks that would keep accepting resources indefinitely
>>> would consume the remaining resource, as every other framework had rejected
>>> the offers.
>>>
>>> Could someone elaborate a little on how the DRF allocator / sorter
>>> handles this situation, is this likely to be related to the different users
>>> being used? Is there a way to mitigate this?
>>>
>>> We're running version 0.23.1.
>>>
>>> Cheers,
>>>
>>> Tom.
>>>
>>
>>
>

Reply via email to