Re: Understanding Mesos Maintenance

Zameer Manji Fri, 03 Mar 2017 17:30:04 -0800

Ben,

Thanks for responding to my questions. I have a follow up on #3.


I have a framework which accepts inverse offers but does not do anything to
the associated tasks. I noticed that the framework **does not** receive
another inverse offer  within the allocation period. At what interval will
an inverse offer be resent to the framework if it was accepted? I took a
glance at `src/tests/master_maintenance_tests.cpp` and did not notice any
tests testing for this.

Are you sure that inverse offers are resent after they have been accepted
but before the tasks are removed from the host?


On Thu, Mar 2, 2017 at 4:14 PM, Benjamin Mahler <[email protected]> wrote:

> Hey Zameer, great questions. Let us know if there's anything you think
> could be improved or documented better.
>
> Re 1:
>
> The 'Viewing maintenance status' section of the documentation should
> clarify this:
> http://mesos.apache.org/documentation/latest/maintenance/
>
> Re 2:
>
> Both of these sound reasonable but the scheduler should not accept the
> maintenance if it's not yet safe for the machine to be downed. Otherwise a
> task failure may be mistakenly interpreted as a go ahead to down the
> machine, despite the scheduler needing to get the task back running. If
> expensive or long running work needs to finish (e.g. migrate data, replace
> instances in a manner that doesn't violate SLA, etc.) then I would suggest
> waiting until the work completes safely before accepting.
>
> We likely need a third state like, TENTATIVELY_ACCEPT to signal to
> operators / mesos that the framework intends to comply, but hasn't finished
> whatever it needs to do yet for it to be safe to down the machine.
>
> Also, one of the challenges here is when to take the action. Should the
> scheduler prepare itself for maintenance as soon as it safely can? Or as
> late (but not too late!) as it safely can? If the scheduler runs
> long-running services, as soon as safely possible makes sense. If the
> scheduler runs short running batch jobs, as late as safely possible
> provides work-conservation.
>
> Re 3:
>
> The framework will receive another inverse offer if the framework still
> has resources allocated on that agent. If receiving a regular offer for
> available resources on the agent, an 'Unavailability' [1] will be included
> if the machine is scheduled for maintenance, so that the scheduler can be
> aware of the maintenance when placing new work.
>
> Re 4:
>
> It's not possible currently, and it's the operator's responsibility (the
> intention was for "operator" to be maintenance tooling). Ideally we can add
> automation of this decision into mesos, if decision criteria that is widely
> applicable can be established (e.g. if nothing is running and all relevant
> frameworks have accepted). Feel free to file a ticket for this or any other
> improvements!
>
> Ben
>
> [1] https://github.com/apache/mesos/blob/8f487beb9f8aaed8f27
> b0404279b1a2f97672ba1/include/mesos/v1/mesos.proto#L1416-L1426
>
> On Wed, Mar 1, 2017 at 5:41 PM, Zameer Manji <[email protected]> wrote:
>
>> Hey,
>>
>> I'm trying to understand some nuances of the maintenance API. Here are my
>> questions:
>>
>> 1. The documentation mentions that accepting or declining and inverse
>> offer is a "hint" to the operator. How do operators view if a framework has
>> declined, accepted or ignored an inverse offer?
>>
>> 2. Should a framework accept an inverse offer and then start removing
>> tasks from an agent or should the framework only accept the inverse offer
>> after the removal of tasks is complete? I think the former makes sense, but
>> it implies that operators need to poll the state of the agent to ensure
>> there are no active tasks whereas the latter implies operators only need to
>> check if all inverse offers were accepted.
>>
>> 3. After accepting the inverse offer, will a framework get another
>> inverse offer for the same agent? Currently I'm trying to determine if
>> inverse offer information needs to be persisted so a framework can continue
>> it's draining work between failovers or if it can just wait for an inverse
>> offer after starting up.
>>
>> 4. Is it possible for the agent to automatically transition from DRAIN to
>> DOWN if at the start of the unavailability period the agent is free of
>> tasks or is that still the operator's responsibility?
>>
>> --
>> Zameer Manji
>>
>> --
>> Zameer Manji
>>
>

Re: Understanding Mesos Maintenance

Reply via email to