Re: Understanding Mesos Maintenance

Joseph Wu Fri, 03 Mar 2017 18:08:51 -0800

Inverse offers have the same offer cycle as normal offers.  They can
be Accepted/Declined with a timeout (default 5 seconds).


On Fri, Mar 3, 2017 at 5:29 PM, Zameer Manji <[email protected]> wrote:
> Ben,
>
> Thanks for responding to my questions. I have a follow up on #3.
>
> I have a framework which accepts inverse offers but does not do anything to
> the associated tasks. I noticed that the framework **does not** receive
> another inverse offer  within the allocation period. At what interval will
> an inverse offer be resent to the framework if it was accepted? I took a
> glance at `src/tests/master_maintenance_tests.cpp` and did not notice any
> tests testing for this.
>
> Are you sure that inverse offers are resent after they have been accepted
> but before the tasks are removed from the host?
>
>
> On Thu, Mar 2, 2017 at 4:14 PM, Benjamin Mahler <[email protected]> wrote:
>>
>> Hey Zameer, great questions. Let us know if there's anything you think
>> could be improved or documented better.
>>
>> Re 1:
>>
>> The 'Viewing maintenance status' section of the documentation should
>> clarify this:
>> http://mesos.apache.org/documentation/latest/maintenance/
>>
>> Re 2:
>>
>> Both of these sound reasonable but the scheduler should not accept the
>> maintenance if it's not yet safe for the machine to be downed. Otherwise a
>> task failure may be mistakenly interpreted as a go ahead to down the
>> machine, despite the scheduler needing to get the task back running. If
>> expensive or long running work needs to finish (e.g. migrate data, replace
>> instances in a manner that doesn't violate SLA, etc.) then I would suggest
>> waiting until the work completes safely before accepting.
>>
>> We likely need a third state like, TENTATIVELY_ACCEPT to signal to
>> operators / mesos that the framework intends to comply, but hasn't finished
>> whatever it needs to do yet for it to be safe to down the machine.
>>
>> Also, one of the challenges here is when to take the action. Should the
>> scheduler prepare itself for maintenance as soon as it safely can? Or as
>> late (but not too late!) as it safely can? If the scheduler runs
>> long-running services, as soon as safely possible makes sense. If the
>> scheduler runs short running batch jobs, as late as safely possible provides
>> work-conservation.
>>
>> Re 3:
>>
>> The framework will receive another inverse offer if the framework still
>> has resources allocated on that agent. If receiving a regular offer for
>> available resources on the agent, an 'Unavailability' [1] will be included
>> if the machine is scheduled for maintenance, so that the scheduler can be
>> aware of the maintenance when placing new work.
>>
>> Re 4:
>>
>> It's not possible currently, and it's the operator's responsibility (the
>> intention was for "operator" to be maintenance tooling). Ideally we can add
>> automation of this decision into mesos, if decision criteria that is widely
>> applicable can be established (e.g. if nothing is running and all relevant
>> frameworks have accepted). Feel free to file a ticket for this or any other
>> improvements!
>>
>> Ben
>>
>> [1]
>> https://github.com/apache/mesos/blob/8f487beb9f8aaed8f27b0404279b1a2f97672ba1/include/mesos/v1/mesos.proto#L1416-L1426
>>
>> On Wed, Mar 1, 2017 at 5:41 PM, Zameer Manji <[email protected]> wrote:
>>>
>>> Hey,
>>>
>>> I'm trying to understand some nuances of the maintenance API. Here are my
>>> questions:
>>>
>>> 1. The documentation mentions that accepting or declining and inverse
>>> offer is a "hint" to the operator. How do operators view if a framework has
>>> declined, accepted or ignored an inverse offer?
>>>
>>> 2. Should a framework accept an inverse offer and then start removing
>>> tasks from an agent or should the framework only accept the inverse offer
>>> after the removal of tasks is complete? I think the former makes sense, but
>>> it implies that operators need to poll the state of the agent to ensure
>>> there are no active tasks whereas the latter implies operators only need to
>>> check if all inverse offers were accepted.
>>>
>>> 3. After accepting the inverse offer, will a framework get another
>>> inverse offer for the same agent? Currently I'm trying to determine if
>>> inverse offer information needs to be persisted so a framework can continue
>>> it's draining work between failovers or if it can just wait for an inverse
>>> offer after starting up.
>>>
>>> 4. Is it possible for the agent to automatically transition from DRAIN to
>>> DOWN if at the start of the unavailability period the agent is free of tasks
>>> or is that still the operator's responsibility?
>>>
>>> --
>>> Zameer Manji
>>>
>>> --
>>> Zameer Manji

Re: Understanding Mesos Maintenance

Reply via email to