Inverse offers have the same offer cycle as normal offers. They can be Accepted/Declined with a timeout (default 5 seconds).
On Fri, Mar 3, 2017 at 5:29 PM, Zameer Manji <[email protected]> wrote: > Ben, > > Thanks for responding to my questions. I have a follow up on #3. > > I have a framework which accepts inverse offers but does not do anything to > the associated tasks. I noticed that the framework **does not** receive > another inverse offer within the allocation period. At what interval will > an inverse offer be resent to the framework if it was accepted? I took a > glance at `src/tests/master_maintenance_tests.cpp` and did not notice any > tests testing for this. > > Are you sure that inverse offers are resent after they have been accepted > but before the tasks are removed from the host? > > > On Thu, Mar 2, 2017 at 4:14 PM, Benjamin Mahler <[email protected]> wrote: >> >> Hey Zameer, great questions. Let us know if there's anything you think >> could be improved or documented better. >> >> Re 1: >> >> The 'Viewing maintenance status' section of the documentation should >> clarify this: >> http://mesos.apache.org/documentation/latest/maintenance/ >> >> Re 2: >> >> Both of these sound reasonable but the scheduler should not accept the >> maintenance if it's not yet safe for the machine to be downed. Otherwise a >> task failure may be mistakenly interpreted as a go ahead to down the >> machine, despite the scheduler needing to get the task back running. If >> expensive or long running work needs to finish (e.g. migrate data, replace >> instances in a manner that doesn't violate SLA, etc.) then I would suggest >> waiting until the work completes safely before accepting. >> >> We likely need a third state like, TENTATIVELY_ACCEPT to signal to >> operators / mesos that the framework intends to comply, but hasn't finished >> whatever it needs to do yet for it to be safe to down the machine. >> >> Also, one of the challenges here is when to take the action. Should the >> scheduler prepare itself for maintenance as soon as it safely can? Or as >> late (but not too late!) as it safely can? If the scheduler runs >> long-running services, as soon as safely possible makes sense. If the >> scheduler runs short running batch jobs, as late as safely possible provides >> work-conservation. >> >> Re 3: >> >> The framework will receive another inverse offer if the framework still >> has resources allocated on that agent. If receiving a regular offer for >> available resources on the agent, an 'Unavailability' [1] will be included >> if the machine is scheduled for maintenance, so that the scheduler can be >> aware of the maintenance when placing new work. >> >> Re 4: >> >> It's not possible currently, and it's the operator's responsibility (the >> intention was for "operator" to be maintenance tooling). Ideally we can add >> automation of this decision into mesos, if decision criteria that is widely >> applicable can be established (e.g. if nothing is running and all relevant >> frameworks have accepted). Feel free to file a ticket for this or any other >> improvements! >> >> Ben >> >> [1] >> https://github.com/apache/mesos/blob/8f487beb9f8aaed8f27b0404279b1a2f97672ba1/include/mesos/v1/mesos.proto#L1416-L1426 >> >> On Wed, Mar 1, 2017 at 5:41 PM, Zameer Manji <[email protected]> wrote: >>> >>> Hey, >>> >>> I'm trying to understand some nuances of the maintenance API. Here are my >>> questions: >>> >>> 1. The documentation mentions that accepting or declining and inverse >>> offer is a "hint" to the operator. How do operators view if a framework has >>> declined, accepted or ignored an inverse offer? >>> >>> 2. Should a framework accept an inverse offer and then start removing >>> tasks from an agent or should the framework only accept the inverse offer >>> after the removal of tasks is complete? I think the former makes sense, but >>> it implies that operators need to poll the state of the agent to ensure >>> there are no active tasks whereas the latter implies operators only need to >>> check if all inverse offers were accepted. >>> >>> 3. After accepting the inverse offer, will a framework get another >>> inverse offer for the same agent? Currently I'm trying to determine if >>> inverse offer information needs to be persisted so a framework can continue >>> it's draining work between failovers or if it can just wait for an inverse >>> offer after starting up. >>> >>> 4. Is it possible for the agent to automatically transition from DRAIN to >>> DOWN if at the start of the unavailability period the agent is free of tasks >>> or is that still the operator's responsibility? >>> >>> -- >>> Zameer Manji >>> >>> -- >>> Zameer Manji

