Re: Question on resource offers and framework failover

Sharma Podila Tue, 13 May 2014 10:18:21 -0700

Thanks for confirming that, Adam.


> , but it would be a good Mesos FAQ topic.

I was thinking it might be good to also add to doc in code, either in
mesos.proto or MesosSchedulerDriver (mesos.proto already refers to the
latter for failover at FrameworkID definition).

If you were to try to persist the 'ephemeral' offers to another framework
> instance, and call launchTasks with one of the old offers, the master will
> respond with TASK_LOST ("Task launched with invalid offers"), since the
> master no longer knows about that offer

Strictly speaking, shouldn't this produce some kind of an 'invalid offer'
response instead of task being lost? A TASK_LOST response is handled
differently in my scheduler, for example, compared to what I'd do for an
invalid offer response. An invalid offer would just be a simple discard
offer and retry of launch with a more recent offer. Where as, a TASK_LOST
will make me (unnecessarily, in this case) try to ensure that the task is
actually lost, not running away on the slave that got disconnected from
Mesos master. Not all environments may need the distinction, but at least
some do.

On Mon, May 12, 2014 at 11:12 PM, Adam Bordelon <a...@mesosphere.io> wrote:

> Correct, Sharma. I don't think this is documented anywhere yet, but it
> would be a good Mesos FAQ topic.
> When the master notices that the framework has exited or is deactivated,
> it disables the framework in the allocator so no new offers will be made to
> that framework, and removes any outstanding offers (but does not send a
> RescindResourceOfferMessage to the framework, since the framework is
> presumably failing over). When a framework reregisters, it is reactivated
> in the allocator and will start receiving new offers again.
> If you were to try to persist the 'ephemeral' offers to another framework
> instance, and call launchTasks with one of the old offers, the master will
> respond with TASK_LOST ("Task launched with invalid offers"), since the
> master no longer knows about that offer. So don't bother trying. :)
> Already running tasks (used offers) continue running, unless the framework
> failover timeout is exceeded.
>
>
> On Mon, May 12, 2014 at 5:38 PM, Sharma Podila <spod...@netflix.com>wrote:
>
>> My understanding is that when a framework fails over (either new instance
>> starts after previous one fails, or the same instance restarts), Mesos
>> master would automatically cancel any unused offers it had given to the
>> previous framework instance. This is a good thing. Can someone confirm this
>> to be the case? Is such an expectation documented somewhere? I did look at
>> master.cpp and I hope I interpreted it right.
>>
>> Effectively then, the offers are 'ephemeral' and don't need to be
>> persisted by the framework scheduler to pass along to another of its
>> instance that may failover as the leader.
>>
>> Thank you.
>>
>> Sharma
>>
>>
>

Re: Question on resource offers and framework failover

Reply via email to