Re: Reconciliation Document

Sharma Podila Mon, 03 Nov 2014 18:33:07 -0800

Inline...

On Tue, Oct 21, 2014 at 12:52 PM, Benjamin Mahler <benjamin.mah...@gmail.com
> wrote:


> Inline.
>
> On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <spod...@netflix.com>
> wrote:
>
>> Response inline, below.
>>
>> On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <
>> benjamin.mah...@gmail.com> wrote:
>>
>>> Thanks for the thoughtful questions, I will take these into account in
>>> the document.
>>>
>>> Addressing each question in order:
>>>
>>> *(1) Why the retry?*
>>>
>>> It could be once per (re-)registration in the future.
>>>
>>> Some requests are temporarily unanswerable. For example, if reconciling
>>> task T on slave S, and slave S has not yet re-registered, we cannot reply
>>> until the slave is re-registered or removed. Also, if a slave is
>>> transitioning (being removed), we want to make sure that operation finishes
>>> before we can answer.
>>>
>>> It's possible to keep the request around and trigger an event once we
>>> can answer. However, we chose to drop and remain silent for these tasks.
>>> This is both for implementation simplicity and as a defense against OOMing
>>> from too many pending reconciliation requests.
>>>
>>
>> I was thinking that the state machine that maintains the state of tasks
>> always has answers for the current state. Therefore, I don't expect any
>> blocking. For example, if S hasn't yet re-registered. the state machine
>> must think that the state of T is still 'running' until either the slave
>> re-registers and informs of the task being lost, or a timeout occurs after
>> which master decides the slave is gone. At which point a new status update
>> can be sent. I don't see a reason why reconcile needs to wait until slave
>> re-registers here. Maybe I am missing something else? Same with
>> transitioning... the state information is always available, say, as
>> running, until transition happens. This results in two status updates, but
>> always correct.
>>
>
> Task state in Mesos is persisted in the leaves of the system (the slaves)
> for scalability reasons. So when a new master starts up, it doesn't know
> anything about tasks; this state is bootstrapped from the slaves as they
> re-register. This interim period of state recovery is when frameworks may
> not receive answers to reconciliation requests, depending on whether the
> particular slave has re-registered.
>
> In your second case, once a slave is removed, we will send the LOST update
> for all non-terminal tasks on the slave. There's little benefit of replying
> to a reconciliation request while it's being removed, because LOST updates
> are coming shortly thereafter. You can think of these LOST updates as the
> reply to the reconciliation request, as far as the scheduler is concerned.
>
> I think the two takeaways here are:
>
> (1) Ultimately while it is possible to avoid the need for retries on the
> framework side, it introduces too much complexity in the master and gives
> us no flexibility in ignoring or dropping messages. Even in such a world,
> the retries would be a valid resiliency measure for frameworks to insulate
> themselves against anything being dropped.
>
> (2) For now, we want to encourage framework developers to think about
> these kinds of issues, we want them to implement their frameworks in a
> resilient manner. And so in general we haven't chosen to provide a crutch
> when it requires a lot of complexity in Mesos. Today we can't add these
> ergonomic improvements in the scheduler driver because it has no
> persistence. Hopefully as the project moves forward, we can have these kind
> of framework side ergonomic improvements be contained in pure language
> bindings to Mesos. A nice stateful language binding can hide this from you.
> :)
>

OK. The only thought I have is that it could be somewhat useful to have
master send back a (new) state of 'PendingSlaveUpdate' instead of going
silent. This way the reconcile process finishes immediately. Framework
would then retry later for tasks that got these states. Although, figuring
out the timeout after which to retry is still the same issue.

This brings up another question. Say, a slave is 'missing' and hasn't
re-registered with master yet. What is the expected behavior when framework
asks master to kill a task on that slave? Since the slave is disconnected,
the kill request isn't delivered to the executor on that slave. Is the
framework notified of this failure to send the kill request?

This has implications to a framework's task reconcile logic. After a
certain #reconciliations, framework would want to treat the task as
terminally lost and resubmit a replacement. For safety, I'd kill the
existing task before resubmitting the replacement. I am guessing frameworks
should not assume guaranteed delivery of the kill request. So, it is
possible that the task may continue running after the slave reconnects.
Which implies that the framework is now consuming double resources for the
"same" task. I understand this is out of scope for the master and the
tasks/frameworks should use external logic to guarantee only one instance
of a task runs. I am just wanting to know the expectations of the kill
request.

Thanks.

Sharma

Re: Reconciliation Document

Reply via email to