Re: Trying to get task reconciliation to work

David Greenberg Fri, 18 Apr 2014 05:05:16 -0700

Piggybacking onto this thread with a follow up question: what happens if
you ask the master to reconcile some tasks that weren't launched by your
framework? Will you get messages that express those tasks were unknown,
lost, or will nothing respond?


On Thursday, April 17, 2014, Sharma Podila <spod...@netflix.com> wrote:

> No problem, I have a better understanding now.
> And it was useful to see the three items you listed explicitly.
>
>
> On Thu, Apr 17, 2014 at 2:39 PM, Benjamin Mahler <
> benjamin.mah...@gmail.com> wrote:
>
> Good to see you were playing around with reconciliation, we should have
> made the current semantics more clear. Especially in light of the fact that
> it's not implemented fully until one uses a strict registrar (likely
> 0.20.0).
>
> Think of reconciliation as the fallback mechanism to ensure that state is
> consistent, it's not designed to be something to inform you of things you
> were already told (in this case, that the tasks were running). Although we
> could consider sending updates even when task state remains the same.
>
>
> For the purpose of this conversation, let's say we're in the 0.20.0 world,
> operating with the registrar. And let's assume your goal is to build a
> highly available framework (I will be documenting how to do this for
> 0.20.0):
>
> (1) *When you receive a status update, you must persist this information
> before returning from the statusUpdate() callback*. Once you return from
> the callback, the driver will acknowledge the slave directly. Slaves will
> retry status update delivery *until* the acknowledgement is received from
> the scheduler driver in order to ensure that the framework processed the
> update.
>
> (2) *When you receive a "slave lost" signal, it means that your tasks
> that were running on that slave are in state TASK_LOST*, and any
> reconciliation you perform for these tasks will result in a reply of
> TASK_LOST. Most of the time we'll deliver these TASK_LOST automatically,
> but with a confluence of Master *and* Slave failovers, we are unaware of
> which tasks were running on the slave as we do not persist this information
> in the Master.
>
> (3) To guarantee that you have a consistent view of task states. *You
> must also periodically reconcile task state against the Master*. This is
> only because the delivery of the "slave lost" signal in (2) is not reliable
> (the Master could failover after removing a slave but before telling
> frameworks that the slave was lost).
>
> You'll notice that this model forces one to serially persist all status
> update changes. We are planning to expose mechanisms to allow "batch"
> acknowledgement of status updates in the lower-level API that benh has
> given talks about. With a lower-level API, it is possible to build more
> powerful libraries that hide much of these details!
>
> You'll also perhaps notice that only (1) and (3) are strictly required for
> consistency, but (2) is highly recommended as the vast majority of the time
> the "slave lost" signal will be delivered and you can take action quickly,
> without having to rely on periodic reconciliation.
>
> Please let me know if anything here was not clear!
>
>
> On Thu, Apr 17, 2014 at 1:47 PM, Sharma Podila <spod...@netflix.com>wrote:
>
> Should've looked at the code before sending the previous email...
>  master/main.cpp confirmed what I needed to know. It doesn't look like I
> will be able to use reconcileTasks the way I thought I could. Effectively,
> a lack of callback could either mean that the master agrees with the
> requested reconcile task state, or that the task and/or slave is currently
> unknown. Which makes it an unreliable source of data. I understand this is
> expected to improve later by leveraging the registrar, but, I suspect
> there's more to it.
>
> I take it then that individual frameworks need to have their own
> mechanisms to ascertain the state of their tasks.
>
>
> On Thu, Apr 17, 2014 at 12:53 PM, Sharma Podila <spod...@netflix.com>wrote:
>
> Hello
>
>

Re: Trying to get task reconciliation to work

Reply via email to