Re: Reconciliation Document

Whitney Sorenson Thu, 30 Oct 2014 15:28:20 -0700

Ben,

What's a reasonable initial timeout and cap for reconciliation when the #
of slaves and tasks involved is in the tens/hundreds?


I ask because in Singularity we are using a fixed 30 seconds and one user
saw it take 30 minutes to eventually reconcile 25 task statuses (after
seeing all slaves crash and a master failover -- although that's another
issue.)





On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <[email protected]>
wrote:

> Inline.
>
> On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <[email protected]>
> wrote:
>
>> Response inline, below.
>>
>> On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <
>> [email protected]> wrote:
>>
>>> Thanks for the thoughtful questions, I will take these into account in
>>> the document.
>>>
>>> Addressing each question in order:
>>>
>>> *(1) Why the retry?*
>>>
>>> It could be once per (re-)registration in the future.
>>>
>>> Some requests are temporarily unanswerable. For example, if reconciling
>>> task T on slave S, and slave S has not yet re-registered, we cannot reply
>>> until the slave is re-registered or removed. Also, if a slave is
>>> transitioning (being removed), we want to make sure that operation finishes
>>> before we can answer.
>>>
>>> It's possible to keep the request around and trigger an event once we
>>> can answer. However, we chose to drop and remain silent for these tasks.
>>> This is both for implementation simplicity and as a defense against OOMing
>>> from too many pending reconciliation requests.
>>>
>>
>> I was thinking that the state machine that maintains the state of tasks
>> always has answers for the current state. Therefore, I don't expect any
>> blocking. For example, if S hasn't yet re-registered. the state machine
>> must think that the state of T is still 'running' until either the slave
>> re-registers and informs of the task being lost, or a timeout occurs after
>> which master decides the slave is gone. At which point a new status update
>> can be sent. I don't see a reason why reconcile needs to wait until slave
>> re-registers here. Maybe I am missing something else? Same with
>> transitioning... the state information is always available, say, as
>> running, until transition happens. This results in two status updates, but
>> always correct.
>>
>
> Task state in Mesos is persisted in the leaves of the system (the slaves)
> for scalability reasons. So when a new master starts up, it doesn't know
> anything about tasks; this state is bootstrapped from the slaves as they
> re-register. This interim period of state recovery is when frameworks may
> not receive answers to reconciliation requests, depending on whether the
> particular slave has re-registered.
>
> In your second case, once a slave is removed, we will send the LOST update
> for all non-terminal tasks on the slave. There's little benefit of replying
> to a reconciliation request while it's being removed, because LOST updates
> are coming shortly thereafter. You can think of these LOST updates as the
> reply to the reconciliation request, as far as the scheduler is concerned.
>
> I think the two takeaways here are:
>
> (1) Ultimately while it is possible to avoid the need for retries on the
> framework side, it introduces too much complexity in the master and gives
> us no flexibility in ignoring or dropping messages. Even in such a world,
> the retries would be a valid resiliency measure for frameworks to insulate
> themselves against anything being dropped.
>
> (2) For now, we want to encourage framework developers to think about
> these kinds of issues, we want them to implement their frameworks in a
> resilient manner. And so in general we haven't chosen to provide a crutch
> when it requires a lot of complexity in Mesos. Today we can't add these
> ergonomic improvements in the scheduler driver because it has no
> persistence. Hopefully as the project moves forward, we can have these kind
> of framework side ergonomic improvements be contained in pure language
> bindings to Mesos. A nice stateful language binding can hide this from you.
> :)
>
>
>>
>>
>>
>>>
>>>
>>> *(2) Any time bound guarantees?*
>>>
>>> No guarantees on exact timing, but you are guaranteed to eventually
>>> receive an answer.
>>>
>>> This is why exponential backoff is important, to tolerate variability in
>>> timing and avoid snowballing if a backlog ever occurs.
>>>
>>> For suggesting an initial timeout, I need to digress a bit. Currently
>>> the driver does not explicitly expose the event queue to the scheduler, and
>>> so when you call reconcile, you may have an event queue in the driver full
>>> of status updates. Because of this lack of visibility, picking an initial
>>> timeout will depend on your scheduler's update processing speed and scale
>>> (# expected status updates). Again, backoff is recommended to handle this.
>>>
>>> We were considering exposing Java bindings for the newer Event/Call API.
>>> It makes the queue explicit, which lets you avoid reconciling while you
>>> have a queue full of updates.
>>>
>>> Here is what the C++ interface looks like:
>>>
>>> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
>>>
>>> Does this interest you?
>>>
>>
>> I am interpreting this (correct me as needed) to mean that the Java
>> callback statusUpdate() receives a queue instead of the current version
>> with just one TaskStatus argument? I suppose this could be useful, yes. In
>> that case, the acknowledgements of receiving the task status is sent to
>> master once per the entire queue of task status. Which may be OK.
>>
>
> You would always receive a queue of events, which you can store and
> process asynchronously (the key to enabling this was making
> acknowledgements explicit). Sorry for the tangent, keep an eye out for
> discussions related to the new API / HTTP API changes.
>
>
>>
>>
>>
>>>
>>>
>>> *(3) After timeout with no answer, I would be tempted to kill the task.*
>>>
>>> You will eventually receive an answer, so if you decide to kill the task
>>> because you have not received an answer soon enough, you may make the wrong
>>> decision. This is up to you.
>>>
>>> In particular, I would caution against making decisions without feedback
>>> because it can lead to a snowball effect if tasks are treated
>>> independently. In the event of a backlog, what's to stop you from killing
>>> all tasks because you haven't received any answers?
>>>
>>> I would recommend that you only use this kind of timeout as a last
>>> resort, when not receiving a response after a large amount of time and a
>>> large number of reconciliation requests.
>>>
>>
>> Yes, that is the timeout value I was after. However, based on my
>> response to #1, this could be short, isn't it?
>>
>
> Yes it could be on the order of seconds to start with.
>
>
>>
>>
>>
>>>
>>>
>>> *(4) Does rate limiting affect this?*
>>>
>>> When enabled, rate limiting currently only operates on the rate of
>>> incoming messages from a particular framework, so the number of updates
>>> sent back has no effect on the limiting.
>>>
>>
>> That sounds good. Although, just to be paranoid, what if there's a
>> problematic framework that restarts frequently (due to a bug, for
>> example)? This would keep Mesos master busy sending reconcile task updates
>> to it constantly.
>>
>
> You're right, it's an orthogonal problem to address since it applies
> broadly to other messages (e.g. framework sending 100MB tasks).
>
>
>>
>> Thanks.
>>
>> Sharma
>>
>>
>>
>>>
>>>
>>> On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <[email protected]>
>>> wrote:
>>>
>>>> Looks like a good step forward.
>>>>
>>>> What is the reason for the algorithm having to call reconcile tasks
>>>> multiple times after waiting some time in step 6? Shouldn't it be just once
>>>> per (re)registration?
>>>>
>>>
>>>> Are there time bound guarantees within which a task update will be sent
>>>> out after a reconcile request is sent? In the algorithm for task
>>>> reconciliation, what would be a good timeout after which we conclude that
>>>> we got no task update from the master? Upon such a timeout, I would be
>>>> tempted to conclude that the task has disappeared. In which case, I would
>>>> call driver.killTask() (to be sure its marked as gone), mark my task as
>>>> terminated, then submit a replacement task.
>>>>
>>>> Does the "rate limiting" feature (in the works?) affect task
>>>> reconciliation due to the volume of task updates sent back?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've sent a review out for a document describing reconciliation, you
>>>>> can see the draft here:
>>>>> https://gist.github.com/bmahler/18409fc4f052df43f403
>>>>>
>>>>> Would love to gather high level feedback on it from framework
>>>>> developers. Feel free to reply here, or on the review:
>>>>> https://reviews.apache.org/r/26669/
>>>>>
>>>>> Thanks!
>>>>> Ben
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Reconciliation Document

Reply via email to