Re: Reconciliation Document

Sharma Podila Thu, 16 Oct 2014 19:44:47 -0700

Response inline, below.

On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <[email protected]>
wrote:


> Thanks for the thoughtful questions, I will take these into account in the
> document.
>
> Addressing each question in order:
>
> *(1) Why the retry?*
>
> It could be once per (re-)registration in the future.
>
> Some requests are temporarily unanswerable. For example, if reconciling
> task T on slave S, and slave S has not yet re-registered, we cannot reply
> until the slave is re-registered or removed. Also, if a slave is
> transitioning (being removed), we want to make sure that operation finishes
> before we can answer.
>
> It's possible to keep the request around and trigger an event once we can
> answer. However, we chose to drop and remain silent for these tasks. This
> is both for implementation simplicity and as a defense against OOMing from
> too many pending reconciliation requests.
>

I was thinking that the state machine that maintains the state of tasks
always has answers for the current state. Therefore, I don't expect any
blocking. For example, if S hasn't yet re-registered. the state machine
must think that the state of T is still 'running' until either the slave
re-registers and informs of the task being lost, or a timeout occurs after
which master decides the slave is gone. At which point a new status update
can be sent. I don't see a reason why reconcile needs to wait until slave
re-registers here. Maybe I am missing something else? Same with
transitioning... the state information is always available, say, as
running, until transition happens. This results in two status updates, but
always correct.



>
>
> *(2) Any time bound guarantees?*
>
> No guarantees on exact timing, but you are guaranteed to eventually
> receive an answer.
>
> This is why exponential backoff is important, to tolerate variability in
> timing and avoid snowballing if a backlog ever occurs.
>
> For suggesting an initial timeout, I need to digress a bit. Currently the
> driver does not explicitly expose the event queue to the scheduler, and so
> when you call reconcile, you may have an event queue in the driver full of
> status updates. Because of this lack of visibility, picking an initial
> timeout will depend on your scheduler's update processing speed and scale
> (# expected status updates). Again, backoff is recommended to handle this.
>
> We were considering exposing Java bindings for the newer Event/Call API.
> It makes the queue explicit, which lets you avoid reconciling while you
> have a queue full of updates.
>
> Here is what the C++ interface looks like:
>
> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
>
> Does this interest you?
>

I am interpreting this (correct me as needed) to mean that the Java
callback statusUpdate() receives a queue instead of the current version
with just one TaskStatus argument? I suppose this could be useful, yes. In
that case, the acknowledgements of receiving the task status is sent to
master once per the entire queue of task status. Which may be OK.



>
>
> *(3) After timeout with no answer, I would be tempted to kill the task.*
>
> You will eventually receive an answer, so if you decide to kill the task
> because you have not received an answer soon enough, you may make the wrong
> decision. This is up to you.
>
> In particular, I would caution against making decisions without feedback
> because it can lead to a snowball effect if tasks are treated
> independently. In the event of a backlog, what's to stop you from killing
> all tasks because you haven't received any answers?
>
> I would recommend that you only use this kind of timeout as a last resort,
> when not receiving a response after a large amount of time and a large
> number of reconciliation requests.
>

Yes, that is the timeout value I was after. However, based on my response
to #1, this could be short, isn't it?



>
>
> *(4) Does rate limiting affect this?*
>
> When enabled, rate limiting currently only operates on the rate of
> incoming messages from a particular framework, so the number of updates
> sent back has no effect on the limiting.
>

That sounds good. Although, just to be paranoid, what if there's a
problematic framework that restarts frequently (due to a bug, for
example)? This would keep Mesos master busy sending reconcile task updates
to it constantly.

Thanks.

Sharma



>
>
> On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <[email protected]>
> wrote:
>
>> Looks like a good step forward.
>>
>> What is the reason for the algorithm having to call reconcile tasks
>> multiple times after waiting some time in step 6? Shouldn't it be just once
>> per (re)registration?
>>
>
>> Are there time bound guarantees within which a task update will be sent
>> out after a reconcile request is sent? In the algorithm for task
>> reconciliation, what would be a good timeout after which we conclude that
>> we got no task update from the master? Upon such a timeout, I would be
>> tempted to conclude that the task has disappeared. In which case, I would
>> call driver.killTask() (to be sure its marked as gone), mark my task as
>> terminated, then submit a replacement task.
>>
>> Does the "rate limiting" feature (in the works?) affect task
>> reconciliation due to the volume of task updates sent back?
>>
>> Thanks.
>>
>>
>> On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <
>> [email protected]> wrote:
>>
>>> Hi all,
>>>
>>> I've sent a review out for a document describing reconciliation, you can
>>> see the draft here:
>>> https://gist.github.com/bmahler/18409fc4f052df43f403
>>>
>>> Would love to gather high level feedback on it from framework
>>> developers. Feel free to reply here, or on the review:
>>> https://reviews.apache.org/r/26669/
>>>
>>> Thanks!
>>> Ben
>>>
>>
>>
>

Re: Reconciliation Document

Reply via email to