Ben, What's a reasonable initial timeout and cap for reconciliation when the # of slaves and tasks involved is in the tens/hundreds?
I ask because in Singularity we are using a fixed 30 seconds and one user saw it take 30 minutes to eventually reconcile 25 task statuses (after seeing all slaves crash and a master failover -- although that's another issue.) On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <[email protected]> wrote: > Inline. > > On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <[email protected]> > wrote: > >> Response inline, below. >> >> On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler < >> [email protected]> wrote: >> >>> Thanks for the thoughtful questions, I will take these into account in >>> the document. >>> >>> Addressing each question in order: >>> >>> *(1) Why the retry?* >>> >>> It could be once per (re-)registration in the future. >>> >>> Some requests are temporarily unanswerable. For example, if reconciling >>> task T on slave S, and slave S has not yet re-registered, we cannot reply >>> until the slave is re-registered or removed. Also, if a slave is >>> transitioning (being removed), we want to make sure that operation finishes >>> before we can answer. >>> >>> It's possible to keep the request around and trigger an event once we >>> can answer. However, we chose to drop and remain silent for these tasks. >>> This is both for implementation simplicity and as a defense against OOMing >>> from too many pending reconciliation requests. >>> >> >> I was thinking that the state machine that maintains the state of tasks >> always has answers for the current state. Therefore, I don't expect any >> blocking. For example, if S hasn't yet re-registered. the state machine >> must think that the state of T is still 'running' until either the slave >> re-registers and informs of the task being lost, or a timeout occurs after >> which master decides the slave is gone. At which point a new status update >> can be sent. I don't see a reason why reconcile needs to wait until slave >> re-registers here. Maybe I am missing something else? Same with >> transitioning... the state information is always available, say, as >> running, until transition happens. This results in two status updates, but >> always correct. >> > > Task state in Mesos is persisted in the leaves of the system (the slaves) > for scalability reasons. So when a new master starts up, it doesn't know > anything about tasks; this state is bootstrapped from the slaves as they > re-register. This interim period of state recovery is when frameworks may > not receive answers to reconciliation requests, depending on whether the > particular slave has re-registered. > > In your second case, once a slave is removed, we will send the LOST update > for all non-terminal tasks on the slave. There's little benefit of replying > to a reconciliation request while it's being removed, because LOST updates > are coming shortly thereafter. You can think of these LOST updates as the > reply to the reconciliation request, as far as the scheduler is concerned. > > I think the two takeaways here are: > > (1) Ultimately while it is possible to avoid the need for retries on the > framework side, it introduces too much complexity in the master and gives > us no flexibility in ignoring or dropping messages. Even in such a world, > the retries would be a valid resiliency measure for frameworks to insulate > themselves against anything being dropped. > > (2) For now, we want to encourage framework developers to think about > these kinds of issues, we want them to implement their frameworks in a > resilient manner. And so in general we haven't chosen to provide a crutch > when it requires a lot of complexity in Mesos. Today we can't add these > ergonomic improvements in the scheduler driver because it has no > persistence. Hopefully as the project moves forward, we can have these kind > of framework side ergonomic improvements be contained in pure language > bindings to Mesos. A nice stateful language binding can hide this from you. > :) > > >> >> >> >>> >>> >>> *(2) Any time bound guarantees?* >>> >>> No guarantees on exact timing, but you are guaranteed to eventually >>> receive an answer. >>> >>> This is why exponential backoff is important, to tolerate variability in >>> timing and avoid snowballing if a backlog ever occurs. >>> >>> For suggesting an initial timeout, I need to digress a bit. Currently >>> the driver does not explicitly expose the event queue to the scheduler, and >>> so when you call reconcile, you may have an event queue in the driver full >>> of status updates. Because of this lack of visibility, picking an initial >>> timeout will depend on your scheduler's update processing speed and scale >>> (# expected status updates). Again, backoff is recommended to handle this. >>> >>> We were considering exposing Java bindings for the newer Event/Call API. >>> It makes the queue explicit, which lets you avoid reconciling while you >>> have a queue full of updates. >>> >>> Here is what the C++ interface looks like: >>> >>> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478 >>> >>> Does this interest you? >>> >> >> I am interpreting this (correct me as needed) to mean that the Java >> callback statusUpdate() receives a queue instead of the current version >> with just one TaskStatus argument? I suppose this could be useful, yes. In >> that case, the acknowledgements of receiving the task status is sent to >> master once per the entire queue of task status. Which may be OK. >> > > You would always receive a queue of events, which you can store and > process asynchronously (the key to enabling this was making > acknowledgements explicit). Sorry for the tangent, keep an eye out for > discussions related to the new API / HTTP API changes. > > >> >> >> >>> >>> >>> *(3) After timeout with no answer, I would be tempted to kill the task.* >>> >>> You will eventually receive an answer, so if you decide to kill the task >>> because you have not received an answer soon enough, you may make the wrong >>> decision. This is up to you. >>> >>> In particular, I would caution against making decisions without feedback >>> because it can lead to a snowball effect if tasks are treated >>> independently. In the event of a backlog, what's to stop you from killing >>> all tasks because you haven't received any answers? >>> >>> I would recommend that you only use this kind of timeout as a last >>> resort, when not receiving a response after a large amount of time and a >>> large number of reconciliation requests. >>> >> >> Yes, that is the timeout value I was after. However, based on my >> response to #1, this could be short, isn't it? >> > > Yes it could be on the order of seconds to start with. > > >> >> >> >>> >>> >>> *(4) Does rate limiting affect this?* >>> >>> When enabled, rate limiting currently only operates on the rate of >>> incoming messages from a particular framework, so the number of updates >>> sent back has no effect on the limiting. >>> >> >> That sounds good. Although, just to be paranoid, what if there's a >> problematic framework that restarts frequently (due to a bug, for >> example)? This would keep Mesos master busy sending reconcile task updates >> to it constantly. >> > > You're right, it's an orthogonal problem to address since it applies > broadly to other messages (e.g. framework sending 100MB tasks). > > >> >> Thanks. >> >> Sharma >> >> >> >>> >>> >>> On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <[email protected]> >>> wrote: >>> >>>> Looks like a good step forward. >>>> >>>> What is the reason for the algorithm having to call reconcile tasks >>>> multiple times after waiting some time in step 6? Shouldn't it be just once >>>> per (re)registration? >>>> >>> >>>> Are there time bound guarantees within which a task update will be sent >>>> out after a reconcile request is sent? In the algorithm for task >>>> reconciliation, what would be a good timeout after which we conclude that >>>> we got no task update from the master? Upon such a timeout, I would be >>>> tempted to conclude that the task has disappeared. In which case, I would >>>> call driver.killTask() (to be sure its marked as gone), mark my task as >>>> terminated, then submit a replacement task. >>>> >>>> Does the "rate limiting" feature (in the works?) affect task >>>> reconciliation due to the volume of task updates sent back? >>>> >>>> Thanks. >>>> >>>> >>>> On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler < >>>> [email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I've sent a review out for a document describing reconciliation, you >>>>> can see the draft here: >>>>> https://gist.github.com/bmahler/18409fc4f052df43f403 >>>>> >>>>> Would love to gather high level feedback on it from framework >>>>> developers. Feel free to reply here, or on the review: >>>>> https://reviews.apache.org/r/26669/ >>>>> >>>>> Thanks! >>>>> Ben >>>>> >>>> >>>> >>> >> >

