Hi, I'm the poor end user in question :) I have the Singularity logs from task reconciliation saved here: https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
The last line in the log file sums it up pretty well - INFO [2014-10-30 19:24:21,948] com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Task reconciliation ended after 50 checks and 25:00.188 On Nov 3, 2014, at 10:02 AM, Benjamin Mahler <[email protected]> wrote: > I don't think this is related to your retry timeout, but it's very difficult > to diagnose this without logs or a more thorough description of what > occurred. Do you have the logs? > > user saw it take 30 minutes to eventually reconcile 25 task statuses > > What exactly did the user see to infer this that this was related to > reconciling the statuses? > > On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <[email protected]> > wrote: > Ben, > > What's a reasonable initial timeout and cap for reconciliation when the # of > slaves and tasks involved is in the tens/hundreds? > > I ask because in Singularity we are using a fixed 30 seconds and one user saw > it take 30 minutes to eventually reconcile 25 task statuses (after seeing all > slaves crash and a master failover -- although that's another issue.) > > > > > > On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <[email protected]> > wrote: > Inline. > > On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <[email protected]> wrote: > Response inline, below. > > On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <[email protected]> > wrote: > Thanks for the thoughtful questions, I will take these into account in the > document. > > Addressing each question in order: > > (1) Why the retry? > > It could be once per (re-)registration in the future. > > Some requests are temporarily unanswerable. For example, if reconciling task > T on slave S, and slave S has not yet re-registered, we cannot reply until > the slave is re-registered or removed. Also, if a slave is transitioning > (being removed), we want to make sure that operation finishes before we can > answer. > > It's possible to keep the request around and trigger an event once we can > answer. However, we chose to drop and remain silent for these tasks. This is > both for implementation simplicity and as a defense against OOMing from too > many pending reconciliation requests. > > I was thinking that the state machine that maintains the state of tasks > always has answers for the current state. Therefore, I don't expect any > blocking. For example, if S hasn't yet re-registered. the state machine must > think that the state of T is still 'running' until either the slave > re-registers and informs of the task being lost, or a timeout occurs after > which master decides the slave is gone. At which point a new status update > can be sent. I don't see a reason why reconcile needs to wait until slave > re-registers here. Maybe I am missing something else? Same with > transitioning... the state information is always available, say, as running, > until transition happens. This results in two status updates, but always > correct. > > Task state in Mesos is persisted in the leaves of the system (the slaves) for > scalability reasons. So when a new master starts up, it doesn't know anything > about tasks; this state is bootstrapped from the slaves as they re-register. > This interim period of state recovery is when frameworks may not receive > answers to reconciliation requests, depending on whether the particular slave > has re-registered. > > In your second case, once a slave is removed, we will send the LOST update > for all non-terminal tasks on the slave. There's little benefit of replying > to a reconciliation request while it's being removed, because LOST updates > are coming shortly thereafter. You can think of these LOST updates as the > reply to the reconciliation request, as far as the scheduler is concerned. > > I think the two takeaways here are: > > (1) Ultimately while it is possible to avoid the need for retries on the > framework side, it introduces too much complexity in the master and gives us > no flexibility in ignoring or dropping messages. Even in such a world, the > retries would be a valid resiliency measure for frameworks to insulate > themselves against anything being dropped. > > (2) For now, we want to encourage framework developers to think about these > kinds of issues, we want them to implement their frameworks in a resilient > manner. And so in general we haven't chosen to provide a crutch when it > requires a lot of complexity in Mesos. Today we can't add these ergonomic > improvements in the scheduler driver because it has no persistence. Hopefully > as the project moves forward, we can have these kind of framework side > ergonomic improvements be contained in pure language bindings to Mesos. A > nice stateful language binding can hide this from you. :) > > > > > > (2) Any time bound guarantees? > > No guarantees on exact timing, but you are guaranteed to eventually receive > an answer. > > This is why exponential backoff is important, to tolerate variability in > timing and avoid snowballing if a backlog ever occurs. > > For suggesting an initial timeout, I need to digress a bit. Currently the > driver does not explicitly expose the event queue to the scheduler, and so > when you call reconcile, you may have an event queue in the driver full of > status updates. Because of this lack of visibility, picking an initial > timeout will depend on your scheduler's update processing speed and scale (# > expected status updates). Again, backoff is recommended to handle this. > > We were considering exposing Java bindings for the newer Event/Call API. It > makes the queue explicit, which lets you avoid reconciling while you have a > queue full of updates. > > Here is what the C++ interface looks like: > https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478 > > Does this interest you? > > I am interpreting this (correct me as needed) to mean that the Java callback > statusUpdate() receives a queue instead of the current version with just one > TaskStatus argument? I suppose this could be useful, yes. In that case, the > acknowledgements of receiving the task status is sent to master once per the > entire queue of task status. Which may be OK. > > You would always receive a queue of events, which you can store and process > asynchronously (the key to enabling this was making acknowledgements > explicit). Sorry for the tangent, keep an eye out for discussions related to > the new API / HTTP API changes. > > > > > > (3) After timeout with no answer, I would be tempted to kill the task. > > You will eventually receive an answer, so if you decide to kill the task > because you have not received an answer soon enough, you may make the wrong > decision. This is up to you. > > In particular, I would caution against making decisions without feedback > because it can lead to a snowball effect if tasks are treated independently. > In the event of a backlog, what's to stop you from killing all tasks because > you haven't received any answers? > > I would recommend that you only use this kind of timeout as a last resort, > when not receiving a response after a large amount of time and a large number > of reconciliation requests. > > Yes, that is the timeout value I was after. However, based on my response to > #1, this could be short, isn't it? > > Yes it could be on the order of seconds to start with. > > > > > > (4) Does rate limiting affect this? > > When enabled, rate limiting currently only operates on the rate of incoming > messages from a particular framework, so the number of updates sent back has > no effect on the limiting. > > That sounds good. Although, just to be paranoid, what if there's a > problematic framework that restarts frequently (due to a bug, for example)? > This would keep Mesos master busy sending reconcile task updates to it > constantly. > > You're right, it's an orthogonal problem to address since it applies broadly > to other messages (e.g. framework sending 100MB tasks). > > > Thanks. > > Sharma > > > > > On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <[email protected]> wrote: > Looks like a good step forward. > > What is the reason for the algorithm having to call reconcile tasks multiple > times after waiting some time in step 6? Shouldn't it be just once per > (re)registration? > > Are there time bound guarantees within which a task update will be sent out > after a reconcile request is sent? In the algorithm for task reconciliation, > what would be a good timeout after which we conclude that we got no task > update from the master? Upon such a timeout, I would be tempted to conclude > that the task has disappeared. In which case, I would call driver.killTask() > (to be sure its marked as gone), mark my task as terminated, then submit a > replacement task. > > Does the "rate limiting" feature (in the works?) affect task reconciliation > due to the volume of task updates sent back? > > Thanks. > > > On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <[email protected]> > wrote: > Hi all, > > I've sent a review out for a document describing reconciliation, you can see > the draft here: > https://gist.github.com/bmahler/18409fc4f052df43f403 > > Would love to gather high level feedback on it from framework developers. > Feel free to reply here, or on the review: > https://reviews.apache.org/r/26669/ > > Thanks! > Ben > > > > > >

