Here's the Singularity log again, just to have them in the same email: https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt and the master log from the same time period: https://gist.githubusercontent.com/stevenschlansker/1577a1fc269525459571/raw/5cd53f53acc8e3b27490b0ea9af04812d624bc50/gistfile1.txt
On Nov 3, 2014, at 10:46 AM, Benjamin Mahler <[email protected]> wrote: > Thanks! Do you have the master logs? > > On Mon, Nov 3, 2014 at 10:13 AM, Steven Schlansker > <[email protected]> wrote: > Hi, > I'm the poor end user in question :) > > I have the Singularity logs from task reconciliation saved here: > https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt > > The last line in the log file sums it up pretty well - > INFO [2014-10-30 19:24:21,948] > com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Task > reconciliation ended after 50 checks and 25:00.188 > > On Nov 3, 2014, at 10:02 AM, Benjamin Mahler <[email protected]> > wrote: > > > I don't think this is related to your retry timeout, but it's very > > difficult to diagnose this without logs or a more thorough description of > > what occurred. Do you have the logs? > > > > user saw it take 30 minutes to eventually reconcile 25 task statuses > > > > What exactly did the user see to infer this that this was related to > > reconciling the statuses? > > > > On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <[email protected]> > > wrote: > > Ben, > > > > What's a reasonable initial timeout and cap for reconciliation when the # > > of slaves and tasks involved is in the tens/hundreds? > > > > I ask because in Singularity we are using a fixed 30 seconds and one user > > saw it take 30 minutes to eventually reconcile 25 task statuses (after > > seeing all slaves crash and a master failover -- although that's another > > issue.) > > > > > > > > > > > > On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler > > <[email protected]> wrote: > > Inline. > > > > On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <[email protected]> wrote: > > Response inline, below. > > > > On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler > > <[email protected]> wrote: > > Thanks for the thoughtful questions, I will take these into account in the > > document. > > > > Addressing each question in order: > > > > (1) Why the retry? > > > > It could be once per (re-)registration in the future. > > > > Some requests are temporarily unanswerable. For example, if reconciling > > task T on slave S, and slave S has not yet re-registered, we cannot reply > > until the slave is re-registered or removed. Also, if a slave is > > transitioning (being removed), we want to make sure that operation finishes > > before we can answer. > > > > It's possible to keep the request around and trigger an event once we can > > answer. However, we chose to drop and remain silent for these tasks. This > > is both for implementation simplicity and as a defense against OOMing from > > too many pending reconciliation requests. > > > > I was thinking that the state machine that maintains the state of tasks > > always has answers for the current state. Therefore, I don't expect any > > blocking. For example, if S hasn't yet re-registered. the state machine > > must think that the state of T is still 'running' until either the slave > > re-registers and informs of the task being lost, or a timeout occurs after > > which master decides the slave is gone. At which point a new status update > > can be sent. I don't see a reason why reconcile needs to wait until slave > > re-registers here. Maybe I am missing something else? Same with > > transitioning... the state information is always available, say, as > > running, until transition happens. This results in two status updates, but > > always correct. > > > > Task state in Mesos is persisted in the leaves of the system (the slaves) > > for scalability reasons. So when a new master starts up, it doesn't know > > anything about tasks; this state is bootstrapped from the slaves as they > > re-register. This interim period of state recovery is when frameworks may > > not receive answers to reconciliation requests, depending on whether the > > particular slave has re-registered. > > > > In your second case, once a slave is removed, we will send the LOST update > > for all non-terminal tasks on the slave. There's little benefit of replying > > to a reconciliation request while it's being removed, because LOST updates > > are coming shortly thereafter. You can think of these LOST updates as the > > reply to the reconciliation request, as far as the scheduler is concerned. > > > > I think the two takeaways here are: > > > > (1) Ultimately while it is possible to avoid the need for retries on the > > framework side, it introduces too much complexity in the master and gives > > us no flexibility in ignoring or dropping messages. Even in such a world, > > the retries would be a valid resiliency measure for frameworks to insulate > > themselves against anything being dropped. > > > > (2) For now, we want to encourage framework developers to think about these > > kinds of issues, we want them to implement their frameworks in a resilient > > manner. And so in general we haven't chosen to provide a crutch when it > > requires a lot of complexity in Mesos. Today we can't add these ergonomic > > improvements in the scheduler driver because it has no persistence. > > Hopefully as the project moves forward, we can have these kind of framework > > side ergonomic improvements be contained in pure language bindings to > > Mesos. A nice stateful language binding can hide this from you. :) > > > > > > > > > > > > (2) Any time bound guarantees? > > > > No guarantees on exact timing, but you are guaranteed to eventually receive > > an answer. > > > > This is why exponential backoff is important, to tolerate variability in > > timing and avoid snowballing if a backlog ever occurs. > > > > For suggesting an initial timeout, I need to digress a bit. Currently the > > driver does not explicitly expose the event queue to the scheduler, and so > > when you call reconcile, you may have an event queue in the driver full of > > status updates. Because of this lack of visibility, picking an initial > > timeout will depend on your scheduler's update processing speed and scale > > (# expected status updates). Again, backoff is recommended to handle this. > > > > We were considering exposing Java bindings for the newer Event/Call API. It > > makes the queue explicit, which lets you avoid reconciling while you have a > > queue full of updates. > > > > Here is what the C++ interface looks like: > > https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478 > > > > Does this interest you? > > > > I am interpreting this (correct me as needed) to mean that the Java > > callback statusUpdate() receives a queue instead of the current version > > with just one TaskStatus argument? I suppose this could be useful, yes. In > > that case, the acknowledgements of receiving the task status is sent to > > master once per the entire queue of task status. Which may be OK. > > > > You would always receive a queue of events, which you can store and process > > asynchronously (the key to enabling this was making acknowledgements > > explicit). Sorry for the tangent, keep an eye out for discussions related > > to the new API / HTTP API changes. > > > > > > > > > > > > (3) After timeout with no answer, I would be tempted to kill the task. > > > > You will eventually receive an answer, so if you decide to kill the task > > because you have not received an answer soon enough, you may make the wrong > > decision. This is up to you. > > > > In particular, I would caution against making decisions without feedback > > because it can lead to a snowball effect if tasks are treated > > independently. In the event of a backlog, what's to stop you from killing > > all tasks because you haven't received any answers? > > > > I would recommend that you only use this kind of timeout as a last resort, > > when not receiving a response after a large amount of time and a large > > number of reconciliation requests. > > > > Yes, that is the timeout value I was after. However, based on my response > > to #1, this could be short, isn't it? > > > > Yes it could be on the order of seconds to start with. > > > > > > > > > > > > (4) Does rate limiting affect this? > > > > When enabled, rate limiting currently only operates on the rate of incoming > > messages from a particular framework, so the number of updates sent back > > has no effect on the limiting. > > > > That sounds good. Although, just to be paranoid, what if there's a > > problematic framework that restarts frequently (due to a bug, for > > example)? This would keep Mesos master busy sending reconcile task updates > > to it constantly. > > > > You're right, it's an orthogonal problem to address since it applies > > broadly to other messages (e.g. framework sending 100MB tasks). > > > > > > Thanks. > > > > Sharma > > > > > > > > > > On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <[email protected]> wrote: > > Looks like a good step forward. > > > > What is the reason for the algorithm having to call reconcile tasks > > multiple times after waiting some time in step 6? Shouldn't it be just once > > per (re)registration? > > > > Are there time bound guarantees within which a task update will be sent out > > after a reconcile request is sent? In the algorithm for task > > reconciliation, what would be a good timeout after which we conclude that > > we got no task update from the master? Upon such a timeout, I would be > > tempted to conclude that the task has disappeared. In which case, I would > > call driver.killTask() (to be sure its marked as gone), mark my task as > > terminated, then submit a replacement task. > > > > Does the "rate limiting" feature (in the works?) affect task reconciliation > > due to the volume of task updates sent back? > > > > Thanks. > > > > > > On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler > > <[email protected]> wrote: > > Hi all, > > > > I've sent a review out for a document describing reconciliation, you can > > see the draft here: > > https://gist.github.com/bmahler/18409fc4f052df43f403 > > > > Would love to gather high level feedback on it from framework developers. > > Feel free to reply here, or on the review: > > https://reviews.apache.org/r/26669/ > > > > Thanks! > > Ben > > > > > > > > > > > > > >

