Re: Reconciliation Document

Steven Schlansker Mon, 03 Nov 2014 12:25:47 -0800

Here's the Singularity log again, just to have them in the same email:
https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
and the master log from the same time period:
https://gist.githubusercontent.com/stevenschlansker/1577a1fc269525459571/raw/5cd53f53acc8e3b27490b0ea9af04812d624bc50/gistfile1.txt



On Nov 3, 2014, at 10:46 AM, Benjamin Mahler <[email protected]> wrote:

> Thanks! Do you have the master logs?
> 
> On Mon, Nov 3, 2014 at 10:13 AM, Steven Schlansker 
> <[email protected]> wrote:
> Hi,
> I'm the poor end user in question :)
> 
> I have the Singularity logs from task reconciliation saved here:
> https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
> 
> The last line in the log file sums it up pretty well -
> INFO  [2014-10-30 19:24:21,948] 
> com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Task 
> reconciliation ended after 50 checks and 25:00.188
> 
> On Nov 3, 2014, at 10:02 AM, Benjamin Mahler <[email protected]> 
> wrote:
> 
> > I don't think this is related to your retry timeout, but it's very 
> > difficult to diagnose this without logs or a more thorough description of 
> > what occurred. Do you have the logs?
> >
> > user saw it take 30 minutes to eventually reconcile 25 task statuses
> >
> > What exactly did the user see to infer this that this was related to 
> > reconciling the statuses?
> >
> > On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <[email protected]> 
> > wrote:
> > Ben,
> >
> > What's a reasonable initial timeout and cap for reconciliation when the # 
> > of slaves and tasks involved is in the tens/hundreds?
> >
> > I ask because in Singularity we are using a fixed 30 seconds and one user 
> > saw it take 30 minutes to eventually reconcile 25 task statuses (after 
> > seeing all slaves crash and a master failover -- although that's another 
> > issue.)
> >
> >
> >
> >
> >
> > On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler 
> > <[email protected]> wrote:
> > Inline.
> >
> > On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <[email protected]> wrote:
> > Response inline, below.
> >
> > On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler 
> > <[email protected]> wrote:
> > Thanks for the thoughtful questions, I will take these into account in the 
> > document.
> >
> > Addressing each question in order:
> >
> > (1) Why the retry?
> >
> > It could be once per (re-)registration in the future.
> >
> > Some requests are temporarily unanswerable. For example, if reconciling 
> > task T on slave S, and slave S has not yet re-registered, we cannot reply 
> > until the slave is re-registered or removed. Also, if a slave is 
> > transitioning (being removed), we want to make sure that operation finishes 
> > before we can answer.
> >
> > It's possible to keep the request around and trigger an event once we can 
> > answer. However, we chose to drop and remain silent for these tasks. This 
> > is both for implementation simplicity and as a defense against OOMing from 
> > too many pending reconciliation requests.
> >
> > I was thinking that the state machine that maintains the state of tasks 
> > always has answers for the current state. Therefore, I don't expect any 
> > blocking. For example, if S hasn't yet re-registered. the state machine 
> > must think that the state of T is still 'running' until either the slave 
> > re-registers and informs of the task being lost, or a timeout occurs after 
> > which master decides the slave is gone. At which point a new status update 
> > can be sent. I don't see a reason why reconcile needs to wait until slave 
> > re-registers here. Maybe I am missing something else? Same with 
> > transitioning... the state information is always available, say, as 
> > running, until transition happens. This results in two status updates, but 
> > always correct.
> >
> > Task state in Mesos is persisted in the leaves of the system (the slaves) 
> > for scalability reasons. So when a new master starts up, it doesn't know 
> > anything about tasks; this state is bootstrapped from the slaves as they 
> > re-register. This interim period of state recovery is when frameworks may 
> > not receive answers to reconciliation requests, depending on whether the 
> > particular slave has re-registered.
> >
> > In your second case, once a slave is removed, we will send the LOST update 
> > for all non-terminal tasks on the slave. There's little benefit of replying 
> > to a reconciliation request while it's being removed, because LOST updates 
> > are coming shortly thereafter. You can think of these LOST updates as the 
> > reply to the reconciliation request, as far as the scheduler is concerned.
> >
> > I think the two takeaways here are:
> >
> > (1) Ultimately while it is possible to avoid the need for retries on the 
> > framework side, it introduces too much complexity in the master and gives 
> > us no flexibility in ignoring or dropping messages. Even in such a world, 
> > the retries would be a valid resiliency measure for frameworks to insulate 
> > themselves against anything being dropped.
> >
> > (2) For now, we want to encourage framework developers to think about these 
> > kinds of issues, we want them to implement their frameworks in a resilient 
> > manner. And so in general we haven't chosen to provide a crutch when it 
> > requires a lot of complexity in Mesos. Today we can't add these ergonomic 
> > improvements in the scheduler driver because it has no persistence. 
> > Hopefully as the project moves forward, we can have these kind of framework 
> > side ergonomic improvements be contained in pure language bindings to 
> > Mesos. A nice stateful language binding can hide this from you. :)
> >
> >
> >
> >
> >
> > (2) Any time bound guarantees?
> >
> > No guarantees on exact timing, but you are guaranteed to eventually receive 
> > an answer.
> >
> > This is why exponential backoff is important, to tolerate variability in 
> > timing and avoid snowballing if a backlog ever occurs.
> >
> > For suggesting an initial timeout, I need to digress a bit. Currently the 
> > driver does not explicitly expose the event queue to the scheduler, and so 
> > when you call reconcile, you may have an event queue in the driver full of 
> > status updates. Because of this lack of visibility, picking an initial 
> > timeout will depend on your scheduler's update processing speed and scale 
> > (# expected status updates). Again, backoff is recommended to handle this.
> >
> > We were considering exposing Java bindings for the newer Event/Call API. It 
> > makes the queue explicit, which lets you avoid reconciling while you have a 
> > queue full of updates.
> >
> > Here is what the C++ interface looks like:
> > https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
> >
> > Does this interest you?
> >
> > I am interpreting this (correct me as needed) to mean that the Java 
> > callback statusUpdate() receives a queue instead of the current version 
> > with just one TaskStatus argument? I suppose this could be useful, yes. In 
> > that case, the acknowledgements of receiving the task status is sent to 
> > master once per the entire queue of task status. Which may be OK.
> >
> > You would always receive a queue of events, which you can store and process 
> > asynchronously (the key to enabling this was making acknowledgements 
> > explicit). Sorry for the tangent, keep an eye out for discussions related 
> > to the new API / HTTP API changes.
> >
> >
> >
> >
> >
> > (3) After timeout with no answer, I would be tempted to kill the task.
> >
> > You will eventually receive an answer, so if you decide to kill the task 
> > because you have not received an answer soon enough, you may make the wrong 
> > decision. This is up to you.
> >
> > In particular, I would caution against making decisions without feedback 
> > because it can lead to a snowball effect if tasks are treated 
> > independently. In the event of a backlog, what's to stop you from killing 
> > all tasks because you haven't received any answers?
> >
> > I would recommend that you only use this kind of timeout as a last resort, 
> > when not receiving a response after a large amount of time and a large 
> > number of reconciliation requests.
> >
> > Yes, that is the timeout value I was after. However, based on my response 
> > to #1, this could be short, isn't it?
> >
> > Yes it could be on the order of seconds to start with.
> >
> >
> >
> >
> >
> > (4) Does rate limiting affect this?
> >
> > When enabled, rate limiting currently only operates on the rate of incoming 
> > messages from a particular framework, so the number of updates sent back 
> > has no effect on the limiting.
> >
> > That sounds good. Although, just to be paranoid, what if there's a 
> > problematic framework that restarts frequently (due to a bug, for 
> > example)? This would keep Mesos master busy sending reconcile task updates 
> > to it constantly.
> >
> > You're right, it's an orthogonal problem to address since it applies 
> > broadly to other messages (e.g. framework sending 100MB tasks).
> >
> >
> > Thanks.
> >
> > Sharma
> >
> >
> >
> >
> > On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <[email protected]> wrote:
> > Looks like a good step forward.
> >
> > What is the reason for the algorithm having to call reconcile tasks 
> > multiple times after waiting some time in step 6? Shouldn't it be just once 
> > per (re)registration?
> >
> > Are there time bound guarantees within which a task update will be sent out 
> > after a reconcile request is sent? In the algorithm for task 
> > reconciliation, what would be a good timeout after which we conclude that 
> > we got no task update from the master? Upon such a timeout, I would be 
> > tempted to conclude that the task has disappeared. In which case, I would 
> > call driver.killTask() (to be sure its marked as gone), mark my task as 
> > terminated, then submit a replacement task.
> >
> > Does the "rate limiting" feature (in the works?) affect task reconciliation 
> > due to the volume of task updates sent back?
> >
> > Thanks.
> >
> >
> > On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler 
> > <[email protected]> wrote:
> > Hi all,
> >
> > I've sent a review out for a document describing reconciliation, you can 
> > see the draft here:
> > https://gist.github.com/bmahler/18409fc4f052df43f403
> >
> > Would love to gather high level feedback on it from framework developers. 
> > Feel free to reply here, or on the review:
> > https://reviews.apache.org/r/26669/
> >
> > Thanks!
> > Ben
> >
> >
> >
> >
> >
> >
> 
>

Re: Reconciliation Document

Reply via email to