Re: Reconciliation Document

Steven Schlansker Mon, 03 Nov 2014 10:14:47 -0800

Hi,
I'm the poor end user in question :)

I have the Singularity logs from task reconciliation saved here:
https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt


The last line in the log file sums it up pretty well -
INFO  [2014-10-30 19:24:21,948] 
com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Task 
reconciliation ended after 50 checks and 25:00.188

On Nov 3, 2014, at 10:02 AM, Benjamin Mahler <[email protected]> wrote:

> I don't think this is related to your retry timeout, but it's very difficult 
> to diagnose this without logs or a more thorough description of what 
> occurred. Do you have the logs?
> 
> user saw it take 30 minutes to eventually reconcile 25 task statuses 
> 
> What exactly did the user see to infer this that this was related to 
> reconciling the statuses?
> 
> On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <[email protected]> 
> wrote:
> Ben,
> 
> What's a reasonable initial timeout and cap for reconciliation when the # of 
> slaves and tasks involved is in the tens/hundreds?
> 
> I ask because in Singularity we are using a fixed 30 seconds and one user saw 
> it take 30 minutes to eventually reconcile 25 task statuses (after seeing all 
> slaves crash and a master failover -- although that's another issue.)
> 
> 
> 
>  
> 
> On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <[email protected]> 
> wrote:
> Inline.
> 
> On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <[email protected]> wrote:
> Response inline, below.
> 
> On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <[email protected]> 
> wrote:
> Thanks for the thoughtful questions, I will take these into account in the 
> document.
> 
> Addressing each question in order:
> 
> (1) Why the retry?
> 
> It could be once per (re-)registration in the future.
> 
> Some requests are temporarily unanswerable. For example, if reconciling task 
> T on slave S, and slave S has not yet re-registered, we cannot reply until 
> the slave is re-registered or removed. Also, if a slave is transitioning 
> (being removed), we want to make sure that operation finishes before we can 
> answer.
> 
> It's possible to keep the request around and trigger an event once we can 
> answer. However, we chose to drop and remain silent for these tasks. This is 
> both for implementation simplicity and as a defense against OOMing from too 
> many pending reconciliation requests.
> 
> I was thinking that the state machine that maintains the state of tasks 
> always has answers for the current state. Therefore, I don't expect any 
> blocking. For example, if S hasn't yet re-registered. the state machine must 
> think that the state of T is still 'running' until either the slave 
> re-registers and informs of the task being lost, or a timeout occurs after 
> which master decides the slave is gone. At which point a new status update 
> can be sent. I don't see a reason why reconcile needs to wait until slave 
> re-registers here. Maybe I am missing something else? Same with 
> transitioning... the state information is always available, say, as running, 
> until transition happens. This results in two status updates, but always 
> correct. 
> 
> Task state in Mesos is persisted in the leaves of the system (the slaves) for 
> scalability reasons. So when a new master starts up, it doesn't know anything 
> about tasks; this state is bootstrapped from the slaves as they re-register. 
> This interim period of state recovery is when frameworks may not receive 
> answers to reconciliation requests, depending on whether the particular slave 
> has re-registered.
> 
> In your second case, once a slave is removed, we will send the LOST update 
> for all non-terminal tasks on the slave. There's little benefit of replying 
> to a reconciliation request while it's being removed, because LOST updates 
> are coming shortly thereafter. You can think of these LOST updates as the 
> reply to the reconciliation request, as far as the scheduler is concerned.
> 
> I think the two takeaways here are:
> 
> (1) Ultimately while it is possible to avoid the need for retries on the 
> framework side, it introduces too much complexity in the master and gives us 
> no flexibility in ignoring or dropping messages. Even in such a world, the 
> retries would be a valid resiliency measure for frameworks to insulate 
> themselves against anything being dropped.
> 
> (2) For now, we want to encourage framework developers to think about these 
> kinds of issues, we want them to implement their frameworks in a resilient 
> manner. And so in general we haven't chosen to provide a crutch when it 
> requires a lot of complexity in Mesos. Today we can't add these ergonomic 
> improvements in the scheduler driver because it has no persistence. Hopefully 
> as the project moves forward, we can have these kind of framework side 
> ergonomic improvements be contained in pure language bindings to Mesos. A 
> nice stateful language binding can hide this from you. :)
>  
> 
>  
> 
> 
> (2) Any time bound guarantees?
> 
> No guarantees on exact timing, but you are guaranteed to eventually receive 
> an answer.
> 
> This is why exponential backoff is important, to tolerate variability in 
> timing and avoid snowballing if a backlog ever occurs.
> 
> For suggesting an initial timeout, I need to digress a bit. Currently the 
> driver does not explicitly expose the event queue to the scheduler, and so 
> when you call reconcile, you may have an event queue in the driver full of 
> status updates. Because of this lack of visibility, picking an initial 
> timeout will depend on your scheduler's update processing speed and scale (# 
> expected status updates). Again, backoff is recommended to handle this.
> 
> We were considering exposing Java bindings for the newer Event/Call API. It 
> makes the queue explicit, which lets you avoid reconciling while you have a 
> queue full of updates.
> 
> Here is what the C++ interface looks like:
> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
> 
> Does this interest you?
> 
> I am interpreting this (correct me as needed) to mean that the Java callback 
> statusUpdate() receives a queue instead of the current version with just one 
> TaskStatus argument? I suppose this could be useful, yes. In that case, the 
> acknowledgements of receiving the task status is sent to master once per the 
> entire queue of task status. Which may be OK.
> 
> You would always receive a queue of events, which you can store and process 
> asynchronously (the key to enabling this was making acknowledgements 
> explicit). Sorry for the tangent, keep an eye out for discussions related to 
> the new API / HTTP API changes.
>  
> 
>  
> 
> 
> (3) After timeout with no answer, I would be tempted to kill the task.
> 
> You will eventually receive an answer, so if you decide to kill the task 
> because you have not received an answer soon enough, you may make the wrong 
> decision. This is up to you.
> 
> In particular, I would caution against making decisions without feedback 
> because it can lead to a snowball effect if tasks are treated independently. 
> In the event of a backlog, what's to stop you from killing all tasks because 
> you haven't received any answers?
> 
> I would recommend that you only use this kind of timeout as a last resort, 
> when not receiving a response after a large amount of time and a large number 
> of reconciliation requests.
> 
> Yes, that is the timeout value I was after. However, based on my response to 
> #1, this could be short, isn't it?
> 
> Yes it could be on the order of seconds to start with.
>  
> 
>  
> 
> 
> (4) Does rate limiting affect this?
> 
> When enabled, rate limiting currently only operates on the rate of incoming 
> messages from a particular framework, so the number of updates sent back has 
> no effect on the limiting.
> 
> That sounds good. Although, just to be paranoid, what if there's a 
> problematic framework that restarts frequently (due to a bug, for example)? 
> This would keep Mesos master busy sending reconcile task updates to it 
> constantly. 
> 
> You're right, it's an orthogonal problem to address since it applies broadly 
> to other messages (e.g. framework sending 100MB tasks).
>  
> 
> Thanks.
> 
> Sharma
> 
>  
> 
> 
> On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <[email protected]> wrote:
> Looks like a good step forward.
> 
> What is the reason for the algorithm having to call reconcile tasks multiple 
> times after waiting some time in step 6? Shouldn't it be just once per 
> (re)registration? 
> 
> Are there time bound guarantees within which a task update will be sent out 
> after a reconcile request is sent? In the algorithm for task reconciliation, 
> what would be a good timeout after which we conclude that we got no task 
> update from the master? Upon such a timeout, I would be tempted to conclude 
> that the task has disappeared. In which case, I would call driver.killTask() 
> (to be sure its marked as gone), mark my task as terminated, then submit a 
> replacement task. 
> 
> Does the "rate limiting" feature (in the works?) affect task reconciliation 
> due to the volume of task updates sent back? 
> 
> Thanks.
> 
> 
> On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <[email protected]> 
> wrote:
> Hi all,
> 
> I've sent a review out for a document describing reconciliation, you can see 
> the draft here:
> https://gist.github.com/bmahler/18409fc4f052df43f403
> 
> Would love to gather high level feedback on it from framework developers. 
> Feel free to reply here, or on the review:
> https://reviews.apache.org/r/26669/
> 
> Thanks!
> Ben
> 
> 
> 
> 
> 
>

Re: Reconciliation Document

Reply via email to