Re: Reconciliation Document

Benjamin Mahler Mon, 03 Nov 2014 15:16:50 -0800

I'm pretty confused by what's occurring in your scheduler, let's start by
looking at a particular task:
https://gist.github.com/bmahler/6f6bdb0385ec245b2346


You received an update from reconciliation at: 19:05:52

But yet the scheduler continued to think it was reconciling it, and later
at 19:23:53, a flood of 16 reconciliation requests for this task are sent
to the master, which leads to 16 LOST updates for this task within the same
second. Do you know what went wrong in the scheduler here?

On the master side I see the flooding of reconciliation requests:
https://gist.github.com/bmahler/d82433764ee6ce33faac


The only time the master does not reply is when a slave is transitioning,
and I don't see that in the logs, so for example here where the scheduler
went from 26 to 12, all 26 of them should have been reconciled on the
scheduler side:

I1030 19:23:37.980578  1593 http.cpp:334] HTTP request for
'/master/redirect'
I1030 19:23:38.056701  1590 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:38.441155  1590 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:38.826906  1590 http.cpp:466] HTTP request for
'/master/state.json'
*I1030 19:23:39.238561  1590 master.cpp:3349] Performing explicit task
state reconciliation for 26 tasks of framework
20140724-231003-419644938-5050-1707-0000*
I1030 19:23:42.177489  1590 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:44.478252  1592 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:44.854184  1592 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:48.172827  1590 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:48.556999  1590 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:53.175287  1587 http.cpp:466] HTTP request for
'/master/state.json'
*I1030 19:23:53.686470  1592 master.cpp:3349] Performing explicit task
state reconciliation for 12 tasks of framework Singularity*

So far I'm not seeing an issue in the master. You can turn up verbose
logging on the master to see exactly which status updates are being sent
back (GLOG_v=1 or use the /help/logging endpoint).

----
Also note that the timestamps are only processed at the second granularity
in your scheduler FWICT, since you're casting to a long here:
https://github.com/HubSpot/Singularity/blob/master/SingularityService/src/main/java/com/hubspot/singularity/mesos/SingularityMesosScheduler.java#L254

On Mon, Nov 3, 2014 at 12:24 PM, Steven Schlansker <
[email protected]> wrote:

> Here's the Singularity log again, just to have them in the same email:
>
> https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
> and the master log from the same time period:
>
> https://gist.githubusercontent.com/stevenschlansker/1577a1fc269525459571/raw/5cd53f53acc8e3b27490b0ea9af04812d624bc50/gistfile1.txt
>
>
> On Nov 3, 2014, at 10:46 AM, Benjamin Mahler <[email protected]>
> wrote:
>
> > Thanks! Do you have the master logs?
> >
> > On Mon, Nov 3, 2014 at 10:13 AM, Steven Schlansker <
> [email protected]> wrote:
> > Hi,
> > I'm the poor end user in question :)
> >
> > I have the Singularity logs from task reconciliation saved here:
> >
> https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
> >
> > The last line in the log file sums it up pretty well -
> > INFO  [2014-10-30 19:24:21,948]
> com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Task
> reconciliation ended after 50 checks and 25:00.188
> >
> > On Nov 3, 2014, at 10:02 AM, Benjamin Mahler <[email protected]>
> wrote:
> >
> > > I don't think this is related to your retry timeout, but it's very
> difficult to diagnose this without logs or a more thorough description of
> what occurred. Do you have the logs?
> > >
> > > user saw it take 30 minutes to eventually reconcile 25 task statuses
> > >
> > > What exactly did the user see to infer this that this was related to
> reconciling the statuses?
> > >
> > > On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <
> [email protected]> wrote:
> > > Ben,
> > >
> > > What's a reasonable initial timeout and cap for reconciliation when
> the # of slaves and tasks involved is in the tens/hundreds?
> > >
> > > I ask because in Singularity we are using a fixed 30 seconds and one
> user saw it take 30 minutes to eventually reconcile 25 task statuses (after
> seeing all slaves crash and a master failover -- although that's another
> issue.)
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <
> [email protected]> wrote:
> > > Inline.
> > >
> > > On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <[email protected]>
> wrote:
> > > Response inline, below.
> > >
> > > On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <
> [email protected]> wrote:
> > > Thanks for the thoughtful questions, I will take these into account in
> the document.
> > >
> > > Addressing each question in order:
> > >
> > > (1) Why the retry?
> > >
> > > It could be once per (re-)registration in the future.
> > >
> > > Some requests are temporarily unanswerable. For example, if
> reconciling task T on slave S, and slave S has not yet re-registered, we
> cannot reply until the slave is re-registered or removed. Also, if a slave
> is transitioning (being removed), we want to make sure that operation
> finishes before we can answer.
> > >
> > > It's possible to keep the request around and trigger an event once we
> can answer. However, we chose to drop and remain silent for these tasks.
> This is both for implementation simplicity and as a defense against OOMing
> from too many pending reconciliation requests.
> > >
> > > I was thinking that the state machine that maintains the state of
> tasks always has answers for the current state. Therefore, I don't expect
> any blocking. For example, if S hasn't yet re-registered. the state machine
> must think that the state of T is still 'running' until either the slave
> re-registers and informs of the task being lost, or a timeout occurs after
> which master decides the slave is gone. At which point a new status update
> can be sent. I don't see a reason why reconcile needs to wait until slave
> re-registers here. Maybe I am missing something else? Same with
> transitioning... the state information is always available, say, as
> running, until transition happens. This results in two status updates, but
> always correct.
> > >
> > > Task state in Mesos is persisted in the leaves of the system (the
> slaves) for scalability reasons. So when a new master starts up, it doesn't
> know anything about tasks; this state is bootstrapped from the slaves as
> they re-register. This interim period of state recovery is when frameworks
> may not receive answers to reconciliation requests, depending on whether
> the particular slave has re-registered.
> > >
> > > In your second case, once a slave is removed, we will send the LOST
> update for all non-terminal tasks on the slave. There's little benefit of
> replying to a reconciliation request while it's being removed, because LOST
> updates are coming shortly thereafter. You can think of these LOST updates
> as the reply to the reconciliation request, as far as the scheduler is
> concerned.
> > >
> > > I think the two takeaways here are:
> > >
> > > (1) Ultimately while it is possible to avoid the need for retries on
> the framework side, it introduces too much complexity in the master and
> gives us no flexibility in ignoring or dropping messages. Even in such a
> world, the retries would be a valid resiliency measure for frameworks to
> insulate themselves against anything being dropped.
> > >
> > > (2) For now, we want to encourage framework developers to think about
> these kinds of issues, we want them to implement their frameworks in a
> resilient manner. And so in general we haven't chosen to provide a crutch
> when it requires a lot of complexity in Mesos. Today we can't add these
> ergonomic improvements in the scheduler driver because it has no
> persistence. Hopefully as the project moves forward, we can have these kind
> of framework side ergonomic improvements be contained in pure language
> bindings to Mesos. A nice stateful language binding can hide this from you.
> :)
> > >
> > >
> > >
> > >
> > >
> > > (2) Any time bound guarantees?
> > >
> > > No guarantees on exact timing, but you are guaranteed to eventually
> receive an answer.
> > >
> > > This is why exponential backoff is important, to tolerate variability
> in timing and avoid snowballing if a backlog ever occurs.
> > >
> > > For suggesting an initial timeout, I need to digress a bit. Currently
> the driver does not explicitly expose the event queue to the scheduler, and
> so when you call reconcile, you may have an event queue in the driver full
> of status updates. Because of this lack of visibility, picking an initial
> timeout will depend on your scheduler's update processing speed and scale
> (# expected status updates). Again, backoff is recommended to handle this.
> > >
> > > We were considering exposing Java bindings for the newer Event/Call
> API. It makes the queue explicit, which lets you avoid reconciling while
> you have a queue full of updates.
> > >
> > > Here is what the C++ interface looks like:
> > >
> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
> > >
> > > Does this interest you?
> > >
> > > I am interpreting this (correct me as needed) to mean that the Java
> callback statusUpdate() receives a queue instead of the current version
> with just one TaskStatus argument? I suppose this could be useful, yes. In
> that case, the acknowledgements of receiving the task status is sent to
> master once per the entire queue of task status. Which may be OK.
> > >
> > > You would always receive a queue of events, which you can store and
> process asynchronously (the key to enabling this was making
> acknowledgements explicit). Sorry for the tangent, keep an eye out for
> discussions related to the new API / HTTP API changes.
> > >
> > >
> > >
> > >
> > >
> > > (3) After timeout with no answer, I would be tempted to kill the task.
> > >
> > > You will eventually receive an answer, so if you decide to kill the
> task because you have not received an answer soon enough, you may make the
> wrong decision. This is up to you.
> > >
> > > In particular, I would caution against making decisions without
> feedback because it can lead to a snowball effect if tasks are treated
> independently. In the event of a backlog, what's to stop you from killing
> all tasks because you haven't received any answers?
> > >
> > > I would recommend that you only use this kind of timeout as a last
> resort, when not receiving a response after a large amount of time and a
> large number of reconciliation requests.
> > >
> > > Yes, that is the timeout value I was after. However, based on my
> response to #1, this could be short, isn't it?
> > >
> > > Yes it could be on the order of seconds to start with.
> > >
> > >
> > >
> > >
> > >
> > > (4) Does rate limiting affect this?
> > >
> > > When enabled, rate limiting currently only operates on the rate of
> incoming messages from a particular framework, so the number of updates
> sent back has no effect on the limiting.
> > >
> > > That sounds good. Although, just to be paranoid, what if there's a
> problematic framework that restarts frequently (due to a bug, for
> example)? This would keep Mesos master busy sending reconcile task updates
> to it constantly.
> > >
> > > You're right, it's an orthogonal problem to address since it applies
> broadly to other messages (e.g. framework sending 100MB tasks).
> > >
> > >
> > > Thanks.
> > >
> > > Sharma
> > >
> > >
> > >
> > >
> > > On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <[email protected]>
> wrote:
> > > Looks like a good step forward.
> > >
> > > What is the reason for the algorithm having to call reconcile tasks
> multiple times after waiting some time in step 6? Shouldn't it be just once
> per (re)registration?
> > >
> > > Are there time bound guarantees within which a task update will be
> sent out after a reconcile request is sent? In the algorithm for task
> reconciliation, what would be a good timeout after which we conclude that
> we got no task update from the master? Upon such a timeout, I would be
> tempted to conclude that the task has disappeared. In which case, I would
> call driver.killTask() (to be sure its marked as gone), mark my task as
> terminated, then submit a replacement task.
> > >
> > > Does the "rate limiting" feature (in the works?) affect task
> reconciliation due to the volume of task updates sent back?
> > >
> > > Thanks.
> > >
> > >
> > > On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <
> [email protected]> wrote:
> > > Hi all,
> > >
> > > I've sent a review out for a document describing reconciliation, you
> can see the draft here:
> > > https://gist.github.com/bmahler/18409fc4f052df43f403
> > >
> > > Would love to gather high level feedback on it from framework
> developers. Feel free to reply here, or on the review:
> > > https://reviews.apache.org/r/26669/
> > >
> > > Thanks!
> > > Ben
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>

Re: Reconciliation Document

Reply via email to