Yup, that looks like the way to go. Going to go ahead and file a ticket on
JIRA for this so that we don't forget. Thanks for digging into this David.
-Renan
On Mon, Jul 17, 2017 at 3:00 PM, David McLaughlin
wrote:
> Based on the thread in the Mesos dev list, it looks
Based on the thread in the Mesos dev list, it looks like because they don't
persist task information so they don't have the task IDs to send when they
detect the agent is lost during failover. So unless this is changed on the
Mesos side, we need to act on the slaveLost message and mark all those
Got it. Thx!
> On Jul 16, 2017, at 9:49 AM, Stephan Erb wrote:
>
> Reconciliation in Aurora is not a specific mode. It just runs
> concurrently to other background work such as snapshots or backups [1].
>
>
> Just be aware that we don't have metrics to track the runtime of
Reconciliation in Aurora is not a specific mode. It just runs
concurrently to other background work such as snapshots or backups [1].
Just be aware that we don't have metrics to track the runtime of
explicit and implicit reconciliations. If you use settings that are
overly aggressive, you might
Thx David for the follow up and confirmation.
We have started the thread on the mesos dev DL.
So to get clarification on the recon, what is in general effect during the
recon. Does scheduling and activities like snapshot is paused as recon takes
place. Trying to see whether to run aggressive
I've left a comment on the initial RB detailing how the change broke
backwards-compatibility. Given that the tasks are marked as lost as soon as
the agent reregisters after slaveLost is sent anyway, there doesn't seem to
be any reason not to send TASK_LOST too. I think this should be an easy
fix.
Yes, we've confirmed this internally too (Santhosh did the work here):
When an agent becomes unreachable while the master is running, it sends
> TASK_LOST events for each task on the agent.
> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe38e9909
>
It would be interesting to see the logs. I think that will tell you if the
Mesos master is:
a) Sending slaveLost
b) Trying to send TASK_LOST
And then the Scheduler logs (and/or the metrics it exports) should tell you
whether those events were received. If this is reproducible, I'd consider
it a
"1. When mesos sends slave lost after 10 mins in this situation , why does
aurora not act on it?"
Because Mesos also sends TASK_LOST for every task running on the agent
whenever it calls slaveLost:
When it is time to remove an agent, the master removes the agent from the
list of registered