Re: Aurora reconciliation and Master fail over

2017-07-18 Thread Renan DelValle
Yup, that looks like the way to go. Going to go ahead and file a ticket on JIRA for this so that we don't forget. Thanks for digging into this David. -Renan On Mon, Jul 17, 2017 at 3:00 PM, David McLaughlin wrote: > Based on the thread in the Mesos dev list, it looks

Re: Aurora reconciliation and Master fail over

2017-07-17 Thread David McLaughlin
Based on the thread in the Mesos dev list, it looks like because they don't persist task information so they don't have the task IDs to send when they detect the agent is lost during failover. So unless this is changed on the Mesos side, we need to act on the slaveLost message and mark all those

Re: Aurora reconciliation and Master fail over

2017-07-16 Thread Meghdoot bhattacharya
Got it. Thx! > On Jul 16, 2017, at 9:49 AM, Stephan Erb wrote: > > Reconciliation in Aurora is not a specific mode. It just runs > concurrently to other background work such as snapshots or backups [1]. > > > Just be aware that we don't have metrics to track the runtime of

Re: Aurora reconciliation and Master fail over

2017-07-16 Thread Stephan Erb
Reconciliation in Aurora is not a specific mode. It just runs concurrently to other background work such as snapshots or backups [1]. Just be aware that we don't have metrics to track the runtime of explicit and implicit reconciliations. If you use settings that are overly aggressive, you might

Re: Aurora reconciliation and Master fail over

2017-07-15 Thread Meghdoot bhattacharya
Thx David for the follow up and confirmation. We have started the thread on the mesos dev DL. So to get clarification on the recon, what is in general effect during the recon. Does scheduling and activities like snapshot is paused as recon takes place. Trying to see whether to run aggressive

Re: Aurora reconciliation and Master fail over

2017-07-15 Thread David McLaughlin
I've left a comment on the initial RB detailing how the change broke backwards-compatibility. Given that the tasks are marked as lost as soon as the agent reregisters after slaveLost is sent anyway, there doesn't seem to be any reason not to send TASK_LOST too. I think this should be an easy fix.

Re: Aurora reconciliation and Master fail over

2017-07-15 Thread David McLaughlin
Yes, we've confirmed this internally too (Santhosh did the work here): When an agent becomes unreachable while the master is running, it sends > TASK_LOST events for each task on the agent. > https://github.com/apache/mesos/blob/33093c893773f8c9d293afe38e9909 >

Re: Aurora reconciliation and Master fail over

2017-07-14 Thread David McLaughlin
It would be interesting to see the logs. I think that will tell you if the Mesos master is: a) Sending slaveLost b) Trying to send TASK_LOST And then the Scheduler logs (and/or the metrics it exports) should tell you whether those events were received. If this is reproducible, I'd consider it a

Re: Aurora reconciliation and Master fail over

2017-07-14 Thread David McLaughlin
"1. When mesos sends slave lost after 10 mins in this situation , why does aurora not act on it?" Because Mesos also sends TASK_LOST for every task running on the agent whenever it calls slaveLost: When it is time to remove an agent, the master removes the agent from the list of registered