Re: Duplicate task ID for same framework on different agents

2017-12-21 Thread Benjamin Mahler
It's a known issue:
https://issues.apache.org/jira/browse/MESOS-3070

Putting in place a protection mechanism sounds good, but is rather
complicated. See the comment in this ticket:
https://issues.apache.org/jira/browse/MESOS-6785

On Wed, Dec 20, 2017 at 8:26 PM, Zhitao Li  wrote:

> Hi all,
>
> We have seen a mesos master crash loop after a leader failover. After more
> investigation, it seems that a same task ID was managed to be created onto
> multiple Mesos agents in the cluster.
>
> One possible logical sequence which can lead to such problem:
>
> 1. Task T1 was launched to master M1 on agent A1 for framework F;
> 2. Master M1 failed over to M2;
> 3. Before A1 reregistered to M2, the same T1 was launched on to agent A2:
> M2 does not know previous T1 yet so it accepted it and sent to A2;
> 4. A1 reregistered: this probably crashed M2 (because same task cannot be
> added twice);
> 5. When M3 tries to come up after M2, it further crashes because both A1
> and A2 tried to add a T1 to the framework.
>
> (I only have logs to prove the last step right now)
>
> This happened on 1.4.0 masters.
>
> Although this is probably triggered by incorrect retry logic on framework
> side, I wonder whether Mesos master should do extra protection to prevent
> such issue to cause master crash loop. Some possible ideas are to instruct
> one of the agents carrying tasks w/ duplicate ID to terminate corresponding
> tasks, or just refuse to reregister such agents and instruct them to
> shutdown.
>
> I also filed MESOS-8353 
> to track this potential bug. Thanks!
>
>
> --
>
> Cheers,
>
> Zhitao Li
>


Duplicate task ID for same framework on different agents

2017-12-20 Thread Zhitao Li
Hi all,

We have seen a mesos master crash loop after a leader failover. After more
investigation, it seems that a same task ID was managed to be created onto
multiple Mesos agents in the cluster.

One possible logical sequence which can lead to such problem:

1. Task T1 was launched to master M1 on agent A1 for framework F;
2. Master M1 failed over to M2;
3. Before A1 reregistered to M2, the same T1 was launched on to agent A2:
M2 does not know previous T1 yet so it accepted it and sent to A2;
4. A1 reregistered: this probably crashed M2 (because same task cannot be
added twice);
5. When M3 tries to come up after M2, it further crashes because both A1
and A2 tried to add a T1 to the framework.

(I only have logs to prove the last step right now)

This happened on 1.4.0 masters.

Although this is probably triggered by incorrect retry logic on framework
side, I wonder whether Mesos master should do extra protection to prevent
such issue to cause master crash loop. Some possible ideas are to instruct
one of the agents carrying tasks w/ duplicate ID to terminate corresponding
tasks, or just refuse to reregister such agents and instruct them to
shutdown.

I also filed MESOS-8353 
to track this potential bug. Thanks!


-- 

Cheers,

Zhitao Li