[jira] [Commented] (MESOS-4659) Avoid leaving orphan task after framework failure + master failover

2020-02-17 Thread Vinod Kone (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038595#comment-17038595
 ] 

Vinod Kone commented on MESOS-4659:
---

I dont have the bandwidth right now, but happy to review the code if you work 
on a patch. Please see instructions here: 
https://mesos.readthedocs.io/en/latest/submitting-a-patch/

> Avoid leaving orphan task after framework failure + master failover
> ---
>
> Key: MESOS-4659
> URL: https://issues.apache.org/jira/browse/MESOS-4659
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Priority: Major
>  Labels: failover, mesosphere
>
> If a framework becomes disconnected from the master, its tasks are killed 
> after waiting for {{failover_timeout}}.
> However, if a master failover occurs but a framework never reconnects to the 
> new master, we never kill any of the tasks associated with that framework. 
> These tasks remain orphaned and presumably would need to be manually removed 
> by the operator. Similarly, if a framework gets torn down or disconnects 
> while it has running tasks on a partitioned agent, those tasks are not 
> shutdown when the agent reregisters.
> We should consider whether to kill such orphaned tasks automatically, likely 
> after waiting for some (framework-configurable?) timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-4659) Avoid leaving orphan task after framework failure + master failover

2020-02-17 Thread Charles (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038509#comment-17038509
 ] 

Charles commented on MESOS-4659:


[~vinodkone]

Is there any chance to implement the workaround you suggested above - setting 
up a timer upon framework recovery to automatically teardown the framework 
after the failover timeout if it hasn't re-registered by then?
It'd be really interesting for us as we sometimes run into a problem where the 
agents would keep spamming the master about finished tasks for an unregistered 
framework.

If you don't have the bandwidth to work on it I could maybe work on a patch?

Cheers,

> Avoid leaving orphan task after framework failure + master failover
> ---
>
> Key: MESOS-4659
> URL: https://issues.apache.org/jira/browse/MESOS-4659
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Priority: Major
>  Labels: failover, mesosphere
>
> If a framework becomes disconnected from the master, its tasks are killed 
> after waiting for {{failover_timeout}}.
> However, if a master failover occurs but a framework never reconnects to the 
> new master, we never kill any of the tasks associated with that framework. 
> These tasks remain orphaned and presumably would need to be manually removed 
> by the operator. Similarly, if a framework gets torn down or disconnects 
> while it has running tasks on a partitioned agent, those tasks are not 
> shutdown when the agent reregisters.
> We should consider whether to kill such orphaned tasks automatically, likely 
> after waiting for some (framework-configurable?) timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-4659) Avoid leaving orphan task after framework failure + master failover

2018-07-12 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542307#comment-16542307
 ] 

Vinod Kone commented on MESOS-4659:
---

I was looking at this code today, and I think we can fix this now without 
having to persist frameworks in the registry. This is because after MESOS-6419, 
the master recovers FrameworkInfo from a re-registering agent after a failover. 
Master could conceivably schedule a timer for failover timeout (available from 
the recovered FrameworkInfo) after such recovery and teardown the framework if 
it didn't register. It is not a perfect solution because it depends on the 
timing of the first agent re-registration that recovers a framework, but is a 
good stop gap until we get to persisting frameworks in the registry.

> Avoid leaving orphan task after framework failure + master failover
> ---
>
> Key: MESOS-4659
> URL: https://issues.apache.org/jira/browse/MESOS-4659
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Priority: Major
>  Labels: failover, mesosphere
>
> If a framework becomes disconnected from the master, its tasks are killed 
> after waiting for {{failover_timeout}}.
> However, if a master failover occurs but a framework never reconnects to the 
> new master, we never kill any of the tasks associated with that framework. 
> These tasks remain orphaned and presumably would need to be manually removed 
> by the operator. Similarly, if a framework gets torn down or disconnects 
> while it has running tasks on a partitioned agent, those tasks are not 
> shutdown when the agent reregisters.
> We should consider whether to kill such orphaned tasks automatically, likely 
> after waiting for some (framework-configurable?) timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)