[jira] [Commented] (MESOS-4659) Avoid leaving orphan task after framework failure + master failover
[ https://issues.apache.org/jira/browse/MESOS-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038595#comment-17038595 ] Vinod Kone commented on MESOS-4659: --- I dont have the bandwidth right now, but happy to review the code if you work on a patch. Please see instructions here: https://mesos.readthedocs.io/en/latest/submitting-a-patch/ > Avoid leaving orphan task after framework failure + master failover > --- > > Key: MESOS-4659 > URL: https://issues.apache.org/jira/browse/MESOS-4659 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Neil Conway >Priority: Major > Labels: failover, mesosphere > > If a framework becomes disconnected from the master, its tasks are killed > after waiting for {{failover_timeout}}. > However, if a master failover occurs but a framework never reconnects to the > new master, we never kill any of the tasks associated with that framework. > These tasks remain orphaned and presumably would need to be manually removed > by the operator. Similarly, if a framework gets torn down or disconnects > while it has running tasks on a partitioned agent, those tasks are not > shutdown when the agent reregisters. > We should consider whether to kill such orphaned tasks automatically, likely > after waiting for some (framework-configurable?) timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-4659) Avoid leaving orphan task after framework failure + master failover
[ https://issues.apache.org/jira/browse/MESOS-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038509#comment-17038509 ] Charles commented on MESOS-4659: [~vinodkone] Is there any chance to implement the workaround you suggested above - setting up a timer upon framework recovery to automatically teardown the framework after the failover timeout if it hasn't re-registered by then? It'd be really interesting for us as we sometimes run into a problem where the agents would keep spamming the master about finished tasks for an unregistered framework. If you don't have the bandwidth to work on it I could maybe work on a patch? Cheers, > Avoid leaving orphan task after framework failure + master failover > --- > > Key: MESOS-4659 > URL: https://issues.apache.org/jira/browse/MESOS-4659 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Neil Conway >Priority: Major > Labels: failover, mesosphere > > If a framework becomes disconnected from the master, its tasks are killed > after waiting for {{failover_timeout}}. > However, if a master failover occurs but a framework never reconnects to the > new master, we never kill any of the tasks associated with that framework. > These tasks remain orphaned and presumably would need to be manually removed > by the operator. Similarly, if a framework gets torn down or disconnects > while it has running tasks on a partitioned agent, those tasks are not > shutdown when the agent reregisters. > We should consider whether to kill such orphaned tasks automatically, likely > after waiting for some (framework-configurable?) timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-4659) Avoid leaving orphan task after framework failure + master failover
[ https://issues.apache.org/jira/browse/MESOS-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542307#comment-16542307 ] Vinod Kone commented on MESOS-4659: --- I was looking at this code today, and I think we can fix this now without having to persist frameworks in the registry. This is because after MESOS-6419, the master recovers FrameworkInfo from a re-registering agent after a failover. Master could conceivably schedule a timer for failover timeout (available from the recovered FrameworkInfo) after such recovery and teardown the framework if it didn't register. It is not a perfect solution because it depends on the timing of the first agent re-registration that recovers a framework, but is a good stop gap until we get to persisting frameworks in the registry. > Avoid leaving orphan task after framework failure + master failover > --- > > Key: MESOS-4659 > URL: https://issues.apache.org/jira/browse/MESOS-4659 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Neil Conway >Priority: Major > Labels: failover, mesosphere > > If a framework becomes disconnected from the master, its tasks are killed > after waiting for {{failover_timeout}}. > However, if a master failover occurs but a framework never reconnects to the > new master, we never kill any of the tasks associated with that framework. > These tasks remain orphaned and presumably would need to be manually removed > by the operator. Similarly, if a framework gets torn down or disconnects > while it has running tasks on a partitioned agent, those tasks are not > shutdown when the agent reregisters. > We should consider whether to kill such orphaned tasks automatically, likely > after waiting for some (framework-configurable?) timeout. -- This message was sent by Atlassian JIRA (v7.6.3#76005)