Anand Mazumdar created MESOS-7426:
-------------------------------------

             Summary: Support for agent lifecycle management.
                 Key: MESOS-7426
                 URL: https://issues.apache.org/jira/browse/MESOS-7426
             Project: Mesos
          Issue Type: Epic
          Components: agent
            Reporter: Anand Mazumdar


This epic co-ordinates the work for introducing agent lifecycle management in 
Mesos allowing a framework to be notified in case of agent node failures. The 
existing {{Event::Failure}} is not enough for frameworks to know that the given 
agent node isn't ever coming back.

The primary motivations for introducing such a feature would be:

- Currently, when an agent running a task fails, there is inherently an 
operator interference needed (manual step) to remove the node via a 
configuration API exposed by the framework e.g., dcos cassandra node replace 
for the cassandra framework. This needs to be done once for every stateful 
framework running on the cluster.

- When an agent is marked as unhealthy, the removal rate is bounded if the 
`--agent_rate_removal_limit` option is set. This is specifically problematic 
for operators relying on EC2 autoscaling groups or for workload bursting to 
another cloud.

- When an agent is marked as unhealthy, the removal rate is bounded if the 
`--agent_rate_removal_limit` option is set. This is specifically problematic 
for operators relying on EC2 autoscaling groups or for workload bursting to 
another cloud.

- When the fault domain associated with an agent changes (e.g., it is moved 
from an unallocated rack to an allocated rack), there is no feedback mechanism 
for the framework.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to