[ 
https://issues.apache.org/jira/browse/MESOS-6522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15627454#comment-15627454
 ] 

Benjamin Mahler commented on MESOS-6522:
----------------------------------------

The current thinking w.r.t. to maintenance is that we ask for resources back 
from the schedulers (via inverse offers). Schedulers know the deadline and 
should co-operate with these requests. Once the maintenance window begins, the 
operator could force the draining of the agent but this will potentially cause 
SLA violations or data loss if the maintenance is destructive. Because of this, 
we'd like the operator to make this call. Also worth noting that it is to be 
expected that some attempts to do maintenance do not succeed since they would 
have led to SLA violations for the frameworks, or data loss in the case of 
destructive maintenance. In these cases the operator can follow up on the 
"stragglers" with a more suitable maintenance plan.

A maximum executor lifetime is interesting in that it forces churn in the 
cluster, but it would make it very difficult to implement certain classes of 
workloads (e.g. data storage) and I suspect it would frustrate framework 
developers since they have no control over it. In general we try to give 
control to the frameworks, since only they understand the workload.

> Ability to set global maximum executor runtime for an agent
> -----------------------------------------------------------
>
>                 Key: MESOS-6522
>                 URL: https://issues.apache.org/jira/browse/MESOS-6522
>             Project: Mesos
>          Issue Type: Improvement
>          Components: slave
>            Reporter: Will Rouesnel
>            Priority: Minor
>
> With the developing concept of agent maintenance mode, it would be nice to 
> have some blunt-force ability to reason about the behavior of agents on 
> uncooperative frameworks.
> Ideally there would be a new parameter --executor_maximum_lifetime which 
> would specify a maximum duration for which *any* executor on an agent can run 
> before being terminated.
> Even when using persistent schedulers such as Marathon, the ability to 
> enforce reasonable gurantees about when an agent's tasks definitely must end 
> can help contribute to keeping the cluster turning over and prevent nodes 
> becoming "special" or jammed up with jobs which will not end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to