[jira] [Commented] (MESOS-1474) Provide cluster maintenance primitives for operators.

2016-08-24 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435880#comment-15435880
 ] 

Joris Van Remoortere commented on MESOS-1474:
-

To help clarify: The new offers have an explicit unavailability in them that 
indicates how long the agent will still be up. New tasks scheduled there should 
be able to complete prior to that time point.

> Provide cluster maintenance primitives for operators.
> -
>
> Key: MESOS-1474
> URL: https://issues.apache.org/jira/browse/MESOS-1474
> Project: Mesos
>  Issue Type: Epic
>  Components: framework, master, slave
>Reporter: Benjamin Mahler
>Assignee: Artem Harutyunyan
>  Labels: mesosphere, twitter
>
> Sometimes operators need to perform maintenance on a mesos cluster; we define 
> maintenance here as anything that requires the tasks to be drained on the 
> slave(s). Most mesos upgrades can be done without affecting running tasks, 
> but there are situations where maintenance is task-affecting:
> * Host maintenance (e.g. hardware repair, kernel upgrades).
> * Non-recoverable slave upgrades (e.g. adjusting slave attributes).
> * etc
> In order to ensure operators don’t violate frameworks’ SLAs, schedulers need 
> to be aware of planned unavailability events.
> Maintenance awareness allows schedulers to avoid churn for long running tasks 
> by placing them on machines not undergoing maintenance. If all resources are 
> planned for maintenance, then the scheduler will prefer machines scheduled 
> for maintenance least imminently.
> Maintenance awareness is also crucial when a scheduler uses [persistent 
> disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure 
> that the scheduler is aware of the expected duration of unavailability for a 
> persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate 
> 1TB over the network when only 1 of the 3 replicas is going to be unavailable 
> for a reboot (< 1 hour)).
> There are a few primitives of interest here:
> * Provide a way for operators to [fully shutdown a 
> slave|https://issues.apache.org/jira/browse/MESOS-1475] (killing all tasks 
> underneath it). Colloquially known as a "hard drain".
> * Provide a way for operators to mark specific slaves as scheduled for 
> maintenance. This will inform the scheduler about the scheduled 
> unavailability of the resources.
> * Provide a way for frameworks to be notified when resources are requested to 
> be relinquished. This gives the framework to proactively move a task before 
> it may be forcibly killed by an operator. It also allows the automation of 
> operations like: "please drain these slaves within 1 hour."
> See the [design 
> doc|https://docs.google.com/a/twitter.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit#]
>  for the latest details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1474) Provide cluster maintenance primitives for operators.

2016-08-24 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435864#comment-15435864
 ] 

Joseph Wu commented on MESOS-1474:
--

Mesos won't stop sending offers because the machine could be {{DRAINING}} for 
any arbitrary period (say, a second or a year).

We opted for this design (maintenance primitives have essentially no effect) 
because we cannot give inverse offers to existing schedulers without breaking 
all existing schedulers.  Better to have people upgrade their schedulers first, 
and then take advantage of this feature.

> Provide cluster maintenance primitives for operators.
> -
>
> Key: MESOS-1474
> URL: https://issues.apache.org/jira/browse/MESOS-1474
> Project: Mesos
>  Issue Type: Epic
>  Components: framework, master, slave
>Reporter: Benjamin Mahler
>Assignee: Artem Harutyunyan
>  Labels: mesosphere, twitter
>
> Sometimes operators need to perform maintenance on a mesos cluster; we define 
> maintenance here as anything that requires the tasks to be drained on the 
> slave(s). Most mesos upgrades can be done without affecting running tasks, 
> but there are situations where maintenance is task-affecting:
> * Host maintenance (e.g. hardware repair, kernel upgrades).
> * Non-recoverable slave upgrades (e.g. adjusting slave attributes).
> * etc
> In order to ensure operators don’t violate frameworks’ SLAs, schedulers need 
> to be aware of planned unavailability events.
> Maintenance awareness allows schedulers to avoid churn for long running tasks 
> by placing them on machines not undergoing maintenance. If all resources are 
> planned for maintenance, then the scheduler will prefer machines scheduled 
> for maintenance least imminently.
> Maintenance awareness is also crucial when a scheduler uses [persistent 
> disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure 
> that the scheduler is aware of the expected duration of unavailability for a 
> persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate 
> 1TB over the network when only 1 of the 3 replicas is going to be unavailable 
> for a reboot (< 1 hour)).
> There are a few primitives of interest here:
> * Provide a way for operators to [fully shutdown a 
> slave|https://issues.apache.org/jira/browse/MESOS-1475] (killing all tasks 
> underneath it). Colloquially known as a "hard drain".
> * Provide a way for operators to mark specific slaves as scheduled for 
> maintenance. This will inform the scheduler about the scheduled 
> unavailability of the resources.
> * Provide a way for frameworks to be notified when resources are requested to 
> be relinquished. This gives the framework to proactively move a task before 
> it may be forcibly killed by an operator. It also allows the automation of 
> operations like: "please drain these slaves within 1 hour."
> See the [design 
> doc|https://docs.google.com/a/twitter.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit#]
>  for the latest details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1474) Provide cluster maintenance primitives for operators.

2016-08-24 Thread Matt DeBoer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435850#comment-15435850
 ] 

Matt DeBoer commented on MESOS-1474:


I apologize if this has already been covered, but I'm confused as to why an 
agent in DRAINING mode is still presenting new offers to frameworks (in 
addition to the inverse offers).
The result of this is that frameworks that don't know about the maintenance 
primitives (marathon, for example) go about business as usual -- continuing to 
schedule new tasks on the draining node--was this the intended design?

Perhaps it's not in the scope of this feature, but what I was hoping for when I 
saw this (maintenance primitives) was a way to put an agent into a state such 
that no new offers would be made for it, but existing tasks would be allowed to 
run to completion (i.e., so an operator could manually migrate tasks from older 
frameworks away)--is there such a mechanism or procedure that I've missed?


> Provide cluster maintenance primitives for operators.
> -
>
> Key: MESOS-1474
> URL: https://issues.apache.org/jira/browse/MESOS-1474
> Project: Mesos
>  Issue Type: Epic
>  Components: framework, master, slave
>Reporter: Benjamin Mahler
>Assignee: Artem Harutyunyan
>  Labels: mesosphere, twitter
>
> Sometimes operators need to perform maintenance on a mesos cluster; we define 
> maintenance here as anything that requires the tasks to be drained on the 
> slave(s). Most mesos upgrades can be done without affecting running tasks, 
> but there are situations where maintenance is task-affecting:
> * Host maintenance (e.g. hardware repair, kernel upgrades).
> * Non-recoverable slave upgrades (e.g. adjusting slave attributes).
> * etc
> In order to ensure operators don’t violate frameworks’ SLAs, schedulers need 
> to be aware of planned unavailability events.
> Maintenance awareness allows schedulers to avoid churn for long running tasks 
> by placing them on machines not undergoing maintenance. If all resources are 
> planned for maintenance, then the scheduler will prefer machines scheduled 
> for maintenance least imminently.
> Maintenance awareness is also crucial when a scheduler uses [persistent 
> disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure 
> that the scheduler is aware of the expected duration of unavailability for a 
> persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate 
> 1TB over the network when only 1 of the 3 replicas is going to be unavailable 
> for a reboot (< 1 hour)).
> There are a few primitives of interest here:
> * Provide a way for operators to [fully shutdown a 
> slave|https://issues.apache.org/jira/browse/MESOS-1475] (killing all tasks 
> underneath it). Colloquially known as a "hard drain".
> * Provide a way for operators to mark specific slaves as scheduled for 
> maintenance. This will inform the scheduler about the scheduled 
> unavailability of the resources.
> * Provide a way for frameworks to be notified when resources are requested to 
> be relinquished. This gives the framework to proactively move a task before 
> it may be forcibly killed by an operator. It also allows the automation of 
> operations like: "please drain these slaves within 1 hour."
> See the [design 
> doc|https://docs.google.com/a/twitter.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit#]
>  for the latest details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1474) Provide cluster maintenance primitives for operators.

2015-09-02 Thread Artem Harutyunyan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728249#comment-14728249
 ] 

Artem Harutyunyan commented on MESOS-1474:
--

[~karlkfi] not really. You can schedule maintenance for a given machine (or a 
set of machines) but you currently can not make Mesos to selectively send 
inverse offers to only certain tasks.

> Provide cluster maintenance primitives for operators.
> -
>
> Key: MESOS-1474
> URL: https://issues.apache.org/jira/browse/MESOS-1474
> Project: Mesos
>  Issue Type: Epic
>  Components: framework, master, slave
>Reporter: Benjamin Mahler
>Assignee: Artem Harutyunyan
>  Labels: mesosphere, twitter
>
> Sometimes operators need to perform maintenance on a mesos cluster; we define 
> maintenance here as anything that requires the tasks to be drained on the 
> slave(s). Most mesos upgrades can be done without affecting running tasks, 
> but there are situations where maintenance is task-affecting:
> * Host maintenance (e.g. hardware repair, kernel upgrades).
> * Non-recoverable slave upgrades (e.g. adjusting slave attributes).
> * etc
> In order to ensure operators don’t violate frameworks’ SLAs, schedulers need 
> to be aware of planned unavailability events.
> Maintenance awareness allows schedulers to avoid churn for long running tasks 
> by placing them on machines not undergoing maintenance. If all resources are 
> planned for maintenance, then the scheduler will prefer machines scheduled 
> for maintenance least imminently.
> Maintenance awareness is also crucial when a scheduler uses [persistent 
> disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure 
> that the scheduler is aware of the expected duration of unavailability for a 
> persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate 
> 1TB over the network when only 1 of the 3 replicas is going to be unavailable 
> for a reboot (< 1 hour)).
> There are a few primitives of interest here:
> * Provide a way for operators to [fully shutdown a 
> slave|https://issues.apache.org/jira/browse/MESOS-1475] (killing all tasks 
> underneath it). Colloquially known as a "hard drain".
> * Provide a way for operators to mark specific slaves as scheduled for 
> maintenance. This will inform the scheduler about the scheduled 
> unavailability of the resources.
> * Provide a way for frameworks to be notified when resources are requested to 
> be relinquished. This gives the framework to proactively move a task before 
> it may be forcibly killed by an operator. It also allows the automation of 
> operations like: "please drain these slaves within 1 hour."
> See the [design 
> doc|https://docs.google.com/a/twitter.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit#]
>  for the latest details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1474) Provide cluster maintenance primitives for operators.

2015-09-02 Thread Karl Isenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728244#comment-14728244
 ] 

Karl Isenberg commented on MESOS-1474:
--

Is draining active tasks for a framework upgrade (across the cluster, not just 
a single node) covered by this epic? 

> Provide cluster maintenance primitives for operators.
> -
>
> Key: MESOS-1474
> URL: https://issues.apache.org/jira/browse/MESOS-1474
> Project: Mesos
>  Issue Type: Epic
>  Components: framework, master, slave
>Reporter: Benjamin Mahler
>Assignee: Artem Harutyunyan
>  Labels: mesosphere, twitter
>
> Sometimes operators need to perform maintenance on a mesos cluster; we define 
> maintenance here as anything that requires the tasks to be drained on the 
> slave(s). Most mesos upgrades can be done without affecting running tasks, 
> but there are situations where maintenance is task-affecting:
> * Host maintenance (e.g. hardware repair, kernel upgrades).
> * Non-recoverable slave upgrades (e.g. adjusting slave attributes).
> * etc
> In order to ensure operators don’t violate frameworks’ SLAs, schedulers need 
> to be aware of planned unavailability events.
> Maintenance awareness allows schedulers to avoid churn for long running tasks 
> by placing them on machines not undergoing maintenance. If all resources are 
> planned for maintenance, then the scheduler will prefer machines scheduled 
> for maintenance least imminently.
> Maintenance awareness is also crucial when a scheduler uses [persistent 
> disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure 
> that the scheduler is aware of the expected duration of unavailability for a 
> persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate 
> 1TB over the network when only 1 of the 3 replicas is going to be unavailable 
> for a reboot (< 1 hour)).
> There are a few primitives of interest here:
> * Provide a way for operators to [fully shutdown a 
> slave|https://issues.apache.org/jira/browse/MESOS-1475] (killing all tasks 
> underneath it). Colloquially known as a "hard drain".
> * Provide a way for operators to mark specific slaves as scheduled for 
> maintenance. This will inform the scheduler about the scheduled 
> unavailability of the resources.
> * Provide a way for frameworks to be notified when resources are requested to 
> be relinquished. This gives the framework to proactively move a task before 
> it may be forcibly killed by an operator. It also allows the automation of 
> operations like: "please drain these slaves within 1 hour."
> See the [design 
> doc|https://docs.google.com/a/twitter.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit#]
>  for the latest details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1474) Provide cluster maintenance primitives for operators.

2015-09-02 Thread Karl Isenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728499#comment-14728499
 ] 

Karl Isenberg commented on MESOS-1474:
--

Say you need to upgrade a mesos framework and the new version can't handle the 
result of tasks spawned by the previous version. So you'd want to kill all the 
tasks as gracefully as possible, store them on disk (or somewhere that will 
survive the update), then shutdown the frameowork, start the new version of the 
framework, recover the stored tasks and reschedule them.

> Provide cluster maintenance primitives for operators.
> -
>
> Key: MESOS-1474
> URL: https://issues.apache.org/jira/browse/MESOS-1474
> Project: Mesos
>  Issue Type: Epic
>  Components: framework, master, slave
>Reporter: Benjamin Mahler
>Assignee: Artem Harutyunyan
>  Labels: mesosphere, twitter
>
> Sometimes operators need to perform maintenance on a mesos cluster; we define 
> maintenance here as anything that requires the tasks to be drained on the 
> slave(s). Most mesos upgrades can be done without affecting running tasks, 
> but there are situations where maintenance is task-affecting:
> * Host maintenance (e.g. hardware repair, kernel upgrades).
> * Non-recoverable slave upgrades (e.g. adjusting slave attributes).
> * etc
> In order to ensure operators don’t violate frameworks’ SLAs, schedulers need 
> to be aware of planned unavailability events.
> Maintenance awareness allows schedulers to avoid churn for long running tasks 
> by placing them on machines not undergoing maintenance. If all resources are 
> planned for maintenance, then the scheduler will prefer machines scheduled 
> for maintenance least imminently.
> Maintenance awareness is also crucial when a scheduler uses [persistent 
> disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure 
> that the scheduler is aware of the expected duration of unavailability for a 
> persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate 
> 1TB over the network when only 1 of the 3 replicas is going to be unavailable 
> for a reboot (< 1 hour)).
> There are a few primitives of interest here:
> * Provide a way for operators to [fully shutdown a 
> slave|https://issues.apache.org/jira/browse/MESOS-1475] (killing all tasks 
> underneath it). Colloquially known as a "hard drain".
> * Provide a way for operators to mark specific slaves as scheduled for 
> maintenance. This will inform the scheduler about the scheduled 
> unavailability of the resources.
> * Provide a way for frameworks to be notified when resources are requested to 
> be relinquished. This gives the framework to proactively move a task before 
> it may be forcibly killed by an operator. It also allows the automation of 
> operations like: "please drain these slaves within 1 hour."
> See the [design 
> doc|https://docs.google.com/a/twitter.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit#]
>  for the latest details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1474) Provide cluster maintenance primitives for operators.

2014-09-09 Thread Joe Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127701#comment-14127701
 ] 

Joe Smith commented on MESOS-1474:
--

The doc looks great! Thanks for all the hard work and rigor!

 Provide cluster maintenance primitives for operators.
 -

 Key: MESOS-1474
 URL: https://issues.apache.org/jira/browse/MESOS-1474
 Project: Mesos
  Issue Type: Epic
  Components: framework, master, slave
Reporter: Benjamin Mahler

 Normally cluster upgrades can be done seamlessly using the built-in slave 
 recovery feature. However, there are situations where operators want to be 
 able to perform destructive maintenance operations on machines:
 * Non-recoverable slave upgrades.
 * Machine reboots.
 * Kernel upgrades.
 * Machine decommissioning.
 * etc.
 In these situations, best practice is to perform rolling maintenance in large 
 batches of machines. This can be problematic for frameworks when many related 
 tasks are located within a batch of machines going for maintenance.
 There are a few primitives of interest here:
 * Provide a way for operators to fully shutdown a slave (killing all tasks 
 underneath it).
 * Provide a way for operators to mark specific slaves as undergoing 
 maintenance. This means that no more offers are being sent for these slaves, 
 and no new tasks will launch on them.
 * Provide a way for frameworks to be notified when resources are requested to 
 be relinquished. This gives the framework to proactively move a task before 
 it is forcibly killed. It also allows the automation of operations like: 
 please drain these slaves within 1 hour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1474) Provide cluster maintenance primitives for operators.

2014-08-26 Thread Kevin Lyda (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110617#comment-14110617
 ] 

Kevin Lyda commented on MESOS-1474:
---

Just thought I'd link the design doc by Alexandra that Benjamin mentioned on 
the list:

https://docs.google.com/a/ie.suberic.net/document/d/1NjK7MQeJzTRdfZTQ9q1Q5p4dY985bZ7cFqDpX4_fgjM/edit

 Provide cluster maintenance primitives for operators.
 -

 Key: MESOS-1474
 URL: https://issues.apache.org/jira/browse/MESOS-1474
 Project: Mesos
  Issue Type: Epic
  Components: framework, master, slave
Reporter: Benjamin Mahler

 Normally cluster upgrades can be done seamlessly using the built-in slave 
 recovery feature. However, there are situations where operators want to be 
 able to perform destructive maintenance operations on machines:
 * Non-recoverable slave upgrades.
 * Machine reboots.
 * Kernel upgrades.
 * Machine decommissioning.
 * etc.
 In these situations, best practice is to perform rolling maintenance in large 
 batches of machines. This can be problematic for frameworks when many related 
 tasks are located within a batch of machines going for maintenance.
 There are a few primitives of interest here:
 * Provide a way for operators to fully shutdown a slave (killing all tasks 
 underneath it).
 * Provide a way for operators to mark specific slaves as undergoing 
 maintenance. This means that no more offers are being sent for these slaves, 
 and no new tasks will launch on them.
 * Provide a way for frameworks to be notified when resources are requested to 
 be relinquished. This gives the framework to proactively move a task before 
 it is forcibly killed. It also allows the automation of operations like: 
 please drain these slaves within 1 hour.



--
This message was sent by Atlassian JIRA
(v6.2#6252)