[jira] [Commented] (MESOS-2246) Improve slave health-checking

2015-06-16 Thread Joe Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588361#comment-14588361
 ] 

Joe Smith commented on MESOS-2246:
--

[~vinodkone] [~jieyu] given the tickets in this epic are completed, can this be 
resolved?

 Improve slave health-checking
 -

 Key: MESOS-2246
 URL: https://issues.apache.org/jira/browse/MESOS-2246
 Project: Mesos
  Issue Type: Epic
  Components: master, slave
Reporter: Dominic Hamon

 In the event of a network partition, or other systemic issues, we may see  
 widespread slave removal. There are several approaches we can take to 
 mitigate this issue including, but not limited to:
 . rate limit the slave removal
 . change how we do health checking to not rely on a single point of view
 . work with frameworks to determine SLA of running services before removing 
 the slave
 . manual control to allow operator intervention 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2246) Improve slave health-checking

2015-06-16 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588449#comment-14588449
 ] 

Vinod Kone commented on MESOS-2246:
---

I think we solved the first part of the problem, rate limiting slave removals. 
We still haven't solved improving the scalability of health checks and being 
SLA aware. Since they latter can be epics in themselves we can resolve this and 
open new ones.

 Improve slave health-checking
 -

 Key: MESOS-2246
 URL: https://issues.apache.org/jira/browse/MESOS-2246
 Project: Mesos
  Issue Type: Epic
  Components: master, slave
Reporter: Dominic Hamon

 In the event of a network partition, or other systemic issues, we may see  
 widespread slave removal. There are several approaches we can take to 
 mitigate this issue including, but not limited to:
 . rate limit the slave removal
 . change how we do health checking to not rely on a single point of view
 . work with frameworks to determine SLA of running services before removing 
 the slave
 . manual control to allow operator intervention 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2246) Improve slave health-checking

2015-01-26 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292534#comment-14292534
 ] 

Jie Yu commented on MESOS-2246:
---

We probably wanna separate this work into two major pieces:

1) Trying to reduce false positives (we thought the slave was dead but actually 
it is not) as much as possible. Current health check is based on a series of 
ping-pongs between the leading master and slaves, and a magical timeout value. 
While simple, this might not be the best way in detecting dead slaves (in terms 
of false positives). There are a few researches that try to solve this problem 
(e.g., gossip protocol, etc.).

2) We need to admit that false positives are inevitable, then the question is 
how are we going to handle those false positives. Currently, Mesos handle the 
false positive by killing all tasks and remove the slave. We could improve this 
part by allowing more smart decisions. Some possible ways are: let framework 
make those decisions (e.g., be sla aware), or introduce a few policies for 
framework to choose, etc.

 Improve slave health-checking
 -

 Key: MESOS-2246
 URL: https://issues.apache.org/jira/browse/MESOS-2246
 Project: Mesos
  Issue Type: Epic
  Components: master, slave
Reporter: Dominic Hamon
Assignee: Jie Yu

 In the event of a network partition, or other systemic issues, we may see  
 widespread slave removal. There are several approaches we can take to 
 mitigate this issue including, but not limited to:
 . rate limit the slave removal
 . change how we do health checking to not rely on a single point of view
 . work with frameworks to determine SLA of running services before removing 
 the slave
 . manual control to allow operator intervention 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)