Ian Downes created MESOS-4092:
---------------------------------

             Summary: Try to re-establish connection on ping timeouts with 
agent before removing it
                 Key: MESOS-4092
                 URL: https://issues.apache.org/jira/browse/MESOS-4092
             Project: Mesos
          Issue Type: Improvement
          Components: master
    Affects Versions: 0.25.0
            Reporter: Ian Downes


The SlaveObserver will trigger an agent to be removed after 
{{flags.max_slave_ping_timeouts}} timeouts of {{flags.slave_ping_timeout}}. 
This can occur because of transient network failures, e.g., gray failures of a 
switch uplink exhibiting heavy or total packet loss. Some network architectures 
are designed to tolerate such gray failures and support multiple paths between 
hosts. This can be implemented with equal-cost multi-path routing (ECMP) where 
flows are hashed by their 5-tuple to multiple possible uplinks. In such 
networks re-establishing a TCP connection will almost certainly use a new 
source port and thus will likely be hashed to a different uplink, avoiding the 
failed uplink and re-establishing connectivity with the agent.

After failing to receive pongs the SlaveObserver should next try to 
re-establish a TCP connection (with exponential back-off) before declaring the 
agent as lost. This can avoid significant disruption where large numbers of 
agents reached through a single failed link could be removed unnecessarily 
while still ensuring that agents that are truly lost are recognized as such.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to