[ 
https://issues.apache.org/jira/browse/MESOS-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859933#comment-15859933
 ] 

Anindya Sinha commented on MESOS-7087:
--------------------------------------

Here is a write up on a proposal to address this situation:

https://docs.google.com/document/d/1nUxvh6BbB8jv5G-MvckGj9XzFYLBrUM0O5Go_Zmdftk/edit?usp=sharing

Comments/feedback welcome.

> Consider improving exponential backoff algorithm.
> -------------------------------------------------
>
>                 Key: MESOS-7087
>                 URL: https://issues.apache.org/jira/browse/MESOS-7087
>             Project: Mesos
>          Issue Type: Improvement
>          Components: general
>            Reporter: Anindya Sinha
>            Assignee: Anindya Sinha
>
> There are 3 types of backoff algorithms in use:
> 1) Exponential backoff with randomness, as in framework/agent registration.
> 2) Exponential backoff with no randomness, as in status updates.
> 3) Linear backoff with randomness, as in executor registration.
> Consider framework registration. nth retry attempt is done after a random 
> interval ranging between [0 .. backoff * 2^(n-1)] as long as each interval is 
> less than 1 min. The default value for backoff is 2secs.
> Although the current approach brings in exponential backoff with randomness, 
> we have observed that for clusters with thousands of agents and/or 
> frameworks, the actual retry interval (which is randomized) can end up being 
> very frequent for a substantial number of agents and/or frameworks due to the 
> fact that the allowed range is [0 .. <n>], which leads to bombarding the 
> master with tons of messages thereby overloading it.
> So, the main issues seen are (esp for large number of frameworks and/or 
> agents) are:
> 1) Every subsequent retry should be spaced off by a minimum deterministic 
> amount from the previous attempt.
> 2) Every subsequent retry should be greater or equal to the previous attempt.
> 3) Maximum retry interval should be configurable since it can be a function 
> of the initial backoff factor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to