[ https://issues.apache.org/jira/browse/MESOS-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859933#comment-15859933 ]
Anindya Sinha commented on MESOS-7087: -------------------------------------- Here is a write up on a proposal to address this situation: https://docs.google.com/document/d/1nUxvh6BbB8jv5G-MvckGj9XzFYLBrUM0O5Go_Zmdftk/edit?usp=sharing Comments/feedback welcome. > Consider improving exponential backoff algorithm. > ------------------------------------------------- > > Key: MESOS-7087 > URL: https://issues.apache.org/jira/browse/MESOS-7087 > Project: Mesos > Issue Type: Improvement > Components: general > Reporter: Anindya Sinha > Assignee: Anindya Sinha > > There are 3 types of backoff algorithms in use: > 1) Exponential backoff with randomness, as in framework/agent registration. > 2) Exponential backoff with no randomness, as in status updates. > 3) Linear backoff with randomness, as in executor registration. > Consider framework registration. nth retry attempt is done after a random > interval ranging between [0 .. backoff * 2^(n-1)] as long as each interval is > less than 1 min. The default value for backoff is 2secs. > Although the current approach brings in exponential backoff with randomness, > we have observed that for clusters with thousands of agents and/or > frameworks, the actual retry interval (which is randomized) can end up being > very frequent for a substantial number of agents and/or frameworks due to the > fact that the allowed range is [0 .. <n>], which leads to bombarding the > master with tons of messages thereby overloading it. > So, the main issues seen are (esp for large number of frameworks and/or > agents) are: > 1) Every subsequent retry should be spaced off by a minimum deterministic > amount from the previous attempt. > 2) Every subsequent retry should be greater or equal to the previous attempt. > 3) Maximum retry interval should be configurable since it can be a function > of the initial backoff factor. -- This message was sent by Atlassian JIRA (v6.3.15#6346)