FYI, the 75s (hard coded) timeout is being made configurable in MESOS-2110 <https://issues.apache.org/jira/browse/MESOS-2110>, hopefully landing in 0.23.
On Fri, Jun 19, 2015 at 4:18 PM, Roger Ignazio <[email protected]> wrote: > On Fri, Jun 19, 2015 at 3:46 PM, Vinod Kone <[email protected]> wrote: > >> >> *If* the 75 seconds is exceeded but we're within the recovery_timeout, >>> the slave *should* register with a new slave ID. The slave daemon (with >>> the new slave ID) reconnects to the old executors and updates them to use >>> the new slave ID. >>> >> >> This is not true. 'recovery_timeout' was added to make sure that if a >> slave is down for a long time (>10 mins), the executors commit suicide. It >> is better for the executor/task to die than keep running because the >> framework might have already launched another replica of that instance. >> This was not tied to the 75s timeout (hard coded) because it is possible >> for a slave to successfully re-register with a master after 75s (e.g., both >> master and slave are down for 5 min). >> >> Also, a slave cannot connect to old executors with a new slave id. >> > > Perfect, thanks for the quick response Vinod! >

