Hi all, I'm hoping to get some clarification around slave recovery timeouts. Looking through the slave recovery documentation[0], I see two (potentially) conflicting pieces of information that has led to some confusion on my part:
- - recovery_timeout : Amount of time allotted for the slave to recover [Default: 15 mins]. If the slave takes longer than recovery_timeout to recover, any executors that are waiting to reconnect to the slave will self-terminate. - A restarted slave should re-register with master within a timeout (currently, 75s). If the slave takes longer than this timeout to re-register, the master shuts down the slave, which in turn shuts down any live executors/tasks. Here's my understanding of how slave recovery should work (please correct me if I'm wrong): If the mesos-slave daemon fails / is upgraded / is restarted and comes back online within 75 seconds (is this a hard-coded value?), it sounds like there's no problem: the slave re-connects to the master using the same slave ID. *But* if it takes longer than 75 seconds to come back online and tries to re-use the same slave ID*,* the master will shut it down and kill off its executors (despite the executors being given 10mins to reconnect). *If* the 75 seconds is exceeded but we're within the recovery_timeout, the slave *should* register with a new slave ID. The slave daemon (with the new slave ID) reconnects to the old executors and updates them to use the new slave ID. Does that sound accurate, or have I missed something? If it is accurate, then I have two follow-up questions: 1. At what point does a slave get a new slave ID? 2. What would cause a slave to come back online with the same slave ID *after* the 75sec threshold? Thanks, -- Roger [0] http://mesos.apache.org/documentation/latest/slave-recovery/ [1] http://mesos.apache.org/documentation/latest/configuration/

