Hi all,

I'm hoping to get some clarification around slave recovery timeouts.
Looking through the slave recovery documentation[0], I see two
(potentially) conflicting pieces of information that has led to some
confusion on my part:

   -
   - recovery_timeout : Amount of time allotted for the slave to recover
   [Default: 15 mins]. If the slave takes longer than recovery_timeout to
   recover, any executors that are waiting to reconnect to the slave will
   self-terminate.
   - A restarted slave should re-register with master within a timeout
   (currently, 75s). If the slave takes longer than this timeout to
   re-register, the master shuts down the slave, which in turn shuts down any
   live executors/tasks.


Here's my understanding of how slave recovery should work (please correct
me if I'm wrong):

If the mesos-slave daemon fails / is upgraded / is restarted and comes back
online within 75 seconds (is this a hard-coded value?), it sounds like
there's no problem: the slave re-connects to the master using the same
slave ID.

*But* if it takes longer than 75 seconds to come back online and tries to
re-use the same slave ID*,* the master will shut it down and kill off its
executors (despite the executors being given 10mins to reconnect).

*If* the 75 seconds is exceeded but we're within the recovery_timeout, the
slave *should* register with a new slave ID. The slave daemon (with the new
slave ID) reconnects to the old executors and updates them to use the new
slave ID.

Does that sound accurate, or have I missed something? If it is accurate,
then I have two follow-up questions:

   1. At what point does a slave get a new slave ID?
   2. What would cause a slave to come back online with the same slave ID
   *after* the 75sec threshold?

Thanks,

-- Roger

[0] http://mesos.apache.org/documentation/latest/slave-recovery/
[1] http://mesos.apache.org/documentation/latest/configuration/

Reply via email to