Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-03-31 Thread Adam B


> On Feb. 18, 2015, 11:38 a.m., Niklas Nielsen wrote:
> > Let's get tests wired up before committing this :)
> 
> Adam B wrote:
> Sure thing. Adding tests in my subsequent patch where we will pass the 
> master's timeout values on to the slave. Will post that very soon.
> 
> Ben Mahler wrote:
> Can you do it in one patch? This patch in isolation looks a bit dangerous 
> per our conversation above.
> 
> Also, please carefully consider whether your approach will be safe to do 
> in a single version. i.e. What happens when there are old slaves running 
> against a new master? And vice versa.
> 
> Adam B wrote:
> Easy enough. Added the second patch to this one. Most of the new changes 
> are in messages.proto, slave.hpp/cpp, and partition_tests.cpp, with a few 
> lines in master.cpp to set the new message fields.
> 
> Certainly safe to upgrade if the new flags stay at their default value. 
> Also, with new slaves talking to an old master, the master will still use the 
> defaults, hence so will the slaves. 
> 
> But old slaves running against a new master with a longer timeout will 
> try to reregister unnecessarily early, so you may want to guarantee that 
> all/most of the slaves have been upgraded before setting these flags, 
> otherwise a large cluster suddenly waking up from a network split would see a 
> lot of unnecessary reregistration attempts. The old behavior in this scenario 
> was that the slaves would reregister after the same default timeout after the 
> network split, and the master would consider them shutdown after the same 
> timeout.
> If that last scenario is a problem, maybe the better approach is for each 
> slave to specify its own ping timeout, and then the master can use these 
> values with each slave's SlaveObserver. Then we can just require the 
> master(s) to be upgraded first, as is typical.
> 
> Adam B wrote:
> I'm inclined to think that we shouldn't worry too much about the unlikely 
> scenario of a network split in the middle of a rolling upgrade to 0.23 where 
> longer ping timeouts are simultaneously being applied. Procedure should be: 
> upgrade (most of) the cluster, then restart the master(s) with the new longer 
> ping timeout value. I can add a Note to upgrades.md if you like. How does 
> that sound to you @bmahler ?
> If there are no objections, I'll rebase the patch and update the 
> configuration and upgrades docs.
> 
> Ben Mahler wrote:
> Yeah that sounds fine to me (might be prudent to call it out in the 
> upgrade doc or the changelog).
> 
> Mind reaching out to me directly when this is ready for more reviewing? :)

Absolutely. Thanks. :)


- Adam


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review72992
---


On Feb. 19, 2015, 12:10 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Feb. 19, 2015, 12:10 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
> https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/constants.hpp ad3fe81 
>   src/master/constants.cpp d3d0f71 
>   src/master/flags.hpp 51a6059 
>   src/master/master.cpp f10a3cf 
>   src/messages/messages.proto 58484ae 
>   src/slave/constants.hpp 12d6e92 
>   src/slave/constants.cpp 7868bef 
>   src/slave/slave.hpp 91dae10 
>   src/slave/slave.cpp aec9525 
>   src/tests/fault_tolerance_tests.cpp efa5c57 
>   src/tests/partition_tests.cpp eb16a58 
>   src/tests/slave_recovery_tests.cpp 8210c52 
>   src/tests/slave_tests.cpp 153d9d6 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-03-31 Thread Ben Mahler


> On Feb. 18, 2015, 7:38 p.m., Niklas Nielsen wrote:
> > Let's get tests wired up before committing this :)
> 
> Adam B wrote:
> Sure thing. Adding tests in my subsequent patch where we will pass the 
> master's timeout values on to the slave. Will post that very soon.
> 
> Ben Mahler wrote:
> Can you do it in one patch? This patch in isolation looks a bit dangerous 
> per our conversation above.
> 
> Also, please carefully consider whether your approach will be safe to do 
> in a single version. i.e. What happens when there are old slaves running 
> against a new master? And vice versa.
> 
> Adam B wrote:
> Easy enough. Added the second patch to this one. Most of the new changes 
> are in messages.proto, slave.hpp/cpp, and partition_tests.cpp, with a few 
> lines in master.cpp to set the new message fields.
> 
> Certainly safe to upgrade if the new flags stay at their default value. 
> Also, with new slaves talking to an old master, the master will still use the 
> defaults, hence so will the slaves. 
> 
> But old slaves running against a new master with a longer timeout will 
> try to reregister unnecessarily early, so you may want to guarantee that 
> all/most of the slaves have been upgraded before setting these flags, 
> otherwise a large cluster suddenly waking up from a network split would see a 
> lot of unnecessary reregistration attempts. The old behavior in this scenario 
> was that the slaves would reregister after the same default timeout after the 
> network split, and the master would consider them shutdown after the same 
> timeout.
> If that last scenario is a problem, maybe the better approach is for each 
> slave to specify its own ping timeout, and then the master can use these 
> values with each slave's SlaveObserver. Then we can just require the 
> master(s) to be upgraded first, as is typical.
> 
> Adam B wrote:
> I'm inclined to think that we shouldn't worry too much about the unlikely 
> scenario of a network split in the middle of a rolling upgrade to 0.23 where 
> longer ping timeouts are simultaneously being applied. Procedure should be: 
> upgrade (most of) the cluster, then restart the master(s) with the new longer 
> ping timeout value. I can add a Note to upgrades.md if you like. How does 
> that sound to you @bmahler ?
> If there are no objections, I'll rebase the patch and update the 
> configuration and upgrades docs.

Yeah that sounds fine to me (might be prudent to call it out in the upgrade doc 
or the changelog).

Mind reaching out to me directly when this is ready for more reviewing? :)


- Ben


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review72992
---


On Feb. 19, 2015, 8:10 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Feb. 19, 2015, 8:10 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
> https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/constants.hpp ad3fe81 
>   src/master/constants.cpp d3d0f71 
>   src/master/flags.hpp 51a6059 
>   src/master/master.cpp f10a3cf 
>   src/messages/messages.proto 58484ae 
>   src/slave/constants.hpp 12d6e92 
>   src/slave/constants.cpp 7868bef 
>   src/slave/slave.hpp 91dae10 
>   src/slave/slave.cpp aec9525 
>   src/tests/fault_tolerance_tests.cpp efa5c57 
>   src/tests/partition_tests.cpp eb16a58 
>   src/tests/slave_recovery_tests.cpp 8210c52 
>   src/tests/slave_tests.cpp 153d9d6 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-03-27 Thread Adam B


> On Feb. 18, 2015, 11:38 a.m., Niklas Nielsen wrote:
> > Let's get tests wired up before committing this :)
> 
> Adam B wrote:
> Sure thing. Adding tests in my subsequent patch where we will pass the 
> master's timeout values on to the slave. Will post that very soon.
> 
> Ben Mahler wrote:
> Can you do it in one patch? This patch in isolation looks a bit dangerous 
> per our conversation above.
> 
> Also, please carefully consider whether your approach will be safe to do 
> in a single version. i.e. What happens when there are old slaves running 
> against a new master? And vice versa.
> 
> Adam B wrote:
> Easy enough. Added the second patch to this one. Most of the new changes 
> are in messages.proto, slave.hpp/cpp, and partition_tests.cpp, with a few 
> lines in master.cpp to set the new message fields.
> 
> Certainly safe to upgrade if the new flags stay at their default value. 
> Also, with new slaves talking to an old master, the master will still use the 
> defaults, hence so will the slaves. 
> 
> But old slaves running against a new master with a longer timeout will 
> try to reregister unnecessarily early, so you may want to guarantee that 
> all/most of the slaves have been upgraded before setting these flags, 
> otherwise a large cluster suddenly waking up from a network split would see a 
> lot of unnecessary reregistration attempts. The old behavior in this scenario 
> was that the slaves would reregister after the same default timeout after the 
> network split, and the master would consider them shutdown after the same 
> timeout.
> If that last scenario is a problem, maybe the better approach is for each 
> slave to specify its own ping timeout, and then the master can use these 
> values with each slave's SlaveObserver. Then we can just require the 
> master(s) to be upgraded first, as is typical.

I'm inclined to think that we shouldn't worry too much about the unlikely 
scenario of a network split in the middle of a rolling upgrade to 0.23 where 
longer ping timeouts are simultaneously being applied. Procedure should be: 
upgrade (most of) the cluster, then restart the master(s) with the new longer 
ping timeout value. I can add a Note to upgrades.md if you like. How does that 
sound to you @bmahler ?
If there are no objections, I'll rebase the patch and update the configuration 
and upgrades docs.


- Adam


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review72992
---


On Feb. 19, 2015, 12:10 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Feb. 19, 2015, 12:10 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
> https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/constants.hpp ad3fe81 
>   src/master/constants.cpp d3d0f71 
>   src/master/flags.hpp 51a6059 
>   src/master/master.cpp f10a3cf 
>   src/messages/messages.proto 58484ae 
>   src/slave/constants.hpp 12d6e92 
>   src/slave/constants.cpp 7868bef 
>   src/slave/slave.hpp 91dae10 
>   src/slave/slave.cpp aec9525 
>   src/tests/fault_tolerance_tests.cpp efa5c57 
>   src/tests/partition_tests.cpp eb16a58 
>   src/tests/slave_recovery_tests.cpp 8210c52 
>   src/tests/slave_tests.cpp 153d9d6 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-03-27 Thread Adam B


> On March 6, 2015, 5:19 a.m., Joerg Schad wrote:
> > src/master/flags.hpp, line 383
> > 
> >
> > Shouldn't this also be added to the documentation (i.e. 
> > http://mesos.apache.org/documentation/latest/configuration/)?

Excellent point. Will update that in the next patch revision.


- Adam


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review75484
---


On Feb. 19, 2015, 12:10 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Feb. 19, 2015, 12:10 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
> https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/constants.hpp ad3fe81 
>   src/master/constants.cpp d3d0f71 
>   src/master/flags.hpp 51a6059 
>   src/master/master.cpp f10a3cf 
>   src/messages/messages.proto 58484ae 
>   src/slave/constants.hpp 12d6e92 
>   src/slave/constants.cpp 7868bef 
>   src/slave/slave.hpp 91dae10 
>   src/slave/slave.cpp aec9525 
>   src/tests/fault_tolerance_tests.cpp efa5c57 
>   src/tests/partition_tests.cpp eb16a58 
>   src/tests/slave_recovery_tests.cpp 8210c52 
>   src/tests/slave_tests.cpp 153d9d6 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-03-06 Thread Joerg Schad

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review75484
---



src/master/flags.hpp


Shouldn't this also be added to the documentation (i.e. 
http://mesos.apache.org/documentation/latest/configuration/)?


- Joerg Schad


On Feb. 19, 2015, 8:10 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Feb. 19, 2015, 8:10 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
> https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/constants.hpp ad3fe81 
>   src/master/constants.cpp d3d0f71 
>   src/master/flags.hpp 51a6059 
>   src/master/master.cpp f10a3cf 
>   src/messages/messages.proto 58484ae 
>   src/slave/constants.hpp 12d6e92 
>   src/slave/constants.cpp 7868bef 
>   src/slave/slave.hpp 91dae10 
>   src/slave/slave.cpp aec9525 
>   src/tests/fault_tolerance_tests.cpp efa5c57 
>   src/tests/partition_tests.cpp eb16a58 
>   src/tests/slave_recovery_tests.cpp 8210c52 
>   src/tests/slave_tests.cpp 153d9d6 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-02-19 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review73111
---


Patch looks great!

Reviews applied: [29507]

All tests passed.

- Mesos ReviewBot


On Feb. 19, 2015, 8:10 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Feb. 19, 2015, 8:10 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
> https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/constants.hpp ad3fe81 
>   src/master/constants.cpp d3d0f71 
>   src/master/flags.hpp 51a6059 
>   src/master/master.cpp f10a3cf 
>   src/messages/messages.proto 58484ae 
>   src/slave/constants.hpp 12d6e92 
>   src/slave/constants.cpp 7868bef 
>   src/slave/slave.hpp 91dae10 
>   src/slave/slave.cpp aec9525 
>   src/tests/fault_tolerance_tests.cpp efa5c57 
>   src/tests/partition_tests.cpp eb16a58 
>   src/tests/slave_recovery_tests.cpp 8210c52 
>   src/tests/slave_tests.cpp 153d9d6 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-02-19 Thread Adam B


> On Feb. 18, 2015, 11:38 a.m., Niklas Nielsen wrote:
> > Let's get tests wired up before committing this :)
> 
> Adam B wrote:
> Sure thing. Adding tests in my subsequent patch where we will pass the 
> master's timeout values on to the slave. Will post that very soon.
> 
> Ben Mahler wrote:
> Can you do it in one patch? This patch in isolation looks a bit dangerous 
> per our conversation above.
> 
> Also, please carefully consider whether your approach will be safe to do 
> in a single version. i.e. What happens when there are old slaves running 
> against a new master? And vice versa.

Easy enough. Added the second patch to this one. Most of the new changes are in 
messages.proto, slave.hpp/cpp, and partition_tests.cpp, with a few lines in 
master.cpp to set the new message fields.

Certainly safe to upgrade if the new flags stay at their default value. Also, 
with new slaves talking to an old master, the master will still use the 
defaults, hence so will the slaves. 

But old slaves running against a new master with a longer timeout will try to 
reregister unnecessarily early, so you may want to guarantee that all/most of 
the slaves have been upgraded before setting these flags, otherwise a large 
cluster suddenly waking up from a network split would see a lot of unnecessary 
reregistration attempts. The old behavior in this scenario was that the slaves 
would reregister after the same default timeout after the network split, and 
the master would consider them shutdown after the same timeout.
If that last scenario is a problem, maybe the better approach is for each slave 
to specify its own ping timeout, and then the master can use these values with 
each slave's SlaveObserver. Then we can just require the master(s) to be 
upgraded first, as is typical.


- Adam


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review72992
---


On Feb. 19, 2015, 12:10 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Feb. 19, 2015, 12:10 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
> https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/constants.hpp ad3fe81 
>   src/master/constants.cpp d3d0f71 
>   src/master/flags.hpp 51a6059 
>   src/master/master.cpp f10a3cf 
>   src/messages/messages.proto 58484ae 
>   src/slave/constants.hpp 12d6e92 
>   src/slave/constants.cpp 7868bef 
>   src/slave/slave.hpp 91dae10 
>   src/slave/slave.cpp aec9525 
>   src/tests/fault_tolerance_tests.cpp efa5c57 
>   src/tests/partition_tests.cpp eb16a58 
>   src/tests/slave_recovery_tests.cpp 8210c52 
>   src/tests/slave_tests.cpp 153d9d6 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-02-19 Thread Adam B

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/
---

(Updated Feb. 19, 2015, 12:10 a.m.)


Review request for mesos, Ben Mahler and Niklas Nielsen.


Changes
---

Merged in subsequent patch to make slave use master's ping timeout from 
SlaveRe[re]gisteredMessage, plus unit tests.


Bugs: MESOS-2110
https://issues.apache.org/jira/browse/MESOS-2110


Repository: mesos


Description (updated)
---

Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
   
These can be extended if slaves are expected/allowed to be down for
longer than a minute or two.

Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
  
Beware that this affects recovery from network timeouts as well as
actual slave node/process failover.


Diffs (updated)
-

  src/master/constants.hpp ad3fe81 
  src/master/constants.cpp d3d0f71 
  src/master/flags.hpp 51a6059 
  src/master/master.cpp f10a3cf 
  src/messages/messages.proto 58484ae 
  src/slave/constants.hpp 12d6e92 
  src/slave/constants.cpp 7868bef 
  src/slave/slave.hpp 91dae10 
  src/slave/slave.cpp aec9525 
  src/tests/fault_tolerance_tests.cpp efa5c57 
  src/tests/partition_tests.cpp eb16a58 
  src/tests/slave_recovery_tests.cpp 8210c52 
  src/tests/slave_tests.cpp 153d9d6 

Diff: https://reviews.apache.org/r/29507/diff/


Testing (updated)
---

Manually tested slave failover/shutdown with master using different 
--slave_ping_timeout and --max_slave_ping_timeouts.
Ran unit tests with shorter non-default values for ping timeouts.
`make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
ShortPingTimeoutUnreachableSlave


Thanks,

Adam B



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-02-18 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review73085
---


Bad patch!

Reviews applied: [29507]

Failed command: ./support/apply-review.sh -n -r 29507

Error:
 2015-02-19 04:34:42 URL:https://reviews.apache.org/r/29507/diff/raw/ 
[15458/15458] -> "29507.patch" [1]
error: patch failed: src/master/master.cpp:116
error: src/master/master.cpp: patch does not apply
error: patch failed: src/slave/constants.hpp:116
error: src/slave/constants.hpp: patch does not apply
error: patch failed: src/tests/partition_tests.cpp:67
error: src/tests/partition_tests.cpp: patch does not apply
Failed to apply patch

- Mesos ReviewBot


On Feb. 19, 2015, 12:33 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Feb. 19, 2015, 12:33 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
> https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/constants.hpp c386eab 
>   src/master/constants.cpp 9ee17e9 
>   src/master/flags.hpp 6c18a1a 
>   src/master/master.cpp f4b6463 
>   src/slave/constants.hpp 761cfaf 
>   src/slave/constants.cpp 83d9fc1 
>   src/slave/slave.cpp a8b2621 
>   src/tests/fault_tolerance_tests.cpp f927d4a 
>   src/tests/partition_tests.cpp fea7801 
>   src/tests/slave_recovery_tests.cpp 7e2e63d 
>   src/tests/slave_tests.cpp e7e2af6 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-02-18 Thread Ben Mahler


> On Feb. 18, 2015, 7:38 p.m., Niklas Nielsen wrote:
> > Let's get tests wired up before committing this :)
> 
> Adam B wrote:
> Sure thing. Adding tests in my subsequent patch where we will pass the 
> master's timeout values on to the slave. Will post that very soon.

Can you do it in one patch? This patch in isolation looks a bit dangerous per 
our conversation above.

Also, please carefully consider whether your approach will be safe to do in a 
single version. i.e. What happens when there are old slaves running against a 
new master? And vice versa.


- Ben


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review72992
---


On Feb. 19, 2015, 12:33 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Feb. 19, 2015, 12:33 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
> https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/constants.hpp c386eab 
>   src/master/constants.cpp 9ee17e9 
>   src/master/flags.hpp 6c18a1a 
>   src/master/master.cpp f4b6463 
>   src/slave/constants.hpp 761cfaf 
>   src/slave/constants.cpp 83d9fc1 
>   src/slave/slave.cpp a8b2621 
>   src/tests/fault_tolerance_tests.cpp f927d4a 
>   src/tests/partition_tests.cpp fea7801 
>   src/tests/slave_recovery_tests.cpp 7e2e63d 
>   src/tests/slave_tests.cpp e7e2af6 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-02-18 Thread Adam B

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/
---

(Updated Feb. 18, 2015, 4:33 p.m.)


Review request for mesos and Niklas Nielsen.


Changes
---

Rebased, in preparation for part 2 (slave-side timeouts).


Bugs: MESOS-2110
https://issues.apache.org/jira/browse/MESOS-2110


Repository: mesos


Description (updated)
---

Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
   
These can be extended if slaves are expected/allowed to be down for
longer than a minute or two.
  
Beware that this affects recovery from network timeouts as well as
actual slave node/process failover.


Diffs (updated)
-

  src/master/constants.hpp c386eab 
  src/master/constants.cpp 9ee17e9 
  src/master/flags.hpp 6c18a1a 
  src/master/master.cpp f4b6463 
  src/slave/constants.hpp 761cfaf 
  src/slave/constants.cpp 83d9fc1 
  src/slave/slave.cpp a8b2621 
  src/tests/fault_tolerance_tests.cpp f927d4a 
  src/tests/partition_tests.cpp fea7801 
  src/tests/slave_recovery_tests.cpp 7e2e63d 
  src/tests/slave_tests.cpp e7e2af6 

Diff: https://reviews.apache.org/r/29507/diff/


Testing (updated)
---

Manually tested slave failover/shutdown with master using different 
--slave_ping_timeout and --max_slave_ping_timeouts.
Ran unit tests with shorter non-default values for ping timeouts.


Thanks,

Adam B



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-02-18 Thread Adam B


> On Feb. 18, 2015, 11:38 a.m., Niklas Nielsen wrote:
> > Let's get tests wired up before committing this :)

Sure thing. Adding tests in my subsequent patch where we will pass the master's 
timeout values on to the slave. Will post that very soon.


- Adam


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review72992
---


On Jan. 8, 2015, 12:44 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Jan. 8, 2015, 12:44 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
> https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/constants.hpp c386eab 
>   src/master/constants.cpp 9ee17e9 
>   src/master/flags.hpp f5c8d2a 
>   src/master/master.cpp d6651e2 
>   src/slave/constants.hpp fd1c1ab 
>   src/slave/constants.cpp 2a99b11 
>   src/slave/slave.cpp 50b5781 
>   src/tests/fault_tolerance_tests.cpp 5763486 
>   src/tests/partition_tests.cpp fea7801 
>   src/tests/slave_recovery_tests.cpp cd4a398 
>   src/tests/slave_tests.cpp c50cbc7 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-02-18 Thread Niklas Nielsen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review72992
---


Let's get tests wired up before committing this :)

- Niklas Nielsen


On Jan. 8, 2015, 12:44 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Jan. 8, 2015, 12:44 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
> https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/constants.hpp c386eab 
>   src/master/constants.cpp 9ee17e9 
>   src/master/flags.hpp f5c8d2a 
>   src/master/master.cpp d6651e2 
>   src/slave/constants.hpp fd1c1ab 
>   src/slave/constants.cpp 2a99b11 
>   src/slave/slave.cpp 50b5781 
>   src/tests/fault_tolerance_tests.cpp 5763486 
>   src/tests/partition_tests.cpp fea7801 
>   src/tests/slave_recovery_tests.cpp cd4a398 
>   src/tests/slave_tests.cpp c50cbc7 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-02-17 Thread Niklas Nielsen


> On Jan. 7, 2015, 2:50 p.m., Ben Mahler wrote:
> > src/master/flags.hpp, lines 361-373
> > 
> >
> > Do you mind changing the names now that they're merely defaults?
> > 
> > s/SLAVE_PING_TIMEOUT/DEFAULT_SLAVE_PING_TIMEOUT/
> > s/MAX_SLAVE_PING_TIMEOUTS/DEFAULT_MAX_SLAVE_PING_TIMEOUTS/
> > 
> > Once you do this, it will become apparent there is more needed in this 
> > change. In particular, all the slave logic that depends on 
> > MASTER_PING_TIMEOUT() is broken now! Making it configurable will require 
> > some more thought.
> 
> Adam B wrote:
> Renamed to make the DEFAULT nature explicit.
> 
> I would argue that with this change as is, the slave side of the ping 
> timeout isn't much more broken than it already was. As I alluded to in the 
> MESOS-2110 JIRA description, the slave uses the timeout to force a 
> leader-detection & re-registration, which is only loosely related to what the 
> master is doing with the timeout (declaring the slave shutdown). Even without 
> my changes, the Timers on master/slave could be out of sync and the two 
> timeout events aren't guaranteed to happen in a particular order. If the 
> master were to timeout first, and the slave did eventually detect a new 
> leader, the slave would still be told to shutdown and its re-registration 
> would fail. If the slave timed out first, it would try to re-register with 
> the currently verified leader (could be the same), and the master's 
> SlaveObserver would keep trying to ping the slave and potentially eventually 
> shut it down. That's why I would argue that this patch is fine on its own, 
> and a subsequent patch (same JIRA?) sh
 ould clean up the slave's ping timeout behavior.
> 
> `src/master/constants.hpp` has the following note:
> ```
> // NOTE: The slave uses these PING constants to determine when
> // the master has stopped sending pings. If these are made
> // configurable, then we'll need to rely on upper/lower bounds
> // to ensure that the slave is not unnecessarily triggering
> // re-registrations.
> ```
> If the master flags are configurable, there's currently no way for the 
> slave to know what value the current leading master is using, especially if 
> the slave is disconnected at the time. After reading through the comments in 
> MESOS-1529 (which added the slave-side ping timeout), I'm thinking that maybe 
> the master->slave registered (and PING?) messages could embed the current 
> master's actual ping timeout. Then the pingTimers started from 
> Slave::registered() and Slave::ping() could use that value (same as master's) 
> to prevent unnecessary reregistration due to a timeout mismatch.
> 
> Thoughts, suggestions?
> 
> Ben Mahler wrote:
> > I would argue that with this change as is, the slave side of the ping 
> timeout isn't much more broken than it already was.
> 
> Hm.. I think you misunderstood the intent of the slave using this timeout 
> (it's not trying to prevent the things you discussed, please clarify if you 
> think it's causing some issues, I didn't quite understand what you're 
> pointing out).
> 
> These timers are in place to guarantee that if the master removes a slave 
> (stops health checking it), then the slave will eventually terminate. Without 
> this timeout, it's possible for the master -> slave removal messages to drop 
> while the slave -> master socket remains intact. At this point, (without the 
> timeout) the slave would remain running potentially forever (unless a message 
> (e.g. status update) is sent to the master for the master to detect the 
> zombied slave).
> 
> The slave also uses the pings to prevent a slave from getting "stuck" in 
> a situation where the master considers the slave disconnected, but the slave 
> considers itself connected. This can happen via a one-way socket closure from 
> master -> slave. Note that the health checking in the master continues while 
> the slave is disconnected.
> 
> The comments here should help:
> 
> ```
> void Slave::ping(const UPID& from, bool connected)
> {
>   VLOG(1) << "Received ping from " << from;
> 
>   if (!connected && state == RUNNING) {
> // This could happen if there is a one way partition between
> // the master and slave, causing the master to get an exited
> // event and marking the slave disconnected but the slave
> // thinking it is still connected. Force a re-registration with
> // the master to reconcile.
> LOG(INFO) << "Master marked the slave as disconnected but the slave"
>   << " considers itself registered! Forcing re-registration.";
> detection.discard();
>   }
> 
>   // If we don't get a ping from the master, trigger a
>   // re-registration. This can occur when the master no
>   // longer con

Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-01-25 Thread Adam B


> On Jan. 7, 2015, 2:50 p.m., Ben Mahler wrote:
> > src/master/flags.hpp, lines 361-373
> > 
> >
> > Do you mind changing the names now that they're merely defaults?
> > 
> > s/SLAVE_PING_TIMEOUT/DEFAULT_SLAVE_PING_TIMEOUT/
> > s/MAX_SLAVE_PING_TIMEOUTS/DEFAULT_MAX_SLAVE_PING_TIMEOUTS/
> > 
> > Once you do this, it will become apparent there is more needed in this 
> > change. In particular, all the slave logic that depends on 
> > MASTER_PING_TIMEOUT() is broken now! Making it configurable will require 
> > some more thought.
> 
> Adam B wrote:
> Renamed to make the DEFAULT nature explicit.
> 
> I would argue that with this change as is, the slave side of the ping 
> timeout isn't much more broken than it already was. As I alluded to in the 
> MESOS-2110 JIRA description, the slave uses the timeout to force a 
> leader-detection & re-registration, which is only loosely related to what the 
> master is doing with the timeout (declaring the slave shutdown). Even without 
> my changes, the Timers on master/slave could be out of sync and the two 
> timeout events aren't guaranteed to happen in a particular order. If the 
> master were to timeout first, and the slave did eventually detect a new 
> leader, the slave would still be told to shutdown and its re-registration 
> would fail. If the slave timed out first, it would try to re-register with 
> the currently verified leader (could be the same), and the master's 
> SlaveObserver would keep trying to ping the slave and potentially eventually 
> shut it down. That's why I would argue that this patch is fine on its own, 
> and a subsequent patch (same JIRA?) sh
 ould clean up the slave's ping timeout behavior.
> 
> `src/master/constants.hpp` has the following note:
> ```
> // NOTE: The slave uses these PING constants to determine when
> // the master has stopped sending pings. If these are made
> // configurable, then we'll need to rely on upper/lower bounds
> // to ensure that the slave is not unnecessarily triggering
> // re-registrations.
> ```
> If the master flags are configurable, there's currently no way for the 
> slave to know what value the current leading master is using, especially if 
> the slave is disconnected at the time. After reading through the comments in 
> MESOS-1529 (which added the slave-side ping timeout), I'm thinking that maybe 
> the master->slave registered (and PING?) messages could embed the current 
> master's actual ping timeout. Then the pingTimers started from 
> Slave::registered() and Slave::ping() could use that value (same as master's) 
> to prevent unnecessary reregistration due to a timeout mismatch.
> 
> Thoughts, suggestions?
> 
> Ben Mahler wrote:
> > I would argue that with this change as is, the slave side of the ping 
> timeout isn't much more broken than it already was.
> 
> Hm.. I think you misunderstood the intent of the slave using this timeout 
> (it's not trying to prevent the things you discussed, please clarify if you 
> think it's causing some issues, I didn't quite understand what you're 
> pointing out).
> 
> These timers are in place to guarantee that if the master removes a slave 
> (stops health checking it), then the slave will eventually terminate. Without 
> this timeout, it's possible for the master -> slave removal messages to drop 
> while the slave -> master socket remains intact. At this point, (without the 
> timeout) the slave would remain running potentially forever (unless a message 
> (e.g. status update) is sent to the master for the master to detect the 
> zombied slave).
> 
> The slave also uses the pings to prevent a slave from getting "stuck" in 
> a situation where the master considers the slave disconnected, but the slave 
> considers itself connected. This can happen via a one-way socket closure from 
> master -> slave. Note that the health checking in the master continues while 
> the slave is disconnected.
> 
> The comments here should help:
> 
> ```
> void Slave::ping(const UPID& from, bool connected)
> {
>   VLOG(1) << "Received ping from " << from;
> 
>   if (!connected && state == RUNNING) {
> // This could happen if there is a one way partition between
> // the master and slave, causing the master to get an exited
> // event and marking the slave disconnected but the slave
> // thinking it is still connected. Force a re-registration with
> // the master to reconcile.
> LOG(INFO) << "Master marked the slave as disconnected but the slave"
>   << " considers itself registered! Forcing re-registration.";
> detection.discard();
>   }
> 
>   // If we don't get a ping from the master, trigger a
>   // re-registration. This can occur when the master no
>   // longer con

Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-01-09 Thread Ben Mahler


> On Jan. 7, 2015, 10:50 p.m., Ben Mahler wrote:
> > src/master/flags.hpp, lines 361-373
> > 
> >
> > Do you mind changing the names now that they're merely defaults?
> > 
> > s/SLAVE_PING_TIMEOUT/DEFAULT_SLAVE_PING_TIMEOUT/
> > s/MAX_SLAVE_PING_TIMEOUTS/DEFAULT_MAX_SLAVE_PING_TIMEOUTS/
> > 
> > Once you do this, it will become apparent there is more needed in this 
> > change. In particular, all the slave logic that depends on 
> > MASTER_PING_TIMEOUT() is broken now! Making it configurable will require 
> > some more thought.
> 
> Adam B wrote:
> Renamed to make the DEFAULT nature explicit.
> 
> I would argue that with this change as is, the slave side of the ping 
> timeout isn't much more broken than it already was. As I alluded to in the 
> MESOS-2110 JIRA description, the slave uses the timeout to force a 
> leader-detection & re-registration, which is only loosely related to what the 
> master is doing with the timeout (declaring the slave shutdown). Even without 
> my changes, the Timers on master/slave could be out of sync and the two 
> timeout events aren't guaranteed to happen in a particular order. If the 
> master were to timeout first, and the slave did eventually detect a new 
> leader, the slave would still be told to shutdown and its re-registration 
> would fail. If the slave timed out first, it would try to re-register with 
> the currently verified leader (could be the same), and the master's 
> SlaveObserver would keep trying to ping the slave and potentially eventually 
> shut it down. That's why I would argue that this patch is fine on its own, 
> and a subsequent patch (same JIRA?) sh
 ould clean up the slave's ping timeout behavior.
> 
> `src/master/constants.hpp` has the following note:
> ```
> // NOTE: The slave uses these PING constants to determine when
> // the master has stopped sending pings. If these are made
> // configurable, then we'll need to rely on upper/lower bounds
> // to ensure that the slave is not unnecessarily triggering
> // re-registrations.
> ```
> If the master flags are configurable, there's currently no way for the 
> slave to know what value the current leading master is using, especially if 
> the slave is disconnected at the time. After reading through the comments in 
> MESOS-1529 (which added the slave-side ping timeout), I'm thinking that maybe 
> the master->slave registered (and PING?) messages could embed the current 
> master's actual ping timeout. Then the pingTimers started from 
> Slave::registered() and Slave::ping() could use that value (same as master's) 
> to prevent unnecessary reregistration due to a timeout mismatch.
> 
> Thoughts, suggestions?

> I would argue that with this change as is, the slave side of the ping timeout 
> isn't much more broken than it already was.

Hm.. I think you misunderstood the intent of the slave using this timeout (it's 
not trying to prevent the things you discussed, please clarify if you think 
it's causing some issues, I didn't quite understand what you're pointing out).

These timers are in place to guarantee that if the master removes a slave 
(stops health checking it), then the slave will eventually terminate. Without 
this timeout, it's possible for the master -> slave removal messages to drop 
while the slave -> master socket remains intact. At this point, (without the 
timeout) the slave would remain running potentially forever (unless a message 
(e.g. status update) is sent to the master for the master to detect the zombied 
slave).

The slave also uses the pings to prevent a slave from getting "stuck" in a 
situation where the master considers the slave disconnected, but the slave 
considers itself connected. This can happen via a one-way socket closure from 
master -> slave. Note that the health checking in the master continues while 
the slave is disconnected.

The comments here should help:

```
void Slave::ping(const UPID& from, bool connected)
{
  VLOG(1) << "Received ping from " << from;

  if (!connected && state == RUNNING) {
// This could happen if there is a one way partition between
// the master and slave, causing the master to get an exited
// event and marking the slave disconnected but the slave
// thinking it is still connected. Force a re-registration with
// the master to reconcile.
LOG(INFO) << "Master marked the slave as disconnected but the slave"
  << " considers itself registered! Forcing re-registration.";
detection.discard();
  }

  // If we don't get a ping from the master, trigger a
  // re-registration. This can occur when the master no
  // longer considers the slave to be registered, so it is
  // essential for the slave to attempt a re-registration
  // when this occurs.
  Clock::cancel(pingTimer);

  pingTimer = delay(
  MASTER_PING_TIMEOUT(),
  self(),

Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-01-08 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review67183
---


Patch looks great!

Reviews applied: [29507]

All tests passed.

- Mesos ReviewBot


On Jan. 8, 2015, 8:44 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Jan. 8, 2015, 8:44 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
> https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos-git
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/constants.hpp c386eab 
>   src/master/constants.cpp 9ee17e9 
>   src/master/flags.hpp f5c8d2a 
>   src/master/master.cpp d6651e2 
>   src/slave/constants.hpp fd1c1ab 
>   src/slave/constants.cpp 2a99b11 
>   src/slave/slave.cpp 50b5781 
>   src/tests/fault_tolerance_tests.cpp 5763486 
>   src/tests/partition_tests.cpp fea7801 
>   src/tests/slave_recovery_tests.cpp cd4a398 
>   src/tests/slave_tests.cpp c50cbc7 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-01-08 Thread Adam B

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review67182
---



src/tests/partition_tests.cpp


I believe this only ever needed to be `slave_ping_timeout`, not 
`slave_ping_timeout * max_slave_ping_timeouts`, since we just want to advance 
the clock long enough for the SlaveObserver to send the next ping, not so long 
that the master considers the slave shutdown or so long that the slave tries to 
reregister after not receiving a ping.


- Adam B


On Jan. 8, 2015, 12:44 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Jan. 8, 2015, 12:44 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
> https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos-git
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/constants.hpp c386eab 
>   src/master/constants.cpp 9ee17e9 
>   src/master/flags.hpp f5c8d2a 
>   src/master/master.cpp d6651e2 
>   src/slave/constants.hpp fd1c1ab 
>   src/slave/constants.cpp 2a99b11 
>   src/slave/slave.cpp 50b5781 
>   src/tests/fault_tolerance_tests.cpp 5763486 
>   src/tests/partition_tests.cpp fea7801 
>   src/tests/slave_recovery_tests.cpp cd4a398 
>   src/tests/slave_tests.cpp c50cbc7 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-01-08 Thread Adam B

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/
---

(Updated Jan. 8, 2015, 12:44 a.m.)


Review request for mesos and Niklas Nielsen.


Changes
---

Responding to review feedback.


Bugs: MESOS-2110
https://issues.apache.org/jira/browse/MESOS-2110


Repository: mesos-git


Description
---

Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
to mesos-master to replace the existing (still default) 
SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
   
These can be extended if slaves are expected/allowed to be down for
longer than a minute or two.
  
Beware that this affects recovery from network timeouts as well as
actual slave node/process failover.


Diffs (updated)
-

  src/master/constants.hpp c386eab 
  src/master/constants.cpp 9ee17e9 
  src/master/flags.hpp f5c8d2a 
  src/master/master.cpp d6651e2 
  src/slave/constants.hpp fd1c1ab 
  src/slave/constants.cpp 2a99b11 
  src/slave/slave.cpp 50b5781 
  src/tests/fault_tolerance_tests.cpp 5763486 
  src/tests/partition_tests.cpp fea7801 
  src/tests/slave_recovery_tests.cpp cd4a398 
  src/tests/slave_tests.cpp c50cbc7 

Diff: https://reviews.apache.org/r/29507/diff/


Testing
---

Manually tested slave failover/shutdown with master using different 
--slave_ping_timeout and --max_slave_ping_timeouts.
Updated unit tests to use shorter non-default values for ping timeouts.


Thanks,

Adam B



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-01-08 Thread Adam B


> On Jan. 7, 2015, 2:50 p.m., Ben Mahler wrote:
> > src/master/flags.hpp, lines 361-373
> > 
> >
> > Do you mind changing the names now that they're merely defaults?
> > 
> > s/SLAVE_PING_TIMEOUT/DEFAULT_SLAVE_PING_TIMEOUT/
> > s/MAX_SLAVE_PING_TIMEOUTS/DEFAULT_MAX_SLAVE_PING_TIMEOUTS/
> > 
> > Once you do this, it will become apparent there is more needed in this 
> > change. In particular, all the slave logic that depends on 
> > MASTER_PING_TIMEOUT() is broken now! Making it configurable will require 
> > some more thought.

Renamed to make the DEFAULT nature explicit.

I would argue that with this change as is, the slave side of the ping timeout 
isn't much more broken than it already was. As I alluded to in the MESOS-2110 
JIRA description, the slave uses the timeout to force a leader-detection & 
re-registration, which is only loosely related to what the master is doing with 
the timeout (declaring the slave shutdown). Even without my changes, the Timers 
on master/slave could be out of sync and the two timeout events aren't 
guaranteed to happen in a particular order. If the master were to timeout 
first, and the slave did eventually detect a new leader, the slave would still 
be told to shutdown and its re-registration would fail. If the slave timed out 
first, it would try to re-register with the currently verified leader (could be 
the same), and the master's SlaveObserver would keep trying to ping the slave 
and potentially eventually shut it down. That's why I would argue that this 
patch is fine on its own, and a subsequent patch (same JIRA?) should c
 lean up the slave's ping timeout behavior.

`src/master/constants.hpp` has the following note:
```
// NOTE: The slave uses these PING constants to determine when
// the master has stopped sending pings. If these are made
// configurable, then we'll need to rely on upper/lower bounds
// to ensure that the slave is not unnecessarily triggering
// re-registrations.
```
If the master flags are configurable, there's currently no way for the slave to 
know what value the current leading master is using, especially if the slave is 
disconnected at the time. After reading through the comments in MESOS-1529 
(which added the slave-side ping timeout), I'm thinking that maybe the 
master->slave registered (and PING?) messages could embed the current master's 
actual ping timeout. Then the pingTimers started from Slave::registered() and 
Slave::ping() could use that value (same as master's) to prevent unnecessary 
reregistration due to a timeout mismatch.

Thoughts, suggestions?


- Adam


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review67113
---


On Dec. 31, 2014, 1:38 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Dec. 31, 2014, 1:38 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2150
> https://issues.apache.org/jira/browse/MESOS-2150
> 
> 
> Repository: mesos-git
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/flags.hpp f5c8d2a8cc7530bc8746935af9ea90af747cc111 
>   src/master/master.cpp d6651e299ddb73bfdc1b126c474075db6cda8acd 
>   src/tests/fault_tolerance_tests.cpp 
> 5763486acb6d687b50c02c01ea00e1cfbea48421 
>   src/tests/mesos.cpp 3b98c69a604132be71a60fbbee4a47b51fe6956a 
>   src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 
>   src/tests/slave_recovery_tests.cpp cd4a398ef680b5694cb6069b8e2ca4e2c05911d1 
>   src/tests/slave_tests.cpp c50cbc799d4793243f184f9fe628b69a020adc66 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-01-07 Thread Adam B


> On Jan. 6, 2015, 9:17 a.m., Niklas Nielsen wrote:
> > src/tests/mesos.cpp, line 129
> > 
> >
> > Can we use a constant here instead (to make sure it follows future 
> > changes to the default value?)
> 
> Adam B wrote:
> The point of setting these values here was for them to be _different_ 
> from the default value (3x5s vs. 5x15s). I could a) create constants that 
> aren't used anywhere else yet, but might be wanted by some test in the 
> future; b) divide the default by some constant factor, like what 
> registry_store_timeout does; or c) just use let our unit tests use the 
> default values. I'm leaning towards (c), since I don't think any of our tests 
> care about the exact value, and any future test that does can override the 
> default itself.
> 
> Michael Park wrote:
> I'm +1 for (c) as well, but I don't understand why you would've changed 
> this in the first place then?

I think I just changed it for personal debugging/sanity-testing. Removing these 
lines now.


- Adam


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review66859
---


On Dec. 31, 2014, 1:38 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Dec. 31, 2014, 1:38 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2150
> https://issues.apache.org/jira/browse/MESOS-2150
> 
> 
> Repository: mesos-git
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/flags.hpp f5c8d2a8cc7530bc8746935af9ea90af747cc111 
>   src/master/master.cpp d6651e299ddb73bfdc1b126c474075db6cda8acd 
>   src/tests/fault_tolerance_tests.cpp 
> 5763486acb6d687b50c02c01ea00e1cfbea48421 
>   src/tests/mesos.cpp 3b98c69a604132be71a60fbbee4a47b51fe6956a 
>   src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 
>   src/tests/slave_recovery_tests.cpp cd4a398ef680b5694cb6069b8e2ca4e2c05911d1 
>   src/tests/slave_tests.cpp c50cbc799d4793243f184f9fe628b69a020adc66 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-01-07 Thread Ben Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review67113
---



src/master/flags.hpp


Do you mind changing the names now that they're merely defaults?

s/SLAVE_PING_TIMEOUT/DEFAULT_SLAVE_PING_TIMEOUT/
s/MAX_SLAVE_PING_TIMEOUTS/DEFAULT_MAX_SLAVE_PING_TIMEOUTS/

Once you do this, it will become apparent there is more needed in this 
change. In particular, all the slave logic that depends on 
MASTER_PING_TIMEOUT() is broken now! Making it configurable will require some 
more thought.


- Ben Mahler


On Dec. 31, 2014, 9:38 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Dec. 31, 2014, 9:38 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2150
> https://issues.apache.org/jira/browse/MESOS-2150
> 
> 
> Repository: mesos-git
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/flags.hpp f5c8d2a8cc7530bc8746935af9ea90af747cc111 
>   src/master/master.cpp d6651e299ddb73bfdc1b126c474075db6cda8acd 
>   src/tests/fault_tolerance_tests.cpp 
> 5763486acb6d687b50c02c01ea00e1cfbea48421 
>   src/tests/mesos.cpp 3b98c69a604132be71a60fbbee4a47b51fe6956a 
>   src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 
>   src/tests/slave_recovery_tests.cpp cd4a398ef680b5694cb6069b8e2ca4e2c05911d1 
>   src/tests/slave_tests.cpp c50cbc799d4793243f184f9fe628b69a020adc66 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-01-07 Thread Michael Park


> On Jan. 7, 2015, 8:22 p.m., Michael Park wrote:
> >

Looks good to me. Just a minor comment on consistent variable naming.


- Michael


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review67077
---


On Dec. 31, 2014, 9:38 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Dec. 31, 2014, 9:38 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2150
> https://issues.apache.org/jira/browse/MESOS-2150
> 
> 
> Repository: mesos-git
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/flags.hpp f5c8d2a8cc7530bc8746935af9ea90af747cc111 
>   src/master/master.cpp d6651e299ddb73bfdc1b126c474075db6cda8acd 
>   src/tests/fault_tolerance_tests.cpp 
> 5763486acb6d687b50c02c01ea00e1cfbea48421 
>   src/tests/mesos.cpp 3b98c69a604132be71a60fbbee4a47b51fe6956a 
>   src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 
>   src/tests/slave_recovery_tests.cpp cd4a398ef680b5694cb6069b8e2ca4e2c05911d1 
>   src/tests/slave_tests.cpp c50cbc799d4793243f184f9fe628b69a020adc66 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-01-07 Thread Michael Park


> On Jan. 6, 2015, 5:17 p.m., Niklas Nielsen wrote:
> > src/tests/mesos.cpp, line 129
> > 
> >
> > Can we use a constant here instead (to make sure it follows future 
> > changes to the default value?)
> 
> Adam B wrote:
> The point of setting these values here was for them to be _different_ 
> from the default value (3x5s vs. 5x15s). I could a) create constants that 
> aren't used anywhere else yet, but might be wanted by some test in the 
> future; b) divide the default by some constant factor, like what 
> registry_store_timeout does; or c) just use let our unit tests use the 
> default values. I'm leaning towards (c), since I don't think any of our tests 
> care about the exact value, and any future test that does can override the 
> default itself.

I'm +1 for (c) as well, but I don't understand why you would've changed this in 
the first place then?


- Michael


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review66859
---


On Dec. 31, 2014, 9:38 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Dec. 31, 2014, 9:38 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2150
> https://issues.apache.org/jira/browse/MESOS-2150
> 
> 
> Repository: mesos-git
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/flags.hpp f5c8d2a8cc7530bc8746935af9ea90af747cc111 
>   src/master/master.cpp d6651e299ddb73bfdc1b126c474075db6cda8acd 
>   src/tests/fault_tolerance_tests.cpp 
> 5763486acb6d687b50c02c01ea00e1cfbea48421 
>   src/tests/mesos.cpp 3b98c69a604132be71a60fbbee4a47b51fe6956a 
>   src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 
>   src/tests/slave_recovery_tests.cpp cd4a398ef680b5694cb6069b8e2ca4e2c05911d1 
>   src/tests/slave_tests.cpp c50cbc799d4793243f184f9fe628b69a020adc66 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-01-07 Thread Michael Park

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review67077
---



src/master/master.cpp


Maybe `s/ping_timeout/slave_ping_timeout/` and 
`s/max_timeouts/max_slave_ping_timeouts/`?

My guess is that you chose this naming scheme because we're already inside 
`SlaveObserver`. But I see `slaveInfo`, `slaveId`, as opposed to `info` and 
`id`, would prefer if these can be consistent one way or another :)


- Michael Park


On Dec. 31, 2014, 9:38 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Dec. 31, 2014, 9:38 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2150
> https://issues.apache.org/jira/browse/MESOS-2150
> 
> 
> Repository: mesos-git
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/flags.hpp f5c8d2a8cc7530bc8746935af9ea90af747cc111 
>   src/master/master.cpp d6651e299ddb73bfdc1b126c474075db6cda8acd 
>   src/tests/fault_tolerance_tests.cpp 
> 5763486acb6d687b50c02c01ea00e1cfbea48421 
>   src/tests/mesos.cpp 3b98c69a604132be71a60fbbee4a47b51fe6956a 
>   src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 
>   src/tests/slave_recovery_tests.cpp cd4a398ef680b5694cb6069b8e2ca4e2c05911d1 
>   src/tests/slave_tests.cpp c50cbc799d4793243f184f9fe628b69a020adc66 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-01-07 Thread Adam B


> On Jan. 6, 2015, 9:17 a.m., Niklas Nielsen wrote:
> > src/tests/mesos.cpp, line 129
> > 
> >
> > Can we use a constant here instead (to make sure it follows future 
> > changes to the default value?)

The point of setting these values here was for them to be _different_ from the 
default value (3x5s vs. 5x15s). I could a) create constants that aren't used 
anywhere else yet, but might be wanted by some test in the future; b) divide 
the default by some constant factor, like what registry_store_timeout does; or 
c) just use let our unit tests use the default values. I'm leaning towards (c), 
since I don't think any of our tests care about the exact value, and any future 
test that does can override the default itself.


> On Jan. 6, 2015, 9:17 a.m., Niklas Nielsen wrote:
> > src/master/master.cpp, line 
> > 
> >
> > Can you expand a bit on this change?

Why certainly. I believe we have been logging the wrong value here, ever since 
--recovery_slave_removal_limit was introduced in 
https://reviews.apache.org/r/20097/diff/#. @bmahler correct me if I'm wrong.
Once the registrar has recovered, `Master::_recover` starts a `delay` timer for 
`flags.slave_reregister_timeout`, after which point it calls 
`Master::recoveredSlavesTimeout`, which is the method here that could exit with 
the message "After _X_ there were [an unacceptable number of] slaves recovered 
from the registry that did not reregister". This after `_recover` has already 
logged "Recovered ... slaves from the Registry; allowing 
[slave_reregister_timeout] for slaves to re-register".
The slave ping timeouts determine when the master considers a slave shutdown, 
sends SLAVE_LOST, and no longer accepts re-registrations; whereas the slave 
reregister timeout determines how soon after leader-election (and/or recovery) 
the master should expect slaves to reregister. This log message describes the 
latter scenario, so I corrected the value we were logging.
Make sense?


> On Jan. 6, 2015, 9:17 a.m., Niklas Nielsen wrote:
> > src/master/flags.hpp, line 406
> > 
> >
> > Why uint32 and not int?

Mostly because the original default value constant is 
`src/master/constants.cpp:35:const uint32_t MAX_SLAVE_PING_TIMEOUTS = 5;`
My guess why _that_ was uint32? Well, any count variable (like the existing 
`uint32 SlaveObserver::timeout`) should never be negative and valid, so 
unsigned makes sense. I would think that one shouldn't need more than 255 
timeouts (failed pings) before declaring a slave shutdown, so a uchar/byte 
would be sufficient, but maybe those bytes aren't worth saving, compared to 
having more bits to express a larger pseudo-infinity.

Now does the flag itself need to be uint32? Well, doing so puts the burden of 
size/sign validation on the flag-parsing code instead of wherever I end up 
doing the conversion.


- Adam


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review66859
---


On Dec. 31, 2014, 1:38 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Dec. 31, 2014, 1:38 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2150
> https://issues.apache.org/jira/browse/MESOS-2150
> 
> 
> Repository: mesos-git
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/flags.hpp f5c8d2a8cc7530bc8746935af9ea90af747cc111 
>   src/master/master.cpp d6651e299ddb73bfdc1b126c474075db6cda8acd 
>   src/tests/fault_tolerance_tests.cpp 
> 5763486acb6d687b50c02c01ea00e1cfbea48421 
>   src/tests/mesos.cpp 3b98c69a604132be71a60fbbee4a47b51fe6956a 
>   src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 
>   src/tests/slave_recovery_tests.cpp cd4a398ef680b5694cb6069b8e2ca4e2c05911d1 
>   src/tests/slave_tests.cpp c50cbc799d4793243f184f9fe628b69a020adc66 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use short

Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-01-06 Thread Niklas Nielsen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review66859
---


Looks good! Have a couple of small questions


src/master/flags.hpp


Why uint32 and not int?



src/master/master.cpp


Can you expand a bit on this change?



src/master/master.cpp


According to the wrapping rules, we should newline each argument here no?



src/tests/mesos.cpp


Can we use a constant here instead (to make sure it follows future changes 
to the default value?)



src/tests/partition_tests.cpp


s/> >/>>/



src/tests/slave_recovery_tests.cpp


s/> >/>>/


- Niklas Nielsen


On Dec. 31, 2014, 1:38 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Dec. 31, 2014, 1:38 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2150
> https://issues.apache.org/jira/browse/MESOS-2150
> 
> 
> Repository: mesos-git
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/flags.hpp f5c8d2a8cc7530bc8746935af9ea90af747cc111 
>   src/master/master.cpp d6651e299ddb73bfdc1b126c474075db6cda8acd 
>   src/tests/fault_tolerance_tests.cpp 
> 5763486acb6d687b50c02c01ea00e1cfbea48421 
>   src/tests/mesos.cpp 3b98c69a604132be71a60fbbee4a47b51fe6956a 
>   src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 
>   src/tests/slave_recovery_tests.cpp cd4a398ef680b5694cb6069b8e2ca4e2c05911d1 
>   src/tests/slave_tests.cpp c50cbc799d4793243f184f9fe628b69a020adc66 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2014-12-31 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review66461
---


Patch looks great!

Reviews applied: [29507]

All tests passed.

- Mesos ReviewBot


On Dec. 31, 2014, 9:38 a.m., Adam B wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> ---
> 
> (Updated Dec. 31, 2014, 9:38 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2150
> https://issues.apache.org/jira/browse/MESOS-2150
> 
> 
> Repository: mesos-git
> 
> 
> Description
> ---
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -
> 
>   src/master/flags.hpp f5c8d2a8cc7530bc8746935af9ea90af747cc111 
>   src/master/master.cpp d6651e299ddb73bfdc1b126c474075db6cda8acd 
>   src/tests/fault_tolerance_tests.cpp 
> 5763486acb6d687b50c02c01ea00e1cfbea48421 
>   src/tests/mesos.cpp 3b98c69a604132be71a60fbbee4a47b51fe6956a 
>   src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 
>   src/tests/slave_recovery_tests.cpp cd4a398ef680b5694cb6069b8e2ca4e2c05911d1 
>   src/tests/slave_tests.cpp c50cbc799d4793243f184f9fe628b69a020adc66 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> ---
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>