Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-06-29 Thread Adam B


 On June 24, 2015, 6:41 p.m., Ben Mahler wrote:
  Actually, we should think about one more thing, how does this interact with 
  the zookeeper session timeout?
 
 Adam B wrote:
 The hardcoded individual ping timeout (15secs) was previously longer than 
 the default zk session timeout (10secs), but the zk session timeout is 
 already configurable with no bounds/validation, so users could already change 
 it to be longer than an individual ping timeout, or even the total ping 
 timeout. Is this a bad idea? What would this mean?
 This is mostly relevant to the slave-side (as a leader detector), in that 
 a slave may timeout and decide to check for a new leading master to 
 reregister with, but the zk session may have recently gone bad but not yet 
 timed out. Since a zk session could go bad/timeout at any point along the 
 way, I don't think making the ping timeout configurable will introduce any 
 new potential errors. Rather, it would just be advisable to keep the zk 
 session timeouts reasonably small so that the zk session is likely to be 
 healthy whenever a slave needs to detect a new leader. We could introduce 
 validation that the zk_session_timeout is shorter than the (total) ping 
 timeout, but I'm not sure that's even necessary.
 Does this make sense? I'm no ZK expert, so I defer to those with more 
 experience.
 
 Ben Mahler wrote:
 Right, the ZK timeout is negotiated with ZooKeeper, so it gets bounded 
 based on the ZK server configuration:
 http://zookeeper.apache.org/doc/r3.3.1/zookeeperAdmin.html
 
 Validating that the session timeout is less than the total ping timeout 
 might be nice, otherwise every session expiration slave-side will trigger 
 useless re-registration attempts. Mind just adding a NOTE that reflects this 
 in the flag help? The validation might be nice, but the ZK session timeout is 
 only really determined after negotation anyway..

Noted in the flag help.


- Adam


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review89305
---


On June 26, 2015, 3:12 a.m., Adam B wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29507/
 ---
 
 (Updated June 26, 2015, 3:12 a.m.)
 
 
 Review request for mesos, Ben Mahler and Niklas Nielsen.
 
 
 Bugs: MESOS-2110
 https://issues.apache.org/jira/browse/MESOS-2110
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
 to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
 and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).

 These can be extended if slaves are expected/allowed to be down for
 longer than a minute or two.
 
 Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
   
 Beware that this affects recovery from network timeouts as well as
 actual slave node/process failover.
 
 Also fixed the log message in recoveredSlavesTimeout() to correctly
 reference flags.slave_reregister_timeout instead of the unrelated
 ping timeouts.
 
 
 Diffs
 -
 
   docs/configuration.md aaf65bf 
   docs/upgrades.md 73e503c 
   src/master/constants.hpp 072d59c 
   src/master/constants.cpp 997b792 
   src/master/flags.hpp 55ed3a9 
   src/master/flags.cpp 4377715 
   src/master/master.cpp 0782b54 
   src/messages/messages.proto 1c8d79e 
   src/slave/constants.hpp 84927e5 
   src/slave/constants.cpp d8d2f98 
   src/slave/slave.hpp f1cf3b8 
   src/slave/slave.cpp b3e1ccc 
   src/tests/partition_tests.cpp f7ee3ab 
   src/tests/slave_recovery_tests.cpp c036e9c 
   src/tests/slave_tests.cpp e9002e8 
 
 Diff: https://reviews.apache.org/r/29507/diff/
 
 
 Testing
 ---
 
 Manually tested slave failover/shutdown with master using different 
 --slave_ping_timeout and --max_slave_ping_timeouts.
 Ran unit tests with shorter non-default values for ping timeouts.
 `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
 ShortPingTimeoutUnreachableSlave
 
 
 Thanks,
 
 Adam B
 




Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-06-26 Thread Adam B

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/
---

(Updated June 26, 2015, 3:12 a.m.)


Review request for mesos, Ben Mahler and Niklas Nielsen.


Changes
---

Updated in response to bmahler's comments.


Bugs: MESOS-2110
https://issues.apache.org/jira/browse/MESOS-2110


Repository: mesos


Description
---

Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
   
These can be extended if slaves are expected/allowed to be down for
longer than a minute or two.

Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
  
Beware that this affects recovery from network timeouts as well as
actual slave node/process failover.

Also fixed the log message in recoveredSlavesTimeout() to correctly
reference flags.slave_reregister_timeout instead of the unrelated
ping timeouts.


Diffs (updated)
-

  docs/configuration.md aaf65bf 
  docs/upgrades.md 73e503c 
  src/master/constants.hpp 072d59c 
  src/master/constants.cpp 997b792 
  src/master/flags.hpp 55ed3a9 
  src/master/flags.cpp 4377715 
  src/master/master.cpp 0782b54 
  src/messages/messages.proto 1c8d79e 
  src/slave/constants.hpp 84927e5 
  src/slave/constants.cpp d8d2f98 
  src/slave/slave.hpp f1cf3b8 
  src/slave/slave.cpp b3e1ccc 
  src/tests/partition_tests.cpp f7ee3ab 
  src/tests/slave_recovery_tests.cpp c036e9c 
  src/tests/slave_tests.cpp e9002e8 

Diff: https://reviews.apache.org/r/29507/diff/


Testing
---

Manually tested slave failover/shutdown with master using different 
--slave_ping_timeout and --max_slave_ping_timeouts.
Ran unit tests with shorter non-default values for ping timeouts.
`make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
ShortPingTimeoutUnreachableSlave


Thanks,

Adam B



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-06-26 Thread Adam B


 On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
  src/tests/partition_tests.cpp, lines 647-650
  https://reviews.apache.org/r/29507/diff/7/?file=973888#file973888line647
 
  This sounds like testing two particular behaviors in a single test:
  
  (1) A registered slave that never receives any Ping messages will 
  attempt to reregister after the master's ping timeout.
  
  (2) Master will consider it shutdown after the same timeout.
  
  I believe we already have tests for both of these. For (1) we could 
  just add the shutdown message expectation to 
  `PartitionTest.PartitionedSlave`. For (2), we have 
  SlaveTest.PingTimeoutNoPings, SlaveTest.PingTimeoutNoPings.
 
 Adam B wrote:
 Yes, I see now that SlaveTest.PingTimeoutNoPings already satisfies (1), 
 and PartitionTest.PartitionedSlave could be modified to check for the 
 ShutdownMessage() to satisfy (2).
 I originally started trying to write unit tests for 
 registered/unregistered slaves that encountered each type of a one-way 
 partition, with both shorter and longer-than-default ping timeouts, largely 
 for my own understanding of the process. Testing longer timeouts proved 
 unnecessary, and testing unregistered slaves didn't actually test my feature. 
 I posted these two since they worked and explicitly tested the one-way 
 partitions I was trying to test. But this one can almost certainly be covered 
 by the tests you mentioned. I will update in a follow-up patch.

It turns out that the shutdown message expectation is already covered in 
SlaveRecoveryTest.PartitionedSlave, which is similar to, but in many ways a 
superset of PartitionTest.PartitionedSlave. I have removed my two new tests 
completely and relied on SlaveRecoveryTest.PartitionedSlave and 
SlaveTest.PingTimeoutNoPings, with the latter updated to use custom ping 
timeout values.
For the updated test, see new RR: https://reviews.apache.org/r/35958


 On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
  src/tests/partition_tests.cpp, lines 600-602
  https://reviews.apache.org/r/29507/diff/7/?file=973888#file973888line600
 
  I'm a bit confused by this test, why are you expecting the first ping? 
  Why are you dropping the pongs?
  
  I would expect the pings to be dropped entirely here so that advancing 
  the clock after a slave is registered leads to a re-registration.
  
  Also, did you know about SlaveTest.PingTimeoutNoPings and 
  SlaveTest.PingTimeoutSomePings tests? Is there something being tested here 
  that is not captured in these?
 
 Adam B wrote:
 This test was intended to simulate a one-way partition where the slave 
 can no longer reach the master, after the slave initially successfully 
 registered. Hence the PONGs are dropped, but not the PINGs.

Even though we're dropping PONGs instead of PINGs here, it's still only testing 
that the Slave reregisters after the timeout. If anything, it should be testing 
that a master that never receives PONGs will shutdown the slave. This behavior 
is already tested by SlaveRecoveryTest.PartitionedSlave. No new test needed.


- Adam


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review86461
---


On June 26, 2015, 3:12 a.m., Adam B wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29507/
 ---
 
 (Updated June 26, 2015, 3:12 a.m.)
 
 
 Review request for mesos, Ben Mahler and Niklas Nielsen.
 
 
 Bugs: MESOS-2110
 https://issues.apache.org/jira/browse/MESOS-2110
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
 to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
 and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).

 These can be extended if slaves are expected/allowed to be down for
 longer than a minute or two.
 
 Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
   
 Beware that this affects recovery from network timeouts as well as
 actual slave node/process failover.
 
 Also fixed the log message in recoveredSlavesTimeout() to correctly
 reference flags.slave_reregister_timeout instead of the unrelated
 ping timeouts.
 
 
 Diffs
 -
 
   docs/configuration.md aaf65bf 
   docs/upgrades.md 73e503c 
   src/master/constants.hpp 072d59c 
   src/master/constants.cpp 997b792 
   src/master/flags.hpp 55ed3a9 
   src/master/flags.cpp 4377715 
   src/master/master.cpp 0782b54 
   src/messages/messages.proto 1c8d79e 
   src/slave/constants.hpp 84927e5 
   src/slave/constants.cpp d8d2f98 
   src/slave/slave.hpp f1cf3b8 
   src/slave/slave.cpp b3e1ccc 
   src/tests/partition_tests.cpp f7ee3ab 
   src/tests/slave_recovery_tests.cpp c036e9c 
   

Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-06-26 Thread Ben Mahler


 On June 25, 2015, 1:41 a.m., Ben Mahler wrote:
  Actually, we should think about one more thing, how does this interact with 
  the zookeeper session timeout?
 
 Adam B wrote:
 The hardcoded individual ping timeout (15secs) was previously longer than 
 the default zk session timeout (10secs), but the zk session timeout is 
 already configurable with no bounds/validation, so users could already change 
 it to be longer than an individual ping timeout, or even the total ping 
 timeout. Is this a bad idea? What would this mean?
 This is mostly relevant to the slave-side (as a leader detector), in that 
 a slave may timeout and decide to check for a new leading master to 
 reregister with, but the zk session may have recently gone bad but not yet 
 timed out. Since a zk session could go bad/timeout at any point along the 
 way, I don't think making the ping timeout configurable will introduce any 
 new potential errors. Rather, it would just be advisable to keep the zk 
 session timeouts reasonably small so that the zk session is likely to be 
 healthy whenever a slave needs to detect a new leader. We could introduce 
 validation that the zk_session_timeout is shorter than the (total) ping 
 timeout, but I'm not sure that's even necessary.
 Does this make sense? I'm no ZK expert, so I defer to those with more 
 experience.

Right, the ZK timeout is negotiated with ZooKeeper, so it gets bounded based on 
the ZK server configuration:
http://zookeeper.apache.org/doc/r3.3.1/zookeeperAdmin.html

Validating that the session timeout is less than the total ping timeout might 
be nice, otherwise every session expiration slave-side will trigger useless 
re-registration attempts. Mind just adding a NOTE that reflects this in the 
flag help? The validation might be nice, but the ZK session timeout is only 
really determined after negotation anyway..


- Ben


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review89305
---


On June 26, 2015, 10:12 a.m., Adam B wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29507/
 ---
 
 (Updated June 26, 2015, 10:12 a.m.)
 
 
 Review request for mesos, Ben Mahler and Niklas Nielsen.
 
 
 Bugs: MESOS-2110
 https://issues.apache.org/jira/browse/MESOS-2110
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
 to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
 and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).

 These can be extended if slaves are expected/allowed to be down for
 longer than a minute or two.
 
 Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
   
 Beware that this affects recovery from network timeouts as well as
 actual slave node/process failover.
 
 Also fixed the log message in recoveredSlavesTimeout() to correctly
 reference flags.slave_reregister_timeout instead of the unrelated
 ping timeouts.
 
 
 Diffs
 -
 
   docs/configuration.md aaf65bf 
   docs/upgrades.md 73e503c 
   src/master/constants.hpp 072d59c 
   src/master/constants.cpp 997b792 
   src/master/flags.hpp 55ed3a9 
   src/master/flags.cpp 4377715 
   src/master/master.cpp 0782b54 
   src/messages/messages.proto 1c8d79e 
   src/slave/constants.hpp 84927e5 
   src/slave/constants.cpp d8d2f98 
   src/slave/slave.hpp f1cf3b8 
   src/slave/slave.cpp b3e1ccc 
   src/tests/partition_tests.cpp f7ee3ab 
   src/tests/slave_recovery_tests.cpp c036e9c 
   src/tests/slave_tests.cpp e9002e8 
 
 Diff: https://reviews.apache.org/r/29507/diff/
 
 
 Testing
 ---
 
 Manually tested slave failover/shutdown with master using different 
 --slave_ping_timeout and --max_slave_ping_timeouts.
 Ran unit tests with shorter non-default values for ping timeouts.
 `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
 ShortPingTimeoutUnreachableSlave
 
 
 Thanks,
 
 Adam B
 




Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-06-26 Thread Adam B


 On June 24, 2015, 6:41 p.m., Ben Mahler wrote:
  Actually, we should think about one more thing, how does this interact with 
  the zookeeper session timeout?

The hardcoded individual ping timeout (15secs) was previously longer than the 
default zk session timeout (10secs), but the zk session timeout is already 
configurable with no bounds/validation, so users could already change it to be 
longer than an individual ping timeout, or even the total ping timeout. Is this 
a bad idea? What would this mean?
This is mostly relevant to the slave-side (as a leader detector), in that a 
slave may timeout and decide to check for a new leading master to reregister 
with, but the zk session may have recently gone bad but not yet timed out. 
Since a zk session could go bad/timeout at any point along the way, I don't 
think making the ping timeout configurable will introduce any new potential 
errors. Rather, it would just be advisable to keep the zk session timeouts 
reasonably small so that the zk session is likely to be healthy whenever a 
slave needs to detect a new leader. We could introduce validation that the 
zk_session_timeout is shorter than the (total) ping timeout, but I'm not sure 
that's even necessary.
Does this make sense? I'm no ZK expert, so I defer to those with more 
experience.


- Adam


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review89305
---


On June 26, 2015, 3:12 a.m., Adam B wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29507/
 ---
 
 (Updated June 26, 2015, 3:12 a.m.)
 
 
 Review request for mesos, Ben Mahler and Niklas Nielsen.
 
 
 Bugs: MESOS-2110
 https://issues.apache.org/jira/browse/MESOS-2110
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
 to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
 and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).

 These can be extended if slaves are expected/allowed to be down for
 longer than a minute or two.
 
 Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
   
 Beware that this affects recovery from network timeouts as well as
 actual slave node/process failover.
 
 Also fixed the log message in recoveredSlavesTimeout() to correctly
 reference flags.slave_reregister_timeout instead of the unrelated
 ping timeouts.
 
 
 Diffs
 -
 
   docs/configuration.md aaf65bf 
   docs/upgrades.md 73e503c 
   src/master/constants.hpp 072d59c 
   src/master/constants.cpp 997b792 
   src/master/flags.hpp 55ed3a9 
   src/master/flags.cpp 4377715 
   src/master/master.cpp 0782b54 
   src/messages/messages.proto 1c8d79e 
   src/slave/constants.hpp 84927e5 
   src/slave/constants.cpp d8d2f98 
   src/slave/slave.hpp f1cf3b8 
   src/slave/slave.cpp b3e1ccc 
   src/tests/partition_tests.cpp f7ee3ab 
   src/tests/slave_recovery_tests.cpp c036e9c 
   src/tests/slave_tests.cpp e9002e8 
 
 Diff: https://reviews.apache.org/r/29507/diff/
 
 
 Testing
 ---
 
 Manually tested slave failover/shutdown with master using different 
 --slave_ping_timeout and --max_slave_ping_timeouts.
 Ran unit tests with shorter non-default values for ping timeouts.
 `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
 ShortPingTimeoutUnreachableSlave
 
 
 Thanks,
 
 Adam B
 




Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-06-26 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review89482
---


Patch looks great!

Reviews applied: [29507]

All tests passed.

- Mesos ReviewBot


On June 26, 2015, 10:12 a.m., Adam B wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29507/
 ---
 
 (Updated June 26, 2015, 10:12 a.m.)
 
 
 Review request for mesos, Ben Mahler and Niklas Nielsen.
 
 
 Bugs: MESOS-2110
 https://issues.apache.org/jira/browse/MESOS-2110
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
 to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
 and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).

 These can be extended if slaves are expected/allowed to be down for
 longer than a minute or two.
 
 Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
   
 Beware that this affects recovery from network timeouts as well as
 actual slave node/process failover.
 
 Also fixed the log message in recoveredSlavesTimeout() to correctly
 reference flags.slave_reregister_timeout instead of the unrelated
 ping timeouts.
 
 
 Diffs
 -
 
   docs/configuration.md aaf65bf 
   docs/upgrades.md 73e503c 
   src/master/constants.hpp 072d59c 
   src/master/constants.cpp 997b792 
   src/master/flags.hpp 55ed3a9 
   src/master/flags.cpp 4377715 
   src/master/master.cpp 0782b54 
   src/messages/messages.proto 1c8d79e 
   src/slave/constants.hpp 84927e5 
   src/slave/constants.cpp d8d2f98 
   src/slave/slave.hpp f1cf3b8 
   src/slave/slave.cpp b3e1ccc 
   src/tests/partition_tests.cpp f7ee3ab 
   src/tests/slave_recovery_tests.cpp c036e9c 
   src/tests/slave_tests.cpp e9002e8 
 
 Diff: https://reviews.apache.org/r/29507/diff/
 
 
 Testing
 ---
 
 Manually tested slave failover/shutdown with master using different 
 --slave_ping_timeout and --max_slave_ping_timeouts.
 Ran unit tests with shorter non-default values for ping timeouts.
 `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
 ShortPingTimeoutUnreachableSlave
 
 
 Thanks,
 
 Adam B
 




Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-06-24 Thread Ben Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review89305
---


Actually, we should think about one more thing, how does this interact with the 
zookeeper session timeout?

- Ben Mahler


On June 22, 2015, 10:03 a.m., Adam B wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29507/
 ---
 
 (Updated June 22, 2015, 10:03 a.m.)
 
 
 Review request for mesos, Ben Mahler and Niklas Nielsen.
 
 
 Bugs: MESOS-2110
 https://issues.apache.org/jira/browse/MESOS-2110
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
 to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
 and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).

 These can be extended if slaves are expected/allowed to be down for
 longer than a minute or two.
 
 Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
   
 Beware that this affects recovery from network timeouts as well as
 actual slave node/process failover.
 
 Also fixed the log message in recoveredSlavesTimeout() to correctly
 reference flags.slave_reregister_timeout instead of the unrelated
 ping timeouts.
 
 
 Diffs
 -
 
   docs/configuration.md aaf65bf 
   docs/upgrades.md 355307a 
   src/master/constants.hpp 072d59c 
   src/master/constants.cpp 997b792 
   src/master/flags.hpp 55ed3a9 
   src/master/flags.cpp 4377715 
   src/master/master.cpp 0135c15 
   src/messages/messages.proto 1c8d79e 
   src/slave/constants.hpp 84927e5 
   src/slave/constants.cpp d8d2f98 
   src/slave/slave.hpp f1cf3b8 
   src/slave/slave.cpp 40c0c33 
   src/tests/partition_tests.cpp f7ee3ab 
   src/tests/slave_recovery_tests.cpp c036e9c 
   src/tests/slave_tests.cpp 5030198 
 
 Diff: https://reviews.apache.org/r/29507/diff/
 
 
 Testing
 ---
 
 Manually tested slave failover/shutdown with master using different 
 --slave_ping_timeout and --max_slave_ping_timeouts.
 Ran unit tests with shorter non-default values for ping timeouts.
 `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
 ShortPingTimeoutUnreachableSlave
 
 
 Thanks,
 
 Adam B
 




Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-06-24 Thread Niklas Nielsen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review89243
---

Ship it!


Ship It!

- Niklas Nielsen


On June 22, 2015, 3:03 a.m., Adam B wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29507/
 ---
 
 (Updated June 22, 2015, 3:03 a.m.)
 
 
 Review request for mesos, Ben Mahler and Niklas Nielsen.
 
 
 Bugs: MESOS-2110
 https://issues.apache.org/jira/browse/MESOS-2110
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
 to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
 and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).

 These can be extended if slaves are expected/allowed to be down for
 longer than a minute or two.
 
 Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
   
 Beware that this affects recovery from network timeouts as well as
 actual slave node/process failover.
 
 Also fixed the log message in recoveredSlavesTimeout() to correctly
 reference flags.slave_reregister_timeout instead of the unrelated
 ping timeouts.
 
 
 Diffs
 -
 
   docs/configuration.md aaf65bf 
   docs/upgrades.md 355307a 
   src/master/constants.hpp 072d59c 
   src/master/constants.cpp 997b792 
   src/master/flags.hpp 55ed3a9 
   src/master/flags.cpp 4377715 
   src/master/master.cpp 0135c15 
   src/messages/messages.proto 1c8d79e 
   src/slave/constants.hpp 84927e5 
   src/slave/constants.cpp d8d2f98 
   src/slave/slave.hpp f1cf3b8 
   src/slave/slave.cpp 40c0c33 
   src/tests/partition_tests.cpp f7ee3ab 
   src/tests/slave_recovery_tests.cpp c036e9c 
   src/tests/slave_tests.cpp 5030198 
 
 Diff: https://reviews.apache.org/r/29507/diff/
 
 
 Testing
 ---
 
 Manually tested slave failover/shutdown with master using different 
 --slave_ping_timeout and --max_slave_ping_timeouts.
 Ran unit tests with shorter non-default values for ping timeouts.
 `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
 ShortPingTimeoutUnreachableSlave
 
 
 Thanks,
 
 Adam B
 




Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-06-22 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review88761
---


Patch looks great!

Reviews applied: [29507]

All tests passed.

- Mesos ReviewBot


On June 22, 2015, 10:03 a.m., Adam B wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29507/
 ---
 
 (Updated June 22, 2015, 10:03 a.m.)
 
 
 Review request for mesos, Ben Mahler and Niklas Nielsen.
 
 
 Bugs: MESOS-2110
 https://issues.apache.org/jira/browse/MESOS-2110
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
 to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
 and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).

 These can be extended if slaves are expected/allowed to be down for
 longer than a minute or two.
 
 Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
   
 Beware that this affects recovery from network timeouts as well as
 actual slave node/process failover.
 
 Also fixed the log message in recoveredSlavesTimeout() to correctly
 reference flags.slave_reregister_timeout instead of the unrelated
 ping timeouts.
 
 
 Diffs
 -
 
   docs/configuration.md aaf65bf 
   docs/upgrades.md 355307a 
   src/master/constants.hpp 072d59c 
   src/master/constants.cpp 997b792 
   src/master/flags.hpp 55ed3a9 
   src/master/flags.cpp 4377715 
   src/master/master.cpp 0135c15 
   src/messages/messages.proto 1c8d79e 
   src/slave/constants.hpp 84927e5 
   src/slave/constants.cpp d8d2f98 
   src/slave/slave.hpp f1cf3b8 
   src/slave/slave.cpp 40c0c33 
   src/tests/partition_tests.cpp f7ee3ab 
   src/tests/slave_recovery_tests.cpp c036e9c 
   src/tests/slave_tests.cpp 5030198 
 
 Diff: https://reviews.apache.org/r/29507/diff/
 
 
 Testing
 ---
 
 Manually tested slave failover/shutdown with master using different 
 --slave_ping_timeout and --max_slave_ping_timeouts.
 Ran unit tests with shorter non-default values for ping timeouts.
 `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
 ShortPingTimeoutUnreachableSlave
 
 
 Thanks,
 
 Adam B
 




Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-06-22 Thread Adam B


 On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
  LGTM - Would it make sense to have sane min/max values for the 
  timeouts/counts?
  
  I wonder it would make sense to have a test to exercise an upgrade path 
  (the timeout being different in the slaves, than in the master). Maybe I 
  missed that in your first test.

Added flag validation using the cool new lambda style from 
https://reviews.apache.org/r/34943

It's not so easy to test the upgrade path in a unit test, since as soon as a 
slave from this source base gets its SlaveRegistered ack from the master, it 
then has the same timeout as the master (as designed). And if the slave never 
receives its registered ack, it will retry registration after a /random/ 
backoff initially capped at 1min (--registration_backoff_factor). If the random 
backoff_factor ends up being longer than the slave's ping timeout, when the 
ping timeout expires, the slave will anyway re-detect the master and retry 
registration. As I revisit the tests (in a follow-up patch), I will brainstorm 
ways to test the upgrade path.


 On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
  src/messages/messages.proto, line 279
  https://reviews.apache.org/r/29507/diff/7/?file=973883#file973883line279
 
  Why a zero default?

Originally added as a way for the recipient of this message to know that the 
sender didn't set it, presumably because the sender is pre-0.23. Since we don't 
have a way for install handlers to translate optional fields into Options 
instead of just pulling in the default value, I couldn't think of a cleaner 
option. BenM, however, suggested pushing the ping timeout into a nested 
submessage, so I could just check has_ping_timeout_seconds() instead. Went with 
that instead.


 On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
  src/slave/slave.cpp, line 864
  https://reviews.apache.org/r/29507/diff/7/?file=973887#file973887line864
 
  If we removed the 0 default, we'd do:
  
  if (ping_timeout_seconds.isSome()) {
// ...
  }
  
  Which seems a bit more logical to me :)

Unfortunately, we don't currently translate optional protobuf fields into 
Options in install handlers, so I went for an explicit default. BenM suggested 
using a nested message so I can check `if 
(connection.has_ping_timeout_seconds())` instead. Much more readable.


 On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
  docs/configuration.md, line 339
  https://reviews.apache.org/r/29507/diff/7/?file=973876#file973876line339
 
  As this will be markdown, how about formatting max_slave_ping_timeouts 
  in `` or as --max_slave_ping_timeouts ?

Fixed.


- Adam


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review86039
---


On May 28, 2015, 4:13 p.m., Adam B wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29507/
 ---
 
 (Updated May 28, 2015, 4:13 p.m.)
 
 
 Review request for mesos, Ben Mahler and Niklas Nielsen.
 
 
 Bugs: MESOS-2110
 https://issues.apache.org/jira/browse/MESOS-2110
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
 to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
 and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).

 These can be extended if slaves are expected/allowed to be down for
 longer than a minute or two.
 
 Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
   
 Beware that this affects recovery from network timeouts as well as
 actual slave node/process failover.
 
 Also fixed the log message in recoveredSlavesTimeout() to correctly
 reference flags.slave_reregister_timeout instead of the unrelated
 ping timeouts.
 
 
 Diffs
 -
 
   docs/configuration.md 5a41477 
   docs/upgrades.md 355307a 
   src/master/constants.hpp 57cf8fb 
   src/master/constants.cpp 8c7174a 
   src/master/flags.hpp 84fa238 
   src/master/flags.cpp 49d953a 
   src/master/master.cpp 1526f59 
   src/messages/messages.proto 39dac72 
   src/slave/constants.hpp 206d439 
   src/slave/constants.cpp d8d2f98 
   src/slave/slave.hpp 0207eaf 
   src/slave/slave.cpp b4d2029 
   src/tests/partition_tests.cpp f7ee3ab 
   src/tests/slave_recovery_tests.cpp c036e9c 
   src/tests/slave_tests.cpp acae497 
 
 Diff: https://reviews.apache.org/r/29507/diff/
 
 
 Testing
 ---
 
 Manually tested slave failover/shutdown with master using different 
 --slave_ping_timeout and --max_slave_ping_timeouts.
 Ran unit tests with shorter non-default values for ping timeouts.
 `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
 ShortPingTimeoutUnreachableSlave
 
 
 Thanks,
 
 Adam B
 




Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-06-22 Thread Adam B

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/
---

(Updated June 22, 2015, 3:03 a.m.)


Review request for mesos, Ben Mahler and Niklas Nielsen.


Changes
---

Updated with Niklas and BenM's suggestions. Excluding new unit tests; will post 
subsequent review with them.


Bugs: MESOS-2110
https://issues.apache.org/jira/browse/MESOS-2110


Repository: mesos


Description
---

Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
   
These can be extended if slaves are expected/allowed to be down for
longer than a minute or two.

Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
  
Beware that this affects recovery from network timeouts as well as
actual slave node/process failover.

Also fixed the log message in recoveredSlavesTimeout() to correctly
reference flags.slave_reregister_timeout instead of the unrelated
ping timeouts.


Diffs (updated)
-

  docs/configuration.md aaf65bf 
  docs/upgrades.md 355307a 
  src/master/constants.hpp 072d59c 
  src/master/constants.cpp 997b792 
  src/master/flags.hpp 55ed3a9 
  src/master/flags.cpp 4377715 
  src/master/master.cpp 0135c15 
  src/messages/messages.proto 1c8d79e 
  src/slave/constants.hpp 84927e5 
  src/slave/constants.cpp d8d2f98 
  src/slave/slave.hpp f1cf3b8 
  src/slave/slave.cpp 40c0c33 
  src/tests/partition_tests.cpp f7ee3ab 
  src/tests/slave_recovery_tests.cpp c036e9c 
  src/tests/slave_tests.cpp 5030198 

Diff: https://reviews.apache.org/r/29507/diff/


Testing
---

Manually tested slave failover/shutdown with master using different 
--slave_ping_timeout and --max_slave_ping_timeouts.
Ran unit tests with shorter non-default values for ping timeouts.
`make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
ShortPingTimeoutUnreachableSlave


Thanks,

Adam B



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-06-01 Thread Niklas Nielsen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review86039
---


LGTM - Would it make sense to have sane min/max values for the timeouts/counts?

I wonder it would make sense to have a test to exercise an upgrade path (the 
timeout being different in the slaves, than in the master). Maybe I missed that 
in your first test.


docs/configuration.md
https://reviews.apache.org/r/29507/#comment137889

As this will be markdown, how about formatting max_slave_ping_timeouts in 
`` or as --max_slave_ping_timeouts ?



docs/upgrades.md
https://reviews.apache.org/r/29507/#comment137891

You shouldn't need this line - look at line 9 :)



src/master/master.cpp
https://reviews.apache.org/r/29507/#comment137899

You should be able to const this, right?

Here and below



src/messages/messages.proto
https://reviews.apache.org/r/29507/#comment137902

Why a zero default?



src/slave/slave.hpp
https://reviews.apache.org/r/29507/#comment137903

We should Camel case here, right?



src/slave/slave.hpp
https://reviews.apache.org/r/29507/#comment137904

Same here



src/slave/slave.cpp
https://reviews.apache.org/r/29507/#comment137905

Camel case?



src/slave/slave.cpp
https://reviews.apache.org/r/29507/#comment137906

If we removed the 0 default, we'd do:

if (ping_timeout_seconds.isSome()) {
  // ...
}

Which seems a bit more logical to me :)



src/tests/partition_tests.cpp
https://reviews.apache.org/r/29507/#comment137936

What do you think about commenting on why you chose these dividing factors 
(Why 3, Why 2? What will these do compared to the original timeouts?)



src/tests/partition_tests.cpp
https://reviews.apache.org/r/29507/#comment137931

const?



src/tests/partition_tests.cpp
https://reviews.apache.org/r/29507/#comment137938

newline?



src/tests/partition_tests.cpp
https://reviews.apache.org/r/29507/#comment137937

Newline?


- Niklas Nielsen


On May 28, 2015, 4:13 p.m., Adam B wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29507/
 ---
 
 (Updated May 28, 2015, 4:13 p.m.)
 
 
 Review request for mesos, Ben Mahler and Niklas Nielsen.
 
 
 Bugs: MESOS-2110
 https://issues.apache.org/jira/browse/MESOS-2110
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
 to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
 and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).

 These can be extended if slaves are expected/allowed to be down for
 longer than a minute or two.
 
 Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
   
 Beware that this affects recovery from network timeouts as well as
 actual slave node/process failover.
 
 Also fixed the log message in recoveredSlavesTimeout() to correctly
 reference flags.slave_reregister_timeout instead of the unrelated
 ping timeouts.
 
 
 Diffs
 -
 
   docs/configuration.md 5a41477 
   docs/upgrades.md 355307a 
   src/master/constants.hpp 57cf8fb 
   src/master/constants.cpp 8c7174a 
   src/master/flags.hpp 84fa238 
   src/master/flags.cpp 49d953a 
   src/master/master.cpp 1526f59 
   src/messages/messages.proto 39dac72 
   src/slave/constants.hpp 206d439 
   src/slave/constants.cpp d8d2f98 
   src/slave/slave.hpp 0207eaf 
   src/slave/slave.cpp b4d2029 
   src/tests/partition_tests.cpp f7ee3ab 
   src/tests/slave_recovery_tests.cpp c036e9c 
   src/tests/slave_tests.cpp acae497 
 
 Diff: https://reviews.apache.org/r/29507/diff/
 
 
 Testing
 ---
 
 Manually tested slave failover/shutdown with master using different 
 --slave_ping_timeout and --max_slave_ping_timeouts.
 Ran unit tests with shorter non-default values for ping timeouts.
 `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
 ShortPingTimeoutUnreachableSlave
 
 
 Thanks,
 
 Adam B
 




Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-05-28 Thread Adam B

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/
---

(Updated May 28, 2015, 4:13 p.m.)


Review request for mesos, Ben Mahler and Niklas Nielsen.


Changes
---

Updated diff with minor feedback, updated description to mention 
slave_reregister_timeout fix.


Bugs: MESOS-2110
https://issues.apache.org/jira/browse/MESOS-2110


Repository: mesos


Description (updated)
---

Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
   
These can be extended if slaves are expected/allowed to be down for
longer than a minute or two.

Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
  
Beware that this affects recovery from network timeouts as well as
actual slave node/process failover.

Also fixed the log message in recoveredSlavesTimeout() to correctly
reference flags.slave_reregister_timeout instead of the unrelated
ping timeouts.


Diffs (updated)
-

  docs/configuration.md 5a41477 
  docs/upgrades.md 355307a 
  src/master/constants.hpp 57cf8fb 
  src/master/constants.cpp 8c7174a 
  src/master/flags.hpp 84fa238 
  src/master/flags.cpp 49d953a 
  src/master/master.cpp 1526f59 
  src/messages/messages.proto 39dac72 
  src/slave/constants.hpp 206d439 
  src/slave/constants.cpp d8d2f98 
  src/slave/slave.hpp 0207eaf 
  src/slave/slave.cpp b4d2029 
  src/tests/partition_tests.cpp f7ee3ab 
  src/tests/slave_recovery_tests.cpp c036e9c 
  src/tests/slave_tests.cpp acae497 

Diff: https://reviews.apache.org/r/29507/diff/


Testing
---

Manually tested slave failover/shutdown with master using different 
--slave_ping_timeout and --max_slave_ping_timeouts.
Ran unit tests with shorter non-default values for ping timeouts.
`make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
ShortPingTimeoutUnreachableSlave


Thanks,

Adam B



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-05-28 Thread Adam B


 On May 14, 2015, 6:07 a.m., Alexander Rukletsov wrote:
  src/master/master.cpp, line 1315
  https://reviews.apache.org/r/29507/diff/6/?file=959052#file959052line1315
 
  This looks like a drive-by bug fix, should it be included in this diff? 
  At least, let's mention it in the description.

Had to modify it when I was changing the constants, and upon closer inspection 
it was logging the wrong timeout anyway.
Mentioned it in the description now.


- Adam


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review83758
---


On May 14, 2015, 3:01 a.m., Adam B wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29507/
 ---
 
 (Updated May 14, 2015, 3:01 a.m.)
 
 
 Review request for mesos, Ben Mahler and Niklas Nielsen.
 
 
 Bugs: MESOS-2110
 https://issues.apache.org/jira/browse/MESOS-2110
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
 to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
 and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).

 These can be extended if slaves are expected/allowed to be down for
 longer than a minute or two.
 
 Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
   
 Beware that this affects recovery from network timeouts as well as
 actual slave node/process failover.
 
 
 Diffs
 -
 
   docs/configuration.md 54c4e31 
   docs/upgrades.md 2a15694 
   src/master/constants.hpp c386eab 
   src/master/constants.cpp 9ee17e9 
   src/master/flags.hpp 996cf38 
   src/master/flags.cpp 5798989 
   src/master/master.cpp eaea79d 
   src/messages/messages.proto 19e2444 
   src/slave/constants.hpp df02043 
   src/slave/constants.cpp 07f699a 
   src/slave/slave.hpp b62ed7b 
   src/slave/slave.cpp 132f83e 
   src/tests/partition_tests.cpp f7ee3ab 
   src/tests/slave_recovery_tests.cpp c036e9c 
   src/tests/slave_tests.cpp acae497 
 
 Diff: https://reviews.apache.org/r/29507/diff/
 
 
 Testing
 ---
 
 Manually tested slave failover/shutdown with master using different 
 --slave_ping_timeout and --max_slave_ping_timeouts.
 Ran unit tests with shorter non-default values for ping timeouts.
 `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
 ShortPingTimeoutUnreachableSlave
 
 
 Thanks,
 
 Adam B
 




Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-05-28 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review85652
---


Patch looks great!

Reviews applied: [29507]

All tests passed.

- Mesos ReviewBot


On May 28, 2015, 11:13 p.m., Adam B wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29507/
 ---
 
 (Updated May 28, 2015, 11:13 p.m.)
 
 
 Review request for mesos, Ben Mahler and Niklas Nielsen.
 
 
 Bugs: MESOS-2110
 https://issues.apache.org/jira/browse/MESOS-2110
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
 to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
 and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).

 These can be extended if slaves are expected/allowed to be down for
 longer than a minute or two.
 
 Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
   
 Beware that this affects recovery from network timeouts as well as
 actual slave node/process failover.
 
 Also fixed the log message in recoveredSlavesTimeout() to correctly
 reference flags.slave_reregister_timeout instead of the unrelated
 ping timeouts.
 
 
 Diffs
 -
 
   docs/configuration.md 5a41477 
   docs/upgrades.md 355307a 
   src/master/constants.hpp 57cf8fb 
   src/master/constants.cpp 8c7174a 
   src/master/flags.hpp 84fa238 
   src/master/flags.cpp 49d953a 
   src/master/master.cpp 1526f59 
   src/messages/messages.proto 39dac72 
   src/slave/constants.hpp 206d439 
   src/slave/constants.cpp d8d2f98 
   src/slave/slave.hpp 0207eaf 
   src/slave/slave.cpp b4d2029 
   src/tests/partition_tests.cpp f7ee3ab 
   src/tests/slave_recovery_tests.cpp c036e9c 
   src/tests/slave_tests.cpp acae497 
 
 Diff: https://reviews.apache.org/r/29507/diff/
 
 
 Testing
 ---
 
 Manually tested slave failover/shutdown with master using different 
 --slave_ping_timeout and --max_slave_ping_timeouts.
 Ran unit tests with shorter non-default values for ping timeouts.
 `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
 ShortPingTimeoutUnreachableSlave
 
 
 Thanks,
 
 Adam B
 




Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-05-21 Thread Niklas Nielsen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review84726
---


Hey Adam, is this ready for review?

- Niklas Nielsen


On May 14, 2015, 3:01 a.m., Adam B wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29507/
 ---
 
 (Updated May 14, 2015, 3:01 a.m.)
 
 
 Review request for mesos, Ben Mahler and Niklas Nielsen.
 
 
 Bugs: MESOS-2110
 https://issues.apache.org/jira/browse/MESOS-2110
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
 to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
 and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).

 These can be extended if slaves are expected/allowed to be down for
 longer than a minute or two.
 
 Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
   
 Beware that this affects recovery from network timeouts as well as
 actual slave node/process failover.
 
 
 Diffs
 -
 
   docs/configuration.md 54c4e31 
   docs/upgrades.md 2a15694 
   src/master/constants.hpp c386eab 
   src/master/constants.cpp 9ee17e9 
   src/master/flags.hpp 996cf38 
   src/master/flags.cpp 5798989 
   src/master/master.cpp eaea79d 
   src/messages/messages.proto 19e2444 
   src/slave/constants.hpp df02043 
   src/slave/constants.cpp 07f699a 
   src/slave/slave.hpp b62ed7b 
   src/slave/slave.cpp 132f83e 
   src/tests/partition_tests.cpp f7ee3ab 
   src/tests/slave_recovery_tests.cpp c036e9c 
   src/tests/slave_tests.cpp acae497 
 
 Diff: https://reviews.apache.org/r/29507/diff/
 
 
 Testing
 ---
 
 Manually tested slave failover/shutdown with master using different 
 --slave_ping_timeout and --max_slave_ping_timeouts.
 Ran unit tests with shorter non-default values for ping timeouts.
 `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
 ShortPingTimeoutUnreachableSlave
 
 
 Thanks,
 
 Adam B
 




Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-05-14 Thread Adam B

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/
---

(Updated May 14, 2015, 1:54 a.m.)


Review request for mesos, Ben Mahler and Niklas Nielsen.


Changes
---

Rebased and added configuration/upgrade documentation.


Bugs: MESOS-2110
https://issues.apache.org/jira/browse/MESOS-2110


Repository: mesos


Description
---

Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
   
These can be extended if slaves are expected/allowed to be down for
longer than a minute or two.

Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
  
Beware that this affects recovery from network timeouts as well as
actual slave node/process failover.


Diffs (updated)
-

  docs/configuration.md 54c4e31 
  docs/upgrades.md 2a15694 
  src/master/constants.hpp c386eab 
  src/master/constants.cpp 9ee17e9 
  src/master/flags.hpp 996cf38 
  src/master/flags.cpp 5798989 
  src/master/master.cpp ec32cd6 
  src/messages/messages.proto 98d859f 
  src/slave/constants.hpp fd1c1ab 
  src/slave/constants.cpp 2a99b11 
  src/slave/slave.hpp adb52b5 
  src/slave/slave.cpp 1b17441 
  src/tests/fault_tolerance_tests.cpp 3a27d82 
  src/tests/partition_tests.cpp 1018e47 
  src/tests/slave_recovery_tests.cpp c036e9c 
  src/tests/slave_tests.cpp 04e79ec 

Diff: https://reviews.apache.org/r/29507/diff/


Testing
---

Manually tested slave failover/shutdown with master using different 
--slave_ping_timeout and --max_slave_ping_timeouts.
Ran unit tests with shorter non-default values for ping timeouts.
`make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
ShortPingTimeoutUnreachableSlave


Thanks,

Adam B



Re: Review Request 29507: Added Configurable Slave Ping Timeouts

2015-05-14 Thread Adam B

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/
---

(Updated May 14, 2015, 3:01 a.m.)


Review request for mesos, Ben Mahler and Niklas Nielsen.


Changes
---

Rebased


Bugs: MESOS-2110
https://issues.apache.org/jira/browse/MESOS-2110


Repository: mesos


Description
---

Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
   
These can be extended if slaves are expected/allowed to be down for
longer than a minute or two.

Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
  
Beware that this affects recovery from network timeouts as well as
actual slave node/process failover.


Diffs (updated)
-

  docs/configuration.md 54c4e31 
  docs/upgrades.md 2a15694 
  src/master/constants.hpp c386eab 
  src/master/constants.cpp 9ee17e9 
  src/master/flags.hpp 996cf38 
  src/master/flags.cpp 5798989 
  src/master/master.cpp eaea79d 
  src/messages/messages.proto 19e2444 
  src/slave/constants.hpp df02043 
  src/slave/constants.cpp 07f699a 
  src/slave/slave.hpp b62ed7b 
  src/slave/slave.cpp 132f83e 
  src/tests/partition_tests.cpp f7ee3ab 
  src/tests/slave_recovery_tests.cpp c036e9c 
  src/tests/slave_tests.cpp acae497 

Diff: https://reviews.apache.org/r/29507/diff/


Testing
---

Manually tested slave failover/shutdown with master using different 
--slave_ping_timeout and --max_slave_ping_timeouts.
Ran unit tests with shorter non-default values for ping timeouts.
`make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
ShortPingTimeoutUnreachableSlave


Thanks,

Adam B