Re: Review Request 29507: Added Configurable Slave Ping Timeouts
On June 24, 2015, 6:41 p.m., Ben Mahler wrote: Actually, we should think about one more thing, how does this interact with the zookeeper session timeout? Adam B wrote: The hardcoded individual ping timeout (15secs) was previously longer than the default zk session timeout (10secs), but the zk session timeout is already configurable with no bounds/validation, so users could already change it to be longer than an individual ping timeout, or even the total ping timeout. Is this a bad idea? What would this mean? This is mostly relevant to the slave-side (as a leader detector), in that a slave may timeout and decide to check for a new leading master to reregister with, but the zk session may have recently gone bad but not yet timed out. Since a zk session could go bad/timeout at any point along the way, I don't think making the ping timeout configurable will introduce any new potential errors. Rather, it would just be advisable to keep the zk session timeouts reasonably small so that the zk session is likely to be healthy whenever a slave needs to detect a new leader. We could introduce validation that the zk_session_timeout is shorter than the (total) ping timeout, but I'm not sure that's even necessary. Does this make sense? I'm no ZK expert, so I defer to those with more experience. Ben Mahler wrote: Right, the ZK timeout is negotiated with ZooKeeper, so it gets bounded based on the ZK server configuration: http://zookeeper.apache.org/doc/r3.3.1/zookeeperAdmin.html Validating that the session timeout is less than the total ping timeout might be nice, otherwise every session expiration slave-side will trigger useless re-registration attempts. Mind just adding a NOTE that reflects this in the flag help? The validation might be nice, but the ZK session timeout is only really determined after negotation anyway.. Noted in the flag help. - Adam --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review89305 --- On June 26, 2015, 3:12 a.m., Adam B wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated June 26, 2015, 3:12 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs - docs/configuration.md aaf65bf docs/upgrades.md 73e503c src/master/constants.hpp 072d59c src/master/constants.cpp 997b792 src/master/flags.hpp 55ed3a9 src/master/flags.cpp 4377715 src/master/master.cpp 0782b54 src/messages/messages.proto 1c8d79e src/slave/constants.hpp 84927e5 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp f1cf3b8 src/slave/slave.cpp b3e1ccc src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp e9002e8 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated June 26, 2015, 3:12 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Changes --- Updated in response to bmahler's comments. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs (updated) - docs/configuration.md aaf65bf docs/upgrades.md 73e503c src/master/constants.hpp 072d59c src/master/constants.cpp 997b792 src/master/flags.hpp 55ed3a9 src/master/flags.cpp 4377715 src/master/master.cpp 0782b54 src/messages/messages.proto 1c8d79e src/slave/constants.hpp 84927e5 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp f1cf3b8 src/slave/slave.cpp b3e1ccc src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp e9002e8 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
On June 3, 2015, 2:24 p.m., Ben Mahler wrote: src/tests/partition_tests.cpp, lines 647-650 https://reviews.apache.org/r/29507/diff/7/?file=973888#file973888line647 This sounds like testing two particular behaviors in a single test: (1) A registered slave that never receives any Ping messages will attempt to reregister after the master's ping timeout. (2) Master will consider it shutdown after the same timeout. I believe we already have tests for both of these. For (1) we could just add the shutdown message expectation to `PartitionTest.PartitionedSlave`. For (2), we have SlaveTest.PingTimeoutNoPings, SlaveTest.PingTimeoutNoPings. Adam B wrote: Yes, I see now that SlaveTest.PingTimeoutNoPings already satisfies (1), and PartitionTest.PartitionedSlave could be modified to check for the ShutdownMessage() to satisfy (2). I originally started trying to write unit tests for registered/unregistered slaves that encountered each type of a one-way partition, with both shorter and longer-than-default ping timeouts, largely for my own understanding of the process. Testing longer timeouts proved unnecessary, and testing unregistered slaves didn't actually test my feature. I posted these two since they worked and explicitly tested the one-way partitions I was trying to test. But this one can almost certainly be covered by the tests you mentioned. I will update in a follow-up patch. It turns out that the shutdown message expectation is already covered in SlaveRecoveryTest.PartitionedSlave, which is similar to, but in many ways a superset of PartitionTest.PartitionedSlave. I have removed my two new tests completely and relied on SlaveRecoveryTest.PartitionedSlave and SlaveTest.PingTimeoutNoPings, with the latter updated to use custom ping timeout values. For the updated test, see new RR: https://reviews.apache.org/r/35958 On June 3, 2015, 2:24 p.m., Ben Mahler wrote: src/tests/partition_tests.cpp, lines 600-602 https://reviews.apache.org/r/29507/diff/7/?file=973888#file973888line600 I'm a bit confused by this test, why are you expecting the first ping? Why are you dropping the pongs? I would expect the pings to be dropped entirely here so that advancing the clock after a slave is registered leads to a re-registration. Also, did you know about SlaveTest.PingTimeoutNoPings and SlaveTest.PingTimeoutSomePings tests? Is there something being tested here that is not captured in these? Adam B wrote: This test was intended to simulate a one-way partition where the slave can no longer reach the master, after the slave initially successfully registered. Hence the PONGs are dropped, but not the PINGs. Even though we're dropping PONGs instead of PINGs here, it's still only testing that the Slave reregisters after the timeout. If anything, it should be testing that a master that never receives PONGs will shutdown the slave. This behavior is already tested by SlaveRecoveryTest.PartitionedSlave. No new test needed. - Adam --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review86461 --- On June 26, 2015, 3:12 a.m., Adam B wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated June 26, 2015, 3:12 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs - docs/configuration.md aaf65bf docs/upgrades.md 73e503c src/master/constants.hpp 072d59c src/master/constants.cpp 997b792 src/master/flags.hpp 55ed3a9 src/master/flags.cpp 4377715 src/master/master.cpp 0782b54 src/messages/messages.proto 1c8d79e src/slave/constants.hpp 84927e5 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp f1cf3b8 src/slave/slave.cpp b3e1ccc src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
On June 25, 2015, 1:41 a.m., Ben Mahler wrote: Actually, we should think about one more thing, how does this interact with the zookeeper session timeout? Adam B wrote: The hardcoded individual ping timeout (15secs) was previously longer than the default zk session timeout (10secs), but the zk session timeout is already configurable with no bounds/validation, so users could already change it to be longer than an individual ping timeout, or even the total ping timeout. Is this a bad idea? What would this mean? This is mostly relevant to the slave-side (as a leader detector), in that a slave may timeout and decide to check for a new leading master to reregister with, but the zk session may have recently gone bad but not yet timed out. Since a zk session could go bad/timeout at any point along the way, I don't think making the ping timeout configurable will introduce any new potential errors. Rather, it would just be advisable to keep the zk session timeouts reasonably small so that the zk session is likely to be healthy whenever a slave needs to detect a new leader. We could introduce validation that the zk_session_timeout is shorter than the (total) ping timeout, but I'm not sure that's even necessary. Does this make sense? I'm no ZK expert, so I defer to those with more experience. Right, the ZK timeout is negotiated with ZooKeeper, so it gets bounded based on the ZK server configuration: http://zookeeper.apache.org/doc/r3.3.1/zookeeperAdmin.html Validating that the session timeout is less than the total ping timeout might be nice, otherwise every session expiration slave-side will trigger useless re-registration attempts. Mind just adding a NOTE that reflects this in the flag help? The validation might be nice, but the ZK session timeout is only really determined after negotation anyway.. - Ben --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review89305 --- On June 26, 2015, 10:12 a.m., Adam B wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated June 26, 2015, 10:12 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs - docs/configuration.md aaf65bf docs/upgrades.md 73e503c src/master/constants.hpp 072d59c src/master/constants.cpp 997b792 src/master/flags.hpp 55ed3a9 src/master/flags.cpp 4377715 src/master/master.cpp 0782b54 src/messages/messages.proto 1c8d79e src/slave/constants.hpp 84927e5 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp f1cf3b8 src/slave/slave.cpp b3e1ccc src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp e9002e8 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
On June 24, 2015, 6:41 p.m., Ben Mahler wrote: Actually, we should think about one more thing, how does this interact with the zookeeper session timeout? The hardcoded individual ping timeout (15secs) was previously longer than the default zk session timeout (10secs), but the zk session timeout is already configurable with no bounds/validation, so users could already change it to be longer than an individual ping timeout, or even the total ping timeout. Is this a bad idea? What would this mean? This is mostly relevant to the slave-side (as a leader detector), in that a slave may timeout and decide to check for a new leading master to reregister with, but the zk session may have recently gone bad but not yet timed out. Since a zk session could go bad/timeout at any point along the way, I don't think making the ping timeout configurable will introduce any new potential errors. Rather, it would just be advisable to keep the zk session timeouts reasonably small so that the zk session is likely to be healthy whenever a slave needs to detect a new leader. We could introduce validation that the zk_session_timeout is shorter than the (total) ping timeout, but I'm not sure that's even necessary. Does this make sense? I'm no ZK expert, so I defer to those with more experience. - Adam --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review89305 --- On June 26, 2015, 3:12 a.m., Adam B wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated June 26, 2015, 3:12 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs - docs/configuration.md aaf65bf docs/upgrades.md 73e503c src/master/constants.hpp 072d59c src/master/constants.cpp 997b792 src/master/flags.hpp 55ed3a9 src/master/flags.cpp 4377715 src/master/master.cpp 0782b54 src/messages/messages.proto 1c8d79e src/slave/constants.hpp 84927e5 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp f1cf3b8 src/slave/slave.cpp b3e1ccc src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp e9002e8 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review89482 --- Patch looks great! Reviews applied: [29507] All tests passed. - Mesos ReviewBot On June 26, 2015, 10:12 a.m., Adam B wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated June 26, 2015, 10:12 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs - docs/configuration.md aaf65bf docs/upgrades.md 73e503c src/master/constants.hpp 072d59c src/master/constants.cpp 997b792 src/master/flags.hpp 55ed3a9 src/master/flags.cpp 4377715 src/master/master.cpp 0782b54 src/messages/messages.proto 1c8d79e src/slave/constants.hpp 84927e5 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp f1cf3b8 src/slave/slave.cpp b3e1ccc src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp e9002e8 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review89305 --- Actually, we should think about one more thing, how does this interact with the zookeeper session timeout? - Ben Mahler On June 22, 2015, 10:03 a.m., Adam B wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated June 22, 2015, 10:03 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs - docs/configuration.md aaf65bf docs/upgrades.md 355307a src/master/constants.hpp 072d59c src/master/constants.cpp 997b792 src/master/flags.hpp 55ed3a9 src/master/flags.cpp 4377715 src/master/master.cpp 0135c15 src/messages/messages.proto 1c8d79e src/slave/constants.hpp 84927e5 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp f1cf3b8 src/slave/slave.cpp 40c0c33 src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp 5030198 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review89243 --- Ship it! Ship It! - Niklas Nielsen On June 22, 2015, 3:03 a.m., Adam B wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated June 22, 2015, 3:03 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs - docs/configuration.md aaf65bf docs/upgrades.md 355307a src/master/constants.hpp 072d59c src/master/constants.cpp 997b792 src/master/flags.hpp 55ed3a9 src/master/flags.cpp 4377715 src/master/master.cpp 0135c15 src/messages/messages.proto 1c8d79e src/slave/constants.hpp 84927e5 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp f1cf3b8 src/slave/slave.cpp 40c0c33 src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp 5030198 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review88761 --- Patch looks great! Reviews applied: [29507] All tests passed. - Mesos ReviewBot On June 22, 2015, 10:03 a.m., Adam B wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated June 22, 2015, 10:03 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs - docs/configuration.md aaf65bf docs/upgrades.md 355307a src/master/constants.hpp 072d59c src/master/constants.cpp 997b792 src/master/flags.hpp 55ed3a9 src/master/flags.cpp 4377715 src/master/master.cpp 0135c15 src/messages/messages.proto 1c8d79e src/slave/constants.hpp 84927e5 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp f1cf3b8 src/slave/slave.cpp 40c0c33 src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp 5030198 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote: LGTM - Would it make sense to have sane min/max values for the timeouts/counts? I wonder it would make sense to have a test to exercise an upgrade path (the timeout being different in the slaves, than in the master). Maybe I missed that in your first test. Added flag validation using the cool new lambda style from https://reviews.apache.org/r/34943 It's not so easy to test the upgrade path in a unit test, since as soon as a slave from this source base gets its SlaveRegistered ack from the master, it then has the same timeout as the master (as designed). And if the slave never receives its registered ack, it will retry registration after a /random/ backoff initially capped at 1min (--registration_backoff_factor). If the random backoff_factor ends up being longer than the slave's ping timeout, when the ping timeout expires, the slave will anyway re-detect the master and retry registration. As I revisit the tests (in a follow-up patch), I will brainstorm ways to test the upgrade path. On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote: src/messages/messages.proto, line 279 https://reviews.apache.org/r/29507/diff/7/?file=973883#file973883line279 Why a zero default? Originally added as a way for the recipient of this message to know that the sender didn't set it, presumably because the sender is pre-0.23. Since we don't have a way for install handlers to translate optional fields into Options instead of just pulling in the default value, I couldn't think of a cleaner option. BenM, however, suggested pushing the ping timeout into a nested submessage, so I could just check has_ping_timeout_seconds() instead. Went with that instead. On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote: src/slave/slave.cpp, line 864 https://reviews.apache.org/r/29507/diff/7/?file=973887#file973887line864 If we removed the 0 default, we'd do: if (ping_timeout_seconds.isSome()) { // ... } Which seems a bit more logical to me :) Unfortunately, we don't currently translate optional protobuf fields into Options in install handlers, so I went for an explicit default. BenM suggested using a nested message so I can check `if (connection.has_ping_timeout_seconds())` instead. Much more readable. On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote: docs/configuration.md, line 339 https://reviews.apache.org/r/29507/diff/7/?file=973876#file973876line339 As this will be markdown, how about formatting max_slave_ping_timeouts in `` or as --max_slave_ping_timeouts ? Fixed. - Adam --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review86039 --- On May 28, 2015, 4:13 p.m., Adam B wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated May 28, 2015, 4:13 p.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs - docs/configuration.md 5a41477 docs/upgrades.md 355307a src/master/constants.hpp 57cf8fb src/master/constants.cpp 8c7174a src/master/flags.hpp 84fa238 src/master/flags.cpp 49d953a src/master/master.cpp 1526f59 src/messages/messages.proto 39dac72 src/slave/constants.hpp 206d439 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp 0207eaf src/slave/slave.cpp b4d2029 src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp acae497 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated June 22, 2015, 3:03 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Changes --- Updated with Niklas and BenM's suggestions. Excluding new unit tests; will post subsequent review with them. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs (updated) - docs/configuration.md aaf65bf docs/upgrades.md 355307a src/master/constants.hpp 072d59c src/master/constants.cpp 997b792 src/master/flags.hpp 55ed3a9 src/master/flags.cpp 4377715 src/master/master.cpp 0135c15 src/messages/messages.proto 1c8d79e src/slave/constants.hpp 84927e5 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp f1cf3b8 src/slave/slave.cpp 40c0c33 src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp 5030198 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review86039 --- LGTM - Would it make sense to have sane min/max values for the timeouts/counts? I wonder it would make sense to have a test to exercise an upgrade path (the timeout being different in the slaves, than in the master). Maybe I missed that in your first test. docs/configuration.md https://reviews.apache.org/r/29507/#comment137889 As this will be markdown, how about formatting max_slave_ping_timeouts in `` or as --max_slave_ping_timeouts ? docs/upgrades.md https://reviews.apache.org/r/29507/#comment137891 You shouldn't need this line - look at line 9 :) src/master/master.cpp https://reviews.apache.org/r/29507/#comment137899 You should be able to const this, right? Here and below src/messages/messages.proto https://reviews.apache.org/r/29507/#comment137902 Why a zero default? src/slave/slave.hpp https://reviews.apache.org/r/29507/#comment137903 We should Camel case here, right? src/slave/slave.hpp https://reviews.apache.org/r/29507/#comment137904 Same here src/slave/slave.cpp https://reviews.apache.org/r/29507/#comment137905 Camel case? src/slave/slave.cpp https://reviews.apache.org/r/29507/#comment137906 If we removed the 0 default, we'd do: if (ping_timeout_seconds.isSome()) { // ... } Which seems a bit more logical to me :) src/tests/partition_tests.cpp https://reviews.apache.org/r/29507/#comment137936 What do you think about commenting on why you chose these dividing factors (Why 3, Why 2? What will these do compared to the original timeouts?) src/tests/partition_tests.cpp https://reviews.apache.org/r/29507/#comment137931 const? src/tests/partition_tests.cpp https://reviews.apache.org/r/29507/#comment137938 newline? src/tests/partition_tests.cpp https://reviews.apache.org/r/29507/#comment137937 Newline? - Niklas Nielsen On May 28, 2015, 4:13 p.m., Adam B wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated May 28, 2015, 4:13 p.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs - docs/configuration.md 5a41477 docs/upgrades.md 355307a src/master/constants.hpp 57cf8fb src/master/constants.cpp 8c7174a src/master/flags.hpp 84fa238 src/master/flags.cpp 49d953a src/master/master.cpp 1526f59 src/messages/messages.proto 39dac72 src/slave/constants.hpp 206d439 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp 0207eaf src/slave/slave.cpp b4d2029 src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp acae497 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated May 28, 2015, 4:13 p.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Changes --- Updated diff with minor feedback, updated description to mention slave_reregister_timeout fix. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description (updated) --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs (updated) - docs/configuration.md 5a41477 docs/upgrades.md 355307a src/master/constants.hpp 57cf8fb src/master/constants.cpp 8c7174a src/master/flags.hpp 84fa238 src/master/flags.cpp 49d953a src/master/master.cpp 1526f59 src/messages/messages.proto 39dac72 src/slave/constants.hpp 206d439 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp 0207eaf src/slave/slave.cpp b4d2029 src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp acae497 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
On May 14, 2015, 6:07 a.m., Alexander Rukletsov wrote: src/master/master.cpp, line 1315 https://reviews.apache.org/r/29507/diff/6/?file=959052#file959052line1315 This looks like a drive-by bug fix, should it be included in this diff? At least, let's mention it in the description. Had to modify it when I was changing the constants, and upon closer inspection it was logging the wrong timeout anyway. Mentioned it in the description now. - Adam --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review83758 --- On May 14, 2015, 3:01 a.m., Adam B wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated May 14, 2015, 3:01 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Diffs - docs/configuration.md 54c4e31 docs/upgrades.md 2a15694 src/master/constants.hpp c386eab src/master/constants.cpp 9ee17e9 src/master/flags.hpp 996cf38 src/master/flags.cpp 5798989 src/master/master.cpp eaea79d src/messages/messages.proto 19e2444 src/slave/constants.hpp df02043 src/slave/constants.cpp 07f699a src/slave/slave.hpp b62ed7b src/slave/slave.cpp 132f83e src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp acae497 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review85652 --- Patch looks great! Reviews applied: [29507] All tests passed. - Mesos ReviewBot On May 28, 2015, 11:13 p.m., Adam B wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated May 28, 2015, 11:13 p.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Also fixed the log message in recoveredSlavesTimeout() to correctly reference flags.slave_reregister_timeout instead of the unrelated ping timeouts. Diffs - docs/configuration.md 5a41477 docs/upgrades.md 355307a src/master/constants.hpp 57cf8fb src/master/constants.cpp 8c7174a src/master/flags.hpp 84fa238 src/master/flags.cpp 49d953a src/master/master.cpp 1526f59 src/messages/messages.proto 39dac72 src/slave/constants.hpp 206d439 src/slave/constants.cpp d8d2f98 src/slave/slave.hpp 0207eaf src/slave/slave.cpp b4d2029 src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp acae497 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review84726 --- Hey Adam, is this ready for review? - Niklas Nielsen On May 14, 2015, 3:01 a.m., Adam B wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated May 14, 2015, 3:01 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Diffs - docs/configuration.md 54c4e31 docs/upgrades.md 2a15694 src/master/constants.hpp c386eab src/master/constants.cpp 9ee17e9 src/master/flags.hpp 996cf38 src/master/flags.cpp 5798989 src/master/master.cpp eaea79d src/messages/messages.proto 19e2444 src/slave/constants.hpp df02043 src/slave/constants.cpp 07f699a src/slave/slave.hpp b62ed7b src/slave/slave.cpp 132f83e src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp acae497 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated May 14, 2015, 1:54 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Changes --- Rebased and added configuration/upgrade documentation. Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Diffs (updated) - docs/configuration.md 54c4e31 docs/upgrades.md 2a15694 src/master/constants.hpp c386eab src/master/constants.cpp 9ee17e9 src/master/flags.hpp 996cf38 src/master/flags.cpp 5798989 src/master/master.cpp ec32cd6 src/messages/messages.proto 98d859f src/slave/constants.hpp fd1c1ab src/slave/constants.cpp 2a99b11 src/slave/slave.hpp adb52b5 src/slave/slave.cpp 1b17441 src/tests/fault_tolerance_tests.cpp 3a27d82 src/tests/partition_tests.cpp 1018e47 src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp 04e79ec Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B
Re: Review Request 29507: Added Configurable Slave Ping Timeouts
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/ --- (Updated May 14, 2015, 3:01 a.m.) Review request for mesos, Ben Mahler and Niklas Nielsen. Changes --- Rebased Bugs: MESOS-2110 https://issues.apache.org/jira/browse/MESOS-2110 Repository: mesos Description --- Added new --slave_ping_timeout and --max_slave_ping_timeouts flags to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). These can be extended if slaves are expected/allowed to be down for longer than a minute or two. Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. Beware that this affects recovery from network timeouts as well as actual slave node/process failover. Diffs (updated) - docs/configuration.md 54c4e31 docs/upgrades.md 2a15694 src/master/constants.hpp c386eab src/master/constants.cpp 9ee17e9 src/master/flags.hpp 996cf38 src/master/flags.cpp 5798989 src/master/master.cpp eaea79d src/messages/messages.proto 19e2444 src/slave/constants.hpp df02043 src/slave/constants.cpp 07f699a src/slave/slave.hpp b62ed7b src/slave/slave.cpp 132f83e src/tests/partition_tests.cpp f7ee3ab src/tests/slave_recovery_tests.cpp c036e9c src/tests/slave_tests.cpp acae497 Diff: https://reviews.apache.org/r/29507/diff/ Testing --- Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts. Ran unit tests with shorter non-default values for ping timeouts. `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave Thanks, Adam B