Hi Trying to adjust the current failover time to below 10 seconds and don't seem to be able to find the right set of parameters. Currently, it takes around minute and half for master to detect that a slave has gone offline, which seems to correspond to slave_ping_timeout=15*max_slave_ping_timeouts=5. However, I can't find these parameters in mesos-master:
# mesos-master --version mesos 0.22.1 #mesos-master --help Usage: mesos-master [...] Supported options: --acls=VALUE The value could be a JSON formatted string of ACLs or a file path containing the JSON formatted ACLs used for authorization. Path could be of the form 'file:///path/to/file' or '/path/to/file'. See the ACLs protobuf in mesos.proto for the expected format. Example: { "register_frameworks": [ { "principals": { "type": "ANY" }, "roles": { "values": ["a"] } } ], "run_tasks": [ { "principals": { "values": ["a", "b"] }, "users": { "values": ["c"] } } ], "shutdown_frameworks": [ { "principals": { "values": ["a", "b"] }, "framework_principals": { "values": ["c"] } } ] } --allocation_interval=VALUE Amount of time to wait between performing (batch) allocations (e.g., 500ms, 1sec, etc). (default: 1secs) --[no-]authenticate If authenticate is 'true' only authenticated frameworks are allowed to register. If 'false' unauthenticated frameworks are also allowed to register. (default: false) --[no-]authenticate_slaves If 'true' only authenticated slaves are allowed to register. If 'false' unauthenticated slaves are also allowed to register. (default: false) --authenticators=VALUE Authenticator implementation to use when authenticating frameworks and/or slaves. Use the default 'crammd5', or load an alternate authenticator module using --modules. (default: crammd5) --cluster=VALUE Human readable name for the cluster, displayed in the webui. --credentials=VALUE Either a path to a text file with a list of credentials, each line containing 'principal' and 'secret' separated by whitespace, or, a path to a JSON-formatted file containing credentials. Path could be of the form 'file:///path/to/file' or '/path/to/file'. JSON file Example: { "credentials": [ { "principal": "sherman", "secret": "kitesurf", } ] } Text file Example: username secret --external_log_file=VALUE Specified the externally managed log file. This file will be exposed in the webui and HTTP api. This is useful when using stderr logging as the log file is otherwise unknown to Mesos. --framework_sorter=VALUE Policy to use for allocating resources between a given user's frameworks. Options are the same as for user_allocator. (default: drf) --[no-]help Prints this help message (default: false) --hooks=VALUE A comma separated list of hook modules to be installed inside master. --hostname=VALUE The hostname the master should advertise in ZooKeeper. If left unset, the hostname is resolved from the IP address that the master binds to. --[no-]initialize_driver_logging Whether to automatically initialize google logging of scheduler and/or executor drivers. (default: true) --ip=VALUE IP address to listen on --[no-]log_auto_initialize Whether to automatically initialize the replicated log used for the registry. If this is set to false, the log has to be manually initialized when used for the very first time. (default: true) --log_dir=VALUE Directory path to put log files (no default, nothing is written to disk unless specified; does not affect logging to stderr). NOTE: 3rd party log messages (e.g. ZooKeeper) are only written to stderr! --logbufsecs=VALUE How many seconds to buffer log messages for (default: 0) --logging_level=VALUE Log message at or above this level; possible values: 'INFO', 'WARNING', 'ERROR'; if quiet flag is used, this will affect just the logs from log_dir (if specified) (default: INFO) --modules=VALUE List of modules to be loaded and be available to the internal subsystems. Use --modules=filepath to specify the list of modules via a file containing a JSON formatted string. 'filepath' can be of the form 'file:///path/to/file' or '/path/to/file'. Use --modules="{...}" to specify the list of modules inline. Example: { "libraries": [ { "file": "/path/to/libfoo.so", "modules": [ { "name": "org_apache_mesos_bar", "parameters": [ { "key": "X", "value": "Y" } ] }, { "name": "org_apache_mesos_baz" } ] }, { "name": "qux", "modules": [ { "name": "org_apache_mesos_norf" } ] } ] } --offer_timeout=VALUE Duration of time before an offer is rescinded from a framework. This helps fairness when running frameworks that hold on to offers, or frameworks that accidentally drop offers. --port=VALUE Port to listen on (default: 5050) --[no-]quiet Disable logging to stderr (default: false) --quorum=VALUE The size of the quorum of replicas when using 'replicated_log' based registry. It is imperative to set this value to be a majority of masters i.e., quorum > (number of masters)/2. --rate_limits=VALUE The value could be a JSON formatted string of rate limits or a file path containing the JSON formatted rate limits used for framework rate limiting. Path could be of the form 'file:///path/to/file' or '/path/to/file'. See the RateLimits protobuf in mesos.proto for the expected format. Example: { "limits": [ { "principal": "foo", "qps": 55.5 }, { "principal": "bar" } ], "aggregate_default_qps": 33.3 } --recovery_slave_removal_limit=VALUE For failovers, limit on the percentage of slaves that can be removed from the registry *and* shutdown after the re-registration timeout elapses. If the limit is exceeded, the master will fail over rather than remove the slaves. This can be used to provide safety guarantees for production environments. Production environments may expect that across Master failovers, at most a certain percentage of slaves will fail permanently (e.g. due to rack-level failures). Setting this limit would ensure that a human needs to get involved should an unexpected widespread failure of slaves occur in the cluster. Values: [0%-100%] (default: 100%) --registry=VALUE Persistence strategy for the registry; available options are 'replicated_log', 'in_memory' (for testing). (default: replicated_log) --registry_fetch_timeout=VALUE Duration of time to wait in order to fetch data from the registry after which the operation is considered a failure. (default: 1mins) --registry_store_timeout=VALUE Duration of time to wait in order to store data in the registry after which the operation is considered a failure. (default: 5secs) --[no-]registry_strict Whether the Master will take actions based on the persistent information stored in the Registry. Setting this to false means that the Registrar will never reject the admission, readmission, or removal of a slave. Consequently, 'false' can be used to bootstrap the persistent state on a running cluster. NOTE: This flag is *experimental* and should not be used in production yet. (default: false) --roles=VALUE A comma separated list of the allocation roles that frameworks in this cluster may belong to. --[no-]root_submissions Can root submit frameworks? (default: true) --slave_removal_rate_limit=VALUE The maximum rate (e.g., 1/10mins, 2/3hrs, etc) at which slaves will be removed from the master when they fail health checks. By default slaves will be removed as soon as they fail the health checks. The value is of the form <Number of slaves>/<Duration>. --slave_reregister_timeout=VALUE The timeout within which all slaves are expected to re-register when a new master is elected as the leader. Slaves that do not re-register within the timeout will be removed from the registry and will be shutdown if they attempt to communicate with master. NOTE: This value has to be atleast 10mins. (default: 10mins) --user_sorter=VALUE Policy to use for allocating resources between users. May be one of: dominant_resource_fairness (drf) (default: drf) --[no-]version Show version and exit. (default: false) --webui_dir=VALUE Directory path of the webui files/assets (default: /usr/share/mesos/webui) --weights=VALUE A comma separated list of role/weight pairs of the form 'role=weight,role=weight'. Weights are used to indicate forms of priority. --whitelist=VALUE Path to a file with a list of slaves (one per line) to advertise offers for. Path could be of the form 'file:///path/to/file' or '/path/to/file'. --work_dir=VALUE Directory path to store the persistent information stored in the Registry. (example: /var/lib/mesos/master) --zk=VALUE ZooKeeper URL (used for leader election amongst masters) May be one of: zk://host1:port1,host2:port2,.../path zk://username:password@host1:port1,host2:port2,.../path file:///path/to/file (where file contains one of the above) --zk_session_timeout=VALUE ZooKeeper session timeout. (default: 10secs) Furthermore, setting these parameter either in /etc/mesos-master/ or inline generates the following error: # /usr/sbin/mesos-master --zk=zk://10.40.50.228:2181/mesos --port=5050 --log_dir=/var/log/mesos --hostname=10.40.50.228 --ip=10.40.50.228 --quorum=1 --work _dir=/var/lib/mesos --max_slave_ping_timeouts=2 Failed to load unknown flag 'max_slave_ping_timeouts' Usage: mesos-master [...] Supported options: --acls=VALUE The valu ... Any thoughts? Cheers, [http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] Nastooh Avessta ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: +1 604 647 1527 Cisco Systems Limited 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121 VANCOUVER BRITISH COLUMBIA V7X 1J1 CA Cisco.com<http://www.cisco.com/> [Think before you print.]Think before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. Phone: 416-306-7000; Fax: 416-306-7099. Preferences<http://www.cisco.com/offer/subscribe/?sid=000478326> - Unsubscribe<http://www.cisco.com/offer/unsubscribe/?sid=000478327> - Privacy<http://www.cisco.com/web/siteassets/legal/privacy.html>