Would the following process enable zero downtime upgrade of Mesos (0.19 to 0.25) in an existing Mesos cluster?
0. From [1] it doesn't seem like there are any incompatible changes introduced between 0.19 and 0.25. 1. Deploy Mesos(0.25) binaries to unelected master nodes 2. Deploy Mesos(0.25) binaries to leading master. This should potentially result in master re-election and elect a master which already has Mesos(0.25) installed from Step (1). 3. Deploy Mesos(0.25) binaries to mesos slave nodes. Existing tasks should continue to execute and report to the master after mesos process launch (with 0.25 binaries) on the slave node. Known gotchas: 1. Any monitoring built around state.json and stats.json should be updated accordingly as endpoints have changed [1]. 2. Checkpointing should be enabled (It is not automatically enabled in 0.19) [2] . 3. recovery_timeout for slave nodes should be set to an appropriate value depending on how long it takes to install Mesos(0.25) on the slave nodes. Is any step missing in the upgrade process? Are there other gotchas that one needs to be aware of? [1] http://mesos.apache.org/documentation/latest/upgrades/ [2] http://mesos.apache.org/documentation/latest/slave-recovery/ Thanks, Abishek

