Would the following process enable zero downtime upgrade of Mesos (0.19 to
0.25) in an existing Mesos cluster?

0. From [1] it doesn't seem like there are any incompatible changes
introduced between 0.19 and 0.25.
1. Deploy Mesos(0.25) binaries to unelected master nodes
2. Deploy Mesos(0.25) binaries to leading master. This should potentially
result in master re-election and elect a master which already has
Mesos(0.25) installed from Step (1).
3. Deploy Mesos(0.25) binaries to mesos slave nodes. Existing tasks should
continue to execute and report to the master after mesos process launch
(with 0.25 binaries) on the slave node.

Known gotchas:
1. Any monitoring built around state.json and stats.json should be updated
accordingly as endpoints have changed [1].
2. Checkpointing should be enabled (It is not automatically enabled in
0.19) [2] .
3. recovery_timeout for slave nodes should be set to an appropriate value
depending on how long it takes to install Mesos(0.25) on the slave nodes.

Is any step missing in the upgrade process? Are there other gotchas that
one needs to be aware of?

[1] http://mesos.apache.org/documentation/latest/upgrades/
[2] http://mesos.apache.org/documentation/latest/slave-recovery/

Thanks,
Abishek

Reply via email to