Changing FrameworkInfo (while keeping the FrameworkID) is not handled
correctly by Mesos at the moment. This is what you currently need to do to
propagate FrameworkInfo.checkpoint throughout the cluster.

--> Update FrameworkInfo inside your framework and re-register with master.
(Old FrameworkInfo is still cached at master and slaves).
--> Failover the leading master. (New FrameworkInfo will be cached by new
leading master).
--> Hard restart (kill slave and wipe meta data) your slave in batches.

The proper fix for this is tracked at:

On Tue, Feb 24, 2015 at 4:23 PM, Zameer Manji <>

> For anyone who is going to read this information in the future, this works
> because the information in the replicated log can be recovered by the
> master. In future releases of Mesos the master might store information
> which cannot be recovered so please take extra care if you are going to do
> this.
> On Tue, Feb 24, 2015 at 4:11 PM, Steve Niemitz <>
> wrote:
>> Definitely don't change the frameworkID, we did that once and it was a
>> disaster, for reasons described already.
>> Here's what we did to force it on (as I can recall)
>> - Change the startup flags for all masters to use the in memory DB
>> instead of the replicated log (--registry=in_memory)
>> - Restart all masters (not all at once, let them fail over)
>> - Delete the replicated log on all masters
>> - Ensure the framework is now registered with checkpoint = true (the
>> slaves won't be yet howerver)
>> - Remove the --registry flag from the masters and do a rolling restart
>> again
>> - Do another rolling restart of the masters
>> *- At this point the framework will be persisted as checkpoint = true*
>> - Now, restart your slaves.  Restarting them should cause them to pick up
>> the new framework.  I'm not 100% sure if I deleted their state or not when
>> I did this part, if it doesn't seem to take, try deleting their slave info
>> on each one.
>> On Tue, Feb 24, 2015 at 4:02 PM, Zameer Manji <>
>> wrote:
>>> I would like to point out that using a new FrameworkID is not a solution
>>> to this problem. This means that a cluster operator has to drain the entire
>>> cluster to enable checkpointing, or lose all previous tasks. Both scenarios
>>> are not desirable.
>>> Fortunately it is possible to do this without changing the FrameworkID.
>>> I have cced Steve from TellApart who has enabled checkpointing without
>>> changing the FrameworkID on a production cluster. I hope he can share his
>>> process here.
>>> On Tue, Feb 24, 2015 at 3:51 PM, Tim Chen <> wrote:
>>>> Mesos checkpoints the FrameworkInfo into disk, and recovers it on
>>>> relaunch.
>>>> I don't think we expose any API to remove the framework manually though
>>>> if you really want to keep the FrameworkID. If you hit the failover timeout
>>>> the framework will get removed from the master and slave.
>>>> I think for now the best way is just use a new FrameworkID when you
>>>> want to change the FrameworkInfo.
>>>> Tim
>>>> On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr <> wrote:
>>>>> Hey folks,
>>>>> Is there a best practice for rolling out FrameworkInfo changes? We
>>>>> need to set checkpoint to true, so I redeployed our framework with
>>>>> the new settings (with tasks still running), but when I hit a slave's
>>>>> stats.json endpoint, it appears that the old FrameworkInfo data is
>>>>> still there (which makes sense since there's active executors running). I
>>>>> then tried draining the tasks and completely restarting a Mesos slave, but
>>>>> still no luck.
>>>>> Is there anything additional / special I need to do here? Is some part
>>>>> of Mesos caching FrameworkInfo based on the framework ID?
>>>>> Another wrinkle with our setup is we have a rather large
>>>>> failover_timeout set for the framework -- maybe that's affecting
>>>>> things too?
>>>>> Thanks,
>>>>> Tom
>>> --
>>> Zameer Manji
> --
> Zameer Manji

Reply via email to