For anyone who is going to read this information in the future, this works
because the information in the replicated log can be recovered by the
master. In future releases of Mesos the master might store information
which cannot be recovered so please take extra care if you are going to do
this.

On Tue, Feb 24, 2015 at 4:11 PM, Steve Niemitz <[email protected]> wrote:

> Definitely don't change the frameworkID, we did that once and it was a
> disaster, for reasons described already.
>
> Here's what we did to force it on (as I can recall)
> - Change the startup flags for all masters to use the in memory DB instead
> of the replicated log (--registry=in_memory)
> - Restart all masters (not all at once, let them fail over)
> - Delete the replicated log on all masters
> - Ensure the framework is now registered with checkpoint = true (the
> slaves won't be yet howerver)
> - Remove the --registry flag from the masters and do a rolling restart
> again
> - Do another rolling restart of the masters
> *- At this point the framework will be persisted as checkpoint = true*
> - Now, restart your slaves.  Restarting them should cause them to pick up
> the new framework.  I'm not 100% sure if I deleted their state or not when
> I did this part, if it doesn't seem to take, try deleting their slave info
> on each one.
>
> On Tue, Feb 24, 2015 at 4:02 PM, Zameer Manji <[email protected]>
> wrote:
>
>> I would like to point out that using a new FrameworkID is not a solution
>> to this problem. This means that a cluster operator has to drain the entire
>> cluster to enable checkpointing, or lose all previous tasks. Both scenarios
>> are not desirable.
>>
>> Fortunately it is possible to do this without changing the FrameworkID. I
>> have cced Steve from TellApart who has enabled checkpointing without
>> changing the FrameworkID on a production cluster. I hope he can share his
>> process here.
>>
>> On Tue, Feb 24, 2015 at 3:51 PM, Tim Chen <[email protected]> wrote:
>>
>>> Mesos checkpoints the FrameworkInfo into disk, and recovers it on
>>> relaunch.
>>>
>>> I don't think we expose any API to remove the framework manually though
>>> if you really want to keep the FrameworkID. If you hit the failover timeout
>>> the framework will get removed from the master and slave.
>>>
>>> I think for now the best way is just use a new FrameworkID when you want
>>> to change the FrameworkInfo.
>>>
>>> Tim
>>>
>>>
>>>
>>> On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr <[email protected]> wrote:
>>>
>>>> Hey folks,
>>>>
>>>> Is there a best practice for rolling out FrameworkInfo changes? We need
>>>> to set checkpoint to true, so I redeployed our framework with the new
>>>> settings (with tasks still running), but when I hit a slave's
>>>> stats.json endpoint, it appears that the old FrameworkInfo data is
>>>> still there (which makes sense since there's active executors running). I
>>>> then tried draining the tasks and completely restarting a Mesos slave, but
>>>> still no luck.
>>>>
>>>> Is there anything additional / special I need to do here? Is some part
>>>> of Mesos caching FrameworkInfo based on the framework ID?
>>>>
>>>> Another wrinkle with our setup is we have a rather large
>>>> failover_timeout set for the framework -- maybe that's affecting
>>>> things too?
>>>>
>>>> Thanks,
>>>> Tom
>>>>
>>>
>>>
>>
>>
>> --
>> Zameer Manji
>>
>
>


-- 
Zameer Manji

Reply via email to