Aha, thanks! Will give this a shot tomorrow morning.

On Tuesday, February 24, 2015, Vinod Kone <[email protected]> wrote:

> Changing FrameworkInfo (while keeping the FrameworkID) is not handled
> correctly by Mesos at the moment. This is what you currently need to do to
> propagate FrameworkInfo.checkpoint throughout the cluster.
>
> --> Update FrameworkInfo inside your framework and re-register with
> master. (Old FrameworkInfo is still cached at master and slaves).
> --> Failover the leading master. (New FrameworkInfo will be cached by new
> leading master).
> --> Hard restart (kill slave and wipe meta data) your slave in batches.
>
> The proper fix for this is tracked at:
> https://issues.apache.org/jira/browse/MESOS-703
>
> On Tue, Feb 24, 2015 at 4:23 PM, Zameer Manji <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> For anyone who is going to read this information in the future, this
>> works because the information in the replicated log can be recovered by the
>> master. In future releases of Mesos the master might store information
>> which cannot be recovered so please take extra care if you are going to do
>> this.
>>
>> On Tue, Feb 24, 2015 at 4:11 PM, Steve Niemitz <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>
>>> Definitely don't change the frameworkID, we did that once and it was a
>>> disaster, for reasons described already.
>>>
>>> Here's what we did to force it on (as I can recall)
>>> - Change the startup flags for all masters to use the in memory DB
>>> instead of the replicated log (--registry=in_memory)
>>> - Restart all masters (not all at once, let them fail over)
>>> - Delete the replicated log on all masters
>>> - Ensure the framework is now registered with checkpoint = true (the
>>> slaves won't be yet howerver)
>>> - Remove the --registry flag from the masters and do a rolling restart
>>> again
>>> - Do another rolling restart of the masters
>>> *- At this point the framework will be persisted as checkpoint = true*
>>> - Now, restart your slaves.  Restarting them should cause them to pick
>>> up the new framework.  I'm not 100% sure if I deleted their state or not
>>> when I did this part, if it doesn't seem to take, try deleting their slave
>>> info on each one.
>>>
>>> On Tue, Feb 24, 2015 at 4:02 PM, Zameer Manji <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>
>>>> I would like to point out that using a new FrameworkID is not a
>>>> solution to this problem. This means that a cluster operator has to drain
>>>> the entire cluster to enable checkpointing, or lose all previous tasks.
>>>> Both scenarios are not desirable.
>>>>
>>>> Fortunately it is possible to do this without changing the FrameworkID.
>>>> I have cced Steve from TellApart who has enabled checkpointing without
>>>> changing the FrameworkID on a production cluster. I hope he can share his
>>>> process here.
>>>>
>>>> On Tue, Feb 24, 2015 at 3:51 PM, Tim Chen <[email protected]
>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>>
>>>>> Mesos checkpoints the FrameworkInfo into disk, and recovers it on
>>>>> relaunch.
>>>>>
>>>>> I don't think we expose any API to remove the framework manually
>>>>> though if you really want to keep the FrameworkID. If you hit the failover
>>>>> timeout the framework will get removed from the master and slave.
>>>>>
>>>>> I think for now the best way is just use a new FrameworkID when you
>>>>> want to change the FrameworkInfo.
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> Is there a best practice for rolling out FrameworkInfo changes? We
>>>>>> need to set checkpoint to true, so I redeployed our framework with
>>>>>> the new settings (with tasks still running), but when I hit a slave's
>>>>>> stats.json endpoint, it appears that the old FrameworkInfo data is
>>>>>> still there (which makes sense since there's active executors running). I
>>>>>> then tried draining the tasks and completely restarting a Mesos slave, 
>>>>>> but
>>>>>> still no luck.
>>>>>>
>>>>>> Is there anything additional / special I need to do here? Is some
>>>>>> part of Mesos caching FrameworkInfo based on the framework ID?
>>>>>>
>>>>>> Another wrinkle with our setup is we have a rather large
>>>>>> failover_timeout set for the framework -- maybe that's affecting
>>>>>> things too?
>>>>>>
>>>>>> Thanks,
>>>>>> Tom
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Zameer Manji
>>>>
>>>
>>>
>>
>>
>> --
>> Zameer Manji
>>
>
>

Reply via email to