Aha, thanks! Will give this a shot tomorrow morning. On Tuesday, February 24, 2015, Vinod Kone <[email protected]> wrote:
> Changing FrameworkInfo (while keeping the FrameworkID) is not handled > correctly by Mesos at the moment. This is what you currently need to do to > propagate FrameworkInfo.checkpoint throughout the cluster. > > --> Update FrameworkInfo inside your framework and re-register with > master. (Old FrameworkInfo is still cached at master and slaves). > --> Failover the leading master. (New FrameworkInfo will be cached by new > leading master). > --> Hard restart (kill slave and wipe meta data) your slave in batches. > > The proper fix for this is tracked at: > https://issues.apache.org/jira/browse/MESOS-703 > > On Tue, Feb 24, 2015 at 4:23 PM, Zameer Manji <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> For anyone who is going to read this information in the future, this >> works because the information in the replicated log can be recovered by the >> master. In future releases of Mesos the master might store information >> which cannot be recovered so please take extra care if you are going to do >> this. >> >> On Tue, Feb 24, 2015 at 4:11 PM, Steve Niemitz <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >> >>> Definitely don't change the frameworkID, we did that once and it was a >>> disaster, for reasons described already. >>> >>> Here's what we did to force it on (as I can recall) >>> - Change the startup flags for all masters to use the in memory DB >>> instead of the replicated log (--registry=in_memory) >>> - Restart all masters (not all at once, let them fail over) >>> - Delete the replicated log on all masters >>> - Ensure the framework is now registered with checkpoint = true (the >>> slaves won't be yet howerver) >>> - Remove the --registry flag from the masters and do a rolling restart >>> again >>> - Do another rolling restart of the masters >>> *- At this point the framework will be persisted as checkpoint = true* >>> - Now, restart your slaves. Restarting them should cause them to pick >>> up the new framework. I'm not 100% sure if I deleted their state or not >>> when I did this part, if it doesn't seem to take, try deleting their slave >>> info on each one. >>> >>> On Tue, Feb 24, 2015 at 4:02 PM, Zameer Manji <[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>> >>>> I would like to point out that using a new FrameworkID is not a >>>> solution to this problem. This means that a cluster operator has to drain >>>> the entire cluster to enable checkpointing, or lose all previous tasks. >>>> Both scenarios are not desirable. >>>> >>>> Fortunately it is possible to do this without changing the FrameworkID. >>>> I have cced Steve from TellApart who has enabled checkpointing without >>>> changing the FrameworkID on a production cluster. I hope he can share his >>>> process here. >>>> >>>> On Tue, Feb 24, 2015 at 3:51 PM, Tim Chen <[email protected] >>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>> >>>>> Mesos checkpoints the FrameworkInfo into disk, and recovers it on >>>>> relaunch. >>>>> >>>>> I don't think we expose any API to remove the framework manually >>>>> though if you really want to keep the FrameworkID. If you hit the failover >>>>> timeout the framework will get removed from the master and slave. >>>>> >>>>> I think for now the best way is just use a new FrameworkID when you >>>>> want to change the FrameworkInfo. >>>>> >>>>> Tim >>>>> >>>>> >>>>> >>>>> On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr <[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>>> >>>>>> Hey folks, >>>>>> >>>>>> Is there a best practice for rolling out FrameworkInfo changes? We >>>>>> need to set checkpoint to true, so I redeployed our framework with >>>>>> the new settings (with tasks still running), but when I hit a slave's >>>>>> stats.json endpoint, it appears that the old FrameworkInfo data is >>>>>> still there (which makes sense since there's active executors running). I >>>>>> then tried draining the tasks and completely restarting a Mesos slave, >>>>>> but >>>>>> still no luck. >>>>>> >>>>>> Is there anything additional / special I need to do here? Is some >>>>>> part of Mesos caching FrameworkInfo based on the framework ID? >>>>>> >>>>>> Another wrinkle with our setup is we have a rather large >>>>>> failover_timeout set for the framework -- maybe that's affecting >>>>>> things too? >>>>>> >>>>>> Thanks, >>>>>> Tom >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Zameer Manji >>>> >>> >>> >> >> >> -- >> Zameer Manji >> > >

