Re: Updating FrameworkInfo settings
I would like to point out that using a new FrameworkID is not a solution to this problem. This means that a cluster operator has to drain the entire cluster to enable checkpointing, or lose all previous tasks. Both scenarios are not desirable. Fortunately it is possible to do this without changing the FrameworkID. I have cced Steve from TellApart who has enabled checkpointing without changing the FrameworkID on a production cluster. I hope he can share his process here. On Tue, Feb 24, 2015 at 3:51 PM, Tim Chen t...@mesosphere.io wrote: Mesos checkpoints the FrameworkInfo into disk, and recovers it on relaunch. I don't think we expose any API to remove the framework manually though if you really want to keep the FrameworkID. If you hit the failover timeout the framework will get removed from the master and slave. I think for now the best way is just use a new FrameworkID when you want to change the FrameworkInfo. Tim On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr tp...@hubspot.com wrote: Hey folks, Is there a best practice for rolling out FrameworkInfo changes? We need to set checkpoint to true, so I redeployed our framework with the new settings (with tasks still running), but when I hit a slave's stats.json endpoint, it appears that the old FrameworkInfo data is still there (which makes sense since there's active executors running). I then tried draining the tasks and completely restarting a Mesos slave, but still no luck. Is there anything additional / special I need to do here? Is some part of Mesos caching FrameworkInfo based on the framework ID? Another wrinkle with our setup is we have a rather large failover_timeout set for the framework -- maybe that's affecting things too? Thanks, Tom -- Zameer Manji
Re: Updating FrameworkInfo settings
Changing FrameworkInfo (while keeping the FrameworkID) is not handled correctly by Mesos at the moment. This is what you currently need to do to propagate FrameworkInfo.checkpoint throughout the cluster. -- Update FrameworkInfo inside your framework and re-register with master. (Old FrameworkInfo is still cached at master and slaves). -- Failover the leading master. (New FrameworkInfo will be cached by new leading master). -- Hard restart (kill slave and wipe meta data) your slave in batches. The proper fix for this is tracked at: https://issues.apache.org/jira/browse/MESOS-703 On Tue, Feb 24, 2015 at 4:23 PM, Zameer Manji zma...@twopensource.com wrote: For anyone who is going to read this information in the future, this works because the information in the replicated log can be recovered by the master. In future releases of Mesos the master might store information which cannot be recovered so please take extra care if you are going to do this. On Tue, Feb 24, 2015 at 4:11 PM, Steve Niemitz st...@tellapart.com wrote: Definitely don't change the frameworkID, we did that once and it was a disaster, for reasons described already. Here's what we did to force it on (as I can recall) - Change the startup flags for all masters to use the in memory DB instead of the replicated log (--registry=in_memory) - Restart all masters (not all at once, let them fail over) - Delete the replicated log on all masters - Ensure the framework is now registered with checkpoint = true (the slaves won't be yet howerver) - Remove the --registry flag from the masters and do a rolling restart again - Do another rolling restart of the masters *- At this point the framework will be persisted as checkpoint = true* - Now, restart your slaves. Restarting them should cause them to pick up the new framework. I'm not 100% sure if I deleted their state or not when I did this part, if it doesn't seem to take, try deleting their slave info on each one. On Tue, Feb 24, 2015 at 4:02 PM, Zameer Manji zma...@twopensource.com wrote: I would like to point out that using a new FrameworkID is not a solution to this problem. This means that a cluster operator has to drain the entire cluster to enable checkpointing, or lose all previous tasks. Both scenarios are not desirable. Fortunately it is possible to do this without changing the FrameworkID. I have cced Steve from TellApart who has enabled checkpointing without changing the FrameworkID on a production cluster. I hope he can share his process here. On Tue, Feb 24, 2015 at 3:51 PM, Tim Chen t...@mesosphere.io wrote: Mesos checkpoints the FrameworkInfo into disk, and recovers it on relaunch. I don't think we expose any API to remove the framework manually though if you really want to keep the FrameworkID. If you hit the failover timeout the framework will get removed from the master and slave. I think for now the best way is just use a new FrameworkID when you want to change the FrameworkInfo. Tim On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr tp...@hubspot.com wrote: Hey folks, Is there a best practice for rolling out FrameworkInfo changes? We need to set checkpoint to true, so I redeployed our framework with the new settings (with tasks still running), but when I hit a slave's stats.json endpoint, it appears that the old FrameworkInfo data is still there (which makes sense since there's active executors running). I then tried draining the tasks and completely restarting a Mesos slave, but still no luck. Is there anything additional / special I need to do here? Is some part of Mesos caching FrameworkInfo based on the framework ID? Another wrinkle with our setup is we have a rather large failover_timeout set for the framework -- maybe that's affecting things too? Thanks, Tom -- Zameer Manji -- Zameer Manji
Updating FrameworkInfo settings
Hey folks, Is there a best practice for rolling out FrameworkInfo changes? We need to set checkpoint to true, so I redeployed our framework with the new settings (with tasks still running), but when I hit a slave's stats.json endpoint, it appears that the old FrameworkInfo data is still there (which makes sense since there's active executors running). I then tried draining the tasks and completely restarting a Mesos slave, but still no luck. Is there anything additional / special I need to do here? Is some part of Mesos caching FrameworkInfo based on the framework ID? Another wrinkle with our setup is we have a rather large failover_timeout set for the framework -- maybe that's affecting things too? Thanks, Tom
Re: Updating FrameworkInfo settings
Mesos checkpoints the FrameworkInfo into disk, and recovers it on relaunch. I don't think we expose any API to remove the framework manually though if you really want to keep the FrameworkID. If you hit the failover timeout the framework will get removed from the master and slave. I think for now the best way is just use a new FrameworkID when you want to change the FrameworkInfo. Tim On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr tp...@hubspot.com wrote: Hey folks, Is there a best practice for rolling out FrameworkInfo changes? We need to set checkpoint to true, so I redeployed our framework with the new settings (with tasks still running), but when I hit a slave's stats.json endpoint, it appears that the old FrameworkInfo data is still there (which makes sense since there's active executors running). I then tried draining the tasks and completely restarting a Mesos slave, but still no luck. Is there anything additional / special I need to do here? Is some part of Mesos caching FrameworkInfo based on the framework ID? Another wrinkle with our setup is we have a rather large failover_timeout set for the framework -- maybe that's affecting things too? Thanks, Tom