Re: Updating FrameworkInfo settings

2015-02-24 Thread Zameer Manji
I would like to point out that using a new FrameworkID is not a solution to
this problem. This means that a cluster operator has to drain the entire
cluster to enable checkpointing, or lose all previous tasks. Both scenarios
are not desirable.

Fortunately it is possible to do this without changing the FrameworkID. I
have cced Steve from TellApart who has enabled checkpointing without
changing the FrameworkID on a production cluster. I hope he can share his
process here.

On Tue, Feb 24, 2015 at 3:51 PM, Tim Chen t...@mesosphere.io wrote:

 Mesos checkpoints the FrameworkInfo into disk, and recovers it on relaunch.

 I don't think we expose any API to remove the framework manually though if
 you really want to keep the FrameworkID. If you hit the failover timeout
 the framework will get removed from the master and slave.

 I think for now the best way is just use a new FrameworkID when you want
 to change the FrameworkInfo.

 Tim



 On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr tp...@hubspot.com wrote:

 Hey folks,

 Is there a best practice for rolling out FrameworkInfo changes? We need
 to set checkpoint to true, so I redeployed our framework with the new
 settings (with tasks still running), but when I hit a slave's stats.json
 endpoint, it appears that the old FrameworkInfo data is still there (which
 makes sense since there's active executors running). I then tried draining
 the tasks and completely restarting a Mesos slave, but still no luck.

 Is there anything additional / special I need to do here? Is some part of
 Mesos caching FrameworkInfo based on the framework ID?

 Another wrinkle with our setup is we have a rather large failover_timeout
 set for the framework -- maybe that's affecting things too?

 Thanks,
 Tom





-- 
Zameer Manji


Re: Updating FrameworkInfo settings

2015-02-24 Thread Vinod Kone
Changing FrameworkInfo (while keeping the FrameworkID) is not handled
correctly by Mesos at the moment. This is what you currently need to do to
propagate FrameworkInfo.checkpoint throughout the cluster.

-- Update FrameworkInfo inside your framework and re-register with master.
(Old FrameworkInfo is still cached at master and slaves).
-- Failover the leading master. (New FrameworkInfo will be cached by new
leading master).
-- Hard restart (kill slave and wipe meta data) your slave in batches.

The proper fix for this is tracked at:
https://issues.apache.org/jira/browse/MESOS-703

On Tue, Feb 24, 2015 at 4:23 PM, Zameer Manji zma...@twopensource.com
wrote:

 For anyone who is going to read this information in the future, this works
 because the information in the replicated log can be recovered by the
 master. In future releases of Mesos the master might store information
 which cannot be recovered so please take extra care if you are going to do
 this.

 On Tue, Feb 24, 2015 at 4:11 PM, Steve Niemitz st...@tellapart.com
 wrote:

 Definitely don't change the frameworkID, we did that once and it was a
 disaster, for reasons described already.

 Here's what we did to force it on (as I can recall)
 - Change the startup flags for all masters to use the in memory DB
 instead of the replicated log (--registry=in_memory)
 - Restart all masters (not all at once, let them fail over)
 - Delete the replicated log on all masters
 - Ensure the framework is now registered with checkpoint = true (the
 slaves won't be yet howerver)
 - Remove the --registry flag from the masters and do a rolling restart
 again
 - Do another rolling restart of the masters
 *- At this point the framework will be persisted as checkpoint = true*
 - Now, restart your slaves.  Restarting them should cause them to pick up
 the new framework.  I'm not 100% sure if I deleted their state or not when
 I did this part, if it doesn't seem to take, try deleting their slave info
 on each one.

 On Tue, Feb 24, 2015 at 4:02 PM, Zameer Manji zma...@twopensource.com
 wrote:

 I would like to point out that using a new FrameworkID is not a solution
 to this problem. This means that a cluster operator has to drain the entire
 cluster to enable checkpointing, or lose all previous tasks. Both scenarios
 are not desirable.

 Fortunately it is possible to do this without changing the FrameworkID.
 I have cced Steve from TellApart who has enabled checkpointing without
 changing the FrameworkID on a production cluster. I hope he can share his
 process here.

 On Tue, Feb 24, 2015 at 3:51 PM, Tim Chen t...@mesosphere.io wrote:

 Mesos checkpoints the FrameworkInfo into disk, and recovers it on
 relaunch.

 I don't think we expose any API to remove the framework manually though
 if you really want to keep the FrameworkID. If you hit the failover timeout
 the framework will get removed from the master and slave.

 I think for now the best way is just use a new FrameworkID when you
 want to change the FrameworkInfo.

 Tim



 On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr tp...@hubspot.com wrote:

 Hey folks,

 Is there a best practice for rolling out FrameworkInfo changes? We
 need to set checkpoint to true, so I redeployed our framework with
 the new settings (with tasks still running), but when I hit a slave's
 stats.json endpoint, it appears that the old FrameworkInfo data is
 still there (which makes sense since there's active executors running). I
 then tried draining the tasks and completely restarting a Mesos slave, but
 still no luck.

 Is there anything additional / special I need to do here? Is some part
 of Mesos caching FrameworkInfo based on the framework ID?

 Another wrinkle with our setup is we have a rather large
 failover_timeout set for the framework -- maybe that's affecting
 things too?

 Thanks,
 Tom





 --
 Zameer Manji





 --
 Zameer Manji



Updating FrameworkInfo settings

2015-02-24 Thread Thomas Petr
Hey folks,

Is there a best practice for rolling out FrameworkInfo changes? We need to
set checkpoint to true, so I redeployed our framework with the new settings
(with tasks still running), but when I hit a slave's stats.json endpoint,
it appears that the old FrameworkInfo data is still there (which makes
sense since there's active executors running). I then tried draining the
tasks and completely restarting a Mesos slave, but still no luck.

Is there anything additional / special I need to do here? Is some part of
Mesos caching FrameworkInfo based on the framework ID?

Another wrinkle with our setup is we have a rather large failover_timeout
set for the framework -- maybe that's affecting things too?

Thanks,
Tom


Re: Updating FrameworkInfo settings

2015-02-24 Thread Tim Chen
Mesos checkpoints the FrameworkInfo into disk, and recovers it on relaunch.

I don't think we expose any API to remove the framework manually though if
you really want to keep the FrameworkID. If you hit the failover timeout
the framework will get removed from the master and slave.

I think for now the best way is just use a new FrameworkID when you want to
change the FrameworkInfo.

Tim



On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr tp...@hubspot.com wrote:

 Hey folks,

 Is there a best practice for rolling out FrameworkInfo changes? We need to
 set checkpoint to true, so I redeployed our framework with the new
 settings (with tasks still running), but when I hit a slave's stats.json
 endpoint, it appears that the old FrameworkInfo data is still there (which
 makes sense since there's active executors running). I then tried draining
 the tasks and completely restarting a Mesos slave, but still no luck.

 Is there anything additional / special I need to do here? Is some part of
 Mesos caching FrameworkInfo based on the framework ID?

 Another wrinkle with our setup is we have a rather large failover_timeout
 set for the framework -- maybe that's affecting things too?

 Thanks,
 Tom