On a related note, what if framework scheduler is up while Mesos master goes down. Then, if Mesos master restarts after a time interval greater than framework failover timeout, what is the expected behavior? Would the framework successfully get a re-registered() callback? Or error() callback? Other?
On Fri, Apr 18, 2014 at 10:54 AM, Vinod Kone <[email protected]> wrote: > I think you are on the right track here. > > I would recommend setting a high failover timeout that is an upper bound > for all of your schedulers being down (e.g., 1 week). This way, even if all > your scheduler instances are down due to outage/maintenance, your > tasks/services keep running in the Mesos cluster. > > > On Fri, Apr 18, 2014 at 5:02 AM, David Greenberg > <[email protected]>wrote: > >> Hey Vinod, >> The problem I'm trying to solve is writing a framework that can run on >> our HA application cluster, and whenever the framework's current scheduler >> dies, another node will be elected and take over. I'm trying to work >> through the various failure cases to understand how implement this so that >> it works through all the failure cases I can think of. >> >> It sounds like the solution that'd work best for me would be to try to >> read the framework ID from a known location and register with that. If it's >> not there, or if registration fails, then the framework should register >> anew. >> >> This framework's state is very large, and resides in a couple databases, >> so that even if the entire set of candidates for becoming the framework is >> down for the whole failover grave period, the framework still wants to >> register, since it's state never gets invalidated. >> >> Thanks, >> David >> >> >> On Thursday, April 17, 2014, Vinod Kone <[email protected]> wrote: >> >>> >>> On Thu, Apr 17, 2014 at 2:56 PM, David Greenberg <[email protected] >>> > wrote: >>> >>>> My follow-up question is this--is there a way to tell whether I'm >>>> outside of the timeout window? I'd like to have my framework check ZK and >>>> determine whether it's w/in the framework timeout or not, so that it can >>>> make the correct call. >>>> >>> >>> Hey David, >>> >>> Currently, the only signal you can get is by hitting "/state.json" >>> endpoint on the master. The framework should've been moved to >>> 'completed_frameworks' after the failover timeout. Of course, if a master >>> fails over this information is lost so you can't reliably depend on it. >>> >>> When master starts storing persistent state about frameworks (likely >>> couple of releases away), a re-registration attempt in such a case would be >>> denied by the master. So that could be your signal. Alternatively, with >>> persistence, you could also more reliably depend on "/state.json" to get >>> this info. >>> >>> To take a step back, what is the problem you are trying to solve? >>> >>> Thanks, >>> >> >

