I think you are on the right track here. I would recommend setting a high failover timeout that is an upper bound for all of your schedulers being down (e.g., 1 week). This way, even if all your scheduler instances are down due to outage/maintenance, your tasks/services keep running in the Mesos cluster.
On Fri, Apr 18, 2014 at 5:02 AM, David Greenberg <[email protected]>wrote: > Hey Vinod, > The problem I'm trying to solve is writing a framework that can run on our > HA application cluster, and whenever the framework's current scheduler > dies, another node will be elected and take over. I'm trying to work > through the various failure cases to understand how implement this so that > it works through all the failure cases I can think of. > > It sounds like the solution that'd work best for me would be to try to > read the framework ID from a known location and register with that. If it's > not there, or if registration fails, then the framework should register > anew. > > This framework's state is very large, and resides in a couple databases, > so that even if the entire set of candidates for becoming the framework is > down for the whole failover grave period, the framework still wants to > register, since it's state never gets invalidated. > > Thanks, > David > > > On Thursday, April 17, 2014, Vinod Kone <[email protected]> wrote: > >> >> On Thu, Apr 17, 2014 at 2:56 PM, David Greenberg >> <[email protected]>wrote: >> >>> My follow-up question is this--is there a way to tell whether I'm >>> outside of the timeout window? I'd like to have my framework check ZK and >>> determine whether it's w/in the framework timeout or not, so that it can >>> make the correct call. >>> >> >> Hey David, >> >> Currently, the only signal you can get is by hitting "/state.json" >> endpoint on the master. The framework should've been moved to >> 'completed_frameworks' after the failover timeout. Of course, if a master >> fails over this information is lost so you can't reliably depend on it. >> >> When master starts storing persistent state about frameworks (likely >> couple of releases away), a re-registration attempt in such a case would be >> denied by the master. So that could be your signal. Alternatively, with >> persistence, you could also more reliably depend on "/state.json" to get >> this info. >> >> To take a step back, what is the problem you are trying to solve? >> >> Thanks, >> >

