I think you are on the right track here.

I would recommend setting a high failover timeout that is an upper bound
for all of your schedulers being down (e.g., 1 week). This way, even if all
your scheduler instances are down due to outage/maintenance, your
tasks/services keep running in the Mesos cluster.


On Fri, Apr 18, 2014 at 5:02 AM, David Greenberg <[email protected]>wrote:

> Hey Vinod,
> The problem I'm trying to solve is writing a framework that can run on our
> HA application cluster, and whenever the framework's current scheduler
> dies, another node will be elected and take over. I'm trying to work
> through the various failure cases to understand how implement this so that
> it works through all the failure cases I can think of.
>
> It sounds like the solution that'd work best for me would be to try to
> read the framework ID from a known location and register with that. If it's
> not there, or if registration fails, then the framework should register
> anew.
>
> This framework's state is very large, and resides in a couple databases,
> so that even if the entire set of candidates for becoming the framework is
> down for the whole failover grave period, the framework still wants to
> register, since it's state never gets invalidated.
>
> Thanks,
> David
>
>
> On Thursday, April 17, 2014, Vinod Kone <[email protected]> wrote:
>
>>
>> On Thu, Apr 17, 2014 at 2:56 PM, David Greenberg 
>> <[email protected]>wrote:
>>
>>> My follow-up question is this--is there a way to tell whether I'm
>>> outside of the timeout window? I'd like to have my framework check ZK and
>>> determine whether it's w/in the framework timeout or not, so that it can
>>> make the correct call.
>>>
>>
>> Hey David,
>>
>> Currently, the only signal you can get is by hitting "/state.json"
>> endpoint on the master. The framework should've been moved to
>> 'completed_frameworks' after the failover timeout. Of course, if a master
>> fails over this information is lost so you can't reliably depend on it.
>>
>> When master starts storing persistent state about frameworks (likely
>> couple of releases away), a re-registration attempt in such a case would be
>> denied by the master. So that could be your signal. Alternatively, with
>> persistence, you could also more reliably depend on "/state.json" to get
>> this info.
>>
>> To take a step back, what is the problem you are trying to solve?
>>
>> Thanks,
>>
>

Reply via email to