We've been running our Java framework for > 6 mos. now and today, for what I can tell is the first time, mesos shut down our framework:
I0627 09:07:05.740335 4753 master.cpp:1034] Asked to unregister framework sy3x2 I0627 09:07:05.740466 4753 master.cpp:2688] Removing framework sy3x2 All executors running our framework promptly shut down all tasks. This happened during a deployment of our framework, in which the existing framework shuts down, generally with a driver.abort() call followed by the process exiting, which normally (and today) results in the log entries: I0627 09:07:04.926462 4755 master.cpp:1079] Deactivating framework sy3x2 I0627 09:07:04.926609 4755 hierarchical_allocator_process.hpp:408] Deactivated framework sy3x2 To complete the deployment, a new framework process starts and shortly calls driver.start(). We pass a very large framework timeout parameter in order to ensure this never happens: I0627 09:51:49.545934 4751 master.cpp:617] Giving framework sy3x2 1.65343915343915weeks to failover I have 2 questions: - How/why did the framework unregister? There are 0 calls to driver.stop() (after looking at SchedulerDriver again, I'm assuming this would accomplish the above) in our codebase (https://github.com/HubSpot/Singularity) - As a user, I don't think I'm even interested in this functionality being in Mesos. I've always figured setting a high framework timeout meant I was paying a cost that if I ever wanted to really shutdown my framework, I'd either have to wait 1.6 weeks, do some manual zookeeper manipulation, or simply start a new Mesos cluster - all of which are acceptable tradeoffs to me to avoid the possibility that Mesos shuts down the world. Assuming some frameworks still need this unregister functionality and at the same time - high framework timeouts - can we add a switch such that the framework can say whether or not it can be unregistered before framework timeout occurs? We are running 0.18.0. Thanks! -Whitney

