It would be good to call out explicitly in the SchedulerDriver docs that stop() is a cluster-wide framework shutdown, possibly renaming that method to something like killAllTasksAndUnregisterFramework to note this. The current JavaDoc at http://mesos.apache.org/api/latest/java/org/apache/mesos/SchedulerDriver.html#stop() doesn't point out the danger and as a framework developer, I'd assume that if I'm required to call start() (even though I failed over) I should call stop() as well (as opposed to stop(true)).
On Fri, Jun 27, 2014 at 12:21 PM, Vinod Kone <[email protected]> wrote: > Perhaps we should call this out explicitly when we back port and do bug > fix releases (0.18.0 and 0.19.0) and urge people to upgrade lest this gets > drowned out in the noise. > > > On Fri, Jun 27, 2014 at 11:40 AM, Benjamin Hindman < > [email protected]> wrote: > >> Thanks for the bug report Whitney, this looks like a long standing bug >> that apparently is rarely exercised. Here is the JIRA ticket to follow for >> the fix: https://issues.apache.org/jira/browse/MESOS-1550 >> >> For posterity, when the MesosSchedulerDriver instance gets cleaned up by >> the JVM garbage collector we also delete any underlying C++ objects that we >> created but not before we call 'MesosSchedulerDriver.stop'. <-- Bug! We >> should never call stop, as that's what sends the 'unregister' request to >> the master. >> >> Short term fix: don't bother nulling out your instance of the >> MesosSchedulerDriver so that the garbage collector doesn't clean it up. >> (This is likely the common pattern and thus why this bug has lasted as long >> as it has.) >> >> >> On Fri, Jun 27, 2014 at 6:40 AM, Whitney Sorenson <[email protected]> >> wrote: >> >>> We've been running our Java framework for > 6 mos. now and today, for >>> what I can tell is the first time, mesos shut down our framework: >>> >>> I0627 09:07:05.740335 4753 master.cpp:1034] Asked to unregister >>> framework sy3x2 >>> I0627 09:07:05.740466 4753 master.cpp:2688] Removing framework sy3x2 >>> >>> All executors running our framework promptly shut down all tasks. >>> >>> This happened during a deployment of our framework, in which the >>> existing framework shuts down, generally with a driver.abort() call >>> followed by the process exiting, which normally (and today) results in the >>> log entries: >>> >>> I0627 09:07:04.926462 4755 master.cpp:1079] Deactivating framework sy3x2 >>> I0627 09:07:04.926609 4755 hierarchical_allocator_process.hpp:408] >>> Deactivated framework sy3x2 >>> >>> To complete the deployment, a new framework process starts and shortly >>> calls driver.start(). We pass a very large framework timeout parameter in >>> order to ensure this never happens: >>> >>> I0627 09:51:49.545934 4751 master.cpp:617] Giving framework sy3x2 >>> 1.65343915343915weeks to failover >>> >>> I have 2 questions: >>> >>> - How/why did the framework unregister? There are 0 calls to >>> driver.stop() (after looking at SchedulerDriver again, I'm assuming this >>> would accomplish the above) in our codebase ( >>> https://github.com/HubSpot/Singularity) >>> >>> - As a user, I don't think I'm even interested in this functionality >>> being in Mesos. I've always figured setting a high framework timeout meant >>> I was paying a cost that if I ever wanted to really shutdown my framework, >>> I'd either have to wait 1.6 weeks, do some manual zookeeper manipulation, >>> or simply start a new Mesos cluster - all of which are acceptable tradeoffs >>> to me to avoid the possibility that Mesos shuts down the world. Assuming >>> some frameworks still need this unregister functionality and at the same >>> time - high framework timeouts - can we add a switch such that the >>> framework can say whether or not it can be unregistered before framework >>> timeout occurs? >>> >>> We are running 0.18.0. >>> >>> Thanks! >>> >>> -Whitney >>> >>> >>> >>> >> >

