Thanks for the bug report Whitney, this looks like a long standing bug that
apparently is rarely exercised. Here is the JIRA ticket to follow for the
fix: https://issues.apache.org/jira/browse/MESOS-1550

For posterity, when the MesosSchedulerDriver instance gets cleaned up by
the JVM garbage collector we also delete any underlying C++ objects that we
created but not before we call 'MesosSchedulerDriver.stop'. <-- Bug! We
should never call stop, as that's what sends the 'unregister' request to
the master.

Short term fix: don't bother nulling out your instance of the
MesosSchedulerDriver so that the garbage collector doesn't clean it up.
(This is likely the common pattern and thus why this bug has lasted as long
as it has.)


On Fri, Jun 27, 2014 at 6:40 AM, Whitney Sorenson <[email protected]>
wrote:

> We've been running our Java framework for > 6 mos. now and today, for what
> I can tell is the first time, mesos shut down our framework:
>
> I0627 09:07:05.740335  4753 master.cpp:1034] Asked to unregister
> framework sy3x2
> I0627 09:07:05.740466  4753 master.cpp:2688] Removing framework sy3x2
>
> All executors running our framework promptly shut down all tasks.
>
> This happened during a deployment of our framework, in which the existing
> framework shuts down, generally with a driver.abort() call followed by the
> process exiting, which normally (and today) results in the log entries:
>
> I0627 09:07:04.926462  4755 master.cpp:1079] Deactivating framework sy3x2
> I0627 09:07:04.926609  4755 hierarchical_allocator_process.hpp:408]
> Deactivated framework sy3x2
>
> To complete the deployment, a new framework process starts and shortly
> calls driver.start(). We pass a very large framework timeout parameter in
> order to ensure this never happens:
>
> I0627 09:51:49.545934  4751 master.cpp:617] Giving framework sy3x2
> 1.65343915343915weeks to failover
>
> I have 2 questions:
>
> - How/why did the framework unregister? There are 0 calls to driver.stop()
> (after looking at SchedulerDriver again, I'm assuming this would accomplish
> the above) in our codebase (https://github.com/HubSpot/Singularity)
>
> - As a user, I don't think I'm even interested in this functionality being
> in Mesos. I've always figured setting a high framework timeout meant I was
> paying a cost that if I ever wanted to really shutdown my framework, I'd
> either have to wait 1.6 weeks, do some manual zookeeper manipulation, or
> simply start a new Mesos cluster - all of which are acceptable tradeoffs to
> me to avoid the possibility that Mesos shuts down the world. Assuming some
> frameworks still need this unregister functionality and at the same time -
> high framework timeouts - can we add a switch such that the framework can
> say whether or not it can be unregistered before framework timeout occurs?
>
> We are running 0.18.0.
>
> Thanks!
>
> -Whitney
>
>
>
>

Reply via email to