It would be good to call out explicitly in the SchedulerDriver docs that
stop() is a cluster-wide framework shutdown, possibly renaming that method
to something like killAllTasksAndUnregisterFramework to note this. The
current JavaDoc at
http://mesos.apache.org/api/latest/java/org/apache/mesos/SchedulerDriver.html#stop()
doesn't point out the danger and as a framework developer, I'd assume that
if I'm required to call start() (even though I failed over) I should call
stop() as well (as opposed to stop(true)).




On Fri, Jun 27, 2014 at 12:21 PM, Vinod Kone <[email protected]> wrote:

> Perhaps we should call this out explicitly when we back port and do bug
> fix releases (0.18.0 and 0.19.0) and urge people to upgrade lest this gets
> drowned out in the noise.
>
>
> On Fri, Jun 27, 2014 at 11:40 AM, Benjamin Hindman <
> [email protected]> wrote:
>
>> Thanks for the bug report Whitney, this looks like a long standing bug
>> that apparently is rarely exercised. Here is the JIRA ticket to follow for
>> the fix: https://issues.apache.org/jira/browse/MESOS-1550
>>
>> For posterity, when the MesosSchedulerDriver instance gets cleaned up by
>> the JVM garbage collector we also delete any underlying C++ objects that we
>> created but not before we call 'MesosSchedulerDriver.stop'. <-- Bug! We
>> should never call stop, as that's what sends the 'unregister' request to
>> the master.
>>
>> Short term fix: don't bother nulling out your instance of the
>> MesosSchedulerDriver so that the garbage collector doesn't clean it up.
>> (This is likely the common pattern and thus why this bug has lasted as long
>> as it has.)
>>
>>
>> On Fri, Jun 27, 2014 at 6:40 AM, Whitney Sorenson <[email protected]>
>> wrote:
>>
>>> We've been running our Java framework for > 6 mos. now and today, for
>>> what I can tell is the first time, mesos shut down our framework:
>>>
>>> I0627 09:07:05.740335  4753 master.cpp:1034] Asked to unregister
>>> framework sy3x2
>>> I0627 09:07:05.740466  4753 master.cpp:2688] Removing framework sy3x2
>>>
>>> All executors running our framework promptly shut down all tasks.
>>>
>>> This happened during a deployment of our framework, in which the
>>> existing framework shuts down, generally with a driver.abort() call
>>> followed by the process exiting, which normally (and today) results in the
>>> log entries:
>>>
>>> I0627 09:07:04.926462  4755 master.cpp:1079] Deactivating framework sy3x2
>>> I0627 09:07:04.926609  4755 hierarchical_allocator_process.hpp:408]
>>> Deactivated framework sy3x2
>>>
>>> To complete the deployment, a new framework process starts and shortly
>>> calls driver.start(). We pass a very large framework timeout parameter in
>>> order to ensure this never happens:
>>>
>>> I0627 09:51:49.545934  4751 master.cpp:617] Giving framework sy3x2
>>> 1.65343915343915weeks to failover
>>>
>>> I have 2 questions:
>>>
>>> - How/why did the framework unregister? There are 0 calls to
>>> driver.stop() (after looking at SchedulerDriver again, I'm assuming this
>>> would accomplish the above) in our codebase (
>>> https://github.com/HubSpot/Singularity)
>>>
>>> - As a user, I don't think I'm even interested in this functionality
>>> being in Mesos. I've always figured setting a high framework timeout meant
>>> I was paying a cost that if I ever wanted to really shutdown my framework,
>>> I'd either have to wait 1.6 weeks, do some manual zookeeper manipulation,
>>> or simply start a new Mesos cluster - all of which are acceptable tradeoffs
>>> to me to avoid the possibility that Mesos shuts down the world. Assuming
>>> some frameworks still need this unregister functionality and at the same
>>> time - high framework timeouts - can we add a switch such that the
>>> framework can say whether or not it can be unregistered before framework
>>> timeout occurs?
>>>
>>> We are running 0.18.0.
>>>
>>> Thanks!
>>>
>>> -Whitney
>>>
>>>
>>>
>>>
>>
>

Reply via email to