We've been running our Java framework for > 6 mos. now and today, for what
I can tell is the first time, mesos shut down our framework:

I0627 09:07:05.740335  4753 master.cpp:1034] Asked to unregister framework
sy3x2
I0627 09:07:05.740466  4753 master.cpp:2688] Removing framework sy3x2

All executors running our framework promptly shut down all tasks.

This happened during a deployment of our framework, in which the existing
framework shuts down, generally with a driver.abort() call followed by the
process exiting, which normally (and today) results in the log entries:

I0627 09:07:04.926462  4755 master.cpp:1079] Deactivating framework sy3x2
I0627 09:07:04.926609  4755 hierarchical_allocator_process.hpp:408]
Deactivated framework sy3x2

To complete the deployment, a new framework process starts and shortly
calls driver.start(). We pass a very large framework timeout parameter in
order to ensure this never happens:

I0627 09:51:49.545934  4751 master.cpp:617] Giving framework sy3x2
1.65343915343915weeks to failover

I have 2 questions:

- How/why did the framework unregister? There are 0 calls to driver.stop()
(after looking at SchedulerDriver again, I'm assuming this would accomplish
the above) in our codebase (https://github.com/HubSpot/Singularity)

- As a user, I don't think I'm even interested in this functionality being
in Mesos. I've always figured setting a high framework timeout meant I was
paying a cost that if I ever wanted to really shutdown my framework, I'd
either have to wait 1.6 weeks, do some manual zookeeper manipulation, or
simply start a new Mesos cluster - all of which are acceptable tradeoffs to
me to avoid the possibility that Mesos shuts down the world. Assuming some
frameworks still need this unregister functionality and at the same time -
high framework timeouts - can we add a switch such that the framework can
say whether or not it can be unregistered before framework timeout occurs?

We are running 0.18.0.

Thanks!

-Whitney

Reply via email to