Hi,
We have a production pipeline running a series of jobs, with each job
creating a custom mesos framework to execute all tasks related to that job.
Both scheduler and executor are written using the Python mesos API.
Here's a snippet (modified for brevity) of the scheduler code:
class MyMesosScheduler(mesos.Scheduler):
<...>
def run_jobs(self, jobs):
for job in jobs:
framework = mesos_pb2.FrameworkInfo()
<...>
driver = MesosSchedulerDriver(self, framework, Flags.mesos_master)
driver.start()
for task in job.generate_tasks():
<...>
# wait for all tasks to complete
driver.stop()
<...>
This usually works just fine, but sometimes the pipeline gets "stuck" on
the second framework, and I can see on the mesos dashboard that the first
framework is still "active":
[image: Inline image 1]
I know driver.stop() for the first framework was called and has returned
(from my logs, and from the fact that the following job started). I also
see this in the console where the scheduler is running:
I0514 07:05:16.819201 29486 sched.cpp:1286] Asked to stop the driver
*If I add time.sleep(0.1) after driver.stop() the problem disappears!*
I tried adding driver.join() after driver.stop(), but the behavior was the
same (the join() returned immediately).
I tried adding "del driver" after driver.stop(), but the behavior was the
same.
*So my questions are:*
- What is promised to me by the mesos API on return from "driver.stop()" ?
(I thought the promise is that the framework successfully stopped,
including stopping all executors)
- Is it safe to "recycle" the same instance of MyMesosScheduler for
multiple (consecutive, not overlapping) frameworks? (note that the driver
object is brand new for every framework)
- Any thoughts on the problem I'm describing, and potential solutions that
are not based on a voodoo sleep?
Thanks!
- Itamar.