Re: Review Request 42519: Fixed race between SchedDriver.{stop(), abort()} and SchedDriver.join().

2016-01-22 Thread Kevin Klues


> On Jan. 22, 2016, 6:58 a.m., Ben Mahler wrote:
> > Logically looks good, just a couple of trivial comments and we can get this 
> > landed!

I was going to start fixing things up, but it looks like you took care of it 
for me.  Thanks!


- Kevin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42519/#review115796
---


On Jan. 19, 2016, 10:58 p.m., Kevin Klues wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42519/
> ---
> 
> (Updated Jan. 19, 2016, 10:58 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Greg Mann.
> 
> 
> Bugs: MESOS-4409
> https://issues.apache.org/jira/browse/MESOS-4409
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, it was possible for join() to return before a schedDriver
> was actually fully stopped or aborted (breaking the semantics of the
> join() call). The race came from a short circuit in join(), which
> simply checked for status != DRIVER_RUNNING before returning. It appears
> this short circuit was introduced to handle cases where initialize() or
> start() ended up aborting before ever starting the driver to begin with.
> However, it unintentionally covers cases where stop() or abort() were
> called *after* the driver started running as well.
> 
> The problem is that stop() and abort() will change the status
> to DRIVER_STOPPED or DRIVER_ABORTED before actually processing
> dispatched stop or abort events (which happen asynchronously in a
> libprocess thread). Under normal operation, join() would wait for these
> events to trigger a latch that allowed the join() call to return.
> However, with the short circuit, join() exits immediately even if the
> libprocess thread hasn't yet processed the stop() or abort() events.
> 
> This commit fixes the semantics of the join() call to avoid this race.
> We considered removing the latch completely and replacing it with
> process.wait(), but, unlike the latch, this wouldn't ensure that stop()
> or abort() was ever called in the first place.
> 
> 
> Diffs
> -
> 
>   src/sched/sched.cpp 38940b7e2563a2156be2f8c228afdc27c45b6e83 
> 
> Diff: https://reviews.apache.org/r/42519/diff/
> 
> 
> Testing
> ---
> 
> Ran the entire 'make check' suite with no failures on both Mac OS X and 
> ubuntu 14.04.
> 
> 
> Thanks,
> 
> Kevin Klues
> 
>



Review Request 42519: Fixed race between SchedDriver.{stop(), abort()} and SchedDriver.join().

2016-01-19 Thread Kevin Klues

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42519/
---

Review request for mesos, Ben Mahler and Greg Mann.


Bugs: MESOS-4409
https://issues.apache.org/jira/browse/MESOS-4409


Repository: mesos


Description
---

Previously, it was possible for join() to return before a schedDriver
was actually fully stopped or aborted (breaking the semantics of the
join() call). The race came from a short circuit in join(), which
simply checked for status != DRIVER_RUNNING before returning. It appears
this short circuit was introduced to handle cases where initialize() or
start() ended up aborting before ever starting the driver to begin with.
However, it unintentionally covers cases where stop() or abort() were
called *after* the driver started running as well.

The problem is that stop() and abort() will change the status
to DRIVER_STOPPED or DRIVER_ABORTED before actually processing
dispatched stop or abort events (which happen asynchronously in a
libprocess thread). Under normal operation, join() would wait for these
events to trigger a latch that allowed the join() call to return.
However, with the short circuit, join() exits immediately even if the
libprocess thread hasn't yet processed the stop() or abort() events.

This commit fixes the semantics of the join() call to avoid this race.
We considered removing the latch completely and replacing it with
process.wait(), but, unlike the latch, this wouldn't ensure that stop()
or abort() was ever called in the first place.


Diffs
-

  src/sched/sched.cpp 38940b7e2563a2156be2f8c228afdc27c45b6e83 

Diff: https://reviews.apache.org/r/42519/diff/


Testing
---

Ran the entire 'make check' suite with no failures on both Mac OS X and ubuntu 
14.04.


Thanks,

Kevin Klues