Re: Review Request 65482: Improved handling of non-terminal operations after master failover.

2018-03-02 Thread Benjamin Bannier

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/
---

(Updated March 2, 2018, 10:46 a.m.)


Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.


Changes
---

Fixed comment language.


Bugs: MESOS-8536
https://issues.apache.org/jira/browse/MESOS-8536


Repository: mesos


Description
---

This patch fixes the handling of non-terminal operations learned by a
newly elected master after a master failover, so that only these
operations are counted as using resources. Previously we did not count
any operations as using resources which by accident produced expected
behavior if the operation was already terminal when the master learned
about them.

We do not address the issue of being unable to properly account for
operations triggered by frameworks unknown to the master, see
MESOS-8582. Instead we emit a warning for now since the master might
continue to abort due to assertion failures due to incomplete resource
accounting.


Diffs (updated)
-

  src/master/master.cpp 9cea7bb6a6ee8bf5f4226d07111bcfa6f5d3a88c 


Diff: https://reviews.apache.org/r/65482/diff/5/

Changes: https://reviews.apache.org/r/65482/diff/4-5/


Testing
---

`make check`, also tested with a version of the test added in r/65045 which 
triggered this issue.


Thanks,

Benjamin Bannier



Re: Review Request 65482: Improved handling of non-terminal operations after master failover.

2018-03-01 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/#review198470
---


Fix it, then Ship it!





src/master/master.cpp
Lines 7642 (patched)


s/ways/ways of/


- Greg Mann


On Feb. 16, 2018, 2:12 p.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65482/
> ---
> 
> (Updated Feb. 16, 2018, 2:12 p.m.)
> 
> 
> Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8536
> https://issues.apache.org/jira/browse/MESOS-8536
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch fixes the handling of non-terminal operations learned by a
> newly elected master after a master failover, so that only these
> operations are counted as using resources. Previously we did not count
> any operations as using resources which by accident produced expected
> behavior if the operation was already terminal when the master learned
> about them.
> 
> We do not address the issue of being unable to properly account for
> operations triggered by frameworks unknown to the master, see
> MESOS-8582. Instead we emit a warning for now since the master might
> continue to abort due to assertion failures due to incomplete resource
> accounting.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b06d7a6e2fbbb81b97eaf537d5b6745c73dc867d 
> 
> 
> Diff: https://reviews.apache.org/r/65482/diff/4/
> 
> 
> Testing
> ---
> 
> `make check`, also tested with a version of the test added in r/65045 which 
> triggered this issue.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 65482: Improved handling of non-terminal operations after master failover.

2018-03-01 Thread Greg Mann


> On Feb. 15, 2018, 11:55 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 7647-7654 (patched)
> > 
> >
> > I'm sitting here trying to think of ways we might avoid crashing if the 
> > framework subscribes before the operation becomes terminal...
> > 
> > Would it be reasonable to add an `if (framework == nullptr)` check to 
> > `updateOperation()` so that we only recover resources if the framework is 
> > known to the master?
> 
> Greg Mann wrote:
> Er... wait that doesn't make sense :) I guess when we receive the 
> operation update, we have no way of knowing whether or not the framework had 
> subscribed when the master learned about the pending operation. As a 
> workaround for now, we could store in a set the operation UUIDs of operations 
> for which we do not track allocated resources (i.e., operations which hit 
> this block of code). Then, in `updateOperation` we could avoid recovering 
> resources if the operation's UUID is in the set?
> 
> Benjamin Bannier wrote:
> No matter what we do here, we will already have entered dangerous 
> territory with `CHECK` failures looming as soon as we added such an operation 
> to master state, since we cannot update the allocator on these used resources 
> (update the allocation to reflect operation results, or recover). We might 
> also end up unknowingly oversubscribing the resources used by the operation.
> 
> I am unsure whether working around this by e.g., no updating the 
> allocator when the operation becomes terminal is a good way forward since it 
> seems to delay the needed master failover for even longer. With the approach 
> I proposed we would failover when this operation gets terminal (i.e., when we 
> can safely reconcile this particular operation).
> 
> I added the framework ID to the log output since it might be useful for 
> operators debugging such scenarios.

I think I agree - let's merge this improvement for now and try to resolve 
MESOS-8582 for Mesos 1.6.


- Greg


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/#review197641
---


On Feb. 16, 2018, 2:12 p.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65482/
> ---
> 
> (Updated Feb. 16, 2018, 2:12 p.m.)
> 
> 
> Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8536
> https://issues.apache.org/jira/browse/MESOS-8536
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch fixes the handling of non-terminal operations learned by a
> newly elected master after a master failover, so that only these
> operations are counted as using resources. Previously we did not count
> any operations as using resources which by accident produced expected
> behavior if the operation was already terminal when the master learned
> about them.
> 
> We do not address the issue of being unable to properly account for
> operations triggered by frameworks unknown to the master, see
> MESOS-8582. Instead we emit a warning for now since the master might
> continue to abort due to assertion failures due to incomplete resource
> accounting.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b06d7a6e2fbbb81b97eaf537d5b6745c73dc867d 
> 
> 
> Diff: https://reviews.apache.org/r/65482/diff/4/
> 
> 
> Testing
> ---
> 
> `make check`, also tested with a version of the test added in r/65045 which 
> triggered this issue.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 65482: Improved handling of non-terminal operations after master failover.

2018-02-16 Thread Benjamin Bannier


> On Feb. 3, 2018, 12:06 a.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 7594-7597 (original), 7609-7612 (patched)
> > 
> >
> > Is this function now only called with resources from already-existing 
> > frameworks? Does that mean that we can update the code here? 
> > https://github.com/apache/mesos/blob/5a1433576eca20f15f1ea309fc202f4bbaf3b6c7/src/master/allocator/mesos/hierarchical.cpp#L690-L703
> 
> Benjamin Bannier wrote:
> I am not sure we can make sure this will always happen in general. As a 
> precaution I think it is still a good idea to leave this check in the 
> allocator, if only to document allocator-local invariants.

Dropping this.


- Benjamin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/#review196752
---


On Feb. 16, 2018, 3:12 p.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65482/
> ---
> 
> (Updated Feb. 16, 2018, 3:12 p.m.)
> 
> 
> Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8536
> https://issues.apache.org/jira/browse/MESOS-8536
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch fixes the handling of non-terminal operations learned by a
> newly elected master after a master failover, so that only these
> operations are counted as using resources. Previously we did not count
> any operations as using resources which by accident produced expected
> behavior if the operation was already terminal when the master learned
> about them.
> 
> We do not address the issue of being unable to properly account for
> operations triggered by frameworks unknown to the master, see
> MESOS-8582. Instead we emit a warning for now since the master might
> continue to abort due to assertion failures due to incomplete resource
> accounting.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b06d7a6e2fbbb81b97eaf537d5b6745c73dc867d 
> 
> 
> Diff: https://reviews.apache.org/r/65482/diff/4/
> 
> 
> Testing
> ---
> 
> `make check`, also tested with a version of the test added in r/65045 which 
> triggered this issue.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 65482: Improved handling of non-terminal operations after master failover.

2018-02-16 Thread Benjamin Bannier


> On Feb. 16, 2018, 1:13 a.m., Gaston Kleiman wrote:
> > src/master/master.cpp
> > Lines 7652 (patched)
> > 
> >
> > s/   "operation/<< "operation/

I don't think we need or want an extra function call here. I ended up moving 
the space from the previous line to this one though.


- Benjamin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/#review197651
---


On Feb. 16, 2018, 3:12 p.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65482/
> ---
> 
> (Updated Feb. 16, 2018, 3:12 p.m.)
> 
> 
> Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8536
> https://issues.apache.org/jira/browse/MESOS-8536
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch fixes the handling of non-terminal operations learned by a
> newly elected master after a master failover, so that only these
> operations are counted as using resources. Previously we did not count
> any operations as using resources which by accident produced expected
> behavior if the operation was already terminal when the master learned
> about them.
> 
> We do not address the issue of being unable to properly account for
> operations triggered by frameworks unknown to the master, see
> MESOS-8582. Instead we emit a warning for now since the master might
> continue to abort due to assertion failures due to incomplete resource
> accounting.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b06d7a6e2fbbb81b97eaf537d5b6745c73dc867d 
> 
> 
> Diff: https://reviews.apache.org/r/65482/diff/4/
> 
> 
> Testing
> ---
> 
> `make check`, also tested with a version of the test added in r/65045 which 
> triggered this issue.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 65482: Improved handling of non-terminal operations after master failover.

2018-02-16 Thread Benjamin Bannier


> On Feb. 16, 2018, 12:55 a.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 7645 (patched)
> > 
> >
> > s/`RunTaskMessage`, see/`RunTaskMessage`. See/

Hmm ... I am not a native speaker, but wouldn't starting an extra sentence here 
needlessly inflate this comment (e.g., we would likely want some verb as well 
then)? We seem to use this pattern elsewhere already,

% git grep ', see' src | wc -l
77


> On Feb. 16, 2018, 12:55 a.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 7647-7654 (patched)
> > 
> >
> > I'm sitting here trying to think of ways we might avoid crashing if the 
> > framework subscribes before the operation becomes terminal...
> > 
> > Would it be reasonable to add an `if (framework == nullptr)` check to 
> > `updateOperation()` so that we only recover resources if the framework is 
> > known to the master?
> 
> Greg Mann wrote:
> Er... wait that doesn't make sense :) I guess when we receive the 
> operation update, we have no way of knowing whether or not the framework had 
> subscribed when the master learned about the pending operation. As a 
> workaround for now, we could store in a set the operation UUIDs of operations 
> for which we do not track allocated resources (i.e., operations which hit 
> this block of code). Then, in `updateOperation` we could avoid recovering 
> resources if the operation's UUID is in the set?

No matter what we do here, we will already have entered dangerous territory 
with `CHECK` failures looming as soon as we added such an operation to master 
state, since we cannot update the allocator on these used resources (update the 
allocation to reflect operation results, or recover). We might also end up 
unknowingly oversubscribing the resources used by the operation.

I am unsure whether working around this by e.g., no updating the allocator when 
the operation becomes terminal is a good way forward since it seems to delay 
the needed master failover for even longer. With the approach I proposed we 
would failover when this operation gets terminal (i.e., when we can safely 
reconcile this particular operation).

I added the framework ID to the log output since it might be useful for 
operators debugging such scenarios.


- Benjamin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/#review197641
---


On Feb. 16, 2018, 3:12 p.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65482/
> ---
> 
> (Updated Feb. 16, 2018, 3:12 p.m.)
> 
> 
> Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8536
> https://issues.apache.org/jira/browse/MESOS-8536
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch fixes the handling of non-terminal operations learned by a
> newly elected master after a master failover, so that only these
> operations are counted as using resources. Previously we did not count
> any operations as using resources which by accident produced expected
> behavior if the operation was already terminal when the master learned
> about them.
> 
> We do not address the issue of being unable to properly account for
> operations triggered by frameworks unknown to the master, see
> MESOS-8582. Instead we emit a warning for now since the master might
> continue to abort due to assertion failures due to incomplete resource
> accounting.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b06d7a6e2fbbb81b97eaf537d5b6745c73dc867d 
> 
> 
> Diff: https://reviews.apache.org/r/65482/diff/4/
> 
> 
> Testing
> ---
> 
> `make check`, also tested with a version of the test added in r/65045 which 
> triggered this issue.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 65482: Improved handling of non-terminal operations after master failover.

2018-02-16 Thread Benjamin Bannier

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/
---

(Updated Feb. 16, 2018, 3:12 p.m.)


Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.


Changes
---

Fixed comment issues pointed out by Greg; improved logging.


Bugs: MESOS-8536
https://issues.apache.org/jira/browse/MESOS-8536


Repository: mesos


Description
---

This patch fixes the handling of non-terminal operations learned by a
newly elected master after a master failover, so that only these
operations are counted as using resources. Previously we did not count
any operations as using resources which by accident produced expected
behavior if the operation was already terminal when the master learned
about them.

We do not address the issue of being unable to properly account for
operations triggered by frameworks unknown to the master, see
MESOS-8582. Instead we emit a warning for now since the master might
continue to abort due to assertion failures due to incomplete resource
accounting.


Diffs (updated)
-

  src/master/master.cpp b06d7a6e2fbbb81b97eaf537d5b6745c73dc867d 


Diff: https://reviews.apache.org/r/65482/diff/4/

Changes: https://reviews.apache.org/r/65482/diff/3-4/


Testing
---

`make check`, also tested with a version of the test added in r/65045 which 
triggered this issue.


Thanks,

Benjamin Bannier



Re: Review Request 65482: Improved handling of non-terminal operations after master failover.

2018-02-15 Thread Gaston Kleiman

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/#review197651
---




src/master/master.cpp
Lines 7652 (patched)


s/   "operation/<< "operation/


- Gaston Kleiman


On Feb. 14, 2018, 6:21 a.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65482/
> ---
> 
> (Updated Feb. 14, 2018, 6:21 a.m.)
> 
> 
> Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8536
> https://issues.apache.org/jira/browse/MESOS-8536
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch fixes the handling of non-terminal operations learned by a
> newly elected master after a master failover, so that only these
> operations are counted as using resources. Previously we did not count
> any operations as using resources which by accident produced expected
> behavior if the operation was already terminal when the master learned
> about them.
> 
> We do not address the issue of being unable to properly account for
> operations triggered by frameworks unknown to the master, see
> MESOS-8582. Instead we emit a warning for now since the master might
> continue to abort due to assertion failures due to incomplete resource
> accounting.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b06d7a6e2fbbb81b97eaf537d5b6745c73dc867d 
> 
> 
> Diff: https://reviews.apache.org/r/65482/diff/3/
> 
> 
> Testing
> ---
> 
> `make check`, also tested with a version of the test added in r/65045 which 
> triggered this issue.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 65482: Improved handling of non-terminal operations after master failover.

2018-02-15 Thread Greg Mann


> On Feb. 15, 2018, 11:55 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 7647-7654 (patched)
> > 
> >
> > I'm sitting here trying to think of ways we might avoid crashing if the 
> > framework subscribes before the operation becomes terminal...
> > 
> > Would it be reasonable to add an `if (framework == nullptr)` check to 
> > `updateOperation()` so that we only recover resources if the framework is 
> > known to the master?

Er... wait that doesn't make sense :) I guess when we receive the operation 
update, we have no way of knowing whether or not the framework had subscribed 
when the master learned about the pending operation. As a workaround for now, 
we could store in a set the operation UUIDs of operations for which we do not 
track allocated resources (i.e., operations which hit this block of code). 
Then, in `updateOperation` we could avoid recovering resources if the 
operation's UUID is in the set?


- Greg


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/#review197641
---


On Feb. 14, 2018, 2:21 p.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65482/
> ---
> 
> (Updated Feb. 14, 2018, 2:21 p.m.)
> 
> 
> Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8536
> https://issues.apache.org/jira/browse/MESOS-8536
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch fixes the handling of non-terminal operations learned by a
> newly elected master after a master failover, so that only these
> operations are counted as using resources. Previously we did not count
> any operations as using resources which by accident produced expected
> behavior if the operation was already terminal when the master learned
> about them.
> 
> We do not address the issue of being unable to properly account for
> operations triggered by frameworks unknown to the master, see
> MESOS-8582. Instead we emit a warning for now since the master might
> continue to abort due to assertion failures due to incomplete resource
> accounting.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b06d7a6e2fbbb81b97eaf537d5b6745c73dc867d 
> 
> 
> Diff: https://reviews.apache.org/r/65482/diff/3/
> 
> 
> Testing
> ---
> 
> `make check`, also tested with a version of the test added in r/65045 which 
> triggered this issue.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 65482: Improved handling of non-terminal operations after master failover.

2018-02-15 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/#review197641
---




src/master/master.cpp
Lines 7643 (patched)


s/know/knows/



src/master/master.cpp
Lines 7644 (patched)


s/added/adding/



src/master/master.cpp
Lines 7645 (patched)


s/`RunTaskMessage`, see/`RunTaskMessage`. See/



src/master/master.cpp
Lines 7647-7654 (patched)


I'm sitting here trying to think of ways we might avoid crashing if the 
framework subscribes before the operation becomes terminal...

Would it be reasonable to add an `if (framework == nullptr)` check to 
`updateOperation()` so that we only recover resources if the framework is known 
to the master?



src/master/master.cpp
Lines 7652 (patched)


s/MESOS-8356/MESOS-8536/


- Greg Mann


On Feb. 14, 2018, 2:21 p.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65482/
> ---
> 
> (Updated Feb. 14, 2018, 2:21 p.m.)
> 
> 
> Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8536
> https://issues.apache.org/jira/browse/MESOS-8536
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch fixes the handling of non-terminal operations learned by a
> newly elected master after a master failover, so that only these
> operations are counted as using resources. Previously we did not count
> any operations as using resources which by accident produced expected
> behavior if the operation was already terminal when the master learned
> about them.
> 
> We do not address the issue of being unable to properly account for
> operations triggered by frameworks unknown to the master, see
> MESOS-8582. Instead we emit a warning for now since the master might
> continue to abort due to assertion failures due to incomplete resource
> accounting.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b06d7a6e2fbbb81b97eaf537d5b6745c73dc867d 
> 
> 
> Diff: https://reviews.apache.org/r/65482/diff/3/
> 
> 
> Testing
> ---
> 
> `make check`, also tested with a version of the test added in r/65045 which 
> triggered this issue.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 65482: Improved handling of non-terminal operations after master failover.

2018-02-14 Thread Benjamin Bannier

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/
---

(Updated Feb. 14, 2018, 3:21 p.m.)


Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.


Summary (updated)
-

Improved handling of non-terminal operations after master failover.


Bugs: MESOS-8536
https://issues.apache.org/jira/browse/MESOS-8536


Repository: mesos


Description (updated)
---

This patch fixes the handling of non-terminal operations learned by a
newly elected master after a master failover, so that only these
operations are counted as using resources. Previously we did not count
any operations as using resources which by accident produced expected
behavior if the operation was already terminal when the master learned
about them.

We do not address the issue of being unable to properly account for
operations triggered by frameworks unknown to the master, see
MESOS-8582. Instead we emit a warning for now since the master might
continue to abort due to assertion failures due to incomplete resource
accounting.


Diffs (updated)
-

  src/master/master.cpp d7d22866f7a4eb87bd8949efafc97e828e7d4b94 


Diff: https://reviews.apache.org/r/65482/diff/2/

Changes: https://reviews.apache.org/r/65482/diff/1-2/


Testing
---

`make check`, also tested with a version of the test added in r/65045 which 
triggered this issue.


Thanks,

Benjamin Bannier