Re: Review Request 64506: Correctly reset slave status when aborting a registration.

2017-12-18 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64506/#review194086
---


Ship it!




Ship It!

- Vinod Kone


On Dec. 14, 2017, 3:32 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64506/
> ---
> 
> (Updated Dec. 14, 2017, 3:32 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-8341
> https://issues.apache.org/jira/browse/MESOS-8341
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the slave was not erased from the `registering`
> and `reregistering` sets in the master for some code paths
> that would result in a failed (re-)registration attempt.
> 
> This could lead to a state where the reason of the unsuccessful
> (re-)registration attempt is fixed on the agent, but the master
> ignores subsequent attempts because it assumes the previous
> operation is still in progress.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b3e074cfe86600793310deb87932fa145e95055d 
>   src/tests/master_tests.cpp 7b087675200fe0ca69319b715997700a91d9f6b2 
> 
> 
> Diff: https://reviews.apache.org/r/64506/diff/2/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 64506: Correctly reset slave status when aborting a registration.

2017-12-18 Thread Benno Evers


> On Dec. 14, 2017, 9:37 p.m., Benjamin Mahler wrote:
> > Could you please file a ticket that describes the bug from what a user 
> > would experience, and link that in to the review? I would like to target it 
> > for backporting, seems pretty bad.

Here you go: https://issues.apache.org/jira/browse/MESOS-8341

I think it looks worse than it is in practice, most of these code paths will 
almost never be hit, and it can be worked-around with a master restart.


- Benno


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64506/#review193846
---


On Dec. 14, 2017, 3:32 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64506/
> ---
> 
> (Updated Dec. 14, 2017, 3:32 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the slave was not erased from the `registering`
> and `reregistering` sets in the master for some code paths
> that would result in a failed (re-)registration attempt.
> 
> This could lead to a state where the reason of the unsuccessful
> (re-)registration attempt is fixed on the agent, but the master
> ignores subsequent attempts because it assumes the previous
> operation is still in progress.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b3e074cfe86600793310deb87932fa145e95055d 
>   src/tests/master_tests.cpp 7b087675200fe0ca69319b715997700a91d9f6b2 
> 
> 
> Diff: https://reviews.apache.org/r/64506/diff/2/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 64506: Correctly reset slave status when aborting a registration.

2017-12-14 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64506/#review193846
---


Ship it!




Could you please file a ticket that describes the bug from what a user would 
experience, and link that in to the review? I would like to target it for 
backporting, seems pretty bad.


src/master/master.cpp
Lines 6433-6434 (patched)


Yeah, this is long overdue, maybe I should resurrect my patch for this. :)


- Benjamin Mahler


On Dec. 14, 2017, 3:32 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64506/
> ---
> 
> (Updated Dec. 14, 2017, 3:32 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the slave was not erased from the `registering`
> and `reregistering` sets in the master for some code paths
> that would result in a failed (re-)registration attempt.
> 
> This could lead to a state where the reason of the unsuccessful
> (re-)registration attempt is fixed on the agent, but the master
> ignores subsequent attempts because it assumes the previous
> operation is still in progress.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b3e074cfe86600793310deb87932fa145e95055d 
>   src/tests/master_tests.cpp 7b087675200fe0ca69319b715997700a91d9f6b2 
> 
> 
> Diff: https://reviews.apache.org/r/64506/diff/2/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 64506: Correctly reset slave status when aborting a registration.

2017-12-14 Thread Benno Evers

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64506/
---

(Updated Dec. 14, 2017, 3:32 p.m.)


Review request for mesos and Benjamin Mahler.


Repository: mesos


Description
---

Previously, the slave was not erased from the `registering`
and `reregistering` sets in the master for some code paths
that would result in a failed (re-)registration attempt.

This could lead to a state where the reason of the unsuccessful
(re-)registration attempt is fixed on the agent, but the master
ignores subsequent attempts because it assumes the previous
operation is still in progress.


Diffs (updated)
-

  src/master/master.cpp b3e074cfe86600793310deb87932fa145e95055d 
  src/tests/master_tests.cpp 7b087675200fe0ca69319b715997700a91d9f6b2 


Diff: https://reviews.apache.org/r/64506/diff/2/

Changes: https://reviews.apache.org/r/64506/diff/1-2/


Testing
---

`make check`


Thanks,

Benno Evers



Re: Review Request 64506: Correctly reset slave status when aborting a registration.

2017-12-12 Thread Benno Evers


> On Dec. 11, 2017, 9:37 p.m., Vinod Kone wrote:
> > src/master/master.cpp
> > Lines 6433-6434 (patched)
> > 
> >
> > Instead of this, I would propose that we have a proper state machine 
> > and transition the agent through that. That would avoid issues like this.

I think its a matter of time allocation: This would probably take 1-2 days, a 
complete refactoring to use a state machine more like 1-2 weeks, and both would 
improve on the current situation. So I wouldn't want to rule out doing this, 
even if I agree that it would be even better to use a state machine.


- Benno


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64506/#review193449
---


On Dec. 12, 2017, 9:57 a.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64506/
> ---
> 
> (Updated Dec. 12, 2017, 9:57 a.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the slave was not erased from the `registering`
> and `reregistering` sets in the master for some code paths
> that would result in a failed (re-)registration attempt.
> 
> This could lead to a state where the reason of the unsuccessful
> (re-)registration attempt is fixed on the agent, but the master
> ignores subsequent attempts because it assumes the previous
> operation is still in progress.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b3e074cfe86600793310deb87932fa145e95055d 
>   src/tests/master_tests.cpp 7b087675200fe0ca69319b715997700a91d9f6b2 
> 
> 
> Diff: https://reviews.apache.org/r/64506/diff/1/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 64506: Correctly reset slave status when aborting a registration.

2017-12-11 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64506/#review193449
---


Fix it, then Ship it!





src/master/master.cpp
Lines 6433-6434 (patched)


Instead of this, I would propose that we have a proper state machine and 
transition the agent through that. That would avoid issues like this.



src/tests/master_tests.cpp
Lines 8331 (patched)


move this to right after `advance`.


- Vinod Kone


On Dec. 11, 2017, 8:16 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64506/
> ---
> 
> (Updated Dec. 11, 2017, 8:16 p.m.)
> 
> 
> Review request for mesos.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the slave was not erased from the `registering`
> and `reregistering` sets in the master for some code paths
> that would result in a failed (re-)registration attempt.
> 
> This could lead to a state where the reason of the unsuccessful
> (re-)registration attempt is fixed on the agent, but the master
> ignores subsequent attempts because it assumes the previous
> operation is still in progress.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b3e074cfe86600793310deb87932fa145e95055d 
>   src/tests/master_tests.cpp 7b087675200fe0ca69319b715997700a91d9f6b2 
> 
> 
> Diff: https://reviews.apache.org/r/64506/diff/1/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benno Evers
> 
>