Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-06-13 Thread Deshi Xiao


> On 六月 13, 2017, 12:12 a.m., Vinod Kone wrote:
> > Can you split this review into multiple logical reviews? It is really hard 
> > to review otherwise.
> > 
> > Below are a few splits I could think of
> > 
> > 1) Renames of variables (e.g., in tests).
> > 2) Changes of resources value in tests (not sure of the reason for this 
> > change).
> > 3) Addition of a new helper `recoverSlaveState`
> > 4) Changes to containerizers to short circuit recovery if state is none 
> > (not sure of the reason for this change)
> > 5) Adding `rebooted` field to recovery info and state (you can combine this 
> > with 6th if you want but might be easier to review separately) 
> > 6) Keeping the agent id on reboot
> > 
> > In addition to making it easy to review it will help us to ship parts of 
> > this chain sooner than later.

+1


- Deshi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review177687
---


On 六月 9, 2017, 4:27 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated 六月 9, 2017, 4:27 a.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/composing.cpp 
> a003e1b80dc9b4dec5b3fbbadb2daecf855c90c7 
>   src/slave/containerizer/docker.cpp 9f84109d7de22a39ace6e44e0c7d8d501bcb24de 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f3e6210eccd4a6b445ffd4447e69526d424ea36d 
>   src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
>   src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 
>   src/tests/reservation_tests.cpp 6e9c215382ef41700921a673669ac1a7975e9b7f 
>   src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/7/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-06-12 Thread Vinod Kone


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Line 5954 (original), 5965 (patched)
> > 
> >
> > Previously, looks like any errors during the partial recovery after a 
> > reboot resulted in agent flapping (e.g., checkpointed resources recovery 
> > failure, fetcher recovery failure), but now looks like the agent will 
> > remove the symlink and move along? That seems like a change in behavior
> 
> Megha Sharma wrote:
> Now we are only cleaning the symlink and continuing if the failure was 
> because of slaveInfo mismatch. There's no recovery failure caused by fetcher 
> recovery anymore (removed in ToT) so the only class of failure left is 
> checkpointed resources failure which will not be affected by agent symlink 
> cleanup so no cleanup has been performed and it will cause the agent to flap 
> so no change in behavior here.

What is `ToT` ?

If this issue is fixed, then please mark it so. If you are unsure then feel 
free to ask further questions here that will help us in resolving it.


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Line 5962 (original)
> > 
> >
> > So, we are continuing if there is a recovery error and the agent has 
> > not rebooted? That doesn't sound right?
> 
> Megha Sharma wrote:
> Fixed to continue after symlink cleanup if the recovery error is because 
> of slaveInfo mismatch, any other errors will cause the agent to behave same 
> as before.

please mark it fixed. see above.


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/tests/slave_recovery_tests.cpp
> > Line 237 (original), 237 (patched)
> > 
> >
> > why this change in this review? looks independent.
> 
> Megha Sharma wrote:
> Actually, this was done to address Neil's comment about the variable name 
> being too generic which seemed quite reasonable. See the comment below.
> 
> `Can we rename _ack to something that identifies we're waiting for the 
> agent to see the status update acknowledgment?`

please do such changes in a different review. see my comment in the next review.


- Vinod


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review175852
---


On June 9, 2017, 4:27 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated June 9, 2017, 4:27 a.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/composing.cpp 
> a003e1b80dc9b4dec5b3fbbadb2daecf855c90c7 
>   src/slave/containerizer/docker.cpp 9f84109d7de22a39ace6e44e0c7d8d501bcb24de 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f3e6210eccd4a6b445ffd4447e69526d424ea36d 
>   src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
>   src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 
>   src/tests/reservation_tests.cpp 6e9c215382ef41700921a673669ac1a7975e9b7f 
>   src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/7/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-06-12 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review177687
---



Can you split this review into multiple logical reviews? It is really hard to 
review otherwise.

Below are a few splits I could think of

1) Renames of variables (e.g., in tests).
2) Changes of resources value in tests (not sure of the reason for this change).
3) Addition of a new helper `recoverSlaveState`
4) Changes to containerizers to short circuit recovery if state is none (not 
sure of the reason for this change)
5) Adding `rebooted` field to recovery info and state (you can combine this 
with 6th if you want but might be easier to review separately) 
6) Keeping the agent id on reboot

In addition to making it easy to review it will help us to ship parts of this 
chain sooner than later.

- Vinod Kone


On June 9, 2017, 4:27 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated June 9, 2017, 4:27 a.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/composing.cpp 
> a003e1b80dc9b4dec5b3fbbadb2daecf855c90c7 
>   src/slave/containerizer/docker.cpp 9f84109d7de22a39ace6e44e0c7d8d501bcb24de 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f3e6210eccd4a6b445ffd4447e69526d424ea36d 
>   src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
>   src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 
>   src/tests/reservation_tests.cpp 6e9c215382ef41700921a673669ac1a7975e9b7f 
>   src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/7/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-06-09 Thread Megha Sharma


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Line 5954 (original), 5965 (patched)
> > 
> >
> > Previously, looks like any errors during the partial recovery after a 
> > reboot resulted in agent flapping (e.g., checkpointed resources recovery 
> > failure, fetcher recovery failure), but now looks like the agent will 
> > remove the symlink and move along? That seems like a change in behavior

Now we are only cleaning the symlink and continuing if the failure was because 
of slaveInfo mismatch. There's no recovery failure caused by fetcher recovery 
anymore (removed in ToT) so the only class of failure left is checkpointed 
resources failure which will not be affected by agent symlink cleanup so no 
cleanup has been performed and it will cause the agent to flap so no change in 
behavior here.


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Line 5962 (original)
> > 
> >
> > So, we are continuing if there is a recovery error and the agent has 
> > not rebooted? That doesn't sound right?

Fixed to continue after symlink cleanup if the recovery error is because of 
slaveInfo mismatch, any other errors will cause the agent to behave same as 
before.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review175852
---


On June 9, 2017, 4:27 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated June 9, 2017, 4:27 a.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/composing.cpp 
> a003e1b80dc9b4dec5b3fbbadb2daecf855c90c7 
>   src/slave/containerizer/docker.cpp 9f84109d7de22a39ace6e44e0c7d8d501bcb24de 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f3e6210eccd4a6b445ffd4447e69526d424ea36d 
>   src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
>   src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 
>   src/tests/reservation_tests.cpp 6e9c215382ef41700921a673669ac1a7975e9b7f 
>   src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/7/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-06-08 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review177438
---



Patch looks great!

Reviews applied: [56895]

Passed command: support\windows-build.bat

- Mesos Reviewbot Windows


On June 9, 2017, 4:27 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated June 9, 2017, 4:27 a.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/composing.cpp 
> a003e1b80dc9b4dec5b3fbbadb2daecf855c90c7 
>   src/slave/containerizer/docker.cpp 9f84109d7de22a39ace6e44e0c7d8d501bcb24de 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f3e6210eccd4a6b445ffd4447e69526d424ea36d 
>   src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
>   src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 
>   src/tests/reservation_tests.cpp 6e9c215382ef41700921a673669ac1a7975e9b7f 
>   src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/7/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-06-08 Thread Megha Sharma


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > No tests!?

I have added a new test, so in total this change has two tests: one verifying 
that the state is recovered correctly and agentId is retained post the agent 
host reboot given the recovery finishes without errors and a second one to 
verify that no state is recovered and only the agentId is retained if the 
recovery fails after a reboot.


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Line 5956 (original), 5967 (patched)
> > 
> >
> > Add a comment here saying that we do this for backwards compatibiity, 
> > i.e., in Mesos <= 1.3 a rebooted agent did not recover checkpointed disk 
> > and registered as a new agent.

Fixed


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/tests/slave_recovery_tests.cpp
> > Line 237 (original), 237 (patched)
> > 
> >
> > why this change in this review? looks independent.

Actually, this was done to address Neil's comment about the variable name being 
too generic which seemed quite reasonable. See the comment below.

`Can we rename _ack to something that identifies we're waiting for the agent to 
see the status update acknowledgment?`


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review175852
---


On June 9, 2017, 4:27 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated June 9, 2017, 4:27 a.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/composing.cpp 
> a003e1b80dc9b4dec5b3fbbadb2daecf855c90c7 
>   src/slave/containerizer/docker.cpp 9f84109d7de22a39ace6e44e0c7d8d501bcb24de 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f3e6210eccd4a6b445ffd4447e69526d424ea36d 
>   src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
>   src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 
>   src/tests/reservation_tests.cpp 6e9c215382ef41700921a673669ac1a7975e9b7f 
>   src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/7/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-06-08 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated June 9, 2017, 4:27 a.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

With partition awareness, the agents are now allowed to re-register
after they have been marked Unreachable. The executors are anyway
terminated on the agent when it reboots so there is no harm in
letting the agent keep its SlaveID, re-register with the master
and reconcile the lost executors. This is a pre-requisite for
supporting persistent/restartable tasks in mesos.


Diffs (updated)
-

  src/slave/containerizer/composing.cpp 
a003e1b80dc9b4dec5b3fbbadb2daecf855c90c7 
  src/slave/containerizer/docker.cpp 9f84109d7de22a39ace6e44e0c7d8d501bcb24de 
  src/slave/containerizer/mesos/containerizer.cpp 
f3e6210eccd4a6b445ffd4447e69526d424ea36d 
  src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
  src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
  src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
  src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 
  src/tests/reservation_tests.cpp 6e9c215382ef41700921a673669ac1a7975e9b7f 
  src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c 


Diff: https://reviews.apache.org/r/56895/diff/7/

Changes: https://reviews.apache.org/r/56895/diff/6-7/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-05-23 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review175877
---



Patch looks great!

Reviews applied: [56895]

Passed command: support\windows-build.bat

- Mesos Reviewbot Windows


On May 16, 2017, 11:17 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated May 16, 2017, 11:17 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp 77fb93abc701cd34b69c75b6219c219fdb784a67 
>   src/slave/slave.cpp 14de72fa4a8746504a251d735bf56041b934df40 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 33dcc7a148f9a6b1a3216cce45710da8fd819ba6 
>   src/tests/reservation_tests.cpp 4504831d77c1bfcf5f2ddf6d28cd45dea2c421ad 
>   src/tests/slave_recovery_tests.cpp 52e78b6b6280a159233b402ce2849448204d4f11 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/6/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-05-23 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review175852
---



No tests!?


src/slave/slave.cpp
Line 5954 (original), 5965 (patched)


Previously, looks like any errors during the partial recovery after a 
reboot resulted in agent flapping (e.g., checkpointed resources recovery 
failure, fetcher recovery failure), but now looks like the agent will remove 
the symlink and move along? That seems like a change in behavior



src/slave/slave.cpp
Line 5956 (original), 5967 (patched)


Add a comment here saying that we do this for backwards compatibiity, i.e., 
in Mesos <= 1.3 a rebooted agent did not recover checkpointed disk and 
registered as a new agent.



src/slave/slave.cpp
Line 5962 (original)


So, we are continuing if there is a recovery error and the agent has not 
rebooted? That doesn't sound right?



src/tests/slave_recovery_tests.cpp
Line 237 (original), 237 (patched)


why this change in this review? looks independent.


- Vinod Kone


On May 16, 2017, 11:17 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated May 16, 2017, 11:17 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp 77fb93abc701cd34b69c75b6219c219fdb784a67 
>   src/slave/slave.cpp 14de72fa4a8746504a251d735bf56041b934df40 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 33dcc7a148f9a6b1a3216cce45710da8fd819ba6 
>   src/tests/reservation_tests.cpp 4504831d77c1bfcf5f2ddf6d28cd45dea2c421ad 
>   src/tests/slave_recovery_tests.cpp 52e78b6b6280a159233b402ce2849448204d4f11 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/6/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-05-17 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review175222
---



Patch looks great!

Reviews applied: [56895]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' 
CONFIGURATION='--verbose' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; 
./support/docker-build.sh

- Mesos Reviewbot


On May 16, 2017, 11:17 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated May 16, 2017, 11:17 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp 77fb93abc701cd34b69c75b6219c219fdb784a67 
>   src/slave/slave.cpp 14de72fa4a8746504a251d735bf56041b934df40 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 33dcc7a148f9a6b1a3216cce45710da8fd819ba6 
>   src/tests/reservation_tests.cpp 4504831d77c1bfcf5f2ddf6d28cd45dea2c421ad 
>   src/tests/slave_recovery_tests.cpp 52e78b6b6280a159233b402ce2849448204d4f11 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/6/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-05-16 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated May 16, 2017, 11:17 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

With partition awareness, the agents are now allowed to re-register
after they have been marked Unreachable. The executors are anyway
terminated on the agent when it reboots so there is no harm in
letting the agent keep its SlaveID, re-register with the master
and reconcile the lost executors. This is a pre-requisite for
supporting persistent/restartable tasks in mesos.


Diffs (updated)
-

  src/slave/slave.hpp 77fb93abc701cd34b69c75b6219c219fdb784a67 
  src/slave/slave.cpp 14de72fa4a8746504a251d735bf56041b934df40 
  src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
  src/slave/state.cpp 33dcc7a148f9a6b1a3216cce45710da8fd819ba6 
  src/tests/reservation_tests.cpp 4504831d77c1bfcf5f2ddf6d28cd45dea2c421ad 
  src/tests/slave_recovery_tests.cpp 52e78b6b6280a159233b402ce2849448204d4f11 


Diff: https://reviews.apache.org/r/56895/diff/6/

Changes: https://reviews.apache.org/r/56895/diff/5-6/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-05-16 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated May 16, 2017, 10:12 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

With partition awareness, the agents are now allowed to re-register
after they have been marked Unreachable. The executors are anyway
terminated on the agent when it reboots so there is no harm in
letting the agent keep its SlaveID, re-register with the master
and reconcile the lost executors. This is a pre-requisite for
supporting persistent/restartable tasks in mesos.


Diffs (updated)
-

  src/slave/slave.hpp 77fb93abc701cd34b69c75b6219c219fdb784a67 
  src/slave/slave.cpp 14de72fa4a8746504a251d735bf56041b934df40 
  src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
  src/slave/state.cpp 33dcc7a148f9a6b1a3216cce45710da8fd819ba6 
  src/tests/slave_recovery_tests.cpp 52e78b6b6280a159233b402ce2849448204d4f11 


Diff: https://reviews.apache.org/r/56895/diff/5/

Changes: https://reviews.apache.org/r/56895/diff/4-5/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-05-16 Thread Megha Sharma


> On May 3, 2017, 12:04 a.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Lines 5957 (patched)
> > 
> >
> > s/info/info after a reboot/
> > 
> > also please log the `future.failure()` here? are we sure that a failed 
> > future here can only happen due to incompatibile agent info?

Indeed there are more reasons for the future to fail in addition to the 
incompatibile agent info so I agree the message needs to be more generic.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review173678
---


On April 26, 2017, 6:16 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated April 26, 2017, 6:16 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp 77fb93abc701cd34b69c75b6219c219fdb784a67 
>   src/slave/slave.cpp 4ff522e75bc8de34fe2e7720bdd8ce3d32cbf803 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 33dcc7a148f9a6b1a3216cce45710da8fd819ba6 
>   src/tests/reservation_tests.cpp 4504831d77c1bfcf5f2ddf6d28cd45dea2c421ad 
>   src/tests/slave_recovery_tests.cpp 53f33a2b0411c8158326074ce043c7b1dbeef5b4 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/4/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-05-02 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review173678
---




src/slave/slave.hpp
Lines 333 (patched)


Can you add a comment on what this represents?



src/slave/slave.cpp
Lines 5957 (patched)


s/info/info after a reboot/

also please log the `future.failure()` here? are we sure that a failed 
future here can only happen due to incompatibile agent info?



src/slave/slave.cpp
Line 5954 (original), 5963 (patched)


As discussed offline, we should continue registering as new agent instead 
of exiting here.



src/slave/state.hpp
Lines 310 (patched)


s/hasRebooted/rebooted/


- Vinod Kone


On April 26, 2017, 6:16 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated April 26, 2017, 6:16 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp 77fb93abc701cd34b69c75b6219c219fdb784a67 
>   src/slave/slave.cpp 4ff522e75bc8de34fe2e7720bdd8ce3d32cbf803 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 33dcc7a148f9a6b1a3216cce45710da8fd819ba6 
>   src/tests/reservation_tests.cpp 4504831d77c1bfcf5f2ddf6d28cd45dea2c421ad 
>   src/tests/slave_recovery_tests.cpp 53f33a2b0411c8158326074ce043c7b1dbeef5b4 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/4/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-04-19 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review172320
---



Patch looks great!

Reviews applied: [56895]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' 
CONFIGURATION='--verbose' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; 
./support/docker-build.sh

- Mesos Reviewbot


On April 19, 2017, 1:10 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated April 19, 2017, 1:10 a.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp 77fb93abc701cd34b69c75b6219c219fdb784a67 
>   src/slave/slave.cpp 4ff522e75bc8de34fe2e7720bdd8ce3d32cbf803 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 33dcc7a148f9a6b1a3216cce45710da8fd819ba6 
>   src/tests/reservation_tests.cpp 4504831d77c1bfcf5f2ddf6d28cd45dea2c421ad 
>   src/tests/slave_recovery_tests.cpp 53f33a2b0411c8158326074ce043c7b1dbeef5b4 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/4/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-04-18 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated April 19, 2017, 1:10 a.m.)


Review request for mesos, Neil Conway and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

With partition awareness, the agents are now allowed to re-register
after they have been marked Unreachable. The executors are anyway
terminated on the agent when it reboots so there is no harm in
letting the agent keep its SlaveID, re-register with the master
and reconcile the lost executors. This is a pre-requisite for
supporting persistent/restartable tasks in mesos.


Diffs (updated)
-

  src/slave/slave.hpp 77fb93abc701cd34b69c75b6219c219fdb784a67 
  src/slave/slave.cpp 4ff522e75bc8de34fe2e7720bdd8ce3d32cbf803 
  src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
  src/slave/state.cpp 33dcc7a148f9a6b1a3216cce45710da8fd819ba6 
  src/tests/reservation_tests.cpp 4504831d77c1bfcf5f2ddf6d28cd45dea2c421ad 
  src/tests/slave_recovery_tests.cpp 53f33a2b0411c8158326074ce043c7b1dbeef5b4 


Diff: https://reviews.apache.org/r/56895/diff/4/

Changes: https://reviews.apache.org/r/56895/diff/3-4/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-03-17 Thread Jiang Yan Xu


> On March 16, 2017, 2:25 p.m., Neil Conway wrote:
> > Seems like a legitimate problem in `SlaveRecoveryTest/0.CleanupExecutor`, 
> > per the review bot. Can you take a look?
> 
> Megha Sharma wrote:
> So, the thing is I see this error only in the review bot. I tested this 
> patch on OS X and we ran it on a linux ditribution for nearly 100 iterations 
> to see if the test is flaky but it didn't fail even once. Yan pointed me to 
> the jira https://issues.apache.org/jira/browse/MESOS-2879 and it seems to be 
> very much related to the error we are seeing here.

I meant this could be another case of "the mutex is being deleted, and then 
accessed again later" but MESOS-2879 is already fixed of course so it has to be 
something else. Neil did you get the error you posted by applying the patch to 
the master branch? I tried it and wasn't able to repro on either macOS or Linux 
(1000 iterations).


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review169215
---


On March 16, 2017, 11:25 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated March 16, 2017, 11:25 a.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp e2de66cc5b899b8b9a9ea27cc30f19a9e8fc11fb 
>   src/slave/slave.cpp a4f4a9ca80b726de8e07571fd6d93120947c278b 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp f8e7cdd4df0a3c5d62d89edd11844527084f2baa 
>   src/tests/slave_recovery_tests.cpp e6b2bdd4e385208eea7dc513421024242b9efc1c 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/3/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-03-17 Thread Megha Sharma


> On March 16, 2017, 9:25 p.m., Neil Conway wrote:
> > Seems like a legitimate problem in `SlaveRecoveryTest/0.CleanupExecutor`, 
> > per the review bot. Can you take a look?

So, the thing is I see this error only in the review bot. I tested this patch 
on OS X and we ran it on a linux ditribution for nearly 100 iterations to see 
if the test is flaky but it didn't fail even once. Yan pointed me to the jira 
https://issues.apache.org/jira/browse/MESOS-2879 and it seems to be very much 
related to the error we are seeing here.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review169215
---


On March 16, 2017, 6:25 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated March 16, 2017, 6:25 p.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp e2de66cc5b899b8b9a9ea27cc30f19a9e8fc11fb 
>   src/slave/slave.cpp a4f4a9ca80b726de8e07571fd6d93120947c278b 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp f8e7cdd4df0a3c5d62d89edd11844527084f2baa 
>   src/tests/slave_recovery_tests.cpp e6b2bdd4e385208eea7dc513421024242b9efc1c 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/3/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-03-16 Thread Neil Conway

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review169215
---



Seems like a legitimate problem in `SlaveRecoveryTest/0.CleanupExecutor`, per 
the review bot. Can you take a look?

- Neil Conway


On March 16, 2017, 6:25 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated March 16, 2017, 6:25 p.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp e2de66cc5b899b8b9a9ea27cc30f19a9e8fc11fb 
>   src/slave/slave.cpp a4f4a9ca80b726de8e07571fd6d93120947c278b 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp f8e7cdd4df0a3c5d62d89edd11844527084f2baa 
>   src/tests/slave_recovery_tests.cpp e6b2bdd4e385208eea7dc513421024242b9efc1c 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/3/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-03-16 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated March 16, 2017, 6:25 p.m.)


Review request for mesos, Neil Conway and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

With partition awareness, the agents are now allowed to re-register
after they have been marked Unreachable. The executors are anyway
terminated on the agent when it reboots so there is no harm in
letting the agent keep its SlaveID, re-register with the master
and reconcile the lost executors. This is a pre-requisite for
supporting persistent/restartable tasks in mesos.


Diffs (updated)
-

  src/slave/slave.hpp e2de66cc5b899b8b9a9ea27cc30f19a9e8fc11fb 
  src/slave/slave.cpp a4f4a9ca80b726de8e07571fd6d93120947c278b 
  src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
  src/slave/state.cpp f8e7cdd4df0a3c5d62d89edd11844527084f2baa 
  src/tests/slave_recovery_tests.cpp e6b2bdd4e385208eea7dc513421024242b9efc1c 


Diff: https://reviews.apache.org/r/56895/diff/3/

Changes: https://reviews.apache.org/r/56895/diff/2-3/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-02-21 Thread Neil Conway

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review166284
---



A few superficial suggestions attached -- will take a closer look shortly.

I see this intermittently:

```
libc++abi.dylib: terminating with uncaught exception of type 
std::__1::system_error: recursive_mutex lock failed: Invalid argument
*** Aborted at 1487726754 (unix time) try "date -d @1487726754" if you are 
using GNU date ***
PC: @ 0x7fff9271cdd6 __pthread_kill
*** SIGABRT (@0x7fff9271cdd6) received by PID 90593 (TID 0x7362d000) stack 
trace: ***
@ 0x7fff927fbbba _sigtramp
@0x41221cd4c (unknown)
@ 0x7fff92682420 abort
@ 0x7fff911dd85a abort_message
@ 0x7fff91202c37 default_terminate_handler()
@ 0x7fff91d0cf33 _objc_terminate()
@ 0x7fff911ffd69 std::__terminate()
@ 0x7fff911ff7de __cxa_throw
@ 0x7fff911cd441 std::__1::__throw_system_error()
@0x112631a49 
_ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENKUlPS1_E_clES6_
@0x112631a28 
_ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENUlPS1_E_8__invokeES6_
@0x112631af9 Synchronized<>::Synchronized()
@0x1126319fd Synchronized<>::Synchronized()
@0x1125fe49a synchronize<>()
@0x1155690f2 process::ProcessBase::enqueue()
@0x115583aad process::ProcessManager::deliver()
@0x115583696 process::ProcessManager::deliver()
@0x115591aa2 process::internal::dispatch()
@0x11386f147 process::dispatch<>()
@0x11386f04f 
_ZZN7process5delayIN5mesos8internal5slave5SlaveEEENS_5TimerERK8DurationRKNS_3PIDIT_EEMSA_FvvEENKUlvE_clEv
@0x11386f00d 
_ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRZN7process5delayIN5mesos8internal5slave5SlaveEEENS3_5TimerERK8DurationRKNS3_3PIDIT_EEMSE_FvvEEUlvE_EEEvDpOT_
@0x11386edc9 
_ZNSt3__110__function6__funcIZN7process5delayIN5mesos8internal5slave5SlaveEEENS2_5TimerERK8DurationRKNS2_3PIDIT_EEMSD_FvvEEUlvE_NS_9allocatorISJ_EEFvvEEclEv
@0x11223765e std::__1::function<>::operator()()
@0x11554e489 process::Timer::operator()()
@0x11554e1c9 process::timedout()
@0x11560df89 
_ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRNS_6__bindIPFvRKNS_4listIN7process5TimerENS_9allocatorIS6_EJRNS_12placeholders4__phILi1EESB_EEEvDpOT_
@0x11560dc79 
_ZNSt3__110__function6__funcINS_6__bindIPFvRKNS_4listIN7process5TimerENS_9allocatorIS5_EJRNS_12placeholders4__phILi1EENS6_ISH_EESB_EclESA_
@0x112233591 std::__1::function<>::operator()()
@0x115210067 process::clock::tick()
@0x11521774f 
_ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRNS_6__bindIRFvRKN7process4TimeEEJS7_EEvDpOT_
@0x1152175c9 
_ZNSt3__110__function6__funcINS_6__bindIRFvRKN7process4TimeEEJS6_EEENS_9allocatorIS9_EEFvvEEclEv
@0x11223765e std::__1::function<>::operator()()
```

Seems like the review bot ran into a similar problem.


src/tests/slave_recovery_tests.cpp (line 2310)


Can we rename `_ack` to something that identifies we're waiting for the 
_agent_ to see the status update acknowledgment?



src/tests/slave_recovery_tests.cpp (line 2401)


Words in variable names should not be separated with underscores.



src/tests/slave_recovery_tests.cpp (line 2402)


Seems like we should lookup 
`state.frameworks[frameworkId].executors[executorId]` once and then reuse it.



src/tests/slave_recovery_tests.cpp (line 2405)


Should probably be `EXPECT`, here and below.



src/tests/slave_recovery_tests.cpp (line 2420)


Indentation.


- Neil Conway


On Feb. 21, 2017, 9:44 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated Feb. 21, 2017, 9:44 p.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting 

Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-02-21 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated Feb. 21, 2017, 9:42 p.m.)


Review request for mesos.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

With partition awareness, the agents are now allowed to re-register
after they have been marked Unreachable. The executors are anyway
terminated on the agent when it reboots so there is no harm in
letting the agent keep its SlaveID, re-register with the master
and reconcile the lost executors. This is a pre-requisite for
supporting persistent/restartable tasks in mesos.


Diffs (updated)
-

  src/slave/slave.hpp 3b0aea4e3e9a17501077beccbccaab4abbe11af2 
  src/slave/slave.cpp 7564e8d39530794131dbbc928fcbc59fb65ef471 
  src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
  src/slave/state.cpp f8e7cdd4df0a3c5d62d89edd11844527084f2baa 
  src/tests/slave_recovery_tests.cpp 0e295915fea0a7314e173857249bd8726eeccd76 

Diff: https://reviews.apache.org/r/56895/diff/


Testing (updated)
---

make check


Thanks,

Megha Sharma



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-02-21 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review166206
---



Bad review!

Reviews applied: []

Error:
No reviewers specified. Please find a reviewer by asking on JIRA or the mailing 
list.

- Mesos Reviewbot


On Feb. 21, 2017, 6:58 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated Feb. 21, 2017, 6:58 p.m.)
> 
> 
> Review request for mesos.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp 5049eb783b8ad7b9599f20c3701f7d3d654b4491 
>   src/slave/slave.cpp 92564ff8fff06d1cb17192d374d355b4bb7d39d8 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp f8e7cdd4df0a3c5d62d89edd11844527084f2baa 
>   src/tests/slave_recovery_tests.cpp 946a7bc4b78f56244633a0b2acb59381e3dbe7e7 
> 
> Diff: https://reviews.apache.org/r/56895/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Allow agents to recover slave state post a reboot.

2017-02-21 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated Feb. 21, 2017, 6:58 p.m.)


Review request for mesos.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

With partition awareness, the agents are now allowed to re-register
after they have been marked Unreachable. The executors are anyway
terminated on the agent when it reboots so there is no harm in
letting the agent keep its SlaveID, re-register with the master
and reconcile the lost executors. This is a pre-requisite for
supporting persistent/restartable tasks in mesos.


Diffs
-

  src/slave/slave.hpp 5049eb783b8ad7b9599f20c3701f7d3d654b4491 
  src/slave/slave.cpp 92564ff8fff06d1cb17192d374d355b4bb7d39d8 
  src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
  src/slave/state.cpp f8e7cdd4df0a3c5d62d89edd11844527084f2baa 
  src/tests/slave_recovery_tests.cpp 946a7bc4b78f56244633a0b2acb59381e3dbe7e7 

Diff: https://reviews.apache.org/r/56895/diff/


Testing
---


Thanks,

Megha Sharma



Review Request 56895: Allow agents to recover slave state post a reboot.

2017-02-21 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

Review request for mesos.


Repository: mesos


Description
---

With partition awareness, the agents are now allowed to re-register
after they have been marked Unreachable. The executors are anyway
terminated on the agent when it reboots so there is no harm in
letting the agent keep its SlaveID, re-register with the master
and reconcile the lost executors. This is a pre-requisite for
supporting persistent/restartable tasks in mesos.


Diffs
-

  src/slave/slave.hpp 5049eb783b8ad7b9599f20c3701f7d3d654b4491 
  src/slave/slave.cpp 92564ff8fff06d1cb17192d374d355b4bb7d39d8 
  src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
  src/slave/state.cpp f8e7cdd4df0a3c5d62d89edd11844527084f2baa 
  src/tests/slave_recovery_tests.cpp 946a7bc4b78f56244633a0b2acb59381e3dbe7e7 

Diff: https://reviews.apache.org/r/56895/diff/


Testing
---


Thanks,

Megha Sharma