Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-12 Thread Jiang Yan Xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review195350
---


Ship it!




Will commit with a few touches like the following.


src/master/master.cpp
Lines 10937-10941 (original), 10935-10939 (patched)


Given our latest conclusion let's revert this back to 

```
count += framework->unreachableTasks.size();
```


- Jiang Yan Xu


On Jan. 12, 2018, 11:24 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 12, 2018, 11:24 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might have
> been lost. If a task is already terminal but has unacknowleged status
> updates, it is expected that we track it in the unreachable tasks list
> so we should remove the CHECK that prevents this. This also backs out
> changes to how unreachable tasks are presented in the HTTP endpoints to
> restore compatibility with previous Mesos releases.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 
>   src/master/master.hpp f915c6f611eaffd5c2cb6cba6c43def679e8f92d 
>   src/master/master.cpp c96cd7090875fbf1b11b1708390e88182c77655b 
>   src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/7/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-12 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review195339
---


Ship it!




LGTM.

I still see open issues though, can you please resolve them?

- Vinod Kone


On Jan. 12, 2018, 7:24 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 12, 2018, 7:24 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might have
> been lost. If a task is already terminal but has unacknowleged status
> updates, it is expected that we track it in the unreachable tasks list
> so we should remove the CHECK that prevents this. This also backs out
> changes to how unreachable tasks are presented in the HTTP endpoints to
> restore compatibility with previous Mesos releases.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 
>   src/master/master.hpp f915c6f611eaffd5c2cb6cba6c43def679e8f92d 
>   src/master/master.cpp c96cd7090875fbf1b11b1708390e88182c77655b 
>   src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/7/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-12 Thread James Peach

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/
---

(Updated Jan. 12, 2018, 7:24 p.m.)


Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod Kone, 
and Jiang Yan Xu.


Changes
---

Addressed nits.


Bugs: MESOS-8337
https://issues.apache.org/jira/browse/MESOS-8337


Repository: mesos


Description
---

If an agent is lost, we try to remove all the tasks that might have
been lost. If a task is already terminal but has unacknowleged status
updates, it is expected that we track it in the unreachable tasks list
so we should remove the CHECK that prevents this. This also backs out
changes to how unreachable tasks are presented in the HTTP endpoints to
restore compatibility with previous Mesos releases.


Diffs (updated)
-

  src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 
  src/master/master.hpp f915c6f611eaffd5c2cb6cba6c43def679e8f92d 
  src/master/master.cpp c96cd7090875fbf1b11b1708390e88182c77655b 
  src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 
  src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 


Diff: https://reviews.apache.org/r/64940/diff/7/

Changes: https://reviews.apache.org/r/64940/diff/6-7/


Testing
---

make check (Fedora 27)


Thanks,

James Peach



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-12 Thread Vinod Kone


> On Jan. 12, 2018, 2:24 a.m., Vinod Kone wrote:
> > AFAICT, in 1.4.x we show unreachable terminal tasks on a removed agent as 
> > completed tasks. But now, we show them as unreachable tasks. If that's true 
> > it's an API change that needs to be called out. Is that really backwards 
> > compatible?
> 
> Jiang Yan Xu wrote:
> Yeah it's true. Despite it being a bug that if the unreachable terminal 
> tasks are considered completed and added to the completed list, it cannot be 
> later removed when the agent reregisters and duplicates are shown in the 
> webUI and APIs, it is indeed what 1.4.x gives you.
> 
> 1.5 fixes the duplication problem but we did the extra work (the 
> additional `if (task->state() != TASK_UNREACHABLE)` checks we added and this 
> revision removes) to make it look like the 1.4.x version, I guess it's fine 
> to keep it that way until we have a plan for an overhaul...
> 
> So, probably let's not revert the code that involves the http endpoints 
> (sorry for suggesting it earlier, there are small changes needed which I'll 
> comment on).

"unreachable terminal tasks are considered completed and added to the completed 
list, it cannot be later removed when the agent reregisters and duplicates are 
shown in the webUI and APIs" .

Duplicated tasks in 2 lists sounds like a bug and not intentional behavior in 
1.4.x. So, if possible we should fix it in 1.5.x to move the task from 
completed to active upon re-registration. Is that easy to do or do we have to 
live with the duplicates until the overhaul?


- Vinod


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review195282
---


On Jan. 5, 2018, 7:37 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 5, 2018, 7:37 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might have
> been lost. If a task is already terminal but has unacknowleged status
> updates, it is expected that we track it in the unreachable tasks list
> so we should remove the CHECK that prevents this. This also backs out
> changes to how unreachable tasks are presented in the HTTP endpoints to
> restore compatibility with previous Mesos releases.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 
>   src/master/master.hpp 5e6ba53c075174a1e514a395ceb17c26201ec470 
>   src/master/master.cpp 6fc5de89e54ba0b9ae2c4fb475be9878910820d3 
>   src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/6/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-12 Thread Jiang Yan Xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review195241
---



LGTM modulo Vinod's comments. Sorry we have gone back and forth on this but 
this time I think we are close to reach what's shippable to unblock 1.5.


src/master/http.cpp
Lines 360-362 (original)


So, as reminded by Vinod, for backward compatibility let's keep the checks. 
However here we should change the condition to

```
if (task->state() == TASK_UNREACHABLE) {
  continue;
}
```

because other terminal tasks that are not TASK_LOST could also be in this 
map...



src/master/master.hpp
Lines 2326-2327 (original)


The following CHECK looks correct?

```
CHECK(Master::isRemovable(task->state()) << task->state();
```

Although failing CHECKs are annoying, it helped us uncover the 
inconsistency in our task state transitions so it's better to correct it?



src/master/master.cpp
Lines 11023-11036 (original), 11021-11034 (patched)


I was going to suggest that we remove the `if (task->state() == 
TASK_UNREACHABLE)` check but per Vinod's comment we shouldn't.



src/tests/mesos.hpp
Lines 3511 (patched)


s/a/an/?



src/tests/mesos.hpp
Line 3511 (original), 3518 (patched)


s/a/an/?



src/tests/partition_tests.cpp
Lines 2401 (patched)


We have to s/unreachable/completed/ and s/but/and/ ?



src/tests/partition_tests.cpp
Lines 2402 (patched)


With all the discussions about teminology it seems  s/Completed/Terminal/ 
is more accurate?



src/tests/partition_tests.cpp
Lines 2440-2442 (patched)


Nit: declare them in the order we are expecting on them? i.e., starting -> 
running -> finished.



src/tests/partition_tests.cpp
Lines 2474-2475 (patched)


Without a paused clock, there is a chance of race here right? If the same 
status update is resent after a timeout before we acknowledge, it's possible 
that the resent update is going to be caught by the 2nd expectation?

Would it be possible to hoist the Clock::pause() statement?



src/tests/partition_tests.cpp
Lines 2527 (patched)


Remove this debugging log.



src/tests/partition_tests.cpp
Lines 2532-2534 (patched)


Sorry now you have to expect it to be in the completed_task section...


- Jiang Yan Xu


On Jan. 5, 2018, 11:37 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 5, 2018, 11:37 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might have
> been lost. If a task is already terminal but has unacknowleged status
> updates, it is expected that we track it in the unreachable tasks list
> so we should remove the CHECK that prevents this. This also backs out
> changes to how unreachable tasks are presented in the HTTP endpoints to
> restore compatibility with previous Mesos releases.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 
>   src/master/master.hpp 5e6ba53c075174a1e514a395ceb17c26201ec470 
>   src/master/master.cpp 6fc5de89e54ba0b9ae2c4fb475be9878910820d3 
>   src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/6/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-11 Thread Jiang Yan Xu


> On Jan. 11, 2018, 6:24 p.m., Vinod Kone wrote:
> > AFAICT, in 1.4.x we show unreachable terminal tasks on a removed agent as 
> > completed tasks. But now, we show them as unreachable tasks. If that's true 
> > it's an API change that needs to be called out. Is that really backwards 
> > compatible?

Yeah it's true. Despite it being a bug that if the unreachable terminal tasks 
are considered completed and added to the completed list, it cannot be later 
removed when the agent reregisters and duplicates are shown in the webUI and 
APIs, it is indeed what 1.4.x gives you.

1.5 fixes the duplication problem but we did the extra work (the additional `if 
(task->state() != TASK_UNREACHABLE)` checks we added and this revision removes) 
to make it look like the 1.4.x version, I guess it's fine to keep it that way 
until we have a plan for an overhaul...

So, probably let's not revert the code that involves the http endpoints (sorry 
for suggesting it earlier, there are small changes needed which I'll comment 
on).


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review195282
---


On Jan. 5, 2018, 11:37 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 5, 2018, 11:37 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might have
> been lost. If a task is already terminal but has unacknowleged status
> updates, it is expected that we track it in the unreachable tasks list
> so we should remove the CHECK that prevents this. This also backs out
> changes to how unreachable tasks are presented in the HTTP endpoints to
> restore compatibility with previous Mesos releases.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 
>   src/master/master.hpp 5e6ba53c075174a1e514a395ceb17c26201ec470 
>   src/master/master.cpp 6fc5de89e54ba0b9ae2c4fb475be9878910820d3 
>   src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/6/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-11 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review195282
---



AFAICT, in 1.4.x we show unreachable terminal tasks on a removed agent as 
completed tasks. But now, we show them as unreachable tasks. If that's true 
it's an API change that needs to be called out. Is that really backwards 
compatible?


src/tests/partition_tests.cpp
Lines 2481 (patched)


s/trscking/tracking/



src/tests/partition_tests.cpp
Lines 2513-2514 (patched)


s/we have one finished task and lost task/the task state is still finished/


- Vinod Kone


On Jan. 5, 2018, 7:37 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 5, 2018, 7:37 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might have
> been lost. If a task is already terminal but has unacknowleged status
> updates, it is expected that we track it in the unreachable tasks list
> so we should remove the CHECK that prevents this. This also backs out
> changes to how unreachable tasks are presented in the HTTP endpoints to
> restore compatibility with previous Mesos releases.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 
>   src/master/master.hpp 5e6ba53c075174a1e514a395ceb17c26201ec470 
>   src/master/master.cpp 6fc5de89e54ba0b9ae2c4fb475be9878910820d3 
>   src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/6/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-11 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review195245
---



FAIL: Mesos tests failed to build.

Reviews applied: `['64940']`

Failed command: `cmake.exe --build . --target mesos-tests --config Debug`

All the build artifacts available at: 
http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/64940

Relevant logs:

- 
[mesos-tests-build-cmake-stdout.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/64940/logs/mesos-tests-build-cmake-stdout.log):

```
  C:\Program Files (x86)\Microsoft Visual 
Studio\2017\Community\VC\Tools\MSVC\14.12.25827\include\vector(590): fatal 
error C1088: Cannot flush compiler intermediate file: 
'C:\Users\mesos\AppData\Local\Temp_CL_52516f0cex': No space left on device 
(compiling source file D:\DCOS\mesos\mesos\src\tests\fault_tolerance_tests.cpp) 
[D:\DCOS\mesos\src\tests\mesos-tests.vcxproj]
  C:\Program Files (x86)\Microsoft Visual 
Studio\2017\Community\VC\Tools\MSVC\14.12.25827\include\xutility(3118): fatal 
error C1088: Cannot flush compiler intermediate file: 
'C:\Users\mesos\AppData\Local\Temp_CL_f3419c4eex': No space left on device 
(compiling source file 
D:\DCOS\mesos\mesos\src\tests\default_executor_tests.cpp) 
[D:\DCOS\mesos\src\tests\mesos-tests.vcxproj]
  C:\Program Files (x86)\Microsoft Visual 
Studio\2017\Community\VC\Tools\MSVC\14.12.25827\include\xtree(1792): fatal 
error C1088: Cannot flush compiler intermediate file: 
'C:\Users\mesos\AppData\Local\Temp_CL_e4799263ex': No space left on device 
(compiling source file 
D:\DCOS\mesos\mesos\src\tests\master_maintenance_tests.cpp) 
[D:\DCOS\mesos\src\tests\mesos-tests.vcxproj]
  
D:\DCOS\mesos\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest-printers.h(857):
 fatal error C1088: Cannot flush compiler intermediate file: 
'C:\Users\mesos\AppData\Local\Temp_CL_23766635ex': No space left on device 
(compiling source file D:\DCOS\mesos\mesos\src\tests\health_check_tests.cpp) 
[D:\DCOS\mesos\src\tests\mesos-tests.vcxproj]
  C:\Program Files (x86)\Microsoft Visual 
Studio\2017\Community\VC\Tools\MSVC\14.12.25827\include\xutility(492): fatal 
error C1088: Cannot flush compiler intermediate file: 
'C:\Users\mesos\AppData\Local\Temp_CL_f0a7b581ex': No space left on device 
(compiling source file 
D:\DCOS\mesos\mesos\src\tests\hierarchical_allocator_tests.cpp) 
[D:\DCOS\mesos\src\tests\mesos-tests.vcxproj]
  C:\Program Files (x86)\Microsoft Visual 
Studio\2017\Community\VC\Tools\MSVC\14.12.25827\include\tuple(274): fatal error 
C1088: Cannot flush compiler intermediate file: 
'C:\Users\mesos\AppData\Local\Temp_CL_b053ea7fsy': No space left on device 
(compiling source file D:\DCOS\mesos\mesos\src\tests\hook_tests.cpp) 
[D:\DCOS\mesos\src\tests\mesos-tests.vcxproj]
  C:\Program Files (x86)\Microsoft Visual 
Studio\2017\Community\VC\Tools\MSVC\14.12.25827\include\memory(1587): fatal 
error C1088: Cannot flush compiler intermediate file: 
'C:\Users\mesos\AppData\Local\Temp_CL_fc0c8d2eex': No space left on device 
(compiling source file 
D:\DCOS\mesos\mesos\src\tests\http_fault_tolerance_tests.cpp) 
[D:\DCOS\mesos\src\tests\mesos-tests.vcxproj]
  C:\Program Files (x86)\Microsoft Visual 
Studio\2017\Community\VC\Tools\MSVC\14.12.25827\include\xtree(1792): fatal 
error C1088: Cannot flush compiler intermediate file: 
'C:\Users\mesos\AppData\Local\Temp_CL_8f7a8b15ex': No space left on device 
(compiling source file 
D:\DCOS\mesos\mesos\src\tests\http_authentication_tests.cpp) 
[D:\DCOS\mesos\src\tests\mesos-tests.vcxproj]
  D:\DCOS\mesos\mesos\3rdparty\libprocess\include\process/gtest.hpp(137): fatal 
error C1088: Cannot flush compiler intermediate file: 
'C:\Users\mesos\AppData\Local\Temp_CL_0673ed75ex': No space left on device 
(compiling source file 
D:\DCOS\mesos\mesos\src\tests\master_slave_reconciliation_tests.cpp) 
[D:\DCOS\mesos\src\tests\mesos-tests.vcxproj]
  C:\Program Files (x86)\Microsoft Visual 
Studio\2017\Community\VC\Tools\MSVC\14.12.25827\include\xutility(460): fatal 
error C1088: Cannot flush compiler intermediate file: 
'C:\Users\mesos\AppData\Local\Temp_CL_0de14d03ex': No space left on device 
(compiling source file D:\DCOS\mesos\mesos\src\tests\gc_tests.cpp) 
[D:\DCOS\mesos\src\tests\mesos-tests.vcxproj]
  
D:\DCOS\mesos\3rdparty\protobuf-3.5.0\src\protobuf-3.5.0\src\google/protobuf/generated_message_util.h(288):
 fatal error C1088: Cannot flush compiler intermediate file: 
'C:\Users\mesos\AppData\Local\Temp_CL_97d6766aex': No space left on device 
(compiling source file D:\DCOS\mesos\mesos\src\tests\role_tests.cpp) 
[D:\DCOS\mesos\src\tests\mesos-tests.vcxproj]
  C:\Program Files (x86)\Microsoft Visual 
Studio\2017\Community\VC\Tools\MSVC\14.12.25827\include\xmemory0(943): fatal 
error C1088: Cannot flush compiler intermediate file: 
'C:\Users\mesos\AppData\Local\Temp_CL_f0b3e955ex': No 

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-09 Thread Jiang Yan Xu


> On Jan. 4, 2018, 5:25 p.m., Vinod Kone wrote:
> > src/master/master.cpp
> > Lines 10037-10056 (original), 10039-10062 (patched)
> > 
> >
> > I think we shouldn't create a TASK_UNREACHABLE status update and call 
> > `updateTask` or `forward` for a terminal task at all. . Also, `forward` 
> > sends TASK_UNREACHABLE update for terminal task to the framework which 
> > looks incorrect.
> > 
> > 
> > Ideally, we want terminal but unacknowledged tasks to still be marked 
> > unreachable in some way, either via task state being TASK_UNREACHABLE or 
> > task being present in `unreachableTasks`. This allows, for example, the 
> > WebUI to not show sandbox links for unreachable tasks irrespective of 
> > whether they were terminal or not before going unreachable. 
> > 
> > But doing this is tricky for various reasons:
> > 
> > --> `updateTask()` doesn't allow a terminal state to be transitioned to 
> > TASK_UNREACHABLE. Right now when we call `updateTask` for a terminal task, 
> > it adds TASK_UNREACHABLE status to `Task.statuses` and also sends it to 
> > operator API stream subscribers which looks incorrect. The fact that 
> > `updateTask` internally deals with already terminal tasks is a bad design 
> > decision in retrospect. I think the callers shouldn't call it for terminal 
> > tasks instead.
> > 
> > --> It's not clear to our users what a `completed` task means. The 
> > intention was for this to hold a cache of terminal and acknowledged tasks 
> > for storing recent history. The users of the WebUI probably equate 
> > "Completed Tasks" to terminal tasks irrespective of their acknowledgement 
> > status, which is why it is confusing for them to see terminal but 
> > unacknowledged tasks in the "Active tasks" section in the WebUI.
> > 
> > --> When a framework reconciles the state of a task on an unreachable 
> > agent, master replies with TASK_UNREACHABLE irrespective of whether the 
> > task was in a non-terminal state or terminal but un-acknowledged state or 
> > terminal and acknowledged state when the agent went unreachable.  
> > 
> > I think the direction we want to go towards is
> > 
> > --> Completed tasks should consist of terminal unacknowledged and 
> > terminal acknowled tasks, likely in two different data structures.
> > --> Unreachable tasks should consist of all non-complete tasks on an 
> > unreachable agent.  All the tasks in this map should be in TASK_UNREACHABLE 
> > state.
> > 
> > 
> > Given all the above is a very involved change, I would recommend 
> > keeping what you have here but with a giant TODO (your current comment in 
> > #10058 doesn't go into enough detail about the complexity here) that talks 
> > about the above stuff. Your change at least keeps the parity with the 
> > (broken) semantics that we have in 1.4 and earlier so that's a bit better.
> 
> Vinod Kone wrote:
> Ignore the first line. Forgot to delete it.
> 
> Jiang Yan Xu wrote:
> Future direction
> 
> 1. If completed == terminal unacknowledged + terminal acknowledged, then 
> completed == terminal right? Should we then unify the terminology and pick 
> one?
> 2. Unreachable tasks == non-terminal tasks on an unreachable agent: this 
> is what this RR is going to do but IIUC you want a different behavior.
> 
> Current semantics
> 
> 1. In 1.4 the the master (in `updateTask` sends `TASK_UNREACHABLE` to the 
> operator API subsribers for terminal tasks), as it stands right now we are 
> going to send `TASK_UNREACHABLE` to the schedulers as well. Should we change 
> that?
> 2. You also said above that "Ideally, we want terminal but unacknowledged 
> tasks to still be marked unreachable in some way" which seems to contradict 
> your later point that "Unreachable tasks should consist of all non-complete 
> (terminal) tasks", could you clarify?
> 
> Overall it sounds to me that the most correct semantic is still to set 
> `TASK_UNREACHABLE` only for the tasks that are non-terminal (because 
> otherwise we know that the state is not going to change to something else 
> that we don't know yet) but perhaps we can use another field in the status to 
> signal the fact that the agent is partitioned?
> 
> James Peach wrote:
> https://issues.apache.org/jira/browse/MESOS-8405
> 
> Vinod Kone wrote:
> Sorry for the confusion. In the future direction section it should be 
> "Unreachable tasks should consist of all tasks on an unreachable agent.". 
> Note that this is what is returned when a framework does explicit 
> reconciliation for example. I guess this means a terminal task can go to 
> UNREACHABLE and back depending on the state of the agent, which I agree is 
> also a bit weird. Basically the TASK_UNREACHABLE state is being used as a 
> proxy for the agent state, it's not really task state. I think that's source 

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-08 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194998
---




src/master/master.cpp
Line 10043 (original), 10038-10044 (patched)


You probably want to mention that even though we are not marking terminal 
tasks as UNREACHABLE, we are still sending TASK_UNREACHABLE updates to the 
clients (frameworks, subscribers) which is unfortunate.



src/tests/partition_tests.cpp
Lines 2426-2448 (original), 2436-2452 (patched)


I still think you are complicating this test by launching 2 tasks instead 
of one task that goes to TASK_FINISHED. All the comments in this test point to 
this task, but for some reason you also wanted to test the other case 
(non-terminal task on a removed agent) which didn't trigger the bug.



src/tests/partition_tests.cpp
Lines 2496-2502 (original), 2498-2504 (patched)


This doesn't really say which task went to FINISHED and which one went to 
LOST, which is unfortunate.


- Vinod Kone


On Jan. 5, 2018, 7:37 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 5, 2018, 7:37 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/5/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-08 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194999
---



FAIL: Some Mesos tests failed.

Reviews applied: `['64940']`

Failed command: `D:\DCOS\mesos\src\mesos-tests.exe --verbose`

All the build artifacts available at: 
http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/64940

Relevant logs:

- 
[mesos-tests-stdout.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/64940/logs/mesos-tests-stdout.log):

```
[   OK ] Endpoint/SlaveEndpointTest.NoAuthorizer/2 (102 ms)
[--] 9 tests from Endpoint/SlaveEndpointTest (971 ms total)

[--] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest
[ RUN  ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0
[   OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0 (33 
ms)
[ RUN  ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1
[   OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1 (38 
ms)
[--] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest (73 ms 
total)

[--] 1 test from IsolationFlag/CpuIsolatorTest
[ RUN  ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0
[   OK ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0 (2429 ms)
[--] 1 test from IsolationFlag/CpuIsolatorTest (2454 ms total)

[--] 1 test from IsolationFlag/MemoryIsolatorTest
[ RUN  ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0
[   OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (2391 ms)
[--] 1 test from IsolationFlag/MemoryIsolatorTest (2416 ms total)

[--] Global test environment tear-down
[==] 847 tests from 85 test cases ran. (321089 ms total)
[  PASSED  ] 846 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] PartitionTest.AgentWithCompletedTaskPartitioned

 1 FAILED TEST
  YOU HAVE 213 DISABLED TESTS

```

- 
[mesos-tests-stderr.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/64940/logs/mesos-tests-stderr.log):

```
I0108 23:23:35.775553  9436 executor.cpp:171] Received SUBSCRIBED event
I0108 23:23:35.779811  9436 executor.cpp:175] Subscribed executor on 
build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net
I0108 23:23:35.780552  9436 executor.cpp:171] Received LAUNCH event
I0108 23:23:35.783553  9436 executor.cpp:638] Starting task 
3f37b88a-3621-4109-8069-194fca55ecec
I0108 23:23:35.858552  9436 executor.cpp:478] Running 
'D:\DCOS\mesos\src\mesos-containerizer.exe launch '
I0108 23:23:36.412520  9436 executor.cpp:651] Forked command at 2800
I0108 23:23:36.441519  9252 exec.cpp:445] Executor asked to shutdown
I0108 23:23:36.441519  9436 executor.cpp:171] Received SHUTDOWN event
I0108 23:23:36.441519  9436 executor.cpp:748] Shutting down
I0108 23:23:36.441519  9436 executor.cpp:863] Sending SIGTERM to process tree 
at pid 27f76041c-8c90-4335-b40b-e60c5cfb6b86- (default) at 
scheduler-219dcd29-9600-4631--39cb16c974e3@10.3.1.11:55127
I0108 23:23:36.439517  9172 hierarchical.cpp:405] Deactivated framework 
7f76041c-8c90-4335-b40b-e60c5cfb6b86-
I0108 23:23:36.439517  6764 master.cpp:10196] Updating the state of task 
3f37b88a-3621-4109-8069-194fca55ecec of framework 
7f76041c-8c90-4335-b40b-e60c5cfb6b86- (latest state: TASK_KILLED, status 
update state: TASK_KILLED)
I0108 23:23:36.439517  9432 slave.cpp:3396] Shutting down framework 
7f76041c-8c90-4335-b40b-e60c5cfb6b86-
I0108 23:23:36.439517  9432 slave.cpp:6074] Shutting down executor 
'3f37b88a-3621-4109-8069-194fca55ecec' of framework 
7f76041c-8c90-4335-b40b-e60c5cfb6b86- at executor(1)@10.3.1.11:55149
I0108 23:23:36.440518  9432 slave.cpp:931] Agent terminating
W0108 23:23:36.440518  9432 slave.cpp:3392] Ignoring shutdown framework 
7f76041c-8c90-4335-b40b-e60c5cfb6b86- because it is terminating
I0108 23:23:36.441519  6764 master.cpp:10300] Removing task 
3f37b88a-3621-4109-8069-194fca55ecec with resources cpus(allocated: *):4; 
mem(allocated: *):2048; disk(allocated: *):1024; ports(allocated: 
*):[31000-32000] of framework 7f76041c-8c90-4335-b40b-e60c5cfb6b86- on 
agent 7f76041c-8c90-4335-b40b-e60c5cfb6b86-S0 at slave(329)@10.3.1.11:55127 
(build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0108 23:23:36.443517  9432 containerizer.cpp:2352] Destroying container 
da7f8cdc-2135-4ca1-b3cf-e18e8fdef330 in RUNNING state
I0108 23:23:36.443517  9432 containerizer.cpp:2966] Transitioning the state of 
container da7f8cdc-2135-4ca1-b3cf-e18e8fdef330 from RUNNING to DESTROYING
I0108 23:23:36.444519  6764 master.cpp:1305] Agent 
7f76041c-8c90-4335-b40b-e60c5cfb6b86-S0 at slave(329)@10.3.1.11:55127 
(build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net) disconnected
I0108 23:23:36.444519  6764 master.cpp:3365] Disconnecting agent 
7f76041c-8c90-4335-b40b-e60c5cfb6b86-S0 at 

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-08 Thread Vinod Kone


> On Jan. 5, 2018, 1:25 a.m., Vinod Kone wrote:
> > src/master/master.cpp
> > Lines 10037-10056 (original), 10039-10062 (patched)
> > 
> >
> > I think we shouldn't create a TASK_UNREACHABLE status update and call 
> > `updateTask` or `forward` for a terminal task at all. . Also, `forward` 
> > sends TASK_UNREACHABLE update for terminal task to the framework which 
> > looks incorrect.
> > 
> > 
> > Ideally, we want terminal but unacknowledged tasks to still be marked 
> > unreachable in some way, either via task state being TASK_UNREACHABLE or 
> > task being present in `unreachableTasks`. This allows, for example, the 
> > WebUI to not show sandbox links for unreachable tasks irrespective of 
> > whether they were terminal or not before going unreachable. 
> > 
> > But doing this is tricky for various reasons:
> > 
> > --> `updateTask()` doesn't allow a terminal state to be transitioned to 
> > TASK_UNREACHABLE. Right now when we call `updateTask` for a terminal task, 
> > it adds TASK_UNREACHABLE status to `Task.statuses` and also sends it to 
> > operator API stream subscribers which looks incorrect. The fact that 
> > `updateTask` internally deals with already terminal tasks is a bad design 
> > decision in retrospect. I think the callers shouldn't call it for terminal 
> > tasks instead.
> > 
> > --> It's not clear to our users what a `completed` task means. The 
> > intention was for this to hold a cache of terminal and acknowledged tasks 
> > for storing recent history. The users of the WebUI probably equate 
> > "Completed Tasks" to terminal tasks irrespective of their acknowledgement 
> > status, which is why it is confusing for them to see terminal but 
> > unacknowledged tasks in the "Active tasks" section in the WebUI.
> > 
> > --> When a framework reconciles the state of a task on an unreachable 
> > agent, master replies with TASK_UNREACHABLE irrespective of whether the 
> > task was in a non-terminal state or terminal but un-acknowledged state or 
> > terminal and acknowledged state when the agent went unreachable.  
> > 
> > I think the direction we want to go towards is
> > 
> > --> Completed tasks should consist of terminal unacknowledged and 
> > terminal acknowled tasks, likely in two different data structures.
> > --> Unreachable tasks should consist of all non-complete tasks on an 
> > unreachable agent.  All the tasks in this map should be in TASK_UNREACHABLE 
> > state.
> > 
> > 
> > Given all the above is a very involved change, I would recommend 
> > keeping what you have here but with a giant TODO (your current comment in 
> > #10058 doesn't go into enough detail about the complexity here) that talks 
> > about the above stuff. Your change at least keeps the parity with the 
> > (broken) semantics that we have in 1.4 and earlier so that's a bit better.
> 
> Vinod Kone wrote:
> Ignore the first line. Forgot to delete it.
> 
> Jiang Yan Xu wrote:
> Future direction
> 
> 1. If completed == terminal unacknowledged + terminal acknowledged, then 
> completed == terminal right? Should we then unify the terminology and pick 
> one?
> 2. Unreachable tasks == non-terminal tasks on an unreachable agent: this 
> is what this RR is going to do but IIUC you want a different behavior.
> 
> Current semantics
> 
> 1. In 1.4 the the master (in `updateTask` sends `TASK_UNREACHABLE` to the 
> operator API subsribers for terminal tasks), as it stands right now we are 
> going to send `TASK_UNREACHABLE` to the schedulers as well. Should we change 
> that?
> 2. You also said above that "Ideally, we want terminal but unacknowledged 
> tasks to still be marked unreachable in some way" which seems to contradict 
> your later point that "Unreachable tasks should consist of all non-complete 
> (terminal) tasks", could you clarify?
> 
> Overall it sounds to me that the most correct semantic is still to set 
> `TASK_UNREACHABLE` only for the tasks that are non-terminal (because 
> otherwise we know that the state is not going to change to something else 
> that we don't know yet) but perhaps we can use another field in the status to 
> signal the fact that the agent is partitioned?
> 
> James Peach wrote:
> https://issues.apache.org/jira/browse/MESOS-8405

Sorry for the confusion. In the future direction section it should be 
"Unreachable tasks should consist of all tasks on an unreachable agent.". Note 
that this is what is returned when a framework does explicit reconciliation for 
example. I guess this means a terminal task can go to UNREACHABLE and back 
depending on the state of the agent, which I agree is also a bit weird. 
Basically the TASK_UNREACHABLE state is being used as a proxy for the agent 
state, it's not really task state. I think that's source of the confusion. In 
retrospect, maybe 

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-05 Thread Jiang Yan Xu


> On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 10061 (patched)
> > 
> >
> > Perhaps move all the logic around determining the unreachable task down 
> > here.
> > 
> > ```
> > // We need to inform `removeTask` whether the task became unreachable 
> > due to the agent being removed. Note that terminal tasks are not considered 
> > unreachable. We can not rely on the task's state because it may have been 
> > set to `TASK_LOST` for backwards-compatibility.
> > bool unreachable =
> >   unreachableTime.isSome() && !protobuf::isTerminalState(task->state());
> > ```
> > 
> > Inlining `unreachable` is probably fine too but separating it helps 
> > clarify it with comments?
> 
> James Peach wrote:
> You have to sample the task state before the update because it might 
> transition to a terminal state at that time (and LOST is terminal).

Right, so put the line above `updateTask`? The main thing I am commenting on is 
that we can consolidate the logic without first defining `bool unreachable = 
true;` much far above?


> On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote:
> > src/tests/master_tests.cpp
> > Lines 7637 (patched)
> > 
> >
> > You could just wait for the the status update (being droppped) instead 
> > of capaturing `Slave::executorTerminated` and risk the race condition of 
> > the master not receiving the status update before the agent is deemed 
> > partitioned?
> 
> James Peach wrote:
> Dealing with all the status updates was less reliable.

Do you see the race condition I was trying to point out?


> On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote:
> > src/tests/master_tests.cpp
> > Lines 7686 (patched)
> > 
> >
> > No need to settle?
> 
> James Peach wrote:
> This is settling to ensure that events following from marking the agent 
> down are fully processed.

Oh ok that's right.


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194777
---


On Jan. 5, 2018, 11:06 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 5, 2018, 11:06 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/4/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-05 Thread James Peach


> On Jan. 5, 2018, 1:25 a.m., Vinod Kone wrote:
> > src/master/master.cpp
> > Lines 10037-10056 (original), 10039-10062 (patched)
> > 
> >
> > I think we shouldn't create a TASK_UNREACHABLE status update and call 
> > `updateTask` or `forward` for a terminal task at all. . Also, `forward` 
> > sends TASK_UNREACHABLE update for terminal task to the framework which 
> > looks incorrect.
> > 
> > 
> > Ideally, we want terminal but unacknowledged tasks to still be marked 
> > unreachable in some way, either via task state being TASK_UNREACHABLE or 
> > task being present in `unreachableTasks`. This allows, for example, the 
> > WebUI to not show sandbox links for unreachable tasks irrespective of 
> > whether they were terminal or not before going unreachable. 
> > 
> > But doing this is tricky for various reasons:
> > 
> > --> `updateTask()` doesn't allow a terminal state to be transitioned to 
> > TASK_UNREACHABLE. Right now when we call `updateTask` for a terminal task, 
> > it adds TASK_UNREACHABLE status to `Task.statuses` and also sends it to 
> > operator API stream subscribers which looks incorrect. The fact that 
> > `updateTask` internally deals with already terminal tasks is a bad design 
> > decision in retrospect. I think the callers shouldn't call it for terminal 
> > tasks instead.
> > 
> > --> It's not clear to our users what a `completed` task means. The 
> > intention was for this to hold a cache of terminal and acknowledged tasks 
> > for storing recent history. The users of the WebUI probably equate 
> > "Completed Tasks" to terminal tasks irrespective of their acknowledgement 
> > status, which is why it is confusing for them to see terminal but 
> > unacknowledged tasks in the "Active tasks" section in the WebUI.
> > 
> > --> When a framework reconciles the state of a task on an unreachable 
> > agent, master replies with TASK_UNREACHABLE irrespective of whether the 
> > task was in a non-terminal state or terminal but un-acknowledged state or 
> > terminal and acknowledged state when the agent went unreachable.  
> > 
> > I think the direction we want to go towards is
> > 
> > --> Completed tasks should consist of terminal unacknowledged and 
> > terminal acknowled tasks, likely in two different data structures.
> > --> Unreachable tasks should consist of all non-complete tasks on an 
> > unreachable agent.  All the tasks in this map should be in TASK_UNREACHABLE 
> > state.
> > 
> > 
> > Given all the above is a very involved change, I would recommend 
> > keeping what you have here but with a giant TODO (your current comment in 
> > #10058 doesn't go into enough detail about the complexity here) that talks 
> > about the above stuff. Your change at least keeps the parity with the 
> > (broken) semantics that we have in 1.4 and earlier so that's a bit better.
> 
> Vinod Kone wrote:
> Ignore the first line. Forgot to delete it.
> 
> Jiang Yan Xu wrote:
> Future direction
> 
> 1. If completed == terminal unacknowledged + terminal acknowledged, then 
> completed == terminal right? Should we then unify the terminology and pick 
> one?
> 2. Unreachable tasks == non-terminal tasks on an unreachable agent: this 
> is what this RR is going to do but IIUC you want a different behavior.
> 
> Current semantics
> 
> 1. In 1.4 the the master (in `updateTask` sends `TASK_UNREACHABLE` to the 
> operator API subsribers for terminal tasks), as it stands right now we are 
> going to send `TASK_UNREACHABLE` to the schedulers as well. Should we change 
> that?
> 2. You also said above that "Ideally, we want terminal but unacknowledged 
> tasks to still be marked unreachable in some way" which seems to contradict 
> your later point that "Unreachable tasks should consist of all non-complete 
> (terminal) tasks", could you clarify?
> 
> Overall it sounds to me that the most correct semantic is still to set 
> `TASK_UNREACHABLE` only for the tasks that are non-terminal (because 
> otherwise we know that the state is not going to change to something else 
> that we don't know yet) but perhaps we can use another field in the status to 
> signal the fact that the agent is partitioned?

https://issues.apache.org/jira/browse/MESOS-8405


- James


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194794
---


On Jan. 5, 2018, 7:06 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 5, 2018, 7:06 p.m.)
> 
> 
> Review request for 

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-05 Thread James Peach


> On Jan. 5, 2018, 1:25 a.m., Vinod Kone wrote:
> > src/tests/master_tests.cpp
> > Lines 7630 (patched)
> > 
> >
> > `the task`, which task? are both the tasks using the same executor?

We need one task to finish and one to not finish. They have separate executors.


> On Jan. 5, 2018, 1:25 a.m., Vinod Kone wrote:
> > src/tests/master_tests.cpp
> > Lines 7645-7653 (patched)
> > 
> >
> > Why are you launching 2 tasks instead of just one? You just want to 
> > test the unacknowledged terminal task case right? Doesn't really matter if 
> > it is TASK_FINISHED or TASK_LOST right?

`TASK_LOST` is also a terminal state. So I'm distinguishing that the completed 
task doesn't get tracked as unreachable but that the `LOST` task does. If I 
didn't have any non-lost tasks, the `CHECK` would not be triggered.


> On Jan. 5, 2018, 1:25 a.m., Vinod Kone wrote:
> > src/tests/master_tests.cpp
> > Lines 7692-7711 (patched)
> > 
> >
> > Are you verifying that the latest state of the task has been preserved? 
> > I would recommend hitting "/tasks" endpoint instead of "state-summary" to 
> > be more direct.

I'm checking that one task remained at `FINISHED` and one was moved to `LOST`. 
I can make the same checks using the `/tasks` endpoint, but it's going to be 
more code and more complicated.


- James


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194794
---


On Jan. 5, 2018, 7:06 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 5, 2018, 7:06 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/4/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-05 Thread James Peach

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/
---

(Updated Jan. 5, 2018, 7:06 p.m.)


Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod Kone, 
and Jiang Yan Xu.


Bugs: MESOS-8337
https://issues.apache.org/jira/browse/MESOS-8337


Repository: mesos


Description
---

If an agent is lost, we try to remove all the tasks that might
have been lost. However, if a task is already terminal, it hasn't
really been lost so we should not be tracking it in the framework's
unreachable tasks list.


Diffs (updated)
-

  src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
  src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
  src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 


Diff: https://reviews.apache.org/r/64940/diff/4/

Changes: https://reviews.apache.org/r/64940/diff/3-4/


Testing
---

make check (Fedora 27)


Thanks,

James Peach



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-05 Thread Jiang Yan Xu


> On Jan. 4, 2018, 5:25 p.m., Vinod Kone wrote:
> > src/master/master.cpp
> > Lines 10037-10056 (original), 10039-10062 (patched)
> > 
> >
> > I think we shouldn't create a TASK_UNREACHABLE status update and call 
> > `updateTask` or `forward` for a terminal task at all. . Also, `forward` 
> > sends TASK_UNREACHABLE update for terminal task to the framework which 
> > looks incorrect.
> > 
> > 
> > Ideally, we want terminal but unacknowledged tasks to still be marked 
> > unreachable in some way, either via task state being TASK_UNREACHABLE or 
> > task being present in `unreachableTasks`. This allows, for example, the 
> > WebUI to not show sandbox links for unreachable tasks irrespective of 
> > whether they were terminal or not before going unreachable. 
> > 
> > But doing this is tricky for various reasons:
> > 
> > --> `updateTask()` doesn't allow a terminal state to be transitioned to 
> > TASK_UNREACHABLE. Right now when we call `updateTask` for a terminal task, 
> > it adds TASK_UNREACHABLE status to `Task.statuses` and also sends it to 
> > operator API stream subscribers which looks incorrect. The fact that 
> > `updateTask` internally deals with already terminal tasks is a bad design 
> > decision in retrospect. I think the callers shouldn't call it for terminal 
> > tasks instead.
> > 
> > --> It's not clear to our users what a `completed` task means. The 
> > intention was for this to hold a cache of terminal and acknowledged tasks 
> > for storing recent history. The users of the WebUI probably equate 
> > "Completed Tasks" to terminal tasks irrespective of their acknowledgement 
> > status, which is why it is confusing for them to see terminal but 
> > unacknowledged tasks in the "Active tasks" section in the WebUI.
> > 
> > --> When a framework reconciles the state of a task on an unreachable 
> > agent, master replies with TASK_UNREACHABLE irrespective of whether the 
> > task was in a non-terminal state or terminal but un-acknowledged state or 
> > terminal and acknowledged state when the agent went unreachable.  
> > 
> > I think the direction we want to go towards is
> > 
> > --> Completed tasks should consist of terminal unacknowledged and 
> > terminal acknowled tasks, likely in two different data structures.
> > --> Unreachable tasks should consist of all non-complete tasks on an 
> > unreachable agent.  All the tasks in this map should be in TASK_UNREACHABLE 
> > state.
> > 
> > 
> > Given all the above is a very involved change, I would recommend 
> > keeping what you have here but with a giant TODO (your current comment in 
> > #10058 doesn't go into enough detail about the complexity here) that talks 
> > about the above stuff. Your change at least keeps the parity with the 
> > (broken) semantics that we have in 1.4 and earlier so that's a bit better.
> 
> Vinod Kone wrote:
> Ignore the first line. Forgot to delete it.

Future direction

1. If completed == terminal unacknowledged + terminal acknowledged, then 
completed == terminal right? Should we then unify the terminology and pick one?
2. Unreachable tasks == non-terminal tasks on an unreachable agent: this is 
what this RR is going to do but IIUC you want a different behavior.

Current semantics

1. In 1.4 the the master (in `updateTask` sends `TASK_UNREACHABLE` to the 
operator API subsribers for terminal tasks), as it stands right now we are 
going to send `TASK_UNREACHABLE` to the schedulers as well. Should we change 
that?
2. You also said above that "Ideally, we want terminal but unacknowledged tasks 
to still be marked unreachable in some way" which seems to contradict your 
later point that "Unreachable tasks should consist of all non-complete 
(terminal) tasks", could you clarify?

Overall it sounds to me that the most correct semantic is still to set 
`TASK_UNREACHABLE` only for the tasks that are non-terminal (because otherwise 
we know that the state is not going to change to something else that we don't 
know yet) but perhaps we can use another field in the status to signal the fact 
that the agent is partitioned?


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194794
---


On Jan. 3, 2018, 5:35 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 3, 2018, 5:35 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-04 Thread Vinod Kone


> On Jan. 5, 2018, 1:25 a.m., Vinod Kone wrote:
> > src/master/master.cpp
> > Lines 10037-10056 (original), 10039-10062 (patched)
> > 
> >
> > I think we shouldn't create a TASK_UNREACHABLE status update and call 
> > `updateTask` or `forward` for a terminal task at all. . Also, `forward` 
> > sends TASK_UNREACHABLE update for terminal task to the framework which 
> > looks incorrect.
> > 
> > 
> > Ideally, we want terminal but unacknowledged tasks to still be marked 
> > unreachable in some way, either via task state being TASK_UNREACHABLE or 
> > task being present in `unreachableTasks`. This allows, for example, the 
> > WebUI to not show sandbox links for unreachable tasks irrespective of 
> > whether they were terminal or not before going unreachable. 
> > 
> > But doing this is tricky for various reasons:
> > 
> > --> `updateTask()` doesn't allow a terminal state to be transitioned to 
> > TASK_UNREACHABLE. Right now when we call `updateTask` for a terminal task, 
> > it adds TASK_UNREACHABLE status to `Task.statuses` and also sends it to 
> > operator API stream subscribers which looks incorrect. The fact that 
> > `updateTask` internally deals with already terminal tasks is a bad design 
> > decision in retrospect. I think the callers shouldn't call it for terminal 
> > tasks instead.
> > 
> > --> It's not clear to our users what a `completed` task means. The 
> > intention was for this to hold a cache of terminal and acknowledged tasks 
> > for storing recent history. The users of the WebUI probably equate 
> > "Completed Tasks" to terminal tasks irrespective of their acknowledgement 
> > status, which is why it is confusing for them to see terminal but 
> > unacknowledged tasks in the "Active tasks" section in the WebUI.
> > 
> > --> When a framework reconciles the state of a task on an unreachable 
> > agent, master replies with TASK_UNREACHABLE irrespective of whether the 
> > task was in a non-terminal state or terminal but un-acknowledged state or 
> > terminal and acknowledged state when the agent went unreachable.  
> > 
> > I think the direction we want to go towards is
> > 
> > --> Completed tasks should consist of terminal unacknowledged and 
> > terminal acknowled tasks, likely in two different data structures.
> > --> Unreachable tasks should consist of all non-complete tasks on an 
> > unreachable agent.  All the tasks in this map should be in TASK_UNREACHABLE 
> > state.
> > 
> > 
> > Given all the above is a very involved change, I would recommend 
> > keeping what you have here but with a giant TODO (your current comment in 
> > #10058 doesn't go into enough detail about the complexity here) that talks 
> > about the above stuff. Your change at least keeps the parity with the 
> > (broken) semantics that we have in 1.4 and earlier so that's a bit better.

Ignore the first line. Forgot to delete it.


- Vinod


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194794
---


On Jan. 4, 2018, 1:35 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 4, 2018, 1:35 a.m.)
> 
> 
> Review request for mesos, Gaston Kleiman, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/3/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-04 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194794
---




src/master/master.cpp
Lines 10037-10056 (original), 10039-10062 (patched)


I think we shouldn't create a TASK_UNREACHABLE status update and call 
`updateTask` or `forward` for a terminal task at all. . Also, `forward` sends 
TASK_UNREACHABLE update for terminal task to the framework which looks 
incorrect.

Ideally, we want terminal but unacknowledged tasks to still be marked 
unreachable in some way, either via task state being TASK_UNREACHABLE or task 
being present in `unreachableTasks`. This allows, for example, the WebUI to not 
show sandbox links for unreachable tasks irrespective of whether they were 
terminal or not before going unreachable. 

But doing this is tricky for various reasons:

--> `updateTask()` doesn't allow a terminal state to be transitioned to 
TASK_UNREACHABLE. Right now when we call `updateTask` for a terminal task, it 
adds TASK_UNREACHABLE status to `Task.statuses` and also sends it to operator 
API stream subscribers which looks incorrect. The fact that `updateTask` 
internally deals with already terminal tasks is a bad design decision in 
retrospect. I think the callers shouldn't call it for terminal tasks instead.

--> It's not clear to our users what a `completed` task means. The 
intention was for this to hold a cache of terminal and acknowledged tasks for 
storing recent history. The users of the WebUI probably equate "Completed 
Tasks" to terminal tasks irrespective of their acknowledgement status, which is 
why it is confusing for them to see terminal but unacknowledged tasks in the 
"Active tasks" section in the WebUI.

--> When a framework reconciles the state of a task on an unreachable 
agent, master replies with TASK_UNREACHABLE irrespective of whether the task 
was in a non-terminal state or terminal but un-acknowledged state or terminal 
and acknowledged state when the agent went unreachable.  

I think the direction we want to go towards is

--> Completed tasks should consist of terminal unacknowledged and terminal 
acknowled tasks, likely in two different data structures.
--> Unreachable tasks should consist of all non-complete tasks on an 
unreachable agent.  All the tasks in this map should be in TASK_UNREACHABLE 
state.

Given all the above is a very involved change, I would recommend keeping 
what you have here but with a giant TODO (your current comment in #10058 
doesn't go into enough detail about the complexity here) that talks about the 
above stuff. Your change at least keeps the parity with the (broken) semantics 
that we have in 1.4 and earlier so that's a bit better.



src/master/master.cpp
Line 10273 (original), 10279-10283 (patched)


I would kill this CHECK in favor of the CHECK that already exists in the 
`Framework::removeTask`. AFAICT `Master::removeTask` shouldn't care about it.



src/tests/master_tests.cpp
Lines 7601 (patched)


How about:

// This test verifies that when the master removes an agent any 
unacknowledged but terminal tasks
// on the agent are moved to `completed` but keep their original terminal 
state.



src/tests/master_tests.cpp
Lines 7630 (patched)


`the task`, which task? are both the tasks using the same executor?



src/tests/master_tests.cpp
Lines 7630-7638 (patched)


This seems a bit complicated. You could alternatively just start the 
scheduler driver with explicit acknowledgements and just not acknowledge 
terminal updates when you receive them.



src/tests/master_tests.cpp
Lines 7645-7653 (patched)


Why are you launching 2 tasks instead of just one? You just want to test 
the unacknowledged terminal task case right? Doesn't really matter if it is 
TASK_FINISHED or TASK_LOST right?



src/tests/master_tests.cpp
Lines 7692-7711 (patched)


Are you verifying that the latest state of the task has been preserved? I 
would recommend hitting "/tasks" endpoint instead of "state-summary" to be more 
direct.


- Vinod Kone


On Jan. 4, 2018, 1:35 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 4, 2018, 1:35 a.m.)
> 
> 
> Review request for mesos, Gaston Kleiman, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: 

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-04 Thread James Peach


> On Jan. 4, 2018, 10:09 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 10061 (patched)
> > 
> >
> > Perhaps move all the logic around determining the unreachable task down 
> > here.
> > 
> > ```
> > // We need to inform `removeTask` whether the task became unreachable 
> > due to the agent being removed. Note that terminal tasks are not considered 
> > unreachable. We can not rely on the task's state because it may have been 
> > set to `TASK_LOST` for backwards-compatibility.
> > bool unreachable =
> >   unreachableTime.isSome() && !protobuf::isTerminalState(task->state());
> > ```
> > 
> > Inlining `unreachable` is probably fine too but separating it helps 
> > clarify it with comments?

You have to sample the task state before the update because it might transition 
to a terminal state at that time (and LOST is terminal).


- James


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194777
---


On Jan. 4, 2018, 1:35 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 4, 2018, 1:35 a.m.)
> 
> 
> Review request for mesos, Gaston Kleiman, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
>   src/tests/master_tests.cpp 5546fd937d078c1f757964d5163449ffd993388e 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/2/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-04 Thread James Peach


> On Jan. 4, 2018, 10:09 p.m., Jiang Yan Xu wrote:
> > src/tests/master_tests.cpp
> > Lines 7637 (patched)
> > 
> >
> > You could just wait for the the status update (being droppped) instead 
> > of capaturing `Slave::executorTerminated` and risk the race condition of 
> > the master not receiving the status update before the agent is deemed 
> > partitioned?

Dealing with all the status updates was less reliable.


- James


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194777
---


On Jan. 4, 2018, 1:35 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 4, 2018, 1:35 a.m.)
> 
> 
> Review request for mesos, Gaston Kleiman, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
>   src/tests/master_tests.cpp 5546fd937d078c1f757964d5163449ffd993388e 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/2/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-04 Thread James Peach

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194796
---




src/master/master.cpp
Lines 10061 (patched)


You have to sample the task state before the update because it might 
transition to a terminal state at that time (and LOST is terminal).


- James Peach


On Jan. 4, 2018, 1:35 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 4, 2018, 1:35 a.m.)
> 
> 
> Review request for mesos, Gaston Kleiman, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
>   src/tests/master_tests.cpp 5546fd937d078c1f757964d5163449ffd993388e 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/2/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-04 Thread James Peach


> On Jan. 4, 2018, 10:09 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 10058-10060 (patched)
> > 
> >
> > Perhaps we just need to comment on the logic we use to determine if the 
> > task is `unreachable` and not what `removeTask` is going to do about it 
> > (because it's hidden here)? i.e., remove the comment here and use something 
> > like what I suggested below?

Sincce this is particularly obscure I think it's worth the comment.


> On Jan. 4, 2018, 10:09 p.m., Jiang Yan Xu wrote:
> > src/tests/master_tests.cpp
> > Lines 7686 (patched)
> > 
> >
> > No need to settle?

This is settling to ensure that events following from marking the agent down 
are fully processed.


- James


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194777
---


On Jan. 4, 2018, 1:35 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 4, 2018, 1:35 a.m.)
> 
> 
> Review request for mesos, Gaston Kleiman, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
>   src/tests/master_tests.cpp 5546fd937d078c1f757964d5163449ffd993388e 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/2/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-04 Thread James Peach


> On Jan. 4, 2018, 7:06 p.m., Ilya Pronin wrote:
> > src/master/master.cpp
> > Lines 10061 (patched)
> > 
> >
> > Nit: The expression can be simplified to `!terminal && unreachable`.

I think it's a matter of taste which is more readable ;)


> On Jan. 4, 2018, 7:06 p.m., Ilya Pronin wrote:
> > src/tests/master_tests.cpp
> > Lines 7626 (patched)
> > 
> >
> > Just curious: why do you pause the clock here? To speed up the test 
> > (i.e. not wait for `allocation_interval`)?

Yeh that was weird, I removed it.


- James


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194772
---


On Jan. 4, 2018, 1:35 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 4, 2018, 1:35 a.m.)
> 
> 
> Review request for mesos, Gaston Kleiman, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 8fe9420dbe03ea2cefc6a40b0f64284aa9fe7915 
>   src/master/master.cpp 282fdf8ac38e3613c621c1c8e5c50f27bde9dafd 
>   src/tests/master_tests.cpp 5546fd937d078c1f757964d5163449ffd993388e 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/1/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-04 Thread Jiang Yan Xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194777
---




src/master/master.cpp
Lines 10058-10060 (patched)


Perhaps we just need to comment on the logic we use to determine if the 
task is `unreachable` and not what `removeTask` is going to do about it 
(because it's hidden here)? i.e., remove the comment here and use something 
like what I suggested below?



src/master/master.cpp
Lines 10061 (patched)


Perhaps move all the logic around determining the unreachable task down 
here.

```
// We need to inform `removeTask` whether the task became unreachable due 
to the agent being removed. Note that terminal tasks are not considered 
unreachable. We can not rely on the task's state because it may have been set 
to `TASK_LOST` for backwards-compatibility.
bool unreachable =
  unreachableTime.isSome() && !protobuf::isTerminalState(task->state());
```

Inlining `unreachable` is probably fine too but separating it helps clarify 
it with comments?



src/master/master.cpp
Line 10273 (original), 10279-10280 (patched)


So we had to use a boolean because we cannot rely on the task *state*, so 
"any of the unreachable states" is a bit misleading?

I think the CHECK message is already clear enough for now? (Like you did in 
`removeTask`)

I suggest we don't comment further on the removable-ness of the task 
because we might want to change it: 
https://issues.apache.org/jira/browse/MESOS-8389



src/tests/master_tests.cpp
Lines 7601-7602 (patched)


This reads like "terminal tasks" are stored as "either finished or lost" 
when it's only the non-terminal task that is transitioned to lost.

Can we just say that we are verifying "the terminal task is stored in its 
original state when parition happens"?



src/tests/master_tests.cpp
Lines 7603 (patched)


Most agent partition related tests are put in partition_tests.cpp, can we 
put this one there as well for better grouping? 

In particular there is a test named `TaskCompletedOnPartitionedAgent` there 
which is somewhat related. We can name this one 
`AgentWithCompletedTaskPartitioned` or `TaskCompletedBeforeAgentPartitioned`

The main word I want to change is `Lost`. While I can understand you are 
using it interchangably with *partitioned*, in this context when it's call 
partitioned elsewhere and there's `TaskLost` to confuse with it, perhaps it's 
better to use the more precise word.



src/tests/master_tests.cpp
Lines 7621 (patched)


We typically name this `offers` since it's a vector. You'll be using the 
first and only one so later you should

```
  AWAIT_READY(offers);
  ASSERT_FALSE(offers->empty());
  const Offer& offer = offers->at(0);
```



src/tests/master_tests.cpp
Lines 7637 (patched)


You could just wait for the the status update (being droppped) instead of 
capaturing `Slave::executorTerminated` and risk the race condition of the 
master not receiving the status update before the agent is deemed partitioned?



src/tests/master_tests.cpp
Lines 7641 (patched)


Instead of settling it's more direct to just wait for the offer before 
using it (like I mentioned above).



src/tests/master_tests.cpp
Lines 7675 (patched)


"fast fast" to emphasize? :)



src/tests/master_tests.cpp
Lines 7681 (patched)


If you wait for the ping, you don't need to settle?



src/tests/master_tests.cpp
Lines 7686 (patched)


No need to settle?


- Jiang Yan Xu


On Jan. 3, 2018, 5:35 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 3, 2018, 5:35 p.m.)
> 
> 
> Review request for mesos, Gaston Kleiman, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-04 Thread Ilya Pronin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194772
---


Fix it, then Ship it!




Looks good to me!


src/master/master.cpp
Lines 10061 (patched)


Nit: The expression can be simplified to `!terminal && unreachable`.



src/tests/master_tests.cpp
Lines 7626 (patched)


Just curious: why do you pause the clock here? To speed up the test (i.e. 
not wait for `allocation_interval`)?


- Ilya Pronin


On Jan. 3, 2018, 5:35 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 3, 2018, 5:35 p.m.)
> 
> 
> Review request for mesos, Gaston Kleiman, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 8fe9420dbe03ea2cefc6a40b0f64284aa9fe7915 
>   src/master/master.cpp 282fdf8ac38e3613c621c1c8e5c50f27bde9dafd 
>   src/tests/master_tests.cpp 5546fd937d078c1f757964d5163449ffd993388e 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/1/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>