Re: Review Request 69684: Fixed the FD leak if containerizer::_launch() failed or discarded.

2019-01-09 Thread Gilbert Song

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69684/#review211790
---


Fix it, then Ship it!





src/slave/containerizer/mesos/containerizer.cpp
Lines 1385-1386 (patched)


4 space ahead if you make it to the newline after (

I will fix it for you


- Gilbert Song


On Jan. 7, 2019, 7:27 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69684/
> ---
> 
> (Updated Jan. 7, 2019, 7:27 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-6632
> https://issues.apache.org/jira/browse/MESOS-6632
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> For the period between IOSB server is up and container process exec,
> if the containerizer launch returns failure or discarded, the FD used
> for signaling from the container process to the IOSB finish redirect
> will be leaked, which would cause the IOSB stuck at `io::redirect`
> forever. It would block the containerizer from cleaning up this
> container at IOSB.
> 
> This issue could be exposed if there are frequent check containers
> launch on an agent with heavy loads.
> 
> This patch fixes the issue by handling discard of a `launch` future,
> so the container IO is cleaned up and therefore all FDs are closed.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a5cf2da55c046c5c45e0c2ca3400f64de12de62b 
> 
> 
> Diff: https://reviews.apache.org/r/69684/diff/6/
> 
> 
> Testing
> ---
> 
> Manual testing:
> 
> Add `return Failure("Test!");` to `IOSwitchboard::_prepare()`: 
> https://github.com/apache/mesos/blob/7a378597d6a4359de0b1eabb563d4c12cdb4290e/src/slave/containerizer/mesos/io/switchboard.cpp#L657
> 
> 1) Compile without this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to a hanging IOSB (15 seconds timeout) after trying to 
> destroy IOSB on the HTTP error "Test!"
> 
> 1) Compile with this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to the HTTP error "Test!" and terminates immediately 
> (as expected)
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 69684: Fixed the FD leak if containerizer::_launch() failed or discarded.

2019-01-09 Thread Andrei Budnik


> On Jan. 9, 2019, 5:39 p.m., Gilbert Song wrote:
> > src/slave/containerizer/mesos/containerizer.cpp
> > Lines 1391-1397 (patched)
> > 
> >
> > style nits, how about:
> > 
> > ```
> >   return result
> > .onAny()
> > ```

I've removed `Future result`.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69684/#review211783
---


On Jan. 7, 2019, 3:27 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69684/
> ---
> 
> (Updated Jan. 7, 2019, 3:27 p.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-6632
> https://issues.apache.org/jira/browse/MESOS-6632
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> For the period between IOSB server is up and container process exec,
> if the containerizer launch returns failure or discarded, the FD used
> for signaling from the container process to the IOSB finish redirect
> will be leaked, which would cause the IOSB stuck at `io::redirect`
> forever. It would block the containerizer from cleaning up this
> container at IOSB.
> 
> This issue could be exposed if there are frequent check containers
> launch on an agent with heavy loads.
> 
> This patch fixes the issue by handling discard of a `launch` future,
> so the container IO is cleaned up and therefore all FDs are closed.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a5cf2da55c046c5c45e0c2ca3400f64de12de62b 
> 
> 
> Diff: https://reviews.apache.org/r/69684/diff/6/
> 
> 
> Testing
> ---
> 
> Manual testing:
> 
> Add `return Failure("Test!");` to `IOSwitchboard::_prepare()`: 
> https://github.com/apache/mesos/blob/7a378597d6a4359de0b1eabb563d4c12cdb4290e/src/slave/containerizer/mesos/io/switchboard.cpp#L657
> 
> 1) Compile without this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to a hanging IOSB (15 seconds timeout) after trying to 
> destroy IOSB on the HTTP error "Test!"
> 
> 1) Compile with this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to the HTTP error "Test!" and terminates immediately 
> (as expected)
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 69684: Fixed the FD leak if containerizer::_launch() failed or discarded.

2019-01-09 Thread Gilbert Song

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69684/#review211784
---


Ship it!




Ship It!

- Gilbert Song


On Jan. 7, 2019, 7:27 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69684/
> ---
> 
> (Updated Jan. 7, 2019, 7:27 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-6632
> https://issues.apache.org/jira/browse/MESOS-6632
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> For the period between IOSB server is up and container process exec,
> if the containerizer launch returns failure or discarded, the FD used
> for signaling from the container process to the IOSB finish redirect
> will be leaked, which would cause the IOSB stuck at `io::redirect`
> forever. It would block the containerizer from cleaning up this
> container at IOSB.
> 
> This issue could be exposed if there are frequent check containers
> launch on an agent with heavy loads.
> 
> This patch fixes the issue by handling discard of a `launch` future,
> so the container IO is cleaned up and therefore all FDs are closed.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a5cf2da55c046c5c45e0c2ca3400f64de12de62b 
> 
> 
> Diff: https://reviews.apache.org/r/69684/diff/4/
> 
> 
> Testing
> ---
> 
> Manual testing:
> 
> Add `return Failure("Test!");` to `IOSwitchboard::_prepare()`: 
> https://github.com/apache/mesos/blob/7a378597d6a4359de0b1eabb563d4c12cdb4290e/src/slave/containerizer/mesos/io/switchboard.cpp#L657
> 
> 1) Compile without this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to a hanging IOSB (15 seconds timeout) after trying to 
> destroy IOSB on the HTTP error "Test!"
> 
> 1) Compile with this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to the HTTP error "Test!" and terminates immediately 
> (as expected)
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 69684: Fixed the FD leak if containerizer::_launch() failed or discarded.

2019-01-09 Thread Gilbert Song

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69684/#review211783
---




src/slave/containerizer/mesos/containerizer.cpp
Lines 1391-1397 (patched)


style nits, how about:

```
  return result
.onAny()
```


- Gilbert Song


On Jan. 7, 2019, 7:27 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69684/
> ---
> 
> (Updated Jan. 7, 2019, 7:27 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-6632
> https://issues.apache.org/jira/browse/MESOS-6632
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> For the period between IOSB server is up and container process exec,
> if the containerizer launch returns failure or discarded, the FD used
> for signaling from the container process to the IOSB finish redirect
> will be leaked, which would cause the IOSB stuck at `io::redirect`
> forever. It would block the containerizer from cleaning up this
> container at IOSB.
> 
> This issue could be exposed if there are frequent check containers
> launch on an agent with heavy loads.
> 
> This patch fixes the issue by handling discard of a `launch` future,
> so the container IO is cleaned up and therefore all FDs are closed.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a5cf2da55c046c5c45e0c2ca3400f64de12de62b 
> 
> 
> Diff: https://reviews.apache.org/r/69684/diff/4/
> 
> 
> Testing
> ---
> 
> Manual testing:
> 
> Add `return Failure("Test!");` to `IOSwitchboard::_prepare()`: 
> https://github.com/apache/mesos/blob/7a378597d6a4359de0b1eabb563d4c12cdb4290e/src/slave/containerizer/mesos/io/switchboard.cpp#L657
> 
> 1) Compile without this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to a hanging IOSB (15 seconds timeout) after trying to 
> destroy IOSB on the HTTP error "Test!"
> 
> 1) Compile with this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to the HTTP error "Test!" and terminates immediately 
> (as expected)
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 69684: Fixed the FD leak if containerizer::_launch() failed or discarded.

2019-01-09 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69684/#review211782
---



PASS: Mesos patch 69684 was successfully built and tested.

Reviews applied: `['69684']`

All the build artifacts available at: 
http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2723/mesos-review-69684

- Mesos Reviewbot Windows


On Jan. 7, 2019, 3:27 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69684/
> ---
> 
> (Updated Jan. 7, 2019, 3:27 p.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-6632
> https://issues.apache.org/jira/browse/MESOS-6632
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> For the period between IOSB server is up and container process exec,
> if the containerizer launch returns failure or discarded, the FD used
> for signaling from the container process to the IOSB finish redirect
> will be leaked, which would cause the IOSB stuck at `io::redirect`
> forever. It would block the containerizer from cleaning up this
> container at IOSB.
> 
> This issue could be exposed if there are frequent check containers
> launch on an agent with heavy loads.
> 
> This patch fixes the issue by handling discard of a `launch` future,
> so the container IO is cleaned up and therefore all FDs are closed.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a5cf2da55c046c5c45e0c2ca3400f64de12de62b 
> 
> 
> Diff: https://reviews.apache.org/r/69684/diff/4/
> 
> 
> Testing
> ---
> 
> Manual testing:
> 
> Add `return Failure("Test!");` to `IOSwitchboard::_prepare()`: 
> https://github.com/apache/mesos/blob/7a378597d6a4359de0b1eabb563d4c12cdb4290e/src/slave/containerizer/mesos/io/switchboard.cpp#L657
> 
> 1) Compile without this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to a hanging IOSB (15 seconds timeout) after trying to 
> destroy IOSB on the HTTP error "Test!"
> 
> 1) Compile with this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to the HTTP error "Test!" and terminates immediately 
> (as expected)
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 69684: Fixed the FD leak if containerizer::_launch() failed or discarded.

2019-01-09 Thread Andrei Budnik


> On Jan. 8, 2019, 8:04 p.m., Gilbert Song wrote:
> > src/slave/containerizer/mesos/containerizer.cpp
> > Lines 1366 (patched)
> > 
> >
> > Just realize .repair may not be sufficient because it could only handle 
> > failure case but discarded case. It is possible that `prepare(containerId, 
> > None())` just returns a discarded future and propagate that discarded to 
> > all chained .then
> > 
> > Let's use .onAny instead and make the code more clear and explicit:
> > 
> > ```
> >   Future result = prepare()
> > .then(extractContainerIO)
> > .then(_launch);
> > 
> >   TODO: remove this .onAny once we have FD wrapper.  
> >   return result
> > // extractContainerIO may be called twice, but ok sometimes it is 
> > no-op
> > .onAny(extractContainerIO)

I've refactored a bit this code to remove duplicated code and then added 
`.any(extractContainerIO)` handler.
> TODO: remove this .onAny once we have FD wrapper.

If we had FD wrapper, we would need to call `extractContainerIO` to remove a 
data structure which owns FDs (current `FDWrapper`). So, nothing changes here 
even after introducing FD wrapper in `libprocess`.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69684/#review211769
---


On Jan. 7, 2019, 3:27 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69684/
> ---
> 
> (Updated Jan. 7, 2019, 3:27 p.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-6632
> https://issues.apache.org/jira/browse/MESOS-6632
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> For the period between IOSB server is up and container process exec,
> if the containerizer launch returns failure or discarded, the FD used
> for signaling from the container process to the IOSB finish redirect
> will be leaked, which would cause the IOSB stuck at `io::redirect`
> forever. It would block the containerizer from cleaning up this
> container at IOSB.
> 
> This issue could be exposed if there are frequent check containers
> launch on an agent with heavy loads.
> 
> This patch fixes the issue by handling discard of a `launch` future,
> so the container IO is cleaned up and therefore all FDs are closed.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a5cf2da55c046c5c45e0c2ca3400f64de12de62b 
> 
> 
> Diff: https://reviews.apache.org/r/69684/diff/4/
> 
> 
> Testing
> ---
> 
> Manual testing:
> 
> Add `return Failure("Test!");` to `IOSwitchboard::_prepare()`: 
> https://github.com/apache/mesos/blob/7a378597d6a4359de0b1eabb563d4c12cdb4290e/src/slave/containerizer/mesos/io/switchboard.cpp#L657
> 
> 1) Compile without this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to a hanging IOSB (15 seconds timeout) after trying to 
> destroy IOSB on the HTTP error "Test!"
> 
> 1) Compile with this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to the HTTP error "Test!" and terminates immediately 
> (as expected)
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 69684: Fixed the FD leak if containerizer::_launch() failed or discarded.

2019-01-08 Thread Gilbert Song


> On Jan. 8, 2019, 5:01 p.m., Meng Zhu wrote:
> > > the FD used for signaling from the container process to the IOSB finish 
> > > redirect will be leaked
> > 
> > Is this description right? I thought what "leaked" are those std* in the 
> > container IO, not `unblockFds`. IIUC, `unblockFds` intercepts SIGTERM so 
> > that the server could shutdown gracefully -- this worked as expected.

They are the same thing. IOSB creats some pipes. One end is passed to launcher 
when forking the container process. The other end is held by IOSB to read. 
Nothing related to unblockFds.


- Gilbert


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69684/#review211770
---


On Jan. 7, 2019, 7:27 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69684/
> ---
> 
> (Updated Jan. 7, 2019, 7:27 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-6632
> https://issues.apache.org/jira/browse/MESOS-6632
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> For the period between IOSB server is up and container process exec,
> if the containerizer launch returns failure or discarded, the FD used
> for signaling from the container process to the IOSB finish redirect
> will be leaked, which would cause the IOSB stuck at `io::redirect`
> forever. It would block the containerizer from cleaning up this
> container at IOSB.
> 
> This issue could be exposed if there are frequent check containers
> launch on an agent with heavy loads.
> 
> This patch fixes the issue by handling discard of a `launch` future,
> so the container IO is cleaned up and therefore all FDs are closed.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a5cf2da55c046c5c45e0c2ca3400f64de12de62b 
> 
> 
> Diff: https://reviews.apache.org/r/69684/diff/3/
> 
> 
> Testing
> ---
> 
> Manual testing:
> 
> Add `return Failure("Test!");` to `IOSwitchboard::_prepare()`: 
> https://github.com/apache/mesos/blob/7a378597d6a4359de0b1eabb563d4c12cdb4290e/src/slave/containerizer/mesos/io/switchboard.cpp#L657
> 
> 1) Compile without this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to a hanging IOSB (15 seconds timeout) after trying to 
> destroy IOSB on the HTTP error "Test!"
> 
> 1) Compile with this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to the HTTP error "Test!" and terminates immediately 
> (as expected)
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 69684: Fixed the FD leak if containerizer::_launch() failed or discarded.

2019-01-08 Thread Meng Zhu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69684/#review211770
---



> the FD used for signaling from the container process to the IOSB finish 
> redirect will be leaked

Is this description right? I thought what "leaked" are those std* in the 
container IO, not `unblockFds`. IIUC, `unblockFds` intercepts SIGTERM so that 
the server could shutdown gracefully -- this worked as expected.

- Meng Zhu


On Jan. 7, 2019, 7:27 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69684/
> ---
> 
> (Updated Jan. 7, 2019, 7:27 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-6632
> https://issues.apache.org/jira/browse/MESOS-6632
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> For the period between IOSB server is up and container process exec,
> if the containerizer launch returns failure or discarded, the FD used
> for signaling from the container process to the IOSB finish redirect
> will be leaked, which would cause the IOSB stuck at `io::redirect`
> forever. It would block the containerizer from cleaning up this
> container at IOSB.
> 
> This issue could be exposed if there are frequent check containers
> launch on an agent with heavy loads.
> 
> This patch fixes the issue by handling discard of a `launch` future,
> so the container IO is cleaned up and therefore all FDs are closed.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a5cf2da55c046c5c45e0c2ca3400f64de12de62b 
> 
> 
> Diff: https://reviews.apache.org/r/69684/diff/3/
> 
> 
> Testing
> ---
> 
> Manual testing:
> 
> Add `return Failure("Test!");` to `IOSwitchboard::_prepare()`: 
> https://github.com/apache/mesos/blob/7a378597d6a4359de0b1eabb563d4c12cdb4290e/src/slave/containerizer/mesos/io/switchboard.cpp#L657
> 
> 1) Compile without this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to a hanging IOSB (15 seconds timeout) after trying to 
> destroy IOSB on the HTTP error "Test!"
> 
> 1) Compile with this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to the HTTP error "Test!" and terminates immediately 
> (as expected)
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 69684: Fixed the FD leak if containerizer::_launch() failed or discarded.

2019-01-08 Thread Gilbert Song

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69684/#review211769
---


Fix it, then Ship it!





src/slave/containerizer/mesos/containerizer.cpp
Lines 1366 (patched)


Just realize .repair may not be sufficient because it could only handle 
failure case but discarded case. It is possible that `prepare(containerId, 
None())` just returns a discarded future and propagate that discarded to all 
chained .then

Let's use .onAny instead and make the code more clear and explicit:

```
  Future result = prepare()
.then(extractContainerIO)
.then(_launch);

  TODO: remove this .onAny once we have FD wrapper.  
  return result
// extractContainerIO may be called twice, but ok sometimes it is no-op
.onAny(extractContainerIO)


- Gilbert Song


On Jan. 7, 2019, 7:27 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69684/
> ---
> 
> (Updated Jan. 7, 2019, 7:27 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-6632
> https://issues.apache.org/jira/browse/MESOS-6632
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> For the period between IOSB server is up and container process exec,
> if the containerizer launch returns failure or discarded, the FD used
> for signaling from the container process to the IOSB finish redirect
> will be leaked, which would cause the IOSB stuck at `io::redirect`
> forever. It would block the containerizer from cleaning up this
> container at IOSB.
> 
> This issue could be exposed if there are frequent check containers
> launch on an agent with heavy loads.
> 
> This patch fixes the issue by handling discard of a `launch` future,
> so the container IO is cleaned up and therefore all FDs are closed.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a5cf2da55c046c5c45e0c2ca3400f64de12de62b 
> 
> 
> Diff: https://reviews.apache.org/r/69684/diff/3/
> 
> 
> Testing
> ---
> 
> Manual testing:
> 
> Add `return Failure("Test!");` to `IOSwitchboard::_prepare()`: 
> https://github.com/apache/mesos/blob/7a378597d6a4359de0b1eabb563d4c12cdb4290e/src/slave/containerizer/mesos/io/switchboard.cpp#L657
> 
> 1) Compile without this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to a hanging IOSB (15 seconds timeout) after trying to 
> destroy IOSB on the HTTP error "Test!"
> 
> 1) Compile with this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to the HTTP error "Test!" and terminates immediately 
> (as expected)
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 69684: Fixed the FD leak if containerizer::_launch() failed or discarded.

2019-01-08 Thread Andrei Budnik


> On Jan. 8, 2019, 8:22 a.m., Gilbert Song wrote:
> > src/slave/containerizer/mesos/containerizer.cpp
> > Lines 1372 (patched)
> > 
> >
> > Two cases:
> > 
> > 1) if `prepare(containerId, None())` fails, this `.repair` works, since 
> > it would trigger `FDWrapper` to destructs itself.
> > 2) but if `_launch()` fails, `.repair` is no-op. it does not hurt but 
> > hard for contributors to understand since those FD close are implicit in 
> > destructors.
> > 
> > If we do not care `_launch()` is failed or not, we just want to 
> > safeguard the `READY/FAILURE` cases for `prepare(containerId, None())`, 
> > should we move the .repair up (e.g., right after `prepare(containerId, 
> > None())`)?

I've moved `.repair` handler right after the callback that returns 
`ioSwitchboard->extractContainerIO(containerId)`.
> right after prepare(containerId, None())

In this case, when the future is discarded after calling `defer(self(), 
)`, but before `prepare callback` is executed, then neither 
`prepare callback` nor `.repair callback` is called.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69684/#review211756
---


On Jan. 7, 2019, 3:27 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69684/
> ---
> 
> (Updated Jan. 7, 2019, 3:27 p.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-6632
> https://issues.apache.org/jira/browse/MESOS-6632
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> For the period between IOSB server is up and container process exec,
> if the containerizer launch returns failure or discarded, the FD used
> for signaling from the container process to the IOSB finish redirect
> will be leaked, which would cause the IOSB stuck at `io::redirect`
> forever. It would block the containerizer from cleaning up this
> container at IOSB.
> 
> This issue could be exposed if there are frequent check containers
> launch on an agent with heavy loads.
> 
> This patch fixes the issue by handling discard of a `launch` future,
> so the container IO is cleaned up and therefore all FDs are closed.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a5cf2da55c046c5c45e0c2ca3400f64de12de62b 
> 
> 
> Diff: https://reviews.apache.org/r/69684/diff/3/
> 
> 
> Testing
> ---
> 
> Manual testing:
> 
> Add `return Failure("Test!");` to `IOSwitchboard::_prepare()`: 
> https://github.com/apache/mesos/blob/7a378597d6a4359de0b1eabb563d4c12cdb4290e/src/slave/containerizer/mesos/io/switchboard.cpp#L657
> 
> 1) Compile without this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to a hanging IOSB (15 seconds timeout) after trying to 
> destroy IOSB on the HTTP error "Test!"
> 
> 1) Compile with this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to the HTTP error "Test!" and terminates immediately 
> (as expected)
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 69684: Fixed the FD leak if containerizer::_launch() failed or discarded.

2019-01-08 Thread Gilbert Song

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69684/#review211756
---



This is a cleaner solution, but a little more implicit. Need clear comments to 
let people understand why we call `extractContainerIO` twice.


src/slave/containerizer/mesos/containerizer.cpp
Lines 1372 (patched)


Two cases:

1) if `prepare(containerId, None())` fails, this `.repair` works, since it 
would trigger `FDWrapper` to destructs itself.
2) but if `_launch()` fails, `.repair` is no-op. it does not hurt but hard 
for contributors to understand since those FD close are implicit in destructors.

If we do not care `_launch()` is failed or not, we just want to safeguard 
the `READY/FAILURE` cases for `prepare(containerId, None())`, should we move 
the .repair up (e.g., right after `prepare(containerId, None())`)?



src/slave/containerizer/mesos/containerizer.cpp
Lines 1374-1377 (patched)


Could we be more explicit to point out that FDs are closed in `FDWrapper` 
destruction when the ContainerIO object is freezed.


- Gilbert Song


On Jan. 7, 2019, 7:27 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69684/
> ---
> 
> (Updated Jan. 7, 2019, 7:27 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-6632
> https://issues.apache.org/jira/browse/MESOS-6632
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> For the period between IOSB server is up and container process exec,
> if the containerizer launch returns failure or discarded, the FD used
> for signaling from the container process to the IOSB finish redirect
> will be leaked, which would cause the IOSB stuck at `io::redirect`
> forever. It would block the containerizer from cleaning up this
> container at IOSB.
> 
> This issue could be exposed if there are frequent check containers
> launch on an agent with heavy loads.
> 
> This patch fixes the issue by handling discard of a `launch` future,
> so the container IO is cleaned up and therefore all FDs are closed.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a5cf2da55c046c5c45e0c2ca3400f64de12de62b 
> 
> 
> Diff: https://reviews.apache.org/r/69684/diff/2/
> 
> 
> Testing
> ---
> 
> Manual testing:
> 
> Add `return Failure("Test!");` to `IOSwitchboard::_prepare()`: 
> https://github.com/apache/mesos/blob/7a378597d6a4359de0b1eabb563d4c12cdb4290e/src/slave/containerizer/mesos/io/switchboard.cpp#L657
> 
> 1) Compile without this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to a hanging IOSB (15 seconds timeout) after trying to 
> destroy IOSB on the HTTP error "Test!"
> 
> 1) Compile with this patch applied
> 2) Run `src/mesos-tests --verbose 
> --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to the HTTP error "Test!" and terminates immediately 
> (as expected)
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Review Request 69684: Fixed the FD leak if containerizer::_launch() failed or discarded.

2019-01-07 Thread Andrei Budnik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69684/
---

Review request for mesos, Gilbert Song, Jie Yu, Qian Zhang, and Vinod Kone.


Bugs: MESOS-6632
https://issues.apache.org/jira/browse/MESOS-6632


Repository: mesos


Description
---

For the period between IOSB server is up and container process exec,
if the containerizer launch returns failure or discarded, the FD used
for signaling from the container process to the IOSB finish redirect
will be leaked, which would cause the IOSB stuck at `io::redirect`
forever. It would block the containerizer from cleaning up this
container at IOSB.

This issue could be exposed if there are frequent check containers
launch on an agent with heavy loads.


Diffs
-

  src/slave/containerizer/mesos/containerizer.cpp 
a5cf2da55c046c5c45e0c2ca3400f64de12de62b 


Diff: https://reviews.apache.org/r/69684/diff/1/


Testing
---

Manual testing:

Add `return Failure("Test!");` to `IOSwitchboard::_prepare()`: 
https://github.com/apache/mesos/blob/7a378597d6a4359de0b1eabb563d4c12cdb4290e/src/slave/containerizer/mesos/io/switchboard.cpp#L657

1) Compile without this patch applied
2) Run `src/mesos-tests --verbose 
--gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
3) This test fails due to a hanging IOSB (15 seconds timeout) after trying to 
destroy IOSB on the HTTP error "Test!"

1) Compile with this patch applied
2) Run `src/mesos-tests --verbose 
--gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
3) This test fails due to the HTTP error "Test!" and terminates immediately (as 
expected)


Thanks,

Andrei Budnik