[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-09-05 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604950#comment-16604950
 ] 

Qian Zhang commented on MESOS-8568:
---

commit ba370822c94c8e9881eff3f63a02b38e18335ae4
Author: Qian Zhang 
Date: Thu Aug 23 17:44:53 2018 +0800

Made command check always waits before removing the nested container.
 
 Review: [https://reviews.apache.org/r/68495]

 

commit b5c43f40b41b44ccae05d61e4aba8d004678cde1
Author: Qian Zhang 
Date: Wed Aug 29 11:22:41 2018 +0800

Made checker library retry to remove the previous check container.
 
 Previously when checker library fails to remove the previous check
 container, it will discard the promise and launch a new check container
 which will cause two problems:
 1. The discarded promise is used to launch the new check container,
 that means even the new check container is launched successfully,
 we still have no chance to process its check result since the
 promise has already been discarded.
 2. The previous check container will never get a chance to be removed
 which is leak, i.e., its runtime directory and sandbox directory
 will not be removed.
 
 Now in this patch, when checker library fails to remove the previous
 check container, we make it remove the previous check container again.
 
 Review: https://reviews.apache.org/r/68555

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 1.6.0, 1.6.1
>Reporter: Andrei Budnik
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-30 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598177#comment-16598177
 ] 

Qian Zhang commented on MESOS-8568:
---

[~vinodkone] Done.

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 1.6.0, 1.6.1
>Reporter: Andrei Budnik
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-30 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598016#comment-16598016
 ] 

Vinod Kone commented on MESOS-8568:
---

[~qianzhang] Can you please set the affects and target versions?

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Budnik
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-24 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591235#comment-16591235
 ] 

Qian Zhang commented on MESOS-8568:
---

I ran the exactly same reproduce steps with the above patch applied, and found 
this issue was gone, there is only one check container's sandbox directory at 
any time.
{code:java}
$ ls -la 
/home/qzhang/opt/mesos/slaves/1eada535-3848-4c76-b8c5-0e9e0d6fa102-S0/frameworks/8a842ab3-8aba-4d64-a744-ae98bdcf6d59-/executors/default-executor/runs/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/containers/06e7c625-596c-454c-b092-f17a81073349/containers
 | grep check | wc -l
{code}
Here is the agent log, we can see `WAIT_NESTED_CONTAINER` was called before 
`REMOVE_NESTED_CONTAINER` was called.

 
{code:java}
I0823 19:46:39.269901 32604 http.cpp:3366] Processing 
LAUNCH_NESTED_CONTAINER_SESSION call for container 
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.277669 32603 switchboard.cpp:316] Container logger module 
finished preparing container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18;
 IOSwitchboard server is required
I0823 19:46:39.284180 32603 systemd.cpp:98] Assigned child process '34701' to 
'mesos_executors.slice'
I0823 19:46:39.284451 32603 switchboard.cpp:604] Created I/O switchboard server 
(pid: 34701) listening on socket file 
'/tmp/mesos-io-switchboard-12e8e4c7-268e-4184-881c-a16b61fa260c' for container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.288053 32641 linux_launcher.cpp:492] Launching nested container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 and cloning with namespaces 
W0823 19:46:39.302271 32636 http.cpp:2635] Failed to launch container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18:
 Collect failed: ==Fake error==
I0823 19:46:39.304822 32639 linux_launcher.cpp:580] Asked to destroy container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.305047 32639 linux_launcher.cpp:622] Destroying cgroup 
'/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.306437 32646 cgroups.cpp:2838] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.307015 32614 cgroups.cpp:1229] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 after 419840ns
I0823 19:46:39.307715 32641 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 
10.0.49.2:42086
I0823 19:46:39.308198 32646 cgroups.cpp:2856] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.308298 32641 http.cpp:2685] Processing WAIT_NESTED_CONTAINER 
call for container 
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.308583 32605 cgroups.cpp:1258] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 after 265728ns
I0823 19:46:39.373747 32616 linux_launcher.cpp:654] Destroying cgroup 
'/sys/fs/cgroup/systemd/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:44.375650 32647 switchboard.cpp:807] Sending SIGTERM to I/O 
switchboard server (pid: 34701) since container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 is being destroyed
I0823 19:46:44.403535 32637 switchboard.cpp:913] I/O switchboard server process 
for container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 has terminated (status=0)
I0823 19:46:47.420578 32622 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 
10.0.49.2:42088
I0823 19:46:47.421331 32622 http.cpp:2971] Processing REMOVE_NESTED_CONTAINER 
call for container 
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:47.427382 32636 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 

[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-23 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591106#comment-16591106
 ] 

Qian Zhang commented on MESOS-8568:
---

RR: https://reviews.apache.org/r/68495/

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Budnik
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-23 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591103#comment-16591103
 ] 

Qian Zhang commented on MESOS-8568:
---

[~vinodkone] Yeah, I noticed that as well. When the I/O switchboard server 
process is launched, it just [waits on a 
promise|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L1181],
 and that promise will only be set when a `ATTACH_CONTAINER_OUTPUT` call is 
made or a `SIGTERM` is sent. In this case, `ATTACH_CONTAINER_OUTPUT` will never 
be made since the check container was failed to launch, so we have to wait 5s 
for the `SIGTERM`.

I am not quite sure which cases that [5s 
timeout|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L810:L818]
 is for, maybe [~jieyu] and [~klueska] have more info?

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Budnik
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-23 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590513#comment-16590513
 ] 

Vinod Kone commented on MESOS-8568:
---

Great repro!

One orthogonal question though, it seems unfortunate that IOSwitchboard takes 
5s to complete its cleanup for a container that has failed to launch. IIRC 
there was a 5s timeout in IOSwitchboard for some unexpected corner cases which 
is what we seem to be hitting here, but this is an *expected* case in some 
sense.  Is there anyway we can speed that up?

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Budnik
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-22 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588976#comment-16588976
 ] 

Qian Zhang commented on MESOS-8568:
---

Reproduce steps:

1. To simulate the failure of launching nested container via health check, 
change `CgroupsIsolatorProcess::isolate` a bit:
{code:java}
Future CgroupsIsolatorProcess::isolate(
 const ContainerID& containerId,
 pid_t pid)
 {
+  if (strings::startsWith(containerId.value(), "check")) {
+return Failure("==Fake error==");
+  }
+
{code}
2. Start Mesos master and agent.
{code:java}
$ sudo ./bin/mesos-master.sh --work_dir=/home/qzhang/opt/mesos

$ sudo ./bin/mesos-slave.sh --master=10.0.49.2:5050 --port=36251 
--work_dir=/home/qzhang/opt/mesos 
--isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem
{code}
3. Launch a nested container with check enabled.
{code:java}
$ cat task_group_health_check.json
{
  "tasks":[
{
  "name" : "test",
  "task_id" : {"value" : "test"},
  "agent_id": {"value" : ""},
  "resources": [
{"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
{"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
  ],
  "command": {
"value": "touch aaa && sleep 1000"
  },
  "check": {
"type": "COMMAND",
"command": {
  "command": {
   "value": "ls aaa  > /dev/null"
  }
},
"delay_seconds": 5,
"interval_seconds": 3
  }
}
  ]
}

$ src/mesos-execute --master=10.0.49.2:5050 
--task_group=file:///home/qzhang/workspace/config/task_group_health_check.json{code}
5. After a few minutes, there will be a lot of check container's sandbox 
directories not removed.

 
{code:java}
$ ls -la 
/home/qzhang/opt/mesos/slaves/c355abce-0088-4196-8376-d54c9963abdd-S0/frameworks/c355abce-0088-4196-8376-d54c9963abdd-/executors/default-executor/runs/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/containers/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/containers
 | grep check | wc -l
119
{code}
And in the default executor's stderr, we see a lot of warning messages

 

 
{code:java}
...
W0822 07:37:45.084581 19377 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==Fake error==) while launching COMMAND check 
for task 'test'
I0822 07:37:45.085053 19377 checker_process.cpp:457] COMMAND check for task 
'test' is not available
W0822 07:37:48.092411 19362 checker_process.cpp:655] Received '500 Internal 
Server Error' (Nested container has not terminated yet) while removing the 
nested container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586'
 used for the COMMAND check for task 'test'
I0822 07:37:48.093101 19362 checker_process.cpp:457] COMMAND check for task 
'test' is not available
W0822 07:37:48.130527 19373 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==Fake error==) while launching COMMAND check 
for task 'test'
W0822 07:37:51.099179 19360 checker_process.cpp:655] Received '500 Internal 
Server Error' (Nested container has not terminated yet) while removing the 
nested container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-5c7af0fc-ad73-4870-aba8-65a3fb4eae12'
 used for the COMMAND check for task 'test'
I0822 07:37:51.099799 19360 checker_process.cpp:457] COMMAND check for task 
'test' is not available
...{code}
 

So every time when the default executor called `REMOVE_NESTED_CONTAINER` to 
remove the previous check container, the call will fail with a 500 error. The 
reason that this call failed is the check container has not terminated yet 
(still in `DESTROYING` state), the agent log below also proved this.
{code:java}
I0822 07:37:45.051453 19063 http.cpp:3366] Processing 
LAUNCH_NESTED_CONTAINER_SESSION call for container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586'
I0822 07:37:45.058904 19088 switchboard.cpp:316] Container logger module 
finished preparing container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586;
 IOSwitchboard server is required
I0822 07:37:45.065444 19088 systemd.cpp:98] Assigned child process '19410' to 
'mesos_executors.slice'
I0822 07:37:45.065724 19088 switchboard.cpp:604] Created I/O switchboard server 
(pid: 19410) listening on socket file 
'/tmp/mesos-io-switchboard-048e2be0-4a2b-4c00-a846-0e8137507a85' for container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
I0822 07:37:45.069316 19062 linux_launcher.cpp:492] Launching nested container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
 and cloning with namespaces 

[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-15 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580929#comment-16580929
 ] 

Jan Schlicht commented on MESOS-8568:
-

Scratch my older comment. {{REMOVE_NESTED_CONTAINER}} has to called on a 
destroyed container, because as part of this call, the containers runtime 
directory will be removed. I.e., if this call isn't successful, it will leak 
the containers runtime directory. This is the case in the scenario above. 
Hence, the checker has to call {{WAIT_NESTED_CONTAINER}} to make sure that it's 
not calling {{REMOVE_NESTED_CONTAINER}} on a container that is currently being 
destroyed.

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-14 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579747#comment-16579747
 ] 

Jan Schlicht commented on MESOS-8568:
-

No, the {{REMOVE_NESTED_CONTAINER}} shouldn't be a problem here. This 
particular 500 return code is actually a no-op in the containerizer. We don't 
need to call {{WAIT_NESTED_CONTAINER}} here.

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-14 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579656#comment-16579656
 ] 

Jan Schlicht commented on MESOS-8568:
-

I've linked MESOS-9131, as it's very similar: Calling 
{{REMOVE_NESTED_CONTAINER}} while that container is being destroyed seems to 
result in a race condition, though it isn't yet clear why.

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-06-27 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525717#comment-16525717
 ] 

Till Toenshoff commented on MESOS-8568:
---

Raised priority to blocker - we had multiple major Mesos users reaching out to 
us for help getting this fixed as it gets their disks stuffed with containers 
never getting properly cleaned up.

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)