[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604950#comment-16604950 ] Qian Zhang commented on MESOS-8568: --- commit ba370822c94c8e9881eff3f63a02b38e18335ae4 Author: Qian Zhang Date: Thu Aug 23 17:44:53 2018 +0800 Made command check always waits before removing the nested container. Review: [https://reviews.apache.org/r/68495] commit b5c43f40b41b44ccae05d61e4aba8d004678cde1 Author: Qian Zhang Date: Wed Aug 29 11:22:41 2018 +0800 Made checker library retry to remove the previous check container. Previously when checker library fails to remove the previous check container, it will discard the promise and launch a new check container which will cause two problems: 1. The discarded promise is used to launch the new check container, that means even the new check container is launched successfully, we still have no chance to process its check result since the promise has already been discarded. 2. The previous check container will never get a chance to be removed which is leak, i.e., its runtime directory and sandbox directory will not be removed. Now in this patch, when checker library fails to remove the previous check container, we make it remove the previous check container again. Review: https://reviews.apache.org/r/68555 > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 1.6.0, 1.6.1 >Reporter: Andrei Budnik >Assignee: Qian Zhang >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598177#comment-16598177 ] Qian Zhang commented on MESOS-8568: --- [~vinodkone] Done. > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Improvement >Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 1.6.0, 1.6.1 >Reporter: Andrei Budnik >Assignee: Qian Zhang >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598016#comment-16598016 ] Vinod Kone commented on MESOS-8568: --- [~qianzhang] Can you please set the affects and target versions? > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Improvement >Reporter: Andrei Budnik >Assignee: Qian Zhang >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591235#comment-16591235 ] Qian Zhang commented on MESOS-8568: --- I ran the exactly same reproduce steps with the above patch applied, and found this issue was gone, there is only one check container's sandbox directory at any time. {code:java} $ ls -la /home/qzhang/opt/mesos/slaves/1eada535-3848-4c76-b8c5-0e9e0d6fa102-S0/frameworks/8a842ab3-8aba-4d64-a744-ae98bdcf6d59-/executors/default-executor/runs/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/containers/06e7c625-596c-454c-b092-f17a81073349/containers | grep check | wc -l {code} Here is the agent log, we can see `WAIT_NESTED_CONTAINER` was called before `REMOVE_NESTED_CONTAINER` was called. {code:java} I0823 19:46:39.269901 32604 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:39.277669 32603 switchboard.cpp:316] Container logger module finished preparing container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18; IOSwitchboard server is required I0823 19:46:39.284180 32603 systemd.cpp:98] Assigned child process '34701' to 'mesos_executors.slice' I0823 19:46:39.284451 32603 switchboard.cpp:604] Created I/O switchboard server (pid: 34701) listening on socket file '/tmp/mesos-io-switchboard-12e8e4c7-268e-4184-881c-a16b61fa260c' for container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.288053 32641 linux_launcher.cpp:492] Launching nested container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 and cloning with namespaces W0823 19:46:39.302271 32636 http.cpp:2635] Failed to launch container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18: Collect failed: ==Fake error== I0823 19:46:39.304822 32639 linux_launcher.cpp:580] Asked to destroy container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.305047 32639 linux_launcher.cpp:622] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:39.306437 32646 cgroups.cpp:2838] Freezing cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.307015 32614 cgroups.cpp:1229] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 after 419840ns I0823 19:46:39.307715 32641 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42086 I0823 19:46:39.308198 32646 cgroups.cpp:2856] Thawing cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.308298 32641 http.cpp:2685] Processing WAIT_NESTED_CONTAINER call for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:39.308583 32605 cgroups.cpp:1258] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 after 265728ns I0823 19:46:39.373747 32616 linux_launcher.cpp:654] Destroying cgroup '/sys/fs/cgroup/systemd/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:44.375650 32647 switchboard.cpp:807] Sending SIGTERM to I/O switchboard server (pid: 34701) since container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 is being destroyed I0823 19:46:44.403535 32637 switchboard.cpp:913] I/O switchboard server process for container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 has terminated (status=0) I0823 19:46:47.420578 32622 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42088 I0823 19:46:47.421331 32622 http.cpp:2971] Processing REMOVE_NESTED_CONTAINER call for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:47.427382 32636 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591106#comment-16591106 ] Qian Zhang commented on MESOS-8568: --- RR: https://reviews.apache.org/r/68495/ > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Improvement >Reporter: Andrei Budnik >Assignee: Qian Zhang >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591103#comment-16591103 ] Qian Zhang commented on MESOS-8568: --- [~vinodkone] Yeah, I noticed that as well. When the I/O switchboard server process is launched, it just [waits on a promise|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L1181], and that promise will only be set when a `ATTACH_CONTAINER_OUTPUT` call is made or a `SIGTERM` is sent. In this case, `ATTACH_CONTAINER_OUTPUT` will never be made since the check container was failed to launch, so we have to wait 5s for the `SIGTERM`. I am not quite sure which cases that [5s timeout|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L810:L818] is for, maybe [~jieyu] and [~klueska] have more info? > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Improvement >Reporter: Andrei Budnik >Assignee: Qian Zhang >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590513#comment-16590513 ] Vinod Kone commented on MESOS-8568: --- Great repro! One orthogonal question though, it seems unfortunate that IOSwitchboard takes 5s to complete its cleanup for a container that has failed to launch. IIRC there was a 5s timeout in IOSwitchboard for some unexpected corner cases which is what we seem to be hitting here, but this is an *expected* case in some sense. Is there anyway we can speed that up? > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Improvement >Reporter: Andrei Budnik >Assignee: Qian Zhang >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588976#comment-16588976 ] Qian Zhang commented on MESOS-8568: --- Reproduce steps: 1. To simulate the failure of launching nested container via health check, change `CgroupsIsolatorProcess::isolate` a bit: {code:java} Future CgroupsIsolatorProcess::isolate( const ContainerID& containerId, pid_t pid) { + if (strings::startsWith(containerId.value(), "check")) { +return Failure("==Fake error=="); + } + {code} 2. Start Mesos master and agent. {code:java} $ sudo ./bin/mesos-master.sh --work_dir=/home/qzhang/opt/mesos $ sudo ./bin/mesos-slave.sh --master=10.0.49.2:5050 --port=36251 --work_dir=/home/qzhang/opt/mesos --isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem {code} 3. Launch a nested container with check enabled. {code:java} $ cat task_group_health_check.json { "tasks":[ { "name" : "test", "task_id" : {"value" : "test"}, "agent_id": {"value" : ""}, "resources": [ {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}}, {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}} ], "command": { "value": "touch aaa && sleep 1000" }, "check": { "type": "COMMAND", "command": { "command": { "value": "ls aaa > /dev/null" } }, "delay_seconds": 5, "interval_seconds": 3 } } ] } $ src/mesos-execute --master=10.0.49.2:5050 --task_group=file:///home/qzhang/workspace/config/task_group_health_check.json{code} 5. After a few minutes, there will be a lot of check container's sandbox directories not removed. {code:java} $ ls -la /home/qzhang/opt/mesos/slaves/c355abce-0088-4196-8376-d54c9963abdd-S0/frameworks/c355abce-0088-4196-8376-d54c9963abdd-/executors/default-executor/runs/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/containers/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/containers | grep check | wc -l 119 {code} And in the default executor's stderr, we see a lot of warning messages {code:java} ... W0822 07:37:45.084581 19377 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==Fake error==) while launching COMMAND check for task 'test' I0822 07:37:45.085053 19377 checker_process.cpp:457] COMMAND check for task 'test' is not available W0822 07:37:48.092411 19362 checker_process.cpp:655] Received '500 Internal Server Error' (Nested container has not terminated yet) while removing the nested container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586' used for the COMMAND check for task 'test' I0822 07:37:48.093101 19362 checker_process.cpp:457] COMMAND check for task 'test' is not available W0822 07:37:48.130527 19373 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==Fake error==) while launching COMMAND check for task 'test' W0822 07:37:51.099179 19360 checker_process.cpp:655] Received '500 Internal Server Error' (Nested container has not terminated yet) while removing the nested container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-5c7af0fc-ad73-4870-aba8-65a3fb4eae12' used for the COMMAND check for task 'test' I0822 07:37:51.099799 19360 checker_process.cpp:457] COMMAND check for task 'test' is not available ...{code} So every time when the default executor called `REMOVE_NESTED_CONTAINER` to remove the previous check container, the call will fail with a 500 error. The reason that this call failed is the check container has not terminated yet (still in `DESTROYING` state), the agent log below also proved this. {code:java} I0822 07:37:45.051453 19063 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586' I0822 07:37:45.058904 19088 switchboard.cpp:316] Container logger module finished preparing container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586; IOSwitchboard server is required I0822 07:37:45.065444 19088 systemd.cpp:98] Assigned child process '19410' to 'mesos_executors.slice' I0822 07:37:45.065724 19088 switchboard.cpp:604] Created I/O switchboard server (pid: 19410) listening on socket file '/tmp/mesos-io-switchboard-048e2be0-4a2b-4c00-a846-0e8137507a85' for container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 I0822 07:37:45.069316 19062 linux_launcher.cpp:492] Launching nested container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 and cloning with namespaces
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580929#comment-16580929 ] Jan Schlicht commented on MESOS-8568: - Scratch my older comment. {{REMOVE_NESTED_CONTAINER}} has to called on a destroyed container, because as part of this call, the containers runtime directory will be removed. I.e., if this call isn't successful, it will leak the containers runtime directory. This is the case in the scenario above. Hence, the checker has to call {{WAIT_NESTED_CONTAINER}} to make sure that it's not calling {{REMOVE_NESTED_CONTAINER}} on a container that is currently being destroyed. > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579747#comment-16579747 ] Jan Schlicht commented on MESOS-8568: - No, the {{REMOVE_NESTED_CONTAINER}} shouldn't be a problem here. This particular 500 return code is actually a no-op in the containerizer. We don't need to call {{WAIT_NESTED_CONTAINER}} here. > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579656#comment-16579656 ] Jan Schlicht commented on MESOS-8568: - I've linked MESOS-9131, as it's very similar: Calling {{REMOVE_NESTED_CONTAINER}} while that container is being destroyed seems to result in a race condition, though it isn't yet clear why. > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525717#comment-16525717 ] Till Toenshoff commented on MESOS-8568: --- Raised priority to blocker - we had multiple major Mesos users reaching out to us for help getting this fixed as it gets their disks stuffed with containers never getting properly cleaned up. > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Assignee: Benno Evers >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)