[ https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gilbert Song reassigned MESOS-9306: ----------------------------------- Assignee: Andrei Budnik > Mesos containerizer can get stuck during cgroup cleanup > ------------------------------------------------------- > > Key: MESOS-9306 > URL: https://issues.apache.org/jira/browse/MESOS-9306 > Project: Mesos > Issue Type: Bug > Components: agent, containerization > Affects Versions: 1.7.0 > Reporter: Greg Mann > Assignee: Andrei Budnik > Priority: Critical > Labels: containerizer, mesosphere > > I observed a task group's executor container which failed to be completely > destroyed after its associated tasks were killed. The following is an excerpt > from the agent log which is filtered to include only lines with the container > ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}: > {code} > 2018-10-10 14:20:50: I1010 14:20:50.204756 6799 containerizer.cpp:2963] > Container d463b9fe-970d-4077-bab9-558464889a9e has exited > 2018-10-10 14:20:50: I1010 14:20:50.204839 6799 containerizer.cpp:2457] > Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state > 2018-10-10 14:20:50: I1010 14:20:50.204859 6799 containerizer.cpp:3124] > Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e > from RUNNING to DESTROYING > 2018-10-10 14:20:50: I1010 14:20:50.204960 6799 linux_launcher.cpp:580] > Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e > 2018-10-10 14:20:50: I1010 14:20:50.204993 6799 linux_launcher.cpp:622] > Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e' > 2018-10-10 14:20:50: I1010 14:20:50.205417 6806 cgroups.cpp:2838] Freezing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos > 2018-10-10 14:20:50: I1010 14:20:50.205477 6810 cgroups.cpp:2838] Freezing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e > 2018-10-10 14:20:50: I1010 14:20:50.205708 6808 cgroups.cpp:1229] > Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after > 203008ns > 2018-10-10 14:20:50: I1010 14:20:50.205878 6800 cgroups.cpp:1229] > Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after > 339200ns > 2018-10-10 14:20:50: I1010 14:20:50.206185 6799 cgroups.cpp:2856] Thawing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos > 2018-10-10 14:20:50: I1010 14:20:50.206226 6808 cgroups.cpp:2856] Thawing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e > 2018-10-10 14:20:50: I1010 14:20:50.206455 6808 cgroups.cpp:1258] > Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after > 83968ns > 2018-10-10 14:20:50: I1010 14:20:50.306803 6810 cgroups.cpp:1258] > Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after > 100.50816ms > 2018-10-10 14:20:50: I1010 14:20:50.307531 6805 linux_launcher.cpp:654] > Destroying cgroup > '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e' > 2018-10-10 14:21:40: W1010 14:21:40.032855 6809 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:22:40: W1010 14:22:40.031224 6800 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:23:40: W1010 14:23:40.031946 6799 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:24:40: W1010 14:24:40.032979 6804 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:25:40: W1010 14:25:40.030784 6808 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:26:40: W1010 14:26:40.032526 6810 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:27:40: W1010 14:27:40.029932 6801 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > {code} > The last log line from the containerizer's destroy path is: > {code} > 14:20:50.307531 6805 linux_launcher.cpp:654] Destroying cgroup > '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e' > {code} > (that is the second such log line, from {{LinuxLauncherProcess::_destroy}}) > Then we just see > {code} > containerizer.cpp:2401] Skipping status for container > d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist > {code} > repeatedly, which occurs because the agent's {{GET_CONTAINERS}} call is being > polled once per minute. This seems to indicate that the container in question > is still in the agent's {{containers_}} map. > So, it seems that the containerizer is stuck either in the Linux launcher's > {{destroy()}} code path, or the containerizer's {{destroy()}} code path. -- This message was sent by Atlassian JIRA (v7.6.3#76005)