[ 
https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9306:
-----------------------------------

    Assignee: Andrei Budnik

> Mesos containerizer can get stuck during cgroup cleanup
> -------------------------------------------------------
>
>                 Key: MESOS-9306
>                 URL: https://issues.apache.org/jira/browse/MESOS-9306
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, containerization
>    Affects Versions: 1.7.0
>            Reporter: Greg Mann
>            Assignee: Andrei Budnik
>            Priority: Critical
>              Labels: containerizer, mesosphere
>
> I observed a task group's executor container which failed to be completely 
> destroyed after its associated tasks were killed. The following is an excerpt 
> from the agent log which is filtered to include only lines with the container 
> ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}:
> {code}
> 2018-10-10 14:20:50: I1010 14:20:50.204756  6799 containerizer.cpp:2963] 
> Container d463b9fe-970d-4077-bab9-558464889a9e has exited
> 2018-10-10 14:20:50: I1010 14:20:50.204839  6799 containerizer.cpp:2457] 
> Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state
> 2018-10-10 14:20:50: I1010 14:20:50.204859  6799 containerizer.cpp:3124] 
> Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e 
> from RUNNING to DESTROYING
> 2018-10-10 14:20:50: I1010 14:20:50.204960  6799 linux_launcher.cpp:580] 
> Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.204993  6799 linux_launcher.cpp:622] 
> Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:20:50: I1010 14:20:50.205417  6806 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.205477  6810 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.205708  6808 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 203008ns
> 2018-10-10 14:20:50: I1010 14:20:50.205878  6800 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 339200ns
> 2018-10-10 14:20:50: I1010 14:20:50.206185  6799 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.206226  6808 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.206455  6808 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 83968ns
> 2018-10-10 14:20:50: I1010 14:20:50.306803  6810 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 100.50816ms
> 2018-10-10 14:20:50: I1010 14:20:50.307531  6805 linux_launcher.cpp:654] 
> Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:21:40: W1010 14:21:40.032855  6809 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:22:40: W1010 14:22:40.031224  6800 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:23:40: W1010 14:23:40.031946  6799 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:24:40: W1010 14:24:40.032979  6804 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:25:40: W1010 14:25:40.030784  6808 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:26:40: W1010 14:26:40.032526  6810 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:27:40: W1010 14:27:40.029932  6801 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> {code}
> The last log line from the containerizer's destroy path is:
> {code}
> 14:20:50.307531  6805 linux_launcher.cpp:654] Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> {code}
> (that is the second such log line, from {{LinuxLauncherProcess::_destroy}})
> Then we just see
> {code}
> containerizer.cpp:2401] Skipping status for container 
> d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> {code}
> repeatedly, which occurs because the agent's {{GET_CONTAINERS}} call is being 
> polled once per minute. This seems to indicate that the container in question 
> is still in the agent's {{containers_}} map.
> So, it seems that the containerizer is stuck either in the Linux launcher's 
> {{destroy()}} code path, or the containerizer's {{destroy()}} code path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to