[jira] [Comment Edited] (MESOS-9836) Docker containerizer overwrites `/mesos/slave` cgroups.

2019-08-16 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1690#comment-1690
 ] 

Qian Zhang edited comment on MESOS-9836 at 8/16/19 9:26 AM:


{quote}As Mesos provides an option to run a Docker image as an (custom?) 
executor, it might make sense to update the Docker container's resources 
(executor+its tasks running in the Docker container) in cgroups
{quote}
[~abudnik] You are saying the `--docker_mesos_image` flag, right? I think when 
this flag is specified, the Docker executor will run in a Docker container with 
the Docker image specified via this flag, but the task will be launched by 
Docker executor in *another* Docker container because Docker executor always 
calls `docker run` to launch task as a Docker container in host and in Docker 
containerizer we mount the host docker socket for Docker executor so it can 
communicate with the Docker daemon in host (see 
[here|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/docker.cpp#L333:L339]
 for details). And another important thing is, in any cases, what 
`DockerContainerizerProcess::update` tries to update is the Docker container of 
task instead of Docker container of executor (because 
[container->containerName|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/docker.hpp#L486]
 is always the task container's name), that is really confusing me as I 
mentioned in my previous comment.


was (Author: qianzhang):
{quote}As Mesos provides an option to run a Docker image as an (custom?) 
executor, it might make sense to update the Docker container's resources 
(executor+its tasks running in the Docker container) in cgroups
{quote}
[~abudnik] You are saying the `--docker_mesos_image` flag, right? I think when 
this flag is specified, the Docker executor will run in a Docker container with 
the Docker image specified via this flag, but the task will be launched by 
Docker executor in *another* Docker container because Docker executor always 
calls `docker run` to launch task as a Docker container and in Docker 
containerizer we mount the docker socket for Docker executor so it can 
communicate with the Docker daemon in agent host (see 
[here|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/docker.cpp#L333:L339]
 for details). And another important thing is, in any cases, what 
`DockerContainerizerProcess::update` tries to update is the Docker container of 
task instead of Docker container of executor (because 
[container->containerName|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/docker.hpp#L486]
 is always the task container's name), that is really confusing me as I 
mentioned in my previous comment.

> Docker containerizer overwrites `/mesos/slave` cgroups.
> ---
>
> Key: MESOS-9836
> URL: https://issues.apache.org/jira/browse/MESOS-9836
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Chun-Hung Hsiao
>Priority: Critical
>  Labels: docker, mesosphere
>
> The following bug was observed on our internal testing cluster.
> The docker containerizer launched a container on an agent:
> {noformat}
> I0523 06:00:53.888579 21815 docker.cpp:1195] Starting container 
> 'f69c8a8c-eba4-4494-a305-0956a44a6ad2' for task 
> 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1' (and executor 
> 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1') of framework 
> 415284b7-2967-407d-b66f-f445e93f064e-0011
> I0523 06:00:54.524171 21815 docker.cpp:783] Checkpointing pid 13716 to 
> '/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S2/frameworks/415284b7-2967-407d-b66f-f445e93f064e-0011/executors/apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1/runs/f69c8a8c-eba4-4494-a305-0956a44a6ad2/pids/forked.pid'
> {noformat}
> After the container was launched, the docker containerizer did a {{docker 
> inspect}} on the container and cached the pid:
>  
> [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1764]
>  The pid should be slightly greater than 13716.
> The docker executor sent a {{TASK_FINISHED}} status update around 16 minutes 
> later:
> {noformat}
> I0523 06:16:17.287595 21809 slave.cpp:5566] Handling status update 
> TASK_FINISHED (Status UUID: 4e00b786-b773-46cd-8327-c7deb08f1de9) for task 
> apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1 of framework 
> 415284b7-2967-407d-b66f-f445e93f064e-0011 from executor(1)@172.31.1.7:36244
> {noformat}
> After receiving the terminal status update, the agent asked the docker 
> containerizer to update {{cpu.cfs_period_us}}, {{cpu.cfs_quota_us}} and 
> {{memory.soft_limit_in_bytes}} of the container through the cac

[jira] [Comment Edited] (MESOS-9836) Docker containerizer overwrites `/mesos/slave` cgroups.

2019-08-15 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907981#comment-16907981
 ] 

Qian Zhang edited comment on MESOS-9836 at 8/15/19 10:28 AM:
-

{quote}A fix would be after obtaining the pid through docker inspect, we look 
into the proc filesystem and get the cgroup right away and cache it, as well as 
other properties we originally retrieve through pid if any.
{quote}
[~chhsia0] Yeah, we could get and catch the cgroup, but we also need to recover 
it after agent restarts which might be complicated.
{quote}Probably we should leave out all cgroups not containing "docker" 
substring instead of (or in addition to) filtering [the system root 
cgroup|https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1783-L1788].
 It's ugly, hacky, and introduces a dependency on Docker's runtime.
{quote}
[~abudnik] I think a typical cgroup for Docker container is like:
{code:java}
/sys/fs/cgroup//docker/
{code}
Are you saying in `DockerContainerizerProcess::__update` we do not update 
anything if the cgroup does not contain "docker" substring (i.e., it is not a 
Docker container)? Yeah, it should work, but as you mentioned it will introduce 
a dependency on Docker which is unfortunate.

 

I am actually wondering why we need to update Docker container's CPU & memory 
in Docker containerizer. For the task launched via Docker containerizer, the 
method `DockerContainerizerProcess::update` will be called twice:
 # When Docker executor registers agent (see 
[here|https://github.com/apache/mesos/blob/1.8.1/src/slave/slave.cpp#L5210] for 
details): The resources passed into `DockerContainerizerProcess::update` is 
same as the container's resources, so we will not update anything (see 
[here|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/docker.cpp#L1675:L1680]
 for details).
 # When agent receives a terminal status update for the task (see 
[here|https://github.com/apache/mesos/blob/1.8.1/src/slave/slave.cpp#L5841:L5846]
 for details): Since the container has terminated, what `docker inspect` 
returns will not contain a pid, so we will not update anything as well (see 
[here|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/docker.cpp#L1754:L1756]
 for details).

The reason that we have the issue mentioned in this ticket is, between #1 and 
#2 (i.e., the Docker container is stilling running) 
`DockerContainerizerProcess::usage` is called (e.g., agent's 
/monitor/statistics endpoint is hit), internally it calls `docker inspect` 
which will return a valid pid, and we will cache the pid (see 
[here|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/docker.cpp#L2025]
 for details). And after that but before #2, that pid may be reused by a 
process which is agent process's child, so we will overwrite agent's cgroup in 
#2.

It is Docker executor to launch task as a Docker container, so it is a bit 
weird for Docker containerizer to directly update the Docker container's 
resources in cgroups which should be the job of Docker executor. And I do not 
see why we need to support updating Docker container's resource at runtime. So 
what is the purpose of Docker containerizer's update method? To me, it should 
just do nothing.

 


was (Author: qianzhang):
{quote}A fix would be after obtaining the pid through docker inspect, we look 
into the proc filesystem and get the cgroup right away and cache it, as well as 
other properties we originally retrieve through pid if any.
{quote}
[~chhsia0] Yeah, we could get and catch the cgroup, but we also need to recover 
it after agent restarts which might be complicated.
{quote}Probably we should leave out all cgroups not containing "docker" 
substring instead of (or in addition to) filtering [the system root 
cgroup|https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1783-L1788].
 It's ugly, hacky, and introduces a dependency on Docker's runtime.
{quote}
[~abudnik] I think a typical cgroup for Docker container is like:

 
{code:java}
/sys/fs/cgroup//docker/
{code}
Are you saying in `DockerContainerizerProcess::__update` we do not update 
anything if the cgroup does not contain "docker" substring (i.e., it is not a 
Docker container)? Yeah, it should work, but as you mentioned it will introduce 
a dependency on Docker which is unfortunate.

 

I am actually wondering why we need to update Docker container's CPU & memory 
in Docker containerizer. For the task launched via Docker containerizer, the 
method `DockerContainerizerProcess::update` will be called twice:
 # When Docker executor registers agent (see 
[here|https://github.com/apache/mesos/blob/1.8.1/src/slave/slave.cpp#L5210] for 
details): The resources passed into `DockerContainerizerProcess::update` is 
same as the con