[jira] [Commented] (MESOS-9697) Release RPMs are not uploaded to bintray

2019-04-10 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814766#comment-16814766
 ] 

Jie Yu commented on MESOS-9697:
---

I would actually suggest we port the bintray part to the Mesos/Packaging/CentOS 
job so that we also publish nightly RPMs too.

We can add a new job that based on the same 
support/jenkins/Jenkinsfile-packaging-centos, but only trigger on tags for 
releasing. I am a big fan of automation.

> Release RPMs are not uploaded to bintray
> 
>
> Key: MESOS-9697
> URL: https://issues.apache.org/jira/browse/MESOS-9697
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.6.2, 1.7.2, 1.8.0
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Critical
>  Labels: foundations, integration, jenkins, packaging, rpm
>
> While we currently build release RPMs, e.g., 
> [https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/job/1.7.x/],
>  these artifacts are not uploaded to bintray. Due to that RPM links on the 
> downloads page [http://mesos.apache.org/downloads/] are broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9529) `/proc` should be remounted even if a nested container set `share_pid_namespace` to true

2019-04-03 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809399#comment-16809399
 ] 

Jie Yu commented on MESOS-9529:
---

commit f0b6a5568cc5573be8f0ac5f9dcdd48914b5 (HEAD -> master, origin/master, 
origin/HEAD, proc)
Author: Jie Yu 
Date:   Mon Apr 1 18:17:21 2019 -0700

Mounted /proc properly a container shares pid namespace with its parent.

If a container shares the same pid namespace as its parent and is not a
top level container. It might or might not share the same pid namespace
as the agent. In this case, we need to re-mount `/proc`.

One caveat here is that: in the case where this container does share the
pid namespace of the agent (because its parent shares the same pid
namespace of the agent), mounting `/proc` at the same place will result
in EBUSY.

As a result, we need to "move" (MS_MOVE) the mounts under `/proc` to a
new location and mount the `/proc` again at the old location.

See MESOS-9529 for details.

Review: https://reviews.apache.org/r/70356

commit 76e583ab6ba71e7aef020fc662c0c36d6f3d9923
Author: Jie Yu 
Date:   Mon Apr 1 18:11:59 2019 -0700

Switched to used `/proc/1/ns/pid` to test pid namespaces.

Previously, we're using `/proc/self/ns/pid` to test pid namespaces. This
is proven to be problematic because the kernel will resolve correctly
even if the `/proc` is not re-mounted in the new pid namespace.

Review: https://reviews.apache.org/r/70355

> `/proc` should be remounted even if a nested container set 
> `share_pid_namespace` to true
> 
>
> Key: MESOS-9529
> URL: https://issues.apache.org/jira/browse/MESOS-9529
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.4.2, 1.5.2, 1.6.2, 1.7.1
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Critical
>
> Currently, if a nested container wants to share the pid namespace of its 
> parent container, we allow the framework to set 
> `LinuxInfo.share_pid_namespace`.
> If the nested container does not have its own rootfs (i.e., using the host 
> rootfs), the `/proc` is not re-mounted:
> https://github.com/apache/mesos/blob/1.7.x/src/slave/containerizer/mesos/isolators/namespaces/pid.cpp#L120-L126
> This is problematic because the nested container will fork host's mount 
> namespace, thus inherit the `/proc` there. As a result, `/proc/` are 
> still for the host pid namespace. The pid namespace of the parent container 
> might be different than that of the host pid namspace.
> As a result, `ps aux` in the nested container will show all process 
> information on the host pid namespace. Although, the pid namespace of the 
> nested container is different than that of the host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9529) `/proc` should be remounted even if a nested container set `share_pid_namespace` to true

2019-04-01 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807338#comment-16807338
 ] 

Jie Yu commented on MESOS-9529:
---

https://reviews.apache.org/r/70355/
https://reviews.apache.org/r/70356/

> `/proc` should be remounted even if a nested container set 
> `share_pid_namespace` to true
> 
>
> Key: MESOS-9529
> URL: https://issues.apache.org/jira/browse/MESOS-9529
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.4.2, 1.5.2, 1.6.2, 1.7.1
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Critical
>
> Currently, if a nested container wants to share the pid namespace of its 
> parent container, we allow the framework to set 
> `LinuxInfo.share_pid_namespace`.
> If the nested container does not have its own rootfs (i.e., using the host 
> rootfs), the `/proc` is not re-mounted:
> https://github.com/apache/mesos/blob/1.7.x/src/slave/containerizer/mesos/isolators/namespaces/pid.cpp#L120-L126
> This is problematic because the nested container will fork host's mount 
> namespace, thus inherit the `/proc` there. As a result, `/proc/` are 
> still for the host pid namespace. The pid namespace of the parent container 
> might be different than that of the host pid namspace.
> As a result, `ps aux` in the nested container will show all process 
> information on the host pid namespace. Although, the pid namespace of the 
> nested container is different than that of the host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9529) `/proc` should be remounted even if a nested container set `share_pid_namespace` to true

2019-04-01 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9529:
-

Assignee: Jie Yu

> `/proc` should be remounted even if a nested container set 
> `share_pid_namespace` to true
> 
>
> Key: MESOS-9529
> URL: https://issues.apache.org/jira/browse/MESOS-9529
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.4.2, 1.5.2, 1.6.2, 1.7.1
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Critical
>
> Currently, if a nested container wants to share the pid namespace of its 
> parent container, we allow the framework to set 
> `LinuxInfo.share_pid_namespace`.
> If the nested container does not have its own rootfs (i.e., using the host 
> rootfs), the `/proc` is not re-mounted:
> https://github.com/apache/mesos/blob/1.7.x/src/slave/containerizer/mesos/isolators/namespaces/pid.cpp#L120-L126
> This is problematic because the nested container will fork host's mount 
> namespace, thus inherit the `/proc` there. As a result, `/proc/` are 
> still for the host pid namespace. The pid namespace of the parent container 
> might be different than that of the host pid namspace.
> As a result, `ps aux` in the nested container will show all process 
> information on the host pid namespace. Although, the pid namespace of the 
> nested container is different than that of the host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9529) `/proc` should be remounted even if a nested container set `share_pid_namespace` to true

2019-03-29 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805615#comment-16805615
 ] 

Jie Yu edited comment on MESOS-9529 at 3/30/19 5:02 AM:


I tried to send a patch for this ticket, but realized that the problem is a 
little tricky to resolve. The problem is that we cannot just blindly `mount -t 
proc proc /proc` irrespective if the container shares the same pid namespace of 
its parent. Because if the parent container indeed shares the same pid 
namespace as the agent, this mount will result in EBUSY. We also cannot blindly 
umount /proc and mount again. This is because typically there are bind mounts 
under `/proc/sys/fs/binfmt_misc`. An unmount will typically result in EBUSY too.

The current idea is to use MS_MOVE to move the existing /proc to a tmp 
location, mount proc to /proc, and unmountAll the tmp location. We only do this 
sequence if share_pid_namespace is true.


was (Author: jieyu):
I tried to send a patch for this ticket, but realized that the problem is a 
little tricky to resolve. The problem is that we cannot just blindly `mount -t 
proc proc /proc` irrespective if the container shares the same pid namespace of 
its parent. Because if the parent container indeed shares the same pid 
namespace as the agent, this mount will result in EBUSY. We also cannot blindly 
umount /proc and mount again. This is because typically there are bind mounts 
under `/proc/sys/fs/binfmt_misc`. An unmount will typically result in EBUSY too.

The current idea is to use MS_MOVE to move the existing /proc to a tmp 
location, mount proc to /proc, and unmountAll the tmp location.

> `/proc` should be remounted even if a nested container set 
> `share_pid_namespace` to true
> 
>
> Key: MESOS-9529
> URL: https://issues.apache.org/jira/browse/MESOS-9529
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.4.2, 1.5.2, 1.6.2, 1.7.1
>Reporter: Jie Yu
>Priority: Critical
>
> Currently, if a nested container wants to share the pid namespace of its 
> parent container, we allow the framework to set 
> `LinuxInfo.share_pid_namespace`.
> If the nested container does not have its own rootfs (i.e., using the host 
> rootfs), the `/proc` is not re-mounted:
> https://github.com/apache/mesos/blob/1.7.x/src/slave/containerizer/mesos/isolators/namespaces/pid.cpp#L120-L126
> This is problematic because the nested container will fork host's mount 
> namespace, thus inherit the `/proc` there. As a result, `/proc/` are 
> still for the host pid namespace. The pid namespace of the parent container 
> might be different than that of the host pid namspace.
> As a result, `ps aux` in the nested container will show all process 
> information on the host pid namespace. Although, the pid namespace of the 
> nested container is different than that of the host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9529) `/proc` should be remounted even if a nested container set `share_pid_namespace` to true

2019-03-29 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805615#comment-16805615
 ] 

Jie Yu edited comment on MESOS-9529 at 3/30/19 4:57 AM:


I tried to send a patch for this ticket, but realized that the problem is a 
little tricky to resolve. The problem is that we cannot just blindly `mount -t 
proc proc /proc` irrespective if the container shares the same pid namespace of 
its parent. Because if the parent container indeed shares the same pid 
namespace as the agent, this mount will result in EBUSY. We also cannot blindly 
umount /proc and mount again. This is because typically there are bind mounts 
under `/proc/sys/fs/binfmt_misc`. An unmount will typically result in EBUSY too.

The current idea is to use MS_MOVE to move the existing /proc to a tmp 
location, mount proc to /proc, and unmountAll the tmp location.


was (Author: jieyu):
I tried to send a patch for this ticket, but realized that the problem is a 
little tricky to resolve. The problem is that we cannot just blindly `mount -t 
proc proc /proc` irrespective if the container shares the same pid namespace of 
its parent. Because if the parent container indeed shares the same pid 
namespace as the agent, this mount will result in EBUSY. We also cannot blindly 
umount /proc and mount again. This is because typically there are bind mounts 
under `/proc/sys/fs/binfmt_misc`. An unmount will typically result in EBUSY too.

> `/proc` should be remounted even if a nested container set 
> `share_pid_namespace` to true
> 
>
> Key: MESOS-9529
> URL: https://issues.apache.org/jira/browse/MESOS-9529
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.4.2, 1.5.2, 1.6.2, 1.7.1
>Reporter: Jie Yu
>Priority: Critical
>
> Currently, if a nested container wants to share the pid namespace of its 
> parent container, we allow the framework to set 
> `LinuxInfo.share_pid_namespace`.
> If the nested container does not have its own rootfs (i.e., using the host 
> rootfs), the `/proc` is not re-mounted:
> https://github.com/apache/mesos/blob/1.7.x/src/slave/containerizer/mesos/isolators/namespaces/pid.cpp#L120-L126
> This is problematic because the nested container will fork host's mount 
> namespace, thus inherit the `/proc` there. As a result, `/proc/` are 
> still for the host pid namespace. The pid namespace of the parent container 
> might be different than that of the host pid namspace.
> As a result, `ps aux` in the nested container will show all process 
> information on the host pid namespace. Although, the pid namespace of the 
> nested container is different than that of the host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9529) `/proc` should be remounted even if a nested container set `share_pid_namespace` to true

2019-03-29 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805615#comment-16805615
 ] 

Jie Yu commented on MESOS-9529:
---

I tried to send a patch for this ticket, but realized that the problem is a 
little tricky to resolve. The problem is that we cannot just blindly `mount -t 
proc proc /proc` irrespective if the container shares the same pid namespace of 
its parent. Because if the parent container indeed shares the same pid 
namespace as the agent, this mount will result in EBUSY. We also cannot blindly 
umount /proc and mount again. This is because typically there are bind mounts 
under `/proc/sys/fs/binfmt_misc`. An unmount will typically result in EBUSY too.

> `/proc` should be remounted even if a nested container set 
> `share_pid_namespace` to true
> 
>
> Key: MESOS-9529
> URL: https://issues.apache.org/jira/browse/MESOS-9529
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.4.2, 1.5.2, 1.6.2, 1.7.1
>Reporter: Jie Yu
>Priority: Critical
>
> Currently, if a nested container wants to share the pid namespace of its 
> parent container, we allow the framework to set 
> `LinuxInfo.share_pid_namespace`.
> If the nested container does not have its own rootfs (i.e., using the host 
> rootfs), the `/proc` is not re-mounted:
> https://github.com/apache/mesos/blob/1.7.x/src/slave/containerizer/mesos/isolators/namespaces/pid.cpp#L120-L126
> This is problematic because the nested container will fork host's mount 
> namespace, thus inherit the `/proc` there. As a result, `/proc/` are 
> still for the host pid namespace. The pid namespace of the parent container 
> might be different than that of the host pid namspace.
> As a result, `ps aux` in the nested container will show all process 
> information on the host pid namespace. Although, the pid namespace of the 
> nested container is different than that of the host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9621) Mesos failed to build due to error LNK2019 on Windows using MSVC.

2019-03-01 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9621:
-

Assignee: Qian Zhang

> Mesos failed to build due to error LNK2019 on Windows using MSVC.
> -
>
> Key: MESOS-9621
> URL: https://issues.apache.org/jira/browse/MESOS-9621
> Project: Mesos
>  Issue Type: Bug
>  Components: build
> Environment: VS 2017 + Windows Server 2016
>Reporter: LinGao
>Assignee: Qian Zhang
>Priority: Major
>  Labels: windows
> Attachments: log_x64_build.log
>
>
> Issue description:
> Mesos failed to build due to error
> {noformat}
> LNK2019: unresolved external symbol "public: __cdecl 
> mesos::internal::slave::VolumeGidManager::~VolumeGidManager(void)" 
> (??1VolumeGidManager@slave@internal@mesos@@QEAA@XZ) referenced in function 
> "public: void * __cdecl mesos::internal::slave::VolumeGidManager::`scalar 
> deleting destructor'(unsigned int)"{color} on Windows using MSVC. It can be 
> first reproduced on mesos master branch 
> [c03e51f|https://github.com/apache/mesos/commit/c03e51f1fe9cc7137635a7fe586fd890f7c7bdae].
>  
> {noformat}
> Could you please take a look?
> Reproduce steps:
> {noformat}
>  # git clone -c core.autocrlf=true https://github.com/apache/mesos 
> D:\mesos\src
>  # Open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  # cd src
>  # .\bootstrap.bat
>  # cd ..
>  # mkdir build_x64 && pushd build_x64
>  # cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
>  # msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 
> /t:Rebuild
> {noformat}
> ErrorMessage:
> {noformat}
> main.obj : error LNK2019: unresolved external symbol "public: __cdecl 
> mesos::internal::slave::VolumeGidManager::~VolumeGidManager(void)" 
> (??1VolumeGidManager@slave@internal@mesos@@QEAA@XZ) referenced in function 
> "public: void * __cdecl mesos::internal::slave::VolumeGidManager::`scalar 
> deleting destructor'(unsigned int)" 
> (??_GVolumeGidManager@slave@internal@mesos@@QEAAPEAXI@Z) 
> [D:\Mesos\build_x64\src\slave\mesos-agent.vcxproj]
>    107>D:\Mesos\build_x64\src\mesos-agent.exe : fatal error LNK1120: 1 
> unresolved externals [D:\Mesos\build_x64\src\slave\mesos-agent.vcxproj]
>    107>Done Building Project 
> "D:\Mesos\build_x64\src\slave\mesos-agent.vcxproj" (Rebuild target(s)) -- 
> FAILED.
>     27>Done Building Project 
> "D:\Mesos\build_x64\src\slave\mesos-agent.vcxproj.metaproj" (Rebuild 
> target(s)) -- FAILED.
>  2>Done Building Project "D:\Mesos\build_x64\ALL_BUILD.vcxproj.metaproj" 
> (Rebuild target(s)) -- FAILED.
>  1>Done Building Project "D:\Mesos\build_x64\Mesos.sln" (Rebuild 
> target(s)) -- FAILED.
> Build FAILED.
> {noformat}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9590) Mesos CI sometimes, incorrectly, overwrites already-pushed mesos master nightly images with new images built from non-master branches.

2019-02-20 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773517#comment-16773517
 ] 

Jie Yu commented on MESOS-9590:
---

commit 8143d006f1032bb1c43364bd9f6741ee3dfbfc0b (HEAD -> master, origin/master, 
origin/HEAD)
Author: Jie Yu 
Date:   Wed Feb 20 11:16:57 2019 -0800

Blacklisted the "ubuntu-4" Jenkins box.

The git version installed on the box is too low.

Review: https://reviews.apache.org/r/70025

commit e9acc79ed535dd95b71227412a0e19868cf453d9
Author: Jie Yu 
Date:   Wed Feb 20 11:14:26 2019 -0800

Failed the scripts if `--points-at` is not supported.

On some Jenkins boxes, the git installed on the box does not support
`--points-at`. Instead of silently assume the 'master' branch in the
scripts (which could be wrong), we fail hard here.

Review: https://reviews.apache.org/r/70024



> Mesos CI sometimes, incorrectly, overwrites already-pushed mesos master 
> nightly images with new images built from non-master branches.
> --
>
> Key: MESOS-9590
> URL: https://issues.apache.org/jira/browse/MESOS-9590
> Project: Mesos
>  Issue Type: Bug
>Reporter: James DeFelice
>Assignee: Jie Yu
>Priority: Major
>  Labels: mesosphere
>
> I pulled image mesos/mesos-centos:master-2019-02-15 some time on the 15th and 
> worked with it locally, on my laptop, for about a week. Part of that work 
> included downloading the related mesos-xxx-devel.rpm from the same CI build 
> that produced the image so that I could build 3rd party mesos modules from 
> the master base image. The rpm was labeled as pre-1.8.0.
> This worked great until I tried to repeat the work on another machine. The 
> other machine pulled the "same" dockerhub image 
> (mesos/mesos-centos:master-2019-02-15) which was somehow built with a 
> mesos-xxx.rpm labeled as pre-1.7.2. I couldn't build my docker image using 
> this strangely new base because the mesos-xxx-devel.rpm I had hardcoded into 
> the dockerfile no longer aligned with the version of the mesos RPM that was 
> shipping in the base image.
> The base image had changed, such that the mesos RPM version went from 1.8.0 
> to 1.7.2. This should never happen.
> [~jieyu] investigated and found that the problem appears to happen at random. 
> Current thinking is that one of the mesos CI boxes uses a version of git 
> that's too old, and that the CI scripts are incorrectly ignoring a git 
> command failure: the git command fails because the git version is too old, 
> and the script subsequently ignores any failures from the command pipeline in 
> which this command is executed. With the result being that the "version" of 
> the branch being built cannot be detected and therefore defaults to master - 
> overwriting *actual* master image builds.
> [~jieyu] also wrote some patches, which I'll link here:
> * https://reviews.apache.org/r/70024/
> * https://reviews.apache.org/r/70025/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9549) nvidia/cuda 10 does not work on GPU isolator

2019-02-01 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758811#comment-16758811
 ] 

Jie Yu commented on MESOS-9549:
---

Spent some time on this today, we need to do the following to make cuda:10 works

1. Inject "/usr/local/nvidia/bin" to PATH
2. Inject "/usr/local/nvidia/lib64:/usr/local/nvidia/lib" to LD_LIBRARY_PATH
3. Add one more condition to  inject volume 
{code}
+  if (manifest.config().labels().count("maintainer") &&
+  strings::contains(
+  manifest.config().labels().at("maintainer"),
+  "NVIDIA CORPORATION")) {
+return true;
+  }
{code}

1 and 2 are because the cuda:10 image removed those env vars (in favor of 
nvidia docker runtime)
3 is because cuda:10 image remove the original label "com.nvidia.volumes.needed"

> nvidia/cuda 10 does not work on GPU isolator
> 
>
> Key: MESOS-9549
> URL: https://issues.apache.org/jira/browse/MESOS-9549
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Priority: Major
>
> I verified that nvidia/cuda 9 (i.e., 9.2-devel-ubuntu18.04) works with GPU 
> isolator.
> The unit test 
> NvidiaGpuTest.ROOT_INTERNET_CURL_CGROUPS_NVIDIA_GPU_NvidiaDockerImage 
> captures this, and is currently failing on GPU hosts since it uses latest 
> nvidia/cuda image.
> If fails with
> {format}
> sh: 1: nvidia-smi: not found
> {format}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9549) nvidia/cuda 10 does not work on GPU isolator

2019-02-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9549:
-

 Summary: nvidia/cuda 10 does not work on GPU isolator
 Key: MESOS-9549
 URL: https://issues.apache.org/jira/browse/MESOS-9549
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu


I verified that nvidia/cuda 9 (i.e., 9.2-devel-ubuntu18.04) works with GPU 
isolator.

The unit test 
NvidiaGpuTest.ROOT_INTERNET_CURL_CGROUPS_NVIDIA_GPU_NvidiaDockerImage captures 
this, and is currently failing on GPU hosts since it uses latest nvidia/cuda 
image.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9529) `/proc` should be remounted even if a nested container set `share_pid_namespace` to true

2019-01-18 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9529:
-

 Summary: `/proc` should be remounted even if a nested container 
set `share_pid_namespace` to true
 Key: MESOS-9529
 URL: https://issues.apache.org/jira/browse/MESOS-9529
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Affects Versions: 1.4.2, 1.5.2, 1.6.2, 1.7.1
Reporter: Jie Yu


Currently, if a nested container wants to share the pid namespace of its parent 
container, we allow the framework to set `LinuxInfo.share_pid_namespace`.

If the nested container does not have its own rootfs (i.e., using the host 
rootfs), the `/proc` is not re-mounted:
https://github.com/apache/mesos/blob/1.7.x/src/slave/containerizer/mesos/isolators/namespaces/pid.cpp#L120-L126

This is problematic because the nested container will fork host's mount 
namespace, thus inherit the `/proc` there. As a result, `/proc/` are still 
for the host pid namespace. The pid namespace of the parent container might be 
different than that of the host pid namspace.

As a result, `ps aux` in the nested container will show all process information 
on the host pid namespace. Although, the pid namespace of the nested container 
is different than that of the host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9518) CNI_NETNS should not be set for orphan containers that do not have network namespace

2019-01-11 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740881#comment-16740881
 ] 

Jie Yu commented on MESOS-9518:
---

Also need this:
https://reviews.apache.org/r/69727

> CNI_NETNS should not be set for orphan containers that do not have network 
> namespace
> 
>
> Key: MESOS-9518
> URL: https://issues.apache.org/jira/browse/MESOS-9518
> Project: Mesos
>  Issue Type: Bug
>  Components: cni
>Affects Versions: 1.4.2, 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
>
> We introduced a new agent flag in MESOS-9492 so that CNI configs can be 
> persisted across reboot. This is for some CNI plugins to be able to cleanup 
> IP allocated to the containers after a sudden reboot of the host (not all CNI 
> plugins need this).
> It's important to unset `CNI_NETNS` environment variable after reboot when 
> invoking CNI plugin "DEL" command so that it conforms to the spec:
> {noformat}
> When CNI_NETNS and/or prevResult are not provided, the plugin should clean up 
> as many resources as possible (e.g. releasing IPAM allocations) and return a 
> successful response.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9518) CNI_NETNS should not be set for orphan containers that do not have network namespace

2019-01-11 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740881#comment-16740881
 ] 

Jie Yu edited comment on MESOS-9518 at 1/11/19 11:40 PM:
-

Also need this for newer kernels:
https://reviews.apache.org/r/69727


was (Author: jieyu):
Also need this:
https://reviews.apache.org/r/69727

> CNI_NETNS should not be set for orphan containers that do not have network 
> namespace
> 
>
> Key: MESOS-9518
> URL: https://issues.apache.org/jira/browse/MESOS-9518
> Project: Mesos
>  Issue Type: Bug
>  Components: cni
>Affects Versions: 1.4.2, 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
>
> We introduced a new agent flag in MESOS-9492 so that CNI configs can be 
> persisted across reboot. This is for some CNI plugins to be able to cleanup 
> IP allocated to the containers after a sudden reboot of the host (not all CNI 
> plugins need this).
> It's important to unset `CNI_NETNS` environment variable after reboot when 
> invoking CNI plugin "DEL" command so that it conforms to the spec:
> {noformat}
> When CNI_NETNS and/or prevResult are not provided, the plugin should clean up 
> as many resources as possible (e.g. releasing IPAM allocations) and return a 
> successful response.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9518) CNI_NETNS should not be set for orphan containers that do not have network namespace

2019-01-10 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739761#comment-16739761
 ] 

Jie Yu edited comment on MESOS-9518 at 1/11/19 6:19 AM:


https://reviews.apache.org/r/69706/
https://reviews.apache.org/r/69710/
https://reviews.apache.org/r/69711/
https://reviews.apache.org/r/69712/
https://reviews.apache.org/r/69713/
https://reviews.apache.org/r/69714/
https://reviews.apache.org/r/69715/


was (Author: jieyu):
Fix: https://reviews.apache.org/r/69706/

Adding tests now.

> CNI_NETNS should not be set for orphan containers that do not have network 
> namespace
> 
>
> Key: MESOS-9518
> URL: https://issues.apache.org/jira/browse/MESOS-9518
> Project: Mesos
>  Issue Type: Bug
>  Components: cni
>Affects Versions: 1.4.2, 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
>
> We introduced a new agent flag in MESOS-9492 so that CNI configs can be 
> persisted across reboot. This is for some CNI plugins to be able to cleanup 
> IP allocated to the containers after a sudden reboot of the host (not all CNI 
> plugins need this).
> It's important to unset `CNI_NETNS` environment variable after reboot when 
> invoking CNI plugin "DEL" command so that it conforms to the spec:
> {noformat}
> When CNI_NETNS and/or prevResult are not provided, the plugin should clean up 
> as many resources as possible (e.g. releasing IPAM allocations) and return a 
> successful response.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9518) CNI_NETNS should not be set for orphan containers that do not have network namespace

2019-01-10 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739761#comment-16739761
 ] 

Jie Yu commented on MESOS-9518:
---

Fix: https://reviews.apache.org/r/69706/

Adding tests now.

> CNI_NETNS should not be set for orphan containers that do not have network 
> namespace
> 
>
> Key: MESOS-9518
> URL: https://issues.apache.org/jira/browse/MESOS-9518
> Project: Mesos
>  Issue Type: Bug
>  Components: cni
>Affects Versions: 1.4.2, 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
>
> We introduced a new agent flag in MESOS-9492 so that CNI configs can be 
> persisted across reboot. This is for some CNI plugins to be able to cleanup 
> IP allocated to the containers after a sudden reboot of the host (not all CNI 
> plugins need this).
> It's important to unset `CNI_NETNS` environment variable after reboot when 
> invoking CNI plugin "DEL" command so that it conforms to the spec:
> {noformat}
> When CNI_NETNS and/or prevResult are not provided, the plugin should clean up 
> as many resources as possible (e.g. releasing IPAM allocations) and return a 
> successful response.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9518) CNI_NETNS should not be set for orphan containers that do not have network namespace

2019-01-10 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9518:
-

Assignee: Jie Yu

> CNI_NETNS should not be set for orphan containers that do not have network 
> namespace
> 
>
> Key: MESOS-9518
> URL: https://issues.apache.org/jira/browse/MESOS-9518
> Project: Mesos
>  Issue Type: Bug
>  Components: cni
>Affects Versions: 1.4.2, 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
>
> We introduced a new agent flag in MESOS-9492 so that CNI configs can be 
> persisted across reboot. This is for some CNI plugins to be able to cleanup 
> IP allocated to the containers after a sudden reboot of the host (not all CNI 
> plugins need this).
> It's important to unset `CNI_NETNS` environment variable after reboot when 
> invoking CNI plugin "DEL" command so that it conforms to the spec:
> {noformat}
> When CNI_NETNS and/or prevResult are not provided, the plugin should clean up 
> as many resources as possible (e.g. releasing IPAM allocations) and return a 
> successful response.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9518) CNI_NETNS should not be set for orphan containers that do not have network namespace

2019-01-09 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9518:
-

 Summary: CNI_NETNS should not be set for orphan containers that do 
not have network namespace
 Key: MESOS-9518
 URL: https://issues.apache.org/jira/browse/MESOS-9518
 Project: Mesos
  Issue Type: Bug
  Components: cni
Affects Versions: 1.7.0, 1.6.1, 1.5.1, 1.4.2
Reporter: Jie Yu


We introduced a new agent flag in MESOS-9492 so that CNI configs can be 
persisted across reboot. This is for some CNI plugins to be able to cleanup IP 
allocated to the containers after a sudden reboot of the host (not all CNI 
plugins need this).

It's important to unset `CNI_NETNS` environment variable after reboot when 
invoking CNI plugin "DEL" command so that it conforms to the spec:

{noformat}
When CNI_NETNS and/or prevResult are not provided, the plugin should clean up 
as many resources as possible (e.g. releasing IPAM allocations) and return a 
successful response.
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9505) 'make check' failed on MacOS mojave failed with linking errors.

2019-01-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9505:
-

 Summary: 'make check' failed on MacOS mojave failed with linking 
errors.
 Key: MESOS-9505
 URL: https://issues.apache.org/jira/browse/MESOS-9505
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu


MacOS Mojave
autotool

{noformat}
/Users/jie/workspace/mesos/configure --prefix=/Users/jie/workspace/dist/mesos 
--disable-python --disable-java --enable-ssl --enable-libevent

$ g++ --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr 
--with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 10.0.0 (clang-1000.10.44.4)
Target: x86_64-apple-darwin18.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
{noformat}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9266) Whenever our packaging tasks trigger errors we run into permission problems.

2018-12-10 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9266:
-

Assignee: Jie Yu

> Whenever our packaging tasks trigger errors we run into permission problems.
> 
>
> Key: MESOS-9266
> URL: https://issues.apache.org/jira/browse/MESOS-9266
> Project: Mesos
>  Issue Type: Task
>  Components: build
>Reporter: Till Toenshoff
>Assignee: Jie Yu
>Priority: Minor
>  Labels: mesosphere, packaging, rainbow-worriers-task
>
> As shown in MESOS-9238, failures within our packaging cause permission 
> failures on cleanup.
> {noformat}
> cleanup
> rm: cannot remove 
> '/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/.cache': 
> Permission denied
> rm: cannot remove 
> '/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/rpmbuild/SRPMS':
>  Permission denied
> rm: cannot remove 
> '/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/rpmbuild/BUILDROOT/mesos-1.8.0-0.1.pre.20180915git4805a47.el7.x86_64/var/lib/mesos':
>  Permission denied
> rm: cannot remove 
> '/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/rpmbuild/BUILDROOT/mesos-1.8.0-0.1.pre.20180915git4805a47.el7.x86_64/var/log/mesos':
>  Permission denied
> {noformat}
> We should clean that up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9462) Devices in a container are inaccessible due to `nodev` on `/var/run`.

2018-12-09 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9462:
-

Assignee: Andrei Budnik  (was: Jie Yu)

> Devices in a container are inaccessible due to `nodev` on `/var/run`.
> -
>
> Key: MESOS-9462
> URL: https://issues.apache.org/jira/browse/MESOS-9462
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.8.0
>Reporter: Jie Yu
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: regression
>
> A recent [patch|https://reviews.apache.org/r/69086/] (commit 
> ede8155d1d043137e15007c48da36ac5fa0b5124) changes the behavior of how 
> standard device nodes (e.g., /dev/null, etc.) are setup. It uses bind mount 
> (from host) now (instead of mknod).
> The devices nodes are created under 
> `/var/run/mesos/containers//devices`, and then bind mounted to 
> the container root filesystem. This is problematic for those Linux distros 
> that mount `/var/run` (or `/run`) as `nodev`. For instance, CentOS 7.4:
> {noformat}
> [jie@core-dev ~]$ cat /proc/self/mountinfo | grep "/run\ "
>   
>
> 24 62 0:19 / /run rw,nosuid,nodev shared:23 - tmpfs tmpfs rw,seclabel,mode=755
> [jie@core-dev ~]$ cat /etc/redhat-release 
> CentOS Linux release 7.4.1708 (Core) 
> {noformat}
> As a result, the `/dev/null` devices in the container will inherit the 
> `nodev` from `/run` on the host
> {noformat}
> 629 625 0:121 
> /mesos/containers/49f1da14-d741-4030-994c-0d8ed5093b13/devices/null /dev/null 
> rw,nosuid,nodev - tmpfs tmpfs rw,mode=755
> {noformat}
> This will cause "Permission Denied" error when a process in the container 
> tries to open the device node.
> You can try to reproduce this issue using Mesos Mini
> {noformat}
> docker run --rm --privileged -p 5050:5050 -p 5051:5051 -p 8080:8080 
> mesos/mesos-mini:master-2018-12-06
> {noformat}
> And the, go to Marathon UI (http://localhost:8080), and launch an app using 
> the following config
> {code}
> {
>   "id": "/test",
>   "cmd": "dd if=/dev/zero of=file bs=1024 count=1 oflag=dsync",
>   "cpus": 1,
>   "mem": 128,
>   "disk": 128,
>   "instances": 1,
>   "container": {
> "type": "MESOS",
> "docker": {
>   "image": "ubuntu:18.04"
> }
>   }
> }
> {code}
> You'll see the task failed with "Permission Denied".
> The task will run normally if you use `mesos/mesos-mini:master-2018-12-01`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9462) Devices in a container are inaccessible due to `nodev` on `/var/run`.

2018-12-09 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714274#comment-16714274
 ] 

Jie Yu commented on MESOS-9462:
---

commit f050bf01af8f9f92bbada2c0a2025a459290ed98 (HEAD -> master, origin/master, 
origin/HEAD, remount_runtime_dir)
Author: Jie Yu 
Date:   Fri Dec 7 16:51:19 2018 -0800

Made sure containers runtime dir has device file access.

Make sure that container's runtime dir has device file access.  Some
Linux distributions will mount `/run` with `nodev`, restricting
accessing to device files under `/run`. However, Mesos prepares device
files for containers under container's runtime dir (which is typically
under `/run`) and bind mount into container root filesystems. Therefore,
we need to make sure those device files can be accessed by the
container. We need to do a self bind mount and remount with proper
options if necessary. See MESOS-9462 for more details.

Review: https://reviews.apache.org/r/69532

commit 35cdfa2c0bcb75f5801ec60671a3f978b4aa645f
Author: Jie Yu 
Date:   Sun Dec 9 19:14:09 2018 -0800

Used strings::format in os::shell.

Previously, `strings::internal::format` was used. It causes issues when
std::string is passed in as parameters. Switched to use
`strings::format` instead in `os::shell` implementation.

Review: https://reviews.apache.org/r/69537

> Devices in a container are inaccessible due to `nodev` on `/var/run`.
> -
>
> Key: MESOS-9462
> URL: https://issues.apache.org/jira/browse/MESOS-9462
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.8.0
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Blocker
>  Labels: regression
>
> A recent [patch|https://reviews.apache.org/r/69086/] (commit 
> ede8155d1d043137e15007c48da36ac5fa0b5124) changes the behavior of how 
> standard device nodes (e.g., /dev/null, etc.) are setup. It uses bind mount 
> (from host) now (instead of mknod).
> The devices nodes are created under 
> `/var/run/mesos/containers//devices`, and then bind mounted to 
> the container root filesystem. This is problematic for those Linux distros 
> that mount `/var/run` (or `/run`) as `nodev`. For instance, CentOS 7.4:
> {noformat}
> [jie@core-dev ~]$ cat /proc/self/mountinfo | grep "/run\ "
>   
>
> 24 62 0:19 / /run rw,nosuid,nodev shared:23 - tmpfs tmpfs rw,seclabel,mode=755
> [jie@core-dev ~]$ cat /etc/redhat-release 
> CentOS Linux release 7.4.1708 (Core) 
> {noformat}
> As a result, the `/dev/null` devices in the container will inherit the 
> `nodev` from `/run` on the host
> {noformat}
> 629 625 0:121 
> /mesos/containers/49f1da14-d741-4030-994c-0d8ed5093b13/devices/null /dev/null 
> rw,nosuid,nodev - tmpfs tmpfs rw,mode=755
> {noformat}
> This will cause "Permission Denied" error when a process in the container 
> tries to open the device node.
> You can try to reproduce this issue using Mesos Mini
> {noformat}
> docker run --rm --privileged -p 5050:5050 -p 5051:5051 -p 8080:8080 
> mesos/mesos-mini:master-2018-12-06
> {noformat}
> And the, go to Marathon UI (http://localhost:8080), and launch an app using 
> the following config
> {code}
> {
>   "id": "/test",
>   "cmd": "dd if=/dev/zero of=file bs=1024 count=1 oflag=dsync",
>   "cpus": 1,
>   "mem": 128,
>   "disk": 128,
>   "instances": 1,
>   "container": {
> "type": "MESOS",
> "docker": {
>   "image": "ubuntu:18.04"
> }
>   }
> }
> {code}
> You'll see the task failed with "Permission Denied".
> The task will run normally if you use `mesos/mesos-mini:master-2018-12-01`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9462) Devices in a container are inaccessible due to `nodev` on `/var/run`.

2018-12-07 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9462:
-

Assignee: Jie Yu

> Devices in a container are inaccessible due to `nodev` on `/var/run`.
> -
>
> Key: MESOS-9462
> URL: https://issues.apache.org/jira/browse/MESOS-9462
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.8.0
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Blocker
>  Labels: regression
>
> A recent [patch|https://reviews.apache.org/r/69086/] (commit 
> ede8155d1d043137e15007c48da36ac5fa0b5124) changes the behavior of how 
> standard device nodes (e.g., /dev/null, etc.) are setup. It uses bind mount 
> (from host) now (instead of mknod).
> The devices nodes are created under 
> `/var/run/mesos/containers//devices`, and then bind mounted to 
> the container root filesystem. This is problematic for those Linux distros 
> that mount `/var/run` (or `/run`) as `nodev`. For instance, CentOS 7.4:
> {noformat}
> [jie@core-dev ~]$ cat /proc/self/mountinfo | grep "/run\ "
>   
>
> 24 62 0:19 / /run rw,nosuid,nodev shared:23 - tmpfs tmpfs rw,seclabel,mode=755
> [jie@core-dev ~]$ cat /etc/redhat-release 
> CentOS Linux release 7.4.1708 (Core) 
> {noformat}
> As a result, the `/dev/null` devices in the container will inherit the 
> `nodev` from `/run` on the host
> {noformat}
> 629 625 0:121 
> /mesos/containers/49f1da14-d741-4030-994c-0d8ed5093b13/devices/null /dev/null 
> rw,nosuid,nodev - tmpfs tmpfs rw,mode=755
> {noformat}
> This will cause "Permission Denied" error when a process in the container 
> tries to open the device node.
> You can try to reproduce this issue using Mesos Mini
> {noformat}
> docker run --rm --privileged -p 5050:5050 -p 5051:5051 -p 8080:8080 
> mesos/mesos-mini:master-2018-12-06
> {noformat}
> And the, go to Marathon UI (http://localhost:8080), and launch an app using 
> the following config
> {code}
> {
>   "id": "/test",
>   "cmd": "dd if=/dev/zero of=file bs=1024 count=1 oflag=dsync",
>   "cpus": 1,
>   "mem": 128,
>   "disk": 128,
>   "instances": 1,
>   "container": {
> "type": "MESOS",
> "docker": {
>   "image": "ubuntu:18.04"
> }
>   }
> }
> {code}
> You'll see the task failed with "Permission Denied".
> The task will run normally if you use `mesos/mesos-mini:master-2018-12-01`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9462) Devices in a container are inaccessible due to `nodev` on `/var/run`.

2018-12-07 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713286#comment-16713286
 ] 

Jie Yu commented on MESOS-9462:
---

The fix is to make sure  is a mounted w/o `nodev`. If the distro 
mount `/var/run` using `nodev`, during agent startup, we can perform a one time 
self bind mount, and a remount with `-o remount,dev`. This can ensure that 
device nodes created in `` can be accessed.

> Devices in a container are inaccessible due to `nodev` on `/var/run`.
> -
>
> Key: MESOS-9462
> URL: https://issues.apache.org/jira/browse/MESOS-9462
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.8.0
>Reporter: Jie Yu
>Priority: Blocker
>  Labels: regression
>
> A recent [patch|https://reviews.apache.org/r/69086/] (commit 
> ede8155d1d043137e15007c48da36ac5fa0b5124) changes the behavior of how 
> standard device nodes (e.g., /dev/null, etc.) are setup. It uses bind mount 
> (from host) now (instead of mknod).
> The devices nodes are created under 
> `/var/run/mesos/containers//devices`, and then bind mounted to 
> the container root filesystem. This is problematic for those Linux distros 
> that mount `/var/run` (or `/run`) as `nodev`. For instance, CentOS 7.4:
> {noformat}
> [jie@core-dev ~]$ cat /proc/self/mountinfo | grep "/run\ "
>   
>
> 24 62 0:19 / /run rw,nosuid,nodev shared:23 - tmpfs tmpfs rw,seclabel,mode=755
> [jie@core-dev ~]$ cat /etc/redhat-release 
> CentOS Linux release 7.4.1708 (Core) 
> {noformat}
> As a result, the `/dev/null` devices in the container will inherit the 
> `nodev` from `/run` on the host
> {noformat}
> 629 625 0:121 
> /mesos/containers/49f1da14-d741-4030-994c-0d8ed5093b13/devices/null /dev/null 
> rw,nosuid,nodev - tmpfs tmpfs rw,mode=755
> {noformat}
> This will cause "Permission Denied" error when a process in the container 
> tries to open the device node.
> You can try to reproduce this issue using Mesos Mini
> {noformat}
> docker run --rm --privileged -p 5050:5050 -p 5051:5051 -p 8080:8080 
> mesos/mesos-mini:master-2018-12-06
> {noformat}
> And the, go to Marathon UI (http://localhost:8080), and launch an app using 
> the following config
> {code}
> {
>   "id": "/test",
>   "cmd": "dd if=/dev/zero of=file bs=1024 count=1 oflag=dsync",
>   "cpus": 1,
>   "mem": 128,
>   "disk": 128,
>   "instances": 1,
>   "container": {
> "type": "MESOS",
> "docker": {
>   "image": "ubuntu:18.04"
> }
>   }
> }
> {code}
> You'll see the task failed with "Permission Denied".
> The task will run normally if you use `mesos/mesos-mini:master-2018-12-01`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9462) Devices in a container are inaccessible due to `nodev` on `/var/run`.

2018-12-07 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713286#comment-16713286
 ] 

Jie Yu edited comment on MESOS-9462 at 12/7/18 9:01 PM:


The fix is to make sure  is mounted w/o `nodev`. If the distro 
mount `/var/run` using `nodev`, during agent startup, we can perform a one time 
self bind mount, and a remount with `-o remount,dev`. This can ensure that 
device nodes created in `` can be accessed.


was (Author: jieyu):
The fix is to make sure  is a mounted w/o `nodev`. If the distro 
mount `/var/run` using `nodev`, during agent startup, we can perform a one time 
self bind mount, and a remount with `-o remount,dev`. This can ensure that 
device nodes created in `` can be accessed.

> Devices in a container are inaccessible due to `nodev` on `/var/run`.
> -
>
> Key: MESOS-9462
> URL: https://issues.apache.org/jira/browse/MESOS-9462
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.8.0
>Reporter: Jie Yu
>Priority: Blocker
>  Labels: regression
>
> A recent [patch|https://reviews.apache.org/r/69086/] (commit 
> ede8155d1d043137e15007c48da36ac5fa0b5124) changes the behavior of how 
> standard device nodes (e.g., /dev/null, etc.) are setup. It uses bind mount 
> (from host) now (instead of mknod).
> The devices nodes are created under 
> `/var/run/mesos/containers//devices`, and then bind mounted to 
> the container root filesystem. This is problematic for those Linux distros 
> that mount `/var/run` (or `/run`) as `nodev`. For instance, CentOS 7.4:
> {noformat}
> [jie@core-dev ~]$ cat /proc/self/mountinfo | grep "/run\ "
>   
>
> 24 62 0:19 / /run rw,nosuid,nodev shared:23 - tmpfs tmpfs rw,seclabel,mode=755
> [jie@core-dev ~]$ cat /etc/redhat-release 
> CentOS Linux release 7.4.1708 (Core) 
> {noformat}
> As a result, the `/dev/null` devices in the container will inherit the 
> `nodev` from `/run` on the host
> {noformat}
> 629 625 0:121 
> /mesos/containers/49f1da14-d741-4030-994c-0d8ed5093b13/devices/null /dev/null 
> rw,nosuid,nodev - tmpfs tmpfs rw,mode=755
> {noformat}
> This will cause "Permission Denied" error when a process in the container 
> tries to open the device node.
> You can try to reproduce this issue using Mesos Mini
> {noformat}
> docker run --rm --privileged -p 5050:5050 -p 5051:5051 -p 8080:8080 
> mesos/mesos-mini:master-2018-12-06
> {noformat}
> And the, go to Marathon UI (http://localhost:8080), and launch an app using 
> the following config
> {code}
> {
>   "id": "/test",
>   "cmd": "dd if=/dev/zero of=file bs=1024 count=1 oflag=dsync",
>   "cpus": 1,
>   "mem": 128,
>   "disk": 128,
>   "instances": 1,
>   "container": {
> "type": "MESOS",
> "docker": {
>   "image": "ubuntu:18.04"
> }
>   }
> }
> {code}
> You'll see the task failed with "Permission Denied".
> The task will run normally if you use `mesos/mesos-mini:master-2018-12-01`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9462) Devices in a container are inaccessible due to `nodev` on `/var/run`.

2018-12-07 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9462:
-

 Summary: Devices in a container are inaccessible due to `nodev` on 
`/var/run`.
 Key: MESOS-9462
 URL: https://issues.apache.org/jira/browse/MESOS-9462
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Jie Yu


A recent [patch|https://reviews.apache.org/r/69086/] (commit 
ede8155d1d043137e15007c48da36ac5fa0b5124) changes the behavior of how standard 
device nodes (e.g., /dev/null, etc.) are setup. It uses bind mount (from host) 
now (instead of mknod).

The devices nodes are created under 
`/var/run/mesos/containers//devices`, and then bind mounted to 
the container root filesystem. This is problematic for those Linux distros that 
mount `/var/run` (or `/run`) as `nodev`. For instance, CentOS 7.4:
{noformat}
[jie@core-dev ~]$ cat /proc/self/mountinfo | grep "/run\ "  

   
24 62 0:19 / /run rw,nosuid,nodev shared:23 - tmpfs tmpfs rw,seclabel,mode=755
[jie@core-dev ~]$ cat /etc/redhat-release 
CentOS Linux release 7.4.1708 (Core) 
{noformat}

As a result, the `/dev/null` devices in the container will inherit the `nodev` 
from `/run` on the host
{noformat}
629 625 0:121 
/mesos/containers/49f1da14-d741-4030-994c-0d8ed5093b13/devices/null /dev/null 
rw,nosuid,nodev - tmpfs tmpfs rw,mode=755
{noformat}

This will cause "Permission Denied" error when a process in the container tries 
to open the device node.

You can try to reproduce this issue using Mesos Mini
{noformat}
docker run --rm --privileged -p 5050:5050 -p 5051:5051 -p 8080:8080 
mesos/mesos-mini:master-2018-12-06
{noformat}

And the, go to Marathon UI (http://localhost:8080), and launch an app using the 
following config
{code}
{
  "id": "/test",
  "cmd": "dd if=/dev/zero of=file bs=1024 count=1 oflag=dsync",
  "cpus": 1,
  "mem": 128,
  "disk": 128,
  "instances": 1,
  "container": {
"type": "MESOS",
"docker": {
  "image": "ubuntu:18.04"
}
  }
}
{code}

You'll see the task failed with "Permission Denied".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9009) Support for creation non-existing host paths in a whitelist as source paths

2018-11-18 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691279#comment-16691279
 ] 

Jie Yu commented on MESOS-9009:
---

commit b866fc3278dc4fd48d1a50493bcde1efdfa91cc7 (HEAD -> master, origin/master, 
origin/HEAD)
Author: Jason Lai 
Date:   Sun Nov 18 21:12:28 2018 -0800

Added unit tests for Stout `path::normalize` function in POSIX.

Review: https://reviews.apache.org/r/68832/

commit 516c0bd70c50ae5aa6682b3b8675ef75d99dfc3f
Author: Jason Lai 
Date:   Sun Nov 18 21:12:06 2018 -0800

Added Stout `path::normalize` function for POSIX paths.

Added `path::normalize` to normalize a given pathname and remove
redundant separators and up-level references.

This function follows the rules described in `path_resolution(7)`
for Linux. However, it only performs pure lexical processing without
touching the actual filesystem.

Review: https://reviews.apache.org/r/65811/

> Support for creation non-existing host paths in a whitelist as source paths
> ---
>
> Key: MESOS-9009
> URL: https://issues.apache.org/jira/browse/MESOS-9009
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Affects Versions: 1.8.0
>Reporter: Jason Lai
>Assignee: Jason Lai
>Priority: Major
>  Labels: containerizer, mount, path
>
> Docker creates a directory specified in {{docker run}}'s {{--volume}}/{{-v}} 
> option as the source path that will get bind-mounted into the container, if 
> the source location didn't originally exist on the host.
> Unlike Docker, UCR bails on launching containers if any of their host mount 
> paths doesn't originally exist. While this is more secure and eliminates 
> unnecessary side effects, it breaks transparent compatibility when trying to 
> migrate from Docker.
> As a trade-off, we should allow host path creation in a restricted manner, by 
> introducing a new Mesos agent flag ({{--host_path_volume_force_creation}}) as 
> a colon-separated whitelist (similar to the format of POSIX's {{$PATH}} 
> environment variable), under whose items' subdirectories the host paths are 
> allowed to be created.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8257) Unified Containerizer "leaks" a target container mount path to the host FS when the target resolves to an absolute path

2018-11-18 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691278#comment-16691278
 ] 

Jie Yu commented on MESOS-8257:
---

commit b866fc3278dc4fd48d1a50493bcde1efdfa91cc7 (HEAD -> master, origin/master, 
origin/HEAD)
Author: Jason Lai 
Date:   Sun Nov 18 21:12:28 2018 -0800

Added unit tests for Stout `path::normalize` function in POSIX.

Review: https://reviews.apache.org/r/68832/

commit 516c0bd70c50ae5aa6682b3b8675ef75d99dfc3f
Author: Jason Lai 
Date:   Sun Nov 18 21:12:06 2018 -0800

Added Stout `path::normalize` function for POSIX paths.

Added `path::normalize` to normalize a given pathname and remove
redundant separators and up-level references.

This function follows the rules described in `path_resolution(7)`
for Linux. However, it only performs pure lexical processing without
touching the actual filesystem.

Review: https://reviews.apache.org/r/65811/

> Unified Containerizer "leaks" a target container mount path to the host FS 
> when the target resolves to an absolute path
> ---
>
> Key: MESOS-8257
> URL: https://issues.apache.org/jira/browse/MESOS-8257
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Jason Lai
>Assignee: Jason Lai
>Priority: Critical
>  Labels: bug, containerizer, mountpath
>
> If a target path under the root FS provisioned from an image resolves to an 
> absolute path, it will not appear in the container root FS after 
> {{pivot_root(2)}} is called.
> A typical example is that when the target path is under {{/var/run}} (e.g. 
> {{/var/run/some-dir}}), which is usually a symlink to an absolute path of 
> {{/run}} in Debian images, the target path will get resolved as and created 
> at {{/run/some-dir}} in the host root FS, after the container root FS gets 
> provisioned. The target path will get unmounted after {{pivot_root(2)}} as it 
> is part of the old root (host FS).
> A workaround is to use {{/run}} instead of {{/var/run}}, but absolute 
> symlinks need to be resolved within the scope of the container root FS path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9308) URI disk profile adaptor could deadlock.

2018-10-10 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9308:
-

 Summary: URI disk profile adaptor could deadlock.
 Key: MESOS-9308
 URL: https://issues.apache.org/jira/browse/MESOS-9308
 Project: Mesos
  Issue Type: Bug
  Components: resource provider
Affects Versions: 1.7.0, 1.6.1, 1.5.1
Reporter: Jie Yu


The loop here can be infinit:
https://github.com/apache/mesos/blob/1.7.0/src/resource_provider/storage/uri_disk_profile_adaptor.cpp#L61-L80





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9307) Libprocess should have a way to detect stuck actor.

2018-10-10 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9307:
-

 Summary: Libprocess should have a way to detect stuck actor.
 Key: MESOS-9307
 URL: https://issues.apache.org/jira/browse/MESOS-9307
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Jie Yu


We spent two days on a bug, which turns out to be an infinite loop in an actor, 
blocking other events from being processed by that actor.

Currently, the only way to know about a stuck agent is to use gdb. We should 
think about a way to print error logs when an actor has stuck for more than a 
threshold.

For instance, Linux kernel will print a warning in kernel log if a task is 
stuck for more than 120 seconds. Something like this will be extremely helpful.

Another way is to expose some metrics around this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9305) Create cgoup recursively when calling prepare on containers

2018-10-09 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644455#comment-16644455
 ] 

Jie Yu edited comment on MESOS-9305 at 10/10/18 4:54 AM:
-

[~carlone] yeah, I think that sounds good to me. I don't have better 
alternatives. Can you send a PR? I am happy to shepherd it.


was (Author: jieyu):
[~carlone] yeah, I think that sounds good to me. I don't have better 
alternatives. Can you send a PR?

> Create cgoup recursively when calling prepare on containers
> ---
>
> Key: MESOS-9305
> URL: https://issues.apache.org/jira/browse/MESOS-9305
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: longfei
>Assignee: longfei
>Priority: Critical
>
> This is my case:
> My cgroups_root of mesos-slave is some_user/mesos under /sys/fs/cgroup。
> It happens that this some_user dir may be gone for some unknown reason, in 
> which case I can no longer create any cgroup and any task will fail.
> So I would like to change 
>  
> {code:java}
> Try create = cgroups::create(
> hierarchy,
> infos[containerId]->cgroup);
> {code}
> to
> {code:java}
> Try create = cgroups::create(
> hierarchy,
> infos[containerId]->cgroup,
> true);
> {code}
> in CgroupsIsolatorProcess::prepare in 
> src/slave/containerizer/mesos/isolators/cgroups/cgroups.cpp.
> However, I'm not sure if there's any potential problem doing so. Any advice?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9305) Create cgoup recursively when calling prepare on containers

2018-10-09 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644455#comment-16644455
 ] 

Jie Yu commented on MESOS-9305:
---

[~carlone] yeah, I think that sounds good to me. I don't have better 
alternatives. Can you send a PR?

> Create cgoup recursively when calling prepare on containers
> ---
>
> Key: MESOS-9305
> URL: https://issues.apache.org/jira/browse/MESOS-9305
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: longfei
>Assignee: longfei
>Priority: Major
>
> This is my case:
> My cgroups_root of mesos-slave is some_user/mesos under /sys/fs/cgroup。
> It happens that this some_user dir may be gone for some unknown reason, in 
> which case I can no longer create any cgroup and any task will fail.
> So I would like to change 
>  
> {code:java}
> Try create = cgroups::create(
> hierarchy,
> infos[containerId]->cgroup);
> {code}
> to
> {code:java}
> Try create = cgroups::create(
> hierarchy,
> infos[containerId]->cgroup,
> true);
> {code}
> in CgroupsIsolatorProcess::prepare in 
> src/slave/containerizer/mesos/isolators/cgroups/cgroups.cpp.
> However, I'm not sure if there's any potential problem doing so. Any advice?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9283) Docker containerizer actor can get backlogged with large number of containers.

2018-10-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9283:
-

 Summary: Docker containerizer actor can get backlogged with large 
number of containers.
 Key: MESOS-9283
 URL: https://issues.apache.org/jira/browse/MESOS-9283
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Affects Versions: 1.7.0, 1.6.1, 1.5.1
Reporter: Jie Yu


We observed during some scale testing that we do internally.

When launching 300+ Docker containers on a single agent box, it's possible that 
the Docker containerizer actor gets backlogged. As a result, API processing 
like `GET_CONTAINERS` will become unresponsive. It'll also block Mesos 
containerizer from launching containers if one specified 
`--containers=docker,mesos` because Docker containerizer launch will be invoked 
first by the composing containerizer (and queued).

Profiling results show that the bottleneck is `os::killtree`, which will be 
invoked when the Docker commands are discarded (e.g., client disconnect, etc.).

For this particular case, killtree is not really necessary because the docker 
command does not fork additional subprocesses. If we use the argv version of 
`subprocess` to launch docker commands, we can simply use os::kill instead. We 
confirmed that, by switching to os::kill, the performance issues goes away, and 
the agent can easily scale up to 300+ containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9282) StatusUpdateManager does not sync checkpointed data to disk.

2018-10-01 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634765#comment-16634765
 ] 

Jie Yu commented on MESOS-9282:
---

>From [~bmahler]:

we should probably try to figure out a way to avoid O_SYNC? seems like it’s 
going to tank performance based on why joris made that change?

I guess the bigger question is around what semantics we want across agent 
reboot. Sounds like now we want the same agent id. But should we care about the 
terminal status updates and so on? Seems like agent reboot takes awhile and it 
feels wrong to not do something more graceful on the agent prior to reboot. 
Let’s say there’s a hard reboot where there is no chance to do something 
graceful, then we’d probably rather know that any non-terminal tasks are now 
unreachable and then GONE after the reboot (rather than trying to figure out if 
they terminated right before the reboot?)

> StatusUpdateManager does not sync checkpointed data to disk.
> 
>
> Key: MESOS-9282
> URL: https://issues.apache.org/jira/browse/MESOS-9282
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Priority: Major
>
> This is related to MESOS-9281, which we observed in a testing environment.
> The status update manager used to open the checkpoint file using O_SYNC, 
> which will guarantee that each write will be persisted to the disk (similar 
> to calling fsync() after each write()).
> This was removed due to some performance issue
> https://reviews.apache.org/r/50635/
> However, the assumption in the patch is no longer true after we allow the 
> re-use the same agent ID after machine reboot. This will likely cause issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9282) StatusUpdateManager does not sync checkpointed data to disk.

2018-10-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9282:
-

 Summary: StatusUpdateManager does not sync checkpointed data to 
disk.
 Key: MESOS-9282
 URL: https://issues.apache.org/jira/browse/MESOS-9282
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.7.0, 1.6.1, 1.5.1
Reporter: Jie Yu


This is related to MESOS-9281, which we observed in a testing environment.

The status update manager used to open the checkpoint file using O_SYNC, which 
will guarantee that each write will be persisted to the disk (similar to 
calling fsync() after each write()).

This was removed due to some performance issue
https://reviews.apache.org/r/50635/

However, the assumption in the patch is no longer true after we allow the 
re-use the same agent ID after machine reboot. This will likely cause issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9279) Docker Containerizer 'usage' call might be expensive if mount table is big.

2018-09-28 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9279:
-

Assignee: Jie Yu

> Docker Containerizer 'usage' call might be expensive if mount table is big.
> ---
>
> Key: MESOS-9279
> URL: https://issues.apache.org/jira/browse/MESOS-9279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
> Attachments: busy_agent_scale_test_0927.stacks, screenshot-1.png
>
>
> We observed in some testing environment that Docker Containerizer 'usage' 
> call can become very expensive if the host mount table is big.
> Perf analysis shows that most of the time was spent on reading the mount 
> table. This is similar to the problem we saw in MESOS-8418.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9279) Docker Containerizer 'usage' call might be expensive if mount table is big.

2018-09-28 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632657#comment-16632657
 ] 

Jie Yu commented on MESOS-9279:
---

commit 59b92a948fab386542f5ddfee275066694cf1f96 (HEAD -> master, origin/master, 
origin/HEAD, fix_docker_stats)
Author: Jie Yu 
Date:   Fri Sep 28 15:55:59 2018 -0700

Cached the cgroup results in Docker containerizer.

Since the cgroup hierarchy results won't change, it does not make sense
to compute it every time `usage` is called. It will get quite expensivie
when the host mount table is big (e.g., MESOS-8418).

This patch uses the static local variable to cache the result.

Review: https://reviews.apache.org/r/68880

> Docker Containerizer 'usage' call might be expensive if mount table is big.
> ---
>
> Key: MESOS-9279
> URL: https://issues.apache.org/jira/browse/MESOS-9279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Priority: Major
> Attachments: busy_agent_scale_test_0927.stacks, screenshot-1.png
>
>
> We observed in some testing environment that Docker Containerizer 'usage' 
> call can become very expensive if the host mount table is big.
> Perf analysis shows that most of the time was spent on reading the mount 
> table. This is similar to the problem we saw in MESOS-8418.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9279) Docker Containerizer 'usage' call might be expensive if mount table is big.

2018-09-28 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9279:
-

 Summary: Docker Containerizer 'usage' call might be expensive if 
mount table is big.
 Key: MESOS-9279
 URL: https://issues.apache.org/jira/browse/MESOS-9279
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 1.7.0, 1.6.1, 1.5.1
Reporter: Jie Yu


We observed in some testing environment that Docker Containerizer 'usage' call 
can become very expensive if the host mount table is big.

Perf analysis shows that most of the time was spent on reading the mount table. 
This is similar to the problem we saw in MESOS-8418.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9267) Mesos agent crashes when CNI network is not configured but used.

2018-09-27 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631141#comment-16631141
 ] 

Jie Yu commented on MESOS-9267:
---

commit 84b0a51e7174885f45b155ef912772c2593fc398 (fix_cni)
Author: Jie Yu 
Date:   Wed Sep 26 21:57:14 2018 -0700

Removed unneeded pluginDir field from CNI isolator.

The field member is redundant as it's already included in the flags.

Review: https://reviews.apache.org/r/68862

commit 832ebc2beddbe8d38427c2ce0e5578bcaee69b35
Author: Jie Yu 
Date:   Wed Sep 26 21:38:05 2018 -0700

Skipped CNI config load if named network is not enabled.

If the operator didn't turn on named CNI network support (i.e., both
agent flags `network_cni_config_dir` and `network_cni_plugins_dir` are
not specified), the CNI should not attempt to load the network configs.
This patch fixed a potential CHECK failure.

Review: https://reviews.apache.org/r/68861

> Mesos agent crashes when CNI network is not configured but used.
> 
>
> Key: MESOS-9267
> URL: https://issues.apache.org/jira/browse/MESOS-9267
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.2, 1.5.1, 1.6.1, 1.7.0
> Environment: Mesos 1.7.0
> Marathon 1.7.111
> Ubuntu 16.04
>Reporter: z s
>Assignee: Jie Yu
>Priority: Major
> Fix For: 1.5.2, 1.6.2, 1.7.1
>
>
> I'm running Mesos 1.7.0 with Marathon 1.7.111 using the Mesos Universal 
> Container Runtime to manage a Docker container. When I set the network mode 
> to "container/bridge" the executor crashing instantly whenever the task is 
> scheduled.
> Here are the syslogs
>  
> {code:java}
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: I0926 16:49:58.604806 
> 46045 containerizer.cpp:3116] Transitioning the state of container 
> a29fd74a-a361-4b3e-8763-5f2cef77380d from PROVISIONING to PREPARING
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: mesos-agent: 
> ../../3rdparty/stout/include/stout/option.hpp:118: const T& Option::get() 
> const & [with T = std::__cxx11::basic_string]: Assertion `isSome()' 
> failed.
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: *** Aborted at 1537980598 
> (unix time) try "date -d @1537980598" if you are using GNU date ***
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: PC: @     0x7f29695df428 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: *** SIGABRT (@0xb3d1) 
> received by PID 46033 (TID 0x7f295b0ee700) from PID 46033; stack trace: ***
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f2969985390 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695df428 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695e102a 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695d7bd7 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695d7c82 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296c63c964 
> mesos::internal::slave::NetworkCniIsolatorProcess::getNetworkConfigJSON()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296c63d3e5 
> mesos::internal::slave::NetworkCniIsolatorProcess::prepare()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296c3700c6 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI6OptionIN5mesos5slave19ContainerLaunchInfoEENSB_8internal5slave20MesosIsolatorProcessERKNSB_11ContainerIDERKNSC_15ContainerConfigESK_SN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSS_FSQ_T1_T2_EOT3_OT4_EUlSt10unique_ptrINS1_7PromiseISE_EESt14default_deleteIS16_EEOSI_OSL_S3_E_IS19_SI_SL_St12_PlaceholderILi1EEclEOS3_
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296ce67da1 
> process::ProcessBase::consume()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296ce89efa 
> process::ProcessManager::resume()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296ce8dc46 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f2969e5fc80 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296997b6ba 
> start_thread
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29696b141d 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Main process 
> exited, code=killed, status=6/ABRT
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Unit entered 
> failed state.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Failed with 
> result 'signal'.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Service 
> 

[jira] [Assigned] (MESOS-9275) Allow optional `profile` to be specified in `CREATE_DISK` offer operation.

2018-09-27 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9275:
-

Assignee: Chun-Hung Hsiao

> Allow optional `profile` to be specified in `CREATE_DISK` offer operation.
> --
>
> Key: MESOS-9275
> URL: https://issues.apache.org/jira/browse/MESOS-9275
> Project: Mesos
>  Issue Type: Task
>  Components: resource provider
>Reporter: Jie Yu
>Assignee: Chun-Hung Hsiao
>Priority: Major
>
> This will allow the framework to "import" pre-existing volumes reported by 
> the corresponding CSI plugin.
> For instance, the LVM CSI plugin might detect some pre-existing volumes that 
> Dan has created out of band. Currently, those volumes will be represented as 
> RAW "disk" resource with a volume ID, but no volume profile by the SLRP. When 
> a framework tries to use the RAW volume as either MOUNT or BLOCK volume, 
> it'll issue a CREATE_DISK operation. The corresponding SLRP will handles the 
> operation, and validate against a default profile for MOUNT volumes. However, 
> this prevents the volume to have a different profile that the framework might 
> want.
> Ideally, we should allow the framework to optionally specify a profile that 
> it wants the volume to have during CREATE_DISK because it might have some 
> expectations on the volume. The SLRP will validate with the corresponding CSI 
> plugin using the ValidateVolumeCapabilities RPC call to see if the profile is 
> applicable to the volume.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9275) Allow optional `profile` to be specified in `CREATE_DISK` offer operation.

2018-09-27 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9275:
-

 Summary: Allow optional `profile` to be specified in `CREATE_DISK` 
offer operation.
 Key: MESOS-9275
 URL: https://issues.apache.org/jira/browse/MESOS-9275
 Project: Mesos
  Issue Type: Task
  Components: resource provider
Reporter: Jie Yu


This will allows the framework to "import" pre-existing volumes reported by the 
corresponding CSI plugin.

For instance, the LVM CSI plugin might detect some pre-existing volumes that 
Dan has created out of band. Currently, those volumes will be represented as 
RAW "disk" resource with a volume ID, but no volume profile by the SLRP. When a 
framework tries to use the RAW volume as either MOUNT or BLOCK volume, it'll 
issue a CREATE_DISK operation. The corresponding SLRP will handles the 
operation, and validate against a default profile for MOUNT volumes. However, 
this prevents the volume to have a different profile that the framework might 
want.

Ideally, we should allow the framework to optionally specify a profile that it 
wants the volume to have during CREATE_DISK because it might have some 
expectations on the volume. The SLRP will validate with the corresponding CSI 
plugin using the ValidateVolumeCapabilities RPC call to see if the profile is 
applicable to the volume.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9267) Mesos Agent/Executor Crashes w/ UCR Docker bridge-network

2018-09-26 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9267:
-

Assignee: Jie Yu

> Mesos Agent/Executor Crashes w/ UCR Docker bridge-network
> -
>
> Key: MESOS-9267
> URL: https://issues.apache.org/jira/browse/MESOS-9267
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.2, 1.5.1, 1.6.1, 1.7.0
> Environment: Mesos 1.7.0
> Marathon 1.7.111
> Ubuntu 16.04
>Reporter: z s
>Assignee: Jie Yu
>Priority: Major
>
> I'm running Mesos 1.7.0 with Marathon 1.7.111 using the Mesos Universal 
> Container Runtime to manage a Docker container. When I set the network mode 
> to "container/bridge" the executor crashing instantly whenever the task is 
> scheduled.
> Here are the syslogs
>  
> {code:java}
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: I0926 16:49:58.604806 
> 46045 containerizer.cpp:3116] Transitioning the state of container 
> a29fd74a-a361-4b3e-8763-5f2cef77380d from PROVISIONING to PREPARING
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: mesos-agent: 
> ../../3rdparty/stout/include/stout/option.hpp:118: const T& Option::get() 
> const & [with T = std::__cxx11::basic_string]: Assertion `isSome()' 
> failed.
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: *** Aborted at 1537980598 
> (unix time) try "date -d @1537980598" if you are using GNU date ***
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: PC: @     0x7f29695df428 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: *** SIGABRT (@0xb3d1) 
> received by PID 46033 (TID 0x7f295b0ee700) from PID 46033; stack trace: ***
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f2969985390 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695df428 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695e102a 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695d7bd7 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695d7c82 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296c63c964 
> mesos::internal::slave::NetworkCniIsolatorProcess::getNetworkConfigJSON()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296c63d3e5 
> mesos::internal::slave::NetworkCniIsolatorProcess::prepare()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296c3700c6 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI6OptionIN5mesos5slave19ContainerLaunchInfoEENSB_8internal5slave20MesosIsolatorProcessERKNSB_11ContainerIDERKNSC_15ContainerConfigESK_SN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSS_FSQ_T1_T2_EOT3_OT4_EUlSt10unique_ptrINS1_7PromiseISE_EESt14default_deleteIS16_EEOSI_OSL_S3_E_IS19_SI_SL_St12_PlaceholderILi1EEclEOS3_
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296ce67da1 
> process::ProcessBase::consume()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296ce89efa 
> process::ProcessManager::resume()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296ce8dc46 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f2969e5fc80 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296997b6ba 
> start_thread
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29696b141d 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Main process 
> exited, code=killed, status=6/ABRT
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Unit entered 
> failed state.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Failed with 
> result 'signal'.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Service 
> hold-off time over, scheduling restart.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: Stopped Mesos Agent Service.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: Started Mesos Agent Service.
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.947785 
> 46255 main.cpp:349] Build: 2018-09-21 14:54:37 by ubuntu
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.947849 
> 46255 main.cpp:350] Version: 1.7.0
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.947855 
> 46255 main.cpp:353] Git tag: 1.7.0
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.947860 
> 46255 main.cpp:357] Git SHA: 8419b870c571ac11825c883fa20ea3b7d4348d34
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.952015 
> 46255 systemd.cpp:240] systemd version `229` detected
> Sep 26 16:49:58 

[jira] [Commented] (MESOS-9267) Mesos Agent/Executor Crashes w/ UCR Docker bridge-network

2018-09-26 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629757#comment-16629757
 ] 

Jie Yu commented on MESOS-9267:
---

I posted a review to fix the CHECK failure here:
https://reviews.apache.org/r/68861

> Mesos Agent/Executor Crashes w/ UCR Docker bridge-network
> -
>
> Key: MESOS-9267
> URL: https://issues.apache.org/jira/browse/MESOS-9267
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.0
> Environment: Mesos 1.7.0
> Marathon 1.7.111
> Ubuntu 16.04
>Reporter: z s
>Priority: Major
>
> I'm running Mesos 1.7.0 with Marathon 1.7.111 using the Mesos Universal 
> Container Runtime to manage a Docker container. When I set the network mode 
> to "container/bridge" the executor crashing instantly whenever the task is 
> scheduled.
> Here are the syslogs
>  
> {code:java}
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: I0926 16:49:58.604806 
> 46045 containerizer.cpp:3116] Transitioning the state of container 
> a29fd74a-a361-4b3e-8763-5f2cef77380d from PROVISIONING to PREPARING
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: mesos-agent: 
> ../../3rdparty/stout/include/stout/option.hpp:118: const T& Option::get() 
> const & [with T = std::__cxx11::basic_string]: Assertion `isSome()' 
> failed.
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: *** Aborted at 1537980598 
> (unix time) try "date -d @1537980598" if you are using GNU date ***
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: PC: @     0x7f29695df428 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: *** SIGABRT (@0xb3d1) 
> received by PID 46033 (TID 0x7f295b0ee700) from PID 46033; stack trace: ***
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f2969985390 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695df428 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695e102a 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695d7bd7 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695d7c82 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296c63c964 
> mesos::internal::slave::NetworkCniIsolatorProcess::getNetworkConfigJSON()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296c63d3e5 
> mesos::internal::slave::NetworkCniIsolatorProcess::prepare()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296c3700c6 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI6OptionIN5mesos5slave19ContainerLaunchInfoEENSB_8internal5slave20MesosIsolatorProcessERKNSB_11ContainerIDERKNSC_15ContainerConfigESK_SN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSS_FSQ_T1_T2_EOT3_OT4_EUlSt10unique_ptrINS1_7PromiseISE_EESt14default_deleteIS16_EEOSI_OSL_S3_E_IS19_SI_SL_St12_PlaceholderILi1EEclEOS3_
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296ce67da1 
> process::ProcessBase::consume()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296ce89efa 
> process::ProcessManager::resume()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296ce8dc46 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f2969e5fc80 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296997b6ba 
> start_thread
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29696b141d 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Main process 
> exited, code=killed, status=6/ABRT
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Unit entered 
> failed state.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Failed with 
> result 'signal'.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Service 
> hold-off time over, scheduling restart.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: Stopped Mesos Agent Service.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: Started Mesos Agent Service.
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.947785 
> 46255 main.cpp:349] Build: 2018-09-21 14:54:37 by ubuntu
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.947849 
> 46255 main.cpp:350] Version: 1.7.0
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.947855 
> 46255 main.cpp:353] Git tag: 1.7.0
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.947860 
> 46255 main.cpp:357] Git SHA: 8419b870c571ac11825c883fa20ea3b7d4348d34
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.952015 
> 46255 systemd.cpp:240] systemd 

[jira] [Commented] (MESOS-9269) Mesos UCR with Docker only Works on Localhost

2018-09-26 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629745#comment-16629745
 ] 

Jie Yu commented on MESOS-9269:
---

[~dgoel] [~qianzhang] can you help here?

> Mesos UCR with Docker only Works on Localhost
> -
>
> Key: MESOS-9269
> URL: https://issues.apache.org/jira/browse/MESOS-9269
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, docker
>Affects Versions: 1.7.0
> Environment: Ubuntu 16.04
> Mesos 1.7.0
> Marathon 1.7.111
>Reporter: z s
>Priority: Major
>
> I'm having an issue setting up the `mesos-cni-port-mapper` to allow remote 
> connectivity.
> When I `curl :` from the machine I get a response but from a 
> remote machine the `curl` connection timesout. I'm not sure what's wrong with 
> my route settings.
>  
> */var/lib/mesos/cni/config/mesos-bridge.json*
>  
> {code:java}
> {
> "name" : "mesos-bridge",
> "type" : "mesos-cni-port-mapper",
> "excludeDevices" : ["mesos-cni0"],
> "chain": "MESOS-BRIDGE-PORT-MAPPER",
> "delegate": {
> "type": "bridge",
> "bridge": "mesos-cni0",
> "isGateway": true,
> "ipMasq": true,
> "ipam": {
> "type": "host-local",
> "subnet": "10.1.0.0/16",
> "routes": [
> { "dst":
> "0.0.0.0/0" }
> ]
> }
> }
> }
> {code}
>  
> {code:java}
> $ route -n
> Kernel IP routing table
> Destination Gateway Genmask Flags Metric Ref Use Iface
> 0.0.0.0 172.27.1.1 0.0.0.0 UG 0 0 0 ens3
> 10.1.0.0 0.0.0.0 255.255.0.0 U 0 0 0 mesos-cni0
> 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
> 172.27.1.0 0.0.0.0 255.255.255.0 U 0 0 0 ens3
> {code}
> Any suggestions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9269) Mesos UCR with Docker only Works on Localhost

2018-09-26 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629736#comment-16629736
 ] 

Jie Yu commented on MESOS-9269:
---

can you show the nat table?

> Mesos UCR with Docker only Works on Localhost
> -
>
> Key: MESOS-9269
> URL: https://issues.apache.org/jira/browse/MESOS-9269
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, docker
>Affects Versions: 1.7.0
> Environment: Ubuntu 16.04
> Mesos 1.7.0
> Marathon 1.7.111
>Reporter: z s
>Priority: Major
>
> I'm having an issue setting up the `mesos-cni-port-mapper` to allow remote 
> connectivity.
> When I `curl :` from the machine I get a response but from a 
> remote machine the `curl` connection timesout. I'm not sure what's wrong with 
> my route settings.
>  
> */var/lib/mesos/cni/config/mesos-bridge.json*
>  
> {code:java}
> {
> "name" : "mesos-bridge",
> "type" : "mesos-cni-port-mapper",
> "excludeDevices" : ["mesos-cni0"],
> "chain": "MESOS-BRIDGE-PORT-MAPPER",
> "delegate": {
> "type": "bridge",
> "bridge": "mesos-cni0",
> "isGateway": true,
> "ipMasq": true,
> "ipam": {
> "type": "host-local",
> "subnet": "10.1.0.0/16",
> "routes": [
> { "dst":
> "0.0.0.0/0" }
> ]
> }
> }
> }
> {code}
>  
> {code:java}
> $ route -n
> Kernel IP routing table
> Destination Gateway Genmask Flags Metric Ref Use Iface
> 0.0.0.0 172.27.1.1 0.0.0.0 UG 0 0 0 ens3
> 10.1.0.0 0.0.0.0 255.255.0.0 U 0 0 0 mesos-cni0
> 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
> 172.27.1.0 0.0.0.0 255.255.255.0 U 0 0 0 ens3
> {code}
> Any suggestions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9269) Mesos UCR with Docker only Works on Localhost

2018-09-26 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629723#comment-16629723
 ] 

Jie Yu commented on MESOS-9269:
---

[~dkjs] can you show your iptables

> Mesos UCR with Docker only Works on Localhost
> -
>
> Key: MESOS-9269
> URL: https://issues.apache.org/jira/browse/MESOS-9269
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, docker
>Affects Versions: 1.7.0
> Environment: Ubuntu 16.04
> Mesos 1.7.0
> Marathon 1.7.111
>Reporter: z s
>Priority: Major
>
> I'm having an issue setting up the `mesos-cni-port-mapper` to allow remote 
> connectivity.
> When I `curl :` from the machine I get a response but from a 
> remote machine the `curl` connection timesout. I'm not sure what's wrong with 
> my route settings.
>  
> */var/lib/mesos/cni/config/mesos-bridge.json*
>  
> {code:java}
> {
> "name" : "mesos-bridge",
> "type" : "mesos-cni-port-mapper",
> "excludeDevices" : ["mesos-cni0"],
> "chain": "MESOS-BRIDGE-PORT-MAPPER",
> "delegate": {
> "type": "bridge",
> "bridge": "mesos-cni0",
> "isGateway": true,
> "ipMasq": true,
> "ipam": {
> "type": "host-local",
> "subnet": "10.1.0.0/16",
> "routes": [
> { "dst":
> "0.0.0.0/0" }
> ]
> }
> }
> }
> {code}
>  
> {code:java}
> $ route -n
> Kernel IP routing table
> Destination Gateway Genmask Flags Metric Ref Use Iface
> 0.0.0.0 172.27.1.1 0.0.0.0 UG 0 0 0 ens3
> 10.1.0.0 0.0.0.0 255.255.0.0 U 0 0 0 mesos-cni0
> 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
> 172.27.1.0 0.0.0.0 255.255.255.0 U 0 0 0 ens3
> {code}
> Any suggestions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9269) Mesos UCR with Docker only Works on Localhost

2018-09-26 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629546#comment-16629546
 ] 

Jie Yu commented on MESOS-9269:
---

Can you post your marathon app config?

> Mesos UCR with Docker only Works on Localhost
> -
>
> Key: MESOS-9269
> URL: https://issues.apache.org/jira/browse/MESOS-9269
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, docker
>Affects Versions: 1.7.0
> Environment: Ubuntu 16.04
> Mesos 1.7.0
> Marathon 1.7.111
>Reporter: z silver
>Priority: Major
>
> I'm having an issue setting up the `mesos-cni-port-mapper` to allow remote 
> connectivity.
> When I `curl :` from the machine I get a response but from a 
> remote machine the `curl` connection timesout. I'm not sure what's wrong with 
> my route settings.
>  
> */var/lib/mesos/cni/config/mesos-bridge.json*
>  
> {code:java}
> {
> "name" : "mesos-bridge",
> "type" : "mesos-cni-port-mapper",
> "excludeDevices" : ["mesos-cni0"],
> "chain": "MESOS-BRIDGE-PORT-MAPPER",
> "delegate": {
> "type": "bridge",
> "bridge": "mesos-cni0",
> "isGateway": true,
> "ipMasq": true,
> "ipam": {
> "type": "host-local",
> "subnet": "10.1.0.0/16",
> "routes": [
> { "dst":
> "0.0.0.0/0" }
> ]
> }
> }
> }
> {code}
>  
> {code:java}
> $ route -n
> Kernel IP routing table
> Destination Gateway Genmask Flags Metric Ref Use Iface
> 0.0.0.0 172.27.1.1 0.0.0.0 UG 0 0 0 ens3
> 10.1.0.0 0.0.0.0 255.255.0.0 U 0 0 0 mesos-cni0
> 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
> 172.27.1.0 0.0.0.0 255.255.255.0 U 0 0 0 ens3
> {code}
> Any suggestions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9267) Mesos Agent/Executor Crashes w/ UCR Docker bridge-network

2018-09-26 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629521#comment-16629521
 ] 

Jie Yu commented on MESOS-9267:
---

[~dkjs] thanks for reporting. I think at the very least, the Mesos agent 
shouldn't crash in those cases.

cc [~qianzhang] [~gilbert]

> Mesos Agent/Executor Crashes w/ UCR Docker bridge-network
> -
>
> Key: MESOS-9267
> URL: https://issues.apache.org/jira/browse/MESOS-9267
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.0
> Environment: Mesos 1.7.0
> Marathon 1.7.111
> Ubuntu 16.04
>Reporter: z silver
>Priority: Major
>
> I'm running Mesos 1.7.0 with Marathon 1.7.111 using the Mesos Universal 
> Container Runtime to manage a Docker container. When I set the network mode 
> to "container/bridge" the executor crashing instantly whenever the task is 
> scheduled.
> Here are the syslogs
>  
> {code:java}
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: I0926 16:49:58.604806 
> 46045 containerizer.cpp:3116] Transitioning the state of container 
> a29fd74a-a361-4b3e-8763-5f2cef77380d from PROVISIONING to PREPARING
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: mesos-agent: 
> ../../3rdparty/stout/include/stout/option.hpp:118: const T& Option::get() 
> const & [with T = std::__cxx11::basic_string]: Assertion `isSome()' 
> failed.
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: *** Aborted at 1537980598 
> (unix time) try "date -d @1537980598" if you are using GNU date ***
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: PC: @     0x7f29695df428 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]: *** SIGABRT (@0xb3d1) 
> received by PID 46033 (TID 0x7f295b0ee700) from PID 46033; stack trace: ***
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f2969985390 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695df428 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695e102a 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695d7bd7 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29695d7c82 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296c63c964 
> mesos::internal::slave::NetworkCniIsolatorProcess::getNetworkConfigJSON()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296c63d3e5 
> mesos::internal::slave::NetworkCniIsolatorProcess::prepare()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296c3700c6 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI6OptionIN5mesos5slave19ContainerLaunchInfoEENSB_8internal5slave20MesosIsolatorProcessERKNSB_11ContainerIDERKNSC_15ContainerConfigESK_SN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSS_FSQ_T1_T2_EOT3_OT4_EUlSt10unique_ptrINS1_7PromiseISE_EESt14default_deleteIS16_EEOSI_OSL_S3_E_IS19_SI_SL_St12_PlaceholderILi1EEclEOS3_
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296ce67da1 
> process::ProcessBase::consume()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296ce89efa 
> process::ProcessManager::resume()
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296ce8dc46 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f2969e5fc80 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f296997b6ba 
> start_thread
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46033]:     @     0x7f29696b141d 
> (unknown)
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Main process 
> exited, code=killed, status=6/ABRT
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Unit entered 
> failed state.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Failed with 
> result 'signal'.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: mesos-agent.service: Service 
> hold-off time over, scheduling restart.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: Stopped Mesos Agent Service.
> Sep 26 16:49:58 ip-172-27-1-88 systemd[1]: Started Mesos Agent Service.
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.947785 
> 46255 main.cpp:349] Build: 2018-09-21 14:54:37 by ubuntu
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.947849 
> 46255 main.cpp:350] Version: 1.7.0
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.947855 
> 46255 main.cpp:353] Git tag: 1.7.0
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: I0926 16:49:58.947860 
> 46255 main.cpp:357] Git SHA: 8419b870c571ac11825c883fa20ea3b7d4348d34
> Sep 26 16:49:58 ip-172-27-1-88 mesos-agent[46255]: 

[jira] [Commented] (MESOS-9238) rpmbuild checkfiles fails

2018-09-16 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616970#comment-16616970
 ] 

Jie Yu commented on MESOS-9238:
---

https://reviews.apache.org/r/68728/

> rpmbuild checkfiles fails
> -
>
> Key: MESOS-9238
> URL: https://issues.apache.org/jira/browse/MESOS-9238
> Project: Mesos
>  Issue Type: Bug
>Reporter: James DeFelice
>Assignee: Till Toenshoff
>Priority: Major
>
> I noticed that Mesos nightly builds haven't been pushed to dockerhub in a 
> while. After some help from Jie and digging a bit more it looks like 
> rpm-build is reporting an error:
> {code:java}
> RPM build errors:
> error: Installed (but unpackaged) file(s) found:
>/usr/include/rapidjson/allocators.h
>/usr/include/rapidjson/document.h
>/usr/include/rapidjson/encodedstream.h
>/usr/include/rapidjson/encodings.h
>/usr/include/rapidjson/error/en.h
>/usr/include/rapidjson/error/error.h
>/usr/include/rapidjson/filereadstream.h
>/usr/include/rapidjson/filewritestream.h
>/usr/include/rapidjson/fwd.h
>/usr/include/rapidjson/internal/biginteger.h
>/usr/include/rapidjson/internal/diyfp.h
>/usr/include/rapidjson/internal/dtoa.h
>/usr/include/rapidjson/internal/ieee754.h
>/usr/include/rapidjson/internal/itoa.h
>/usr/include/rapidjson/internal/meta.h
>/usr/include/rapidjson/internal/pow10.h
>/usr/include/rapidjson/internal/regex.h
>/usr/include/rapidjson/internal/stack.h
>/usr/include/rapidjson/internal/strfunc.h
>/usr/include/rapidjson/internal/strtod.h
>/usr/include/rapidjson/internal/swap.h
>/usr/include/rapidjson/istreamwrapper.h
>/usr/include/rapidjson/memorybuffer.h
>/usr/include/rapidjson/memorystream.h
>/usr/include/rapidjson/msinttypes/inttypes.h
>/usr/include/rapidjson/msinttypes/stdint.h
>/usr/include/rapidjson/ostreamwrapper.h
>/usr/include/rapidjson/pointer.h
>/usr/include/rapidjson/prettywriter.h
>/usr/include/rapidjson/rapidjson.h
>/usr/include/rapidjson/reader.h
>/usr/include/rapidjson/schema.h
>/usr/include/rapidjson/stream.h
>/usr/include/rapidjson/stringbuffer.h
>/usr/include/rapidjson/writer.h
> Macro %MESOS_VERSION has empty body
> Macro %MESOS_RELEASE has empty body
> {code}
> Furthermore, the cleanup func that's invoked by the trap is failing with a 
> bunch of permission erors:
> {code:java}
> cleanup
> rm: cannot remove 
> '/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/.cache': 
> Permission denied
> rm: cannot remove 
> '/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/rpmbuild/SRPMS':
>  Permission denied
> rm: cannot remove 
> '/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/rpmbuild/BUILDROOT/mesos-1.8.0-0.1.pre.20180915git4805a47.el7.x86_64/var/lib/mesos':
>  Permission denied
> rm: cannot remove 
> '/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/rpmbuild/BUILDROOT/mesos-1.8.0-0.1.pre.20180915git4805a47.el7.x86_64/var/log/mesos':
>  Permission denied
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9220) Non-intuitive master streaming API regarding terminated but unack'ed tasks.

2018-09-07 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9220:
-

 Summary: Non-intuitive master streaming API regarding terminated 
but unack'ed tasks.
 Key: MESOS-9220
 URL: https://issues.apache.org/jira/browse/MESOS-9220
 Project: Mesos
  Issue Type: Improvement
  Components: master
Affects Versions: 1.6.1, 1.5.1, 1.7.0
Reporter: Jie Yu


The Subscribed event will include all active tasks, as well as completed tasks.

The list of active tasks will actually include those tasks that are terminal, 
but their terminal status updates haven't been ack'ed. However, there is no 
task update event sent to the framework when the terminal status updates are 
ack'ed by the framework. So there's no signal for the framework to remove those 
tasks if its intention is to maintain a list of active tasks in the system.

This is not very intuitive to framework writers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9196) Removing rootfs mounts may fail with EBUSY.

2018-08-31 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599528#comment-16599528
 ] 

Jie Yu commented on MESOS-9196:
---

Posted a chain starting here:
https://reviews.apache.org/r/68594/

> Removing rootfs mounts may fail with EBUSY.
> ---
>
> Key: MESOS-9196
> URL: https://issues.apache.org/jira/browse/MESOS-9196
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: containerizer
>
> We observed in production environment that this
> {code}
> Failed to destroy the provisioned rootfs when destroying container: Collect 
> failed: Failed to destroy overlay-mounted rootfs 
> '/var/lib/mesos/slave/provisioner/containers/6332cf3d-9897-475b-88b3-40e983a2a531/containers/e8f36ad7-c9ae-40da-9d14-431e98174735/backends/overlay/rootfses/d601ef1b-11b9-445a-b607-7c6366cd21ec':
>  Failed to unmount 
> '/var/lib/mesos/slave/provisioner/containers/6332cf3d-9897-475b-88b3-40e983a2a531/containers/e8f36ad7-c9ae-40da-9d14-431e98174735/backends/overlay/rootfses/d601ef1b-11b9-445a-b607-7c6366cd21ec':
>  Device or resource busy
> {code}
> Consider fixing the issue by using detach unmount when unmounting container 
> rootfs. See MESOS-3349 for details.
> The root cause on why "Device or resource busy" is received when doing rootfs 
> unmount is still unknown.
> _UPDATE_: The production environment has a cronjob that scan filesystems to 
> build index (updatedb for mlocate). This can explain the EBUSY we receive 
> when doing `unmount`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9159) Support Foreign URLs in docker registry puller

2018-08-29 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596757#comment-16596757
 ] 

Jie Yu commented on MESOS-9159:
---

commit 3295fc98cf33bf22bb3d7b1d1ade424c477d3b83
Author: Liangyu Zhao 
Date:   Wed Aug 29 11:54:47 2018 -0700

Windows: Enabled `DockerFetcherPluginTest` suite.

Enabled `Internet` test environment on Windows. Disabled `Internet`
`HealthCheckTests` on Windows, since they require complete
development. Modified `DockerFetcherPluginTest` to fetch
`microsoft/nanoserver` for more extensive test for fetcher on Windows.

Review: https://reviews.apache.org/r/67930/

commit cdf8eab619239600f5105965b676b13887931f91
Author: Liangyu Zhao 
Date:   Wed Aug 29 11:54:25 2018 -0700

Windows: Enable DockerFetcher in Windows agent.

Review: https://reviews.apache.org/r/68455/

> Support Foreign URLs in docker registry puller
> --
>
> Key: MESOS-9159
> URL: https://issues.apache.org/jira/browse/MESOS-9159
> Project: Mesos
>  Issue Type: Task
>Reporter: Akash Gupta
>Assignee: Liangyu Zhao
>Priority: Major
>
> Currently, trying to pull the layers of a Windows image with the current 
> registry pull code will return a 404 error. This is because the Windows 
> docker images need to pull the base OS layers from the foreign URLs field in 
> the version 2 schema 2 docker manifest. As a result, the register puller 
> needs to be aware of version 2 schema 2 and the foreign urls field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9159) Support Foreign URLs in docker registry puller

2018-08-28 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595789#comment-16595789
 ] 

Jie Yu commented on MESOS-9159:
---

commit 7747a7b3588c40a1e730411a5630084263e9cfff (HEAD -> master, origin/master, 
origin/HEAD)
Author: Liangyu Zhao 
Date:   Tue Aug 28 17:08:59 2018 -0700

Windows: Fetch blobs with V2S2 Docker image manifest.

DockerFetcher now fetches both V2S1 and V2S2 manifests to save on
disk when agent is running on Windows. Linux part of the code in
agent is unchanged. In addition to fetching from DockerHub,
DockerFetcher now supports fetching from foreign URLs provided in
V2S2 Docker image manifest.

Review: https://reviews.apache.org/r/68454/

> Support Foreign URLs in docker registry puller
> --
>
> Key: MESOS-9159
> URL: https://issues.apache.org/jira/browse/MESOS-9159
> Project: Mesos
>  Issue Type: Task
>Reporter: Akash Gupta
>Assignee: Liangyu Zhao
>Priority: Major
>
> Currently, trying to pull the layers of a Windows image with the current 
> registry pull code will return a 404 error. This is because the Windows 
> docker images need to pull the base OS layers from the foreign URLs field in 
> the version 2 schema 2 docker manifest. As a result, the register puller 
> needs to be aware of version 2 schema 2 and the foreign urls field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9159) Support Foreign URLs in docker registry puller

2018-08-28 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9159:
-

Assignee: Liangyu Zhao

> Support Foreign URLs in docker registry puller
> --
>
> Key: MESOS-9159
> URL: https://issues.apache.org/jira/browse/MESOS-9159
> Project: Mesos
>  Issue Type: Task
>Reporter: Akash Gupta
>Assignee: Liangyu Zhao
>Priority: Major
>
> Currently, trying to pull the layers of a Windows image with the current 
> registry pull code will return a 404 error. This is because the Windows 
> docker images need to pull the base OS layers from the foreign URLs field in 
> the version 2 schema 2 docker manifest. As a result, the register puller 
> needs to be aware of version 2 schema 2 and the foreign urls field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9159) Support Foreign URLs in docker registry puller

2018-08-28 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595592#comment-16595592
 ] 

Jie Yu commented on MESOS-9159:
---

commit bf17c98f440df0998281f8619642d1d9ebcf49eb (HEAD -> master, origin/master, 
origin/HEAD)
Author: Liangyu Zhao 
Date:   Tue Aug 28 13:32:17 2018 -0700

Windows: Parse version 2 schema 2 Docker image manifest.

Added support to parse V2S2 Docker image manifest
(https://docs.docker.com/registry/spec/manifest-v2-2/). Adopted the
validation code from patch 53850.

Review: https://reviews.apache.org/r/68451/

commit dc1436289129e5339a7e9d6d9350d64a352ece6b
Author: Liangyu Zhao 
Date:   Tue Aug 28 13:30:42 2018 -0700

Windows: Update curl version to 7.61.0.

A bug is encountered in version 7.57.0, and is fixed in 7.61.0.

Review: https://reviews.apache.org/r/68450/

> Support Foreign URLs in docker registry puller
> --
>
> Key: MESOS-9159
> URL: https://issues.apache.org/jira/browse/MESOS-9159
> Project: Mesos
>  Issue Type: Task
>Reporter: Akash Gupta
>Priority: Major
>
> Currently, trying to pull the layers of a Windows image with the current 
> registry pull code will return a 404 error. This is because the Windows 
> docker images need to pull the base OS layers from the foreign URLs field in 
> the version 2 schema 2 docker manifest. As a result, the register puller 
> needs to be aware of version 2 schema 2 and the foreign urls field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5647) Expose network statistics for containers on CNI network in the `network/cni` isolator.

2018-08-28 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595554#comment-16595554
 ] 

Jie Yu commented on MESOS-5647:
---

commit 0a58ecd86dfd025526c6a2f719df096ec8195c99 (HEAD -> master, origin/master, 
origin/HEAD)
Author: Sergey Urbanovich 
Date:   Tue Aug 28 12:15:00 2018 -0700

Added a CNI test for networking statistics.

This is a veth CNI plugin that is written in bash. It creates a veth
virtual network pair, one end of the pair will be moved to container
network namespace.

The veth CNI plugin uses 203.0.113.0/24 subnet, it is reserved for
documentation and examples [rfc5737]. The plugin can allocate up to
128 veth pairs.

Review: https://reviews.apache.org/r/68355/

> Expose network statistics for containers on CNI network in the `network/cni` 
> isolator.
> --
>
> Key: MESOS-5647
> URL: https://issues.apache.org/jira/browse/MESOS-5647
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Affects Versions: 1.0.0
> Environment: linux
>Reporter: Avinash Sridharan
>Assignee: Sergey Urbanovich
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.7.0
>
>
> We need to implement the `usage` method in the `network/cni` isolator to 
> expose metrics relating to a containers network traffic. 
> On receiving a request for getting `usage` for a a given container the 
> `network/cni` isolator could use NETLINK system calls to query the kernel for 
> interface and routing statistics for a given container's network namespace.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9183) IntervalSet up bound is one off

2018-08-24 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592304#comment-16592304
 ] 

Jie Yu commented on MESOS-9183:
---

this is probably due to we convert everthing to 
`boost::icl::interval_bounds::static_right_open`, causing an overflow in this 
case.

> IntervalSet up bound is one off
> -
>
> Key: MESOS-9183
> URL: https://issues.apache.org/jira/browse/MESOS-9183
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Xudong Ni
>Priority: Minor
>
> the unsigned int 16 range is [0, 65535]; if we tried to set this range, the 
> set will be "{}"
> Example code:
> {quote}IntervalSet set;
> set += (Bound::closed(0), Bound::closed(65535));
> Results: "{}"; Expected: "[0, 65535]"
> {quote}
> If we decrease the up bound by 1 to 65534, it work normally.
> {quote}IntervalSet set;
> set += (Bound::closed(0), Bound::closed(65534));
> Results: "[0, 65535)"; Expected: "[0, 65535)"
> {quote}
> It appears the the up bound is one off, since the inervalSet is template, 
> other type may have the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9174) Unexpected containers transition from RUNNING to DESTROYING during recovery

2018-08-22 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589384#comment-16589384
 ] 

Jie Yu commented on MESOS-9174:
---

Yeah, I think that can explain why the container gets killed when agent is 
restarted, because all container processes are now part of the agent's cgroup 
now (under systemd named hierarchy).



> Unexpected containers transition from RUNNING to DESTROYING during recovery
> ---
>
> Key: MESOS-9174
> URL: https://issues.apache.org/jira/browse/MESOS-9174
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0, 1.6.1
>Reporter: Stephan Erb
>Priority: Major
> Attachments: mesos-agent.log, mesos-executor-stderr.log
>
>
> I am trying to hunt down a weird issue where sometimes restarting a Mesos 
> agent takes down all Mesos containers. The containers die without an apparent 
> cause:
> {code}
> I0821 13:35:01.486346 61392 linux_launcher.cpp:360] Recovered container 
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.627367 61362 provisioner.cpp:451] Recovered container 
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.701448 61375 containerizer.cpp:2835] Container 
> 02da7be0-271e-449f-9554-dc776adb29a9 has exited
> I0821 13:35:03.701453 61375 containerizer.cpp:2382] Destroying container 
> 02da7be0-271e-449f-9554-dc776adb29a9 in RUNNING state
> I0821 13:35:03.701457 61375 containerizer.cpp:2996] Transitioning the state 
> of container 02da7be0-271e-449f-9554-dc776adb29a9 from RUNNING to DESTROYING
> {code}
> From the perspective of the executor, there is nothing relevant in the logs. 
> Everything just stops directly as if the container gets terminated externally 
> without notifying the executor first. For further details, please see the 
> attached agent log and one (example) executor log file.
> I am aware that this is a long shot, but anyone an idea what I should be 
> looking at to narrow down the issue?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9171) Mesos agent crashes when usage is queried

2018-08-21 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9171:
-

Assignee: Sergey Urbanovich

> Mesos agent crashes when usage is queried
> -
>
> Key: MESOS-9171
> URL: https://issues.apache.org/jira/browse/MESOS-9171
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.0
>Reporter: Yan Xu
>Assignee: Sergey Urbanovich
>Priority: Major
>
> The error:
> {noformat:title=}
> ../../3rdparty/stout/include/stout/option.hpp:118: const T& Option::get() 
> const & [with T = std::basic_string]: Assertion `isSome()' failed.
> {noformat}
> The backtrace:
> {noformat:title=}
> Program terminated with signal SIGABRT, Aborted.
> #0  0x7fd0ab922495 in raise () from /lib64/libc.so.6
> #0  0x7fd0ab922495 in raise () from /lib64/libc.so.6
> #1  0x7fd0ab923c75 in abort () from /lib64/libc.so.6
> #2  0x7fd0ab91b60e in __assert_fail_base () from /lib64/libc.so.6
> #3  0x7fd0ab91b6d0 in __assert_fail () from /lib64/libc.so.6
> #4  0x7fd0ae473c33 in Option::get() const & 
> (this=0x7fd0a4deb5a8) at ../../3rdparty/stout/include/stout/option.hpp:118
> #5  0x7fd0ae48ae94 in get (this=0x7fd0a4deb5a8) at 
> /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/unordered_set.h:93
> #6  mesos::internal::slave::NetworkCniIsolatorProcess::usage 
> (this=0x7fd0a4dea800, containerId=...) at 
> ../../src/slave/containerizer/mesos/isolators/network/cni/cni.cpp:1516
> #7  0x7fd0ae1770da in operator() (process=, a0=..., 
> promise=..., __closure=) at 
> ../../3rdparty/libprocess/include/process/dispatch.hpp:354
> #8  invoke&, process::Future 
> (T::*)(P0), A0&&) [with R = mesos::ResourceStatistics; T = 
> mesos::internal::slave::MesosIsolatorProcess; P0 = const mesos::ContainerID&; 
> A0 = const 
> mesos::ContainerID&]::,
>  std::default_delete > >, 
> std::decay::type&&, process::ProcessBase*)>, 
> std::unique_ptr, 
> std::default_delete > >, 
> mesos::ContainerID, process::ProcessBase*> (f=...) at 
> ../../3rdparty/stout/include/stout/cpp17.hpp:42
> #9  invoke_expand&, 
> process::Future (T::*)(P0), A0&&) [with R = mesos::ResourceStatistics; T = 
> mesos::internal::slave::MesosIsolatorProcess; P0 = const mesos::ContainerID&; 
> A0 = const 
> mesos::ContainerID&]::,
>  std::default_delete > >, 
> std::decay::type&&, process::ProcessBase*)>, 
> std::tuple, 
> std::default_delete > >, 
> mesos::ContainerID, std::_Placeholder<1> >, 
> std::tuple, 0ul, 1ul, 2ul> (args=..., 
> bound_args=..., f=...) at ../../3rdparty/stout/include/stout/lambda.hpp:292
> #10 operator() (this=) at 
> ../../3rdparty/stout/include/stout/lambda.hpp:331
> #11 invoke process::PID&, process::Future (T::*)(P0), A0&&) [with R = 
> mesos::ResourceStatistics; T = mesos::internal::slave::MesosIsolatorProcess; 
> P0 = const mesos::ContainerID&; A0 = const 
> mesos::ContainerID&]::,
>  std::default_delete > >, 
> std::decay::type&&, process::ProcessBase*)>, 
> std::unique_ptr, 
> std::default_delete > >, 
> mesos::ContainerID, std::_Placeholder<1> >, process::ProcessBase*> (f=...) at 
> ../../3rdparty/stout/include/stout/cpp17.hpp:42
> #12 operator() process::PID&, process::Future (T::*)(P0), A0&&) [with R = 
> mesos::ResourceStatistics; T = mesos::internal::slave::MesosIsolatorProcess; 
> P0 = const mesos::ContainerID&; A0 = const 
> mesos::ContainerID&]::,
>  std::default_delete > >, 
> std::decay::type&&, process::ProcessBase*)>, 
> std::unique_ptr, 
> std::default_delete > >, 
> mesos::ContainerID, std::_Placeholder<1> >, process::ProcessBase*> (f=..., 
> this=) at ../../3rdparty/stout/include/stout/lambda.hpp:398
> #13 lambda::CallableOnce (process::ProcessBase*)>::CallableFn
>  process::dispatch mesos::internal::slave::MesosIsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID 
> const&>(process::PID const&, 
> process::Future 
> (mesos::internal::slave::MesosIsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID 
> const&)::{lambda(std::unique_ptr, 
> std::default_delete > >, 
> mesos::ContainerID&&, process::ProcessBase*)#1}, 
> std::unique_ptr, 
> std::default_delete > >, 
> mesos::ContainerID, std::_Placeholder<1> > 
> >::operator()(process::ProcessBase*&&) && (this=0x7fd099a2a630, 
> args#0=) at ../../3rdparty/stout/include/stout/lambda.hpp:463
> #14 0x7fd0aed493a2 in operator() (args#0=0x7fd0a4deb6b8, this= out>) at ../../../3rdparty/stout/include/stout/lambda.hpp:443
> #15 process::ProcessBase::consume(process::DispatchEvent&&) (this= out>, event=...) at ../../../3rdparty/libprocess/src/process.cpp:3563
> #16 0x7fd0aed88609 in serve (event=..., this=0x7fd0a4deb6b8) at 
> ../../../3rdparty/libprocess/include/process/process.hpp:87
> #17 process::ProcessManager::resume (this=, 
> process=0x7fd0a4deb6b8) at 

[jira] [Assigned] (MESOS-9081) cgroups::verify is expensive and is done implicitly during cgroups operations

2018-08-19 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9081:
-

Assignee: Jie Yu

> cgroups::verify is expensive and is done implicitly during cgroups operations
> -
>
> Key: MESOS-9081
> URL: https://issues.apache.org/jira/browse/MESOS-9081
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Benjamin Mahler
>Assignee: Jie Yu
>Priority: Major
>  Labels: performance
>
> See MESOS-8418 for how this initially came up.
> Currently, many of the cgroup helper functions perform an internal verify:
> https://github.com/apache/mesos/blob/1.6.0/src/linux/cgroups.cpp#L922
> This reads /procs/mounts to see which cgroups subsystems are mounted, and 
> /proc/mounts can get rather large and expensive to read.
> The steady state case (polling /containers and /monitor/snapshot to retrieve 
> container resource usage statistics) was addressed with a short term patch in 
> MESOS-8418. However, we should consider some longer-term fixes that address 
> performance of other events that incur cgroup operations (e.g. updating 
> resources of a container, launching a container, etc):
> 1. Consider moving the verify function to public and have the isolators use 
> it where appropriate.
> 2. Complementary to 1, optimize verify (e.g. read /proc/self/mountstats as 
> suggested in MESOS-8418).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8418) mesos-agent high cpu usage because of numerous /proc/mounts reads

2018-08-19 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585412#comment-16585412
 ] 

Jie Yu commented on MESOS-8418:
---

I posted another patch to eliminate the mount table read for cgroups creation 
and write
https://reviews.apache.org/r/68426/


> mesos-agent high cpu usage because of numerous /proc/mounts reads
> -
>
> Key: MESOS-8418
> URL: https://issues.apache.org/jira/browse/MESOS-8418
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, containerization
>Affects Versions: 1.4.1, 1.5.1, 1.6.1
>Reporter: Stéphane Cottin
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: containerizer, performance
> Fix For: 1.4.2, 1.5.2, 1.6.2, 1.7.0
>
> Attachments: image-2018-08-06-13-49-03-241.png, 
> image-2018-08-06-13-49-03-317.png, mesos-agent-flamegraph.png, 
> mesos-agent.stacks.gz
>
>
> /proc/mounts is read many, many times from 
> src/(linux/fs|linux/cgroups|slave/slave).cpp.
> When using overlayfs, the /proc/mounts contents can become quite large. 
> As an example, one of our Q/A single node running ~150 tasks,  have a 361 
> lines/ 201299 chars  /proc/mounts file.
> This 200kB file is read on this node about 25 to 150 times per second. This 
> is a (huge) waste of cpu and I/O time.
> Most of these calls are related to cgroups.
> Please consider these proposals :
> 1/ Is /proc/mounts mandatory for cgroups ? 
> We already have cgroup subsystems list from /proc/cgroups.
> The only compelling information from /proc/mounts seems to be the root mount 
> point, 
> /sys/fs/cgroup/, which could be obtained by a unique read on agent start.
> 2/ use /proc/self/mountstats
> {noformat}
> wc /proc/self/mounts /proc/self/mountstats
> 361 2166 201299 /proc/self/mounts
> 361 2888 50200 /proc/self/mountstats
> {noformat}
> {noformat}
> grep cgroup /proc/self/mounts
> cgroup /sys/fs/cgroup tmpfs rw,relatime,mode=755 0 0
> cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
> cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu 0 0
> cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0
> cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio 0 0
> cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory 0 0
> cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices 0 0
> cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer 0 0
> cgroup /sys/fs/cgroup/net_cls cgroup rw,relatime,net_cls 0 0
> cgroup /sys/fs/cgroup/perf_event cgroup rw,relatime,perf_event 0 0
> cgroup /sys/fs/cgroup/net_prio cgroup rw,relatime,net_prio 0 0
> cgroup /sys/fs/cgroup/pids cgroup rw,relatime,pids 0 0
> {noformat}
> {noformat}
> grep cgroup /proc/self/mountstats
> device cgroup mounted on /sys/fs/cgroup with fstype tmpfs
> device cgroup mounted on /sys/fs/cgroup/cpuset with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/cpu with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/cpuacct with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/blkio with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/memory with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/devices with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/freezer with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/net_cls with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/perf_event with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/net_prio with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/pids with fstype cgroup
> {noformat}
> This file contains all the required information, and is 4x smaller
> 3/ microcaching
> Caching cgroups data for just 1 second would be a huge perfomance 
> improvement, but i'm not aware of the possible side effects.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9142) CNI detach might fail due to missing network config file.

2018-08-13 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9142:
-

Assignee: Jie Yu

> CNI detach might fail due to missing network config file.
> -
>
> Key: MESOS-9142
> URL: https://issues.apache.org/jira/browse/MESOS-9142
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1, 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
>
> Observed this in one of the scale testing:
> If the container is destroyed while in PREPARING state, the network config 
> file won't be there. The CNI detach in that case should return success 
> because attach is not called yet.
> {code}
> Aug 08 23:52:03 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
> I0808 23:52:03.762208  4045 slave.cpp:3562] Launching container 
> 278d88a8-2bd1-4359-8c70-25dc90879cdc for executor 
> 'testcni2.09d9faa5-9b66-11e8-b04f-e2837c6f5f2a' of framework 
> 4397555f-b8fd-478c-907a-7532e2f5e32e-0001
> Aug 08 23:53:03 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
> I0808 23:53:03.232110  4047 containerizer.cpp:1211] Starting container 
> 278d88a8-2bd1-4359-8c70-25dc90879cdc
> Aug 08 23:53:03 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
> I0808 23:53:03.249105  4049 provisioner.cpp:546] Provisioning image rootfs 
> '/var/lib/mesos/slave/provisioner/containers/278d88a8-2bd1-4359-8c70-25dc90879cdc/backends/overlay/rootfses/25dc9ff3-a4ad-4b59-bdf7-02f7e741c729'
>  for container 278d88a8-2bd1-4359-8c70-25dc90879cdc using overlay backend
> Aug 08 23:53:31 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
> I0808 23:53:31.367316  4047 containerizer.cpp:2981] Transitioning the state 
> of container 278d88a8-2bd1-4359-8c70-25dc90879cdc from PROVISIONING to 
> PREPARING
> Aug 08 23:55:22 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
> I0808 23:55:22.988091  4047 memory.cpp:479] Started listening for OOM events 
> for container 278d88a8-2bd1-4359-8c70-25dc90879cdc
> Aug 08 23:55:23 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
> I0808 23:55:23.196844  4047 memory.cpp:591] Started listening on 'low' memory 
> pressure events for container 278d88a8-2bd1-4359-8c70-25dc90879cdc
> Aug 08 23:55:23 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
> I0808 23:55:23.277029  4047 memory.cpp:591] Started listening on 'medium' 
> memory pressure events for container 278d88a8-2bd1-4359-8c70-25dc90879cdc
> Aug 08 23:55:23 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
> I0808 23:55:23.407688  4047 memory.cpp:591] Started listening on 'critical' 
> memory pressure events for container 278d88a8-2bd1-4359-8c70-25dc90879cdc
> Aug 08 23:57:31 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
> I0808 23:57:31.760859  4043 memory.cpp:199] Updated 
> 'memory.soft_limit_in_bytes' to 64MB for container 
> 278d88a8-2bd1-4359-8c70-25dc90879cdc
> Aug 08 23:57:31 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
> I0808 23:57:31.878435  4043 memory.cpp:228] Updated 'memory.limit_in_bytes' 
> to 64MB for container 278d88a8-2bd1-4359-8c70-25dc90879cdc
> Aug 08 23:57:35 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
> I0808 23:57:35.892071  4044 cpu.cpp:101] Updated 'cpu.shares' to 112 (cpus 
> 0.11) for container 278d88a8-2bd1-4359-8c70-25dc90879cdc
> Aug 08 23:57:36 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
> I0808 23:57:36.132098  4044 cpu.cpp:121] Updated 'cpu.cfs_period_us' to 100ms 
> and 'cpu.cfs_quota_us' to 11ms (cpus 0.11) for container 
> 278d88a8-2bd1-4359-8c70-25dc90879cdc
> Aug 09 00:02:01 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
> I0809 00:02:01.791821  4047 isolator_module.cpp:68] Container prepare: 
> container_id[value: "278d88a8-2bd1-4359-8c70-25dc90879cdc"] 
> container_config[directory: 
> "/var/lib/mesos/slave/slaves/4397555f-b8fd-478c-907a-7532e2f5e32e-S0/frameworks/4397555f-b8fd-478c-907a-7532e2f5e32e-0001/executors/testcni2.09d9faa5-9b66-11e8-b04f-e2837c6f5f2a/runs/278d88a8-2bd1-4359-8c70-25dc90879cdc"
>  user: "root" rootfs: 
> "/var/lib/mesos/slave/provisioner/containers/278d88a8-2bd1-4359-8c70-25dc90879cdc/backends/overlay/rootfses/25dc9ff3-a4ad-4b59-bdf7-02f7e741c729"
>  docker { manifest { id: 
> "d7066268059eb2499922833bd058e9e0b9da8365bcae4b85ee22296b02fa0489" parent: 
> "f18ee96f0b1656cab52554b270f19e8df5046d307296d2146539c04565d67747" created: 
> "2018-07-06T14:14:06.393355914Z" container: 
> "b449414436d7bc63cc85dcd24abde14bcec1d92f54f2316212059ee2ed7f3a65" 
> container_config { Hostname: "b449414436d7" Env: 
> "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" User: "" 
> Cmd: "/bin/sh" Cmd: "-c" Cmd: "#(nop) " Cmd: "CMD [\"/bin/sh\"]" WorkingDir: 
> "" 

[jira] [Created] (MESOS-9150) libprocess deadlock

2018-08-10 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9150:
-

 Summary: libprocess deadlock
 Key: MESOS-9150
 URL: https://issues.apache.org/jira/browse/MESOS-9150
 Project: Mesos
  Issue Type: Bug
  Components: libprocs
Affects Versions: 1.7.0
Reporter: Jie Yu


Observed this in CI

{code}
[ RUN  ] SlaveTest.KillTaskBetweenRunTaskParts
I0810 17:38:56.969700 18618 cluster.cpp:173] Creating default 'local' authorizer
I0810 17:38:56.970886 18642 master.cpp:413] Master 
b91ffdf5-b472-4fcf-8004-4c5a0cf3e2f8 (ip-172-16-10-30.ec2.internal) started on 
172.16.10.30:42270
I0810 17:38:56.970909 18642 master.cpp:416] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/BSktU6/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/BSktU6/master" --zk_session_timeout="10secs"
I0810 17:38:56.971032 18642 master.cpp:465] Master only allowing authenticated 
frameworks to register
I0810 17:38:56.971042 18642 master.cpp:471] Master only allowing authenticated 
agents to register
I0810 17:38:56.971050 18642 master.cpp:477] Master only allowing authenticated 
HTTP frameworks to register
I0810 17:38:56.971055 18642 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/BSktU6/credentials'
I0810 17:38:56.971112 18642 master.cpp:521] Using default 'crammd5' 
authenticator
I0810 17:38:56.971185 18642 http.cpp:977] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0810 17:38:56.971247 18642 http.cpp:977] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0810 17:38:56.971313 18642 http.cpp:977] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0810 17:38:56.971355 18642 master.cpp:602] Authorization enabled
I0810 17:38:56.971472 18641 whitelist_watcher.cpp:77] No whitelist given
I0810 17:38:56.971596 18638 hierarchical.cpp:182] Initialized hierarchical 
allocator process
I0810 17:38:56.972218 18640 master.cpp:2083] Elected as the leading master!
I0810 17:38:56.972239 18640 master.cpp:1638] Recovering from registrar
I0810 17:38:56.972280 18640 registrar.cpp:339] Recovering registrar
I0810 17:38:56.972399 18642 registrar.cpp:383] Successfully fetched the 
registry (0B) in 97792ns
I0810 17:38:56.972452 18642 registrar.cpp:487] Applied 1 operations in 7042ns; 
attempting to update the registry
I0810 17:38:56.972585 18638 registrar.cpp:544] Successfully updated the 
registry in 113920ns
I0810 17:38:56.972620 18638 registrar.cpp:416] Successfully recovered registrar
I0810 17:38:56.972754 18638 master.cpp:1752] Recovered 0 agents from the 
registry (172B); allowing 10mins for agents to reregister
I0810 17:38:56.972770 18639 hierarchical.cpp:220] Skipping recovery of 
hierarchical allocator: nothing to recover
W0810 17:38:56.974472 18618 process.cpp:2810] Attempted to spawn already 
running process files@172.16.10.30:42270
I0810 17:38:56.974625 18618 cluster.cpp:479] Creating default 'local' authorizer
I0810 17:38:56.975348 18638 slave.cpp:268] Mesos agent started on 
(618)@172.16.10.30:42270
W0810 17:38:56.975487 18618 process.cpp:2810] Attempted to spawn already 
running process version@172.16.10.30:42270
I0810 17:38:56.975544 18638 slave.cpp:269] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/SlaveTest_KillTaskBetweenRunTaskParts_xkhHPH/store/appc" 
--authenticate_http_readonly="true" --authenticate_http_readwrite="false" 
--authenticatee="crammd5" --authentication_backoff_factor="1secs" 
--authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" 
--cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" 
--cgroups_limit_swap="false" 

[jira] [Created] (MESOS-9142) CNI detach might fail due to missing network config file.

2018-08-08 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9142:
-

 Summary: CNI detach might fail due to missing network config file.
 Key: MESOS-9142
 URL: https://issues.apache.org/jira/browse/MESOS-9142
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.6.1, 1.5.1, 1.4.1, 1.7.0
Reporter: Jie Yu


Observed this in one of the scale testing:

If the container is destroyed while in PREPARING state, the network config file 
won't be there. The CNI detach in that case should return success because 
attach is not called yet.

{code}
Aug 08 23:52:03 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
I0808 23:52:03.762208  4045 slave.cpp:3562] Launching container 
278d88a8-2bd1-4359-8c70-25dc90879cdc for executor 
'testcni2.09d9faa5-9b66-11e8-b04f-e2837c6f5f2a' of framework 
4397555f-b8fd-478c-907a-7532e2f5e32e-0001
Aug 08 23:53:03 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
I0808 23:53:03.232110  4047 containerizer.cpp:1211] Starting container 
278d88a8-2bd1-4359-8c70-25dc90879cdc
Aug 08 23:53:03 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
I0808 23:53:03.249105  4049 provisioner.cpp:546] Provisioning image rootfs 
'/var/lib/mesos/slave/provisioner/containers/278d88a8-2bd1-4359-8c70-25dc90879cdc/backends/overlay/rootfses/25dc9ff3-a4ad-4b59-bdf7-02f7e741c729'
 for container 278d88a8-2bd1-4359-8c70-25dc90879cdc using overlay backend
Aug 08 23:53:31 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
I0808 23:53:31.367316  4047 containerizer.cpp:2981] Transitioning the state of 
container 278d88a8-2bd1-4359-8c70-25dc90879cdc from PROVISIONING to PREPARING
Aug 08 23:55:22 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
I0808 23:55:22.988091  4047 memory.cpp:479] Started listening for OOM events 
for container 278d88a8-2bd1-4359-8c70-25dc90879cdc
Aug 08 23:55:23 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
I0808 23:55:23.196844  4047 memory.cpp:591] Started listening on 'low' memory 
pressure events for container 278d88a8-2bd1-4359-8c70-25dc90879cdc
Aug 08 23:55:23 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
I0808 23:55:23.277029  4047 memory.cpp:591] Started listening on 'medium' 
memory pressure events for container 278d88a8-2bd1-4359-8c70-25dc90879cdc
Aug 08 23:55:23 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
I0808 23:55:23.407688  4047 memory.cpp:591] Started listening on 'critical' 
memory pressure events for container 278d88a8-2bd1-4359-8c70-25dc90879cdc
Aug 08 23:57:31 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
I0808 23:57:31.760859  4043 memory.cpp:199] Updated 
'memory.soft_limit_in_bytes' to 64MB for container 
278d88a8-2bd1-4359-8c70-25dc90879cdc
Aug 08 23:57:31 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
I0808 23:57:31.878435  4043 memory.cpp:228] Updated 'memory.limit_in_bytes' to 
64MB for container 278d88a8-2bd1-4359-8c70-25dc90879cdc
Aug 08 23:57:35 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
I0808 23:57:35.892071  4044 cpu.cpp:101] Updated 'cpu.shares' to 112 (cpus 
0.11) for container 278d88a8-2bd1-4359-8c70-25dc90879cdc
Aug 08 23:57:36 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
I0808 23:57:36.132098  4044 cpu.cpp:121] Updated 'cpu.cfs_period_us' to 100ms 
and 'cpu.cfs_quota_us' to 11ms (cpus 0.11) for container 
278d88a8-2bd1-4359-8c70-25dc90879cdc
Aug 09 00:02:01 ip-10-0-3-181.us-west-2.compute.internal mesos-agent[4022]: 
I0809 00:02:01.791821  4047 isolator_module.cpp:68] Container prepare: 
container_id[value: "278d88a8-2bd1-4359-8c70-25dc90879cdc"] 
container_config[directory: 
"/var/lib/mesos/slave/slaves/4397555f-b8fd-478c-907a-7532e2f5e32e-S0/frameworks/4397555f-b8fd-478c-907a-7532e2f5e32e-0001/executors/testcni2.09d9faa5-9b66-11e8-b04f-e2837c6f5f2a/runs/278d88a8-2bd1-4359-8c70-25dc90879cdc"
 user: "root" rootfs: 
"/var/lib/mesos/slave/provisioner/containers/278d88a8-2bd1-4359-8c70-25dc90879cdc/backends/overlay/rootfses/25dc9ff3-a4ad-4b59-bdf7-02f7e741c729"
 docker { manifest { id: 
"d7066268059eb2499922833bd058e9e0b9da8365bcae4b85ee22296b02fa0489" parent: 
"f18ee96f0b1656cab52554b270f19e8df5046d307296d2146539c04565d67747" created: 
"2018-07-06T14:14:06.393355914Z" container: 
"b449414436d7bc63cc85dcd24abde14bcec1d92f54f2316212059ee2ed7f3a65" 
container_config { Hostname: "b449414436d7" Env: 
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" User: "" 
Cmd: "/bin/sh" Cmd: "-c" Cmd: "#(nop) " Cmd: "CMD [\"/bin/sh\"]" WorkingDir: "" 
Image: 
"sha256:9025f4d6338c3425933e4734e8a27fb615d56ca8862e78bb8c4e2426f2db78bd" } 
docker_version: "17.06.2-ce" config { Hostname: "" Env: 
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" User: "" 
Cmd: "/bin/sh" WorkingDir: "" Image: 
"sha256:9025f4d6338c3425933e4734e8a27fb615d56ca8862e78bb8c4e2426f2db78bd" } 
architecture: "amd64" os: "linux" } } executor_info { 

[jira] [Commented] (MESOS-9134) fs::MountTable::read might not be thread safe.

2018-08-07 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572399#comment-16572399
 ] 

Jie Yu commented on MESOS-9134:
---

[~bmahler] pointed out that it's possible that getmntent calls getmntent_r.

I confirmed that this is the case:
https://github.com/bminor/glibc/blob/09533208febe923479261a27b7691abef297d604/misc/mntent.c

So closing this ticket now.

> fs::MountTable::read might not be thread safe.
> --
>
> Key: MESOS-9134
> URL: https://issues.apache.org/jira/browse/MESOS-9134
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Jie Yu
>Priority: Major
>
> I observed the following stack trace for the mesos agent 1.5.1 on CoreOS.
> What I don't understand is that how is this possible. Both re-entrant and 
> non-reentrant version of the code are used in different threads.
> {noformat}Thread 6 (LWP 3022):
> #0  0x7fd950a0034c in ?? () from target:/lib64/libpthread.so.0
> #1  0x7fd9509f9cf5 in pthread_mutex_lock () from 
> target:/lib64/libpthread.so.0
> #2  0x7fd9535028af in mesos::internal::fs::MountTable::read(std::string 
> const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #3  0x7fd9534dc416 in cgroups::subsystems(std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #4  0x7fd9534dd13f in cgroups::mounted(std::string const&, std::string 
> const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #5  0x7fd9534ddd98 in cgroups::verify(std::string const&, std::string 
> const&, std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #6  0x7fd9534df815 in cgroups::read(std::string const&, std::string 
> const&, std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #7  0x7fd9534e22b9 in cgroups::memory::usage_in_bytes(std::string const&, 
> std::string const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #8  0x7fd953557ca4 in 
> mesos::internal::slave::MemorySubsystemProcess::usage(mesos::ContainerID 
> const&, std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #9  0x7fd95354b91b in lambda::CallableOnce (process::ProcessBase*)>::CallableFn
>  process::dispatch mesos::internal::slave::SubsystemProcess, mesos::ContainerID const&, 
> std::string const&, mesos::ContainerID const&, std::string 
> const&>(process::PID const&, 
> process::Future 
> (mesos::internal::slave::SubsystemProcess::*)(mesos::ContainerID const&, 
> std::string const&), mesos::ContainerID const&, std::string 
> const&)::{lambda(std::unique_ptr, 
> std::default_delete > >, 
> mesos::ContainerID&&, std::string&&, process::ProcessBase*)#1}, 
> std::unique_ptr, 
> std::default_delete > >, 
> mesos::ContainerID, std::string, std::_Placeholder<1> > 
> >::operator()(process::ProcessBase*&&) && () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #10 0x7fd953d48331 in 
> process::ProcessBase::consume(process::DispatchEvent&&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #11 0x7fd953d5ed9c in 
> process::ProcessManager::resume(process::ProcessBase*) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #12 0x7fd953d645a6 in 
> std::thread::_Impl  ()> >::_M_run() () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #13 0x7fd950f13191 in ?? () from target:/lib64/libstdc++.so.6
> #14 0x7fd9509f74f0 in ?? () from target:/lib64/libpthread.so.0
> #15 0x7fd950737aed in clone () from target:/lib64/libc.so.6
> Thread 5 (LWP 3021):
> #0  0x7fd9506bcff0 in _IO_file_read () from target:/lib64/libc.so.6
> #1  0x7fd9506bdcd0 in _IO_file_underflow () from target:/lib64/libc.so.6
> #2  0x7fd9506becc1 in _IO_default_uflow () from target:/lib64/libc.so.6
> #3  0x7fd9506b21d2 in _IO_getline_info () from target:/lib64/libc.so.6
> #4  0x7fd9506bbce6 in fgets_unlocked () from target:/lib64/libc.so.6
> #5  0x7fd950730e9e in getmntent_r () from target:/lib64/libc.so.6
> #6  0x7fd9535028c1 in mesos::internal::fs::MountTable::read(std::string 
> const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #7  0x7fd9534dc416 in cgroups::subsystems(std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #8  0x7fd9534dd13f in cgroups::mounted(std::string const&, std::string 
> const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #9  0x7fd9534ddd98 in cgroups::verify(std::string const&, std::string 
> const&, std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #10 0x7fd9534df815 in cgroups::read(std::string const&, std::string 
> const&, std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #11 0x7fd9534e2be7 in cgroups::blkio::readEntries(std::string const&, 
> std::string const&, std::string 

[jira] [Commented] (MESOS-9134) fs::MountTable::read might not be thread safe.

2018-08-07 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572291#comment-16572291
 ] 

Jie Yu commented on MESOS-9134:
---

One way is to do an objdump on the method 
`mesos::internal::fs::MountTable::read`, and see if both code paths are 
compiled in or not

> fs::MountTable::read might not be thread safe.
> --
>
> Key: MESOS-9134
> URL: https://issues.apache.org/jira/browse/MESOS-9134
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Jie Yu
>Priority: Major
>
> I observed the following stack trace for the mesos agent 1.5.1 on CoreOS.
> What I don't understand is that how is this possible. Both re-entrant and 
> non-reentrant version of the code are used in different threads.
> {noformat}Thread 6 (LWP 3022):
> #0  0x7fd950a0034c in ?? () from target:/lib64/libpthread.so.0
> #1  0x7fd9509f9cf5 in pthread_mutex_lock () from 
> target:/lib64/libpthread.so.0
> #2  0x7fd9535028af in mesos::internal::fs::MountTable::read(std::string 
> const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #3  0x7fd9534dc416 in cgroups::subsystems(std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #4  0x7fd9534dd13f in cgroups::mounted(std::string const&, std::string 
> const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #5  0x7fd9534ddd98 in cgroups::verify(std::string const&, std::string 
> const&, std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #6  0x7fd9534df815 in cgroups::read(std::string const&, std::string 
> const&, std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #7  0x7fd9534e22b9 in cgroups::memory::usage_in_bytes(std::string const&, 
> std::string const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #8  0x7fd953557ca4 in 
> mesos::internal::slave::MemorySubsystemProcess::usage(mesos::ContainerID 
> const&, std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #9  0x7fd95354b91b in lambda::CallableOnce (process::ProcessBase*)>::CallableFn
>  process::dispatch mesos::internal::slave::SubsystemProcess, mesos::ContainerID const&, 
> std::string const&, mesos::ContainerID const&, std::string 
> const&>(process::PID const&, 
> process::Future 
> (mesos::internal::slave::SubsystemProcess::*)(mesos::ContainerID const&, 
> std::string const&), mesos::ContainerID const&, std::string 
> const&)::{lambda(std::unique_ptr, 
> std::default_delete > >, 
> mesos::ContainerID&&, std::string&&, process::ProcessBase*)#1}, 
> std::unique_ptr, 
> std::default_delete > >, 
> mesos::ContainerID, std::string, std::_Placeholder<1> > 
> >::operator()(process::ProcessBase*&&) && () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #10 0x7fd953d48331 in 
> process::ProcessBase::consume(process::DispatchEvent&&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #11 0x7fd953d5ed9c in 
> process::ProcessManager::resume(process::ProcessBase*) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #12 0x7fd953d645a6 in 
> std::thread::_Impl  ()> >::_M_run() () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #13 0x7fd950f13191 in ?? () from target:/lib64/libstdc++.so.6
> #14 0x7fd9509f74f0 in ?? () from target:/lib64/libpthread.so.0
> #15 0x7fd950737aed in clone () from target:/lib64/libc.so.6
> Thread 5 (LWP 3021):
> #0  0x7fd9506bcff0 in _IO_file_read () from target:/lib64/libc.so.6
> #1  0x7fd9506bdcd0 in _IO_file_underflow () from target:/lib64/libc.so.6
> #2  0x7fd9506becc1 in _IO_default_uflow () from target:/lib64/libc.so.6
> #3  0x7fd9506b21d2 in _IO_getline_info () from target:/lib64/libc.so.6
> #4  0x7fd9506bbce6 in fgets_unlocked () from target:/lib64/libc.so.6
> #5  0x7fd950730e9e in getmntent_r () from target:/lib64/libc.so.6
> #6  0x7fd9535028c1 in mesos::internal::fs::MountTable::read(std::string 
> const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #7  0x7fd9534dc416 in cgroups::subsystems(std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #8  0x7fd9534dd13f in cgroups::mounted(std::string const&, std::string 
> const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #9  0x7fd9534ddd98 in cgroups::verify(std::string const&, std::string 
> const&, std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #10 0x7fd9534df815 in cgroups::read(std::string const&, std::string 
> const&, std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #11 0x7fd9534e2be7 in cgroups::blkio::readEntries(std::string const&, 
> std::string const&, std::string const&) () from 
> target:/opt/mesosphere/lib/libmesos-1.5.1.so
> #12 0x7fd9534e3a3d in 

[jira] [Assigned] (MESOS-7404) Ensure hierarchical roles work with old Mesos agents

2018-08-07 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-7404:
-

Assignee: (was: Jie Yu)

> Ensure hierarchical roles work with old Mesos agents
> 
>
> Key: MESOS-7404
> URL: https://issues.apache.org/jira/browse/MESOS-7404
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Priority: Major
>  Labels: mesosphere
>
> If the Mesos master supports hierarchical roles but the agent does not, we 
> need to ensure that we avoid putting the agent into a bad state, e.g., if the 
> user creates a persistent volume.
> One approach is to use an agent capability for hierarchical roles, and 
> disallow creating persistent-volumes using a hierarchical role if the agent 
> doesn't have the capability. We could also use an agent version check, 
> although until MESOS-6975 is implemented, that will be a bit awkward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9134) fs::MountTable::read might not be thread safe.

2018-08-03 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9134:
-

 Summary: fs::MountTable::read might not be thread safe.
 Key: MESOS-9134
 URL: https://issues.apache.org/jira/browse/MESOS-9134
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.5.1
Reporter: Jie Yu


I observed the following stack trace for the mesos agent 1.5.1 on CoreOS.

What I don't understand is that how is this possible. Both re-entrant and 
non-reentrant version of the code are used in different threads.

{noformat}Thread 6 (LWP 3022):
#0  0x7fd950a0034c in ?? () from target:/lib64/libpthread.so.0
#1  0x7fd9509f9cf5 in pthread_mutex_lock () from 
target:/lib64/libpthread.so.0
#2  0x7fd9535028af in mesos::internal::fs::MountTable::read(std::string 
const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
#3  0x7fd9534dc416 in cgroups::subsystems(std::string const&) () from 
target:/opt/mesosphere/lib/libmesos-1.5.1.so
#4  0x7fd9534dd13f in cgroups::mounted(std::string const&, std::string 
const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
#5  0x7fd9534ddd98 in cgroups::verify(std::string const&, std::string 
const&, std::string const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
#6  0x7fd9534df815 in cgroups::read(std::string const&, std::string const&, 
std::string const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
#7  0x7fd9534e22b9 in cgroups::memory::usage_in_bytes(std::string const&, 
std::string const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
#8  0x7fd953557ca4 in 
mesos::internal::slave::MemorySubsystemProcess::usage(mesos::ContainerID 
const&, std::string const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
#9  0x7fd95354b91b in lambda::CallableOnce::CallableFn
 process::dispatch(process::PID const&, 
process::Future 
(mesos::internal::slave::SubsystemProcess::*)(mesos::ContainerID const&, 
std::string const&), mesos::ContainerID const&, std::string 
const&)::{lambda(std::unique_ptr, 
std::default_delete > >, 
mesos::ContainerID&&, std::string&&, process::ProcessBase*)#1}, 
std::unique_ptr, 
std::default_delete > >, 
mesos::ContainerID, std::string, std::_Placeholder<1> > 
>::operator()(process::ProcessBase*&&) && () from 
target:/opt/mesosphere/lib/libmesos-1.5.1.so
#10 0x7fd953d48331 in 
process::ProcessBase::consume(process::DispatchEvent&&) () from 
target:/opt/mesosphere/lib/libmesos-1.5.1.so
#11 0x7fd953d5ed9c in 
process::ProcessManager::resume(process::ProcessBase*) () from 
target:/opt/mesosphere/lib/libmesos-1.5.1.so
#12 0x7fd953d645a6 in 
std::thread::_Impl >::_M_run() () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
#13 0x7fd950f13191 in ?? () from target:/lib64/libstdc++.so.6
#14 0x7fd9509f74f0 in ?? () from target:/lib64/libpthread.so.0
#15 0x7fd950737aed in clone () from target:/lib64/libc.so.6
Thread 5 (LWP 3021):
#0  0x7fd9506bcff0 in _IO_file_read () from target:/lib64/libc.so.6
#1  0x7fd9506bdcd0 in _IO_file_underflow () from target:/lib64/libc.so.6
#2  0x7fd9506becc1 in _IO_default_uflow () from target:/lib64/libc.so.6
#3  0x7fd9506b21d2 in _IO_getline_info () from target:/lib64/libc.so.6
#4  0x7fd9506bbce6 in fgets_unlocked () from target:/lib64/libc.so.6
#5  0x7fd950730e9e in getmntent_r () from target:/lib64/libc.so.6
#6  0x7fd9535028c1 in mesos::internal::fs::MountTable::read(std::string 
const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
#7  0x7fd9534dc416 in cgroups::subsystems(std::string const&) () from 
target:/opt/mesosphere/lib/libmesos-1.5.1.so
#8  0x7fd9534dd13f in cgroups::mounted(std::string const&, std::string 
const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
#9  0x7fd9534ddd98 in cgroups::verify(std::string const&, std::string 
const&, std::string const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
#10 0x7fd9534df815 in cgroups::read(std::string const&, std::string const&, 
std::string const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
#11 0x7fd9534e2be7 in cgroups::blkio::readEntries(std::string const&, 
std::string const&, std::string const&) () from 
target:/opt/mesosphere/lib/libmesos-1.5.1.so
#12 0x7fd9534e3a3d in cgroups::blkio::cfq::io_serviced(std::string const&, 
std::string const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
#13 0x7fd95354ddff in 
mesos::internal::slave::BlkioSubsystemProcess::usage(mesos::ContainerID const&, 
std::string const&) () from target:/opt/mesosphere/lib/libmesos-1.5.1.so
#14 0x7fd95354b91b in lambda::CallableOnce::CallableFn
 process::dispatch(process::PID const&, 
process::Future 
(mesos::internal::slave::SubsystemProcess::*)(mesos::ContainerID const&, 
std::string const&), mesos::ContainerID const&, std::string 
const&)::{lambda(std::unique_ptr, 
std::default_delete > >, 
mesos::ContainerID&&, std::string&&, process::ProcessBase*)#1}, 

[jira] [Assigned] (MESOS-9127) Port mapper CNI plugin might deadlock iptables on the agent.

2018-08-01 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9127:
-

Assignee: Jie Yu

> Port mapper CNI plugin might deadlock iptables on the agent.
> 
>
> Key: MESOS-9127
> URL: https://issues.apache.org/jira/browse/MESOS-9127
> Project: Mesos
>  Issue Type: Bug
>  Components: network
>Affects Versions: 1.4.1, 1.5.1, 1.6.1
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
>
> Recently, we noticed that if one launches a lot of containers that use the 
> port mapper CNI plugin, the iptables will get deadlock. The symptom is like 
> the following.
> If you do any iptables command on the box, it'll get stuck on acquiring the 
> xtables lock:
> {noformat}
> core@ip-10-0-2-99 ~ $ time iptables -w -t nat -S UCR-DEFAULT-BRIDGE 
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> ^C
> real0m41.349s
> user0m0.001s
> sys0m0.000s
> {noformat}
> And you'll notice that a lot of iptables and mesos port mapper CNI plugin 
> processes on the box
> {noformat}
> $ ps -fp $(pidof mesos-cni-port-mapper) | wc -l
> 191
> $ ps -fp $(pidof iptables) | wc -l
> 192
> {noformat}
> Then, we look into the process that is holding the xtables lock:
> {noformat}
> $ sudo netstat -p -n | grep xtables
> unix  2  [ ] STREAM   225083   25048/iptables 
>   @xtables
> $ ps aux | grep 25048
> root 25048  0.0  0.0  26184  2512 ?S22:41   0:00 iptables -w 
> -t nat -S UCR-DEFAULT-BRIDGE
> core 31857  0.0  0.0   6760   976 pts/0S+   23:18   0:00 grep 
> --colour=auto 25048
> $ sudo strace -s 1 -p 25048
> Process 25048 attached
> write(1, "-dport 13839 -m comment --comment \"container_id: 
> a074efef-c119-4764-a584-87c57e6b7f68\" -j DNAT --to-destination 
> 172.31.254.183:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 17282 -m comment --comment \"container_id: 
> 47880db5-90ad-4034-9c2b-2fd246d42342\" -j DNAT --to-destination 
> 172.31.254.126:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 28712 -m comment --comment \"container_id: 
> 3293d5d0-772c-48d2-bafd-1c4b4d56247e\" -j DNAT --to-destination 
> 172.31.254.130:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 23893 -m comment --comment \"container_id: 
> f57de8eb-a3b9-44cb-8dac-7ab261bc8aac\" -j DNAT --to-destination 
> 172.31.254.149:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 28449 -m comment --comment \"container_id: 
> 9238dbf0-7b28-4fda-880a-bf0c8f40562a\" -j DNAT --to-destination 
> 172.31.254.190:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 26438 -m comment --comment \"container_id: 
> d307cf58-8972-4de4-ad45-26c29786add0\" -j DNAT --to-destination 
> 172.31.254.187:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 7682 -m comment --comment \"container_id: 
> 60f5a61b-f4c0-4846-b2cd-63cd7eb5a4e8\" -j DNAT --to-destination 
> 172.31.254.177:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 23904 -m comment --comment \"container_id: 
> f203ff9e-7b81-4e54-ab44-d45e2a937f38\" -j DNAT --to-destination 
> 172.31.254.157:80\n-A UCR-DEFAULT-BRIDGE ! 

[jira] [Commented] (MESOS-9127) Port mapper CNI plugin might deadlock iptables on the agent.

2018-08-01 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566357#comment-16566357
 ] 

Jie Yu commented on MESOS-9127:
---

https://reviews.apache.org/r/68158/

> Port mapper CNI plugin might deadlock iptables on the agent.
> 
>
> Key: MESOS-9127
> URL: https://issues.apache.org/jira/browse/MESOS-9127
> Project: Mesos
>  Issue Type: Bug
>  Components: network
>Affects Versions: 1.4.1, 1.5.1, 1.6.1
>Reporter: Jie Yu
>Priority: Major
>
> Recently, we noticed that if one launches a lot of containers that use the 
> port mapper CNI plugin, the iptables will get deadlock. The symptom is like 
> the following.
> If you do any iptables command on the box, it'll get stuck on acquiring the 
> xtables lock:
> {noformat}
> core@ip-10-0-2-99 ~ $ time iptables -w -t nat -S UCR-DEFAULT-BRIDGE 
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> Another app is currently holding the xtables lock; waiting for it to exit...
> ^C
> real0m41.349s
> user0m0.001s
> sys0m0.000s
> {noformat}
> And you'll notice that a lot of iptables and mesos port mapper CNI plugin 
> processes on the box
> {noformat}
> $ ps -fp $(pidof mesos-cni-port-mapper) | wc -l
> 191
> $ ps -fp $(pidof iptables) | wc -l
> 192
> {noformat}
> Then, we look into the process that is holding the xtables lock:
> {noformat}
> $ sudo netstat -p -n | grep xtables
> unix  2  [ ] STREAM   225083   25048/iptables 
>   @xtables
> $ ps aux | grep 25048
> root 25048  0.0  0.0  26184  2512 ?S22:41   0:00 iptables -w 
> -t nat -S UCR-DEFAULT-BRIDGE
> core 31857  0.0  0.0   6760   976 pts/0S+   23:18   0:00 grep 
> --colour=auto 25048
> $ sudo strace -s 1 -p 25048
> Process 25048 attached
> write(1, "-dport 13839 -m comment --comment \"container_id: 
> a074efef-c119-4764-a584-87c57e6b7f68\" -j DNAT --to-destination 
> 172.31.254.183:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 17282 -m comment --comment \"container_id: 
> 47880db5-90ad-4034-9c2b-2fd246d42342\" -j DNAT --to-destination 
> 172.31.254.126:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 28712 -m comment --comment \"container_id: 
> 3293d5d0-772c-48d2-bafd-1c4b4d56247e\" -j DNAT --to-destination 
> 172.31.254.130:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 23893 -m comment --comment \"container_id: 
> f57de8eb-a3b9-44cb-8dac-7ab261bc8aac\" -j DNAT --to-destination 
> 172.31.254.149:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 28449 -m comment --comment \"container_id: 
> 9238dbf0-7b28-4fda-880a-bf0c8f40562a\" -j DNAT --to-destination 
> 172.31.254.190:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 26438 -m comment --comment \"container_id: 
> d307cf58-8972-4de4-ad45-26c29786add0\" -j DNAT --to-destination 
> 172.31.254.187:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 7682 -m comment --comment \"container_id: 
> 60f5a61b-f4c0-4846-b2cd-63cd7eb5a4e8\" -j DNAT --to-destination 
> 172.31.254.177:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
> 23904 -m comment --comment \"container_id: 
> f203ff9e-7b81-4e54-ab44-d45e2a937f38\" -j DNAT --to-destination 
> 172.31.254.157:80\n-A 

[jira] [Created] (MESOS-9127) Port mapper CNI plugin might deadlock iptables on the agent.

2018-08-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9127:
-

 Summary: Port mapper CNI plugin might deadlock iptables on the 
agent.
 Key: MESOS-9127
 URL: https://issues.apache.org/jira/browse/MESOS-9127
 Project: Mesos
  Issue Type: Bug
  Components: network
Affects Versions: 1.6.1, 1.5.1, 1.4.1
Reporter: Jie Yu


Recently, we noticed that if one launches a lot of containers that use the port 
mapper CNI plugin, the iptables will get deadlock. The symptom is like the 
following.

If you do any iptables command on the box, it'll get stuck on acquiring the 
xtables lock:
{noformat}
core@ip-10-0-2-99 ~ $ time iptables -w -t nat -S UCR-DEFAULT-BRIDGE 
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
Another app is currently holding the xtables lock; waiting for it to exit...
^C

real0m41.349s
user0m0.001s
sys0m0.000s
{noformat}

And you'll notice that a lot of iptables and mesos port mapper CNI plugin 
processes on the box

{noformat}
$ ps -fp $(pidof mesos-cni-port-mapper) | wc -l
191
$ ps -fp $(pidof iptables) | wc -l
192
{noformat}

Then, we look into the process that is holding the xtables lock:

{noformat}
$ sudo netstat -p -n | grep xtables
unix  2  [ ] STREAM   225083   25048/iptables   
@xtables
$ ps aux | grep 25048
root 25048  0.0  0.0  26184  2512 ?S22:41   0:00 iptables -w -t 
nat -S UCR-DEFAULT-BRIDGE
core 31857  0.0  0.0   6760   976 pts/0S+   23:18   0:00 grep 
--colour=auto 25048
$ sudo strace -s 1 -p 25048
Process 25048 attached
write(1, "-dport 13839 -m comment --comment \"container_id: 
a074efef-c119-4764-a584-87c57e6b7f68\" -j DNAT --to-destination 
172.31.254.183:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
17282 -m comment --comment \"container_id: 
47880db5-90ad-4034-9c2b-2fd246d42342\" -j DNAT --to-destination 
172.31.254.126:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
28712 -m comment --comment \"container_id: 
3293d5d0-772c-48d2-bafd-1c4b4d56247e\" -j DNAT --to-destination 
172.31.254.130:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
23893 -m comment --comment \"container_id: 
f57de8eb-a3b9-44cb-8dac-7ab261bc8aac\" -j DNAT --to-destination 
172.31.254.149:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
28449 -m comment --comment \"container_id: 
9238dbf0-7b28-4fda-880a-bf0c8f40562a\" -j DNAT --to-destination 
172.31.254.190:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
26438 -m comment --comment \"container_id: 
d307cf58-8972-4de4-ad45-26c29786add0\" -j DNAT --to-destination 
172.31.254.187:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
7682 -m comment --comment \"container_id: 
60f5a61b-f4c0-4846-b2cd-63cd7eb5a4e8\" -j DNAT --to-destination 
172.31.254.177:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
23904 -m comment --comment \"container_id: 
f203ff9e-7b81-4e54-ab44-d45e2a937f38\" -j DNAT --to-destination 
172.31.254.157:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
8359 -m comment --comment \"container_id: 
578cc89c-83bf-46ba-9ae7-c7b89e40e739\" -j DNAT --to-destination 
172.31.254.158:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
28482 -m comment --comment \"container_id: 
70721adb-cd6c-4158-8b11-4ef694999203\" -j DNAT --to-destination 
172.31.254.163:80\n-A UCR-DEFAULT-BRIDGE ! -i ucr-br0 -p tcp -m tcp --dport 
2564 -m 

[jira] [Created] (MESOS-9125) Port mapper CNI plugin might fail with "Resource temporarily unavailable"

2018-08-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-9125:
-

 Summary: Port mapper CNI plugin might fail with "Resource 
temporarily unavailable"
 Key: MESOS-9125
 URL: https://issues.apache.org/jira/browse/MESOS-9125
 Project: Mesos
  Issue Type: Bug
  Components: network
Affects Versions: 1.6.1, 1.5.1, 1.4.1
Reporter: Jie Yu


https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/network/cni/plugins/port_mapper/port_mapper.cpp#L345

Looks like we're missing a `-w` for the iptable command. This will lead to 
issues like
{noformat}
The CNI plugin 
'/opt/mesosphere/active/mesos/libexec/mesos/mesos-cni-port-mapper' failed to 
attach container a710dc89-7b22-493b-b8bb-fb80a99d5321 to CNI network 
'mesos-bridge': stdout='{"cniVersion":"0.3.0","code":103,"msg":"Failed to add 
DNAT rule with tag: Resource temporarily unavailable"}
{noformat}

This becomes more likely if there are many concurrent launches of Mesos 
contianers that uses port mapper on the box.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8418) mesos-agent high cpu usage because of numerous /proc/mounts reads

2018-07-14 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544399#comment-16544399
 ] 

Jie Yu commented on MESOS-8418:
---

Yeah, I feel that we should let caller decide whether to call `verify` or not, 
instead of doing that for all cgroup related operations (e.g., 
create/read/write/mount/unmount/etc.)

cc [~gilbert]

> mesos-agent high cpu usage because of numerous /proc/mounts reads
> -
>
> Key: MESOS-8418
> URL: https://issues.apache.org/jira/browse/MESOS-8418
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, containerization
>Reporter: Stéphane Cottin
>Priority: Major
>  Labels: containerizer, performance
> Attachments: mesos-agent-flamegraph.png, mesos-agent.stacks.gz
>
>
> /proc/mounts is read many, many times from 
> src/(linux/fs|linux/cgroups|slave/slave).cpp.
> When using overlayfs, the /proc/mounts contents can become quite large. 
> As an example, one of our Q/A single node running ~150 tasks,  have a 361 
> lines/ 201299 chars  /proc/mounts file.
> This 200kB file is read on this node about 25 to 150 times per second. This 
> is a (huge) waste of cpu and I/O time.
> Most of these calls are related to cgroups.
> Please consider these proposals :
> 1/ Is /proc/mounts mandatory for cgroups ? 
> We already have cgroup subsystems list from /proc/cgroups.
> The only compelling information from /proc/mounts seems to be the root mount 
> point, 
> /sys/fs/cgroup/, which could be obtained by a unique read on agent start.
> 2/ use /proc/self/mountstats
> {noformat}
> wc /proc/self/mounts /proc/self/mountstats
> 361 2166 201299 /proc/self/mounts
> 361 2888 50200 /proc/self/mountstats
> {noformat}
> {noformat}
> grep cgroup /proc/self/mounts
> cgroup /sys/fs/cgroup tmpfs rw,relatime,mode=755 0 0
> cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
> cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu 0 0
> cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0
> cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio 0 0
> cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory 0 0
> cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices 0 0
> cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer 0 0
> cgroup /sys/fs/cgroup/net_cls cgroup rw,relatime,net_cls 0 0
> cgroup /sys/fs/cgroup/perf_event cgroup rw,relatime,perf_event 0 0
> cgroup /sys/fs/cgroup/net_prio cgroup rw,relatime,net_prio 0 0
> cgroup /sys/fs/cgroup/pids cgroup rw,relatime,pids 0 0
> {noformat}
> {noformat}
> grep cgroup /proc/self/mountstats
> device cgroup mounted on /sys/fs/cgroup with fstype tmpfs
> device cgroup mounted on /sys/fs/cgroup/cpuset with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/cpu with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/cpuacct with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/blkio with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/memory with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/devices with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/freezer with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/net_cls with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/perf_event with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/net_prio with fstype cgroup
> device cgroup mounted on /sys/fs/cgroup/pids with fstype cgroup
> {noformat}
> This file contains all the required information, and is 4x smaller
> 3/ microcaching
> Caching cgroups data for just 1 second would be a huge perfomance 
> improvement, but i'm not aware of the possible side effects.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8951) Flaky `AgentContainerAPITest.RecoverNestedContainer`

2018-06-27 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525319#comment-16525319
 ] 

Jie Yu commented on MESOS-8951:
---

I think we should add some method to cluster::Slave to wait for it to be ready:
```
Future ready();
```
so that in test, we can
```
AWAIT_READY(slave->ready());
```

> Flaky `AgentContainerAPITest.RecoverNestedContainer`
> 
>
> Key: MESOS-8951
> URL: https://issues.apache.org/jira/browse/MESOS-8951
> Project: Mesos
>  Issue Type: Bug
> Environment: internal CI
>  master-668030da
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: flaky, flaky-test
> Attachments: 
> AgentContainerAPITest.RecoverNestedContainer-badrun1.txt, 
> AgentContainerAPITest.RecoverNestedContainer-badrun2.txt
>
>
> {code:java}
> [  FAILED  ] 
> ParentChildContainerTypeAndContentType/AgentContainerAPITest.RecoverNestedContainer/9,
>  where GetParam() = (1, 0, application/json, 
> ("cgroups/cpu,cgroups/mem,filesystem/linux,namespaces/pid", "linux", 
> "ROOT_CGROUPS_")) (15297 ms)
> [  FAILED  ] 
> ParentChildContainerTypeAndContentType/AgentContainerAPITest.RecoverNestedContainer/13,
>  where GetParam() = (1, 1, application/json, 
> ("cgroups/cpu,cgroups/mem,filesystem/linux,namespaces/pid", "linux", 
> "ROOT_CGROUPS_")) (15275 ms){code}
> {code:java}
> ../../src/tests/agent_container_api_tests.cpp:596
> Failed to wait 15secs for wait
> {code}
> There is no call of `WAIT_CONTAINER` in agent logs. It looks like the request 
> wasn't delivered to the agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8951) Flaky `AgentContainerAPITest.RecoverNestedContainer`

2018-06-27 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525298#comment-16525298
 ] 

Jie Yu commented on MESOS-8951:
---

It looks to me that this is a race due to the fact that waiting on the 
receiving of `SlaveReregisteredMessage` is not guaranteed to have the agent 
ready to process API calls. We need to make sure `reregistered` is called and 
finished before we can issue wait nested container API call.

> Flaky `AgentContainerAPITest.RecoverNestedContainer`
> 
>
> Key: MESOS-8951
> URL: https://issues.apache.org/jira/browse/MESOS-8951
> Project: Mesos
>  Issue Type: Bug
> Environment: internal CI
>  master-668030da
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: flaky, flaky-test
> Attachments: 
> AgentContainerAPITest.RecoverNestedContainer-badrun1.txt, 
> AgentContainerAPITest.RecoverNestedContainer-badrun2.txt
>
>
> {code:java}
> [  FAILED  ] 
> ParentChildContainerTypeAndContentType/AgentContainerAPITest.RecoverNestedContainer/9,
>  where GetParam() = (1, 0, application/json, 
> ("cgroups/cpu,cgroups/mem,filesystem/linux,namespaces/pid", "linux", 
> "ROOT_CGROUPS_")) (15297 ms)
> [  FAILED  ] 
> ParentChildContainerTypeAndContentType/AgentContainerAPITest.RecoverNestedContainer/13,
>  where GetParam() = (1, 1, application/json, 
> ("cgroups/cpu,cgroups/mem,filesystem/linux,namespaces/pid", "linux", 
> "ROOT_CGROUPS_")) (15275 ms){code}
> {code:java}
> ../../src/tests/agent_container_api_tests.cpp:596
> Failed to wait 15secs for wait
> {code}
> There is no call of `WAIT_CONTAINER` in agent logs. It looks like the request 
> wasn't delivered to the agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8987) Master asks agent to shutdown upon auth errors

2018-06-12 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510339#comment-16510339
 ] 

Jie Yu commented on MESOS-8987:
---

Raise this to BLOCKER since it might kill all tasks in the cluster.

> Master asks agent to shutdown upon auth errors
> --
>
> Key: MESOS-8987
> URL: https://issues.apache.org/jira/browse/MESOS-8987
> Project: Mesos
>  Issue Type: Bug
>  Components: master, security
>Affects Versions: 1.4.1, 1.5.1, 1.6.0, 1.7.0
>Reporter: Gastón Kleiman
>Priority: Blocker
>  Labels: mesosphere
>
> The Mesos master sends a {{ShutdownMessage}} to an agent if there is an 
> [authentication|https://github.com/apache/mesos/blob/d733b1031350e03bce443aa287044eb4eee1053a/src/master/master.cpp#L6532-L6543]
>  or an 
> [authorization|https://github.com/apache/mesos/blob/d733b1031350e03bce443aa287044eb4eee1053a/src/master/master.cpp#L6622-L6633]
>  error during agent registration.
>  
> Upon receipt of this message, the agent kills alls its tasks and commits 
> suicide. This means that transient auth errors can lead to whole agents being 
> killed along with it's tasks.
> I think the master should stop sending the {{ShutdownMessage}}s in these 
> cases, or at least let the agent retry the registration a few times before 
> asking it to shutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8984) Multi-host Shared Storage Support using CSI.

2018-06-06 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8984:
-

 Summary: Multi-host Shared Storage Support using CSI.
 Key: MESOS-8984
 URL: https://issues.apache.org/jira/browse/MESOS-8984
 Project: Mesos
  Issue Type: Epic
Reporter: Jie Yu


Volumes under this category can be accessed from multi-hosts simultaneously.

Examples of this kind of shared storage are (either NFS based, or fuse based):
* NFS
* S3FS
* Glusterfs
* Portworx shared volume
* ...

The volumes in this category do not require multi-node coordination. For 
instance, if a volume that can only be exclusively accessed from a node at any 
given time, then it does not qualify this category (e.g., EBS volume).

The support those those volumes will be much simpler (comparing to e.g., EBS 
volumes which should be modeled as global resources) because Mesos can simply 
connect to those volumes as long as the framework has access to the handle of 
the volume, and all the lifecycle management of the volume can be done locally 
on an agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8980) mesos-slave can deadlock with docker pull

2018-06-06 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-8980:
-

Assignee: Kjetil Joergensen

> mesos-slave can deadlock with docker pull
> -
>
> Key: MESOS-8980
> URL: https://issues.apache.org/jira/browse/MESOS-8980
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1, 1.5.1, 1.6.0
>Reporter: Kjetil Joergensen
>Assignee: Kjetil Joergensen
>Priority: Major
>
> Similar to MESOS-1885.
> mesos-slave crate pipes for stdout/stderr for docker pull, forks & execs 
> docker pull. When the output of docker pull exceeds the buffer allocated to 
> the pipe, mesos-slave and docker pull will deadlock, where docker pull blocks 
> on writing to stdout, and mesos-slave is waiting for docker pull to exit.
> Under "normal" circumstances this seems somewhat rare, although if you have 
> enough jobs running, you'll get to a point where the sum-total of buffer for 
> pipes allocated gets to fs.pipe-max-size, at which point linux will give you 
> a single page of memory for the pipe buffer, at which point a moderate amount 
> of layers will push you over the 4k pipe buffer.
> Is stdout for docker pull being used for anything ? (Cursory testing & 
> reading said no immediate observable harm). If not, the following should do 
> the trick
> {code}
> diff --git a/src/docker/docker.cpp b/src/docker/docker.cpp
> index d423d56ad..daebb897b 100755
> --- a/src/docker/docker.cpp
> +++ b/src/docker/docker.cpp
> @@ -1413,7 +1413,7 @@ Future Docker::__pull(
>path,
>argv,
>Subprocess::PATH("/dev/null"),
> -  Subprocess::PIPE(),
> +  Subprocess::PATH("/dev/null"),
>Subprocess::PIPE(),
>nullptr,
>environment);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8980) mesos-slave can deadlock with docker pull

2018-06-05 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502687#comment-16502687
 ] 

Jie Yu edited comment on MESOS-8980 at 6/6/18 1:57 AM:
---

Yeah, it does seem unnecessary to me given we never read the output. Can you 
send a patch on reviewboard or github and I'll get that committed.


was (Author: jieyu):
Yeah, it does seem unnecessary to me given we never read the output. Can you 
send a patch and I'll get that committed.

> mesos-slave can deadlock with docker pull
> -
>
> Key: MESOS-8980
> URL: https://issues.apache.org/jira/browse/MESOS-8980
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1, 1.5.1, 1.6.0
>Reporter: Kjetil Joergensen
>Priority: Major
>
> Similar to MESOS-1885.
> mesos-slave crate pipes for stdout/stderr for docker pull, forks & execs 
> docker pull. When the output of docker pull exceeds the buffer allocated to 
> the pipe, mesos-slave and docker pull will deadlock, where docker pull blocks 
> on writing to stdout, and mesos-slave is waiting for docker pull to exit.
> Under "normal" circumstances this seems somewhat rare, although if you have 
> enough jobs running, you'll get to a point where the sum-total of buffer for 
> pipes allocated gets to fs.pipe-max-size, at which point linux will give you 
> a single page of memory for the pipe buffer, at which point a moderate amount 
> of layers will push you over the 4k pipe buffer.
> Is stdout for docker pull being used for anything ? (Cursory testing & 
> reading said no immediate observable harm). If not, the following should do 
> the trick
> {code}
> diff --git a/src/docker/docker.cpp b/src/docker/docker.cpp
> index d423d56ad..daebb897b 100755
> --- a/src/docker/docker.cpp
> +++ b/src/docker/docker.cpp
> @@ -1413,7 +1413,7 @@ Future Docker::__pull(
>path,
>argv,
>Subprocess::PATH("/dev/null"),
> -  Subprocess::PIPE(),
> +  Subprocess::PATH("/dev/null"),
>Subprocess::PIPE(),
>nullptr,
>environment);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8980) mesos-slave can deadlock with docker pull

2018-06-05 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502687#comment-16502687
 ] 

Jie Yu commented on MESOS-8980:
---

Yeah, it does seem unnecessary to me given we never read the output. Can you 
send a patch and I'll get that committed.

> mesos-slave can deadlock with docker pull
> -
>
> Key: MESOS-8980
> URL: https://issues.apache.org/jira/browse/MESOS-8980
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kjetil Joergensen
>Priority: Major
>
> Similar to MESOS-1885.
> mesos-slave crate pipes for stdout/stderr for docker pull, forks & execs 
> docker pull. When the output of docker pull exceeds the buffer allocated to 
> the pipe, mesos-slave and docker pull will deadlock, where docker pull blocks 
> on writing to stdout, and mesos-slave is waiting for docker pull to exit.
> Under "normal" circumstances this seems somewhat rare, although if you have 
> enough jobs running, you'll get to a point where the sum-total of buffer for 
> pipes allocated gets to fs.pipe-max-size, at which point linux will give you 
> a single page of memory for the pipe buffer, at which point a moderate amount 
> of layers will push you over the 4k pipe buffer.
> Is stdout for docker pull being used for anything ? (Cursory testing & 
> reading said no immediate observable harm). If not, the following should do 
> the trick
> {code}
> diff --git a/src/docker/docker.cpp b/src/docker/docker.cpp
> index d423d56ad..daebb897b 100755
> --- a/src/docker/docker.cpp
> +++ b/src/docker/docker.cpp
> @@ -1413,7 +1413,7 @@ Future Docker::__pull(
>path,
>argv,
>Subprocess::PATH("/dev/null"),
> -  Subprocess::PIPE(),
> +  Subprocess::PATH("/dev/null"),
>Subprocess::PIPE(),
>nullptr,
>environment);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8978) Command executor calling setsid breaks the tty support.

2018-06-05 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8978:
-

 Summary: Command executor calling setsid breaks the tty support.
 Key: MESOS-8978
 URL: https://issues.apache.org/jira/browse/MESOS-8978
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.6.0, 1.5.1, 1.4.1
Reporter: Jie Yu


I was playing with 
[msh|https://github.com/mesos/mesos-go/blob/master/api/v1/cmd/msh/msh.go] (one 
example from [mesos-go|https://github.com/mesos/mesos-go]), which allows you to 
launch a interactive shell in the Mesos cluster. It works by launch a container 
with tty enabled, and then [attach to the container 
input|https://github.com/apache/mesos/blob/master/include/mesos/v1/agent/agent.proto#L191-L201]
 using the agent operator API.

However, I got the following error when doing the following:
{noformat}
Jies-MacBook-Pro:mesos-go jie$  ./msh -master 127.0.0.1:5050 -tty -interactive 
-- /bin/sh -i
...
2018/06/05 11:51:35 original window size is 156 x 45
sh: cannot set terminal process group (-1): Inappropriate ioctl for device
sh: no job control in this shell
{noformat}

If I use `-pod`, the problem goes away. This only happens if command executor 
is used.

A few research suggested that this issue is related to `setsid` (see this 
[thread|https://github.com/Yelp/dumb-init/issues/51#issuecomment-227792216]). 
Looks like we did an extra 
`[setsid|https://github.com/apache/mesos/blob/1.6.x/src/launcher/executor.cpp#L512]`
 in the command executor.

The setsid() system call to create a new process group detaches the spawned 
process from a controlling tty. Therefore programs like bash complain, that 
they can't use job control. Re-attaching the controlling tty won't work, 
because the tty is still in use as a controlling tty for the command executor 
process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8942) Master streaming API does not send (health) check updates for tasks.

2018-06-04 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500534#comment-16500534
 ] 

Jie Yu commented on MESOS-8942:
---

The backport to 1.5.x breaks the build:
{noformat}
/usr/bin/mkdir -p examples/java
16:05:32 /usr/lib/jvm/java-openjdk/bin/javac -source 1.6 -target 1.6
\
16:05:32   -cp 
../3rdparty/zookeeper-3.4.8/zookeeper-3.4.8.jar:/home/centos/workspace/mesos/Mesos_CI-build/FLAG/Plain/label/mesos-ec2-centos-7/mesos/build/src/java/target/protobuf-java-3.5.0.jar:java/target/mesos-1.5.1.jar:../../src/examples/java
  \
16:05:32   -sourcepath ../../src/examples/java -d examples/java 
\
16:05:32   ../../src/examples/java/*.java
16:05:32 warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
16:05:34 Note: Some input files use or override a deprecated API.
16:05:34 Note: Recompile with -Xlint:deprecation for details.
16:05:34 1 warning
16:05:34 /usr/lib/jvm/java-openjdk/bin/jar cf examples.jar -C examples/java .
16:05:34   CXXLDtest-helper
16:06:04 ../../src/tests/api_tests.cpp: In member function ‘virtual void 
mesos::internal::tests::MasterAPITest_SubscribersReceiveHealthUpdates_Test::TestBody()’:
16:06:04 ../../src/tests/api_tests.cpp:2310:14: error: ‘createCallSubscribe’ is 
not a member of ‘mesos::internal::tests::v1’
16:06:04mesos.send(v1::createCallSubscribe(v1::DEFAULT_FRAMEWORK_INFO));
16:06:04   ^
{noformat}

> Master streaming API does not send (health) check updates for tasks.
> 
>
> Key: MESOS-8942
> URL: https://issues.apache.org/jira/browse/MESOS-8942
> Project: Mesos
>  Issue Type: Task
>  Components: HTTP API
>Affects Versions: 1.4.1, 1.5.0, 1.6.0
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Major
>  Labels: api, mesosphere, streaming-api
> Fix For: 1.4.2, 1.5.2, 1.7.0, 1.6.1
>
>
> Currently, Master API subscribers get task status updates when task state 
> changes (the actual logic is [slightly more 
> complex|https://github.com/apache/mesos/blob/d7d7cfbc3e5609fc9a4e8de8203a6ecb11afeac7/src/master/master.cpp#L10794-L10841]).
>  We use task status updates to deliver health and check information to 
> schedulers, in which case task state does not change. Hence these updates are 
> filtered out and the subscribers do not get any task health updates.
> Here is a test that confirms the described behaviour: 
> https://gist.github.com/rukletsov/c079d95479fb134d137ea3ae8b7ae874



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8945) Master check failure due to CHECK_SOME(providerId).

2018-05-30 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495251#comment-16495251
 ] 

Jie Yu commented on MESOS-8945:
---

[~bbannier] Sounds good to me. The CHECK was added in 1.5 (although not the 
root case, but at least the master won't crash).

> Master check failure due to CHECK_SOME(providerId).
> ---
>
> Key: MESOS-8945
> URL: https://issues.apache.org/jira/browse/MESOS-8945
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Jie Yu
>Assignee: Benjamin Bannier
>Priority: Critical
>
> {noformat}
> 2018-05-15 23:19:23: I0515 23:19:23.764744  7080 master.cpp:4637] Applying 
> RESERVE operation for resources 
> [{"allocation_info":{"role":"test__integration__hello-world-role"},"name":"cpus","reservations":[{"labels":{"labels":[{"key":"resource_id","value":"bc19583a-6795-46e4-bac0-15804bf46c44"}]},"principal":"\/test\/integration\/hello-world-principal","role":"test__integration__hello-world-role","type":"DYNAMIC"}],"scalar":{"value":5.55111512312578e-17},"type":"SCALAR"}]
>  from framework 101d3dc6-e05d-4d28-acb9-9ee24886b608-0034 
> (/test/integration/hello-world) to agent 
> 101d3dc6-e05d-4d28-acb9-9ee24886b608-S6 at slave(1)@10.0.3.242:5051 
> (10.0.3.242)
> 2018-05-15 23:19:23: F0515 23:19:23.766151  7080 master.cpp:12065] 
> CHECK_SOME(providerId): Could not determine resource provider
> 2018-05-15 23:19:23: *** Check failure stack trace: ***
> 2018-05-15 23:19:23: @ 0x7f45db7a305d  google::LogMessage::Fail()
> 2018-05-15 23:19:23: @ 0x7f45db7a4e8d  google::LogMessage::SendToLog()
> 2018-05-15 23:19:23: @ 0x7f45db7a2c4c  google::LogMessage::Flush()
> 2018-05-15 23:19:23: @ 0x7f45db7a5789  
> google::LogMessageFatal::~LogMessageFatal()
> 2018-05-15 23:19:23: @ 0x7f45da603fe9  _CheckFatal::~_CheckFatal()
> 2018-05-15 23:19:23: @ 0x7f45da83755d  
> mesos::internal::master::Slave::apply()
> 2018-05-15 23:19:23: @ 0x7f45da84d4db  
> mesos::internal::master::Master::_apply()
> 2018-05-15 23:19:23: @ 0x7f45da855433  
> mesos::internal::master::Master::_accept()
> 2018-05-15 23:19:23: @ 0x7f45db6fe831  process::ProcessBase::consume()
> 2018-05-15 23:19:23: @ 0x7f45db70d73c  
> process::ProcessManager::resume()
> 2018-05-15 23:19:23: @ 0x7f45db712c36  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 2018-05-15 23:19:23: @ 0x7f45d8987d73  (unknown)
> 2018-05-15 23:19:23: @ 0x7f45d818452c  (unknown)
> 2018-05-15 23:19:23: @ 0x7f45d7ec21dd  (unknown)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8958) LinuxDevicesIsolatorTest.ROOT_PopulateWhitelistedDevices fails on some boxes.

2018-05-25 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8958:
-

 Summary: LinuxDevicesIsolatorTest.ROOT_PopulateWhitelistedDevices 
fails on some boxes.
 Key: MESOS-8958
 URL: https://issues.apache.org/jira/browse/MESOS-8958
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.7.0
Reporter: Jie Yu


{noformat}
[ RUN  ] 
DevicesTestParam/LinuxDevicesIsolatorTest.ROOT_PopulateWhitelistedDevices/1
I0525 22:06:27.214989  1438 containerizer.cpp:301] Using isolation { 
environment_secret, filesystem/linux, volume/image, docker/runtime, 
volume/sandbox_path, linux/devices, network/cni, volume/host_path }
I0525 22:06:27.217038  1438 linux_launcher.cpp:148] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
sh: 1: hadoop: not found
I0525 22:06:27.314198  1438 fetcher.cpp:69] Skipping URI fetcher plugin 
'hadoop' as it could not be created: Failed to create HDFS client: Hadoop 
client is not available, exit status: 32512 
I0525 22:06:27.314352  1438 local_puller.cpp:98] Creating local puller with 
docker registry '/tmp/76ui1K/registry'
I0525 22:06:27.315137  1438 provisioner.cpp:299] Using default backend 'copy' 
../../src/tests/containerizer/linux_devices_isolator_tests.cpp:111: Failure
create: Failed to create isolator 'linux/devices': Failed to obtain device ID 
for '/dev/cpu/0/cpuid': Failed to stat '/dev/cpu/0/cpuid': No such file or 
directory
[  FAILED  ] 
DevicesTestParam/LinuxDevicesIsolatorTest.ROOT_PopulateWhitelistedDevices/1, 
where GetParam() = test -r /dev/cpu/0/cpuid (300 ms)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8958) LinuxDevicesIsolatorTest.ROOT_PopulateWhitelistedDevices fails on some boxes.

2018-05-25 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-8958:
-

Assignee: James Peach

> LinuxDevicesIsolatorTest.ROOT_PopulateWhitelistedDevices fails on some boxes.
> -
>
> Key: MESOS-8958
> URL: https://issues.apache.org/jira/browse/MESOS-8958
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.0
>Reporter: Jie Yu
>Assignee: James Peach
>Priority: Major
>
> {noformat}
> [ RUN  ] 
> DevicesTestParam/LinuxDevicesIsolatorTest.ROOT_PopulateWhitelistedDevices/1
> I0525 22:06:27.214989  1438 containerizer.cpp:301] Using isolation { 
> environment_secret, filesystem/linux, volume/image, docker/runtime, 
> volume/sandbox_path, linux/devices, network/cni, volume/host_path }
> I0525 22:06:27.217038  1438 linux_launcher.cpp:148] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> sh: 1: hadoop: not found
> I0525 22:06:27.314198  1438 fetcher.cpp:69] Skipping URI fetcher plugin 
> 'hadoop' as it could not be created: Failed to create HDFS client: Hadoop 
> client is not available, exit status: 32512 
> I0525 22:06:27.314352  1438 local_puller.cpp:98] Creating local puller with 
> docker registry '/tmp/76ui1K/registry'
> I0525 22:06:27.315137  1438 provisioner.cpp:299] Using default backend 'copy' 
> ../../src/tests/containerizer/linux_devices_isolator_tests.cpp:111: Failure
> create: Failed to create isolator 'linux/devices': Failed to obtain device ID 
> for '/dev/cpu/0/cpuid': Failed to stat '/dev/cpu/0/cpuid': No such file or 
> directory
> [  FAILED  ] 
> DevicesTestParam/LinuxDevicesIsolatorTest.ROOT_PopulateWhitelistedDevices/1, 
> where GetParam() = test -r /dev/cpu/0/cpuid (300 ms)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-2199) Failing test: SlaveTest.ROOT_RunTaskWithCommandInfoWithUser

2018-05-24 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489437#comment-16489437
 ] 

Jie Yu commented on MESOS-2199:
---

https://reviews.apache.org/r/67291/

> Failing test: SlaveTest.ROOT_RunTaskWithCommandInfoWithUser
> ---
>
> Key: MESOS-2199
> URL: https://issues.apache.org/jira/browse/MESOS-2199
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Ian Downes
>Assignee: Jie Yu
>Priority: Major
>  Labels: disabled-test, mesosphere
>
> Appears that running the executor as {{nobody}} is not supported.
> [~nnielsen] can you take a look?
> Executor log:
> {noformat}
> [root@hostname build]# cat 
> /tmp/SlaveTest_ROOT_RunTaskWithCommandInfoWithUser_cxF1dY/slaves/20141219-005206-2081170186-60487-11862-S0/frameworks/20141219-005206-2081170186-60
> 487-11862-/executors/1/runs/latest/std*
> sh: /home/idownes/workspace/mesos/build/src/mesos-executor: Permission denied
> {noformat}
> Test output:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from SlaveTest
> [ RUN  ] SlaveTest.ROOT_RunTaskWithCommandInfoWithUser
> ../../src/tests/slave_tests.cpp:680: Failure
> Value of: statusRunning.get().state()
>   Actual: TASK_FAILED
> Expected: TASK_RUNNING
> ../../src/tests/slave_tests.cpp:682: Failure
> Failed to wait 10secs for statusFinished
> ../../src/tests/slave_tests.cpp:673: Failure
> Actual function call count doesn't match EXPECT_CALL(sched, 
> statusUpdate(, _))...
>  Expected: to be called twice
>Actual: called once - unsatisfied and active
> [  FAILED  ] SlaveTest.ROOT_RunTaskWithCommandInfoWithUser (10641 ms)
> [--] 1 test from SlaveTest (10641 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (10658 ms total)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-2199) Failing test: SlaveTest.ROOT_RunTaskWithCommandInfoWithUser

2018-05-24 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489438#comment-16489438
 ] 

Jie Yu commented on MESOS-2199:
---

The solution (tip from [~jpe...@apache.org]) is to use $SUDO_USER instead of 
`nobody`.

> Failing test: SlaveTest.ROOT_RunTaskWithCommandInfoWithUser
> ---
>
> Key: MESOS-2199
> URL: https://issues.apache.org/jira/browse/MESOS-2199
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Ian Downes
>Assignee: Jie Yu
>Priority: Major
>  Labels: disabled-test, mesosphere
>
> Appears that running the executor as {{nobody}} is not supported.
> [~nnielsen] can you take a look?
> Executor log:
> {noformat}
> [root@hostname build]# cat 
> /tmp/SlaveTest_ROOT_RunTaskWithCommandInfoWithUser_cxF1dY/slaves/20141219-005206-2081170186-60487-11862-S0/frameworks/20141219-005206-2081170186-60
> 487-11862-/executors/1/runs/latest/std*
> sh: /home/idownes/workspace/mesos/build/src/mesos-executor: Permission denied
> {noformat}
> Test output:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from SlaveTest
> [ RUN  ] SlaveTest.ROOT_RunTaskWithCommandInfoWithUser
> ../../src/tests/slave_tests.cpp:680: Failure
> Value of: statusRunning.get().state()
>   Actual: TASK_FAILED
> Expected: TASK_RUNNING
> ../../src/tests/slave_tests.cpp:682: Failure
> Failed to wait 10secs for statusFinished
> ../../src/tests/slave_tests.cpp:673: Failure
> Actual function call count doesn't match EXPECT_CALL(sched, 
> statusUpdate(, _))...
>  Expected: to be called twice
>Actual: called once - unsatisfied and active
> [  FAILED  ] SlaveTest.ROOT_RunTaskWithCommandInfoWithUser (10641 ms)
> [--] 1 test from SlaveTest (10641 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (10658 ms total)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-2199) Failing test: SlaveTest.ROOT_RunTaskWithCommandInfoWithUser

2018-05-24 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-2199:
-

Assignee: Jie Yu

> Failing test: SlaveTest.ROOT_RunTaskWithCommandInfoWithUser
> ---
>
> Key: MESOS-2199
> URL: https://issues.apache.org/jira/browse/MESOS-2199
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Ian Downes
>Assignee: Jie Yu
>Priority: Major
>  Labels: disabled-test, mesosphere
>
> Appears that running the executor as {{nobody}} is not supported.
> [~nnielsen] can you take a look?
> Executor log:
> {noformat}
> [root@hostname build]# cat 
> /tmp/SlaveTest_ROOT_RunTaskWithCommandInfoWithUser_cxF1dY/slaves/20141219-005206-2081170186-60487-11862-S0/frameworks/20141219-005206-2081170186-60
> 487-11862-/executors/1/runs/latest/std*
> sh: /home/idownes/workspace/mesos/build/src/mesos-executor: Permission denied
> {noformat}
> Test output:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from SlaveTest
> [ RUN  ] SlaveTest.ROOT_RunTaskWithCommandInfoWithUser
> ../../src/tests/slave_tests.cpp:680: Failure
> Value of: statusRunning.get().state()
>   Actual: TASK_FAILED
> Expected: TASK_RUNNING
> ../../src/tests/slave_tests.cpp:682: Failure
> Failed to wait 10secs for statusFinished
> ../../src/tests/slave_tests.cpp:673: Failure
> Actual function call count doesn't match EXPECT_CALL(sched, 
> statusUpdate(, _))...
>  Expected: to be called twice
>Actual: called once - unsatisfied and active
> [  FAILED  ] SlaveTest.ROOT_RunTaskWithCommandInfoWithUser (10641 ms)
> [--] 1 test from SlaveTest (10641 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (10658 ms total)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8945) Master check failure due to CHECK_SOME(providerId).

2018-05-23 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-8945:
-

Assignee: Benjamin Bannier

> Master check failure due to CHECK_SOME(providerId).
> ---
>
> Key: MESOS-8945
> URL: https://issues.apache.org/jira/browse/MESOS-8945
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Jie Yu
>Assignee: Benjamin Bannier
>Priority: Critical
>
> {noformat}
> 2018-05-15 23:19:23: I0515 23:19:23.764744  7080 master.cpp:4637] Applying 
> RESERVE operation for resources 
> [{"allocation_info":{"role":"test__integration__hello-world-role"},"name":"cpus","reservations":[{"labels":{"labels":[{"key":"resource_id","value":"bc19583a-6795-46e4-bac0-15804bf46c44"}]},"principal":"\/test\/integration\/hello-world-principal","role":"test__integration__hello-world-role","type":"DYNAMIC"}],"scalar":{"value":5.55111512312578e-17},"type":"SCALAR"}]
>  from framework 101d3dc6-e05d-4d28-acb9-9ee24886b608-0034 
> (/test/integration/hello-world) to agent 
> 101d3dc6-e05d-4d28-acb9-9ee24886b608-S6 at slave(1)@10.0.3.242:5051 
> (10.0.3.242)
> 2018-05-15 23:19:23: F0515 23:19:23.766151  7080 master.cpp:12065] 
> CHECK_SOME(providerId): Could not determine resource provider
> 2018-05-15 23:19:23: *** Check failure stack trace: ***
> 2018-05-15 23:19:23: @ 0x7f45db7a305d  google::LogMessage::Fail()
> 2018-05-15 23:19:23: @ 0x7f45db7a4e8d  google::LogMessage::SendToLog()
> 2018-05-15 23:19:23: @ 0x7f45db7a2c4c  google::LogMessage::Flush()
> 2018-05-15 23:19:23: @ 0x7f45db7a5789  
> google::LogMessageFatal::~LogMessageFatal()
> 2018-05-15 23:19:23: @ 0x7f45da603fe9  _CheckFatal::~_CheckFatal()
> 2018-05-15 23:19:23: @ 0x7f45da83755d  
> mesos::internal::master::Slave::apply()
> 2018-05-15 23:19:23: @ 0x7f45da84d4db  
> mesos::internal::master::Master::_apply()
> 2018-05-15 23:19:23: @ 0x7f45da855433  
> mesos::internal::master::Master::_accept()
> 2018-05-15 23:19:23: @ 0x7f45db6fe831  process::ProcessBase::consume()
> 2018-05-15 23:19:23: @ 0x7f45db70d73c  
> process::ProcessManager::resume()
> 2018-05-15 23:19:23: @ 0x7f45db712c36  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 2018-05-15 23:19:23: @ 0x7f45d8987d73  (unknown)
> 2018-05-15 23:19:23: @ 0x7f45d818452c  (unknown)
> 2018-05-15 23:19:23: @ 0x7f45d7ec21dd  (unknown)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8945) Master check failure due to CHECK_SOME(providerId).

2018-05-23 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8945:
-

 Summary: Master check failure due to CHECK_SOME(providerId).
 Key: MESOS-8945
 URL: https://issues.apache.org/jira/browse/MESOS-8945
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.6.0, 1.5.0
Reporter: Jie Yu


{noformat}
2018-05-15 23:19:23: I0515 23:19:23.764744  7080 master.cpp:4637] Applying 
RESERVE operation for resources 
[{"allocation_info":{"role":"test__integration__hello-world-role"},"name":"cpus","reservations":[{"labels":{"labels":[{"key":"resource_id","value":"bc19583a-6795-46e4-bac0-15804bf46c44"}]},"principal":"\/test\/integration\/hello-world-principal","role":"test__integration__hello-world-role","type":"DYNAMIC"}],"scalar":{"value":5.55111512312578e-17},"type":"SCALAR"}]
 from framework 101d3dc6-e05d-4d28-acb9-9ee24886b608-0034 
(/test/integration/hello-world) to agent 
101d3dc6-e05d-4d28-acb9-9ee24886b608-S6 at slave(1)@10.0.3.242:5051 (10.0.3.242)
2018-05-15 23:19:23: F0515 23:19:23.766151  7080 master.cpp:12065] 
CHECK_SOME(providerId): Could not determine resource provider
2018-05-15 23:19:23: *** Check failure stack trace: ***
2018-05-15 23:19:23: @ 0x7f45db7a305d  google::LogMessage::Fail()
2018-05-15 23:19:23: @ 0x7f45db7a4e8d  google::LogMessage::SendToLog()
2018-05-15 23:19:23: @ 0x7f45db7a2c4c  google::LogMessage::Flush()
2018-05-15 23:19:23: @ 0x7f45db7a5789  
google::LogMessageFatal::~LogMessageFatal()
2018-05-15 23:19:23: @ 0x7f45da603fe9  _CheckFatal::~_CheckFatal()
2018-05-15 23:19:23: @ 0x7f45da83755d  
mesos::internal::master::Slave::apply()
2018-05-15 23:19:23: @ 0x7f45da84d4db  
mesos::internal::master::Master::_apply()
2018-05-15 23:19:23: @ 0x7f45da855433  
mesos::internal::master::Master::_accept()
2018-05-15 23:19:23: @ 0x7f45db6fe831  process::ProcessBase::consume()
2018-05-15 23:19:23: @ 0x7f45db70d73c  process::ProcessManager::resume()
2018-05-15 23:19:23: @ 0x7f45db712c36  
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
2018-05-15 23:19:23: @ 0x7f45d8987d73  (unknown)
2018-05-15 23:19:23: @ 0x7f45d818452c  (unknown)
2018-05-15 23:19:23: @ 0x7f45d7ec21dd  (unknown)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8909) Scrubbing value secret from HTTP responses

2018-05-18 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481052#comment-16481052
 ] 

Jie Yu commented on MESOS-8909:
---

Skimmed the google doc. looks like the secret type in your setting is `VALUE`. 
`VALUE` type should only be used in testing (or insecure clusters). In prod, 
please use `REFERENCE` type, and build a secret resolver.

> Scrubbing value secret from HTTP responses
> --
>
> Key: MESOS-8909
> URL: https://issues.apache.org/jira/browse/MESOS-8909
> Project: Mesos
>  Issue Type: Task
>  Components: security
>Reporter: Zhitao Li
>Priority: Major
>  Labels: security
>
> Mesos supports a value based secret. However, I believe some HTTP endpoints 
> and v1 operator responses could leak this information.
> The goal here is to make sure these endpoints do not leak the information.
> We did some quick research and gather the following list in this [Google 
> doc|https://docs.google.com/document/d/1W26RUpYEB92eTQYbACIOem5B9hzXX59jeEIT9RB2X1o/edit#heading=h.gzvg4ec6wllm].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8920) Support per-container container logger configuration.

2018-05-17 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479896#comment-16479896
 ] 

Jie Yu commented on MESOS-8920:
---

commit a981067d2dd4a1ce6be68d3fd5a7115ce47b3a24 (HEAD -> master, origin/master, 
origin/HEAD)
Author: Jie Yu 
Date:   Thu May 17 15:27:26 2018 -0700

Added ContainerID to container logger prepare interface.

This is to allow the logger to tag the output with that information.

Review: https://reviews.apache.org/r/67202

> Support per-container container logger configuration.
> -
>
> Key: MESOS-8920
> URL: https://issues.apache.org/jira/browse/MESOS-8920
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
> Fix For: 1.7.0
>
>
> Currently, the container logger only takes `ExecutorInfo`, meaning that it 
> only allows configuration at per-executor level.
> This ticket captures the work to make it configurable at per-container level 
> (nested container, standalone container, etc.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8920) Support per-container container logger configuration.

2018-05-15 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8920:
-

 Summary: Support per-container container logger configuration.
 Key: MESOS-8920
 URL: https://issues.apache.org/jira/browse/MESOS-8920
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Jie Yu
Assignee: Jie Yu


Currently, the container logger only takes `ExecutorInfo`, meaning that it only 
allows configuration at per-executor level.

This ticket captures the work to make it configurable at per-container level 
(nested container, standalone container, etc.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-05-10 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470780#comment-16470780
 ] 

Jie Yu commented on MESOS-8830:
---

[~zhitao] can you grep for container id `904d8155-e4c3-43e3-bf01-85de6a702149` 
in the agent log?

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44:
>  Directory not empty
> {panel}
> (I can try to provide more logs, depending on how much local archive after 
> rotation has)
> This happened on a 1.3.1 agent although I suspect it's not local to that 
> version.
> The path 
> */var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume*
>  is a bind mount to a persistent volume. The fact that agent gc touched that 
> process makes me believe this is what triggered the data loss.
> We had some misconfigurations on out 

[jira] [Comment Edited] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-05-10 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470754#comment-16470754
 ] 

Jie Yu edited comment on MESOS-8830 at 5/10/18 5:08 PM:


I think the problem might be:

1) the destroy of the orphan containers can be in parallel with old agent 
workdir gc because the agent recovery is NOT blocked on the destroy of orphan 
containers
2) when the old agent's workdir is scheduled for gc, since the orphan container 
cleanup is not done yet, it's possible that bind mounted pv is deleted.

Can you provide more agent logs around containerizer recovery. I am curious why 
the destroy of the orphan container takes too long.

But nevertheless, this is an issue. one possible way is to not follow bind 
mounts (just like not follow symlinks) when doing workdir gc.

Another way is to block the agent recovery until all orphans are cleaned up. 
This used to be the case, but was changed due to MESOS-2367. Now i am thinking 
about that, orphans can hold resources (in this case, the pv). allocating those 
resources out before the cleanup is done sounds like not the right approach.


was (Author: jieyu):
I think the problem might be:

1) the destroy of the orphan containers can be in parallel with old agent 
workdir gc because the agent recovery is blocked on the destroy of orphan 
containers
2) when the old agent's workdir is scheduled for gc, since the orphan container 
cleanup is not done yet, it's possible that bind mounted pv is deleted.

Can you provide more agent logs around containerizer recovery. I am curious why 
the destroy of the orphan container takes too long.

But nevertheless, this is an issue. one possible way is to not follow bind 
mounts (just like not follow symlinks) when doing workdir gc.

Another way is to block the agent recovery until all orphans are cleaned up. 
This used to be the case, but was changed due to MESOS-2367. Now i am thinking 
about that, orphans can hold resources (in this case, the pv). allocating those 
resources out before the cleanup is done sounds like not the right approach.

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> 

[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-05-10 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470754#comment-16470754
 ] 

Jie Yu commented on MESOS-8830:
---

I think the problem might be:

1) the destroy of the orphan containers can be in parallel with old agent 
workdir gc because the agent recovery is blocked on the destroy of orphan 
containers
2) when the old agent's workdir is scheduled for gc, since the orphan container 
cleanup is not done yet, it's possible that bind mounted pv is deleted.

Can you provide more agent logs around containerizer recovery. I am curious why 
the destroy of the orphan container takes too long.

But nevertheless, this is an issue. one possible way is to not follow bind 
mounts (just like not follow symlinks) when doing workdir gc.

Another way is to block the agent recovery until all orphans are cleaned up. 
This used to be the case, but was changed due to MESOS-2367. Now i am thinking 
about that, orphans can hold resources (in this case, the pv). allocating those 
resources out before the cleanup is done sounds like not the right approach.

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> 

[jira] [Created] (MESOS-8901) os::children in stout is unnecessarily expensive

2018-05-09 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8901:
-

 Summary: os::children in stout is unnecessarily expensive
 Key: MESOS-8901
 URL: https://issues.apache.org/jira/browse/MESOS-8901
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Jie Yu


it uses os::processes which gets all process information in the proc 
filesystem. Essentially, we just need pid. This call is used in the container 
launch path, so we should consider optimize it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8900) Container statistics should be exposed as Metrics.

2018-05-09 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8900:
-

 Summary: Container statistics should be exposed as Metrics.
 Key: MESOS-8900
 URL: https://issues.apache.org/jira/browse/MESOS-8900
 Project: Mesos
  Issue Type: Improvement
Reporter: Jie Yu


Currently, those are not exposed as metrics, but rather a customized format:
https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L1642

It'll be nice to expose those as metrics with proper labeling. For instance, 
using the prometheus format:
https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md

In that way, it'll be much easier to plug into the metrics system (e.g., 
prometheus). Also, some stats like xxx_p50, xxx_p99 can be abstracted into some 
more standard Metrics concept like Quantile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-03-27 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-8534:
-

Assignee: Sagar Sadashiv Patwardhan

> Allow nested containers in TaskGroups to have separate network namespaces
> -
>
> Key: MESOS-8534
> URL: https://issues.apache.org/jira/browse/MESOS-8534
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Sagar Sadashiv Patwardhan
>Assignee: Sagar Sadashiv Patwardhan
>Priority: Minor
>  Labels: cni
> Fix For: 1.6.0
>
>
> As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to 
> allow nested containers in TaskGroups to have separate namespaces. I am also 
> going to retain the existing functionality, where nested containers can share 
> namespaces with the parent/root container.
> *Use case:* At Yelp, we have this application called seagull that runs 
> multiple tasks in parallel. It is mainly used for running tests that depend 
> on other containerized internal microservices. It was developed before mesos 
> had support for docker-executor. So, it uses a custom executor, which 
> directly talks to docker daemon on the host and run a bunch of service 
> containers along with the process where tests are executed. Resources for all 
> these containers are not accounted for in mesos. Clean-up of these containers 
> is also a headache. We have a tool called docker-reaper that automatically 
> reaps the orphaned containers once the executor goes away. In addition to 
> that, we also run a few cron jobs that clean-up any leftover containers.
> We are in the process of containerizing the process that runs the tests. We 
> also want to delegate the responsibility of lifecycle management of docker 
> containers to mesos and get rid of the custom executor. We looked at a few 
> alternatives to do this and decided to go with pods because they provide 
> all-or-nothing(atomicity) semantics that we need for our application. But, we 
> cannot use pods directly because all the containers in a pod have the same 
> network namespace. The service discovery mechanism requires all the 
> containers to have separate IPs. All of our microservices bind to  
> container port, so we will have port collision unless we are giving separate 
> namespaces to all the containers in a pod.
> *Proposal:* I am planning to allow nested containers to have separate 
> namespaces. If NetworkInfo protobuf for nested containers is not empty, then 
> we will assign separate mnt and network namespaces to the nested containers. 
> Otherwise,  they will share the network and mount namepsaces with the 
> parent/root container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-03-27 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415104#comment-16415104
 ] 

Jie Yu commented on MESOS-8534:
---

commit a741b15e889de3242e3aa7878105ab9d946f6ea2 (HEAD -> master, origin/master, 
origin/HEAD)
Author: Sagar Patwardhan 
Date:   Mon Mar 26 21:13:17 2018 -0700

Allowed a nested container to have a separate network namespace.

Previously, nested containers always share the same network namespace as
their parent. This patch allows a nested container to have a separate
network namespace than its parent.

Continued from https://github.com/apache/mesos/pull/263

JIRA: MESOS-8534

Review: https://reviews.apache.org/r/65987/

commit 020b8cbafaf70ef4b95915bf9b81200509b23a50
Author: Jie Yu 
Date:   Mon Mar 26 23:28:20 2018 -0700

Fixed createVolumeHostPath helper.

commit 77c56351e9bfabea221c6be84472e64b434b5169
Author: Jie Yu 
Date:   Mon Mar 26 21:14:52 2018 -0700

Added a helper to parse ContainerID.

Review: https://reviews.apache.org/r/66101/

> Allow nested containers in TaskGroups to have separate network namespaces
> -
>
> Key: MESOS-8534
> URL: https://issues.apache.org/jira/browse/MESOS-8534
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Sagar Sadashiv Patwardhan
>Priority: Minor
>  Labels: cni
>
> As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to 
> allow nested containers in TaskGroups to have separate namespaces. I am also 
> going to retain the existing functionality, where nested containers can share 
> namespaces with the parent/root container.
> *Use case:* At Yelp, we have this application called seagull that runs 
> multiple tasks in parallel. It is mainly used for running tests that depend 
> on other containerized internal microservices. It was developed before mesos 
> had support for docker-executor. So, it uses a custom executor, which 
> directly talks to docker daemon on the host and run a bunch of service 
> containers along with the process where tests are executed. Resources for all 
> these containers are not accounted for in mesos. Clean-up of these containers 
> is also a headache. We have a tool called docker-reaper that automatically 
> reaps the orphaned containers once the executor goes away. In addition to 
> that, we also run a few cron jobs that clean-up any leftover containers.
> We are in the process of containerizing the process that runs the tests. We 
> also want to delegate the responsibility of lifecycle management of docker 
> containers to mesos and get rid of the custom executor. We looked at a few 
> alternatives to do this and decided to go with pods because they provide 
> all-or-nothing(atomicity) semantics that we need for our application. But, we 
> cannot use pods directly because all the containers in a pod have the same 
> network namespace. The service discovery mechanism requires all the 
> containers to have separate IPs. All of our microservices bind to  
> container port, so we will have port collision unless we are giving separate 
> namespaces to all the containers in a pod.
> *Proposal:* I am planning to allow nested containers to have separate 
> namespaces. If NetworkInfo protobuf for nested containers is not empty, then 
> we will assign separate mnt and network namespaces to the nested containers. 
> Otherwise,  they will share the network and mount namepsaces with the 
> parent/root container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   10   >