[jira] [Commented] (MESOS-8651) Potential memory leaks in the `volume/sandbox_path` isolator

2018-03-16 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403136#comment-16403136
 ] 

Jie Yu commented on MESOS-8651:
---

commit 28ecf0c865347f75b90992a919ec7c56edb93eae (HEAD -> master, origin/master, 
origin/HEAD)
Author: Jason Lai 
Date:   Fri Mar 16 16:45:00 2018 -0700

Fixed potential memory leak in the `volume/sandbox_path` isolator.

The `volume/sandbox_path` isolator inserts a string of the sandbox path
to its `sandboxes` hashmap instance variable upon the launch of each
container. However, it never cleans it up properly and can cause
unbounded growth of the hashmap object, as isolators are global
singleton objects.

The patch ensures the sandbox path associated with a given container ID
gets removed from the `sandboxes` hashmap upon container cleanup.

Review: https://reviews.apache.org/r/66104/

> Potential memory leaks in the `volume/sandbox_path` isolator
> 
>
> Key: MESOS-8651
> URL: https://issues.apache.org/jira/browse/MESOS-8651
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jason Lai
>Assignee: Jason Lai
>Priority: Major
>  Labels: easyfix, patch
> Fix For: 1.6.0
>
>
> The {{sandboxes}} hashmap object of 
> {{mesos::internal::slave::VolumeSandboxPathIsolatorProcess}} bears the risk 
> of potential memory leak.
> It [adds the sandbox path upon each container 
> launch|https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/isolators/volume/sandbox_path.cpp#L119-L122]
>  and does not remove the sandbox path after cleaning up the container. As the 
> life cycle of an isolator is attached to that of {{MesosContainerizer}}, this 
> means that more and more sandbox paths will get added to the {{sandboxes}} 
> hashmap object, as Mesos containers keep being launched and will likely blow 
> up Mesos agent eventually.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-4965) Support resizing of an existing persistent volume

2018-03-16 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402206#comment-16402206
 ] 

Jie Yu edited comment on MESOS-4965 at 3/16/18 5:11 PM:


Relevant CSI spec issue regarding volume resize
https://github.com/container-storage-interface/spec/issues/212


was (Author: jieyu):
Relevant CSI spec regarding volume resize
https://github.com/container-storage-interface/spec/issues/212

> Support resizing of an existing persistent volume
> -
>
> Key: MESOS-4965
> URL: https://issues.apache.org/jira/browse/MESOS-4965
> Project: Mesos
>  Issue Type: Improvement
>  Components: storage
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>  Labels: mesosphere, persistent-volumes, storage
>
> We need a mechanism to update the size of a persistent volume.
> The increase case is generally more interesting to us (as long as there still 
> available disk resource on the same disk).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-4965) Support resizing of an existing persistent volume

2018-03-16 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402206#comment-16402206
 ] 

Jie Yu commented on MESOS-4965:
---

Relevant CSI spec regarding volume resize
https://github.com/container-storage-interface/spec/issues/212

> Support resizing of an existing persistent volume
> -
>
> Key: MESOS-4965
> URL: https://issues.apache.org/jira/browse/MESOS-4965
> Project: Mesos
>  Issue Type: Improvement
>  Components: storage
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>  Labels: mesosphere, persistent-volumes, storage
>
> We need a mechanism to update the size of a persistent volume.
> The increase case is generally more interesting to us (as long as there still 
> available disk resource on the same disk).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8626) The 'allocatable' check in the allocator is problematic with multi-role frameworks

2018-03-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8626:
-

 Summary: The 'allocatable' check in the allocator is problematic 
with multi-role frameworks
 Key: MESOS-8626
 URL: https://issues.apache.org/jira/browse/MESOS-8626
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Affects Versions: 1.5.0, 1.4.1, 1.3.2
Reporter: Jie Yu


The 
[allocatable|https://github.com/apache/mesos/blob/1.5.x/src/master/allocator/mesos/hierarchical.cpp#L2471-L2479]
 check in the allocator (shown below) was originally introduced to help 
alleviate the situation where a framework receives only cpu, but not 
memory/disk, thus cannot launch a task.

{code}
bool HierarchicalAllocatorProcess::allocatable(
const Resources& resources)
{
  Option cpus = resources.cpus();
  Option mem = resources.mem();

  return (cpus.isSome() && cpus.get() >= MIN_CPUS) ||
 (mem.isSome() && mem.get() >= MIN_MEM);
}
{code}

When we introduce multi-role capability to the frameworks, this check makes 
less sense now. For instance, consider the following case:
1) There is a single agent and a single framework in the cluster
2) The agent has cpu/memory reserved to role A, and disk reserved to B
3) The framework subscribes to both role A and role B
4) The framework expects that it'll receive an offer containing the resources 
on the agent
5) However, the framework receives no disk resources due to the following 
[code](https://github.com/apache/mesos/blob/1.5.x/src/master/allocator/mesos/hierarchical.cpp#L2078-L2100).
 This is counter intuitive.

{code}
void HierarchicalAllocatorProcess::__allocate()
{
  ...
  Resources resources = available.allocatableTo(role);
  if (!allocatable(resources)) {
break;
  }
  ...
}

bool Resources::isAllocatableTo(
const Resource& resource,
const std::string& role)
{
  CHECK(!resource.has_role()) << resource;
  CHECK(!resource.has_reservation()) << resource;

  return isUnreserved(resource) ||
 role == reservationRole(resource) ||
 roles::isStrictSubroleOf(role, reservationRole(resource));
}
{code}

Two comments:
1) If `allocatable` check is still necessary (see MESOS-7398).
2) If we want to keep `allocatable` check for the original purpose, we should 
do that based on framework not role, given that a framework can subscribe to 
multiple roles now?

Some related JIRAs:
MESOS-1688
MESOS-7398




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8593) Support credential updates in Docker config without restarting the agent

2018-03-01 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382367#comment-16382367
 ] 

Jie Yu commented on MESOS-8593:
---

Sounds like the better approach is to use the image pull secret feature.

That means you'll have to implement the SecretResolver module

https://github.com/apache/mesos/blob/master/docs/secrets.md#secretresolver-module

> Support credential updates in Docker config without restarting the agent
> 
>
> Key: MESOS-8593
> URL: https://issues.apache.org/jira/browse/MESOS-8593
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Reporter: Jan Schlicht
>Priority: Major
>
> When using the Mesos containerizer with a private Docker repository with 
> {{--docker_config}} option, the repository might expire credentials after 
> some time, forcing the user to login again. In that case the Docker config in 
> use will change and the agent needs to be restarted to reflect the change. 
> Instead of restarting, the agent could reload the Docker config file every 
> time before fetching.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8594) Mesos master crash (under load)

2018-02-19 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369305#comment-16369305
 ] 

Jie Yu commented on MESOS-8594:
---

cc [~bmahler], [~benjaminhindman]

This will likely to be resolved by using `loop` in libprocess, which prevent 
infinite stack overflow.

> Mesos master crash (under load)
> ---
>
> Key: MESOS-8594
> URL: https://issues.apache.org/jira/browse/MESOS-8594
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0, 1.6.0
>Reporter: A. Dukhovniy
>Priority: Major
>  Labels: reliability
> Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, 
> lldb-regiser-read.txt
>
>
> Mesos master crashes under load. Attached are some infos from the `lldb`:
> {code:java}
> Process 41933 resuming
> Process 41933 stopped
> * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x789ecff8)
> frame #0: 0x00010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35
> 32 template 
> 33 struct _Some
> 34 {
> -> 35 _Some(T _t) : t(std::move(_t)) {}
> 36
> 37 T t;
> 38 };
> Target 0: (mesos-master) stopped.
> (lldb)
> {code}
> To quote [~abudnik]
> {quote}it’s the stack overflow bug in libprocess due to the way 
> `internal::send()` and `internal::_send()` are implemented in `process.cpp`
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7990) Support systemd named hierarchy (name=systemd) for Mesos Containerizer.

2018-02-09 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359169#comment-16359169
 ] 

Jie Yu edited comment on MESOS-7990 at 2/10/18 1:05 AM:


{noformat}

commit a86ff8c36532f97b6eb6b44c6f871de24afbcc4d (HEAD -> master, origin/master, 
origin/HEAD)
 Author: Jie Yu 
 Date: Thu Oct 5 21:11:41 2017 -0700

Created cgroups under systemd hierarchy in LinuxLauncher.

This patch added the support for systemd hierarchy in LinuxLauncher.
 It created the same cgroup layout under the systemd hierarchy (if
 systemd is enabled) as that in the freezer hierarchy.

This can give us a bunch of benefits:
 1) systemd-cgls can list mesos container processes.
 2) systemd-cgtop can show stats for mesos containers.
 3) Avoid the pid migration issue described in MESOS-3352.

For example:

```
 [jie@core-dev ~]$ systemd-cgls
|-1 /usr/lib/systemd/systemd --system --deserialize 20|
|-mesos|
| |-8282b91a-5724-4964-a623-7c6bd68ff4ad|
| |-31737 /usr/libexec/mesos/mesos-containerizer launch|
| |-31739 mesos-default-executor --launcher_dir=/usr/libexec/mesos|
| |-mesos|
| |-8555f4af-fa4f-4c9c-aeb3-0c9f72e6a2de|
| |-31791 /usr/libexec/mesos/mesos-containerizer launch|
| |-31793 sleep 1000
 ```|

Review: [https://reviews.apache.org/r/62800]

commit 05a2909508df56253372b4fe36330339b5de00b1
 Author: Jie Yu 
 Date: Thu Oct 5 20:52:38 2017 -0700

Fixed an issue for the I/O switchboard process lifetime.

We expect the I/O switchboard process to last across agent restarts
 (similar to log rotate process or executor processes). Therefore, we
 should put it into 'mesos_executor.slice' like others.

Review: [https://reviews.apache.org/r/62799]

commit 2ece53c39a3e791f6892c4a734b6d3187f184190
 Author: Jie Yu 
 Date: Thu Oct 5 20:49:20 2017 -0700

Added named cgroup hierarchy support.

This patch add a helper to get the cgroup associated with the given
 pid for a named cgroup hierarchy.

Review: [https://reviews.apache.org/r/62798]

{noformat}


was (Author: jieyu):
commit a86ff8c36532f97b6eb6b44c6f871de24afbcc4d (HEAD -> master, origin/master, 
origin/HEAD)
Author: Jie Yu 
Date: Thu Oct 5 21:11:41 2017 -0700

Created cgroups under systemd hierarchy in LinuxLauncher.

This patch added the support for systemd hierarchy in LinuxLauncher.
 It created the same cgroup layout under the systemd hierarchy (if
 systemd is enabled) as that in the freezer hierarchy.

This can give us a bunch of benefits:
 1) systemd-cgls can list mesos container processes.
 2) systemd-cgtop can show stats for mesos containers.
 3) Avoid the pid migration issue described in MESOS-3352.

For example:

```
 [jie@core-dev ~]$ systemd-cgls
 |-1 /usr/lib/systemd/systemd --system --deserialize 20
 |-mesos
 | |-8282b91a-5724-4964-a623-7c6bd68ff4ad
 | |-31737 /usr/libexec/mesos/mesos-containerizer launch
 | |-31739 mesos-default-executor --launcher_dir=/usr/libexec/mesos
 | |-mesos
 | |-8555f4af-fa4f-4c9c-aeb3-0c9f72e6a2de
 | |-31791 /usr/libexec/mesos/mesos-containerizer launch
 | |-31793 sleep 1000
 ```

Review: https://reviews.apache.org/r/62800

commit 05a2909508df56253372b4fe36330339b5de00b1
Author: Jie Yu 
Date: Thu Oct 5 20:52:38 2017 -0700

Fixed an issue for the I/O switchboard process lifetime.

We expect the I/O switchboard process to last across agent restarts
 (similar to log rotate process or executor processes). Therefore, we
 should put it into 'mesos_executor.slice' like others.

Review: https://reviews.apache.org/r/62799

commit 2ece53c39a3e791f6892c4a734b6d3187f184190
Author: Jie Yu 
Date: Thu Oct 5 20:49:20 2017 -0700

Added named cgroup hierarchy support.

This patch add a helper to get the cgroup associated with the given
 pid for a named cgroup hierarchy.

Review: https://reviews.apache.org/r/62798

> Support systemd named hierarchy (name=systemd) for Mesos Containerizer.
> ---
>
> Key: MESOS-7990
> URL: https://issues.apache.org/jira/browse/MESOS-7990
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
> Fix For: 1.6.0
>
>
> Similar to docker's cgroupfs cgroup driver, we should create cgroups under 
> /sys/fs/cgroup/systemd (if it exists), and move container pid into the 
> corresponding cgroup ( /sys/fs/cgroup/systemd/mesos/).
> This can give us a bunch of benefits:
> 1) systemd-cgls can list mesos containers
> 2) systemd-cgtop can show stats for mesos containers
> ...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7990) Support systemd named hierarchy (name=systemd) for Mesos Containerizer.

2018-02-09 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359169#comment-16359169
 ] 

Jie Yu commented on MESOS-7990:
---

commit a86ff8c36532f97b6eb6b44c6f871de24afbcc4d (HEAD -> master, origin/master, 
origin/HEAD)
Author: Jie Yu 
Date: Thu Oct 5 21:11:41 2017 -0700

Created cgroups under systemd hierarchy in LinuxLauncher.

This patch added the support for systemd hierarchy in LinuxLauncher.
 It created the same cgroup layout under the systemd hierarchy (if
 systemd is enabled) as that in the freezer hierarchy.

This can give us a bunch of benefits:
 1) systemd-cgls can list mesos container processes.
 2) systemd-cgtop can show stats for mesos containers.
 3) Avoid the pid migration issue described in MESOS-3352.

For example:

```
 [jie@core-dev ~]$ systemd-cgls
 |-1 /usr/lib/systemd/systemd --system --deserialize 20
 |-mesos
 | |-8282b91a-5724-4964-a623-7c6bd68ff4ad
 | |-31737 /usr/libexec/mesos/mesos-containerizer launch
 | |-31739 mesos-default-executor --launcher_dir=/usr/libexec/mesos
 | |-mesos
 | |-8555f4af-fa4f-4c9c-aeb3-0c9f72e6a2de
 | |-31791 /usr/libexec/mesos/mesos-containerizer launch
 | |-31793 sleep 1000
 ```

Review: https://reviews.apache.org/r/62800

commit 05a2909508df56253372b4fe36330339b5de00b1
Author: Jie Yu 
Date: Thu Oct 5 20:52:38 2017 -0700

Fixed an issue for the I/O switchboard process lifetime.

We expect the I/O switchboard process to last across agent restarts
 (similar to log rotate process or executor processes). Therefore, we
 should put it into 'mesos_executor.slice' like others.

Review: https://reviews.apache.org/r/62799

commit 2ece53c39a3e791f6892c4a734b6d3187f184190
Author: Jie Yu 
Date: Thu Oct 5 20:49:20 2017 -0700

Added named cgroup hierarchy support.

This patch add a helper to get the cgroup associated with the given
 pid for a named cgroup hierarchy.

Review: https://reviews.apache.org/r/62798

> Support systemd named hierarchy (name=systemd) for Mesos Containerizer.
> ---
>
> Key: MESOS-7990
> URL: https://issues.apache.org/jira/browse/MESOS-7990
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
>
> Similar to docker's cgroupfs cgroup driver, we should create cgroups under 
> /sys/fs/cgroup/systemd (if it exists), and move container pid into the 
> corresponding cgroup ( /sys/fs/cgroup/systemd/mesos/).
> This can give us a bunch of benefits:
> 1) systemd-cgls can list mesos containers
> 2) systemd-cgtop can show stats for mesos containers
> ...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-4781) Executor env variables should not be leaked to the command task.

2018-02-08 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357563#comment-16357563
 ] 

Jie Yu commented on MESOS-4781:
---

[~gilbert], are you still working on this?

> Executor env variables should not be leaked to the command task.
> 
>
> Key: MESOS-4781
> URL: https://issues.apache.org/jira/browse/MESOS-4781
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>  Labels: mesosphere
>
> Currently, command task inherits the env variables of the command executor. 
> This is less ideal because the command executor environment variables include 
> some Mesos internal env variables like MESOS_XXX and LIBPROCESS_XXX. Also, 
> this behavior does not match what Docker containerizer does. We should 
> construct the env variables from scratch for the command task, rather than 
> relying on inheriting the env variables from the command executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5268) Cgroups CpushareIsolator don't take effect on SLES 11 SP2 SP3

2018-02-08 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357318#comment-16357318
 ] 

Jie Yu commented on MESOS-5268:
---

[~AndyPang] is this still an issue for you? Do you plan to work on that?

> Cgroups CpushareIsolator don't take effect on SLES 11 SP2 SP3
> -
>
> Key: MESOS-5268
> URL: https://issues.apache.org/jira/browse/MESOS-5268
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.27.0
> Environment: suse 3.0.101-0.47.71-default #1 SMP Thu Nov 12 12:22:22 
> UTC 2015 (b5b212e) x86_64 x86_64 x86_64 GNU/Linux
>Reporter: AndyPang
>Assignee: AndyPang
>Priority: Major
>  Labels: cgroups
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> meson run in SLES  11 sp2 sp3, kernel version 3.0.13/3.076, cpushareisolator 
> don't take effect. Two framework cpushare proportion is 1:3, we find at last 
> in mesos container cpu.shares value is right, but  when we use "top" to see 
> result, the cpu usage is not 1:3. Our Application is multithread and can 
> fulfil the cpu quota when single run.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5754) CommandInfo.user not honored in docker containerizer

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-5754:
-

Assignee: (was: Gilbert Song)

> CommandInfo.user not honored in docker containerizer
> 
>
> Key: MESOS-5754
> URL: https://issues.apache.org/jira/browse/MESOS-5754
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker
>Affects Versions: 1.0.0, 1.2.3, 1.3.1, 1.4.1, 1.5.0
>Reporter: Michael Gummelt
>Priority: Major
>  Labels: mesosphere
>
> Repro by creating a framework that starts a task with CommandInfo.user set, 
> and observe that the dockerized executor is still running as the default 
> (e.g. root).
> cc [~kaysoky]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5866) MESOS_DIRECTORY set to a host path when using a docker image w/ unified containerizer

2018-02-08 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357302#comment-16357302
 ] 

Jie Yu commented on MESOS-5866:
---

MESOS_SANDBOX should be the one you use. MESOS_DIRECTORY is deprecated. CLosing 
this one. It's documented already.

> MESOS_DIRECTORY set to a host path when using a docker image w/ unified 
> containerizer
> -
>
> Key: MESOS-5866
> URL: https://issues.apache.org/jira/browse/MESOS-5866
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.28.2
>Reporter: Michael Gummelt
>Priority: Major
>
> Running Spark with the unified containerizer, it fails with:
> {code}
> 16/07/19 21:03:09 INFO DAGScheduler: ResultStage 0 (reduce at 
> SparkPi.scala:36) failed in Unknown s due to Job aborted due to stage 
> failure: Task serialization failed: java.io.IOException: Failed to create 
> local dir in 
> /var/lib/mesos/slave/slaves/003ebcc2-64e2-488f-87b9-f6fa7630c01b-S0/frameworks/003ebcc2-64e2-488f-87b9-f6fa7630c01b-0001/executors/driver-20160719210109-0002/runs/8f21b32e-b929-4369-bce9-9f49a3a8844f/blockmgr-e3a611d4-e0de-48cb-b17a-1e41d97e84c2/11.
> {code}
> This is because MESOS_DIRECTORY is set to /var/lib/mesos/, which is a 
> host path.  The container can't see the host path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5953) Default work dir is not root for unified containerizer and docker

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-5953:
-

Assignee: Gilbert Song

> Default work dir is not root for unified containerizer and docker
> -
>
> Key: MESOS-5953
> URL: https://issues.apache.org/jira/browse/MESOS-5953
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.0.0
>Reporter: Philip Winder
>Assignee: Gilbert Song
>Priority: Major
>
> According to the docker spec, the default working directory (WORKDIR) is root 
> /. https://docs.docker.com/engine/reference/run/#/workdir
> The unified containerizer with the docker runtime isolator sets the default 
> working directory to /tmp/mesos/sandbox.
> Hence, dockerfiles that are relying on the default workdir will not work 
> because the pwd is changed by mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6340) Set HOME for Mesos tasks

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-6340:
-

Assignee: Jie Yu

> Set HOME for Mesos tasks
> 
>
> Key: MESOS-6340
> URL: https://issues.apache.org/jira/browse/MESOS-6340
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Cody Maloney
>Assignee: Jie Yu
>Priority: Major
>
> Quite a few programs assume {{$HOME}} points to a user-editable data file 
> directory.
> One example is PYTHON, which tries to look up $HOME to find user-installed 
> pacakges, and if that fails it tries to look up the user in the passwd 
> database which often goes badly (The container is running under the `nobody` 
> user):
> {code}
> if i == 1:
> if 'HOME' not in os.environ:
> import pwd
> userhome = pwd.getpwuid(os.getuid()).pw_dir
> else:
> userhome = os.environ['HOME']
> {code}
> Just setting HOME by default to WORK_DIR would enable more software to work 
> correctly out of the box. Software which needs to specialize / change it (or 
> schedulers with specific preferences), should still be able to set it 
> arbitrarily and anything a scheduler explicitly sets should overwrite the 
> default value of $WORK_DIR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6555) Namespace 'mnt' is not supported

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-6555:
-

Assignee: James Peach

> Namespace 'mnt' is not supported
> 
>
> Key: MESOS-6555
> URL: https://issues.apache.org/jira/browse/MESOS-6555
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, containerization
>Affects Versions: 1.0.0, 1.2.3, 1.3.1, 1.4.1, 1.5.0
> Environment: suse11 sp3,  kernal: 3.0.101-0.47.71-default #1 SMP Thu 
> Nov 12 12:22:22 UTC 2015 (b5b212e) x86_64 x86_64 x86_64 GNU/Linux 
>Reporter: AndyPang
>Assignee: James Peach
>Priority: Minor
>
> the same code run in debain os,kernal version is '4.1.0-0' is ok; while in 
> sus 11 sp3 it has error.
> {code:title=mesos-execute|borderStyle=solid}
> ./mesos-execute --command="sleep 100" --master=:xxx  --name=sleep 
> --docker_image=ubuntu
> I1105 11:26:21.090703 194814 scheduler.cpp:172] Version: 1.0.0
> I1105 11:26:21.092821 194837 scheduler.cpp:461] New master detected at 
> master@:xxx
> Subscribed with ID 'fdb8546d-ca11-4a51-a297-8401e53b7692-'
> Submitted task 'sleep' to agent 'fdb8546d-ca11-4a51-a297-8401e53b7692-S0'
> Received status update TASK_FAILED for task 'sleep'
>   message: 'Failed to launch container: Collect failed: Failed to setup 
> hostname and network files: Failed to enter the mount namespace of pid 
> 194976: Namespace 'mnt' is not supported
> ; Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_CONTAINER_LAUNCH_FAILED
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6555) Namespace 'mnt' is not supported

2018-02-08 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357283#comment-16357283
 ] 

Jie Yu commented on MESOS-6555:
---

We should add some check during agent startup.

> Namespace 'mnt' is not supported
> 
>
> Key: MESOS-6555
> URL: https://issues.apache.org/jira/browse/MESOS-6555
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, containerization
>Affects Versions: 1.0.0, 1.2.3, 1.3.1, 1.4.1, 1.5.0
> Environment: suse11 sp3,  kernal: 3.0.101-0.47.71-default #1 SMP Thu 
> Nov 12 12:22:22 UTC 2015 (b5b212e) x86_64 x86_64 x86_64 GNU/Linux 
>Reporter: AndyPang
>Priority: Major
>
> the same code run in debain os,kernal version is '4.1.0-0' is ok; while in 
> sus 11 sp3 it has error.
> {code:title=mesos-execute|borderStyle=solid}
> ./mesos-execute --command="sleep 100" --master=:xxx  --name=sleep 
> --docker_image=ubuntu
> I1105 11:26:21.090703 194814 scheduler.cpp:172] Version: 1.0.0
> I1105 11:26:21.092821 194837 scheduler.cpp:461] New master detected at 
> master@:xxx
> Subscribed with ID 'fdb8546d-ca11-4a51-a297-8401e53b7692-'
> Submitted task 'sleep' to agent 'fdb8546d-ca11-4a51-a297-8401e53b7692-S0'
> Received status update TASK_FAILED for task 'sleep'
>   message: 'Failed to launch container: Collect failed: Failed to setup 
> hostname and network files: Failed to enter the mount namespace of pid 
> 194976: Namespace 'mnt' is not supported
> ; Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_CONTAINER_LAUNCH_FAILED
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6422) cgroups_tests not correctly tearing down testing hierarchies

2018-02-08 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357286#comment-16357286
 ] 

Jie Yu commented on MESOS-6422:
---

[~xujyan], do you plan to work on this one?

> cgroups_tests not correctly tearing down testing hierarchies
> 
>
> Key: MESOS-6422
> URL: https://issues.apache.org/jira/browse/MESOS-6422
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, containerization
>Reporter: Yan Xu
>Assignee: Yan Xu
>Priority: Minor
>  Labels: cgroups
>
> We currently do the following in 
> [CgroupsTest::TearDownTestCase()|https://github.com/apache/mesos/blob/5e850a362edbf494921fedff4037cf4b53088c10/src/tests/containerizer/cgroups_tests.cpp#L83]
> {code:title=}
> static void TearDownTestCase()
> {
>   AWAIT_READY(cgroups::cleanup(TEST_CGROUPS_HIERARCHY));
> }
> {code}
> One of its derived test {{CgroupsNoHierarchyTest}} treats 
> {{TEST_CGROUPS_HIERARCHY}} as a hierarchy so it's able to clean it up as a 
> hierarchy.
> However another derived test {{CgroupsAnyHierarchyTest}} would create new 
> hierarchies (if none is available) using {{TEST_CGROUPS_HIERARCHY}} as a 
> parent directory (i.e., base hierarchy) and not as a hierarchy, so when it's 
> time to clean up, it fails:
> {noformat:title=}
> [   OK ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Subsystems (1 ms)
> ../../src/tests/containerizer/cgroups_tests.cpp:88: Failure
> (cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6656) Nested containers can become unkillable

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-6656:
-

Assignee: Jie Yu

> Nested containers can become unkillable
> ---
>
> Key: MESOS-6656
> URL: https://issues.apache.org/jira/browse/MESOS-6656
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Greg Mann
>Assignee: Jie Yu
>Priority: Major
>  Labels: nested
>
> An incident occurred recently in a cluster running a build of Mesos based on 
> commit {{757319357471227c0a1e906076eae8f9aa2fdbd6}} from master. A task group 
> of five tasks was launched via Marathon. After the tasks were launched, one 
> of the containers quickly exited and was successfully destroyed. A couple 
> minutes later, the task group was killed manually via Marathon, and the agent 
> can then be seen repeatedly attempting to kill the tasks for hours. No calls 
> to {{WAIT_NESTED_CONTAINER}} are visible in the agent logs, and the executor 
> logs do not indicate at any point that the nested containers were launched 
> successfully.
> Agent logs:
> {code}
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.890911  
> 6406 slave.cpp:1539] Got assigned task group containing tasks [ 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server1, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server2, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server3, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server4, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server5 ] for 
> framework ce4bd8be-1198-4819-81d4-9a8439439741-
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.892299  
> 6406 gc.cpp:83] Unscheduling 
> '/var/lib/mesos/slave/slaves/ce4bd8be-1198-4819-81d4-9a8439439741-S1/frameworks/ce4bd8be-1198-4819-81d4-9a8439439741-'
>  from gc
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.892379  
> 6406 gc.cpp:83] Unscheduling 
> '/var/lib/mesos/slave/meta/slaves/ce4bd8be-1198-4819-81d4-9a8439439741-S1/frameworks/ce4bd8be-1198-4819-81d4-9a8439439741-'
>  from gc
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.893131  
> 6405 slave.cpp:1701] Launching task group containing tasks [ 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server1, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server2, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server3, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server4, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server5 ] for 
> framework ce4bd8be-1198-4819-81d4-9a8439439741-
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.893435  
> 6405 paths.cpp:536] Trying to chown 
> '/var/lib/mesos/slave/slaves/ce4bd8be-1198-4819-81d4-9a8439439741-S1/frameworks/ce4bd8be-1198-4819-81d4-9a8439439741-/executors/instance-dat_scout.e57be1fe-b5e8-11e6-995b-70b3d581/runs/8750c2a7-8bef-4a69-8ef2-b873f884bf91'
>  to user 'root'
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.898026  
> 6405 slave.cpp:6179] Launching executor 
> 'instance-dat_scout.e57be1fe-b5e8-11e6-995b-70b3d581' of framework 
> ce4bd8be-1198-4819-81d4-9a8439439741- with resources cpus(*):0.1; 
> mem(*):32; disk(*):10; ports(*):[21421-21425] in work directory 
> '/var/lib/mesos/slave/slaves/ce4bd8be-1198-4819-81d4-9a8439439741-S1/frameworks/ce4bd8be-1198-4819-81d4-9a8439439741-/executors/instance-dat_scout.e57be1fe-b5e8-11e6-995b-70b3d581/runs/8750c2a7-8bef-4a69-8ef2-b873f884bf91'
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.898731  
> 6407 docker.cpp:1000] Skipping non-docker container
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.899050  
> 6407 containerizer.cpp:938] Starting container 
> 8750c2a7-8bef-4a69-8ef2-b873f884bf91 for executor 
> 'instance-dat_scout.e57be1fe-b5e8-11e6-995b-70b3d581' of framework 
> ce4bd8be-1198-4819-81d4-9a8439439741-
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.899909  
> 6405 slave.cpp:1987] Queued task group containing tasks [ 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server1, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server2, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server3, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server4, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server5 ] for 
> executor 'instance-dat_scout.e57be1fe-b5e8-11e6-995b-70b3d581' of 
> framework ce4bd8be-1198-4819-81d4-9a8439439741-
> Nov 29 04:04:16 

[jira] [Assigned] (MESOS-6656) Nested containers can become unkillable

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-6656:
-

Assignee: Greg Mann  (was: Jie Yu)

> Nested containers can become unkillable
> ---
>
> Key: MESOS-6656
> URL: https://issues.apache.org/jira/browse/MESOS-6656
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: nested
>
> An incident occurred recently in a cluster running a build of Mesos based on 
> commit {{757319357471227c0a1e906076eae8f9aa2fdbd6}} from master. A task group 
> of five tasks was launched via Marathon. After the tasks were launched, one 
> of the containers quickly exited and was successfully destroyed. A couple 
> minutes later, the task group was killed manually via Marathon, and the agent 
> can then be seen repeatedly attempting to kill the tasks for hours. No calls 
> to {{WAIT_NESTED_CONTAINER}} are visible in the agent logs, and the executor 
> logs do not indicate at any point that the nested containers were launched 
> successfully.
> Agent logs:
> {code}
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.890911  
> 6406 slave.cpp:1539] Got assigned task group containing tasks [ 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server1, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server2, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server3, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server4, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server5 ] for 
> framework ce4bd8be-1198-4819-81d4-9a8439439741-
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.892299  
> 6406 gc.cpp:83] Unscheduling 
> '/var/lib/mesos/slave/slaves/ce4bd8be-1198-4819-81d4-9a8439439741-S1/frameworks/ce4bd8be-1198-4819-81d4-9a8439439741-'
>  from gc
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.892379  
> 6406 gc.cpp:83] Unscheduling 
> '/var/lib/mesos/slave/meta/slaves/ce4bd8be-1198-4819-81d4-9a8439439741-S1/frameworks/ce4bd8be-1198-4819-81d4-9a8439439741-'
>  from gc
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.893131  
> 6405 slave.cpp:1701] Launching task group containing tasks [ 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server1, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server2, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server3, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server4, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server5 ] for 
> framework ce4bd8be-1198-4819-81d4-9a8439439741-
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.893435  
> 6405 paths.cpp:536] Trying to chown 
> '/var/lib/mesos/slave/slaves/ce4bd8be-1198-4819-81d4-9a8439439741-S1/frameworks/ce4bd8be-1198-4819-81d4-9a8439439741-/executors/instance-dat_scout.e57be1fe-b5e8-11e6-995b-70b3d581/runs/8750c2a7-8bef-4a69-8ef2-b873f884bf91'
>  to user 'root'
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.898026  
> 6405 slave.cpp:6179] Launching executor 
> 'instance-dat_scout.e57be1fe-b5e8-11e6-995b-70b3d581' of framework 
> ce4bd8be-1198-4819-81d4-9a8439439741- with resources cpus(*):0.1; 
> mem(*):32; disk(*):10; ports(*):[21421-21425] in work directory 
> '/var/lib/mesos/slave/slaves/ce4bd8be-1198-4819-81d4-9a8439439741-S1/frameworks/ce4bd8be-1198-4819-81d4-9a8439439741-/executors/instance-dat_scout.e57be1fe-b5e8-11e6-995b-70b3d581/runs/8750c2a7-8bef-4a69-8ef2-b873f884bf91'
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.898731  
> 6407 docker.cpp:1000] Skipping non-docker container
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.899050  
> 6407 containerizer.cpp:938] Starting container 
> 8750c2a7-8bef-4a69-8ef2-b873f884bf91 for executor 
> 'instance-dat_scout.e57be1fe-b5e8-11e6-995b-70b3d581' of framework 
> ce4bd8be-1198-4819-81d4-9a8439439741-
> Nov 29 04:04:16 ip-10-190-112-199 mesos-agent[6397]: I1129 04:04:16.899909  
> 6405 slave.cpp:1987] Queued task group containing tasks [ 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server1, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server2, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server3, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server4, 
> dat_scout.instance-e57be1fe-b5e8-11e6-995b-70b3d581.scout-server5 ] for 
> executor 'instance-dat_scout.e57be1fe-b5e8-11e6-995b-70b3d581' of 
> framework ce4bd8be-1198-4819-81d4-9a8439439741-
> Nov 29 

[jira] [Assigned] (MESOS-6798) Volumes in `/dev/shm` overridden by mesos containerizer

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-6798:
-

Assignee: Jason Lai

> Volumes in `/dev/shm` overridden by mesos containerizer
> ---
>
> Key: MESOS-6798
> URL: https://issues.apache.org/jira/browse/MESOS-6798
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.1.1, 1.2.0, 1.3.1, 1.4.1, 1.5.0
>Reporter: Zhongbo Tian
>Assignee: Jason Lai
>Priority: Major
>
> When making a volume into `/dev/shm`, the volume is overridden by default 
> mount point.
> For example:
> {code}
> mesos-execute --master=mesos-master --name=test --docker_image=busybox 
> --volumes='[{"container_path":"/tmp/hosts", "host_path":"/etc/hosts", 
> "mode":"RO"}]' --command="cat /tmp/hosts"
> {code}
> This will get an error for "No such file or directory"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6874) Agent silently ignores FS isolation when protobuf is malformed

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-6874:
-

Assignee: (was: Gilbert Song)

> Agent silently ignores FS isolation when protobuf is malformed
> --
>
> Key: MESOS-6874
> URL: https://issues.apache.org/jira/browse/MESOS-6874
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.1.0
>Reporter: Michael Gummelt
>Priority: Minor
>  Labels: newbie
>
> cc [~vinodkone]
> I accidentally set my Mesos ContainerInfo to include a DockerInfo instead of 
> a MesosInfo:
> {code}
> executorInfoBuilder.setContainer(
>  Protos.ContainerInfo.newBuilder()
>  .setType(Protos.ContainerInfo.Type.MESOS)
>  .setDocker(Protos.ContainerInfo.DockerInfo.newBuilder()
>  
> .setImage(podSpec.getContainer().get().getImageName()))
> {code}
> I would have expected a validation error before or during containerization, 
> but instead, the agent silently decided to ignore filesystem isolation 
> altogether, and launch my executor on the host filesystem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7069) The linux filesystem isolator should set mode and ownership for host volumes.

2018-02-08 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357272#comment-16357272
 ] 

Jie Yu commented on MESOS-7069:
---

[~ipronin] is this still an issue? Or we can close this one?

> The linux filesystem isolator should set mode and ownership for host volumes.
> -
>
> Key: MESOS-7069
> URL: https://issues.apache.org/jira/browse/MESOS-7069
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Ilya Pronin
>Priority: Major
>  Labels: filesystem, linux, volumes
>
> If the host path is a relative path, the linux filesystem isolator should set 
> the mode and ownership for this host volume since it allows non-root user to 
> write to the volume. Note that this is the case of sharing the host 
> fileysystem (without rootfs).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7450) Docker containerizer will leak dangling symlinks if restarted with a colon in the sandbox path

2018-02-08 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357265#comment-16357265
 ] 

Jie Yu commented on MESOS-7450:
---

[~kaysoky] is this still an issue?

> Docker containerizer will leak dangling symlinks if restarted with a colon in 
> the sandbox path
> --
>
> Key: MESOS-7450
> URL: https://issues.apache.org/jira/browse/MESOS-7450
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.21.0, 1.2.0
>Reporter: Joseph Wu
>Priority: Major
>  Labels: mesosphere
>
> The Docker CLI has a limitation, which was worked around in MESOS-1833.
> TL;DR: If you launch a container with a colon ({{:}}) in the sandbox path, we 
> will create a symlink to that path and mount that symlink into the Docker 
> container.
> However, when you restart the Mesos agent after launching a container like 
> the above, the Docker containerizer will "forget" about the symlink and 
> thereby not clean it up when the container exits.  We will still GC the 
> actual sandbox, but not the symlink.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7543) Allow isolators to specify secret environment

2018-02-08 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357262#comment-16357262
 ] 

Jie Yu commented on MESOS-7543:
---

[~karya], what's this ticket about? Re-open if this is still valid.

> Allow isolators to specify secret environment
> -
>
> Key: MESOS-7543
> URL: https://issues.apache.org/jira/browse/MESOS-7543
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, security
>Reporter: Kapil Arya
>Priority: Major
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7599) Mesos Containerizer Cannot Pull from Certain Registries

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-7599:
-

Assignee: Gilbert Song

> Mesos Containerizer Cannot Pull from Certain Registries 
> 
>
> Key: MESOS-7599
> URL: https://issues.apache.org/jira/browse/MESOS-7599
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.2.0
>Reporter: Max Ehrlich
>Assignee: Gilbert Song
>Priority: Major
>
> I have a docker image that is on a registry hosted by gitlab. When I try to 
> start this container using the Mesos containerizer, it is never scheduled. I 
> have a feeling this is from the unusual name for the image that gitlab uses, 
> but I haven't had time to look into the code yet. I have also tried this with 
> a private gitlab instance that is password protected and I have a similar 
> issue (there also seems to be an unrelated issue that the Mesos containerizer 
> doesn't support password protected registries).
> Example image names are as follows
> * registry.gitlab.com/queuecumber/page/excon (public image)
> * gitlab..com:5005/sri/registry/baseline_combo01 (private, password 
> protected)
> The images seem to work using the Docker containerizer, and again I suspect 
> this related to those long names with lots of / in them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7617) UCR cannot read docker images containing long file paths

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-7617:
-

Assignee: Chun-Hung Hsiao

> UCR cannot read docker images containing long file paths
> 
>
> Key: MESOS-7617
> URL: https://issues.apache.org/jira/browse/MESOS-7617
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.1.2, 1.2.0, 1.3.0, 1.3.1
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: containerizer, triaged
>
> The latest Docker uses go 1.7.5 
> (https://github.com/moby/moby/blob/master/CHANGELOG.md#contrib-1), in which 
> the {{archive/tar}} package has a bug that cannot handle file paths longer 
> than 100 characters (https://github.com/golang/go/issues/17630). As a result, 
> Docker will generate images containing ill-formed tar files (details below) 
> when there are long paths. Docker itself understands the ill-formed image 
> fine, but a standard tar program will interpret the image as if all files 
> with long paths are placed under the root directory 
> (https://github.com/moby/moby/issues/29360).
> This bug has been fixed in go 1.8, but since Docker is still using the bugged 
> version, we might need to handle these ill-formed images created by Dcoker 
> utilities.
> NOTE: It is confirmed that the {{archive/tar}} package in go 1.8 cannot 
> correctly extract the ill-formed tar files, but the one in go 1.7.5 could.
> Details: the {{archive/tar}} package uses {{USTAR}} format to handle files 
> with 100+-character-long paths (by only putting file name in the {{name}} 
> field and the path in the {{prefix}} field in the tar header), but uses 
> {{OLDGNU}}'s magic string, which does not understand the {{prefix}} field, so 
> a standard tar program will extract such files under the current directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7645) Support RO mode for bind mount volumes with filesystem/linux isolator

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-7645:
-

Assignee: Jie Yu  (was: Charles Raimbert)

> Support RO mode for bind mount volumes with filesystem/linux isolator
> -
>
> Key: MESOS-7645
> URL: https://issues.apache.org/jira/browse/MESOS-7645
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.1, 1.4.1
>Reporter: Charles Raimbert
>Assignee: Jie Yu
>Priority: Major
>  Labels: storage
>
> The filesystem/linux isolator currently creates all bind mount volumes as RW, 
> even if a volume mode is set as RO.
> The TODO in the isolator code helps to spot the missing capability:
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L587
> {code}
> // TODO(jieyu): Consider the mode in the volume.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7685) Issue using S3FS from docker container with the mesos containerizer

2018-02-08 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357255#comment-16357255
 ] 

Jie Yu commented on MESOS-7685:
---

If you are using cgroups devices isolator, /dev/fuse won't be accessible. 

> Issue using S3FS from docker container with the mesos containerizer
> ---
>
> Key: MESOS-7685
> URL: https://issues.apache.org/jira/browse/MESOS-7685
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.1.0
>Reporter: Andrei Filip
>Priority: Major
>
> I have a docker image which uses S3FS to mount an amazon S3 bucket for use as 
> a local filesystem. Playing around with this container manually, using 
> docker, i am able to use S3FS as expected.
> When trying to use this image with the mesos containerizer, i get the 
> following error:
> fuse: device not found, try 'modprobe fuse' first
> The way i'm launching a job that runs this s3fs command is via the aurora 
> scheduler. Somehow it seems that docker is able to use the fuse kernel 
> plugin, but the mesos containerizer does not.
> I've also created a stackoverflow topic about this issue here: 
> https://stackoverflow.com/questions/44569238/using-s3fs-in-a-docker-container-ran-by-the-mesos-containerizer/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7685) Issue using S3FS from docker container with the mesos containerizer

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-7685:
-

Assignee: Jie Yu

> Issue using S3FS from docker container with the mesos containerizer
> ---
>
> Key: MESOS-7685
> URL: https://issues.apache.org/jira/browse/MESOS-7685
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.1.0
>Reporter: Andrei Filip
>Assignee: Jie Yu
>Priority: Major
>
> I have a docker image which uses S3FS to mount an amazon S3 bucket for use as 
> a local filesystem. Playing around with this container manually, using 
> docker, i am able to use S3FS as expected.
> When trying to use this image with the mesos containerizer, i get the 
> following error:
> fuse: device not found, try 'modprobe fuse' first
> The way i'm launching a job that runs this s3fs command is via the aurora 
> scheduler. Somehow it seems that docker is able to use the fuse kernel 
> plugin, but the mesos containerizer does not.
> I've also created a stackoverflow topic about this issue here: 
> https://stackoverflow.com/questions/44569238/using-s3fs-in-a-docker-container-ran-by-the-mesos-containerizer/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8105) Docker containerizer fails with "Unable to get executor pid after launch"

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-8105:
-

Assignee: (was: Jie Yu)

> Docker containerizer fails with "Unable to get executor pid after launch"
> -
>
> Key: MESOS-8105
> URL: https://issues.apache.org/jira/browse/MESOS-8105
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: maybob
>Priority: Major
>  Labels: docker
>
> When running lots of command at the same time by each command using same 
> executor with different executorId by docker,some executor occur error 
> "Unable to get executor pid after launch". 
> Reason of this error may be "docker inspect" hangs or exit 0 with pid 0. 
> Another reason may be lots of docker consume many resources, e.g file 
> descriptor.
> {color:red}Log:{color}
> {code:java}
> I1012 16:15:01.003931 124081 slave.cpp:1619] Got assigned task '920860' for 
> framework framework-id-daily
> I1012 16:15:01.006091 124081 slave.cpp:1900] Authorizing task '920860' for 
> framework framework-id-daily
> I1012 16:15:01.008281 124081 slave.cpp:2087] Launching task '920860' for 
> framework framework-id-daily
> I1012 16:15:01.008779 124081 paths.cpp:573] Trying to chown 
> '/volumes/sdb1/mesos/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3'
>  to user 'maybob'
> I1012 16:15:01.009027 124081 slave.cpp:7401] Checkpointing ExecutorInfo to 
> '/volumes/sdb1/mesos/meta/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/executor.info'
> I1012 16:15:01.009546 124081 slave.cpp:7038] Launching executor 
> 'Executor_920860' of framework framework-id-daily with resources {} in work 
> directory 
> '/volumes/sdb1/mesos/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3'
> I1012 16:15:01.010339 124081 slave.cpp:7429] Checkpointing TaskInfo to 
> '/volumes/sdb1/mesos/meta/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3/tasks/920860/task.info'
> I1012 16:15:01.010726 124081 slave.cpp:2316] Queued task '920860' for 
> executor 'Executor_920860' of framework framework-id-daily
> I1012 16:15:01.011740 124088 docker.cpp:1175] Starting container 
> '29c82b61-1242-4de9-80cf-16f46c30e7e3' for executor 'Executor_920860' and 
> framework framework-id-daily
> I1012 16:15:01.013123 124081 slave.cpp:877] Successfully attached file 
> '/volumes/sdb1/mesos/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3'
> I1012 16:15:01.013290 124080 fetcher.cpp:353] Starting to fetch URIs for 
> container: 29c82b61-1242-4de9-80cf-16f46c30e7e3, directory: 
> /volumes/sdb1/mesos/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3
> I1012 16:15:01.706429 124071 docker.cpp:909] Running docker -H 
> unix:///var/run/docker.sock run --cpu-shares 378 --memory 427819008 -e 
> LIBPROCESS_PORT=0 -e MESOS_AGENT_ENDPOINT=xxx.xxx.xxx.xxx:5051 -e 
> MESOS_CHECKPOINT=1 -e 
> MESOS_CONTAINER_NAME=mesos-89192f68-d28f-498c-808f-442a1ef576b3-S2.29c82b61-1242-4de9-80cf-16f46c30e7e3
>  -e 
> MESOS_DIRECTORY=/volumes/sdb1/mesos/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3
>  -e MESOS_EXECUTOR_ID=Executor_920860 -e 
> MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD=5secs -e 
> MESOS_FRAMEWORK_ID=framework-id-daily -e MESOS_HTTP_COMMAND_EXECUTOR=0 -e 
> MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos-1.3.1.so -e 
> MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos-1.3.1.so -e 
> MESOS_RECOVERY_TIMEOUT=15mins -e MESOS_SANDBOX=/mnt/mesos/sandbox -e 
> MESOS_SLAVE_ID=89192f68-d28f-498c-808f-442a1ef576b3-S2 -e 
> MESOS_SLAVE_PID=slave(1)@xxx.xxx.xxx.xxx:5051 -e 
> MESOS_SUBSCRIPTION_BACKOFF_MAX=2secs -v 
> /volumes/sdb1/mesos/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3:/mnt/mesos/sandbox
>  --net host --entrypoint /bin/sh --name 
> mesos-89192f68-d28f-498c-808f-442a1ef576b3-S2.29c82b61-1242-4de9-80cf-16f46c30e7e3
>  reg.docker.xxx/xx/executor:v25 -c env && cd $MESOS_SANDBOX && 
> ./executor.sh
> I1012 16:15:01.717859 124071 docker.cpp:1071] Running docker -H 
> unix:///var/run/docker.sock inspect 
> mesos-89192f68-d28f-498c-808f-442a1ef576b3-S2.29c82b61-1242-4de9-80cf-16f46c30e7e3
> I1012 16:15:02.033951 

[jira] [Assigned] (MESOS-8158) Mesos Agent in docker neglects to retry discovering Task docker containers

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-8158:
-

Assignee: (was: Gilbert Song)

> Mesos Agent in docker neglects to retry discovering Task docker containers
> --
>
> Key: MESOS-8158
> URL: https://issues.apache.org/jira/browse/MESOS-8158
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker, executor
>Affects Versions: 1.4.0
> Environment: Windows 10 with Docker version 17.09.0-ce, build afdb6d4
>Reporter: Charles Allen
>Priority: Major
>
> I have attempted to launch Mesos agents inside of a docker container in such 
> a way where the agent docker can be replaced and recovered. Unfortunately I 
> hit a major snag in the way the mesos docker launching works.
> To test simple functionality a marathon app is setup that simply has the 
> following command: {{date && python -m SimpleHTTPServer $PORT0}} 
> That way the HTTP port can be accessed to assure things are being assigned 
> correctly, and the date is printed out in the log.
> When I attempt to start this marathon app, the mesos agent (inside a docker 
> container) properly launches an executor which properly creates a second task 
> that launches the python code. Here's the output from the executor logs (this 
> looks correct):
> {code}
> I1101 20:34:03.420210 68270 exec.cpp:162] Version: 1.4.0
> I1101 20:34:03.427455 68281 exec.cpp:237] Executor registered on agent 
> d9bb6e96-ee26-43c2-977e-0c404fdd4e81-S0
> I1101 20:34:03.428414 68283 executor.cpp:120] Registered docker executor on 
> 10.0.75.2
> I1101 20:34:03.428680 68281 executor.cpp:160] Starting task 
> testapp.fe35282f-bf43-11e7-a24b-0242ac110002
> I1101 20:34:03.428941 68281 docker.cpp:1080] Running docker -H 
> unix:///var/run/docker.sock run --cpu-shares 1024 --memory 134217728 -e 
> HOST=10.0.75.2 -e MARATHON_APP_DOCKER_IMAGE=python:2 -e 
> MARATHON_APP_ID=/testapp -e MARATHON_APP_LABELS= -e MARATHON_APP_RESOURCE_CPUS
> =1.0 -e MARATHON_APP_RESOURCE_DISK=0.0 -e MARATHON_APP_RESOURCE_GPUS=0 -e 
> MARATHON_APP_RESOURCE_MEM=128.0 -e 
> MARATHON_APP_VERSION=2017-11-01T20:33:44.869Z -e 
> MESOS_CONTAINER_NAME=mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75 -e 
> MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_TA
> SK_ID=testapp.fe35282f-bf43-11e7-a24b-0242ac110002 -e PORT=31464 -e 
> PORT0=31464 -e PORTS=31464 -e PORT_1=31464 -e PORT_HTTP=31464 -v 
> /var/run/mesos/slaves/d9bb6e96-ee26-43c2-977e-0c404fdd4e81-S0/frameworks/a5eb6da1-f8ac-4642-8d66-cdd2e5b14d45-0001/executors/testapp
> .fe35282f-bf43-11e7-a24b-0242ac110002/runs/84f9ae30-9d4c-484a-860c-ca7845b7ec75:/mnt/mesos/sandbox
>  --net host --entrypoint /bin/sh --name 
> mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75 
> --label=MESOS_TASK_ID=testapp.fe35282f-bf43-11e7-a24b-0242ac110002 python:2 
> -c date && p
> ython -m SimpleHTTPServer $PORT0
> I1101 20:34:03.430402 68281 docker.cpp:1243] Running docker -H 
> unix:///var/run/docker.sock inspect mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75
> I1101 20:34:03.520303 68286 docker.cpp:1290] Retrying inspect with non-zero 
> status code. cmd: 'docker -H unix:///var/run/docker.sock inspect 
> mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75', interval: 500ms
> I1101 20:34:04.021216 68288 docker.cpp:1243] Running docker -H 
> unix:///var/run/docker.sock inspect mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75
> I1101 20:34:04.124490 68281 docker.cpp:1290] Retrying inspect with non-zero 
> status code. cmd: 'docker -H unix:///var/run/docker.sock inspect 
> mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75', interval: 500ms
> I1101 20:34:04.624964 68288 docker.cpp:1243] Running docker -H 
> unix:///var/run/docker.sock inspect mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75
> I1101 20:34:04.934087 68286 docker.cpp:1345] Retrying inspect since container 
> not yet started. cmd: 'docker -H unix:///var/run/docker.sock inspect 
> mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75', interval: 500ms
> I1101 20:34:05.435145 68288 docker.cpp:1243] Running docker -H 
> unix:///var/run/docker.sock inspect mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75
> Wed Nov  1 20:34:06 UTC 2017
> {code}
> But, somehow there is a TASK_FAILED message sent to marathon.
> Upon further investigation, the following snippet can be found in the agent 
> logs (running in a docker container)
> {code}
> I1101 20:34:00.949129 9 slave.cpp:1736] Got assigned task 
> 'testapp.fe35282f-bf43-11e7-a24b-0242ac110002' for framework 
> a5eb6da1-f8ac-4642-8d66-cdd2e5b14d45-0001
> I1101 20:34:00.950150 9 gc.cpp:93] Unscheduling 
> '/var/run/mesos/slaves/d9bb6e96-ee26-43c2-977e-0c404fdd4e81-S0/frameworks/a5eb6da1-f8ac-4642-8d66-cdd2e5b14d45-0001'
>  from gc
> I1101 20:34:00.950225 9 gc.cpp:93] Unscheduling 
> 

[jira] [Assigned] (MESOS-8398) External volumes (through docker/volume isolator) might not be accessible by non-root users.

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-8398:
-

Assignee: Jie Yu

> External volumes (through docker/volume isolator) might not be accessible by 
> non-root users.
> 
>
> Key: MESOS-8398
> URL: https://issues.apache.org/jira/browse/MESOS-8398
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.1, 1.4.1
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
>
> That's because we don't perform chown/chmod for external volumes at the 
> moment (because it might be shared across multiple containers). If the 
> container is launched using non-root users, it might not be able to access to 
> the external volume.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8497) Docker parameter `name` does not work with Docker Containerizer.

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-8497:
-

Assignee: Gilbert Song

> Docker parameter `name` does not work with Docker Containerizer.
> 
>
> Key: MESOS-8497
> URL: https://issues.apache.org/jira/browse/MESOS-8497
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jörg Schad
>Assignee: Gilbert Song
>Priority: Major
>  Labels: containerizer
> Attachments: agent.log, master.log
>
>
> When deploying a marathon app with Docker Containerizer (need to check Mesos 
> Containerizer) and the parameter name set, Mesos is not able to 
> recognize/control/kill the started container.
> Steps to reproduce 
>  # Deploy the below marathon app definition
>  #  Watch task being stuck in staging and mesos not being able to kill 
> it/communicate with it
>  ## 
> {quote}e.g., Agent Logs: W0126 18:38:50.00  4988 slave.cpp:6750] Failed 
> to get resource statistics for executor 
> ‘instana-agent.1a1f8d22-02c8-11e8-b607-923c3c523109’ of framework 
> 41f1b534-5f9d-4b5e-bb74-a0e387d5739f-0001: Failed to run ‘docker -H 
> unix:///var/run/docker.sock inspect 
> mesos-1c6f894d-9a3e-408c-8146-47ebab2f28be’: exited with status 1; 
> stderr=’Error: No such image, container or task: 
> mesos-1c6f894d-9a3e-408c-8146-47ebab2f28be{quote}
>  # Check on node and see container running, but not being recognized by mesos
> {noformat}
> {
> "id": "/docker-test",
> "instances": 1,
> "portDefinitions": [],
> "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
> "image": "ubuntu:16.04",
> "parameters": [
> {
> "key": "name",
> "value": "myname"
> }
> ]
> }
> },
> "cpus": 0.1,
> "mem": 128,
> "requirePorts": false,
> "networks": [],
> "healthChecks": [],
> "fetch": [],
> "constraints": [],
> "cmd": "sleep 1000"
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8522) `prepareMounts` in Mesos containerizer is flaky.

2018-02-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-8522:
-

Assignee: Jie Yu

> `prepareMounts` in Mesos containerizer is flaky.
> 
>
> Key: MESOS-8522
> URL: https://issues.apache.org/jira/browse/MESOS-8522
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0
>Reporter: Chun-Hung Hsiao
>Assignee: Jie Yu
>Priority: Critical
>  Labels: mesosphere, storage
>
> The 
> [{{prepareMount()}}|https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L244]
>  function in {{src/slave/containerizer/mesos/launch.cpp}} sometimes fails 
> with the following error:
> {noformat}
> Failed to prepare mounts: Failed to mark 
> '/home/docker/containers/af78db6ebc1aff572e576b773d1378121a66bb755ed63b3278e759907e5fe7b6/shm'
>  as slave: Invalid argument
> {noformat}
> The error message comes from 
> https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L#L326.
> Although it does not happen frequently, it can be reproduced by running tests 
> that need to clone mount namespaces in repetition. For example, I just 
> reproduced the bug with the following command after 17 minutes:
> {noformat}
> sudo bin/mesos-tests.sh --gtest_filter='*ROOT_PublishResourcesRecovery' 
> --gtest_break_on_failure --gtest_repeat=-1 --verbose
> {noformat}
> No that in this example, the test itself does not involve any docker image or 
> docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8539) Add metrics about CSI plugin terminations.

2018-02-02 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8539:
-

 Summary: Add metrics about CSI plugin terminations.
 Key: MESOS-8539
 URL: https://issues.apache.org/jira/browse/MESOS-8539
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu
Assignee: Jie Yu


So that operators can be alerted on flapping CSI plugin container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8527) Add metrics about number of subscribed LRPs on the agent.

2018-02-01 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-8527:
-

Assignee: Jie Yu

> Add metrics about number of subscribed LRPs on the agent.
> -
>
> Key: MESOS-8527
> URL: https://issues.apache.org/jira/browse/MESOS-8527
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
>
> Gauge: `resource_provider/subscribed`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8528) Design doc for External Resource Provider (ERP) support.

2018-02-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8528:
-

 Summary: Design doc for External Resource Provider (ERP) support.
 Key: MESOS-8528
 URL: https://issues.apache.org/jira/browse/MESOS-8528
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8527) Add metrics about number of subscribed LRPs on the agent.

2018-02-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8527:
-

 Summary: Add metrics about number of subscribed LRPs on the agent.
 Key: MESOS-8527
 URL: https://issues.apache.org/jira/browse/MESOS-8527
 Project: Mesos
  Issue Type: Task
 Environment: Gauge: `resource_provider/subscribed`
Reporter: Jie Yu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8527) Add metrics about number of subscribed LRPs on the agent.

2018-02-01 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8527:
--
Environment: (was: Gauge: `resource_provider/subscribed`)

> Add metrics about number of subscribed LRPs on the agent.
> -
>
> Key: MESOS-8527
> URL: https://issues.apache.org/jira/browse/MESOS-8527
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8526) Support BLOCK type persistent volume.

2018-02-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8526:
-

 Summary: Support BLOCK type persistent volume.
 Key: MESOS-8526
 URL: https://issues.apache.org/jira/browse/MESOS-8526
 Project: Mesos
  Issue Type: Improvement
Reporter: Jie Yu


Currently, only MOUNT/PATH is supported. Supporting BLOCK means that we need to 
adjust the cgroups devices isolator (and the corresponding Docker contianerizer 
code) to enable device access to the device. The bind mount logic in 
filesystem/linux isolator might be slightly different too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8525) Support mount propagation when container image is used.

2018-02-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8525:
-

 Summary: Support mount propagation when container image is used.
 Key: MESOS-8525
 URL: https://issues.apache.org/jira/browse/MESOS-8525
 Project: Mesos
  Issue Type: Improvement
Reporter: Jie Yu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8525) Support mount propagation when container image is used.

2018-02-01 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8525:
--
Docs Text:   (was: There's a limitation currently that bi-directional mount 
propagation does not work when a container has a rootfs defined:
https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L263-L281
)

> Support mount propagation when container image is used.
> ---
>
> Key: MESOS-8525
> URL: https://issues.apache.org/jira/browse/MESOS-8525
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8525) Support mount propagation when container image is used.

2018-02-01 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8525:
--
Description: 
There's a limitation currently that bi-directional mount propagation does not 
work when a container has a rootfs defined:
https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L263-L281

> Support mount propagation when container image is used.
> ---
>
> Key: MESOS-8525
> URL: https://issues.apache.org/jira/browse/MESOS-8525
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Priority: Major
>
> There's a limitation currently that bi-directional mount propagation does not 
> work when a container has a rootfs defined:
> https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L263-L281



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8522) `prepareMounts` in Mesos containerizer is flaky.

2018-02-01 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348986#comment-16348986
 ] 

Jie Yu commented on MESOS-8522:
---

By looking at the box, there seemed to be a flapping docker container. That 
explains this. The mount entry is gone after we scan the mount table but before 
we mark the given mount entry as slave mount.

> `prepareMounts` in Mesos containerizer is flaky.
> 
>
> Key: MESOS-8522
> URL: https://issues.apache.org/jira/browse/MESOS-8522
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0
>Reporter: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
>
> The 
> [{{prepareMount()}}|https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L244]
>  function in {{src/slave/containerizer/mesos/launch.cpp}} sometimes fails 
> with the following error:
> {noformat}
> Failed to prepare mounts: Failed to mark 
> '/home/docker/containers/af78db6ebc1aff572e576b773d1378121a66bb755ed63b3278e759907e5fe7b6/shm'
>  as slave: Invalid argument
> {noformat}
> The error message comes from 
> https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L#L326.
> Although it does not happen frequently, it can be reproduced by running tests 
> that need to clone mount namespaces in repetition. For example, I just 
> reproduced the bug with the following command after 17 minutes:
> {noformat}
> sudo bin/mesos-tests.sh --gtest_filter='*ROOT_PublishResourcesRecovery' 
> --gtest_break_on_failure --gtest_repeat=-1 --verbose
> {noformat}
> No that in this example, the test itself does not involve any docker image or 
> docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8510) URI disk profile adaptor does not consider plugin type for a profile.

2018-01-30 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8510:
--
Sprint: Mesosphere Sprint 73

> URI disk profile adaptor does not consider plugin type for a profile.
> -
>
> Key: MESOS-8510
> URL: https://issues.apache.org/jira/browse/MESOS-8510
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Jie Yu
>Assignee: Joseph Wu
>Priority: Major
>
> Currently, the URI disk profile adaptor will fetch an URI, the content of 
> which contains a profile matrix. However, there's no field in the profile 
> matrix for the adaptor to tell which plugin type a profile is for.
> We should consider adding a `plugin_type` field in `CSIManifest`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8510) URI disk profile adaptor does not consider plugin type for a profile.

2018-01-30 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-8510:
-

Assignee: Joseph Wu

> URI disk profile adaptor does not consider plugin type for a profile.
> -
>
> Key: MESOS-8510
> URL: https://issues.apache.org/jira/browse/MESOS-8510
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Jie Yu
>Assignee: Joseph Wu
>Priority: Major
>
> Currently, the URI disk profile adaptor will fetch an URI, the content of 
> which contains a profile matrix. However, there's no field in the profile 
> matrix for the adaptor to tell which plugin type a profile is for.
> We should consider adding a `plugin_type` field in `CSIManifest`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8510) URI disk profile adaptor does not consider plugin type for a profile.

2018-01-30 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8510:
-

 Summary: URI disk profile adaptor does not consider plugin type 
for a profile.
 Key: MESOS-8510
 URL: https://issues.apache.org/jira/browse/MESOS-8510
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Jie Yu


Currently, the URI disk profile adaptor will fetch an URI, the content of which 
contains a profile matrix. However, there's no field in the profile matrix for 
the adaptor to tell which plugin type a profile is for.

We should consider adding a `plugin_type` field in `CSIManifest`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.

2018-01-30 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7504:
--
Fix Version/s: (was: 1.5.2)
   1.4.2

> Parent's mount namespace cannot be determined when launching a nested 
> container.
> 
>
> Key: MESOS-7504
> URL: https://issues.apache.org/jira/browse/MESOS-7504
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.0
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Fix For: 1.4.2, 1.5.0
>
>
> I've observed this failure twice in different Linux environments. Here is an 
> example of such failure:
> {noformat}
> [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover
> I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend 
> 'overlay'
> I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer
> I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete
> I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework 
> I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d'
>  for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus 
> 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o 
> nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}"
>  --pipe_read="29" --pipe_write="32" 
> --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d"
>  --unshare_namespace_mnt="false"'
> I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> I0509 21:53:25.513890 17186 containerizer.cpp:1623] Checkpointing container's 
> forked pid 1873 to 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_Rdjw6M/meta/slaves/frameworks/executors/executor/runs/21bc372c-0f2c-49f5-b8ab-8d32c232b95d/pids/forked.pid'
> I0509 21:53:25.515878 17190 fetcher.cpp:353] Starting to fetch URIs for 
> container: 21bc372c-0f2c-49f5-b8ab-8d32c232b95d, directory: 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr
> I0509 21:53:25.517715 17193 containerizer.cpp:1791] Starting nested container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.518569 17193 switchboard.cpp:545] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b"
>  --stderr_from_fd="36" --stderr_to_fd="2" --stdin_to_fd="32" 
> --stdout_from_fd="33" --stdout_to_fd="1" --tty="false" 
> --wait_for_connection="true"' for container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.521229 17193 switchboard.cpp:575] Created I/O switchboard 
> server (pid: 1881) listening on socket file 
> '/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b' for 
> container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.522195 17191 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> 

[jira] [Updated] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.

2018-01-30 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7504:
--
Fix Version/s: 1.5.2

> Parent's mount namespace cannot be determined when launching a nested 
> container.
> 
>
> Key: MESOS-7504
> URL: https://issues.apache.org/jira/browse/MESOS-7504
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.0
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Fix For: 1.4.2, 1.5.0
>
>
> I've observed this failure twice in different Linux environments. Here is an 
> example of such failure:
> {noformat}
> [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover
> I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend 
> 'overlay'
> I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer
> I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete
> I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework 
> I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d'
>  for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus 
> 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o 
> nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}"
>  --pipe_read="29" --pipe_write="32" 
> --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d"
>  --unshare_namespace_mnt="false"'
> I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> I0509 21:53:25.513890 17186 containerizer.cpp:1623] Checkpointing container's 
> forked pid 1873 to 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_Rdjw6M/meta/slaves/frameworks/executors/executor/runs/21bc372c-0f2c-49f5-b8ab-8d32c232b95d/pids/forked.pid'
> I0509 21:53:25.515878 17190 fetcher.cpp:353] Starting to fetch URIs for 
> container: 21bc372c-0f2c-49f5-b8ab-8d32c232b95d, directory: 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr
> I0509 21:53:25.517715 17193 containerizer.cpp:1791] Starting nested container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.518569 17193 switchboard.cpp:545] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b"
>  --stderr_from_fd="36" --stderr_to_fd="2" --stdin_to_fd="32" 
> --stdout_from_fd="33" --stdout_to_fd="1" --tty="false" 
> --wait_for_connection="true"' for container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.521229 17193 switchboard.cpp:575] Created I/O switchboard 
> server (pid: 1881) listening on socket file 
> '/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b' for 
> container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.522195 17191 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"command":{"shell":true,"value":"sleep 
> 

[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8480:
--
Description: 
The way we get resource statistics for Docker tasks is through getting the 
cgroup subsystem path through {{/proc//cgroup}} first (taking the 
{{cpuacct}} subsystem as an example):
{noformat}
9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
{noformat}
Then read 
{{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
 to get the statistics:
{noformat}
user 4
system 0
{noformat}
However, when a Docker container is being teared down, it seems that Docker or 
the operation system will first move the process to the root cgroup before 
actually killing it, making {{/proc//docker}} look like the following:
{noformat}
9:cpuacct,cpu:/
{noformat}
This makes a racy call to 
[{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
 return a single '/', which in turn makes 
[{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
 read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics 
for the root cgroup:
{noformat}
user 228058750
system 24506461
{noformat}
This can be reproduced by [^test.cpp] with the following command:
{noformat}
$ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
...

Reading file '/proc/44224/cgroup'
Reading file 
'/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
user 4
system 0

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Failed to open file '/proc/44224/cgroup'
sleep
[2]-  Exit 1  ./test $(docker inspect sleep | jq .[].State.Pid)
{noformat}

  was:
The way we get resource statistics for Docker tasks is through getting the 
cgroup subsystem path through {{/proc//docker}} first (taking the 
{{cpuacct}} subsystem as an example):
{noformat}
9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
{noformat}
Then read 
{{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
 to get the statistics:
{noformat}
user 4
system 0
{noformat}

However, when a Docker container is being teared down, it seems that Docker or 
the operation system will first move the process to the root cgroup before 
actually killing it, making {{/proc//docker}} look like the following:
{noformat}
9:cpuacct,cpu:/
{noformat}
This makes a racy call to 
[{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
 return a single '/', which in turn makes 
[{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
 read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics 
for the root cgroup:
{noformat}
user 228058750
system 24506461
{noformat}

This can be reproduced by [^test.cpp] with the following command:
{noformat}
$ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
...

Reading file '/proc/44224/cgroup'
Reading file 
'/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
user 4
system 0

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Failed to open file '/proc/44224/cgroup'
sleep
[2]-  Exit 1  ./test $(docker inspect sleep | jq .[].State.Pid)
{noformat}


> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Fix For: 1.3.2, 1.4.2, 1.6.0, 1.5.1
>
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//cgroup}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then 

[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8480:
--
Fix Version/s: 1.5.1
   1.4.2
   1.3.2

> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Fix For: 1.3.2, 1.4.2, 1.6.0, 1.5.1
>
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//docker}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
>  to get the statistics:
> {noformat}
> user 4
> system 0
> {noformat}
> However, when a Docker container is being teared down, it seems that Docker 
> or the operation system will first move the process to the root cgroup before 
> actually killing it, making {{/proc//docker}} look like the following:
> {noformat}
> 9:cpuacct,cpu:/
> {noformat}
> This makes a racy call to 
> [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
>  return a single '/', which in turn makes 
> [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
>  read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the 
> statistics for the root cgroup:
> {noformat}
> user 228058750
> system 24506461
> {noformat}
> This can be reproduced by [^test.cpp] with the following command:
> {noformat}
> $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
> sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
> ...
> Reading file '/proc/44224/cgroup'
> Reading file 
> '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
> user 4
> system 0
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Failed to open file '/proc/44224/cgroup'
> sleep
> [2]-  Exit 1  ./test $(docker inspect sleep | jq 
> .[].State.Pid)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336710#comment-16336710
 ] 

Jie Yu commented on MESOS-8480:
---

commit 1382e595fa5e82f9917df97fbed76f77140ecc1e (HEAD -> master, origin/master, 
origin/HEAD)
Author: Chun-Hung Hsiao 
Date: Tue Jan 23 17:13:05 2018 -0800

Fixed resource statistics for Docker containers being destroyed.

If a process has exited, but not reaped yet (zombie procses),
 `/proc//cgroup` will still exist, but the process's cgroup will be
 reset to the root cgroup. In DockerContainerizer, we rely on
 `/proc//cgroup` to get the cpu/memory statistics of the container.
 If the `usage` call happens when the process is a zombie, the cpu/memory
 statistics will actually be that of the root cgroup, which is obviously
 not correct. See more details in MESOS-8480.

This patch fixed the issue by checking if the cgroup of a given pid is
 root cgroup or not.

Review: https://reviews.apache.org/r/65301/

> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Fix For: 1.6.0
>
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//docker}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
>  to get the statistics:
> {noformat}
> user 4
> system 0
> {noformat}
> However, when a Docker container is being teared down, it seems that Docker 
> or the operation system will first move the process to the root cgroup before 
> actually killing it, making {{/proc//docker}} look like the following:
> {noformat}
> 9:cpuacct,cpu:/
> {noformat}
> This makes a racy call to 
> [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
>  return a single '/', which in turn makes 
> [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
>  read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the 
> statistics for the root cgroup:
> {noformat}
> user 228058750
> system 24506461
> {noformat}
> This can be reproduced by [^test.cpp] with the following command:
> {noformat}
> $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
> sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
> ...
> Reading file '/proc/44224/cgroup'
> Reading file 
> '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
> user 4
> system 0
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Failed to open file '/proc/44224/cgroup'
> sleep
> [2]-  Exit 1  ./test $(docker inspect sleep | jq 
> .[].State.Pid)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8480:
--
Fix Version/s: 1.6.0

> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Fix For: 1.6.0
>
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//docker}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
>  to get the statistics:
> {noformat}
> user 4
> system 0
> {noformat}
> However, when a Docker container is being teared down, it seems that Docker 
> or the operation system will first move the process to the root cgroup before 
> actually killing it, making {{/proc//docker}} look like the following:
> {noformat}
> 9:cpuacct,cpu:/
> {noformat}
> This makes a racy call to 
> [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
>  return a single '/', which in turn makes 
> [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
>  read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the 
> statistics for the root cgroup:
> {noformat}
> user 228058750
> system 24506461
> {noformat}
> This can be reproduced by [^test.cpp] with the following command:
> {noformat}
> $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
> sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
> ...
> Reading file '/proc/44224/cgroup'
> Reading file 
> '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
> user 4
> system 0
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Failed to open file '/proc/44224/cgroup'
> sleep
> [2]-  Exit 1  ./test $(docker inspect sleep | jq 
> .[].State.Pid)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336700#comment-16336700
 ] 

Jie Yu commented on MESOS-8480:
---

I checked the kernel code, looks like when a process exits (or killed), but 
hasn't been reaped yet (zombie), the proc file `/proc//cgroup` will still 
exist, but the cgroup of the task will be set to root cgroup:

[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/cgroup.c?h=v4.1.49#n5194]

[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/cgroup.c?h=v4.1.49#n1003]

[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/exit.c?h=v4.1.49#n757]

[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/cgroup.c?h=v4.1.49#n5357]

 

> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//docker}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
>  to get the statistics:
> {noformat}
> user 4
> system 0
> {noformat}
> However, when a Docker container is being teared down, it seems that Docker 
> or the operation system will first move the process to the root cgroup before 
> actually killing it, making {{/proc//docker}} look like the following:
> {noformat}
> 9:cpuacct,cpu:/
> {noformat}
> This makes a racy call to 
> [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
>  return a single '/', which in turn makes 
> [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
>  read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the 
> statistics for the root cgroup:
> {noformat}
> user 228058750
> system 24506461
> {noformat}
> This can be reproduced by [^test.cpp] with the following command:
> {noformat}
> $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
> sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
> ...
> Reading file '/proc/44224/cgroup'
> Reading file 
> '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
> user 4
> system 0
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Failed to open file '/proc/44224/cgroup'
> sleep
> [2]-  Exit 1  ./test $(docker inspect sleep | jq 
> .[].State.Pid)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6812) Invalid entries in /proc/self/mountinfo when using persistent storage

2018-01-22 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335241#comment-16335241
 ] 

Jie Yu commented on MESOS-6812:
---

Hum, that sounds like a kernel issue or systemd issue? Do you know why the 
systemd complains about the mount table? It looks fine to me.

> Invalid entries in /proc/self/mountinfo when using persistent storage
> -
>
> Key: MESOS-6812
> URL: https://issues.apache.org/jira/browse/MESOS-6812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1, 1.4.0
>Reporter: Mateusz Moneta
>Priority: Minor
>
> Hello,
> we use Mesos 1.0.1 on Debian Jessie with Kernel {{4.6.1-1~bpo8+1 
> (2016-06-14)}} and Docker 1.12.5.
> We have the problem that on slaves which run tasks with persistent storages 
> Mesos adds invalid entries to {{/proc/self/mountinfo}}. Example:
> {noformat}
> 79 46 253:5 
> /lib/mesos/volumes/roles/slave/services_proxy_production_mongo#data#4d7ae497-a0f5-11e6-8a4f-e0db55fde00f
>  
> /var/lib/mesos/slaves/56e2e372-da8e-47d0-ac25-0f55945c625c-S2/frameworks/fa8eb417-29e3-4640-9405-ab84d2ef9794-0001/executors/services_proxy_production_mongo.4d7ae498-a0f5-11e6-8a4f-e0db55fde00f/runs/f84f2541-7e44-4226-80c6-93f438e50fd5/data
>  rw,relatime shared:28 - ext4 /dev/mapper/main-var rw,data=ordered
> {noformat}
> This causes many {noformat}
> Dec 19 13:56:49 s10.mesos.services.ams.osa systemd[1]: Failed to reread 
> /proc/self/mountinfo: Invalid argument
> {noformat} errors in {{/var/log/daemon.log}}.
> Mesos slave configuration:
> {noformat}
> ULIMIT="-n 8192"
> CLUSTER=services
> MASTER=`cat /etc/mesos/zk`
> MESOS_CONTAINERIZERS=docker,mesos
> MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins
> MESOS_CREDENTIAL=/etc/mesos.credentials
> MESOS_WORK_DIR=/var/lib/mesos
> MESOS_PORT=8080
> MESOS_EXECUTOR_ENVIRONMENT_VARIABLES='{"SSL_ENABLED": "true","SSL_KEY_FILE": 
> "/etc/ssl/certs/star.mesos.services.ams.osa.key", "SSL_CERT_FILE": 
> "/etc/ssl/certs/star.mesos.services.ams.osa.pem"}'
> MESOS_MODULES=file:///usr/etc/mesos/mesos-slave-modules.json
> MESOS_CONTAINER_LOGGER=org_apache_mesos_LogrotateContainerLogger
> MESOS_LOGGING_LEVEL=INFO
> LIBPROCESS_SSL_ENABLED=true
> LIBPROCESS_SSL_KEY_FILE=/etc/ssl/certs/star.mesos.services.ams.osa.key
> LIBPROCESS_SSL_CERT_FILE=/etc/ssl/certs/star.mesos.services.ams.osa.pem
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8461) SLRP should no assume a CSI plugin always has GetNodeID implemented.

2018-01-18 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8461:
--
Description: According to 0.1.0 spec, GetNodeID is optional, and will be 
implemented if PUBLISH_UNPUBLISH_VOLUME capability is set.

> SLRP should no assume a CSI plugin always has GetNodeID implemented.
> 
>
> Key: MESOS-8461
> URL: https://issues.apache.org/jira/browse/MESOS-8461
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Chun-Hung Hsiao
>Priority: Major
>
> According to 0.1.0 spec, GetNodeID is optional, and will be implemented if 
> PUBLISH_UNPUBLISH_VOLUME capability is set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8461) SLRP should no assume a CSI plugin always has GetNodeID implemented.

2018-01-18 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8461:
--
Environment: (was: According to 0.1.0 spec, GetNodeID is optional, and 
will be implemented if PUBLISH_UNPUBLISH_VOLUME capability is set.)

> SLRP should no assume a CSI plugin always has GetNodeID implemented.
> 
>
> Key: MESOS-8461
> URL: https://issues.apache.org/jira/browse/MESOS-8461
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Chun-Hung Hsiao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8461) SLRP should no assume a CSI plugin always has GetNodeID implemented.

2018-01-18 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8461:
-

 Summary: SLRP should no assume a CSI plugin always has GetNodeID 
implemented.
 Key: MESOS-8461
 URL: https://issues.apache.org/jira/browse/MESOS-8461
 Project: Mesos
  Issue Type: Bug
 Environment: According to 0.1.0 spec, GetNodeID is optional, and will 
be implemented if PUBLISH_UNPUBLISH_VOLUME capability is set.
Reporter: Jie Yu
Assignee: Chun-Hung Hsiao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8253) Mesos CI docker rmi conflict

2018-01-16 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327579#comment-16327579
 ] 

Jie Yu commented on MESOS-8253:
---

One potential workaround I can think of is to write a wrapper program that pipe 
the stdout/stderr of docker run/build. the wrapper program will set fd to be 
blocking, but assume that its own stdout/stderr is non-blocking (and will retry 
if writes fail with EAGAIN).

> Mesos CI docker rmi conflict
> 
>
> Key: MESOS-8253
> URL: https://issues.apache.org/jira/browse/MESOS-8253
> Project: Mesos
>  Issue Type: Bug
>  Components: build, docker
>Reporter: James Peach
>Priority: Major
>
> We are seeing a lot of docker build jobs failing when they try to clean up 
> there docker images:
> {noformat}
> + docker rmi mesos-1511286604-15916
> Error response from daemon: conflict: unable to remove repository reference 
> "mesos-1511286604-15916" (must force) - container 1aabf0225a43 is using its 
> referenced image 23292073f88f
> Build step 'Execute shell' marked build as failure
> {noformat}
> The full Jenkins log is 
> [here|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)&&(!qnode3)&&(!H23)/4486/console]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8438) URI disk profile adaptor should not attempt to parse the response if Content-Type is not expected.

2018-01-11 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8438:
-

 Summary: URI disk profile adaptor should not attempt to parse the 
response if Content-Type is not expected.
 Key: MESOS-8438
 URL: https://issues.apache.org/jira/browse/MESOS-8438
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8437) URI disk profile adaptor should handle 30x HTTP redirect

2018-01-11 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8437:
-

 Summary: URI disk profile adaptor should handle 30x HTTP redirect
 Key: MESOS-8437
 URL: https://issues.apache.org/jira/browse/MESOS-8437
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Fix Version/s: 1.3.2

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
> Fix For: 1.3.2, 1.4.2, 1.5.0
>
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Target Version/s: 1.3.2, 1.4.2, 1.5.1  (was: 1.4.2, 1.5.1)

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
> Fix For: 1.4.2, 1.5.0
>
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Fix Version/s: 1.4.2

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
> Fix For: 1.4.2, 1.5.0
>
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319113#comment-16319113
 ] 

Jie Yu commented on MESOS-8356:
---

commit c8e6487d251d938c3c221f606f7e924514877655 (origin/master, origin/HEAD, 
master)
Author: Jie Yu 
Date:   Tue Jan 9 11:23:20 2018 -0800

Fixed the persistent volume permission issue in DockerContainerizer.

This patch fixes MESOS-8356 by skipping the current container to be
launched when doing the shared volume check (`isVolumeInUse`). Prior to
this patch, the code is buggy because `isVolumeInUse` will always be set
to `true`.

Review: https://reviews.apache.org/r/65049

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
> Fix For: 1.5.0
>
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Fix Version/s: 1.5.0

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
> Fix For: 1.5.0
>
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318943#comment-16318943
 ] 

Jie Yu commented on MESOS-8356:
---

I verified that it's not an issue with Mesos containerizer (aka, universal 
containerizer), but it's a problem for docker containerizer.

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Affects Version/s: 1.1.3
   1.2.3
   1.3.1
 Target Version/s: 1.4.2, 1.5.1

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Priority: Critical  (was: Major)

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318907#comment-16318907
 ] 

Jie Yu edited comment on MESOS-8356 at 1/9/18 6:45 PM:
---

[~kkalin] Thanks for reporting!

[~xujyan] This looks like a bug to me because `current` is always set to empty 
in Docker containerizer:
https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/docker.cpp#L625

The logic in the Mesos containerizer (i.e., filesystem/linux isolator) is 
slightly different as `current` there is set to be `info->resources`, thus not 
buggy
https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L644




was (Author: jieyu):
[~kkalin] Thanks for reporting!

[~xujyan] This looks like a bug to me because `current` is always set to empty 
in Docker containerizer:
https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/docker.cpp#L625

The logic in the Mesos containerizer (i.e., filesystem/linux isolator) is 
slightly different as `current` there is set to be `info->resources`:
https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L644



> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318907#comment-16318907
 ] 

Jie Yu commented on MESOS-8356:
---

[~kkalin] Thanks for reporting!

[~xujyan] This looks like a bug to me because `current` is always set to empty 
in Docker containerizer:
https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/docker.cpp#L625

The logic in the Mesos containerizer (i.e., filesystem/linux isolator) is 
slightly different as `current` there is set to be `info->resources`:
https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L644



> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Target Version/s:   (was: 1.4.2, 1.5.1)

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Target Version/s: 1.4.2, 1.5.1

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Affects Version/s: 1.4.1

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8410) Reconfiguration policy fails to handle mount disk resources.

2018-01-05 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-8410:
-

Assignee: Benno Evers

> Reconfiguration policy fails to handle mount disk resources.
> 
>
> Key: MESOS-8410
> URL: https://issues.apache.org/jira/browse/MESOS-8410
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>Assignee: Benno Evers
>
> We deployed {{--reconfiguration_policy="additive"}} on a number of Mesos 
> agents that had mount disk resources configured, and it looks like the agent 
> confused the size of the mount disk with the size of the work directory 
> resource:
> {noformat}
> E0106 01:54:15.000123 1310889 slave.cpp:6733] EXIT with status 1: Failed to 
> perform recovery: Configuration change not permitted under 'additive' policy: 
> Value of scalar resource 'disk' decreased from 183 to 868000
> {noformat}
> The {{--resources}} flag is
> {noformat}
> --resources="[
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 868000
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/a"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/b"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/c"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/d"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/e"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/f"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/g"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/h"
> }
>   }
> }
>   }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8398) External volumes (through docker/volume isolator) might not be accessible by non-root users.

2018-01-04 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8398:
-

 Summary: External volumes (through docker/volume isolator) might 
not be accessible by non-root users.
 Key: MESOS-8398
 URL: https://issues.apache.org/jira/browse/MESOS-8398
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Affects Versions: 1.4.1, 1.3.1
Reporter: Jie Yu


That's because we don't perform chown/chmod for external volumes at the moment 
(because it might be shared across multiple containers). If the container is 
launched using non-root users, it might not be able to access to the external 
volume.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8397) Document standalone container and its API.

2018-01-04 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8397:
-

 Summary: Document standalone container and its API.
 Key: MESOS-8397
 URL: https://issues.apache.org/jira/browse/MESOS-8397
 Project: Mesos
  Issue Type: Documentation
Reporter: Jie Yu
Assignee: Joseph Wu


Standalone container support has been added in Mesos 1.5. We need to document 
this feature.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8396) Document resource provider and its API

2018-01-04 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8396:
-

 Summary: Document resource provider and its API
 Key: MESOS-8396
 URL: https://issues.apache.org/jira/browse/MESOS-8396
 Project: Mesos
  Issue Type: Documentation
Reporter: Jie Yu


Resource provider API is introduced in 1.5. It's an HTTP based API. We should 
document that, and provide a general introduction on what is resource provider, 
what we can do using that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8377) RecoverTest.CatchupTruncated is flaky.

2018-01-04 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311747#comment-16311747
 ] 

Jie Yu commented on MESOS-8377:
---

commit 39525482a756e951e798db4c6831b79b65bb75b5 (HEAD -> master, origin/master, 
origin/HEAD)
Author: Ilya Pronin 
Date:   Thu Jan 4 09:51:05 2018 -0800

Fixed RecoverTest.CatchupTruncated test flakiness.

Most likely the "lock already held by process" LevelDB error was caused
by a Shared still retained by one of the processes when the
test tries to recreate it. This change makes sure that the test code is
the only owner of the replica.

Review: https://reviews.apache.org/r/64938/

> RecoverTest.CatchupTruncated is flaky.
> --
>
> Key: MESOS-8377
> URL: https://issues.apache.org/jira/browse/MESOS-8377
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Alexander Rukletsov
>Assignee: Ilya Pronin
>  Labels: flaky-test
> Fix For: 1.5.0
>
> Attachments: CatchupTruncated-badrun.txt, 
> RecoverTest.CatchupTruncated-badrun2.txt
>
>
> Observing regularly in our CI. Logs attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

2018-01-04 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311718#comment-16311718
 ] 

Jie Yu commented on MESOS-8391:
---

[~bennoe] can you attach the executor's log as well?

> Mesos agent doesn't notice that a pod task exits or crashes after the agent 
> restart
> ---
>
> Key: MESOS-8391
> URL: https://issues.apache.org/jira/browse/MESOS-8391
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, executor
>Affects Versions: 1.5.0
>Reporter: Ivan Chernetsky
>Priority: Critical
> Attachments: agent.log.gz
>
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 1}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, 
> and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks 
> that both tasks are running just fine. Marathon doesn't take any action 
> because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards 
> the correct status update, but the other task stays in {{TASK_KILLING}} state 
> forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, but the other task stays in 
> `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally 
> killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7235) Improve Storage Support using Resource Provider and CSI

2018-01-04 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311702#comment-16311702
 ] 

Jie Yu commented on MESOS-7235:
---

Post MVP work is tracked here: MESOS-8374 Resource Provider and CSI Tech Debt


> Improve Storage Support using Resource Provider and CSI
> ---
>
> Key: MESOS-7235
> URL: https://issues.apache.org/jira/browse/MESOS-7235
> Project: Mesos
>  Issue Type: Epic
>Reporter: Jie Yu
>Assignee: Jie Yu
>  Labels: mesosphere, storage
> Fix For: 1.5.0
>
>
> Currently, Mesos supports both [local persistent 
> volumes|https://github.com/apache/mesos/blob/master/docs/persistent-volume.md]
>  as well as [external persistent 
> volumes|https://github.com/apache/mesos/blob/master/docs/docker-volume.md]. 
> However, both of them are not ideal.
> Local persistent volumes do not support offering physical or logical block 
> devices directly. Also, frameworks do not have choices to select filesystems 
> for their local persistent volumes. There are also some [usability 
> problem|https://issues.apache.org/jira/browse/MESOS-4209] with the local 
> persistent volumes. Mesos does support [multiple local 
> disks|https://github.com/apache/mesos/blob/master/docs/multiple-disk.md]. 
> However, it’s a big burden for operators to configure each agent properly to 
> be able to leverage this feature.
> External persistent volumes support in Mesos currently bypasses the resource 
> management part. In other words, using an external persistent volume does not 
> go through the usual offer cycle. Mesos doesn’t track resources associated 
> with the external volumes. This makes quota control, reservation, fair 
> sharing almost impossible to implement. Also, the current interface Mesos 
> uses to interact with volume providers is the [Docker Volume Driver interface 
> (DVDI)|https://docs.docker.com/engine/extend/plugins_volume/], which is very 
> specific to operations on a particular agent.
> The main problem I see currently is that we don’t have a coherent story for 
> storage. Yes, we have some primitives in Mesos that can support some stateful 
> services, but this is far from ideal. Some of them are just the stop gap 
> solution (e.g., the external volume support). This epic tries to tell a 
> coherent story for supporting storage in Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

2018-01-03 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8391:
--
Fix Version/s: (was: 1.5.0)

> Mesos agent doesn't notice that a pod task exits or crashes after the agent 
> restart
> ---
>
> Key: MESOS-8391
> URL: https://issues.apache.org/jira/browse/MESOS-8391
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, executor
>Affects Versions: 1.5.0
>Reporter: Ivan Chernetsky
>Priority: Critical
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 1}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, 
> and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks 
> that both tasks are running just fine. Marathon doesn't take any action 
> because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards 
> the correct status update, but the other task stays in {{TASK_KILLING}} state 
> forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, but the other task stays in 
> `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally 
> killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8388) Show LRP resources in master endpoints.

2018-01-03 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8388:
-

 Summary: Show LRP resources in master endpoints.
 Key: MESOS-8388
 URL: https://issues.apache.org/jira/browse/MESOS-8388
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu


Currently, only resource provider info is shown. We should also shown the 
resources provided by the RP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8387) Support marking an RP as gone.

2018-01-03 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8387:
-

 Summary: Support marking an RP as gone.
 Key: MESOS-8387
 URL: https://issues.apache.org/jira/browse/MESOS-8387
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu


Similiar to mark agent as gone, this will allow the operator to tell Mesos when 
he/she is sure that the RP is gone.

However, it's possible that the resources from that RP might still be 
referenced by some tasks, we need to design ways to inform the frameworks about 
that, and let them decide what to do.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8385) Integrate with Resource Provider Registrar

2018-01-03 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8385:
-

 Summary: Integrate with Resource Provider Registrar
 Key: MESOS-8385
 URL: https://issues.apache.org/jira/browse/MESOS-8385
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu


Resource Provider Registrar will be used to store known RPs across agent/master 
failover. This will allows us to decide which RP is disconnected after 
agent/master failover.

Also, this is a necessary step for us to implement "marking an RP gone".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8384) Add health check for local resource providers.

2018-01-03 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8384:
-

 Summary: Add health check for local resource providers.
 Key: MESOS-8384
 URL: https://issues.apache.org/jira/browse/MESOS-8384
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu


Similar to what we do for agent, the resource provider manager needs to health 
check resource providers and mark it as unreachable if health check timed out.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8383) Add metrics for operations in Storage Local Resource Provider (SLRP).

2018-01-03 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8383:
-

 Summary: Add metrics for operations in Storage Local Resource 
Provider (SLRP).
 Key: MESOS-8383
 URL: https://issues.apache.org/jira/browse/MESOS-8383
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8382) Master should bookkeep local resource providers.

2018-01-03 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8382:
-

 Summary: Master should bookkeep local resource providers.
 Key: MESOS-8382
 URL: https://issues.apache.org/jira/browse/MESOS-8382
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu


This will simplify the handling of `UpdateSlaveMessage`. ALso, it'll simplify 
the endpoint serving.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8381) Update WebUI to show local resource providers.

2018-01-03 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8381:
-

 Summary: Update WebUI to show local resource providers.
 Key: MESOS-8381
 URL: https://issues.apache.org/jira/browse/MESOS-8381
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7537) Add functionality to disconnect resource providers in the master

2018-01-03 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310373#comment-16310373
 ] 

Jie Yu commented on MESOS-7537:
---

[~nfnt] can you clarify what does this ticket mean? is this still relevant?

> Add functionality to disconnect resource providers in the master
> 
>
> Key: MESOS-7537
> URL: https://issues.apache.org/jira/browse/MESOS-7537
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>  Labels: csi-post-mvp, external-resources, mesosphere, storage
>
> Similar to the existing {{disconnect}} methods for frameworks and agents, a 
> similar function has to be added to the master.
> It needs to be called in {{Master::exited}}, i.e. when it detects that a 
> resource provider is no longer reachable.
> For local resource providers this also has to be called when the agent 
> disconnects where these are running on.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8321) Validate that offer operations contain only master-known resource provider resources

2018-01-03 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8321:
--
Target Version/s:   (was: 1.5.1)

> Validate that offer operations contain only master-known resource provider 
> resources
> 
>
> Key: MESOS-8321
> URL: https://issues.apache.org/jira/browse/MESOS-8321
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>
> We should update the master's offer operation validation to also check that 
> any offer operation only works with resources from known resource providers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7554) Add re-registration timeout for local resource providers

2018-01-03 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7554:
--
Summary: Add re-registration timeout for local resource providers  (was: 
Add health check for resource providers)

> Add re-registration timeout for local resource providers
> 
>
> Key: MESOS-7554
> URL: https://issues.apache.org/jira/browse/MESOS-7554
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>  Labels: csi-post-mvp, mesosphere, storage
>
> Similar to master health checking agents, local resource providers need to be 
> health checked as well. This re-registration timeout will be started when a 
> resource provider seems to have disconnected, similar to how it's done for 
> agents. While waiting for the resource provider to reconnect, it will be 
> deactivated. On re-registration the timeout will be canceled and the resource 
> provider activated again. In case of a timeout, the internal state will be 
> changed to {{unreachable}} (as it is for agents in that situation) and 
> considered gone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7554) Add re-registration timeout for local resource providers

2018-01-03 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7554:
--
Description: This re-registration timeout will be started when a resource 
provider seems to have disconnected, similar to how it's done for agents. While 
waiting for the resource provider to reconnect, it will be deactivated. On 
re-registration the timeout will be canceled and the resource provider 
activated again. In case of a timeout, the internal state will be changed to 
{{unreachable}} (as it is for agents in that situation).  (was: Similar to 
master health checking agents, local resource providers need to be health 
checked as well. This re-registration timeout will be started when a resource 
provider seems to have disconnected, similar to how it's done for agents. While 
waiting for the resource provider to reconnect, it will be deactivated. On 
re-registration the timeout will be canceled and the resource provider 
activated again. In case of a timeout, the internal state will be changed to 
{{unreachable}} (as it is for agents in that situation) and considered gone.)

> Add re-registration timeout for local resource providers
> 
>
> Key: MESOS-7554
> URL: https://issues.apache.org/jira/browse/MESOS-7554
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>  Labels: csi-post-mvp, mesosphere, storage
>
> This re-registration timeout will be started when a resource provider seems 
> to have disconnected, similar to how it's done for agents. While waiting for 
> the resource provider to reconnect, it will be deactivated. On 
> re-registration the timeout will be canceled and the resource provider 
> activated again. In case of a timeout, the internal state will be changed to 
> {{unreachable}} (as it is for agents in that situation).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7554) Add health check for resource providers

2018-01-03 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7554:
--
Description: Similar to master health checking agents, local resource 
providers need to be health checked as well. This re-registration timeout will 
be started when a resource provider seems to have disconnected, similar to how 
it's done for agents. While waiting for the resource provider to reconnect, it 
will be deactivated. On re-registration the timeout will be canceled and the 
resource provider activated again. In case of a timeout, the internal state 
will be changed to {{unreachable}} (as it is for agents in that situation) and 
considered gone.  (was: Similar to master health checking agents, resource 
providers need to be health checked as well. This re-registration timeout will 
be started when a resource provider seems to have disconnected, similar to how 
it's done for agents. While waiting for the resource provider to reconnect, it 
will be deactivated. On re-registration the timeout will be canceled and the 
resource provider activated again. In case of a timeout, the internal state 
will be changed to {{unreachable}} (as it is for agents in that situation) and 
considered gone.)

> Add health check for resource providers
> ---
>
> Key: MESOS-7554
> URL: https://issues.apache.org/jira/browse/MESOS-7554
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>  Labels: csi-post-mvp, mesosphere, storage
>
> Similar to master health checking agents, local resource providers need to be 
> health checked as well. This re-registration timeout will be started when a 
> resource provider seems to have disconnected, similar to how it's done for 
> agents. While waiting for the resource provider to reconnect, it will be 
> deactivated. On re-registration the timeout will be canceled and the resource 
> provider activated again. In case of a timeout, the internal state will be 
> changed to {{unreachable}} (as it is for agents in that situation) and 
> considered gone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7554) Add health check for resource providers

2018-01-03 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7554:
--
Description: Similar to master health checking agents, resource providers 
need to be health checked as well. This re-registration timeout will be started 
when a resource provider seems to have disconnected, similar to how it's done 
for agents. While waiting for the resource provider to reconnect, it will be 
deactivated. On re-registration the timeout will be canceled and the resource 
provider activated again. In case of a timeout, the internal state will be 
changed to {{unreachable}} (as it is for agents in that situation) and 
considered gone.  (was: This re-registration timeout will be started when a 
resource provider seems to have disconnected, similar to how it's done for 
agents. While waiting for the resource provider to reconnect, it will be 
deactivated. On re-registration the timeout will be canceled and the resource 
provider activated again. In case of a timeout, the internal state will be 
changed to {{unreachable}} (as it is for agents in that situation) and 
considered gone.)

> Add health check for resource providers
> ---
>
> Key: MESOS-7554
> URL: https://issues.apache.org/jira/browse/MESOS-7554
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>  Labels: csi-post-mvp, mesosphere, storage
>
> Similar to master health checking agents, resource providers need to be 
> health checked as well. This re-registration timeout will be started when a 
> resource provider seems to have disconnected, similar to how it's done for 
> agents. While waiting for the resource provider to reconnect, it will be 
> deactivated. On re-registration the timeout will be canceled and the resource 
> provider activated again. In case of a timeout, the internal state will be 
> changed to {{unreachable}} (as it is for agents in that situation) and 
> considered gone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7554) Add health check for resource providers

2018-01-03 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7554:
--
Summary: Add health check for resource providers  (was: Add a 
re-registration timeout for resource providers)

> Add health check for resource providers
> ---
>
> Key: MESOS-7554
> URL: https://issues.apache.org/jira/browse/MESOS-7554
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>  Labels: csi-post-mvp, mesosphere, storage
>
> This re-registration timeout will be started when a resource provider seems 
> to have disconnected, similar to how it's done for agents. While waiting for 
> the resource provider to reconnect, it will be deactivated. On 
> re-registration the timeout will be canceled and the resource provider 
> activated again. In case of a timeout, the internal state will be changed to 
> {{unreachable}} (as it is for agents in that situation) and considered gone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8380) Update WebUI to show local resource providers.

2018-01-03 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8380:
-

 Summary: Update WebUI to show local resource providers.
 Key: MESOS-8380
 URL: https://issues.apache.org/jira/browse/MESOS-8380
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8379) Add metrics for resource provider related messages.

2018-01-03 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8379:
-

 Summary: Add metrics for resource provider related messages.
 Key: MESOS-8379
 URL: https://issues.apache.org/jira/browse/MESOS-8379
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8376) Bundled GRPC does not build on Debian 9

2018-01-03 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8376:
--
Fix Version/s: (was: 1.5.0)

> Bundled GRPC does not build on Debian 9
> ---
>
> Key: MESOS-8376
> URL: https://issues.apache.org/jira/browse/MESOS-8376
> Project: Mesos
>  Issue Type: Bug
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Blocker
>
> Debian 9 has OpenSSL 1.1.x by default, which is incompatible with gRPC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8253) Mesos CI docker rmi conflict

2018-01-03 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310193#comment-16310193
 ] 

Jie Yu edited comment on MESOS-8253 at 1/3/18 7:52 PM:
---

I did some triage on this issue. It looks like that the stdout and stderr are 
marked as non-blocking. I added a few print in our docker_build.sh script to 
verify
{noformat}
+ python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, 
fcntl.F_GETFL); print(flags_NONBLOCK);'
2048
+ python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stderr, 
fcntl.F_GETFL); print(flags_NONBLOCK);'
2048
{noformat}

This explains the EAGAIN received by `docker run`. I couldn't find a way in 
Jenkins to set the stdout/stderr to be blocking.

A workaround I am testing is to dump stdout/stderr to a file and tail the file 
in the end to reduce the likelihood of this. But the true fix is to set fds to 
be blocking.

Related issues:
https://github.com/travis-ci/travis-ci/issues/4704


was (Author: jieyu):
I did some triage on this issue. It looks like that the stdout and stderr are 
marked as non-blocking. I added a few print in our docker_build.sh script to 
verify
{noformat}
+ python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, 
fcntl.F_GETFL); print(flags_NONBLOCK);'
2048
+ python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stderr, 
fcntl.F_GETFL); print(flags_NONBLOCK);'
2048
{noformat}

This explains the EAGAIN received by `docker run`. I couldn't find a way in 
Jenkins to set the stdout/stderr to be blocking.

A workaround I am testing is to dump stdout/stderr to a file and tail the file 
in the end to reduce the likelihood of this. But the true fix is to set fds to 
be blocking.

> Mesos CI docker rmi conflict
> 
>
> Key: MESOS-8253
> URL: https://issues.apache.org/jira/browse/MESOS-8253
> Project: Mesos
>  Issue Type: Bug
>  Components: build, docker
>Reporter: James Peach
>
> We are seeing a lot of docker build jobs failing when they try to clean up 
> there docker images:
> {noformat}
> + docker rmi mesos-1511286604-15916
> Error response from daemon: conflict: unable to remove repository reference 
> "mesos-1511286604-15916" (must force) - container 1aabf0225a43 is using its 
> referenced image 23292073f88f
> Build step 'Execute shell' marked build as failure
> {noformat}
> The full Jenkins log is 
> [here|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)&&(!qnode3)&&(!H23)/4486/console]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


<    1   2   3   4   5   6   7   8   9   10   >