[jira] [Commented] (MESOS-7884) Support containerd on Mesos.

2020-05-18 Thread Gilbert Song (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109937#comment-17109937
 ] 

Gilbert Song commented on MESOS-7884:
-

cc [~qianzhang]

> Support containerd on Mesos.
> 
>
> Key: MESOS-7884
> URL: https://issues.apache.org/jira/browse/MESOS-7884
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization
>Reporter: Gilbert Song
>Priority: Major
>  Labels: containerd, containerizer
>
> containerd v1.0 is very close (v1.0.0 alpha 4 now) to the formal release. We 
> should consider support containerd on Mesos, either by refactoring the docker 
> containerizer or introduce a new containerd containerizer. Design and 
> suggestions are definitely welcome.
> https://github.com/containerd/containerd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well

2019-09-16 Thread Gilbert Song (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9966:
---

Assignee: Qian Zhang  (was: Gilbert Song)

> Agent crashes when trying to destroy orphaned nested container if root 
> container is orphaned as well
> 
>
> Key: MESOS-9966
> URL: https://issues.apache.org/jira/browse/MESOS-9966
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.7.3
>Reporter: Jan Schlicht
>Assignee: Qian Zhang
>Priority: Major
>
> Noticed an agent crash-looping when trying to recover. It recognized a 
> container and its nested container as orphaned. When trying to destroy the 
> nested container, the agent crashes. Probably when trying to [get the sandbox 
> path of the root 
> container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966].
> {noformat}
> 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] 
> Recovering Linux launcher
> 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not 
> recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] 
> Recovered container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97
> 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not 
> recovering cgroup 
> mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] 
> Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] 
> Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] 
> Recovered container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436
> 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not 
> recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not 
> recovering cgroup 
> mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] 
> 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is 
> a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] 
> a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] 
> 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] 
> a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is 
> a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] 
> Recovering isolators
> 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started 
> listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started 
> listening on 'low' memory pressure events for container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started 
> listening on 'medium' memory pressure events for container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started 
> listening on 'critical' memory pressure events for container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started 
> listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started 
> listening on 'low' memory pressure events for container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started 
> listening on 'medium' memory pressure events for container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386376 89977 memory.cpp:590] Started 
> listening on 'critical' memory pressure events for container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386694 89921 containerizer.cpp:1131] 
> Recovering provisioner
> 2019-09-09 05:04:26: I0909 05:04:26.388226 90010 metadata_manager.cpp:286] 
> Successfully loaded 64 Docker images
> 2019-09-09 05:04:26: I0909 05:04:26.388420 89932 provisioner.cpp:494] 

[jira] [Created] (MESOS-9965) agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not partition aware.

2019-09-12 Thread Gilbert Song (Jira)
Gilbert Song created MESOS-9965:
---

 Summary: agent should not send `TASK_GONE_BY_OPERATOR` if the 
framework is not partition aware.
 Key: MESOS-9965
 URL: https://issues.apache.org/jira/browse/MESOS-9965
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: Gilbert Song


The Mesos agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not 
partition-aware. We should distinguish the framework capability and send 
different updates to legacy frameworks.

The issue is exposed from here:
https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/slave/slave.cpp#L5803

An example to follow:
https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/master/master.cpp#L9921



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9960) Agent with cgroup support may destroy containers belonging to unrelated agents on startup

2019-09-11 Thread Gilbert Song (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928060#comment-16928060
 ] 

Gilbert Song commented on MESOS-9960:
-

[~bennoe], do we want to close this issue?

> Agent with cgroup support may destroy containers belonging to unrelated 
> agents on startup
> -
>
> Key: MESOS-9960
> URL: https://issues.apache.org/jira/browse/MESOS-9960
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.8.1, 1.9.0, master
>Reporter: Benno Evers
>Priority: Major
>
> Let's say I have a mesos cluster with one master and one agent:
> {noformat}
> $ mesos-master --work_dir=/tmp/mesos-master
> $ sudo mesos-agent --work_dir=/tmp/mesos-agent --master=127.0.1.1:5050 
> --port=5052 --isolation=docker/runtime
> {noformat}
> where I'm running a simple sleep task:
> {noformat}
> $ mesos-execute --command="sleep 1" --master=127.0.1.1:5050 --name="sleep"
> I0904 18:40:25.020413 18321 scheduler.cpp:189] Version: 1.8.0
> I0904 18:40:25.020892 18319 scheduler.cpp:342] Using default 'basic' HTTP 
> authenticatee
> I0904 18:40:25.021039 18323 scheduler.cpp:525] New master detected at 
> master@127.0.1.1:5050
> Subscribed with ID 7d9f5030-cadd-49df-bf1e-daa97a4baab6-
> Submitted task 'sleep' to agent 'd59e934c-9e26-490d-9f4a-1e8b4ce06b4e-S1'
> Received status update TASK_STARTING for task 'sleep'
>   source: SOURCE_EXECUTOR
> Received status update TASK_RUNNING for task 'sleep'
>   source: SOURCE_EXECUTOR
> {noformat}
> Next, I start a second agent  on the same host as the first one:
> {noformat}
> $ sudo ./src/mesos-agent --work_dir=/tmp/ --master=example.org:5050 
> --isolation="linux/seccomp" 
> --seccomp_config_dir=`pwd`/3rdparty/libseccomp-2.3.3
> {noformat}
> During startup, this agent detects the container belonging to the other, 
> unrelated agent and will attempt to clean it up:
> {noformat}
> 0904 18:30:44.906430 18067 task_status_update_manager.cpp:207] Recovering 
> task status update manager
> I0904 18:30:44.906913 18071 containerizer.cpp:797] Recovering Mesos containers
> I0904 18:30:44.910077 18070 linux_launcher.cpp:286] Recovering Linux launcher
> I0904 18:30:44.910347 18070 linux_launcher.cpp:343] Recovered container 
> 7f455ed7-6593-41e8-9b29-52ee84d7675b
> I0904 18:30:44.910409 18070 linux_launcher.cpp:437] 
> 7f455ed7-6593-41e8-9b29-52ee84d7675b is a known orphaned container
> I0904 18:30:44.910877 18065 containerizer.cpp:1123] Recovering isolators
> I0904 18:30:44.911888 18064 containerizer.cpp:1162] Recovering provisioner
> I0904 18:30:44.913368 18068 provisioner.cpp:498] Provisioner recovery complete
> I0904 18:30:44.913630 18065 containerizer.cpp:1234] Cleaning up orphan 
> container 7f455ed7-6593-41e8-9b29-52ee84d7675b
> I0904 18:30:44.913656 18065 containerizer.cpp:2576] Destroying container 
> 7f455ed7-6593-41e8-9b29-52ee84d7675b in RUNNING state
> I0904 18:30:44.913666 18065 containerizer.cpp:3278] Transitioning the state 
> of container 7f455ed7-6593-41e8-9b29-52ee84d7675b from RUNNING to DESTROYING
> I0904 18:30:44.914687 18064 linux_launcher.cpp:576] Asked to destroy 
> container 7f455ed7-6593-41e8-9b29-52ee84d7675b
> I0904 18:30:44.914788 18064 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/7f455ed7-6593-41e8-9b29-52ee84d7675b'
> {noformat}
> killing the sleep task in the process:
> {noformat}
> Received status update TASK_FAILED for task 'sleep'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {noformat}
> After some additional testing, it seems like the value of the `--isolation` 
> flag is actually irrelevant: The same behaviour can be observed as long as 
> cgroup support is enabled with `--systemd_enable_support`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (MESOS-8840) `cpu.cfs_quota_us` may be accidentally set for command task using docker during agent recovery.

2019-08-28 Thread Gilbert Song (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-8840:
---

Assignee: Qian Zhang

> `cpu.cfs_quota_us` may be accidentally set for command task using docker 
> during agent recovery.
> ---
>
> Key: MESOS-8840
> URL: https://issues.apache.org/jira/browse/MESOS-8840
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.2.3
>Reporter: Meng Zhu
>Assignee: Qian Zhang
>Priority: Critical
>  Labels: containerization
>
> Prior to Mesos 1.3, docker containerizer does not honor the flag 
> `–cgroups_enable_cfs` for command task when creating the container, a patch 
> ported this flag to docker command executor only up to 1.3 (MESOS-6134) 
> However, docker containerizer honors the flag when updating containers:
> https://github.com/apache/mesos/blob/7559c9352c78912526820f6222ed2b17ad3b19cf/src/slave/containerizer/docker.cpp#L1726
> For non-command tasks, docker containerizer always `update` the resources 
> during launch:
> https://github.com/apache/mesos/blob/7559c9352c78912526820f6222ed2b17ad3b19cf/src/slave/containerizer/docker.cpp#L1325-L1330
> For command tasks, it is not the case:
> https://github.com/apache/mesos/blob/7559c9352c78912526820f6222ed2b17ad3b19cf/src/slave/containerizer/docker.cpp#L1271-L1277
> However, when recovering the executor, `update` is called for both command 
> and non-command tasks.
> This means, for command task, the cpu cgroup cfs settings would change when a 
> command executor is recovered. Specifically, recovered command executors will 
> have cfs set while all other command executors will not. This may lead to a 
> drastic change in the container resource usage depending on the system load.
> To maintain backward compatibility, we probably want to avoid setting the 
> `cpu.cfs_quota_us` field in `update` if the field is not already set.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9935) The agent crashes after the disk du isolator supporting rootfs checks.

2019-08-12 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9935:
---

 Summary: The agent crashes after the disk du isolator supporting 
rootfs checks.
 Key: MESOS-9935
 URL: https://issues.apache.org/jira/browse/MESOS-9935
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Gilbert Song


This issue was broken by this patch:
https://github.com/apache/mesos/commit/8ba0682521c6051b42f33b3dd96a37f4d46a290d#diff-33089e53bdf9f646cdb9317c212eda02

A task can be launched without disk resource. However, after this patch, if the 
disk resource does not exist, the agent crashes - because the info->paths only 
add an entry 'path' when there is a quota and the quota comes from the disk 
resource.

{noformat}
Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal mesos-agent[15492]: 
F0809 14:54:00.017730 15498 process.cpp:3057] Aborting libprocess: 
'posix-disk-isolator(1)@172.12.2.196:5051' threw exception: _Map_base::at
Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal mesos-agent[15492]: 
*** Check failure stack trace: ***
Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal mesos-agent[15492]: 
@ 0x7f65f7d585cd  google::LogMessage::Fail()
Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal mesos-agent[15492]: 
@ 0x7f65f7d5a828  google::LogMessage::SendToLog()
Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal mesos-agent[15492]: 
@ 0x7f65f7d58163  google::LogMessage::Flush()
Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal mesos-agent[15492]: 
@ 0x7f65f7d5b169  google::LogMessageFatal::~LogMessageFatal()
Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal mesos-agent[15492]: 
@ 0x7f65f7cb8dbd  process::ProcessManager::resume()
Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal mesos-agent[15492]: 
@ 0x7f65f7cbe926  
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal mesos-agent[15492]: 
@ 0x7f65f3976070  (unknown)
Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal mesos-agent[15492]: 
@ 0x7f65f3194e25  start_thread
Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal mesos-agent[15492]: 
@ 0x7f65f2ebebad  __clone
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9908) Introduce a new agent flag and support docker volume chown to task user.

2019-08-08 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9908:
---

Assignee: Gilbert Song

> Introduce a new agent flag and support docker volume chown to task user.
> 
>
> Key: MESOS-9908
> URL: https://issues.apache.org/jira/browse/MESOS-9908
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>  Labels: containerization
>
> Currently, docker volume is always mounted as root, which is not accessible 
> by non-root task users. For security concerns, there are use cases that 
> operator may only allow non-root users to run as container user and docker 
> volume needs to be supported for those non-root users.
> A new agent flag is needed to make this support configurable, because 
> chown-ing a docker volume may be limited to some use case - e.g., multiple 
> non-root users on different hosts sharing the same docker volume 
> simultaneously. Operators are expected to turn on this flag if their 
> cluster's docker volume is not shared by multiple non-root users.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9874) Add environment variable `MESOS_ALLOCATION_ROLE` to the task/container.

2019-07-30 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895825#comment-16895825
 ] 

Gilbert Song commented on MESOS-9874:
-

commit fe89151e053e5c932f920b839ad00792f99ae42c
Author: Qian Zhang 
Date:   Mon Jul 29 23:33:59 2019 -0700

Added a test `ROOT_DOCKER_AllocationRoleEnvironmentVariable`.

Review: https://reviews.apache.org/r/71004/

commit 11cfb1cad77551f643ad29167766962ac2a71de5
Author: Qian Zhang 
Date:   Mon Jul 29 23:33:57 2019 -0700

Added a test `DefaultExecutorTest.AllocationRoleEnvironmentVariable`.

Review: https://reviews.apache.org/r/71003/

commit 6f8fcfde600869eda12d17015be0c68ac3bba0d2
Author: Qian Zhang 
Date:   Mon Jul 29 23:33:56 2019 -0700

Added a test `CommandExecutorTest.AllocationRoleEnvironmentVariable`.

Review: https://reviews.apache.org/r/71002/

> Add environment variable `MESOS_ALLOCATION_ROLE` to the task/container.
> ---
>
> Key: MESOS-9874
> URL: https://issues.apache.org/jira/browse/MESOS-9874
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Qian Zhang
>Priority: Major
>  Labels: containerization
> Fix For: 1.9.0
>
>
> Set this env var as the role from the task resource. Here is an example:
> https://github.com/apache/mesos/blob/master/src/master/readonly_handler.cpp#L197
> We probably want to set this env from executors, by adding this env to 
> CommandInfo.
> Mesos and docker containerizers should be supported.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9908) Introduce a new agent flag and support docker volume chown to task user.

2019-07-25 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9908:
---

 Summary: Introduce a new agent flag and support docker volume 
chown to task user.
 Key: MESOS-9908
 URL: https://issues.apache.org/jira/browse/MESOS-9908
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song


Currently, docker volume is always mounted as root, which is not accessible by 
non-root task users. For security concerns, there are use cases that operator 
may only allow non-root users to run as container user and docker volume needs 
to be supported for those non-root users.

A new agent flag is needed to make this support configurable, because chown-ing 
a docker volume may be limited to some use case - e.g., multiple non-root users 
on different hosts sharing the same docker volume simultaneously. Operators are 
expected to turn on this flag if their cluster's docker volume is not shared by 
multiple non-root users.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9788) Configurable IPC namespace and shared memory in `namespaces/ipc` isolator

2019-07-19 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889201#comment-16889201
 ] 

Gilbert Song commented on MESOS-9788:
-

commit 02b6467f9e8035166f400e0112015ac56c9281b6
Author: Qian Zhang 
Date:   Fri Jul 19 15:06:08 2019 -0700

Added a test `ROOT_DebugContainerWithPrivateIPCMode`.

Review: https://reviews.apache.org/r/71122/

commit 5032ea381dd2d532d781ba1d2c9fd3a600e7883a
Author: Qian Zhang 
Date:   Fri Jul 19 15:06:06 2019 -0700

Added a test `ROOT_NonePrivateIPCModeWithShmSize`.

Review: https://reviews.apache.org/r/71121/

commit e58f4b97b5d13ccc18ad9b1632d7e6409bdd0c55
Author: Qian Zhang 
Date:   Fri Jul 19 15:06:03 2019 -0700

Added two validations in `namespaces/ipc` isolator.

1. Do not support specifying the size of /dev/shm when the IPC mode
   is not `PRIVATE`.
2. Do not support private IPC mode for debug containers.

Review: https://reviews.apache.org/r/71120/

> Configurable IPC namespace and shared memory in `namespaces/ipc` isolator
> -
>
> Key: MESOS-9788
> URL: https://issues.apache.org/jira/browse/MESOS-9788
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>  Labels: containerization
> Fix For: 1.9.0
>
>
> See [design 
> doc|https://docs.google.com/document/d/10t1jf97vrejUWEVSvxGtqw4vhzfPef41JMzb5jw7l1s/edit?usp=sharing]
>  for the background of this improvement and how we are going to implement it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9841) Integrate `IsolatorTracker` and `LinuxLauncher` with Mesos containerizer.

2019-07-18 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888125#comment-16888125
 ] 

Gilbert Song commented on MESOS-9841:
-

commit 59c2c75d0982385271c3ba86e3cbbf6c21fa7bae
Author: Andrei Budnik 
Date:   Thu Jul 18 09:10:45 2019 -0700

Wrapped launcher in `LauncherTracker`.

This patch wraps a container launcher in instance of `LauncherTracker`
class. If the launcher gets stuck in some operation, `pendingFutures`
will return the method name along with its arguments such as
`containerId`, `pid`, etc.

Review: https://reviews.apache.org/r/70891/

> Integrate `IsolatorTracker` and `LinuxLauncher` with Mesos containerizer.
> -
>
> Key: MESOS-9841
> URL: https://issues.apache.org/jira/browse/MESOS-9841
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerization
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9833) Introduce an agent flag for the default `/dev/shm` size

2019-07-16 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885877#comment-16885877
 ] 

Gilbert Song commented on MESOS-9833:
-

commit 633b1b5d4ef3f08b0abf057a3651f34116e1e45f
Author: Qian Zhang 
Date:   Mon Jul 15 23:46:20 2019 -0700

Renamed agent flag `--default_shm_size`.

Review: https://reviews.apache.org/r/71072/

> Introduce an agent flag for the default `/dev/shm` size
> ---
>
> Key: MESOS-9833
> URL: https://issues.apache.org/jira/browse/MESOS-9833
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>  Labels: containerization
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9788) Configurable IPC namespace and shared memory in `namespaces/ipc` isolator

2019-07-13 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884419#comment-16884419
 ] 

Gilbert Song commented on MESOS-9788:
-

commit e04b445f414e256866ba2e1f441bd7e86acc858e
Author: Qian Zhang 
Date:   Sat Jul 13 10:07:59 2019 -0700

Added the test `ROOT_IPCNamespaceWithIPCIsolatorDisabled`.

Review: https://reviews.apache.org/r/70860/

commit e3b2edb1f3226e9f84e22383153314dc4a212edc
Author: Qian Zhang 
Date:   Sat Jul 13 10:07:58 2019 -0700

Updated the test `NamespacesIsolatorTest.ROOT_IPCNamespace`.

This test is updated to verify the backward compatibility is kept
after we implement the configurable IPC namespaces and /dev/shm.

Review: https://reviews.apache.org/r/70859/

commit c4ce90884e2a93e331a6d1bbbe9ed960c5872d24
Author: Qian Zhang 
Date:   Sat Jul 13 10:07:57 2019 -0700

Added the test `ROOT_DisallowShareAgentIPCNamespace`.

Review: https://reviews.apache.org/r/70857/

commit 82305ea19b4fd1b04bdeb0d6f8bb1792077a6206
Author: Qian Zhang 
Date:   Sat Jul 13 10:07:55 2019 -0700

Added the test `NamespacesIsolatorTest.ROOT_ShareAgentIPCNamespace`.

Review: https://reviews.apache.org/r/70852/

commit f26b2aa93553a1a7e845b194cb9ff49fb72bae22
Author: Qian Zhang 
Date:   Sat Jul 13 10:07:54 2019 -0700

Added the test `NamespacesIsolatorTest.ROOT_PrivateIPCNamespace`.

Review: https://reviews.apache.org/r/70849/

commit 6f402d5f193262e854826cac4bc14f29395ca02f
Author: Qian Zhang 
Date:   Sat Jul 13 10:07:52 2019 -0700

Added the test `NamespacesIsolatorTest.ROOT_ShareIPCNamespace`.

Review: https://reviews.apache.org/r/70845/

> Configurable IPC namespace and shared memory in `namespaces/ipc` isolator
> -
>
> Key: MESOS-9788
> URL: https://issues.apache.org/jira/browse/MESOS-9788
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>  Labels: containerization
> Fix For: 1.9.0
>
>
> See [design 
> doc|https://docs.google.com/document/d/10t1jf97vrejUWEVSvxGtqw4vhzfPef41JMzb5jw7l1s/edit?usp=sharing]
>  for the background of this improvement and how we are going to implement it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9883) Check container timeout counting should be started when the check command is executed.

2019-07-08 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9883:
---

 Summary: Check container timeout counting should be started when 
the check command is executed.
 Key: MESOS-9883
 URL: https://issues.apache.org/jira/browse/MESOS-9883
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song


Right now the check container timeout counting is started when the checker 
process sends the request to the agent API. This may not be what users are 
expecting, and sometimes it may take longer for the agent process to launch a 
check container if the agent is under a heavy workload. This could lead to the 
health check being failed if the timeout is small and the agent is slow.

In common sense, users may expect the timeout to be just the timeframe that the 
command should finish. So that users could better define the health check 
timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9874) Add environment variable `MESOS_ALLOCATION_ROLE` to the task/container.

2019-07-01 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9874:
---

 Summary: Add environment variable `MESOS_ALLOCATION_ROLE` to the 
task/container.
 Key: MESOS-9874
 URL: https://issues.apache.org/jira/browse/MESOS-9874
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9873) Document the container environment variables semantics on mesos.

2019-07-01 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9873:
---

 Summary: Document the container environment variables semantics on 
mesos.
 Key: MESOS-9873
 URL: https://issues.apache.org/jira/browse/MESOS-9873
 Project: Mesos
  Issue Type: Documentation
  Components: containerization
Reporter: Gilbert Song


Environment variables might be overwritten on mesos (task env, image env, agent 
env, etc). We should document how env var could be overwritten.

Also, we need to document the mesos-injected environment variables (like 
`MESOS_SANDBOX`), and how are they different in different executors:
- What are the mesos-injected environment variables for executor containers
- What are the mesos-injected environment variables for container tasks via 
default executor
- What are the mesos-injected environment variables for old-style command tasks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-4781) Executor env variables should not be leaked to the command task.

2019-07-01 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876379#comment-16876379
 ] 

Gilbert Song commented on MESOS-4781:
-

[~jieyu], sorry I missed your comment. No, not actively let me unassign for now

> Executor env variables should not be leaked to the command task.
> 
>
> Key: MESOS-4781
> URL: https://issues.apache.org/jira/browse/MESOS-4781
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Priority: Major
>  Labels: mesosphere
>
> Currently, command task inherits the env variables of the command executor. 
> This is less ideal because the command executor environment variables include 
> some Mesos internal env variables like MESOS_XXX and LIBPROCESS_XXX. Also, 
> this behavior does not match what Docker containerizer does. We should 
> construct the env variables from scratch for the command task, rather than 
> relying on inheriting the env variables from the command executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-4781) Executor env variables should not be leaked to the command task.

2019-07-01 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-4781:
---

Assignee: (was: Gilbert Song)

> Executor env variables should not be leaked to the command task.
> 
>
> Key: MESOS-4781
> URL: https://issues.apache.org/jira/browse/MESOS-4781
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Priority: Major
>  Labels: mesosphere
>
> Currently, command task inherits the env variables of the command executor. 
> This is less ideal because the command executor environment variables include 
> some Mesos internal env variables like MESOS_XXX and LIBPROCESS_XXX. Also, 
> this behavior does not match what Docker containerizer does. We should 
> construct the env variables from scratch for the command task, rather than 
> relying on inheriting the env variables from the command executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9868) NetworkInfo from the agent /state endpoint is not correct.

2019-06-26 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9868:
---

 Summary: NetworkInfo from the agent /state endpoint is not correct.
 Key: MESOS-9868
 URL: https://issues.apache.org/jira/browse/MESOS-9868
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Gilbert Song


NetworkInfo from the agent /state endpoint is not correct, which is also 
different from the networkInfo of /containers endpoint. Some frameworks rely on 
the state endpoint to get the ip address for other containers to run.

agent's state endpoint
{noformat}
{
"state": "TASK_RUNNING",
"timestamp": 1561574343.1521769,
"container_status": {
"container_id": {
"value": "9a2633be-d2e5-4636-9ad4-7b2fc669da99",
"parent": {
"value": "45ebab16-9b4b-416e-a7f2-4833fd4ed8ff"
}
},
"network_infos": [
{
"ip_addresses": [
{
"protocol": "IPv4",
"ip_address": "172.31.10.35"
}
]
}
]
},
"healthy": true
}
{noformat}

agent's /containers endpoint
{noformat}
"status": {
"container_id": {
"value": "5ffc9df2-3be6-4879-8b2d-2fde3f0477e0"
},
"executor_pid": 16063,
"network_infos": [
{
"ip_addresses": [
{
"ip_address": "9.0.35.71",
"protocol": "IPv4"
}
],
"name": "dcos"
}
]
}
{noformat}

The ip addresses are different^^.

The container is in RUNNING state and is running correctly. Just the state 
endpoint is not correct. One thing to notice is that the state endpoint used to 
show the correct IP. After there was an agent restart and master leader 
re-election, the IP address in the state endpoint was changed.

Here is the checkpoint CNI network information
{noformat}
OK-23:37:48-root@int-mountvolumeagent2-soak113s:/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S4/frameworks/26ffb84c-81ba-4b3b-989b-9c6560e51fa1-0171/executors/k8s-clusters.kc02__etcd__b50dc403-30d1-4b54-a367-332fb3621030/runs/latest/tasks/k8s-clusters.kc02__etcd-2-peer__5b6aa5fc-e113-4021-9db8-b63e0c8d1f6c
 # cat 
/var/run/mesos/isolators/network/cni/45ebab16-9b4b-416e-a7f2-4833fd4ed8ff/dcos/network.conf
 
{"args":{"org.apache.mesos":{"network_info":{"name":"dcos"}}},"chain":"M-DCOS","delegate":{"bridge":"m-dcos","hairpinMode":true,"ipMasq":false,"ipam":{"dataDir":"/var/run/dcos/cni/networks","routes":[{"dst":"0.0.0.0/0"}],"subnet":"9.0.73.0/25","type":"host-local"},"isGateway":true,"mtu":1420,"type":"bridge"},"excludeDevices":["m-dcos"],"name":"dcos","type":"mesos-cni-port-mapper"}
{noformat}

{noformat}
OK-01:30:05-root@int-mountvolumeagent2-soak113s:/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S4/frameworks/26ffb84c-81ba-4b3b-989b-9c6560e51fa1-0171/executors/k8s-clusters.kc02__etcd__b50dc403-30d1-4b54-a367-332fb3621030/runs/latest/tasks/k8s-clusters.kc02__etcd-2-peer__5b6aa5fc-e113-4021-9db8-b63e0c8d1f6c
 # cat 
/var/run/mesos/isolators/network/cni/45eb16-9b4b-416e-a7f2-4833fd4ed8ff/dcos/eth0/network.info
{"dns":{},"ip4":{"gateway":"9.0.73.1","ip":"9.0.73.65/25","routes":[{"dst":"0.0.0.0/0","gw":"9.0.73.1"}]}}
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9769) Add direct containerized support for filesystem operations.

2019-06-11 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861582#comment-16861582
 ] 

Gilbert Song commented on MESOS-9769:
-

commit 1961e41a61def2b7baca7563c0b7e1855880b55c
Author: Qian Zhang 
Date:   Tue Jun 11 15:50:47 2019 -0700

Improved container-specific cgroups test by checking `cpu.shares`.

This is to ensure the symbolic links (see below as an example) we
create for the container exist.
  ln -s /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/cpu

Review: https://reviews.apache.org/r/70827/

commit f24c54e85e08bc9c8b118cce29ad487661a0ffc6
Author: Qian Zhang 
Date:   Tue Jun 11 15:50:43 2019 -0700

Supported file operations for command tasks.

Review: https://reviews.apache.org/r/70826/

> Add direct containerized support for filesystem operations.
> ---
>
> Key: MESOS-9769
> URL: https://issues.apache.org/jira/browse/MESOS-9769
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
> Fix For: 1.9.0
>
>
> When setting up the container filesystems, we use `pre_exec_commands` to make 
> ABI symlinks and other things. The problem with this is that, depending of 
> the order of operations, we may not have the full security policy in place 
> yet, but since we are running in the context of the container's mount 
> namespaces, the programs we execute are under the control of whoever built 
> the container image.
> [~jieyu] and I previously discussed adding filesystem operations to the 
> `ContainerLaunchInfo`. Just `ln` would be sufficient for the `cgroups` and 
> `linux/filesystem` isolators. Secrets and port mapping isolators need more, 
> so we should discuss and file new tickets if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9830) Implement the container debug endpoint on slave/http.cpp

2019-06-05 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9830:
---

 Summary: Implement the container debug endpoint on slave/http.cpp
 Key: MESOS-9830
 URL: https://issues.apache.org/jira/browse/MESOS-9830
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song
Assignee: Andrei Budnik






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9829) Implement the container debug endpoint on slave/http.cpp

2019-06-05 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9829:
---

 Summary: Implement the container debug endpoint on slave/http.cpp
 Key: MESOS-9829
 URL: https://issues.apache.org/jira/browse/MESOS-9829
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song
Assignee: Andrei Budnik






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9828) Document the IPC namespace and shm on UCR.

2019-06-05 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9828:
---

 Summary: Document the IPC namespace and shm on UCR.
 Key: MESOS-9828
 URL: https://issues.apache.org/jira/browse/MESOS-9828
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song
Assignee: Qian Zhang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9827) Introduce the configurable shm protobuf API.

2019-06-05 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9827:
---

 Summary: Introduce the configurable shm protobuf API.
 Key: MESOS-9827
 URL: https://issues.apache.org/jira/browse/MESOS-9827
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song
Assignee: Qian Zhang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9826) Decouple the /dev/shm from container rootfs.

2019-06-05 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9826:
---

 Summary: Decouple the /dev/shm from container rootfs.
 Key: MESOS-9826
 URL: https://issues.apache.org/jira/browse/MESOS-9826
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song
Assignee: Qian Zhang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9825) Introduce an agent flag to disallow sharing the IPC namespace from the host.

2019-06-05 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9825:
---

 Summary: Introduce an agent flag to disallow sharing the IPC 
namespace from the host.
 Key: MESOS-9825
 URL: https://issues.apache.org/jira/browse/MESOS-9825
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song
Assignee: Qian Zhang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9824) Support rdma cgroup subsystem.

2019-06-05 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9824:
---

 Summary: Support rdma cgroup subsystem.
 Key: MESOS-9824
 URL: https://issues.apache.org/jira/browse/MESOS-9824
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song


Support rdma cgroup subsystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9757) Design doc for container debug endpoint.

2019-05-22 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9757:
---

Assignee: Andrei Budnik

> Design doc for container debug endpoint.
> 
>
> Key: MESOS-9757
> URL: https://issues.apache.org/jira/browse/MESOS-9757
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerization
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9795) Support /dev/shm and configurable IPC namespace.

2019-05-22 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9795:
---

 Summary: Support /dev/shm and configurable IPC namespace.
 Key: MESOS-9795
 URL: https://issues.apache.org/jira/browse/MESOS-9795
 Project: Mesos
  Issue Type: Epic
  Components: containerization
Reporter: Gilbert Song
Assignee: Qian Zhang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9775) Design doc for UCR shared memory.

2019-05-22 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846119#comment-16846119
 ] 

Gilbert Song commented on MESOS-9775:
-

https://docs.google.com/document/d/10t1jf97vrejUWEVSvxGtqw4vhzfPef41JMzb5jw7l1s/edit

> Design doc for UCR shared memory.
> -
>
> Key: MESOS-9775
> URL: https://issues.apache.org/jira/browse/MESOS-9775
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Qian Zhang
>Priority: Major
>  Labels: containerization
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9775) Design doc for UCR shared memory.

2019-05-22 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9775:
---

Assignee: Qian Zhang

> Design doc for UCR shared memory.
> -
>
> Key: MESOS-9775
> URL: https://issues.apache.org/jira/browse/MESOS-9775
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Qian Zhang
>Priority: Major
>  Labels: containerization
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9794) Design doc for container debug endpoint.

2019-05-22 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9794:
---

 Summary: Design doc for container debug endpoint.
 Key: MESOS-9794
 URL: https://issues.apache.org/jira/browse/MESOS-9794
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song


Design doc for container debug endpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9775) Design doc for UCR shared memory.

2019-05-08 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9775:
---

 Summary: Design doc for UCR shared memory.
 Key: MESOS-9775
 URL: https://issues.apache.org/jira/browse/MESOS-9775
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9757) Design doc for container debug endpoint.

2019-05-01 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9757:
---

 Summary: Design doc for container debug endpoint.
 Key: MESOS-9757
 URL: https://issues.apache.org/jira/browse/MESOS-9757
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9756) Introduce a container debug endpoint.

2019-05-01 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9756:
---

 Summary: Introduce a container debug endpoint.
 Key: MESOS-9756
 URL: https://issues.apache.org/jira/browse/MESOS-9756
 Project: Mesos
  Issue Type: Epic
  Components: containerization
Reporter: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9672) Docker containerizer should ignore pids of executors that do not pass the connection check.

2019-04-24 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9672:
---

Assignee: Qian Zhang

> Docker containerizer should ignore pids of executors that do not pass the 
> connection check.
> ---
>
> Key: MESOS-9672
> URL: https://issues.apache.org/jira/browse/MESOS-9672
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Meng Zhu
>Assignee: Qian Zhang
>Priority: Critical
>  Labels: containerization
>
> When recovering executors with a tracked pid we first try to establish a 
> connection to its libprocess address to avoid reaping an irrelevant process:
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1019-L1054
> If the connection fails to establish, we should not track its pid: 
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1071
> One trouble this might cause is that if the pid is being used by another 
> executor, this could lead to duplicate pid error and lead the agent into a 
> crash loop:
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1066-L1068



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9695) Remove the duplicate pid check in Docker containerizer

2019-04-24 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9695:
---

Shepherd: Gilbert Song
Assignee: Qian Zhang
  Sprint: Containerization: RI13 Sp 45

> Remove the duplicate pid check in Docker containerizer
> --
>
> Key: MESOS-9695
> URL: https://issues.apache.org/jira/browse/MESOS-9695
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>  Labels: containerization
>
> In `DockerContainerizerProcess::_recover`, we check if there are two 
> executors use duplicate pid, and error out if we find duplicate pid (see 
> [here|https://github.com/apache/mesos/blob/1.7.2/src/slave/containerizer/docker.cpp#L1068:L1078]
>  for details). However I do not see the value this check can give us but it 
> will cause serious issue (agent crash loop when restarting) in rare case (a 
> new executor reuse pid of an old executor), so I think we'd better to remove 
> it from Docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8769) Agent crashes when CNI config not defined

2019-04-24 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-8769:
---

Assignee: Gilbert Song

> Agent crashes when CNI config not defined
> -
>
> Key: MESOS-8769
> URL: https://issues.apache.org/jira/browse/MESOS-8769
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.4.1
>Reporter: Alena Varkockova
>Assignee: Gilbert Song
>Priority: Critical
>  Labels: cni, containerizer
>
> I was deploying an application through marathon in an integration test that 
> looked like this:
>  * Mesos container (UCR)
>  * container network
>  * some network name specified
> Given network name did not exist, I did not even passed CNI config to the 
> agent.
> After Mesos tried to deploy my task, the agent crashed because of missing CNI 
> config.
> {code}
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] *** 
> SIGABRT (@0x1980) received by PID 6528 (TID 0x7f3124b58700) from PID 6528; 
> stack trace: ***
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f312e5c2890 (unknown)
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f312e23d067 (unknown)
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f312e23e448 (unknown)
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f312e236266 (unknown)
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f312e236312 (unknown)
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f31304fd233 _ZNKR6OptionISsE3getEv.part.103
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f313050b60c 
> mesos::internal::slave::NetworkCniIsolatorProcess::getNetworkConfigJSON()
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f313050bd54 mesos::internal::slave::NetworkCniIsolatorProcess::prepare()
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f313027b903 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEESt5_BindIFZNS0_8dispatchI6OptionIN5mesos5slave19ContainerLaunchInfoEENS7_8internal5slave20MesosIsolatorProcessERKNS7_11ContainerIDERKNS8_15ContainerConfigESG_SJ_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSO_FSM_T1_T2_EOT3_OT4_EUlRSE_RSH_S2_E_SE_SH_St12_PlaceholderILi1E9_M_invokeERKSt9_Any_dataS2_
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f3130a7ee29 process::ProcessManager::resume()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9355) Persistence volume does not unmount correctly with wrong artifact URI

2019-04-24 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9355:
---

Assignee: Joseph Wu

> Persistence volume does not unmount correctly with wrong artifact URI
> -
>
> Key: MESOS-9355
> URL: https://issues.apache.org/jira/browse/MESOS-9355
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.5.1, 1.5.2
> Environment: DCOS 1.11.6
> Mesos 1.5.2
>Reporter: Ken Liu
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: persistent-volumes
>
> DCOS service json file is like following. When you type wrong uri, for 
> example, "file://root/test/http.tar.bz2", but the correct one is 
> "file:///root/test/http.tar.bz2". Then it will leave all the persistence 
> mount on the agent, and after gc_delay timeout, the mount path is still there.
> It means if it failed 10 times, then there is 10 persistence volume mount on 
> the agent.
> *Excepted Result: When task is failed,  dangling mount points should be 
> unmounted correctly.*
> {code:java}
> {
> "id": "/http-server",
> "backoffFactor": 1.15,
> "backoffSeconds": 1,
> "cmd": "python http.py",
> "constraints": [],
> "container": {
> "type": "MESOS",
> "volumes": [
> {
> "persistent": {
> "type": "root",
> "size": 2048,
> "constraints": []
> },
> "mode": "RW",
> "containerPath": "ken-http"
> }
> ]
> },
> "cpus": 0.1,
> "disk": 0,
> "fetch": [
> {
> "uri": "file://root/test/http.tar.bz2",
> "extract": true,
> "executable": false,
> "cache": false
> }
> ],
> "instances": 0,
> "maxLaunchDelaySeconds": 3600,
> "mem": 128,
> "gpus": 0,
> "networks": [
> {
> "mode": "host"
> }
> ],
> "portDefinitions": [],
> "residency": {
> "relaunchEscalationTimeoutSeconds": 3600,
> "taskLostBehavior": "WAIT_FOREVER"
> },
> "requirePorts": false,
> "upgradeStrategy": {
> "maximumOverCapacity": 0,
> "minimumHealthCapacity": 0
> },
> "killSelection": "YOUNGEST_FIRST",
> "unreachableStrategy": "disabled",
> "healthChecks": []
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9306) Mesos containerizer can get stuck during cgroup cleanup

2019-04-24 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9306:
---

Assignee: Andrei Budnik

> Mesos containerizer can get stuck during cgroup cleanup
> ---
>
> Key: MESOS-9306
> URL: https://issues.apache.org/jira/browse/MESOS-9306
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.7.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: containerizer, mesosphere
>
> I observed a task group's executor container which failed to be completely 
> destroyed after its associated tasks were killed. The following is an excerpt 
> from the agent log which is filtered to include only lines with the container 
> ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}:
> {code}
> 2018-10-10 14:20:50: I1010 14:20:50.204756  6799 containerizer.cpp:2963] 
> Container d463b9fe-970d-4077-bab9-558464889a9e has exited
> 2018-10-10 14:20:50: I1010 14:20:50.204839  6799 containerizer.cpp:2457] 
> Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state
> 2018-10-10 14:20:50: I1010 14:20:50.204859  6799 containerizer.cpp:3124] 
> Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e 
> from RUNNING to DESTROYING
> 2018-10-10 14:20:50: I1010 14:20:50.204960  6799 linux_launcher.cpp:580] 
> Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.204993  6799 linux_launcher.cpp:622] 
> Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:20:50: I1010 14:20:50.205417  6806 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.205477  6810 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.205708  6808 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 203008ns
> 2018-10-10 14:20:50: I1010 14:20:50.205878  6800 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 339200ns
> 2018-10-10 14:20:50: I1010 14:20:50.206185  6799 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.206226  6808 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.206455  6808 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 83968ns
> 2018-10-10 14:20:50: I1010 14:20:50.306803  6810 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 100.50816ms
> 2018-10-10 14:20:50: I1010 14:20:50.307531  6805 linux_launcher.cpp:654] 
> Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:21:40: W1010 14:21:40.032855  6809 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:22:40: W1010 14:22:40.031224  6800 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:23:40: W1010 14:23:40.031946  6799 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:24:40: W1010 14:24:40.032979  6804 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:25:40: W1010 14:25:40.030784  6808 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:26:40: W1010 14:26:40.032526  6810 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:27:40: W1010 14:27:40.029932  6801 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> {code}
> The last log line from the containerizer's destroy path is:
> {code}
> 14:20:50.307531  6805 linux_launcher.cpp:654] Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> {code}
> (that is the second such log line, from {{LinuxLauncherProcess::_destroy}})
> Then we just see
> {code}
> containerizer.cpp:2401] 

[jira] [Comment Edited] (MESOS-8522) `prepareMounts` in Mesos containerizer is flaky.

2019-04-24 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825419#comment-16825419
 ] 

Gilbert Song edited comment on MESOS-8522 at 4/24/19 6:37 PM:
--

probably we could just simply check os::exists(mount.target) for this case, 
assuming the mount point is cleaned up when the target is unmounted?


was (Author: gilbert):
probably we could just simply check os::exists(mount.target) for this case?

> `prepareMounts` in Mesos containerizer is flaky.
> 
>
> Key: MESOS-8522
> URL: https://issues.apache.org/jira/browse/MESOS-8522
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0
>Reporter: Chun-Hung Hsiao
>Assignee: Jie Yu
>Priority: Major
>  Labels: mesosphere, storage
>
> The 
> [{{prepareMount()}}|https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L244]
>  function in {{src/slave/containerizer/mesos/launch.cpp}} sometimes fails 
> with the following error:
> {noformat}
> Failed to prepare mounts: Failed to mark 
> '/home/docker/containers/af78db6ebc1aff572e576b773d1378121a66bb755ed63b3278e759907e5fe7b6/shm'
>  as slave: Invalid argument
> {noformat}
> The error message comes from 
> https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L#L326.
> Although it does not happen frequently, it can be reproduced by running tests 
> that need to clone mount namespaces in repetition. For example, I just 
> reproduced the bug with the following command after 17 minutes:
> {noformat}
> sudo bin/mesos-tests.sh --gtest_filter='*ROOT_PublishResourcesRecovery' 
> --gtest_break_on_failure --gtest_repeat=-1 --verbose
> {noformat}
> No that in this example, the test itself does not involve any docker image or 
> docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8522) `prepareMounts` in Mesos containerizer is flaky.

2019-04-24 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825419#comment-16825419
 ] 

Gilbert Song commented on MESOS-8522:
-

probably we could just simply check os::exists(mount.target) for this case?

> `prepareMounts` in Mesos containerizer is flaky.
> 
>
> Key: MESOS-8522
> URL: https://issues.apache.org/jira/browse/MESOS-8522
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0
>Reporter: Chun-Hung Hsiao
>Assignee: Jie Yu
>Priority: Major
>  Labels: mesosphere, storage
>
> The 
> [{{prepareMount()}}|https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L244]
>  function in {{src/slave/containerizer/mesos/launch.cpp}} sometimes fails 
> with the following error:
> {noformat}
> Failed to prepare mounts: Failed to mark 
> '/home/docker/containers/af78db6ebc1aff572e576b773d1378121a66bb755ed63b3278e759907e5fe7b6/shm'
>  as slave: Invalid argument
> {noformat}
> The error message comes from 
> https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L#L326.
> Although it does not happen frequently, it can be reproduced by running tests 
> that need to clone mount namespaces in repetition. For example, I just 
> reproduced the bug with the following command after 17 minutes:
> {noformat}
> sudo bin/mesos-tests.sh --gtest_filter='*ROOT_PublishResourcesRecovery' 
> --gtest_break_on_failure --gtest_repeat=-1 --verbose
> {noformat}
> No that in this example, the test itself does not involve any docker image or 
> docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8522) `prepareMounts` in Mesos containerizer is flaky.

2019-04-24 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825418#comment-16825418
 ] 

Gilbert Song commented on MESOS-8522:
-

[~chhsia0][~bbannier] what is the priority of this issue? does it only happen 
when there is a race with flapping docker containers?

> `prepareMounts` in Mesos containerizer is flaky.
> 
>
> Key: MESOS-8522
> URL: https://issues.apache.org/jira/browse/MESOS-8522
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0
>Reporter: Chun-Hung Hsiao
>Assignee: Jie Yu
>Priority: Major
>  Labels: mesosphere, storage
>
> The 
> [{{prepareMount()}}|https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L244]
>  function in {{src/slave/containerizer/mesos/launch.cpp}} sometimes fails 
> with the following error:
> {noformat}
> Failed to prepare mounts: Failed to mark 
> '/home/docker/containers/af78db6ebc1aff572e576b773d1378121a66bb755ed63b3278e759907e5fe7b6/shm'
>  as slave: Invalid argument
> {noformat}
> The error message comes from 
> https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L#L326.
> Although it does not happen frequently, it can be reproduced by running tests 
> that need to clone mount namespaces in repetition. For example, I just 
> reproduced the bug with the following command after 17 minutes:
> {noformat}
> sudo bin/mesos-tests.sh --gtest_filter='*ROOT_PublishResourcesRecovery' 
> --gtest_break_on_failure --gtest_repeat=-1 --verbose
> {noformat}
> No that in this example, the test itself does not involve any docker image or 
> docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9737) Avoid allocating memory during fork-exec in subprocess.hpp.

2019-04-23 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9737:
---

 Summary: Avoid allocating memory during fork-exec in 
subprocess.hpp.
 Key: MESOS-9737
 URL: https://issues.apache.org/jira/browse/MESOS-9737
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song


https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/posix/subprocess.hpp#L137

Os::strerror calls during fork-exec should be avoided, otherwise potential 
issues are not debuggable.
Consider using fmtlib for error code conversion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9159) Support Foreign URLs in docker registry puller on windows.

2019-04-16 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819583#comment-16819583
 ] 

Gilbert Song commented on MESOS-9159:
-

Closed since windows support on external URLs is already done

> Support Foreign URLs in docker registry puller on windows.
> --
>
> Key: MESOS-9159
> URL: https://issues.apache.org/jira/browse/MESOS-9159
> Project: Mesos
>  Issue Type: Task
>Reporter: Akash Gupta
>Assignee: Liangyu Zhao
>Priority: Major
> Fix For: 1.4.4, 1.5.4, 1.6.3, 1.7.3, 1.8.0
>
>
> Currently, trying to pull the layers of a Windows image with the current 
> registry pull code will return a 404 error. This is because the Windows 
> docker images need to pull the base OS layers from the foreign URLs field in 
> the version 2 schema 2 docker manifest. As a result, the register puller 
> needs to be aware of version 2 schema 2 and the foreign urls field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9704) Support docker manifest v2s2 config GC.

2019-04-16 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9704:
---

Shepherd: Qian Zhang
Assignee: Gilbert Song
  Sprint: Containerization: RI-13 Sp 44
Story Points: 3

> Support docker manifest v2s2 config GC.
> ---
>
> Key: MESOS-9704
> URL: https://issues.apache.org/jira/browse/MESOS-9704
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>  Labels: containerization
>
> After docker manifest v2s2 support, layer GC is still properly supported.
> However, the manifest config is not garbage collected. Need to add the config 
> dir to the checkpointed LAYERS_FILE to support config GC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9720) Support specifying file name in URI fetcher fetch() interface.

2019-04-10 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9720:
---

 Summary: Support specifying file name in URI fetcher fetch() 
interface.
 Key: MESOS-9720
 URL: https://issues.apache.org/jira/browse/MESOS-9720
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6934) Support pulling Docker images with V2 Schema 2 image manifest

2019-04-05 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811150#comment-16811150
 ] 

Gilbert Song commented on MESOS-6934:
-

commit db917f639e2d05fcf493e87649f42ddd2abfeae0
Author: Andrei Budnik abud...@mesosphere.com
Date:   Fri Apr 5 13:06:49 2019 +0200


Fixed use-after-free bug in Docker provisioner store.

Deferred lambda callback of the `moveLayers()` to the `StoreProcess`
to prevent use-after-free of the process object since the callback
refers to the `StoreProcess` class variable `flags`.

Review: https://reviews.apache.org/r/70405

> Support pulling Docker images with V2 Schema 2 image manifest
> -
>
> Key: MESOS-6934
> URL: https://issues.apache.org/jira/browse/MESOS-6934
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
> Environment: https://reviews.apache.org/r/70288/
> https://reviews.apache.org/r/70289/
> https://reviews.apache.org/r/70290/
> https://reviews.apache.org/r/70291/
>Reporter: Ilya Pronin
>Assignee: Gilbert Song
>Priority: Major
>  Labels: containerization
> Fix For: 1.8.0
>
>
> MESOS-3505 added support for pulling Docker images by their digest to the 
> Mesos Containerizer provisioner. However currently it only works with images 
> that were pushed with Docker 1.9 and older or with Registry 2.2.1 and older. 
> Newer versions use Schema 2 manifests by default. Because of CAS constraints 
> the registry does not convert those manifests on-the-fly to Schema 1 when 
> they are being pulled by digest.
> Compatibility details are documented here: 
> https://docs.docker.com/registry/compatibility/
> Image Manifest V2, Schema 2 is documented here: 
> https://docs.docker.com/registry/spec/manifest-v2-2/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9704) Support docker manifest v2s2 config GC.

2019-04-05 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9704:
---

 Summary: Support docker manifest v2s2 config GC.
 Key: MESOS-9704
 URL: https://issues.apache.org/jira/browse/MESOS-9704
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song


After docker manifest v2s2 support, layer GC is still properly supported.

However, the manifest config is not garbage collected. Need to add the config 
dir to the checkpointed LAYERS_FILE to support config GC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9703) Support docker manifest v2s2 external urls.

2019-04-05 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9703:
---

 Summary: Support docker manifest v2s2 external urls.
 Key: MESOS-9703
 URL: https://issues.apache.org/jira/browse/MESOS-9703
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song


docker manifest v2s2 spec define external URLs. Some windows image rely on 
those urls to download some private layers from microsoft server.

Some refactoring may be needed to get rid of the current external urls support 
because the uri fetcher has to parse the manifest when pulling every layer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9676) Add prettyjws support for docker v2 s1 manifest.

2019-04-04 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9676:
---

Assignee: Gilbert Song

> Add prettyjws support for docker v2 s1 manifest.
> 
>
> Key: MESOS-9676
> URL: https://issues.apache.org/jira/browse/MESOS-9676
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>  Labels: containerization
> Fix For: 1.8.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8972) when choose docker image use user network all mesos agent crash

2019-04-04 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810140#comment-16810140
 ] 

Gilbert Song commented on MESOS-8972:
-

[~omegavveapon] [~saturnman], can you verify?

> when choose docker image use user network all mesos agent crash
> ---
>
> Key: MESOS-8972
> URL: https://issues.apache.org/jira/browse/MESOS-8972
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.7.0
> Environment: Ubuntu 14.04 & Ubuntu 16.04, both type crashes mesos
>Reporter: saturnman
>Priority: Blocker
>  Labels: docker, network
>
> When submit docker task from marathon choose user network, then mesos process 
> crashes with the following backtrace message
> mesos-agent: .././../3rdparty/stout/include/stout/option.hpp:118: const T& 
> Option::get() const & [with T = std::__cxx11::basic_string]: 
> Assertion `isSome()' failed.
> *** Aborted at 1527797505 (unix time) try "date -d @1527797505" if you are 
> using GNU date ***
> PC: @ 0x7fc03d43f428 (unknown)
> *** SIGABRT (@0x4514) received by PID 17684 (TID 0x7fc033143700) from PID 
> 17684; stack trace: ***
>  @ 0x7fc03dd7d390 (unknown)
>  @ 0x7fc03d43f428 (unknown)
>  @ 0x7fc03d44102a (unknown)
>  @ 0x7fc03d437bd7 (unknown)
>  @ 0x7fc03d437c82 (unknown)
>  @ 0x564f1ad8871d 
> _ZNKR6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc3getEv
>  @ 0x7fc048c43256 
> mesos::internal::slave::NetworkCniIsolatorProcess::getNetworkConfigJSON()
>  @ 0x7fc048c368cb mesos::internal::slave::NetworkCniIsolatorProcess::prepare()
>  @ 0x7fc0486e5c18 
> _ZZN7process8dispatchI6OptionIN5mesos5slave19ContainerLaunchInfoEENS2_8internal5slave20MesosIsolatorProcessERKNS2_11ContainerIDERKNS3_15ContainerConfigESB_SE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSJ_FSH_T1_T2_EOT3_OT4_ENKUlSt10unique_ptrINS_7PromiseIS5_EESt14default_deleteISX_EEOS9_OSC_PNS_11ProcessBaseEE_clES10_S11_S12_S14_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8972) when choose docker image use user network all mesos agent crash

2019-04-04 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810139#comment-16810139
 ] 

Gilbert Song commented on MESOS-8972:
-

seems like this issue was already fixed by 
https://issues.apache.org/jira/browse/MESOS-9267

> when choose docker image use user network all mesos agent crash
> ---
>
> Key: MESOS-8972
> URL: https://issues.apache.org/jira/browse/MESOS-8972
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.7.0
> Environment: Ubuntu 14.04 & Ubuntu 16.04, both type crashes mesos
>Reporter: saturnman
>Priority: Blocker
>  Labels: docker, network
>
> When submit docker task from marathon choose user network, then mesos process 
> crashes with the following backtrace message
> mesos-agent: .././../3rdparty/stout/include/stout/option.hpp:118: const T& 
> Option::get() const & [with T = std::__cxx11::basic_string]: 
> Assertion `isSome()' failed.
> *** Aborted at 1527797505 (unix time) try "date -d @1527797505" if you are 
> using GNU date ***
> PC: @ 0x7fc03d43f428 (unknown)
> *** SIGABRT (@0x4514) received by PID 17684 (TID 0x7fc033143700) from PID 
> 17684; stack trace: ***
>  @ 0x7fc03dd7d390 (unknown)
>  @ 0x7fc03d43f428 (unknown)
>  @ 0x7fc03d44102a (unknown)
>  @ 0x7fc03d437bd7 (unknown)
>  @ 0x7fc03d437c82 (unknown)
>  @ 0x564f1ad8871d 
> _ZNKR6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc3getEv
>  @ 0x7fc048c43256 
> mesos::internal::slave::NetworkCniIsolatorProcess::getNetworkConfigJSON()
>  @ 0x7fc048c368cb mesos::internal::slave::NetworkCniIsolatorProcess::prepare()
>  @ 0x7fc0486e5c18 
> _ZZN7process8dispatchI6OptionIN5mesos5slave19ContainerLaunchInfoEENS2_8internal5slave20MesosIsolatorProcessERKNS2_11ContainerIDERKNS3_15ContainerConfigESB_SE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSJ_FSH_T1_T2_EOT3_OT4_ENKUlSt10unique_ptrINS_7PromiseIS5_EESt14default_deleteISX_EEOS9_OSC_PNS_11ProcessBaseEE_clES10_S11_S12_S14_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9694) Refactor UCR docker store to construct 'Image' protobuf at Puller.

2019-04-01 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9694:
---

 Summary: Refactor UCR docker store to construct 'Image' protobuf 
at Puller.
 Key: MESOS-9694
 URL: https://issues.apache.org/jira/browse/MESOS-9694
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song
Assignee: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9693) Add master validation for SeccompInfo.

2019-03-29 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9693:
---

 Summary: Add master validation for SeccompInfo.
 Key: MESOS-9693
 URL: https://issues.apache.org/jira/browse/MESOS-9693
 Project: Mesos
  Issue Type: Task
Reporter: Gilbert Song
Assignee: Andrei Budnik


1. if seccomp is not enabled, we should return failure if any fw specify 
seccompInfo and return appropriate status update.
2. at most one field of profile_name and unconfined should be set. better to 
validate in master



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9685) Backport docker manifest v2s2 support to 1.4.x

2019-03-27 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9685:
---

 Summary: Backport docker manifest v2s2 support to 1.4.x
 Key: MESOS-9685
 URL: https://issues.apache.org/jira/browse/MESOS-9685
 Project: Mesos
  Issue Type: Task
Reporter: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9683) Backport docker manifest v2s2 support to 1.6.x

2019-03-27 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9683:
---

 Summary: Backport docker manifest v2s2 support to 1.6.x
 Key: MESOS-9683
 URL: https://issues.apache.org/jira/browse/MESOS-9683
 Project: Mesos
  Issue Type: Task
Reporter: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9684) Backport docker manifest v2s2 support to 1.5.x

2019-03-27 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9684:
---

 Summary: Backport docker manifest v2s2 support to 1.5.x
 Key: MESOS-9684
 URL: https://issues.apache.org/jira/browse/MESOS-9684
 Project: Mesos
  Issue Type: Task
Reporter: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9682) Backport docker manifest v2s2 support to 1.7.x.

2019-03-27 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9682:
---

 Summary: Backport docker manifest v2s2 support to 1.7.x.
 Key: MESOS-9682
 URL: https://issues.apache.org/jira/browse/MESOS-9682
 Project: Mesos
  Issue Type: Task
Reporter: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9676) Add prettyjws support for docker v2 s1 manifest.

2019-03-23 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9676:
---

 Summary: Add prettyjws support for docker v2 s1 manifest.
 Key: MESOS-9676
 URL: https://issues.apache.org/jira/browse/MESOS-9676
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9675) Docker Manifest V2 Schema2 Support.

2019-03-23 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9675:
---

 Summary: Docker Manifest V2 Schema2 Support.
 Key: MESOS-9675
 URL: https://issues.apache.org/jira/browse/MESOS-9675
 Project: Mesos
  Issue Type: Epic
  Components: containerization
Reporter: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9651) Design for docker registry v2 schema2 support.

2019-03-13 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9651:
---

 Summary: Design for docker registry v2 schema2 support.
 Key: MESOS-9651
 URL: https://issues.apache.org/jira/browse/MESOS-9651
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (MESOS-8813) Support multiple tasks with different users can access a persistent volume.

2019-03-07 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-8813:

Comment: was deleted

(was: commit a5b9f6e1cdb2ec26baf6e49706e1a0d59f3ce4d1
Author: Qian Zhang 
Date:   Thu Mar 7 16:42:16 2019 -0800

Updated UNPRIVILEGED_USER_PersistentVolumes to cover non-shared PV.

Review: https://reviews.apache.org/r/70140/

commit d0e13dd928b57ea6d8447b9b428487d2fc28380a
Author: Qian Zhang 
Date:   Thu Mar 7 16:42:14 2019 -0800

Updated ROOT_UNPRIVILEGED_USER_PersistentVolumes to cover non-shared PV.

Review: https://reviews.apache.org/r/70139/

commit 7d78aab8ee3d6047979617a4c18b1c7f05e1317a
Author: Qian Zhang 
Date:   Thu Mar 7 16:42:13 2019 -0800

Replaced reading mounttable with getting path gid in volume gid manager.

Review: https://reviews.apache.org/r/70138/

commit 0293cc2d4ce25b113b4f1b8a34b34a0655132f9b
Author: Qian Zhang 
Date:   Thu Mar 7 16:42:09 2019 -0800

Made volume gid manager allocate & deallocate gid to non-shared PV.

Review: https://reviews.apache.org/r/70137/)

> Support multiple tasks with different users can access a persistent volume.
> ---
>
> Key: MESOS-8813
> URL: https://issues.apache.org/jira/browse/MESOS-8813
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.8.0
>
>
> See [design 
> doc|https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.f4x59l41lxwx]
>  for why we need to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9642) Avoid reading host mount table when allocating a gid in GIDManager.

2019-03-07 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9642:
---

 Summary: Avoid reading host mount table when allocating a gid in 
GIDManager.
 Key: MESOS-9642
 URL: https://issues.apache.org/jira/browse/MESOS-9642
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song
Assignee: Qian Zhang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9641) Support GID manager with non-sharable persistent volume.

2019-03-07 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9641:
---

 Summary: Support GID manager with non-sharable persistent volume.
 Key: MESOS-9641
 URL: https://issues.apache.org/jira/browse/MESOS-9641
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song
Assignee: Qian Zhang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9610) Fetcher vulnerability - escaping from sandbox

2019-03-07 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787285#comment-16787285
 ] 

Gilbert Song commented on MESOS-9610:
-

Probably we could create a separate JIRA to follow up on 
*ARCHIVE_EXTRACT_SECURE_NOABSOLUTEPATHS*?

cc [~mderela] [~kaysoky]

> Fetcher vulnerability - escaping from sandbox
> -
>
> Key: MESOS-9610
> URL: https://issues.apache.org/jira/browse/MESOS-9610
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7.2
>Reporter: Mariusz Derela
>Assignee: Joseph Wu
>Priority: Blocker
>  Labels: bug, foundations, security-issue, vulnerabilities
> Fix For: 1.8.0, 1.7.3
>
>
> I have noticed that there is a possibility to exploit fetcher and  overwrite 
> any file on the agent host.
> scenario to reproduce:
> 1) prepare a file with any content and name a file like "../../../etc/test" 
> and archive it. We can use python and zipfile module to achieve that:
> {code:java}
> >>> import zipfile
> >>> zip = zipfile.ZipFile("exploit.zip", "w")
> >>> zip.writestr("../../../../../../../../../../../../etc/mariusz_was_here.txt",
> >>>  "some content")
> >>> zip.close()
> {code}
> 2) prepare a service that will use our artifact (exploit.zip)
> 3) run service
> at the end in /etc we will get our file. As you can imagine there is a lot 
> possibility how we can use it.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8810) Grant non-root task user the permissions to access the SANDBOX_PATH volume of PARENT type

2019-02-27 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780181#comment-16780181
 ] 

Gilbert Song commented on MESOS-8810:
-

commit 1003935e1c4021cecf3af637faa1588509a64065
Author: Qian Zhang 
Date:   Wed Feb 27 22:22:47 2019 -0800

Added a test `ROOT_UNPRIVILEGED_USER_TaskSandboxLocalPersistentVolume`.

Review: https://reviews.apache.org/r/69579/

> Grant non-root task user the permissions to access the SANDBOX_PATH volume of 
> PARENT type
> -
>
> Key: MESOS-8810
> URL: https://issues.apache.org/jira/browse/MESOS-8810
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.8.0
>
>
> See [design 
> doc|https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.s6f8rmu65g2p]
>  for why we need to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8813) Make multiple tasks with different users can access a shared persistent volume

2019-02-27 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780175#comment-16780175
 ] 

Gilbert Song edited comment on MESOS-8813 at 2/28/19 6:46 AM:
--

commit cb706719975dc1c8ec34a8411d083c6d348779cf
Author: Qian Zhang 
Date:   Wed Feb 27 22:22:44 2019 -0800

Added a test `ROOT_UNPRIVILEGED_USER_TaskSandboxSharedPersistentVolume`.

Review: https://reviews.apache.org/r/69547/

commit 114569fed4f92f7394e4a8aad7077b7084bb94e9
Author: Qian Zhang 
Date:   Wed Feb 27 22:22:41 2019 -0800

Added a test `UNPRIVILEGED_USER_SharedPersistentVolume`.

Review: https://reviews.apache.org/r/68163/

commit d0405160e60f60b3e3416e4a4bb7afb2b7e2907b
Author: Qian Zhang 
Date:   Wed Feb 27 22:22:36 2019 -0800

Added a test `ROOT_UNPRIVILEGED_USER_SharedPersistentVolume`.

Review: https://reviews.apache.org/r/68162/


was (Author: gilbert):
commit d0405160e60f60b3e3416e4a4bb7afb2b7e2907b
Author: Qian Zhang 
Date:   Wed Feb 27 22:22:36 2019 -0800

Added a test `ROOT_UNPRIVILEGED_USER_SharedPersistentVolume`.

Review: https://reviews.apache.org/r/68162/

> Make multiple tasks with different users can access a shared persistent volume
> --
>
> Key: MESOS-8813
> URL: https://issues.apache.org/jira/browse/MESOS-8813
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.8.0
>
>
> See [design 
> doc|https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.f4x59l41lxwx]
>  for why we need to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8813) Make multiple tasks with different users can access a shared persistent volume

2019-02-27 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780175#comment-16780175
 ] 

Gilbert Song commented on MESOS-8813:
-

commit d0405160e60f60b3e3416e4a4bb7afb2b7e2907b
Author: Qian Zhang 
Date:   Wed Feb 27 22:22:36 2019 -0800

Added a test `ROOT_UNPRIVILEGED_USER_SharedPersistentVolume`.

Review: https://reviews.apache.org/r/68162/

> Make multiple tasks with different users can access a shared persistent volume
> --
>
> Key: MESOS-8813
> URL: https://issues.apache.org/jira/browse/MESOS-8813
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.8.0
>
>
> See [design 
> doc|https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.f4x59l41lxwx]
>  for why we need to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8810) Grant non-root task user the permissions to access the SANDBOX_PATH volume of PARENT type

2019-02-27 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780174#comment-16780174
 ] 

Gilbert Song commented on MESOS-8810:
-

commit 9c44b31e73783a220e43098a1b117e83e6074fdc
Author: Qian Zhang 
Date:   Wed Feb 27 22:22:34 2019 -0800

Added a test `ROOT_UNPRIVILEGED_USER_ParentTypeDifferentUser`.

Review: https://reviews.apache.org/r/67997/

commit 16fd7e74b2dc6176d418ebcc1608b94a1159cb15
Author: Qian Zhang 
Date:   Wed Feb 27 22:22:29 2019 -0800

Implemented recovery for volume gid manager.

Review: https://reviews.apache.org/r/69676/

> Grant non-root task user the permissions to access the SANDBOX_PATH volume of 
> PARENT type
> -
>
> Key: MESOS-8810
> URL: https://issues.apache.org/jira/browse/MESOS-8810
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.8.0
>
>
> See [design 
> doc|https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.s6f8rmu65g2p]
>  for why we need to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9613) Support seccomp `unconfined` option for whitelisting.

2019-02-26 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9613:
---

 Summary: Support seccomp `unconfined` option for whitelisting.
 Key: MESOS-9613
 URL: https://issues.apache.org/jira/browse/MESOS-9613
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song
Assignee: Andrei Budnik


Support seccomp `unconfined` option for whitelisting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9009) Support for creation non-existing host paths in a whitelist as source paths

2019-02-24 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776488#comment-16776488
 ] 

Gilbert Song commented on MESOS-9009:
-

Done. Thanks for the patch!

> Support for creation non-existing host paths in a whitelist as source paths
> ---
>
> Key: MESOS-9009
> URL: https://issues.apache.org/jira/browse/MESOS-9009
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Jason Lai
>Assignee: Jason Lai
>Priority: Major
>  Labels: containerizer, mount, path
> Fix For: 1.8.0
>
>
> Docker creates a directory specified in {{docker run}}'s {{--volume}}/{{-v}} 
> option as the source path that will get bind-mounted into the container, if 
> the source location didn't originally exist on the host.
> Unlike Docker, UCR bails on launching containers if any of their host mount 
> paths doesn't originally exist. While this is more secure and eliminates 
> unnecessary side effects, it breaks transparent compatibility when trying to 
> migrate from Docker.
> As a trade-off, we should allow host path creation in a restricted manner, by 
> introducing a new Mesos agent flag ({{--host_path_volume_force_creation}}) as 
> a colon-separated whitelist (similar to the format of POSIX's {{$PATH}} 
> environment variable), under whose items' subdirectories the host paths are 
> allowed to be created.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9180) tasks get stuck in TASK_KILLING on the default executor

2019-02-06 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762217#comment-16762217
 ] 

Gilbert Song commented on MESOS-9180:
-

[~Kirill P], could you add the agent logs for triaging. Also this may related 
to the recent stuck task fix due to a FD leak MESOS-9151 and MESOS-9501, could 
you please upgrade and verify if you still have this issue?

> tasks get stuck in TASK_KILLING on the default executor
> ---
>
> Key: MESOS-9180
> URL: https://issues.apache.org/jira/browse/MESOS-9180
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.6.1
> Environment: Ubuntu 18.04, Ubuntu 16.04
>Reporter: Kirill Plyashkevich
>Priority: Critical
>  Labels: containerization
>
> during our load tests tasks get stuck in TASK_KILLING state
> {quote}{noformat}
> I0823 16:30:20.367563 21608 executor.cpp:192] Version: 1.6.1
> I0823 16:30:20.439478 21684 default_executor.cpp:202] Received SUBSCRIBED 
> event
> I0823 16:30:20.441012 21684 default_executor.cpp:206] Subscribed executor on 
> XX.XXX.XX.XXX
> I0823 16:30:20.916216 21665 default_executor.cpp:202] Received LAUNCH_GROUP 
> event
> I0823 16:30:20.917373 21645 default_executor.cpp:426] Setting 
> 'MESOS_CONTAINER_IP' to: 172.26.10.222
> I0823 16:30:22.573794 21658 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:22.575518 21637 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:22.577137 21665 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:33.091509 21642 default_executor.cpp:661] Finished launching 
> tasks [ 
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka,
>  
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis,
>  
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery
>  ] in child containers [ 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8, 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7, 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b ]
> I0823 16:30:33.091567 21642 default_executor.cpp:685] Waiting on child 
> containers of tasks [ 
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka,
>  
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis,
>  
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery
>  ]
> I0823 16:30:33.096014 21647 default_executor.cpp:746] Waiting for child 
> container 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8 of 
> task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka'
> I0823 16:30:33.096310 21647 default_executor.cpp:746] Waiting for child 
> container 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7 of 
> task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis'
> I0823 16:30:33.096470 21647 default_executor.cpp:746] Waiting for child 
> container 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b of 
> task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery'
> I0823 16:30:33.521510 21648 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:33.522073 21652 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:33.523569 21679 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:38.593736 21668 checker_process.cpp:814] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis'
>  (stdout):
> 0
> PONG
> I0823 16:30:38.593777 21668 checker_process.cpp:817] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis'
>  (stderr):
> I0823 16:30:38.610167 21650 checker_process.cpp:814] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka'
>  (stdout):
> I0823 16:30:38.610194 21650 checker_process.cpp:817] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka'
>  (stderr):
> I0823 16:30:38.700561 21681 checker_process.cpp:814] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery'
>  (stdout):
> I0823 16:30:38.700598 21681 

[jira] [Commented] (MESOS-9159) Support Foreign URLs in docker registry puller

2019-01-31 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757583#comment-16757583
 ] 

Gilbert Song commented on MESOS-9159:
-

[~jpepy], please follow up with https://issues.apache.org/jira/browse/MESOS-5011

> Support Foreign URLs in docker registry puller
> --
>
> Key: MESOS-9159
> URL: https://issues.apache.org/jira/browse/MESOS-9159
> Project: Mesos
>  Issue Type: Task
>Reporter: Akash Gupta
>Assignee: Liangyu Zhao
>Priority: Major
>
> Currently, trying to pull the layers of a Windows image with the current 
> registry pull code will return a 404 error. This is because the Windows 
> docker images need to pull the base OS layers from the foreign URLs field in 
> the version 2 schema 2 docker manifest. As a result, the register puller 
> needs to be aware of version 2 schema 2 and the foreign urls field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9507) Agent could not recover due to empty docker volume checkpointed files.

2019-01-30 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9507:
---

Assignee: Gilbert Song
  Sprint: Containerization RI10 Spr 39
Story Points: 5

> Agent could not recover due to empty docker volume checkpointed files.
> --
>
> Key: MESOS-9507
> URL: https://issues.apache.org/jira/browse/MESOS-9507
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Critical
>  Labels: containerizer
>
> Agent could not recover due to empty docker volume checkpointed files. Please 
> see logs:
> {noformat}
> Nov 12 17:12:00 guppy mesos-agent[38960]: E1112 17:12:00.978682 38969 
> slave.cpp:6279] EXIT with status 1: Failed to perform recovery: Collect 
> failed: Collect failed: Failed to recover docker volumes for orphan container 
> e1b04051-1e4a-47a9-b866-1d625cda1d22: JSON parse failed: syntax error at line 
> 1 near:
> Nov 12 17:12:00 guppy mesos-agent[38960]: To remedy this do as follows: 
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 1: rm -f 
> /var/lib/mesos/slave/meta/slaves/latest
> Nov 12 17:12:00 guppy mesos-agent[38960]: This ensures agent doesn't recover 
> old live executors.
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 2: Restart the agent. 
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service: main process 
> exited, code=exited, status=1/FAILURE
> Nov 12 17:12:00 guppy systemd[1]: Unit dcos-mesos-slave.service entered 
> failed state.
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service failed.
> {noformat}
> This is caused by agent recovery after the volume state file is created but 
> before checkpointing finishes. Basically the docker volume is not mounted 
> yet, so the docker volume isolator should skip recovering this volume.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9533) CniIsolatorTest.ROOT_CleanupAfterReboot is flaky.

2019-01-30 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756440#comment-16756440
 ] 

Gilbert Song commented on MESOS-9533:
-

yea, did not realize the test was backported. I will backport the fix now

> CniIsolatorTest.ROOT_CleanupAfterReboot is flaky.
> -
>
> Key: MESOS-9533
> URL: https://issues.apache.org/jira/browse/MESOS-9533
> Project: Mesos
>  Issue Type: Bug
>  Components: cni, containerization
>Affects Versions: 1.8.0
> Environment: centos-6 with SSL enabled
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>  Labels: cni, flaky-test
> Fix For: 1.4.3, 1.5.3, 1.6.2, 1.7.2, 1.8.0
>
>
> {noformat}
> Error Message
> ../../src/tests/containerizer/cni_isolator_tests.cpp:2685
> Mock function called more times than expected - returning directly.
> Function call: statusUpdate(0x7fffc7c05aa0, @0x7fe637918430 136-byte 
> object <80-24 29-45 E6-7F 00-00 00-00 00-00 00-00 00-00 3E-E8 00-00 00-00 
> 00-00 00-B8 0E-20 F0-55 00-00 C0-03 07-18 E6-7F 00-00 20-17 05-18 E6-7F 00-00 
> 10-50 05-18 E6-7F 00-00 50-D1 04-18 E6-7F 00-00 ... 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 F0-89 16-E9 58-2B D7-41 00-00 00-00 01-00 00-00 18-00 00-00 
> 0B-00 00-00>)
>  Expected: to be called 3 times
>Actual: called 4 times - over-saturated and active
> Stacktrace
> ../../src/tests/containerizer/cni_isolator_tests.cpp:2685
> Mock function called more times than expected - returning directly.
> Function call: statusUpdate(0x7fffc7c05aa0, @0x7fe637918430 136-byte 
> object <80-24 29-45 E6-7F 00-00 00-00 00-00 00-00 00-00 3E-E8 00-00 00-00 
> 00-00 00-B8 0E-20 F0-55 00-00 C0-03 07-18 E6-7F 00-00 20-17 05-18 E6-7F 00-00 
> 10-50 05-18 E6-7F 00-00 50-D1 04-18 E6-7F 00-00 ... 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 F0-89 16-E9 58-2B D7-41 00-00 00-00 01-00 00-00 18-00 00-00 
> 0B-00 00-00>)
>  Expected: to be called 3 times
>Actual: called 4 times - over-saturated and active
> {noformat}
> It was from this commit 
> https://github.com/apache/mesos/commit/c338f5ada0123c0558658c6452ac3402d9fbec29



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9539) Mesos Containerizer Seccomp Improvement.

2019-01-28 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9539:
---

 Summary: Mesos Containerizer Seccomp Improvement.
 Key: MESOS-9539
 URL: https://issues.apache.org/jira/browse/MESOS-9539
 Project: Mesos
  Issue Type: Epic
  Components: containerization
Reporter: Gilbert Song


Mesos Containerizer Seccomp Improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9456) Set `SCMP_FLTATR_CTL_LOG` attribute during initialization of Seccomp context

2019-01-28 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753778#comment-16753778
 ] 

Gilbert Song commented on MESOS-9456:
-

(y)

> Set `SCMP_FLTATR_CTL_LOG` attribute during initialization of Seccomp context
> 
>
> Key: MESOS-9456
> URL: https://issues.apache.org/jira/browse/MESOS-9456
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: Mesosphere, newbie
>
> Since version 4.14 the Linux kernel supports SECCOMP_FILTER_FLAG_LOG flag 
> which can be used for enabling logging for all Seccomp filter operations 
> except SECCOMP_RET_ALLOW. If a Seccomp filter does not allow the system call, 
> then the kernel will print a message into dmesg during invocation of this 
> system call.
> At the moment libseccomp ver. 2.3.3 does not provide this flag, but the 
> latest master branch of libseccomp supports SECCOMP_FILTER_FLAG_LOG. So, we 
> need to add
> {code:java}
> seccomp_attr_set(ctx, SCMP_FLTATR_CTL_LOG, 1);{code}
> into `SeccompFilter::create()` when the newest version of libseccomp will be 
> released (v2.3.4+).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9386) Implement Seccomp profile inheritance for POD containers

2019-01-28 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753777#comment-16753777
 ] 

Gilbert Song commented on MESOS-9386:
-

Probably we should close this as "won't do"?

> Implement Seccomp profile inheritance for POD containers
> 
>
> Key: MESOS-9386
> URL: https://issues.apache.org/jira/browse/MESOS-9386
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
>
> Child containers inherit its parent container's Seccomp profile by default. 
> Also, Seccomp profile can be overridden by a Framework for a particular child 
> container by specifying a path to the Seccomp profile.
> Mesos containerizer persists information about containers on disk via 
> `ContainerLaunchInfo` proto, which includes `ContainerSeccompProfile` proto. 
> Mesos containerizer should use this proto to load the parent's profile for a 
> child container. When a child inherits the parent's Seccomp profile, Mesos 
> agent doesn't have to re-read a Seccomp profile from the disk, which was used 
> for the parent container. Otherwise, we would have to check that a file 
> content hasn't changed since the last time the parent was launched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9080) Port mapping isolator leaks ephemeral ports when a container is destroyed during preparation

2019-01-24 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751405#comment-16751405
 ] 

Gilbert Song commented on MESOS-9080:
-

Seems like it was fixed in 1.7.0. Do we need to backport it?

> Port mapping isolator leaks ephemeral ports when a container is destroyed 
> during preparation
> 
>
> Key: MESOS-9080
> URL: https://issues.apache.org/jira/browse/MESOS-9080
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.6.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Major
>
> {{network/port_mapping}} isolator leaks ephemeral ports during container 
> cleanup if {{Isolator::isolate()}} was not called, i.e. the container is 
> being destroyed during preparation. If the isolator doesn't know the main 
> container's PID it skips filters cleanup (they should not exist in this case) 
> and ephemeral ports deallocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9080) Port mapping isolator leaks ephemeral ports when a container is destroyed during preparation

2019-01-24 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751349#comment-16751349
 ] 

Gilbert Song commented on MESOS-9080:
-

[~ipronin][~jieyu], could we close this ticket given 
https://reviews.apache.org/r/67936/ is submitted? or we need more work?

> Port mapping isolator leaks ephemeral ports when a container is destroyed 
> during preparation
> 
>
> Key: MESOS-9080
> URL: https://issues.apache.org/jira/browse/MESOS-9080
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.6.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Major
>
> {{network/port_mapping}} isolator leaks ephemeral ports during container 
> cleanup if {{Isolator::isolate()}} was not called, i.e. the container is 
> being destroyed during preparation. If the isolator doesn't know the main 
> container's PID it skips filters cleanup (they should not exist in this case) 
> and ephemeral ports deallocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9533) CniIsolatorTest.ROOT_CleanupAfterReboot is flaky.

2019-01-23 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9533:
---

Assignee: Gilbert Song

> CniIsolatorTest.ROOT_CleanupAfterReboot is flaky.
> -
>
> Key: MESOS-9533
> URL: https://issues.apache.org/jira/browse/MESOS-9533
> Project: Mesos
>  Issue Type: Bug
>  Components: cni, containerization
>Affects Versions: 1.8.0
> Environment: centos-6 with SSL enabled
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>  Labels: flaky-test
>
> {noformat}
> Error Message
> ../../src/tests/containerizer/cni_isolator_tests.cpp:2685
> Mock function called more times than expected - returning directly.
> Function call: statusUpdate(0x7fffc7c05aa0, @0x7fe637918430 136-byte 
> object <80-24 29-45 E6-7F 00-00 00-00 00-00 00-00 00-00 3E-E8 00-00 00-00 
> 00-00 00-B8 0E-20 F0-55 00-00 C0-03 07-18 E6-7F 00-00 20-17 05-18 E6-7F 00-00 
> 10-50 05-18 E6-7F 00-00 50-D1 04-18 E6-7F 00-00 ... 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 F0-89 16-E9 58-2B D7-41 00-00 00-00 01-00 00-00 18-00 00-00 
> 0B-00 00-00>)
>  Expected: to be called 3 times
>Actual: called 4 times - over-saturated and active
> Stacktrace
> ../../src/tests/containerizer/cni_isolator_tests.cpp:2685
> Mock function called more times than expected - returning directly.
> Function call: statusUpdate(0x7fffc7c05aa0, @0x7fe637918430 136-byte 
> object <80-24 29-45 E6-7F 00-00 00-00 00-00 00-00 00-00 3E-E8 00-00 00-00 
> 00-00 00-B8 0E-20 F0-55 00-00 C0-03 07-18 E6-7F 00-00 20-17 05-18 E6-7F 00-00 
> 10-50 05-18 E6-7F 00-00 50-D1 04-18 E6-7F 00-00 ... 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 F0-89 16-E9 58-2B D7-41 00-00 00-00 01-00 00-00 18-00 00-00 
> 0B-00 00-00>)
>  Expected: to be called 3 times
>Actual: called 4 times - over-saturated and active
> {noformat}
> It was from this commit 
> https://github.com/apache/mesos/commit/c338f5ada0123c0558658c6452ac3402d9fbec29



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9533) CniIsolatorTest.ROOT_CleanupAfterReboot is flaky.

2019-01-22 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9533:
---

 Summary: CniIsolatorTest.ROOT_CleanupAfterReboot is flaky.
 Key: MESOS-9533
 URL: https://issues.apache.org/jira/browse/MESOS-9533
 Project: Mesos
  Issue Type: Bug
  Components: cni, containerization
Affects Versions: 1.8.0
 Environment: centos-6 with SSL enabled
Reporter: Gilbert Song


{noformat}
Error Message
../../src/tests/containerizer/cni_isolator_tests.cpp:2685
Mock function called more times than expected - returning directly.
Function call: statusUpdate(0x7fffc7c05aa0, @0x7fe637918430 136-byte object 
<80-24 29-45 E6-7F 00-00 00-00 00-00 00-00 00-00 3E-E8 00-00 00-00 00-00 00-B8 
0E-20 F0-55 00-00 C0-03 07-18 E6-7F 00-00 20-17 05-18 E6-7F 00-00 10-50 05-18 
E6-7F 00-00 50-D1 04-18 E6-7F 00-00 ... 00-00 00-00 00-00 00-00 00-00 00-00 
00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
00-00 F0-89 16-E9 58-2B D7-41 00-00 00-00 01-00 00-00 18-00 00-00 0B-00 00-00>)
 Expected: to be called 3 times
   Actual: called 4 times - over-saturated and active
Stacktrace
../../src/tests/containerizer/cni_isolator_tests.cpp:2685
Mock function called more times than expected - returning directly.
Function call: statusUpdate(0x7fffc7c05aa0, @0x7fe637918430 136-byte object 
<80-24 29-45 E6-7F 00-00 00-00 00-00 00-00 00-00 3E-E8 00-00 00-00 00-00 00-B8 
0E-20 F0-55 00-00 C0-03 07-18 E6-7F 00-00 20-17 05-18 E6-7F 00-00 10-50 05-18 
E6-7F 00-00 50-D1 04-18 E6-7F 00-00 ... 00-00 00-00 00-00 00-00 00-00 00-00 
00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
00-00 F0-89 16-E9 58-2B D7-41 00-00 00-00 01-00 00-00 18-00 00-00 0B-00 00-00>)
 Expected: to be called 3 times
   Actual: called 4 times - over-saturated and active
{noformat}

It was from this commit 
https://github.com/apache/mesos/commit/c338f5ada0123c0558658c6452ac3402d9fbec29



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9530) Add timeout and logging to docker ps in docker containerizer recovery.

2019-01-22 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749317#comment-16749317
 ] 

Gilbert Song commented on MESOS-9530:
-

given docker ps is only called once in docker containerizer recover(), it is 
worth retry logic if got stuck.

> Add timeout and logging to docker ps in docker containerizer recovery.
> --
>
> Key: MESOS-9530
> URL: https://issues.apache.org/jira/browse/MESOS-9530
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Gilbert Song
>Priority: Major
>  Labels: docker
>
> Docker daemon may hang on docker ps. We should add timeout to docker->ps() 
> call and more logging to monitor the docker ps (more logging is fine because 
> docker ps is only called once).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9532) ResourceOffersTest.ResourceOfferWithMultipleSlaves is flaky.

2019-01-22 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9532:
---

 Summary: ResourceOffersTest.ResourceOfferWithMultipleSlaves is 
flaky.
 Key: MESOS-9532
 URL: https://issues.apache.org/jira/browse/MESOS-9532
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Gilbert Song
Assignee: Gilbert Song


{noformat}
09:48:57 I0114 09:48:57.153340  6468 credentials.hpp:86] Loading credential for 
authentication from '/tmp/4X6jRy/credential'
09:48:57 E0114 09:48:57.153373  6468 slave.cpp:296] EXIT with status 1: Empty 
credential file '/tmp/4X6jRy/credential' (see --credential flag)
{noformat}

caused by this commit 
https://github.com/apache/mesos/commit/07bccc6377a180267d4251897a765acba9fa0c4d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9531) chown error handling is incorrect in createSandboxDirectory.

2019-01-22 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9531:
---

 Summary: chown error handling is incorrect in 
createSandboxDirectory.
 Key: MESOS-9531
 URL: https://issues.apache.org/jira/browse/MESOS-9531
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Gilbert Song
Assignee: Qian Zhang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9530) Add timeout and logging to docker ps in docker containerizer recovery.

2019-01-22 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9530:
---

 Summary: Add timeout and logging to docker ps in docker 
containerizer recovery.
 Key: MESOS-9530
 URL: https://issues.apache.org/jira/browse/MESOS-9530
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song


Docker daemon may hang on docker ps. We should add timeout to docker->ps() call 
and more logging to monitor the docker ps (more logging is fine because docker 
ps is only called once).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9502) IOswitchboard cleanup could get stuck.

2019-01-09 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9502:
---

Shepherd: Gilbert Song
Assignee: Andrei Budnik
  Sprint: Containerization R9 Sprint 37
Story Points: 8
  Labels: containerizer  (was: )

> IOswitchboard cleanup could get stuck.
> --
>
> Key: MESOS-9502
> URL: https://issues.apache.org/jira/browse/MESOS-9502
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.7.0
>Reporter: Meng Zhu
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: containerizer
>
> Our check container got stuck during destroy which in turned stucks the 
> parent container. It is blocked by the I/O switchboard cleanup:
> 1223 18:04:41.00 16269 switchboard.cpp:814] Sending SIGTERM to I/O 
> switchboard server (pid: 62854) since container 
> 4d4074fa-bc87-471b-8659-08e519b68e13.16d02532-675a-4acb-964d-57459ecf6b67.check-e91521a3-bf72-4ac4-8ead-3950e31cf09e
>  is being destroyed
> 
> 1227 04:45:38.00  5189 switchboard.cpp:916] I/O switchboard server 
> process for container 
> 4d4074fa-bc87-471b-8659-08e519b68e13.16d02532-675a-4acb-964d-57459ecf6b67.check-e91521a3-bf72-4ac4-8ead-3950e31cf09e
>  has terminated (status=N/A)
> Note the timestamp.
> *Root Cause:*
> Fundamentally, this is caused by a race between *.discard()* triggered by 
> Check Container TIMEOUT and IOSB extracting ContainerIO object. This race 
> could be exposed by overloaded/slow agent process. Please see how this race 
> be triggered below:
> # Right after IOSB server process is running, Check container Timed out and 
> the checker process returns a failure, which would close the HTTP connection 
> with agent.
> # From the agent side, if the connection breaks, the handler will trigger a 
> discard on the returned future and that will result in 
> containerizer->launch()'s future transitioned to DISCARDED state.
> # In containerizer, the DISCARDED state will be propagated back to IOSB 
> prepare(), which stop its continuation on *extracting the containerIO* (it 
> implies the object being cleaned up and FDs(one end of pipes created in IOSB) 
> being closed in its destructor).
> # Agent starts to destroy the container due to its discarded launch result, 
> and asks IOSB to cleanup the container.
> # IOSB server is still running, so agent sends a SIGTERM.
> # SIGTERM handler unblocks the IOSB from redirecting (to redirect 
> stdout/stderr from container to logger before exiting).
> # io::redirect() calls io::splice() and reads the other end of those pipes 
> forever.
> This issue is *not easy to reproduce unless* on a busy agent, because the 
> timeout has to happen exactly *AFTER* IOSB server is running and *BEFORE* 
> IOSB extracts containerIO.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6632) ContainerLogger might leak FD if container launch fails.

2019-01-07 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735745#comment-16735745
 ] 

Gilbert Song commented on MESOS-6632:
-

https://reviews.apache.org/r/69681/

> ContainerLogger might leak FD if container launch fails.
> 
>
> Key: MESOS-6632
> URL: https://issues.apache.org/jira/browse/MESOS-6632
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.28.2, 1.0.1, 1.1.0
>Reporter: Jie Yu
>Priority: Critical
>
> In MesosContainerizer, if logger->prepare() succeeds but its continuation 
> fails, the pipe fd allocated in the logger will get leaked. We cannot add a 
> destructor in ContainerLogger::SubprocessInfo to close the fd because 
> subprocess might close the OWNED fd.
> A FD abstraction might help here. In other words, subprocess will no longer 
> be responsible for closing external FDs, instead, the FD destructor will be 
> doing so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9507) Agent could not recover due to empty docker volume checkpointed files.

2019-01-02 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9507:
---

 Summary: Agent could not recover due to empty docker volume 
checkpointed files.
 Key: MESOS-9507
 URL: https://issues.apache.org/jira/browse/MESOS-9507
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Gilbert Song


Agent could not recover due to empty docker volume checkpointed files. Please 
see logs:

{noformat}
Nov 12 17:12:00 guppy mesos-agent[38960]: E1112 17:12:00.978682 38969 
slave.cpp:6279] EXIT with status 1: Failed to perform recovery: Collect failed: 
Collect failed: Failed to recover docker volumes for orphan container 
e1b04051-1e4a-47a9-b866-1d625cda1d22: JSON parse failed: syntax error at line 1 
near:
Nov 12 17:12:00 guppy mesos-agent[38960]: To remedy this do as follows: 
Nov 12 17:12:00 guppy mesos-agent[38960]: Step 1: rm -f 
/var/lib/mesos/slave/meta/slaves/latest
Nov 12 17:12:00 guppy mesos-agent[38960]: This ensures agent doesn't recover 
old live executors.
Nov 12 17:12:00 guppy mesos-agent[38960]: Step 2: Restart the agent. 
Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service: main process 
exited, code=exited, status=1/FAILURE
Nov 12 17:12:00 guppy systemd[1]: Unit dcos-mesos-slave.service entered failed 
state.
Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service failed.
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9371) `FetcherCacheTest.RemoveLRUCacheEntries` is flaky.

2018-11-07 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9371:
---

 Summary: `FetcherCacheTest.RemoveLRUCacheEntries` is flaky.
 Key: MESOS-9371
 URL: https://issues.apache.org/jira/browse/MESOS-9371
 Project: Mesos
  Issue Type: Bug
Reporter: Gilbert Song


{noformat}
[ RUN  ] FetcherCacheTest.RemoveLRUCacheEntries
I1107 13:20:28.161957 39728 cluster.cpp:172] Creating default 'local' authorizer
I1107 13:20:28.165024 39777 master.cpp:457] Master 
f7977f54-bb98-445f-9610-ede08bf34093 (core-dev) started on 10.0.49.2:41973
I1107 13:20:28.165122 39777 master.cpp:459] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/ds9Byj/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/ds9Byj/master" --zk_session_timeout="10secs"
I1107 13:20:28.169046 39777 master.cpp:508] Master only allowing authenticated 
frameworks to register
I1107 13:20:28.169070 39777 master.cpp:514] Master only allowing authenticated 
agents to register
I1107 13:20:28.169093 39777 master.cpp:520] Master only allowing authenticated 
HTTP frameworks to register
I1107 13:20:28.169121 39777 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/ds9Byj/credentials'
I1107 13:20:28.169910 39777 master.cpp:564] Using default 'crammd5' 
authenticator
I1107 13:20:28.170156 39777 authenticator.cpp:520] Initializing server SASL
I1107 13:20:28.171118 39777 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1107 13:20:28.171545 39777 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1107 13:20:28.171684 39777 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1107 13:20:28.171854 39777 master.cpp:643] Authorization enabled
I1107 13:20:28.185091 39744 master.cpp:2247] Elected as the leading master!
I1107 13:20:28.185139 39744 master.cpp:1727] Recovering from registrar
I1107 13:20:28.188170 39752 registrar.cpp:391] Successfully fetched the 
registry (0B) in 2.571008ms
I1107 13:20:28.188540 39752 registrar.cpp:495] Applied 1 operations in 
122344ns; attempting to update the registry
I1107 13:20:28.191902 39752 registrar.cpp:552] Successfully updated the 
registry in 3.163904ms
I1107 13:20:28.192203 39752 registrar.cpp:424] Successfully recovered registrar
I1107 13:20:28.193151 39756 master.cpp:1840] Recovered 0 agents from the 
registry (123B); allowing 10mins for agents to re-register
I1107 13:20:28.194456 39728 containerizer.cpp:304] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
W1107 13:20:28.195065 39728 backend.cpp:76] Failed to create 'overlay' backend: 
OverlayBackend requires root privileges
W1107 13:20:28.195108 39728 backend.cpp:76] Failed to create 'bind' backend: 
BindBackend requires root privileges
I1107 13:20:28.195156 39728 provisioner.cpp:299] Using default backend 'copy'
W1107 13:20:28.198040 39728 process.cpp:2745] Attempted to spawn already 
running process files@10.0.49.2:41973
I1107 13:20:28.198199 39728 cluster.cpp:460] Creating default 'local' authorizer
I1107 13:20:28.199762 39789 slave.cpp:261] Mesos agent started on 
(1)@10.0.49.2:41973
I1107 13:20:28.199816 39789 slave.cpp:262] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/FetcherCacheTest_RemoveLRUCacheEntries_bzpwJi/store/appc"
 --authenticate_http_readonly="true" --authenticate_http_readwrite="false" 
--authenticatee="crammd5" --authentication_backoff_factor="1secs" 
--authentication_timeout_max="1mins" --authentication_timeout_min="5secs" 
--authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" 

[jira] [Commented] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-10-31 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670547#comment-16670547
 ] 

Gilbert Song commented on MESOS-8545:
-

commit a296b820d4d4a25f47caaa3870bc56c6437dd63e
Author: Andrei Budnik 
Date:   Wed Oct 31 11:37:07 2018 -0700

Fixed compile errors on clang 3.5.

Review: https://reviews.apache.org/r/69217/

> AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
> ---
>
> Key: MESOS-8545
> URL: https://issues.apache.org/jira/browse/MESOS-8545
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0, 1.6.1, 1.7.0
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: Mesosphere, flaky-test
> Fix For: 1.5.2, 1.6.2, 1.7.1, 1.8.0
>
> Attachments: 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun.txt, 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun2.txt
>
>
> {code:java}
> I0205 17:11:01.091872 4898 http_proxy.cpp:132] Returning '500 Internal Server 
> Error' for '/slave(974)/api/v1' (Disconnected)
> /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/api_tests.cpp:6596:
>  Failure
> Value of: (response).get().status
> Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9320) UCR container launch stuck at PROVISIONING during image fetching.

2018-10-26 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665793#comment-16665793
 ] 

Gilbert Song commented on MESOS-9320:
-

The root caused is found. This issue shares the same root cause as MESOS-9334, 
but with different symptom.

*Reproduce*
We could reproduce this issue easily, under certain conditions:
On one single machine, scale up 100 tasks running `sleep 10` command tasks and 
have the scheduler to relaunch task if any finishes. This creates a situation 
that there are always some containers being destroyed. Then, launch a container 
using an image with a lot of layers 
(`kvish/jenkins-dev:595c74f713f609fd1d3b05a40d35113fc03227c9` in this case with 
~90 layers). As expected, this container launch would be stuck.

*Root cause*
This is a race between a container launch while another container is being 
destroyed. This is triggered by a bug in cgroup listener finalize() method. The 
event FD is closed before an io::read() future finishes the .onDiscard() call 
back which clean up the event in libevent poll. To explain the bottom of this 
issue, libevent supports multiple events pointing to the same FD, but it 
becomes a race in this case because the new container launch io::read() pick up 
a FD relief by another container destroy. The libevent would disable the FD 
when it clean up one of event that is not cleaned up yet. Please see MESOS-9334 
for details.

*Fix*
https://reviews.apache.org/r/69123/

> UCR container launch stuck at PROVISIONING during image fetching.
> -
>
> Key: MESOS-9320
> URL: https://issues.apache.org/jira/browse/MESOS-9320
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>  Labels: containerizer
>
> We observed mesos containerizer stuck at PROVISIONING when launching a mesos 
> container using docker image: 
> `kvish/jenkins-dev:595c74f713f609fd1d3b05a40d35113fc03227c9`:
> The image pulling never finishes. Insufficient image contents are still in 
> image store staging directory 
> /var/lib/mesos/slave/store/docker/staging/egLYqO, forever.
> {noformat}
> OK-22:50:06-root@int-agent89-mwst9:/var/lib/mesos/slave/store/docker/staging/egLYqO
>  # ls -alh
> total 1.1G
> drwx--. 2 root root 4.0K Oct 15 13:02 .
> drwxr-xr-x. 3 root root   20 Oct 15 22:40 ..
> -rw-r--r--. 1 root root  59K Oct 15 13:02 manifest
> -rw-r--r--. 1 root root 2.6K Oct 15 13:02 
> sha256:08239cb71d7a3e0d8ed680397590b338a2133117250e1a3e2ee5c5c45292db63
> -rw-r--r--. 1 root root  440 Oct 15 13:02 
> sha256:0984904c0e1558248eb25e93d9fc14c47c0052d58569e64c185afca93a060b66
> -rw-r--r--. 1 root root  248 Oct 15 13:02 
> sha256:0bbc7b377a9155696eb0b684bd1999bc43937918552d73fd9697ea50ef46528a
> -rw-r--r--. 1 root root  240 Oct 15 13:02 
> sha256:0c5c0c095e351b976943453c80271f3b75b1208dbad3ca7845332e873361f3bb
> -rw-r--r--. 1 root root  562 Oct 15 13:02 
> sha256:1558b7c35c9e25577ee719529d6fcdddebea68f5bdf8cbdf13d8d75a02f8a5b1
> -rw-r--r--. 1 root root  11M Oct 15 13:02 
> sha256:1ab373b3deaed929a15574ac1912afc6e173f80d400aba0e96c89f6a58961f2d
> -rw-r--r--. 1 root root  130 Oct 15 13:02 
> sha256:1b6c70b3786f72e5255ccd51e27840d1c853a17561b5e94a4359b17d27494d50
> -rw-r--r--. 1 root root  176 Oct 15 13:02 
> sha256:1bf4aab5c3b363b4fdfc46026df9ae854db8858a5cbcccdd4409434817d59312
> -rw-r--r--. 1 root root  380 Oct 15 13:02 
> sha256:213b0c5bb5300df1d2d06df6213ae94448419cf18ecf61358e978a5d25651d5a
> -rw-r--r--. 1 root root  71M Oct 15 13:02 
> sha256:31aaab384e3fa66b73eced4870fc96be590a2376e93fd4f8db5d00f94fb11604
> -rw-r--r--. 1 root root 1.4K Oct 15 13:02 
> sha256:32442b7d159ed2b7f00b00a989ca1d3ee1a3f566df5d5acbd25f0c3dfdad69d1
> -rw-r--r--. 1 root root 653K Oct 15 13:02 
> sha256:340cd692075b636b5e1803fcde9b1a56a2f6e2728e4fb10f7295d39c7d0e0d01
> -rw-r--r--. 1 root root  184 Oct 15 13:02 
> sha256:398819b00c6cbf9cce6c1ed25005c9e1242cace7a6436730e17da052000c7f90
> -rw-r--r--. 1 root root 366K Oct 15 13:02 
> sha256:41d78c0cb1b2a47189068e55f61d6266be14c4fa75935cb021f17668dd8e7f94
> -rw-r--r--. 1 root root  23K Oct 15 13:02 
> sha256:4f5852c22c7ce0155494b6e86a0a4c536c3c95cb87cad84806aa2d56184b95d2
> -rw-r--r--. 1 root root 384M Oct 15 13:02 
> sha256:4fe621515c4d23e33d9850a6cdfc3aa686d790704b9c5569f1726b4469aa30c0
> -rw-r--r--. 1 root root 1.5K Oct 15 13:02 
> sha256:50dcd1d0618b1d42bf6633dc8176e164571081494fa6483ec4489a59637518bc
> -rw-r--r--. 1 root root  48M Oct 15 13:02 
> sha256:57c8de432dbe337bb6cb1ad328e6c564303a3d3fd05b5e872fd9c47c16fdd02c
> -rw-r--r--. 1 root root  30M Oct 15 13:02 
> sha256:63a0f0b6b5d7014b647ac4a164808208229d2e3219f45a39914f0561a4f831bf
> -rw-r--r--. 1 root root 306M Oct 15 13:02 
> 

[jira] [Assigned] (MESOS-9320) UCR container launch stuck at PROVISIONING during image fetching.

2018-10-25 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9320:
---

Assignee: Gilbert Song

> UCR container launch stuck at PROVISIONING during image fetching.
> -
>
> Key: MESOS-9320
> URL: https://issues.apache.org/jira/browse/MESOS-9320
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>  Labels: containerizer
>
> We observed mesos containerizer stuck at PROVISIONING when launching a mesos 
> container using docker image: 
> `kvish/jenkins-dev:595c74f713f609fd1d3b05a40d35113fc03227c9`:
> The image pulling never finishes. Insufficient image contents are still in 
> image store staging directory 
> /var/lib/mesos/slave/store/docker/staging/egLYqO, forever.
> {noformat}
> OK-22:50:06-root@int-agent89-mwst9:/var/lib/mesos/slave/store/docker/staging/egLYqO
>  # ls -alh
> total 1.1G
> drwx--. 2 root root 4.0K Oct 15 13:02 .
> drwxr-xr-x. 3 root root   20 Oct 15 22:40 ..
> -rw-r--r--. 1 root root  59K Oct 15 13:02 manifest
> -rw-r--r--. 1 root root 2.6K Oct 15 13:02 
> sha256:08239cb71d7a3e0d8ed680397590b338a2133117250e1a3e2ee5c5c45292db63
> -rw-r--r--. 1 root root  440 Oct 15 13:02 
> sha256:0984904c0e1558248eb25e93d9fc14c47c0052d58569e64c185afca93a060b66
> -rw-r--r--. 1 root root  248 Oct 15 13:02 
> sha256:0bbc7b377a9155696eb0b684bd1999bc43937918552d73fd9697ea50ef46528a
> -rw-r--r--. 1 root root  240 Oct 15 13:02 
> sha256:0c5c0c095e351b976943453c80271f3b75b1208dbad3ca7845332e873361f3bb
> -rw-r--r--. 1 root root  562 Oct 15 13:02 
> sha256:1558b7c35c9e25577ee719529d6fcdddebea68f5bdf8cbdf13d8d75a02f8a5b1
> -rw-r--r--. 1 root root  11M Oct 15 13:02 
> sha256:1ab373b3deaed929a15574ac1912afc6e173f80d400aba0e96c89f6a58961f2d
> -rw-r--r--. 1 root root  130 Oct 15 13:02 
> sha256:1b6c70b3786f72e5255ccd51e27840d1c853a17561b5e94a4359b17d27494d50
> -rw-r--r--. 1 root root  176 Oct 15 13:02 
> sha256:1bf4aab5c3b363b4fdfc46026df9ae854db8858a5cbcccdd4409434817d59312
> -rw-r--r--. 1 root root  380 Oct 15 13:02 
> sha256:213b0c5bb5300df1d2d06df6213ae94448419cf18ecf61358e978a5d25651d5a
> -rw-r--r--. 1 root root  71M Oct 15 13:02 
> sha256:31aaab384e3fa66b73eced4870fc96be590a2376e93fd4f8db5d00f94fb11604
> -rw-r--r--. 1 root root 1.4K Oct 15 13:02 
> sha256:32442b7d159ed2b7f00b00a989ca1d3ee1a3f566df5d5acbd25f0c3dfdad69d1
> -rw-r--r--. 1 root root 653K Oct 15 13:02 
> sha256:340cd692075b636b5e1803fcde9b1a56a2f6e2728e4fb10f7295d39c7d0e0d01
> -rw-r--r--. 1 root root  184 Oct 15 13:02 
> sha256:398819b00c6cbf9cce6c1ed25005c9e1242cace7a6436730e17da052000c7f90
> -rw-r--r--. 1 root root 366K Oct 15 13:02 
> sha256:41d78c0cb1b2a47189068e55f61d6266be14c4fa75935cb021f17668dd8e7f94
> -rw-r--r--. 1 root root  23K Oct 15 13:02 
> sha256:4f5852c22c7ce0155494b6e86a0a4c536c3c95cb87cad84806aa2d56184b95d2
> -rw-r--r--. 1 root root 384M Oct 15 13:02 
> sha256:4fe621515c4d23e33d9850a6cdfc3aa686d790704b9c5569f1726b4469aa30c0
> -rw-r--r--. 1 root root 1.5K Oct 15 13:02 
> sha256:50dcd1d0618b1d42bf6633dc8176e164571081494fa6483ec4489a59637518bc
> -rw-r--r--. 1 root root  48M Oct 15 13:02 
> sha256:57c8de432dbe337bb6cb1ad328e6c564303a3d3fd05b5e872fd9c47c16fdd02c
> -rw-r--r--. 1 root root  30M Oct 15 13:02 
> sha256:63a0f0b6b5d7014b647ac4a164808208229d2e3219f45a39914f0561a4f831bf
> -rw-r--r--. 1 root root 306M Oct 15 13:02 
> sha256:67f41ed73c082c6ffee553a90b0abd56bc74b260d90b9d594d652b66cbcd5e7f
> -rw-r--r--. 1 root root  435 Oct 15 13:02 
> sha256:6cb303e084ed78386ae87cdaf95e8817d48e94b3ce7c0442a28335600f0efa3d
> -rw-r--r--. 1 root root 5.5K Oct 15 13:02 
> sha256:7d4d905c2060a5ec994ec201e6877714ee73030ef4261f9562abdb0f844174d5
> -rw-r--r--. 1 root root  39M Oct 15 13:02 
> sha256:80d923f4b955c2db89e2e8a9f2dcb0c36a29c1520a5b359578ce2f3d0b849d10
> -rw-r--r--. 1 root root  615 Oct 15 13:02 
> sha256:842cc8bd099d94f6f9c082785bbaa35439af965d1cf6a13300830561427c266b
> -rw-r--r--. 1 root root  712 Oct 15 13:02 
> sha256:977c8e6687e0ca5f0682915102c025dc12d7ff71bf70de17aab3502adda25af2
> -rw-r--r--. 1 root root  12K Oct 15 13:02 
> sha256:989ac24c53a1f7951438aa92ac39bc9053c178336bea4ebe6ab733d4975c9728
> -rw-r--r--. 1 root root  861 Oct 15 13:02 
> sha256:a18e3c45bf91ac3bd11a46b489fb647a721417f60eae66c5f605360ccd8d6352
> -rw-r--r--. 1 root root   32 Oct 15 13:02 
> sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
> -rw-r--r--. 1 root root 266K Oct 15 13:02 
> sha256:b1d3e8de8ec6d87b8485a8a3b66d63125a033cfb0711f8af24b4f600f524e276
> -rw-r--r--. 1 root root 1.6K Oct 15 13:02 
> sha256:b3a122ff7868d2ed9c063df73b0bf67fd77348d3baa2a92368b3479b41f8aa74
> -rw-r--r--. 1 root root 4.2M Oct 15 13:02 
> 

[jira] [Commented] (MESOS-8128) Make os::pipe file descriptors O_CLOEXEC.

2018-10-17 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654222#comment-16654222
 ] 

Gilbert Song commented on MESOS-8128:
-

commit f9627b90521292add41432d15b4c12e036f94ca7
Author: Gilbert Song 
Date:   Wed Oct 17 00:58:55 2018 -0700

Fixed the FreeBSD MACRO as '__FreeBSD__' in posix/pipe.hpp.

Review: https://reviews.apache.org/r/69059

> Make os::pipe file descriptors O_CLOEXEC.
> -
>
> Key: MESOS-8128
> URL: https://issues.apache.org/jira/browse/MESOS-8128
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: James Peach
>Assignee: James Peach
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.4.3, 1.5.2, 1.6.2, 1.7.0
>
>
> File descriptors from {{os::pipe}} will be inherited across exec. On Linux we 
> can use [pipe2|http://man7.org/linux/man-pages/man2/pipe.2.html] to 
> atomically make the pipe {{O_CLOEXEC}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9320) UCR container launch stuck at PROVISIONING during image fetching.

2018-10-16 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652541#comment-16652541
 ] 

Gilbert Song commented on MESOS-9320:
-

Related agent logs:
{noformat}
6-b105-164a40b22d84 because: Container does not exist
Oct 15 13:04:20 int-agent89-mwst9.scaletesting.mesosphe.re mesos-agent[24626]: 
W1015 13:04:20.038676 24706 containerizer.cpp:2401] Skipping status for 
container 528b3a48-4c77-4ac6-b105-164a40b22d84 because: Unknown container
Oct 15 13:03:20 int-agent89-mwst9.scaletesting.mesosphe.re mesos-agent[24626]: 
W1015 13:03:20.084156 24687 containerizer.cpp:2308] Skipping resource statistic 
for container 528b3a48-4c77-4ac6-b105-164a40b22d84 because: Unknown container
Oct 15 13:03:20 int-agent89-mwst9.scaletesting.mesosphe.re mesos-agent[24626]: 
W1015 13:03:20.084116 24687 containerizer.cpp:2308] Skipping resource statistic 
for container 528b3a48-4c77-4ac6-b105-164a40b22d84 because: Unknown container
Oct 15 13:03:20 int-agent89-mwst9.scaletesting.mesosphe.re mesos-agent[24626]: 
W1015 13:03:20.083827 24676 containerizer.cpp:2401] Skipping status for 
container 528b3a48-4c77-4ac6-b105-164a40b22d84 because: Container does not exist
Oct 15 13:03:20 int-agent89-mwst9.scaletesting.mesosphe.re mesos-agent[24626]: 
W1015 13:03:20.083788 24676 containerizer.cpp:2401] Skipping status for 
container 528b3a48-4c77-4ac6-b105-164a40b22d84 because: Unknown container
Oct 15 13:02:26 int-agent89-mwst9.scaletesting.mesosphe.re mesos-agent[24626]: 
I1015 13:02:26.119920 24652 containerizer.cpp:1282] Starting container 
528b3a48-4c77-4ac6-b105-164a40b22d84
Oct 15 13:02:26 int-agent89-mwst9.scaletesting.mesosphe.re mesos-agent[24626]: 
I1015 13:02:26.117457 24688 slave.cpp:3530] Launching container 
528b3a48-4c77-4ac6-b105-164a40b22d84 for executor 
'jenkins156.8e56a300-d07a-11e8-a55b-82d63c4c3187.1' of framework 
8fe06737-a867-4265-a059-091e45611af9-0004
Oct 15 13:02:26 int-agent89-mwst9.scaletesting.mesosphe.re mesos-agent[24626]: 
I1015 13:02:26.116492 24688 slave.cpp:8997] Launching executor 
'jenkins156.8e56a300-d07a-11e8-a55b-82d63c4c3187.1' of framework 
8fe06737-a867-4265-a059-091e45611af9-0004 with resources 
[{"allocation_info":{"role":"mom-4"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"mom-4"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}]
 in work directory 
'/var/lib/mesos/slave/slaves/113720af-edaf-4ff1-b0da-b3b381a0a32f-S91/frameworks/8fe06737-a867-4265-a059-091e45611af9-0004/executors/jenkins156.8e56a300-d07a-11e8-a55b-82d63c4c3187.1/runs/528b3a48-4c77-4ac6-b105-164a40b22d84'
Oct 15 13:02:26 int-agent89-mwst9.scaletesting.mesosphe.re mesos-agent[24626]: 
I1015 13:02:26.116279 24688 paths.cpp:748] Creating sandbox 
'/var/lib/mesos/slave/meta/slaves/113720af-edaf-4ff1-b0da-b3b381a0a32f-S91/frameworks/8fe06737-a867-4265-a059-091e45611af9-0004/executors/jenkins156.8e56a300-d07a-11e8-a55b-82d63c4c3187.1/runs/528b3a48-4c77-4ac6-b105-164a40b22d84'
Oct 15 13:02:26 int-agent89-mwst9.scaletesting.mesosphe.re mesos-agent[24626]: 
I1015 13:02:26.115389 24688 paths.cpp:745] Creating sandbox 
'/var/lib/mesos/slave/slaves/113720af-edaf-4ff1-b0da-b3b381a0a32f-S91/frameworks/8fe06737-a867-4265-a059-091e45611af9-0004/executors/jenkins156.8e56a300-d07a-11e8-a55b-82d63c4c3187.1/runs/528b3a48-4c77-4ac6-b105-164a40b22d84'
 for user 'root'
{noformat}

> UCR container launch stuck at PROVISIONING during image fetching.
> -
>
> Key: MESOS-9320
> URL: https://issues.apache.org/jira/browse/MESOS-9320
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Priority: Major
>  Labels: containerizer
>
> We observed mesos containerizer stuck at PROVISIONING when launching a mesos 
> container using docker image: 
> `kvish/jenkins-dev:595c74f713f609fd1d3b05a40d35113fc03227c9`:
> The image pulling never finishes. Insufficient image contents are still in 
> image store staging directory 
> /var/lib/mesos/slave/store/docker/staging/egLYqO, forever.
> {noformat}
> OK-22:50:06-root@int-agent89-mwst9:/var/lib/mesos/slave/store/docker/staging/egLYqO
>  # ls -alh
> total 1.1G
> drwx--. 2 root root 4.0K Oct 15 13:02 .
> drwxr-xr-x. 3 root root   20 Oct 15 22:40 ..
> -rw-r--r--. 1 root root  59K Oct 15 13:02 manifest
> -rw-r--r--. 1 root root 2.6K Oct 15 13:02 
> sha256:08239cb71d7a3e0d8ed680397590b338a2133117250e1a3e2ee5c5c45292db63
> -rw-r--r--. 1 root root  440 Oct 15 13:02 
> sha256:0984904c0e1558248eb25e93d9fc14c47c0052d58569e64c185afca93a060b66
> -rw-r--r--. 1 root root  248 Oct 15 13:02 
> sha256:0bbc7b377a9155696eb0b684bd1999bc43937918552d73fd9697ea50ef46528a
> -rw-r--r--. 1 root root  240 Oct 15 13:02 
> sha256:0c5c0c095e351b976943453c80271f3b75b1208dbad3ca7845332e873361f3bb
> -rw-r--r--. 

[jira] [Commented] (MESOS-9322) Executor exited accidentally, but mesos-agent did not report TASK_FAILED event.

2018-10-16 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651239#comment-16651239
 ] 

Gilbert Song commented on MESOS-9322:
-

Seems like your executor could not reregister with the agent after the agent 
recovery. The default reregistration timeout is 2 second. If you always observe 
this issue after your operator restarts the agent process, try bumping it to 10 
seconds?

> Executor exited accidentally, but mesos-agent did not report TASK_FAILED 
> event.
> ---
>
> Key: MESOS-9322
> URL: https://issues.apache.org/jira/browse/MESOS-9322
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.4.1
> Environment: Linux n14-068-081 4.4.0-33.bm.1-amd64 #1 SMP Fri, 01 Sep 
> 2017 18:36:21 +0800 x86_64 GNU/Linux
> OS: debion 8.10
> mesos version: 1.4.1
>Reporter: Shiwei Guo
>Priority: Major
>
> The log about this executor:
> executorid: 
> 'gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz'
>  
> {quote}{{I0914 10:40:36.448287 2505 slave.cpp:7336] Recovering executor 
> 'gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz'
>  of framework ae7c9e78-e0b7-4110-8092-52baf64e4f67-}}
> {{I0914 10:40:36.479209 2511 gc.cpp:58] Scheduling 
> '/opt/tiger/mesos_deploy/mesos_titan/slave/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/189e4b23-c892-4c87-9069-dfc98ca5edc8'
>  for gc 3.1546935280563days in the future}}
> {{I0914 10:40:36.479287 2511 gc.cpp:58] Scheduling 
> '/opt/tiger/mesos_deploy/mesos_titan/slave/meta/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/189e4b23-c892-4c87-9069-dfc98ca5edc8'
>  for gc 3.15469352761481days in the future}}
> {{I0914 10:40:36.479310 2511 gc.cpp:58] Scheduling 
> '/opt/tiger/mesos_deploy/mesos_titan/slave/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/4b27d1d4-fe67-4475-88bc-14e994acfb85'
>  for gc -1.02171850967407days in the future}}
> {{I0914 10:40:36.479337 2511 gc.cpp:58] Scheduling 
> '/opt/tiger/mesos_deploy/mesos_titan/slave/meta/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/4b27d1d4-fe67-4475-88bc-14e994acfb85'
>  for gc -1.02171850987259days in the future}}
> {{I0914 10:40:36.480459 2514 gc.cpp:169] Deleting 
> /opt/tiger/mesos_deploy/mesos_titan/slave/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/4b27d1d4-fe67-4475-88bc-14e994acfb85}}
> {{I0914 10:40:36.552492 2516 status_update_manager.cpp:211] Recovering 
> executor 
> 'gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz'
>  of framework ae7c9e78-e0b7-4110-8092-52baf64e4f67-}}
> {{I0914 10:40:36.553234 2519 containerizer.cpp:665] Recovering container 
> 106c7257-fabb-4d58-8fcb-89b15bb9d404 for executor 
> 'gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz'
>  of framework ae7c9e78-e0b7-4110-8092-52baf64e4f67-}}
> {{I0914 10:40:36.591421 2514 gc.cpp:177] Deleted 
> 

[jira] [Created] (MESOS-9320) UCR container launch stuck at PROVISIONING during image fetching.

2018-10-15 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9320:
---

 Summary: UCR container launch stuck at PROVISIONING during image 
fetching.
 Key: MESOS-9320
 URL: https://issues.apache.org/jira/browse/MESOS-9320
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Gilbert Song


We observed mesos containerizer stuck at PROVISIONING when launching a mesos 
container using docker image: 
`kvish/jenkins-dev:595c74f713f609fd1d3b05a40d35113fc03227c9`:

The image pulling never finishes. Insufficient image contents are still in 
image store staging directory /var/lib/mesos/slave/store/docker/staging/egLYqO, 
forever.
{noformat}
OK-22:50:06-root@int-agent89-mwst9:/var/lib/mesos/slave/store/docker/staging/egLYqO
 # ls -alh
total 1.1G
drwx--. 2 root root 4.0K Oct 15 13:02 .
drwxr-xr-x. 3 root root   20 Oct 15 22:40 ..
-rw-r--r--. 1 root root  59K Oct 15 13:02 manifest
-rw-r--r--. 1 root root 2.6K Oct 15 13:02 
sha256:08239cb71d7a3e0d8ed680397590b338a2133117250e1a3e2ee5c5c45292db63
-rw-r--r--. 1 root root  440 Oct 15 13:02 
sha256:0984904c0e1558248eb25e93d9fc14c47c0052d58569e64c185afca93a060b66
-rw-r--r--. 1 root root  248 Oct 15 13:02 
sha256:0bbc7b377a9155696eb0b684bd1999bc43937918552d73fd9697ea50ef46528a
-rw-r--r--. 1 root root  240 Oct 15 13:02 
sha256:0c5c0c095e351b976943453c80271f3b75b1208dbad3ca7845332e873361f3bb
-rw-r--r--. 1 root root  562 Oct 15 13:02 
sha256:1558b7c35c9e25577ee719529d6fcdddebea68f5bdf8cbdf13d8d75a02f8a5b1
-rw-r--r--. 1 root root  11M Oct 15 13:02 
sha256:1ab373b3deaed929a15574ac1912afc6e173f80d400aba0e96c89f6a58961f2d
-rw-r--r--. 1 root root  130 Oct 15 13:02 
sha256:1b6c70b3786f72e5255ccd51e27840d1c853a17561b5e94a4359b17d27494d50
-rw-r--r--. 1 root root  176 Oct 15 13:02 
sha256:1bf4aab5c3b363b4fdfc46026df9ae854db8858a5cbcccdd4409434817d59312
-rw-r--r--. 1 root root  380 Oct 15 13:02 
sha256:213b0c5bb5300df1d2d06df6213ae94448419cf18ecf61358e978a5d25651d5a
-rw-r--r--. 1 root root  71M Oct 15 13:02 
sha256:31aaab384e3fa66b73eced4870fc96be590a2376e93fd4f8db5d00f94fb11604
-rw-r--r--. 1 root root 1.4K Oct 15 13:02 
sha256:32442b7d159ed2b7f00b00a989ca1d3ee1a3f566df5d5acbd25f0c3dfdad69d1
-rw-r--r--. 1 root root 653K Oct 15 13:02 
sha256:340cd692075b636b5e1803fcde9b1a56a2f6e2728e4fb10f7295d39c7d0e0d01
-rw-r--r--. 1 root root  184 Oct 15 13:02 
sha256:398819b00c6cbf9cce6c1ed25005c9e1242cace7a6436730e17da052000c7f90
-rw-r--r--. 1 root root 366K Oct 15 13:02 
sha256:41d78c0cb1b2a47189068e55f61d6266be14c4fa75935cb021f17668dd8e7f94
-rw-r--r--. 1 root root  23K Oct 15 13:02 
sha256:4f5852c22c7ce0155494b6e86a0a4c536c3c95cb87cad84806aa2d56184b95d2
-rw-r--r--. 1 root root 384M Oct 15 13:02 
sha256:4fe621515c4d23e33d9850a6cdfc3aa686d790704b9c5569f1726b4469aa30c0
-rw-r--r--. 1 root root 1.5K Oct 15 13:02 
sha256:50dcd1d0618b1d42bf6633dc8176e164571081494fa6483ec4489a59637518bc
-rw-r--r--. 1 root root  48M Oct 15 13:02 
sha256:57c8de432dbe337bb6cb1ad328e6c564303a3d3fd05b5e872fd9c47c16fdd02c
-rw-r--r--. 1 root root  30M Oct 15 13:02 
sha256:63a0f0b6b5d7014b647ac4a164808208229d2e3219f45a39914f0561a4f831bf
-rw-r--r--. 1 root root 306M Oct 15 13:02 
sha256:67f41ed73c082c6ffee553a90b0abd56bc74b260d90b9d594d652b66cbcd5e7f
-rw-r--r--. 1 root root  435 Oct 15 13:02 
sha256:6cb303e084ed78386ae87cdaf95e8817d48e94b3ce7c0442a28335600f0efa3d
-rw-r--r--. 1 root root 5.5K Oct 15 13:02 
sha256:7d4d905c2060a5ec994ec201e6877714ee73030ef4261f9562abdb0f844174d5
-rw-r--r--. 1 root root  39M Oct 15 13:02 
sha256:80d923f4b955c2db89e2e8a9f2dcb0c36a29c1520a5b359578ce2f3d0b849d10
-rw-r--r--. 1 root root  615 Oct 15 13:02 
sha256:842cc8bd099d94f6f9c082785bbaa35439af965d1cf6a13300830561427c266b
-rw-r--r--. 1 root root  712 Oct 15 13:02 
sha256:977c8e6687e0ca5f0682915102c025dc12d7ff71bf70de17aab3502adda25af2
-rw-r--r--. 1 root root  12K Oct 15 13:02 
sha256:989ac24c53a1f7951438aa92ac39bc9053c178336bea4ebe6ab733d4975c9728
-rw-r--r--. 1 root root  861 Oct 15 13:02 
sha256:a18e3c45bf91ac3bd11a46b489fb647a721417f60eae66c5f605360ccd8d6352
-rw-r--r--. 1 root root   32 Oct 15 13:02 
sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
-rw-r--r--. 1 root root 266K Oct 15 13:02 
sha256:b1d3e8de8ec6d87b8485a8a3b66d63125a033cfb0711f8af24b4f600f524e276
-rw-r--r--. 1 root root 1.6K Oct 15 13:02 
sha256:b3a122ff7868d2ed9c063df73b0bf67fd77348d3baa2a92368b3479b41f8aa74
-rw-r--r--. 1 root root 4.2M Oct 15 13:02 
sha256:b542772b417703c0311c0b90136091369bcd9c2176c0e3ceed5a0114d743ee3c
-rw-r--r--. 1 root root 1.1K Oct 15 13:02 
sha256:b6e3599b777bb2dd681fd84f174a7e0ce3cb01f5a84dcd3c771d0e999a39bc58
-rw-r--r--. 1 root root 2.8K Oct 15 13:02 
sha256:b970c9afc934d5e6bb524a6057342a1d1cc835972f047a805f436c540ee20747
-rw-r--r--. 1 root root 6.3M Oct 15 13:02 
sha256:b984f623b82721cc642c25cd4797f6c3d2c01b6b063c49905a97bb0a7f0725a5
-rw-r--r--. 1 root root 1.8K Oct 15 13:02 

[jira] [Comment Edited] (MESOS-9295) Nested container launch could fail if the agent upgrade with new cgroup subsystems.

2018-10-05 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639936#comment-16639936
 ] 

Gilbert Song edited comment on MESOS-9295 at 10/5/18 11:33 PM:
---

https://reviews.apache.org/r/68929/
https://reviews.apache.org/r/68941/


was (Author: vinodkone):
https://reviews.apache.org/r/68929/

> Nested container launch could fail if the agent upgrade with new cgroup 
> subsystems.
> ---
>
> Key: MESOS-9295
> URL: https://issues.apache.org/jira/browse/MESOS-9295
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>
> Nested container launch could fail if the agent upgrade with new cgroup 
> subsystems, because the new cgroup subsystems do not exist on parent 
> container's cgroup hierarchy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9295) Nested container launch could fail if the agent upgrade with new cgroup subsystems.

2018-10-04 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9295:
---

 Summary: Nested container launch could fail if the agent upgrade 
with new cgroup subsystems.
 Key: MESOS-9295
 URL: https://issues.apache.org/jira/browse/MESOS-9295
 Project: Mesos
  Issue Type: Bug
Reporter: Gilbert Song


Nested container launch could fail if the agent upgrade with new cgroup 
subsystems, because the new cgroup subsystems do not exist on parent 
container's cgroup hierarchy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   >