date:20161025

[jira] [Assigned] (MESOS-5662) Call parent class `SetUpTestCase` function in our test fixtures.

2016-10-25 Thread Manuwela Kanade (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-5662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manuwela Kanade reassigned MESOS-5662:
--

Assignee: Manuwela Kanade

> Call parent class `SetUpTestCase` function in our test fixtures.
> 
>
> Key: MESOS-5662
> URL: https://issues.apache.org/jira/browse/MESOS-5662
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Anand Mazumdar
>Assignee: Manuwela Kanade
>  Labels: mesosphere, newbie
>
> There are some occurrences in our code where we don't invoke the parent's 
> {{SetUpTestCase}} method from a child test fixture. This can be a bit 
> problematic if someone adds the method in the parent class sometime in the 
> future. It would be good to do a sweep across the code and explicitly invoke 
> the parent class's method.
> Some examples (there are more):
> https://github.com/apache/mesos/blob/master/src/tests/mesos.cpp#L80
> https://github.com/apache/mesos/blob/master/src/tests/module_tests.cpp#L59



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-6480) Support for docker live-restore option in Mesos

2016-10-25 Thread Milind Chawre (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607412#comment-15607412
 ] 

Milind Chawre edited comment on MESOS-6480 at 10/26/16 4:50 AM:


[~gilbert] Yes its seems duplicates. But my issue is slightly different, in my 
case I am trying to run all infra services mesos master, mesos worker, 
marathon, etc as docker containers and are not running as standalone services.
This ticket can be closed but while providing support to live-restore option 
this case of all infra services running as docker containers should also be 
considered.



was (Author: milindchawre):
Yes its seems duplicates. But my issue is slightly different, in my case I am 
trying to run all infra services mesos master, mesos worker, marathon, etc as 
docker containers and are not running as standalone services.
This ticket can be closed but while providing support to live-restore option 
this case of all infra services running as docker containers should also be 
considered.


> Support for docker live-restore option in Mesos
> ---
>
> Key: MESOS-6480
> URL: https://issues.apache.org/jira/browse/MESOS-6480
> Project: Mesos
>  Issue Type: Task
>Reporter: Milind Chawre
>
> Docker-1.12 supports live-restore option which keeps containers alive during 
> docker daemon downtime https://docs.docker.com/engine/admin/live-restore/
> I tried to use this option in my Mesos setup And  observed this :
> 1. On mesos worker node stop docker daemon.
> 2. After some time start the docker daemon. All the containers running on 
> that are still visible using "docker ps". This is an expected behaviour of 
> live-restore option.
> 3. When I check mesos and marathon UI. It shows no Active tasks running on 
> that node. The containers which are still running on that node are now 
> scheduled on different mesos nodes, which is not right since I can see the 
> containers in "docker ps" output because of live-restore option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6480) Support for docker live-restore option in Mesos

2016-10-25 Thread Milind Chawre (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607412#comment-15607412
 ] 

Milind Chawre commented on MESOS-6480:
--

Yes its seems duplicates. But my issue is slightly different, in my case I am 
trying to run all infra services mesos master, mesos worker, marathon, etc as 
docker containers and are not running as standalone services.
This ticket can be closed but while providing support to live-restore option 
this case of all infra services running as docker containers should also be 
considered.


> Support for docker live-restore option in Mesos
> ---
>
> Key: MESOS-6480
> URL: https://issues.apache.org/jira/browse/MESOS-6480
> Project: Mesos
>  Issue Type: Task
>Reporter: Milind Chawre
>
> Docker-1.12 supports live-restore option which keeps containers alive during 
> docker daemon downtime https://docs.docker.com/engine/admin/live-restore/
> I tried to use this option in my Mesos setup And  observed this :
> 1. On mesos worker node stop docker daemon.
> 2. After some time start the docker daemon. All the containers running on 
> that are still visible using "docker ps". This is an expected behaviour of 
> live-restore option.
> 3. When I check mesos and marathon UI. It shows no Active tasks running on 
> that node. The containers which are still running on that node are now 
> scheduled on different mesos nodes, which is not right since I can see the 
> containers in "docker ps" output because of live-restore option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6400) Not able to remove Orphan Tasks

2016-10-25 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-6400:

Priority: Critical  (was: Major)

> Not able to remove Orphan Tasks
> ---
>
> Key: MESOS-6400
> URL: https://issues.apache.org/jira/browse/MESOS-6400
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: centos 7 x64
>Reporter: kasim
>Priority: Critical
>
> The problem maybe cause by Mesos and Marathon out of sync
> https://github.com/mesosphere/marathon/issues/616
> When I found Orphan Tasks happen, I
> 1. restart marathon
> 2. marathon do not sync Orphan Tasks, but start new tasks.
> 3. Orphan Tasks still taked the resource, I have to delete them.
> 4. I find all Orphan Tasks are under framework 
> `ef169d8a-24fc-41d1-8b0d-c67718937a48-`,
> curl -XGET `http://c196:5050/master/frameworks` shows that framework is 
> `unregistered_frameworks`
> {code}
> {
> "frameworks": [
> .
> ],
> "completed_frameworks": [ ],
> "unregistered_frameworks": [
> "ef169d8a-24fc-41d1-8b0d-c67718937a48-",
> "ef169d8a-24fc-41d1-8b0d-c67718937a48-",
> "ef169d8a-24fc-41d1-8b0d-c67718937a48-"
> ]
> }
> {code}
> 5.Try {code}curl -XPOST http://c196:5050/master/teardown -d 
> 'frameworkId=ef169d8a-24fc-41d1-8b0d-c67718937a48-' {code}
> ， but get `No framework found with specified ID`
> So I have no idea to delete Orphan Tasks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6483) Potential issue with upgrading from mesos 0.28 to mesos > 1.0

2016-10-25 Thread Megha (JIRA)

Megha created MESOS-6483:


 Summary: Potential issue with upgrading from mesos 0.28 to mesos > 
1.0
 Key: MESOS-6483
 URL: https://issues.apache.org/jira/browse/MESOS-6483
 Project: Mesos
  Issue Type: Bug
Reporter: Megha


When upgrading directly from mesos version 0.28 to a version > 1.0 there could 
be a scenario that may make the 
CHECK(frameworks.recovered.contains(frameworkId)) in 
Master::_markUnreachable(..) to fail. The following sequence of events can 
happen.

1) The master gets upgraded first to the new version and the agent lets say X 
is still at mesos version 0.28
2) This agent X (at mesos 0.28) attempts to re-registers with the master (at 
lets say 1.1) and as a result doesn't send the frameworks (frameworkInfos) in 
the ReRegisterSlave message since it wasn't available in the older mesos 
version.
3) Among other frameworks on this agent X, is a framework Y which didn’t 
re-register after master’s failover. Since the master builds the 
frameworks.recovered from the frameworkInfos that agents provide it so this 
framework Y is neither in the recovered nor in registered frameworks.
4) The agent X post re-registering fails master’s health check and is being 
marked unreachable by the master. The check  
CHECK(frameworks.recovered.contains(frameworkId)) will get fired for the 
framework Y since it is neither in recovered or registered but has tasks 
running on the agent X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-6383) NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - can the device minor number be ascertained reliably using an older set of API calls?

2016-10-25 Thread Kevin Klues (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues reassigned MESOS-6383:
--

Assignee: Kevin Klues

> NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - 
> can the device minor number be ascertained reliably using an older set of API 
> calls?
> 
>
> Key: MESOS-6383
> URL: https://issues.apache.org/jira/browse/MESOS-6383
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.0.1
>Reporter: Dylan Bethune-Waddell
>Assignee: Kevin Klues
>Priority: Minor
>  Labels: gpu
>
> We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We 
> are not in a position to upgrade the Nvidia drivers in the near future, and 
> are currently at driver version 319.72
> When attempting to launch an agent with the following command and take 
> advantage of Nvidia GPU support (master address elided):
> bq. {{./bin/mesos-agent.sh --master=: 
> --work_dir=/tmp/mesos --isolation="cgroups/devices,gpu/nvidia"}}
> I receive the following error message:
> bq. {{Failed to create a containerizer: Failed call to 
> NvidiaGpuAllocator::resources: Failed to nvml::initialize: Failed to load 
> symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol 
> 'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' : 
> /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMinorNumber}}
> Based on the change log for the NVML module, it seems that 
> {{nvmlDeviceGetMinorNumber}} is only available for driver versions 331 and 
> later as per info under the [Changes between NVML v5.319 Update and 
> v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log] 
> heading in the NVML API reference.
> Is there is an alternate method of obtaining this information at runtime to 
> enable support for older versions of the Nvidia driver? Based on discussion 
> in the design document, obtaining this information from the {{nvidia-smi}} 
> command output is a feasible alternative. 
> I am willing to submit a PR that amends the behaviour of 
> {{NvidiaGpuAllocator}} such that it first attempts calls to 
> {{nvml::nvmlGetDeviceMinorNumber}} via libnvidia-ml, and if the symbol cannot 
> be found, falls back on {{--nvidia-smi="/path/to/nvidia-smi"}} option 
> obtained from mesos-agent if provided or attempts to run {{nvidia-smi}} if 
> found on path and parses the output to obtain this information. Otherwise, 
> raise an exception indicating all this was attempted.
> Would a function or class for parsing {{nvidia-smi}} output be a useful 
> contribution?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6383) NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - can the device minor number be ascertained reliably using an older set of API calls?

2016-10-25 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606913#comment-15606913
 ] 

Kevin Klues commented on MESOS-6383:


I think we need to loop some of the nvidia guys in on this. [~vditya], 
[~rph...@nvidia.com], [~rtodd], [~exxo] can one of you comment on this?

> NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - 
> can the device minor number be ascertained reliably using an older set of API 
> calls?
> 
>
> Key: MESOS-6383
> URL: https://issues.apache.org/jira/browse/MESOS-6383
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.0.1
>Reporter: Dylan Bethune-Waddell
>Priority: Minor
>  Labels: gpu
>
> We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We 
> are not in a position to upgrade the Nvidia drivers in the near future, and 
> are currently at driver version 319.72
> When attempting to launch an agent with the following command and take 
> advantage of Nvidia GPU support (master address elided):
> bq. {{./bin/mesos-agent.sh --master=: 
> --work_dir=/tmp/mesos --isolation="cgroups/devices,gpu/nvidia"}}
> I receive the following error message:
> bq. {{Failed to create a containerizer: Failed call to 
> NvidiaGpuAllocator::resources: Failed to nvml::initialize: Failed to load 
> symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol 
> 'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' : 
> /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMinorNumber}}
> Based on the change log for the NVML module, it seems that 
> {{nvmlDeviceGetMinorNumber}} is only available for driver versions 331 and 
> later as per info under the [Changes between NVML v5.319 Update and 
> v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log] 
> heading in the NVML API reference.
> Is there is an alternate method of obtaining this information at runtime to 
> enable support for older versions of the Nvidia driver? Based on discussion 
> in the design document, obtaining this information from the {{nvidia-smi}} 
> command output is a feasible alternative. 
> I am willing to submit a PR that amends the behaviour of 
> {{NvidiaGpuAllocator}} such that it first attempts calls to 
> {{nvml::nvmlGetDeviceMinorNumber}} via libnvidia-ml, and if the symbol cannot 
> be found, falls back on {{--nvidia-smi="/path/to/nvidia-smi"}} option 
> obtained from mesos-agent if provided or attempts to run {{nvidia-smi}} if 
> found on path and parses the output to obtain this information. Otherwise, 
> raise an exception indicating all this was attempted.
> Would a function or class for parsing {{nvidia-smi}} output be a useful 
> contribution?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6482) Master check failure when marking an agent unreachable

2016-10-25 Thread Yan Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-6482:
--
Description: 
{noformat:title=}
I1025 16:34:55.423038 44118 master.cpp:6006] Marked agent 
8e219f7a-06c1-4009-9440-1a33b3e39be5-S473 (x.y.z.com) unreachable: health check 
timed out
F1025 16:34:55.423632 44118 master.cpp:6036] Check failed: 
frameworks.recovered.contains(frameworkId) 
{noformat}

Both the master and the agent are on 1.1.

{code:title=the context}
  foreachkey (const FrameworkID& frameworkId, utils::copy(slave->tasks)) {
Framework* framework = getFramework(frameworkId);

// If the framework has not yet re-registered after master failover,
// its FrameworkInfo will be in the `recovered` collection. Note that
// if the master knows about a task, its FrameworkInfo must appear in
// either the `registered` or `recovered` collections.
FrameworkInfo frameworkInfo;

if (framework == nullptr) {
  CHECK(frameworks.recovered.contains(frameworkId));
  frameworkInfo = frameworks.recovered[frameworkId];
} else {
  frameworkInfo = framework->info;
}

...
{code}

  was:
{noformat:title=}
I1025 16:34:55.423038 44118 master.cpp:6006] Marked agent 
8e219f7a-06c1-4009-9440-1a33b3e39be5-S473 (x.y.z.com) unreachable: health check 
timed out
F1025 16:34:55.423632 44118 master.cpp:6036] Check failed: 
frameworks.recovered.contains(frameworkId) 
{noformat}

Both the master and the agent are on 1.1.


> Master check failure when marking an agent unreachable
> --
>
> Key: MESOS-6482
> URL: https://issues.apache.org/jira/browse/MESOS-6482
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Yan Xu
>Priority: Blocker
>
> {noformat:title=}
> I1025 16:34:55.423038 44118 master.cpp:6006] Marked agent 
> 8e219f7a-06c1-4009-9440-1a33b3e39be5-S473 (x.y.z.com) unreachable: health 
> check timed out
> F1025 16:34:55.423632 44118 master.cpp:6036] Check failed: 
> frameworks.recovered.contains(frameworkId) 
> {noformat}
> Both the master and the agent are on 1.1.
> {code:title=the context}
>   foreachkey (const FrameworkID& frameworkId, utils::copy(slave->tasks)) {
> Framework* framework = getFramework(frameworkId);
> // If the framework has not yet re-registered after master failover,
> // its FrameworkInfo will be in the `recovered` collection. Note that
> // if the master knows about a task, its FrameworkInfo must appear in
> // either the `registered` or `recovered` collections.
> FrameworkInfo frameworkInfo;
> if (framework == nullptr) {
>   CHECK(frameworks.recovered.contains(frameworkId));
>   frameworkInfo = frameworks.recovered[frameworkId];
> } else {
>   frameworkInfo = framework->info;
> }
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6482) Master check failure when marking an agent unreachable

2016-10-25 Thread Yan Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606701#comment-15606701
 ] 

Yan Xu commented on MESOS-6482:
---

The root cause is likely MESOS-4975: when frameworks are torn down, they are 
removed from {{frameworks.registered}} and {{frameworks.recovered}} but a bunch 
of spurious entries are left in {{slave->tasks}}.

> Master check failure when marking an agent unreachable
> --
>
> Key: MESOS-6482
> URL: https://issues.apache.org/jira/browse/MESOS-6482
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Yan Xu
>Priority: Blocker
>
> {noformat:title=}
> I1025 16:34:55.423038 44118 master.cpp:6006] Marked agent 
> 8e219f7a-06c1-4009-9440-1a33b3e39be5-S473 (x.y.z.com) unreachable: health 
> check timed out
> F1025 16:34:55.423632 44118 master.cpp:6036] Check failed: 
> frameworks.recovered.contains(frameworkId) 
> {noformat}
> Both the master and the agent are on 1.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6482) Master check failure when marking an agent unreachable

2016-10-25 Thread Yan Xu (JIRA)

Yan Xu created MESOS-6482:
-

 Summary: Master check failure when marking an agent unreachable
 Key: MESOS-6482
 URL: https://issues.apache.org/jira/browse/MESOS-6482
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.1.0
Reporter: Yan Xu
Priority: Blocker


{noformat:title=}
I1025 16:34:55.423038 44118 master.cpp:6006] Marked agent 
8e219f7a-06c1-4009-9440-1a33b3e39be5-S473 (x.y.z.com) unreachable: health check 
timed out
F1025 16:34:55.423632 44118 master.cpp:6036] Check failed: 
frameworks.recovered.contains(frameworkId) 
{noformat}

Both the master and the agent are on 1.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-6422) cgroups_tests not correctly tearing down testing hierarchies

2016-10-25 Thread Yan Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-6422:
-

Assignee: Yan Xu

> cgroups_tests not correctly tearing down testing hierarchies
> 
>
> Key: MESOS-6422
> URL: https://issues.apache.org/jira/browse/MESOS-6422
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> We currently do the following in 
> [CgroupsTest::TearDownTestCase()|https://github.com/apache/mesos/blob/5e850a362edbf494921fedff4037cf4b53088c10/src/tests/containerizer/cgroups_tests.cpp#L83]
> {code:title=}
> static void TearDownTestCase()
> {
>   AWAIT_READY(cgroups::cleanup(TEST_CGROUPS_HIERARCHY));
> }
> {code}
> One of its derived test {{CgroupsNoHierarchyTest}} treats 
> {{TEST_CGROUPS_HIERARCHY}} as a hierarchy so it's able to clean it up as a 
> hierarchy.
> However another derived test {{CgroupsAnyHierarchyTest}} would create new 
> hierarchies (if none is available) using {{TEST_CGROUPS_HIERARCHY}} as a 
> parent directory (i.e., base hierarchy) and not as a hierarchy, so when it's 
> time to clean up, it fails:
> {noformat:title=}
> [   OK ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Subsystems (1 ms)
> ../../src/tests/containerizer/cgroups_tests.cpp:88: Failure
> (cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-4975) mesos::internal::master::Slave::tasks can grow unboundedly

2016-10-25 Thread Yan Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-4975:
-

Assignee: Yan Xu

> mesos::internal::master::Slave::tasks can grow unboundedly
> --
>
> Key: MESOS-4975
> URL: https://issues.apache.org/jira/browse/MESOS-4975
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> So in a Mesos cluster we observed the following
> {noformat:title=}
> $ jq '.orphan_tasks | length' state.json
> 1369
> $ jq '.unregistered_frameworks | length' state.json
> 20162
> {noformat}
> Aside from {{unregistered_frameworks}} here being "the list of frameworkIDs 
> for each orphan task" (described in MESOS-4973), the discrepancy between the 
> two values above is surprising.
> I think the problem is that we do this in the master:
> From 
> [source|https://github.com/apache/mesos/blob/e376d3aa0074710278224ccd17afd51971820dfb/src/master/master.cpp#L2212]:
> {code}
> foreachvalue (Slave* slave, slaves.registered) {
>   foreachvalue (Task* task, slave->tasks[framework->id()]) {
> framework->addTask(task);
>   }
>   foreachvalue (const ExecutorInfo& executor,
> slave->executors[framework->id()]) {
> framework->addExecutor(slave->id, executor);
>   }
> }
> {code}
> Here an {{operator[]}} is used whenever a framework subscribes regardless of 
> whether this agent has tasks for the framework or not.
> If the agent has no such task for this framework, then this \{frameworkID: 
> empty hashmap\} entry will stay in the map indefinitely! If frameworks are 
> ephemeral and new ones keep come in, the map grows unboundedly.
> We should do {{tasks.contains(frameworkId)}} before using the {{[] operator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6078) Add a agent teardown endpoint

2016-10-25 Thread Cody Maloney (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606268#comment-15606268
 ] 

Cody Maloney commented on MESOS-6078:
-

{{/machine/down}} is very complicated to use for this use case (Requires 
posting multiple JSON blobs, which have to follow a format including timestamps 
in milliseconds, which have to have multiple fields which match exactly how a 
particular mesos agent was launched).

It takes a _lot_ of code and debugging to use and manage it for what is a 
simple common task. Also, once there are existing schedules things get more 
complicated (And if you want the agent to re-register later)

> Add a agent teardown endpoint
> -
>
> Key: MESOS-6078
> URL: https://issues.apache.org/jira/browse/MESOS-6078
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Cody Maloney
>Assignee: Michael Park
>  Labels: mesosphere
>
> Currently, when a whole agent machine is unexpectedly terminated for good 
> (AWS terminated the instance without warning), it goes through the mesos 
> slave removal rate limit before it's gone.
> If a couple agents / a whole rack goes in a cluster of thousands of agents, 
> this can get to be a problem.
> If the agent can be shutdown "cleanly" everything would get scheduled, but 
> once the agent is gone, there currently is no good way for an adminitstrator 
> to indicate the node is gone / gone and it's tasks are lost / should be 
> rescheduled if appropriate as soon as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6078) Add a agent teardown endpoint

2016-10-25 Thread Neil Conway (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606261#comment-15606261
 ] 

Neil Conway commented on MESOS-6078:


FYI, we will likely address this as part of the in-progress work on supporting 
{{TASK_GONE}} and {{TASK_GONE_BY_OPERATOR}}. Workflow:

* framework opts-in to the {{PARTITION_AWARE}} capability.
* if Mesos can _prove_ that the agent ID is gone (e.g., because the agent 
reboots, changes its boot ID, and then an agent using the same {{work_dir}} 
registers and receives a new agent ID), the framework will get {{TASK_GONE}} 
status updates for all tasks on the agent.
* if the operator has some out-of-band knowledge that the agent will never 
attempt to re-register and all of its tasks are no longer running, we'll 
provide an operator HTTP endpoint (e.g., /agent/gone) that the operator can 
hit. When this happens, the framework will receive {{TASK_GONE_BY_OPERATOR}} 
status updates for all tasks on the agent.

In the meantime, the {{/machine/down}} endpoint might help here -- it shouldn't 
be subject to the agent removal rate limit.

> Add a agent teardown endpoint
> -
>
> Key: MESOS-6078
> URL: https://issues.apache.org/jira/browse/MESOS-6078
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Cody Maloney
>Assignee: Michael Park
>  Labels: mesosphere
>
> Currently, when a whole agent machine is unexpectedly terminated for good 
> (AWS terminated the instance without warning), it goes through the mesos 
> slave removal rate limit before it's gone.
> If a couple agents / a whole rack goes in a cluster of thousands of agents, 
> this can get to be a problem.
> If the agent can be shutdown "cleanly" everything would get scheduled, but 
> once the agent is gone, there currently is no good way for an adminitstrator 
> to indicate the node is gone / gone and it's tasks are lost / should be 
> rescheduled if appropriate as soon as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6445) Reconciliation for unreachable agent after master failover is incorrect

2016-10-25 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6445:
--
Fix Version/s: (was: 1.2.0)
   1.1.0

Cherry-picked for 1.1.0.

commit e1b6b2145a82a0f21aed82fc335ea09965b2f2dc
Author: Vinod Kone 
Date:   Tue Oct 25 12:20:39 2016 -0700

Added MESOS-6445 to CHANGELOG for 1.1.0.

commit c6516be5df87e3fc8bea67f4dc74bc6a4743147d
Author: Neil Conway 
Date:   Fri Oct 21 14:18:59 2016 -0700

Tweaked test expectation.

`WillOnce` is more accurate than `WillRepeatedly`.

Review: https://reviews.apache.org/r/53098/

commit 0abd9510a0dba87d1a791d8751e5ccdbb02784db
Author: Neil Conway 
Date:   Fri Oct 21 14:18:52 2016 -0700

Fixed bug when marking agents unreachable after master failover.

If the master fails over and an agent does not re-register within the
`agent_reregister_timeout`, the master marks the agent as unreachable in
the registry and sends `slaveLost` for it. However, we neglected to
update the master's in-memory state for the newly unreachable agent;
this meant that task reconciliation would return incorrect results
(until/unless the next master failover).

Review: https://reviews.apache.org/r/53097/

commit 8cf2ca8703d3b776fdbdaac2979cbd3ea40873ad
Author: Neil Conway 
Date:   Fri Oct 21 14:18:46 2016 -0700

Avoided passing `TimeInfo` by value.

Although this is likely to remain small in practice, passing by const
reference should be preferred until there is a reason not to.

Review: https://reviews.apache.org/r/53099/



> Reconciliation for unreachable agent after master failover is incorrect
> ---
>
> Key: MESOS-6445
> URL: https://issues.apache.org/jira/browse/MESOS-6445
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> {noformat}
> If the master fails over and an agent does not re-register within the
> `agent_reregister_timeout`, the master marks the agent as unreachable in
> the registry and sends `slaveLost` for it. However, we neglected to
> update the master's in-memory state for the newly unreachable agent;
> this meant that task reconciliation would return incorrect results
> (until/unless the next master failover).
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6445) Reconciliation for unreachable agent after master failover is incorrect

2016-10-25 Thread Neil Conway (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6445:
---
Target Version/s: 1.1.0

> Reconciliation for unreachable agent after master failover is incorrect
> ---
>
> Key: MESOS-6445
> URL: https://issues.apache.org/jira/browse/MESOS-6445
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.2.0
>
>
> {noformat}
> If the master fails over and an agent does not re-register within the
> `agent_reregister_timeout`, the master marks the agent as unreachable in
> the registry and sends `slaveLost` for it. However, we neglected to
> update the master's in-memory state for the newly unreachable agent;
> this meant that task reconciliation would return incorrect results
> (until/unless the next master failover).
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6445) Reconciliation for unreachable agent after master failover is incorrect

2016-10-25 Thread Neil Conway (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6445:
---
Priority: Blocker  (was: Major)

> Reconciliation for unreachable agent after master failover is incorrect
> ---
>
> Key: MESOS-6445
> URL: https://issues.apache.org/jira/browse/MESOS-6445
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.2.0
>
>
> {noformat}
> If the master fails over and an agent does not re-register within the
> `agent_reregister_timeout`, the master marks the agent as unreachable in
> the registry and sends `slaveLost` for it. However, we neglected to
> update the master's in-memory state for the newly unreachable agent;
> this meant that task reconciliation would return incorrect results
> (until/unless the next master failover).
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6336) SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky

2016-10-25 Thread Abhishek Dasgupta (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606172#comment-15606172
 ] 

Abhishek Dasgupta commented on MESOS-6336:
--

RR: https://reviews.apache.org/r/53173/

> SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky
> ---
>
> Key: MESOS-6336
> URL: https://issues.apache.org/jira/browse/MESOS-6336
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Greg Mann
>Assignee: Abhishek Dasgupta
>  Labels: mesosphere
>
> The test {{SlaveTest.KillTaskGroupBetweenRunTaskParts}} sometimes segfaults 
> during the agent's {{finalize()}} method. This was observed on our internal 
> CI, on Fedora with libev, without SSL:
> {code}
> [ RUN  ] SlaveTest.KillTaskGroupBetweenRunTaskParts
> I1007 14:12:57.973811 28630 cluster.cpp:158] Creating default 'local' 
> authorizer
> I1007 14:12:57.982128 28630 leveldb.cpp:174] Opened db in 8.195028ms
> I1007 14:12:57.982599 28630 leveldb.cpp:181] Compacted db in 446238ns
> I1007 14:12:57.982616 28630 leveldb.cpp:196] Created db iterator in 3650ns
> I1007 14:12:57.982622 28630 leveldb.cpp:202] Seeked to beginning of db in 
> 451ns
> I1007 14:12:57.982627 28630 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 352ns
> I1007 14:12:57.982638 28630 replica.cpp:776] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1007 14:12:57.983024 28645 recover.cpp:451] Starting replica recovery
> I1007 14:12:57.983127 28651 recover.cpp:477] Replica is in EMPTY status
> I1007 14:12:57.983459 28644 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from __req_res__(6234)@172.30.2.161:38776
> I1007 14:12:57.983543 28651 recover.cpp:197] Received a recover response from 
> a replica in EMPTY status
> I1007 14:12:57.983680 28650 recover.cpp:568] Updating replica status to 
> STARTING
> I1007 14:12:57.983990 28648 master.cpp:380] Master 
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb 
> (ip-172-30-2-161.ec2.internal.mesosphere.io) started on 172.30.2.161:38776
> I1007 14:12:57.984007 28648 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/rVbcaO/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/rVbcaO/master" --zk_session_timeout="10secs"
> I1007 14:12:57.984127 28648 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1007 14:12:57.984134 28648 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1007 14:12:57.984139 28648 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1007 14:12:57.984143 28648 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/rVbcaO/credentials'
> I1007 14:12:57.988487 28648 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1007 14:12:57.988530 28648 http.cpp:883] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1007 14:12:57.988585 28648 http.cpp:883] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1007 14:12:57.988648 28648 http.cpp:883] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1007 14:12:57.988680 28648 master.cpp:584] Authorization enabled
> I1007 14:12:57.988757 28650 whitelist_watcher.cpp:77] No whitelist given
> I1007 14:12:57.988772 28646 hierarchical.cpp:149] Initialized hierarchical 
> allocator process
> I1007 14:12:57.988917 28651 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 5.186917ms
> I1007 14:12:57.988934 28651 replica.cpp:320] Persisted replica status to 
> STARTING
> I1007 14:12:57.989045 28651 recover.cpp:477] Replica is in STARTING status
> I1007 14:12:57.989316 28648

[jira] [Updated] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot

2016-10-25 Thread haosdent (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-6446:

Attachment: webui_metrics.gif

> WebUI redirect doesn't work with stats from /metric/snapshot
> 
>
> Key: MESOS-6446
> URL: https://issues.apache.org/jira/browse/MESOS-6446
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.0.0
>Reporter: Yan Xu
>Assignee: haosdent
>Priority: Blocker
> Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png, 
> webui_metrics.gif
>
>
> After Mesos 1.0, the webUI redirect is hidden from the users so you can go to 
> any of the master and the webUI is populated with state.json from the leading 
> master. 
> This doesn't include stats from /metric/snapshot though as it is not 
> redirected. The user ends up seeing some fields with empty values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot

2016-10-25 Thread haosdent (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606103#comment-15606103
 ] 

haosdent commented on MESOS-6446:
-

I think we have 3 approaches to fix this problem.

# Add {{/master/metrics-snapshot}} in {{master/http.cpp}} and delegate requests 
to {{MetricsProcess}}.
  #* Add {{VIEW_METRICS}} in {{authorizer.proto}} for audit request to 
{{/master/metrics-snapshot}}.
  #* As a follow up step, it would be better that add 
{{/agent/metrics-snapshot}} and deprecated {{/metrics/snapshot}} because it is 
not necessary.
# Add {{leading_maste}} in {{/master/state}}, because we need to use 
{{hostname}} to send requests (Parse {{pid}} to send requests would fail if the 
Mesos cluster is deployed in AWS Cloud). Then use {{JSONP}} to send requests.
# Add {{leading_master}} in {{/master/state}}, because we need to use 
{{hostname}} to send requests (Use {{pid}} to send requests would fail if the 
Mesos cluster is deployed in AWS Cloud). And redirect if found the 
{{leading_master}} is not equal to current one.

I have implemented both 1 and 2. But I found 2 looks more simple. So I posted 
it at https://reviews.apache.org/r/53172/.
Looking forward your thoughts that if other better ways to fix this problem.

[~xujyan] [~vinodkone] [~kaysoky]

> WebUI redirect doesn't work with stats from /metric/snapshot
> 
>
> Key: MESOS-6446
> URL: https://issues.apache.org/jira/browse/MESOS-6446
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.0.0
>Reporter: Yan Xu
>Assignee: haosdent
>Priority: Blocker
> Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png
>
>
> After Mesos 1.0, the webUI redirect is hidden from the users so you can go to 
> any of the master and the webUI is populated with state.json from the leading 
> master. 
> This doesn't include stats from /metric/snapshot though as it is not 
> redirected. The user ends up seeing some fields with empty values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6420) Mesos Agent leaking sockets when port mapping network isolator is ON

2016-10-25 Thread Alexander Rukletsov (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605995#comment-15605995
 ] 

Alexander Rukletsov commented on MESOS-6420:


[~jieyu] could you cherry-pick it to 1.1.x?

> Mesos Agent leaking sockets when port mapping network isolator is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
>Priority: Blocker
> Fix For: 1.0.2, 1.2.0
>
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> $ sudo lsof -p $(pgrep -u root -o -f /usr/local/sbin/mesos-slave) -nP | grep 
> "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468012 close(27)   = 0
> [pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468799 close(27)   = 0
> [pid 57691] 19:18:03.468950 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.469505 close(27)   = 0
> [pid 57691] 19:18:03.469578 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid

[jira] [Updated] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot

2016-10-25 Thread Alexander Rukletsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6446:
---
Target Version/s: 1.0.2, 1.1.0  (was: 1.0.2)

> WebUI redirect doesn't work with stats from /metric/snapshot
> 
>
> Key: MESOS-6446
> URL: https://issues.apache.org/jira/browse/MESOS-6446
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.0.0
>Reporter: Yan Xu
>Assignee: haosdent
>Priority: Blocker
> Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png
>
>
> After Mesos 1.0, the webUI redirect is hidden from the users so you can go to 
> any of the master and the webUI is populated with state.json from the leading 
> master. 
> This doesn't include stats from /metric/snapshot though as it is not 
> redirected. The user ends up seeing some fields with empty values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6455) DefaultExecutorTests fail when running on hosts without docker

2016-10-25 Thread Alexander Rukletsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6455:
---
Target Version/s: 1.1.0

> DefaultExecutorTests fail when running on hosts without docker 
> ---
>
> Key: MESOS-6455
> URL: https://issues.apache.org/jira/browse/MESOS-6455
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.1.0
>Reporter: Yan Xu
>
> {noformat:title=}
> [  FAILED  ] Containterizers/DefaultExecutorTest.ROOT_TaskRunning/1, where 
> GetParam() = "docker,mesos"
> [  FAILED  ] Containterizers/DefaultExecutorTest.ROOT_KillTask/1, where 
> GetParam() = "docker,mesos"
> [  FAILED  ] Containterizers/DefaultExecutorTest.ROOT_TaskUsesExecutor/1, 
> where GetParam() = "docker,mesos"
> {noformat}
> {noformat:title=}
> ../../src/tests/default_executor_tests.cpp:98: Failure
> slave: Failed to create containerizer: Could not create DockerContainerizer: 
> Failed to create docker: Failed to get docker version: Failed to execute 
> 'docker -H unix:///var/run/docker.sock --version': exited with status 127
> {noformat}
> Maybe we can put {{DOCKER_}} in the instantiation name and use another 
> instantiation for tests that don't require docker?
> /cc [~vinodkone] [~anandmazumdar]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6420) Mesos Agent leaking sockets when port mapping network isolator is ON

2016-10-25 Thread Alexander Rukletsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6420:
---
Priority: Blocker  (was: Major)

> Mesos Agent leaking sockets when port mapping network isolator is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
>Priority: Blocker
> Fix For: 1.0.2, 1.2.0
>
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> $ sudo lsof -p $(pgrep -u root -o -f /usr/local/sbin/mesos-slave) -nP | grep 
> "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468012 close(27)   = 0
> [pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468799 close(27)   = 0
> [pid 57691] 19:18:03.468950 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.469505 close(27)   = 0
> [pid 57691] 19:18:03.469578 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.470301 close(27)   = 0
> [pid

[jira] [Updated] (MESOS-6420) Mesos Agent leaking sockets when port mapping network isolator is ON

2016-10-25 Thread Alexander Rukletsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6420:
---
Shepherd: Jie Yu
Target Version/s: 1.0.2, 1.1.0, 1.2.0  (was: 1.1.0)

> Mesos Agent leaking sockets when port mapping network isolator is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
> Fix For: 1.0.2, 1.2.0
>
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> $ sudo lsof -p $(pgrep -u root -o -f /usr/local/sbin/mesos-slave) -nP | grep 
> "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468012 close(27)   = 0
> [pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468799 close(27)   = 0
> [pid 57691] 19:18:03.468950 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.469505 close(27)   = 0
> [pid 57691] 19:18:03.469578 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.470301

[jira] [Created] (MESOS-6481) MesosContainerizerSlaveRecoveryTest.ResourceStatistics could segfault

2016-10-25 Thread Jie Yu (JIRA)

Jie Yu created MESOS-6481:
-

 Summary: MesosContainerizerSlaveRecoveryTest.ResourceStatistics 
could segfault
 Key: MESOS-6481
 URL: https://issues.apache.org/jira/browse/MESOS-6481
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Jie Yu


{noformat}
[--] 1 test from MesosContainerizerSlaveRecoveryTest
[ RUN  ] MesosContainerizerSlaveRecoveryTest.ResourceStatistics
I1025 09:24:00.334599  7253 exec.cpp:162] Version: 1.2.0
I1025 09:24:00.340852  7269 exec.cpp:237] Executor registered on agent 
5d7fe7df-aeca-451e-84f9-422cf78e7fee-S0
Received SUBSCRIBED event
Subscribed executor on core-dev
/home/jie/workspace/mesos/src/tests/slave_recovery_tests.cpp:4061: Failure
Value of: containers.get().size()
  Actual: 0
Expected: 1u
Which is: 1
*** Aborted at 1477412640 (unix time) try "date -d @1477412640" if you are 
using GNU date ***
I1025 09:24:00.369978  7281 exec.cpp:283] Received reconnect request from agent 
5d7fe7df-aeca-451e-84f9-422cf78e7fee-S0
I1025 09:24:00.371438  7250 exec.cpp:260] Executor re-registered on agent 
5d7fe7df-aeca-451e-84f9-422cf78e7fee-S0
PC: @ 0x2b1952628a06 mesos::ContainerID::MergeFrom()
Received SUBSCRIBED event
Subscribed executor on core-dev
*** SIGSEGV (@0x18) received by PID 40269 (TID 0x2b194d6ce440) from PID 24; 
stack trace: ***
@ 0x2b1962cd62f5 (unknown)
@ 0x2b1962cdaec1 (unknown)
@ 0x2b1962ccf1b8 (unknown)
@ 0x2b1953f72100 (unknown)
@ 0x2b1952628a06 mesos::ContainerID::MergeFrom()
@ 0x2b1952627e1c mesos::ContainerID::ContainerID()
@  0x162b774 
mesos::internal::tests::MesosContainerizerSlaveRecoveryTest_ResourceStatistics_Test::TestBody()
@  0x1ad8066 
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@  0x1ad3184 
testing::internal::HandleExceptionsInMethodIfSupported<>()
@  0x1ab4815 testing::Test::Run()
@  0x1ab4f98 testing::TestInfo::Run()
@  0x1ab55de testing::TestCase::Run()
@  0x1abbeb8 testing::internal::UnitTestImpl::RunAllTests()
@  0x1ad8c8b 
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@  0x1ad3ccc 
testing::internal::HandleExceptionsInMethodIfSupported<>()
@  0x1ababfe testing::UnitTest::Run()
@  0x1099562 RUN_ALL_TESTS()
@  0x1099131 main
@ 0x2b1954dccb15 __libc_start_main
@   0xa16669 (unknown)
I1025 09:24:03.721460  7281 exec.cpp:487] Agent exited, but framework has 
checkpointing enabled. Waiting 15mins to reconnect with agent 
5d7fe7df-aeca-451e-84f
9-422cf78e7fee-S0
I1025 09:24:03.721690  7281 exec.cpp:496] Agent exited ... shutting down
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6479) add ability to execute batch jobs from TaskGroupInfo proto in execute.cpp and add string flag for framework-name

2016-10-25 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-6479:

Labels: newbie newbie++ testing  (was: documentation features)

> add ability to execute batch jobs from TaskGroupInfo proto in execute.cpp and 
> add string flag for framework-name
> 
>
> Key: MESOS-6479
> URL: https://issues.apache.org/jira/browse/MESOS-6479
> Project: Mesos
>  Issue Type: Improvement
>  Components: cli
>Affects Versions: 1.1.0
> Environment: all
>Reporter: Hubert Asamer
>Priority: Trivial
>  Labels: newbie, newbie++, testing
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Extend execute.cpp to use TaskGroupInfo as container for batch jobs to 
> distribute tasks based on available offers. A simple bool cli flag shall 
> enable/disable such a behavior. If enabled the contents of TaskGroupInfo does 
> not cause the execution of tasks within a "pod" (on a single host) but as 
> distributed jobs (on multiple hosts) 
> As an addition an optional cli flag for setting the temporary framework name 
> (e.g. to better distinguish between running/finished frameworks) could be 
> useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6480) Support for docker live-restore option in Mesos

2016-10-25 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605830#comment-15605830
 ] 

Gilbert Song commented on MESOS-6480:
-

[~milindchawre], seems like this JIRA duplicates MESOS-6381. Could you please 
confirm? I would close it as duplicate, and let us track on a single JIRA.

> Support for docker live-restore option in Mesos
> ---
>
> Key: MESOS-6480
> URL: https://issues.apache.org/jira/browse/MESOS-6480
> Project: Mesos
>  Issue Type: Task
>Reporter: Milind Chawre
>
> Docker-1.12 supports live-restore option which keeps containers alive during 
> docker daemon downtime https://docs.docker.com/engine/admin/live-restore/
> I tried to use this option in my Mesos setup And  observed this :
> 1. On mesos worker node stop docker daemon.
> 2. After some time start the docker daemon. All the containers running on 
> that are still visible using "docker ps". This is an expected behaviour of 
> live-restore option.
> 3. When I check mesos and marathon UI. It shows no Active tasks running on 
> that node. The containers which are still running on that node are now 
> scheduled on different mesos nodes, which is not right since I can see the 
> containers in "docker ps" output because of live-restore option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6478) "filesystem/linux" isolator leaks (phantom) mounts in `mount` output

2016-10-25 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605819#comment-15605819
 ] 

Jie Yu commented on MESOS-6478:
---

maybe we should just stop updating mtab for the work_dir mount and just use 
fs::mount

> "filesystem/linux" isolator leaks (phantom) mounts in `mount` output
> 
>
> Key: MESOS-6478
> URL: https://issues.apache.org/jira/browse/MESOS-6478
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Yan Xu
>Priority: Minor
>
> After running the tests I can find these with {{mount}}:
> {noformat:title=sample errors, there are many more}
> /tmp/LinuxFilesystemIsolatorTest_ROOT_Metrics_jHFTIO on 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_Metrics_jHFTIO type none (rw,bind)
> /tmp/LinuxFilesystemIsolatorTest_ROOT_VolumeFromHost_W0T33m on 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_VolumeFromHost_W0T33m type none 
> (rw,bind)
> /tmp/LinuxFilesystemIsolatorTest_ROOT_VolumeFromHostSandboxMountPoint_mGd8ff 
> on 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_VolumeFromHostSandboxMountPoint_mGd8ff 
> type none (rw,bind)
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystem_uMlTqr on 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystem_uMlTqr type none 
> (rw,bind)
> /tmp/LinuxFilesystemIsolatorTest_ROOT_FileVolumeFromHostSandboxMountPoint_tU9FqX
>  on 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_FileVolumeFromHostSandboxMountPoint_tU9FqX
>  type none (rw,bind)
> /tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_94nMdN
>  on 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_94nMdN
>  type none (rw,bind)
> /tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithRootFilesystem_cy8INw
>  on 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithRootFilesystem_cy8INw
>  type none (rw,bind)
> /tmp/LinuxFilesystemIsolatorTest_ROOT_MultipleContainers_ot2j9R on 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_MultipleContainers_ot2j9R type none 
> (rw,bind)
> /tmp/LinuxFilesystemIsolatorTest_ROOT_WorkDirMountNotNeeded_S0frNw on 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_WorkDirMountNotNeeded_S0frNw type none 
> (rw,bind)
> /tmp/LinuxFilesystemIsolatorTest_ROOT_WorkDirMountNeeded_OH3VdQ on 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_WorkDirMountNeeded_OH3VdQ type none 
> (rw,bind)
> /tmp/LinuxFilesystemIsolatorTest_ROOT_WorkDirMountNeeded_OH3VdQ/slave on 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_WorkDirMountNeeded_OH3VdQ/slave type 
> none (rw,bind)
> /tmp/LinuxFilesystemIsolatorTest_ROOT_VolumeFromSandbox_1ILVSa on 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_VolumeFromSandbox_1ILVSa type none 
> (rw,bind)
> /tmp/LinuxFilesystemIsolatorTest_ROOT_FileVolumeFromHost_PKSGpm on 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_FileVolumeFromHost_PKSGpm type none 
> (rw,bind)
> {noformat}
> Of course these don't exist and they are gone from {{/proc/mounts}} but I 
> imagine these could look scary when after running the agent for a while. 
> Perhaps we can improve the Mesos mount utility to take care of these so they 
> don't leave these traces?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot

2016-10-25 Thread haosdent (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605684#comment-15605684
 ] 

haosdent commented on MESOS-6446:
-

Sorry for delay, would post the patch in few hours once test done.

> WebUI redirect doesn't work with stats from /metric/snapshot
> 
>
> Key: MESOS-6446
> URL: https://issues.apache.org/jira/browse/MESOS-6446
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.0.0
>Reporter: Yan Xu
>Assignee: haosdent
>Priority: Blocker
> Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png
>
>
> After Mesos 1.0, the webUI redirect is hidden from the users so you can go to 
> any of the master and the webUI is populated with state.json from the leading 
> master. 
> This doesn't include stats from /metric/snapshot though as it is not 
> redirected. The user ends up seeing some fields with empty values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot

2016-10-25 Thread haosdent (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605665#comment-15605665
 ] 

haosdent commented on MESOS-6446:
-

could not use the new operator API become it is a POST request and don't 
support jsonp.  

> WebUI redirect doesn't work with stats from /metric/snapshot
> 
>
> Key: MESOS-6446
> URL: https://issues.apache.org/jira/browse/MESOS-6446
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.0.0
>Reporter: Yan Xu
>Assignee: haosdent
>Priority: Blocker
> Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png
>
>
> After Mesos 1.0, the webUI redirect is hidden from the users so you can go to 
> any of the master and the webUI is populated with state.json from the leading 
> master. 
> This doesn't include stats from /metric/snapshot though as it is not 
> redirected. The user ends up seeing some fields with empty values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4992) sandbox uri does not work outisde mesos http server

2016-10-25 Thread haosdent (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605562#comment-15605562
 ] 

haosdent commented on MESOS-4992:
-

[~skonto][~bmahler] It is because we depends on {{/master/state}} to find the 
agent first. When open this link in a new tab, the following code would be 
perform before fetch the data from {{/master/state}} and got the error. 
Currently the workaround is to pase the URL in navigate bar and refresh again. 

> sandbox uri does not work outisde mesos http server
> ---
>
> Key: MESOS-4992
> URL: https://issues.apache.org/jira/browse/MESOS-4992
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 0.27.1
>Reporter: Stavros Kontopoulos
>Assignee: haosdent
>  Labels: mesosphere
>
> The SandBox uri of a framework does not work if i just copy paste it to the 
> browser.
> For example the following sandbox uri:
> http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse
> should redirect to:
> http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80
> yet it fails with the message:
> "Failed to find slaves.
> Navigate to the slave's sandbox via the Mesos UI."
> and redirects to:
> http://172.17.0.1:5050/#/
> It is an issue for me because im working on expanding the mesos spark ui with 
> sandbox uri, The other option is to get the slave info and parse the json 
> file there and get executor paths not so straightforward or elegant though.
> Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess 
> this is hidden info, this is the needed piece of info to re-write the uri 
> without redirection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-4992) sandbox uri does not work outisde mesos http server

2016-10-25 Thread haosdent (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent reassigned MESOS-4992:
---

Assignee: haosdent

> sandbox uri does not work outisde mesos http server
> ---
>
> Key: MESOS-4992
> URL: https://issues.apache.org/jira/browse/MESOS-4992
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 0.27.1
>Reporter: Stavros Kontopoulos
>Assignee: haosdent
>  Labels: mesosphere
>
> The SandBox uri of a framework does not work if i just copy paste it to the 
> browser.
> For example the following sandbox uri:
> http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse
> should redirect to:
> http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80
> yet it fails with the message:
> "Failed to find slaves.
> Navigate to the slave's sandbox via the Mesos UI."
> and redirects to:
> http://172.17.0.1:5050/#/
> It is an issue for me because im working on expanding the mesos spark ui with 
> sandbox uri, The other option is to get the slave info and parse the json 
> file there and get executor paths not so straightforward or elegant though.
> Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess 
> this is hidden info, this is the needed piece of info to re-write the uri 
> without redirection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6215) Add support for opaque whiteout (.wh..wh..opq) in provisioner

2016-10-25 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604763#comment-15604763
 ] 

Qian Zhang commented on MESOS-6215:
---

Actually the solution in the patch https://reviews.apache.org/r/52118/ to 
handle opaque whiteout (.wh..wh..opq) is not correct, I have reworked the whole 
implementation to handle whiteout files in [MESOS-6360 | 
https://issues.apache.org/jira/browse/MESOS-6360], please check that ticket for 
details.

> Add support for opaque whiteout (.wh..wh..opq) in provisioner
> -
>
> Key: MESOS-6215
> URL: https://issues.apache.org/jira/browse/MESOS-6215
> Project: Mesos
>  Issue Type: Bug
>Reporter: Qian Zhang
>Assignee: Qian Zhang
> Attachments: whiteout.diff
>
>
> In a Docker image, there can be a opaque whiteout entry (a file with the name 
> {{.wh..wh..opq}}) under a directory which indicates all siblings under that 
> directory should be removed. But currently Mesos provisioner does not support 
> to handle such opaque whiteout entry which will cause launching container 
> with some Docker images fails, e.g.:
> {code}
> $ sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --docker_image=rabbitmq --command="sleep 100"
> I0921 09:22:05.167716 15522 scheduler.cpp:176] Version: 1.1.0
> I0921 09:22:05.172436 15541 scheduler.cpp:465] New master detected at 
> master@192.168.122.171:5050
> Subscribed with ID 7ab88509-c068-46b3-b8be-4817e5170a7e-
> Submitted task 'test' to agent '7ab88509-c068-46b3-b8be-4817e5170a7e-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Failed to launch container: Failed to remove whiteout file 
> '/opt/mesos/provisioner/containers/2c4ed860-6256-4fa7-899b-9989d856dab7/backends/copy/rootfses/62e38280-1fd5-4fa7-8707-b19bdc24ae96/var/lib/apt/lists/partial/.wh..opq':
>  No such file or directory'
>   source: SOURCE_AGENT
>   reason: REASON_CONTAINER_LAUNCH_FAILED
> {code}
> Check OCI image spec for more details about opaque whiteout:
> https://github.com/opencontainers/image-spec/blob/master/layer.md#opaque-whiteout



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6215) Add support for opaque whiteout (.wh..wh..opq) in provisioner

2016-10-25 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604750#comment-15604750
 ] 

Qian Zhang commented on MESOS-6215:
---

Here is the patch for the test:
https://reviews.apache.org/r/53127/

> Add support for opaque whiteout (.wh..wh..opq) in provisioner
> -
>
> Key: MESOS-6215
> URL: https://issues.apache.org/jira/browse/MESOS-6215
> Project: Mesos
>  Issue Type: Bug
>Reporter: Qian Zhang
>Assignee: Qian Zhang
> Attachments: whiteout.diff
>
>
> In a Docker image, there can be a opaque whiteout entry (a file with the name 
> {{.wh..wh..opq}}) under a directory which indicates all siblings under that 
> directory should be removed. But currently Mesos provisioner does not support 
> to handle such opaque whiteout entry which will cause launching container 
> with some Docker images fails, e.g.:
> {code}
> $ sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --docker_image=rabbitmq --command="sleep 100"
> I0921 09:22:05.167716 15522 scheduler.cpp:176] Version: 1.1.0
> I0921 09:22:05.172436 15541 scheduler.cpp:465] New master detected at 
> master@192.168.122.171:5050
> Subscribed with ID 7ab88509-c068-46b3-b8be-4817e5170a7e-
> Submitted task 'test' to agent '7ab88509-c068-46b3-b8be-4817e5170a7e-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Failed to launch container: Failed to remove whiteout file 
> '/opt/mesos/provisioner/containers/2c4ed860-6256-4fa7-899b-9989d856dab7/backends/copy/rootfses/62e38280-1fd5-4fa7-8707-b19bdc24ae96/var/lib/apt/lists/partial/.wh..opq':
>  No such file or directory'
>   source: SOURCE_AGENT
>   reason: REASON_CONTAINER_LAUNCH_FAILED
> {code}
> Check OCI image spec for more details about opaque whiteout:
> https://github.com/opencontainers/image-spec/blob/master/layer.md#opaque-whiteout



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-6360) The handling of whiteout files in provisioner is not correct

2016-10-25 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15600711#comment-15600711
 ] 

Qian Zhang edited comment on MESOS-6360 at 10/25/16 8:40 AM:
-

RR:
https://reviews.apache.org/r/53041/
https://reviews.apache.org/r/53042/
https://reviews.apache.org/r/53053/
https://reviews.apache.org/r/53161/
https://reviews.apache.org/r/53115/
https://reviews.apache.org/r/53116/
https://reviews.apache.org/r/53127/


was (Author: qianzhang):
RR:
https://reviews.apache.org/r/53041/
https://reviews.apache.org/r/53042/
https://reviews.apache.org/r/53053/
https://reviews.apache.org/r/53115/
https://reviews.apache.org/r/53116/
https://reviews.apache.org/r/53127/

> The handling of whiteout files in provisioner is not correct
> 
>
> Key: MESOS-6360
> URL: https://issues.apache.org/jira/browse/MESOS-6360
> Project: Mesos
>  Issue Type: Bug
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Blocker
>
> Currently when user launches a container from a Docker image via universal 
> containerizer, we always handle the whiteout files in 
> {{ProvisionerProcess::__provision()}} regardless of which backend is used.
> However this is actually not correct, because the way to handle whiteout 
> files is backend dependent, that means for different backends, we need to 
> handle whiteout files in different ways, e.g.:
> * AUFS backend: It seems the AUFS whiteout ({{.wh.}} and 
> {{.wh..wh..opq}}) is the whiteout standard in Docker (see [this comment | 
> https://github.com/docker/docker/blob/v1.12.1/pkg/archive/archive.go#L259:L262]
>  for details), so that means after the Docker image is pulled, its whiteout 
> files in the store are already in aufs format, then we do not need to do 
> anything about whiteout file handling because the aufs mount done in 
> {{AufsBackendProcess::provision()}} will handle it automatically.
> * Overlay backend: Overlayfs has its own whiteout files (see [this doc | 
> https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt] for 
> details), so we need to convert the aufs whiteout files to overlayfs whiteout 
> files before we do the overlay mount in {{OverlayBackendProcess::provision}} 
> which will automatically handle the overlayfs whiteout files.
> * Copy backend: We need to manually handle the aufs whiteout files when we 
> copy each layer in {{CopyBackendProcess::_provision()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4679) slave dies unexpectedly: Mismatched checkpoint value for status update TASK_LOST

2016-10-25 Thread Jian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604648#comment-15604648
 ] 

Jian Qiu commented on MESOS-4679:
-

It still seems to be a issue in 1.0.0 when using k8s on mesos.

 

> slave dies unexpectedly: Mismatched checkpoint value for status update 
> TASK_LOST
> 
>
> Key: MESOS-4679
> URL: https://issues.apache.org/jira/browse/MESOS-4679
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.26.0
>Reporter: James DeFelice
>  Labels: mesosphere
>
> It looks like the custom executor is sending out multiple terminal status 
> updates for a specific task and that's crashing the slave (as well as 
> possibly mishandling status-update UUID's?). In any event, I think that the 
> slave should handle this case with a bit more aplomb.
> Custom executor logs:
> {code}
> I0215 20:43:59.551657   11068 executor.go:426] Executor driver killTask
> I0215 20:43:59.551719   11068 executor.go:436] Executor driver is asked to 
> kill task 
> '{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],}'
> I0215 20:43:59.552189   11068 executor.go:687] Executor sending status update 
> {FrameworkId:{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:{TaskId:{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*kill-pod-task,SlaveId:{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
>  253 145 223 212 36 17 229 158 224 82 84 0 231 66 
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.552599   11068 executor.go:687] Executor sending status update 
> {FrameworkId:{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:{TaskId:{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
>  253 162 110 212 36 17 229 158 224 82 84 0 231 66 
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.557376   11068 suicide.go:51] stopping suicide watch
> I0215 20:43:59.559077   11068 executor.go:445] Executor 
> statusUpdateAcknowledgement
> I0215 20:43:59.559129   11068 executor.go:448] Receiving status update 
> acknowledgement 
> {SlaveId:{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
>  253 145 223 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
> I0215 20:43:59.562016   11068 executor.go:470] Executor driver received 
> frameworkMessage
> I0215 20:43:59.562073   11068 executor.go:480] Executor driver receives 
> framework message
> I0215 20:43:59.562100   11068 executor.go:445] Executor 
> statusUpdateAcknowledgement
> I0215 20:43:59.562112   11068 executor.go:448] Receiving status update 
> acknowledgement 
> {SlaveId:{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
>  253 162 110 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
> I0215 20:43:59.562173   11068 executor.go:579] Receives message from 
> framework task-lost:pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f
> I0215 20:43:59.562292   11068 executor.go:687] Executor sending status update 
>

[jira] [Created] (MESOS-6480) Support for docker live-restore option in Mesos

2016-10-25 Thread Milind Chawre (JIRA)

Milind Chawre created MESOS-6480:


 Summary: Support for docker live-restore option in Mesos
 Key: MESOS-6480
 URL: https://issues.apache.org/jira/browse/MESOS-6480
 Project: Mesos
  Issue Type: Task
Reporter: Milind Chawre


Docker-1.12 supports live-restore option which keeps containers alive during 
docker daemon downtime https://docs.docker.com/engine/admin/live-restore/
I tried to use this option in my Mesos setup And  observed this :
1. On mesos worker node stop docker daemon.
2. After some time start the docker daemon. All the containers running on that 
are still visible using "docker ps". This is an expected behaviour of 
live-restore option.
3. When I check mesos and marathon UI. It shows no Active tasks running on that 
node. The containers which are still running on that node are now scheduled on 
different mesos nodes, which is not right since I can see the containers in 
"docker ps" output because of live-restore option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6420) Mesos Agent leaking sockets when port mapping network isolator is ON

2016-10-25 Thread Alexander Rukletsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6420:
---
Target Version/s: 1.1.0  (was: 1.1.1, 1.2.0)

> Mesos Agent leaking sockets when port mapping network isolator is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
> Fix For: 1.0.2, 1.2.0
>
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> $ sudo lsof -p $(pgrep -u root -o -f /usr/local/sbin/mesos-slave) -nP | grep 
> "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468012 close(27)   = 0
> [pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468799 close(27)   = 0
> [pid 57691] 19:18:03.468950 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.469505 close(27)   = 0
> [pid 57691] 19:18:03.469578 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.470301 close(27)   = 0
> [pid 57691]

[jira] [Created] (MESOS-6479) add ability to execute batch jobs from TaskGroupInfo proto in execute.cpp and add string flag for framework-name

2016-10-25 Thread Hubert Asamer (JIRA)

Hubert Asamer created MESOS-6479:


 Summary: add ability to execute batch jobs from TaskGroupInfo 
proto in execute.cpp and add string flag for framework-name
 Key: MESOS-6479
 URL: https://issues.apache.org/jira/browse/MESOS-6479
 Project: Mesos
  Issue Type: Improvement
  Components: cli
Affects Versions: 1.1.0
 Environment: all
Reporter: Hubert Asamer
Priority: Trivial


Extend execute.cpp to use TaskGroupInfo as container for batch jobs to 
distribute tasks based on available offers. A simple bool cli flag shall 
enable/disable such a behavior. If enabled the contents of TaskGroupInfo does 
not cause the execution of tasks within a "pod" (on a single host) but as 
distributed jobs (on multiple hosts) 
As an addition an optional cli flag for setting the temporary framework name 
(e.g. to better distinguish between running/finished frameworks) could be 
useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

40 matches

Mail list logo