[jira] [Updated] (MESOS-5498) Implement SUBSCRIBE Call in v1 master API.

2016-07-05 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5498:
--
Shepherd: Vinod Kone

> Implement SUBSCRIBE Call in v1 master API.
> --
>
> Key: MESOS-5498
> URL: https://issues.apache.org/jira/browse/MESOS-5498
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Zhitao Li
> Fix For: 1.0.0
>
>
> This call lets a client subscribe to an event stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5489) Implement GET_STATE Call in v1 master API.

2016-07-05 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363830#comment-15363830
 ] 

Anand Mazumdar edited comment on MESOS-5489 at 7/6/16 5:57 AM:
---

{noformat}
commit 3038809e38a3b8598e3c33c0e66d9db4552f0d29
Author: Zhitao Li 
Date:   Tue Jul 5 21:25:56 2016 -0700

Implemented 'GetState' call in v1 master API.

Also created a helper function `_getState()` that will be used
for snapshot of event stream when a client subscribes.

Review: https://reviews.apache.org/r/49517/

commit 532f66a3a9c98bdb2259091851707cb9e42da0ea
Author: Zhitao Li 
Date:   Tue Jul 5 21:25:53 2016 -0700

Refactored 'Master::Http::getExecutors()' into helper function.

This helper function will be reused by `GetExecutors` and `GetState`.

Review: https://reviews.apache.org/r/49516/

commit 3a988e2eac02f8f5590e1adced40d10b5360cabe
Author: Zhitao Li 
Date:   Tue Jul 5 21:25:50 2016 -0700

Revised protobuf definition of 'GetState' response.

Review: https://reviews.apache.org/r/49509/

commit dc73420f920d948f63e8abddf37d3136969c2c88
Author: Zhitao Li 
Date:   Tue Jul 5 21:25:46 2016 -0700

Refactored 'master::Http::getFrameworks()' to helper function.

This helper function will be reused by `GET_FRAMEWORKS` and
`GET_STATE` calls.

Review: https://reviews.apache.org/r/49489/

commit 2517ce8f4ebf419589b9298b183149d221d9aecc
Author: Zhitao Li 
Date:   Tue Jul 5 21:25:43 2016 -0700

Refactored 'Master::Http::getAgents()' into helper function.

This helper function will be reused by both `GET_AGENTS`
and `GET_STATE` calls.

Review: https://reviews.apache.org/r/49488/

commit a3182645e01cf9240008f3b84c1142a552600954
Author: Zhitao Li 
Date:   Tue Jul 5 21:25:39 2016 -0700

Refactored 'Master::Http::getTasks()' into helper function.

This helper function will be also be reused for `GetState`.

Review: https://reviews.apache.org/r/49487/
{noformat}


was (Author: anandmazumdar):
{noformat}
commit 3038809e38a3b8598e3c33c0e66d9db4552f0d29
Author: Zhitao Li 
Date:   Tue Jul 5 21:25:56 2016 -0700

Implemented 'GetState' call in v1 master API.

Also created a helper function `_getState()` that will be used
for snapshot of event stream when a client subscribes.

Review: https://reviews.apache.org/r/49517/

commit 532f66a3a9c98bdb2259091851707cb9e42da0ea
Author: Zhitao Li 
Date:   Tue Jul 5 21:25:53 2016 -0700

Refactored 'Master::Http::getExecutors()' into helper function.

This helper function will be reused by `GetExecutors` and `GetState`.

Review: https://reviews.apache.org/r/49516/

commit 3a988e2eac02f8f5590e1adced40d10b5360cabe
Author: Zhitao Li 
Date:   Tue Jul 5 21:25:50 2016 -0700

Revised protobuf definition of 'GetState' response.

Review: https://reviews.apache.org/r/49509/

commit dc73420f920d948f63e8abddf37d3136969c2c88
Author: Zhitao Li 
Date:   Tue Jul 5 21:25:46 2016 -0700

Refactored 'master::Http::getFrameworks()' to helper function.

This helper function will be reused by `GET_FRAMEWORKS` and
`GET_STATE` calls.

Review: https://reviews.apache.org/r/49489/

commit 2517ce8f4ebf419589b9298b183149d221d9aecc
Author: Zhitao Li 
Date:   Tue Jul 5 21:25:43 2016 -0700

Refactored 'Master::Http::getAgents()' into helper function.

This helper function will be reused by both `GET_AGENTS`
and `GET_STATE` calls.

Review: https://reviews.apache.org/r/49488/

commit a3182645e01cf9240008f3b84c1142a552600954
Author: Zhitao Li 
Date:   Tue Jul 5 21:25:39 2016 -0700

Refactored 'Master::Http::getTasks()' into helper function.

This helper function will be also be reused for `GetState`.

Review: https://reviews.apache.org/r/49487/

> Implement GET_STATE Call in v1 master API.
> --
>
> Key: MESOS-5489
> URL: https://issues.apache.org/jira/browse/MESOS-5489
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Zhitao Li
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5221) Add Documentation for Nvidia GPU support

2016-07-05 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-5221:
---
Fix Version/s: (was: 1.0.0)

Removing the 1.0 fix version for now.  We will commit the documentation for 
Nvidia GPU support after RC2 has already been cut.

> Add Documentation for Nvidia GPU support
> 
>
> Key: MESOS-5221
> URL: https://issues.apache.org/jira/browse/MESOS-5221
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>Priority: Minor
>
> https://reviews.apache.org/r/46220/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5795) Add support for Nvidia GPUs in the docker containerizer

2016-07-05 Thread Kevin Klues (JIRA)
Kevin Klues created MESOS-5795:
--

 Summary: Add support for Nvidia GPUs in the docker containerizer
 Key: MESOS-5795
 URL: https://issues.apache.org/jira/browse/MESOS-5795
 Project: Mesos
  Issue Type: Task
  Components: docker, isolation
Reporter: Kevin Klues


In order to support Nvidia GPUs with docker containers in Mesos, we need to be 
able to consolidate all Nvidia libraries into a common volume and inject that 
volume into the container. This tracks the support in the docker containerizer. 
The mesos containerizer support has already been completed in MESOS-5401.

More info on why this is necessary here: 
https://github.com/NVIDIA/nvidia-docker/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5779) Allow Docker v1 ImageManifests to be parsed from the output of `docker inspect`

2016-07-05 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-5779:
---
Fix Version/s: (was: 1.0.0)

> Allow Docker v1 ImageManifests to be parsed from the output of `docker 
> inspect`
> ---
>
> Key: MESOS-5779
> URL: https://issues.apache.org/jira/browse/MESOS-5779
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>
> The `docker::spec::v1::ImageManifest` protobuf implements the
> official v1 image manifest specification found at:
> 
> https://github.com/docker/docker/blob/master/image/spec/v1.md
> 
> The field names in this spec are all written in snake_case as are the
> field names of the JSON representing the image manifest when reading
> it from disk (for example after performing a `docker save`). As such,
> the protobuf for ImageManifest also provides these fields in
> snake_case. Unfortunately, the `docker inspect` command also provides
> a method of retrieving the JSON for an image manifest, with one major
> caveat -- it represents all of its top level keys in CamelCase.
> 
> To allow both representations to be parsed in the same way, we
> should intercept the incoming JSON from either source (disk or `docker
> inspect`) and convert it to a canonical snake_case representation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5757) Authorize orphaned tasks

2016-07-05 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363769#comment-15363769
 ] 

Zhitao Li commented on MESOS-5757:
--

Should this be mentioned in changelog and/or upgrade.md?

> Authorize orphaned tasks
> 
>
> Key: MESOS-5757
> URL: https://issues.apache.org/jira/browse/MESOS-5757
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Affects Versions: 1.0.0
>Reporter: Vinod Kone
>Assignee: Joerg Schad
>Priority: Critical
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> Currently, orphaned tasks are not filtered (i.e., using authorization) when a 
> request is made to /state endpoint. This is inconsistent (and unexpected) 
> with how we filter un-orphaned tasks. 
> This is tricky because master and hence the authorizer do not have 
> FrameworkInfos for these orphaned tasks, until after the corresponding 
> frameworks re-register.
> One option is for the agent to include FrameworkInfos of all its tasks and 
> executors in its re-registration message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5401) Add ability to inject a Volume of Nvidia libraries/binaries into a docker-image container in mesos containerizer.

2016-07-05 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5401:
---
Description: 
In order to support Nvidia GPUs with docker containers in Mesos, we need to be 
able to consolidate all Nvidia libraries into a common volume and inject that 
volume into the container.

This tracks the support in the mesos containerizer. The docker containerizer 
support will be tracked separately.

More info on why this is necessary here: 
https://github.com/NVIDIA/nvidia-docker/

  was:
In order to support Nvidia GPUs with docker containers in Mesos, we need to be 
able to consolidate all Nvidia libraries into a common volume and inject that 
volume into the container.

More info on why this is necessary here: 
https://github.com/NVIDIA/nvidia-docker/

Summary: Add ability to inject a Volume of Nvidia libraries/binaries 
into a docker-image container in mesos containerizer.  (was: Add ability to 
inject a Volume of Nvidia GPU-related libraries into a docker container.)

> Add ability to inject a Volume of Nvidia libraries/binaries into a 
> docker-image container in mesos containerizer.
> -
>
> Key: MESOS-5401
> URL: https://issues.apache.org/jira/browse/MESOS-5401
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
> Fix For: 1.0.0
>
>
> In order to support Nvidia GPUs with docker containers in Mesos, we need to 
> be able to consolidate all Nvidia libraries into a common volume and inject 
> that volume into the container.
> This tracks the support in the mesos containerizer. The docker containerizer 
> support will be tracked separately.
> More info on why this is necessary here: 
> https://github.com/NVIDIA/nvidia-docker/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4626) Support Nvidia GPUs with filesystem isolation enabled in mesos containerizer.

2016-07-05 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-4626:
---
Summary: Support Nvidia GPUs with filesystem isolation enabled in mesos 
containerizer.  (was: Support Nvidia GPUs with filesystem isolation enabled.)

> Support Nvidia GPUs with filesystem isolation enabled in mesos containerizer.
> -
>
> Key: MESOS-4626
> URL: https://issues.apache.org/jira/browse/MESOS-4626
> Project: Mesos
>  Issue Type: Task
>  Components: isolation
>Reporter: Benjamin Mahler
>Assignee: Kevin Klues
> Fix For: 1.0.0
>
>
> When filesystem isolation is enabled in the mesos containerizer, containers 
> that use Nvidia GPU resources need access to GPU libraries residing on the 
> host.
> We'll need to provide a means for operators to inject the necessary volumes 
> into *all* containers that use "gpus" resources.
> See the nvidia-docker project for more details:
> [nvidia-docker/tools/src/nvidia/volumes.go|https://github.com/NVIDIA/nvidia-docker/blob/fda10b2d27bf5578cc5337c23877f827e4d1ed77/tools/src/nvidia/volumes.go#L50-L103]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4626) Support Nvidia GPUs with filesystem isolation enabled.

2016-07-05 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-4626:
---
Description: 
When filesystem isolation is enabled in the mesos containerizer, containers 
that use Nvidia GPU resources need access to GPU libraries residing on the host.

We'll need to provide a means for operators to inject the necessary volumes 
into *all* containers that use "gpus" resources.

See the nvidia-docker project for more details:
[nvidia-docker/tools/src/nvidia/volumes.go|https://github.com/NVIDIA/nvidia-docker/blob/fda10b2d27bf5578cc5337c23877f827e4d1ed77/tools/src/nvidia/volumes.go#L50-L103]

  was:
When filesystem isolation is enabled, containers that use Nvidia GPU resources 
need access to GPU libraries residing on the host.

We'll need to provide a means for operators to inject the necessary volumes 
into *all* containers that use "gpus" resources.

See the nvidia-docker project for more details:
[nvidia-docker/tools/src/nvidia/volumes.go|https://github.com/NVIDIA/nvidia-docker/blob/fda10b2d27bf5578cc5337c23877f827e4d1ed77/tools/src/nvidia/volumes.go#L50-L103]


> Support Nvidia GPUs with filesystem isolation enabled.
> --
>
> Key: MESOS-4626
> URL: https://issues.apache.org/jira/browse/MESOS-4626
> Project: Mesos
>  Issue Type: Task
>  Components: isolation
>Reporter: Benjamin Mahler
>Assignee: Kevin Klues
>
> When filesystem isolation is enabled in the mesos containerizer, containers 
> that use Nvidia GPU resources need access to GPU libraries residing on the 
> host.
> We'll need to provide a means for operators to inject the necessary volumes 
> into *all* containers that use "gpus" resources.
> See the nvidia-docker project for more details:
> [nvidia-docker/tools/src/nvidia/volumes.go|https://github.com/NVIDIA/nvidia-docker/blob/fda10b2d27bf5578cc5337c23877f827e4d1ed77/tools/src/nvidia/volumes.go#L50-L103]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5401) Add ability to inject a Volume of Nvidia GPU-related libraries into a docker container.

2016-07-05 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363734#comment-15363734
 ] 

Benjamin Mahler commented on MESOS-5401:


{noformat}
commit 516d8ef0759937eb327893a01e2bdef8c7438b1c
Author: Kevin Klues 
Date:   Tue Jul 5 20:55:36 2016 -0700

Added test for using an Nvidia Docker image.

This test ensures that when using one of Nvidia's Docker images
(these contain a special label), we mount a volume that contains
the libraries and binaries.

Review: https://reviews.apache.org/r/49678/
{noformat}

> Add ability to inject a Volume of Nvidia GPU-related libraries into a docker 
> container.
> ---
>
> Key: MESOS-5401
> URL: https://issues.apache.org/jira/browse/MESOS-5401
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
> Fix For: 1.0.0
>
>
> In order to support Nvidia GPUs with docker containers in Mesos, we need to 
> be able to consolidate all Nvidia libraries into a common volume and inject 
> that volume into the container.
> More info on why this is necessary here: 
> https://github.com/NVIDIA/nvidia-docker/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5681) c++ based resource and resources object

2016-07-05 Thread Yanyan Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363679#comment-15363679
 ] 

Yanyan Hu commented on MESOS-5681:
--

Yes, Guangya, it is. I think we can close this duplicated one. Thanks for 
pointing this out.

> c++ based resource and resources object
> ---
>
> Key: MESOS-5681
> URL: https://issues.apache.org/jira/browse/MESOS-5681
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Yanyan Hu
>  Labels: performance
>
> Followup JIRA for MESOS-5425. Currently, resource object exposes the protobuf 
> to store data internally. But its implementation is low-efficient for math 
> calculation, especially for the case of Ranges subtraction. An interim 
> solution proposed https://reviews.apache.org/r/48593/ is converting Ranges to 
> IntervalSet inline to optimize the performance. In long-term, we should 
> consider C++ library based resource object as a permanent solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5681) c++ based resource and resources object

2016-07-05 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5681:
---

[~yanyanhu] This seems to be duplicate with MESOS-4770 , can you confirm?

> c++ based resource and resources object
> ---
>
> Key: MESOS-5681
> URL: https://issues.apache.org/jira/browse/MESOS-5681
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Yanyan Hu
>  Labels: performance
>
> Followup JIRA for MESOS-5425. Currently, resource object exposes the protobuf 
> to store data internally. But its implementation is low-efficient for math 
> calculation, especially for the case of Ranges subtraction. An interim 
> solution proposed https://reviews.apache.org/r/48593/ is converting Ranges to 
> IntervalSet inline to optimize the performance. In long-term, we should 
> consider C++ library based resource object as a permanent solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5401) Add ability to inject a Volume of Nvidia GPU-related libraries into a docker container.

2016-07-05 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363605#comment-15363605
 ] 

Benjamin Mahler commented on MESOS-5401:


{noformat}
commit 57b52a0ee4bc9016f6b3de29606015c1216fb5b1
Author: Kevin Klues 
Date:   Tue Jul 5 18:42:52 2016 -0700

Inject Nvidia libraries for Docker images in mesos containerizer.

When Docker images have the matching Nvidia label, we will inject
the volume which contains the Nvidia libraries / binaries.

Similar support will be added for the Docker containerizer.

Review: https://reviews.apache.org/r/49669/
{noformat}

> Add ability to inject a Volume of Nvidia GPU-related libraries into a docker 
> container.
> ---
>
> Key: MESOS-5401
> URL: https://issues.apache.org/jira/browse/MESOS-5401
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
> Fix For: 1.0.0
>
>
> In order to support Nvidia GPUs with docker containers in Mesos, we need to 
> be able to consolidate all Nvidia libraries into a common volume and inject 
> that volume into the container.
> More info on why this is necessary here: 
> https://github.com/NVIDIA/nvidia-docker/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5017) Don't consider agents without allocatable resources in the allocator

2016-07-05 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu reassigned MESOS-5017:
--

Assignee: Guangya Liu  (was: Klaus Ma)

> Don't consider agents without allocatable resources in the allocator
> 
>
> Key: MESOS-5017
> URL: https://issues.apache.org/jira/browse/MESOS-5017
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Dario Rexin
>Assignee: Guangya Liu
>Priority: Minor
>
> During the review r/43668/ , it come out an enhancement that if an agent has 
> not allocatable resources, the allocator should filter them out at the 
> beginning.
> {quote}
> Joris Van Remoortere Posted 1 month ago (March 16, 2016, 5:04 a.m.)
> Should we filter out slaves that have no allocatable resources?
> If we do, let's make sure we note that we want to pass the original slaveids 
> to the deallocate function
>  The issue has been resolved. Show all issues
> Dario Rexin 4 weeks ago (March 23, 2016, 4:25 a.m.)
> I'm not sure if it would be a big improvement. Calculating the available 
> resources if somewhat expensive and we have to do it again in the loop and 
> most slaves will probably have resources available anyway. The reason it's an 
> improvement in the loop is, that after we offer the resources to a framework, 
> we can be sure that they are all unavailable to the following frameworks 
> under the same role.
> Klaus Ma 4 weeks ago (March 23, 2016, 11:13 a.m.)
> @joris/dario, I think the improvement dependent on the workload patten: 1.) 
> for short running tasks, it maybe serveral tasks finished during the 
> allocation interval, so maybe no improvement; 2.) but for long running tasks, 
> slave/agent should be fully used in most of time, it'll be a big improvement. 
> I used to log MESOS-4986 to add a filter after stage 1 (Quota), but maybe 
> useless after revocable by default.
> Joris Van Remoortere 3 weeks, 6 days ago (March 23, 2016, 8:59 p.m.)
> Can you open a JIRA to consider doing this. Along Klaus' example, I'm not 
> convinced this wouldn't have a large impact in certain scenarios.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4694) DRFAllocator takes very long to allocate resources with a large number of frameworks

2016-07-05 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363600#comment-15363600
 ] 

Guangya Liu commented on MESOS-4694:


[~drexin] are you still actively working on this? If not, can I take this over? 
Thanks.

> DRFAllocator takes very long to allocate resources with a large number of 
> frameworks
> 
>
> Key: MESOS-4694
> URL: https://issues.apache.org/jira/browse/MESOS-4694
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Affects Versions: 0.26.0, 0.27.0, 0.27.1, 0.27.2, 0.28.0, 0.28.1
>Reporter: Dario Rexin
>Assignee: Dario Rexin
>
> With a growing number of connected frameworks, the allocation time grows to 
> very high numbers. The addition of quota in 0.27 had an additional impact on 
> these numbers. Running `mesos-tests.sh --benchmark 
> --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us 
> the following numbers:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 200 frameworks
> round 0 allocate took 2.921202secs to make 200 offers
> round 1 allocate took 2.85045secs to make 200 offers
> round 2 allocate took 2.823768secs to make 200 offers
> {noformat}
> Increasing the number of frameworks to 2000:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 28.209454secs to make 2000 offers
> round 1 allocate took 28.469419secs to make 2000 offers
> round 2 allocate took 28.138086secs to make 2000 offers
> {noformat}
> I was able to reduce this time by a substantial amount. After applying the 
> patches:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 200 frameworks
> round 0 allocate took 1.016226secs to make 2000 offers
> round 1 allocate took 1.102729secs to make 2000 offers
> round 2 allocate took 1.102624secs to make 2000 offers
> {noformat}
> And with 2000 frameworks:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 12.563203secs to make 2000 offers
> round 1 allocate took 12.437517secs to make 2000 offers
> round 2 allocate took 12.470708secs to make 2000 offers
> {noformat}
> The patches do 3 things to improve the performance of the allocator.
> 1) The total values in the DRFSorter will be pre calculated per resource type
> 2) In the allocate method, when no resources are available to allocate, we 
> break out of the innermost loop to prevent looping over a large number of 
> frameworks when we have nothing to allocate
> 3) when a framework suppresses offers, we remove it from the sorter instead 
> of just calling continue in the allocation loop - this greatly improves 
> performance in the sorter and prevents looping over frameworks that don't 
> need resources
> Assuming that most of the frameworks behave nicely and suppress offers when 
> they have nothing to schedule, it is fair to assume, that point 3) has the 
> biggest impact on the performance. If we suppress offers for 90% of the 
> frameworks in the benchmark test, we see following numbers:
> {noformat}
> ==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 200 slaves and 2000 frameworks
> round 0 allocate took 11626us to make 200 offers
> round 1 allocate took 22890us to make 200 offers
> round 2 allocate took 21346us to make 200 offers
> {noformat}
> And for 200 frameworks:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 1.11178secs to make 2000 offers
> round 1 allocate took 1.062649secs to make 2000 offers
> round 2 allocate took 1.080181secs to make 2000 offers
> 

[jira] [Created] (MESOS-5794) Agent's /containers endpoint should skip terminated executors.

2016-07-05 Thread Jie Yu (JIRA)
Jie Yu created MESOS-5794:
-

 Summary: Agent's /containers endpoint should skip terminated 
executors.
 Key: MESOS-5794
 URL: https://issues.apache.org/jira/browse/MESOS-5794
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu
Assignee: Jie Yu


If the executor has already been terminated (but pending status update ack), 
/containers endpoint should skip them. Currently, we iterate all executors 
which might generate a bunch of warnings in the agent's log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3968) DiskQuotaTest.SlaveRecovery is flaky

2016-07-05 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363519#comment-15363519
 ] 

Joseph Wu commented on MESOS-3968:
--

This test has been failing more frequently (on ASF and local builds):
{code}
[ RUN  ] DiskQuotaTest.SlaveRecovery
I0706 00:02:21.991916 19907 cluster.cpp:155] Creating default 'local' authorizer
I0706 00:02:21.998934 19907 leveldb.cpp:174] Opened db in 6.606049ms
I0706 00:02:22.72 19907 leveldb.cpp:181] Compacted db in 1.093827ms
I0706 00:02:22.000119 19907 leveldb.cpp:196] Created db iterator in 19963ns
I0706 00:02:22.000128 19907 leveldb.cpp:202] Seeked to beginning of db in 1271ns
I0706 00:02:22.000131 19907 leveldb.cpp:271] Iterated through 0 keys in the db 
in 120ns
I0706 00:02:22.000169 19907 replica.cpp:779] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0706 00:02:22.000886 19922 recover.cpp:451] Starting replica recovery
I0706 00:02:22.001183 19922 recover.cpp:477] Replica is in EMPTY status
I0706 00:02:22.002557 19927 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (1328)@127.0.0.1:54648
I0706 00:02:22.003260 19928 master.cpp:382] Master 
8a9140ac-c7b3-45dd-961d-aeff38eae88e (centos71) started on 127.0.0.1:54648
I0706 00:02:22.003288 19928 master.cpp:384] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http="true" --authenticate_http_frameworks="true" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/XoD9Xk/credentials" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
--registry_strict="true" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/XoD9Xk/master" --zk_session_timeout="10secs"
W0706 00:02:22.003545 19928 master.cpp:387] 
**
Master bound to loopback interface! Cannot communicate with remote schedulers 
or agents. You might want to set '--ip' flag to a routable IP address.
**
I0706 00:02:22.003564 19928 master.cpp:434] Master only allowing authenticated 
frameworks to register
I0706 00:02:22.003569 19928 master.cpp:448] Master only allowing authenticated 
agents to register
I0706 00:02:22.003573 19928 master.cpp:461] Master only allowing authenticated 
HTTP frameworks to register
I0706 00:02:22.003577 19928 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/XoD9Xk/credentials'
I0706 00:02:22.003829 19928 master.cpp:506] Using default 'crammd5' 
authenticator
I0706 00:02:22.003933 19928 master.cpp:578] Using default 'basic' HTTP 
authenticator
I0706 00:02:22.004132 19923 recover.cpp:197] Received a recover response from a 
replica in EMPTY status
I0706 00:02:22.004261 19928 master.cpp:658] Using default 'basic' HTTP 
framework authenticator
I0706 00:02:22.004370 19928 master.cpp:705] Authorization enabled
I0706 00:02:22.004560 19923 recover.cpp:568] Updating replica status to STARTING
I0706 00:02:22.006342 19922 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 1.264999ms
I0706 00:02:22.006395 19922 replica.cpp:320] Persisted replica status to 
STARTING
I0706 00:02:22.006669 19924 recover.cpp:477] Replica is in STARTING status
I0706 00:02:22.008113 19922 master.cpp:1972] The newly elected leader is 
master@127.0.0.1:54648 with id 8a9140ac-c7b3-45dd-961d-aeff38eae88e
I0706 00:02:22.008174 19922 master.cpp:1985] Elected as the leading master!
I0706 00:02:22.008215 19922 master.cpp:1672] Recovering from registrar
I0706 00:02:22.008404 19923 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from (1331)@127.0.0.1:54648
I0706 00:02:22.008600 19925 registrar.cpp:332] Recovering registrar
I0706 00:02:22.009078 19928 recover.cpp:197] Received a recover response from a 
replica in STARTING status
I0706 00:02:22.009968 19928 recover.cpp:568] Updating replica status to VOTING
I0706 00:02:22.011096 19928 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 923116ns
I0706 00:02:22.011121 19928 replica.cpp:320] Persisted replica status to VOTING
I0706 00:02:22.011214 19928 recover.cpp:582] Successfully joined the Paxos group
I0706 00:02:22.011320 19928 recover.cpp:466] Recover process terminated
I0706 

[jira] [Created] (MESOS-5793) Add ability to inject Nvidia devices into a container

2016-07-05 Thread Kevin Klues (JIRA)
Kevin Klues created MESOS-5793:
--

 Summary: Add ability to inject Nvidia devices into a container
 Key: MESOS-5793
 URL: https://issues.apache.org/jira/browse/MESOS-5793
 Project: Mesos
  Issue Type: Improvement
Reporter: Kevin Klues
Assignee: Kevin Klues
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5401) Add ability to inject a Volume of Nvidia GPU-related libraries into a docker container.

2016-07-05 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363514#comment-15363514
 ] 

Kevin Klues commented on MESOS-5401:


Last one of the patches in this patch set to enable this feature:
https://reviews.apache.org/r/49669/

> Add ability to inject a Volume of Nvidia GPU-related libraries into a docker 
> container.
> ---
>
> Key: MESOS-5401
> URL: https://issues.apache.org/jira/browse/MESOS-5401
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
> Fix For: 1.0.0
>
>
> In order to support Nvidia GPUs with docker containers in Mesos, we need to 
> be able to consolidate all Nvidia libraries into a common volume and inject 
> that volume into the container.
> More info on why this is necessary here: 
> https://github.com/NVIDIA/nvidia-docker/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363452#comment-15363452
 ] 

Avinash Sridharan commented on MESOS-5693:
--

So we should see the log from the retries (for TASK_RUNNING in case it is not 
acknowledged)? Not sure if the retries are show in LOG(INFO)?

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363424#comment-15363424
 ] 

Adam B edited comment on MESOS-5693 at 7/5/16 10:59 PM:


Only the earliest unacknowledged update (i.e. the TASK_RUNNING, not the 
TASK_KILLED) will be sent (and resent with periodic retries) for each task from 
the agent's StatusUpdateManager to the master. However, with these updates, the 
agent will add the latest task state (not a full StatusUpdate), so the master 
can know to release the resources and update the state in the webui. The data 
and messages from the final terminal status update must wait for all its 
preceding updates to be acknowledged so that it can be sent.


was (Author: adam-mesos):
Only the earliest unacknowledged update (i.e. the TASK_RUNNING, not the 
TASK_KILLED) will be sent for each task from the agent's StatusUpdateManager to 
the master. However, with these updates, the agent will add the latest task 
state (not a full StatusUpdate), so the master can know to release the 
resources and update the state in the webui. The data and messages from the 
final terminal status update must wait for all its preceding updates to be 
acknowledged so that it can be sent.

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5401) Add ability to inject a Volume of Nvidia GPU-related libraries into a docker container.

2016-07-05 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363425#comment-15363425
 ] 

Benjamin Mahler commented on MESOS-5401:


{noformat}
commit a1423a5fe64c888003846b20b200f3142e6fca48
Author: Kevin Klues 
Date:   Tue Jul 5 15:22:35 2016 -0700

Implemented 'shouldInject()' in the 'NvidiaVolume' component.

We use the the 'com.nvidia.volumes.needed' label from
nvidia-docker to decide if we should inject the volume or not:

https://github.com/NVIDIA/nvidia-docker/wiki/Image-inspection

Review: https://reviews.apache.org/r/49615/
{noformat}

> Add ability to inject a Volume of Nvidia GPU-related libraries into a docker 
> container.
> ---
>
> Key: MESOS-5401
> URL: https://issues.apache.org/jira/browse/MESOS-5401
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
> Fix For: 1.0.0
>
>
> In order to support Nvidia GPUs with docker containers in Mesos, we need to 
> be able to consolidate all Nvidia libraries into a common volume and inject 
> that volume into the container.
> More info on why this is necessary here: 
> https://github.com/NVIDIA/nvidia-docker/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363424#comment-15363424
 ] 

Adam B commented on MESOS-5693:
---

Only the earliest unacknowledged update (i.e. the TASK_RUNNING, not the 
TASK_KILLED) will be sent for each task from the agent's StatusUpdateManager to 
the master. However, with these updates, the agent will add the latest task 
state (not a full StatusUpdate), so the master can know to release the 
resources and update the state in the webui. The data and messages from the 
final terminal status update must wait for all its preceding updates to be 
acknowledged so that it can be sent.

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5792) Add mesos tests to CMake (make check)

2016-07-05 Thread Srinivas (JIRA)
Srinivas created MESOS-5792:
---

 Summary: Add mesos tests to CMake (make check)
 Key: MESOS-5792
 URL: https://issues.apache.org/jira/browse/MESOS-5792
 Project: Mesos
  Issue Type: Improvement
  Components: build
Reporter: Srinivas
Assignee: Srinivas


Provide CMakeLists.txt and configuration files to build mesos tests using CMake.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4514) Document how to implement Mesos HTTP operator endpoints.

2016-07-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-4514:
-

Assignee: Vinod Kone

> Document how to implement Mesos HTTP operator endpoints.
> 
>
> Key: MESOS-4514
> URL: https://issues.apache.org/jira/browse/MESOS-4514
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation, master
>Reporter: Alexander Rukletsov
>Assignee: Vinod Kone
>  Labels: documentation, http, mesosphere
>
> We agreed to accept single JSON objects, provide a corresponding *Request 
> protobuf to document the schema, leverage HTTP verbs where appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363399#comment-15363399
 ] 

Avinash Sridharan edited comment on MESOS-5693 at 7/5/16 10:35 PM:
---

Even if the scheduler disconnected from the Master (assuming its timeout is set 
to a very high value), wouldn't the Agent still keep forwarding updates to the 
Master?


was (Author: avin...@mesosphere.io):
Even if the scheduler disconnected from the Master (assuming its timeout is set 
to a very high value), would the Agent still keep forwarding updates to the 
Master?

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363399#comment-15363399
 ] 

Avinash Sridharan commented on MESOS-5693:
--

Even if the scheduler disconnected from the Master (assuming its timeout is set 
to a very high value), would the Agent still keep forwarding updates to the 
Master?

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363397#comment-15363397
 ] 

Avinash Sridharan commented on MESOS-5693:
--

If possible could you also try upgrading to a more recent version of Mesos 
(v0.28.2) and see if you still hit the same problem.

Thanks!!

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363396#comment-15363396
 ] 

Adam B commented on MESOS-5693:
---

Could be that the agent was not in contact with the master to be able to 
forward the update, but that's not possible for a whole hour. More likely the 
scheduler was disconnected for a long time, and since the scheduler never 
acknowledged the previous status update (TASK_RUNNING?), the agent never sent 
the next update in the queue. In order for Mesos to provide guaranteed 
at-least-once delivery of status updates to the schedulers, the scheduler must 
be connected to ACK each update.

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5791) Consider adding support for a API documentation tool.

2016-07-05 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-5791:
-

 Summary: Consider adding support for a API documentation tool.
 Key: MESOS-5791
 URL: https://issues.apache.org/jira/browse/MESOS-5791
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar


Currently, the Scheduler/Executor API docs are generated via markdown. This is 
albeit hard to maintain and adding complete/custom JSON snippets is very 
cumbersome.

We should consider adding support for generating API documentation via an 
automated tool like Swagger etc. The only catch is that all our API's are RPC 
based and hence would be good to explore tools that fit that criteria.

This ticket is around exploring possible tooling choices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363392#comment-15363392
 ] 

Avinash Sridharan commented on MESOS-5693:
--

[~zfx] could you reproduce the problem with GLOG_v=1 or by running `strace` on 
the Agent when you see it stuck. I think we will need more information to see 
what exactly the Agent is doing between I0615 14:59:11.037376  and I0615 
15:54:21.352087 that it wasn't able to send an update.

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5790) Ensure all examples in Scheduler HTTP API docs are valid JSON

2016-07-05 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5790:
--
Description: 
Currently, there are a lot of JSON snippets in the [API Docs | 
http://mesos.apache.org/documentation/latest/scheduler-http-api/ ] that are not 
valid JSON i.e. have {{...}} to make the snippet succinct/easy to read. e.g., 
{code}
{{"filters"   : {...}
{code} 

However, this is a problem for framework developers who are trying to use the 
new API. Looking at the corresponding protobuf definitions can be a good place 
to start but hardly ideal.

It would be good to address the shortcomings and make the JSON snippets 
complete.

  was:
Currently, there are a lot of JSON snippets in the [API Docs | 
http://mesos.apache.org/documentation/latest/scheduler-http-api/ ] that are not 
valid JSON i.e. have {{...}} e.g., {{"filters"   : {...}}} to make the 
snippet succinct/easy to read. However, this is a problem for framework 
developers who are trying to use the new API. Looking at the corresponding 
protobuf definitions can be a good place to start but hardly ideal.

It would be good to address the shortcomings and make the JSON snippets 
complete.


> Ensure all examples in Scheduler HTTP API docs are valid JSON
> -
>
> Key: MESOS-5790
> URL: https://issues.apache.org/jira/browse/MESOS-5790
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>  Labels: mesosphere, newbie
>
> Currently, there are a lot of JSON snippets in the [API Docs | 
> http://mesos.apache.org/documentation/latest/scheduler-http-api/ ] that are 
> not valid JSON i.e. have {{...}} to make the snippet succinct/easy to read. 
> e.g., 
> {code}
> {{"filters"   : {...}
> {code} 
> However, this is a problem for framework developers who are trying to use the 
> new API. Looking at the corresponding protobuf definitions can be a good 
> place to start but hardly ideal.
> It would be good to address the shortcomings and make the JSON snippets 
> complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5790) Ensure all examples in Scheduler HTTP API docs are valid JSON

2016-07-05 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-5790:
-

 Summary: Ensure all examples in Scheduler HTTP API docs are valid 
JSON
 Key: MESOS-5790
 URL: https://issues.apache.org/jira/browse/MESOS-5790
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar


Currently, there are a lot of JSON snippets in the [API Docs | 
http://mesos.apache.org/documentation/latest/scheduler-http-api/ ] that are not 
valid JSON i.e. have {{...}} e.g., {{"filters"   : {...}}} to make the 
snippet succinct/easy to read. However, this is a problem for framework 
developers who are trying to use the new API. Looking at the corresponding 
protobuf definitions can be a good place to start but hardly ideal.

It would be good to address the shortcomings and make the JSON snippets 
complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5789) Expose max_executors_per_agent for non port mapping isolator build.

2016-07-05 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5789:
--
Description: 
Currently, master's flag `max_executors_per_agent` is only visible 
(conditionally compiled) if port mapping network isolator is enabled. However, 
this flag should be general and quite useful in other scenarios. The ticket 
tracks the process of exposing this flag.

{code}
#ifdef WITH_NETWORK_ISOLATOR
  add(::max_executors_per_agent,
  "max_executors_per_agent",
  flags::DeprecatedName("max_executors_per_slave"),
  "Maximum number of executors allowed per agent. The network\n"
  "monitoring/isolation technique imposes an implicit resource\n"
  "acquisition on each executor (# ephemeral ports), as a result\n"
  "one can only run a certain number of executors on each agent.");
#endif // WITH_NETWORK_ISOLATOR
{code}

  was:Currently, master's flag `max_executors_per_agent` is only visible 
(conditionally compiled) if port mapping network isolator is enabled. However, 
this flag should be general and quite useful in other scenarios. The ticket 
tracks the process of exposing this flag.


> Expose max_executors_per_agent for non port mapping isolator build.
> ---
>
> Key: MESOS-5789
> URL: https://issues.apache.org/jira/browse/MESOS-5789
> Project: Mesos
>  Issue Type: Wish
>Reporter: Jie Yu
>
> Currently, master's flag `max_executors_per_agent` is only visible 
> (conditionally compiled) if port mapping network isolator is enabled. 
> However, this flag should be general and quite useful in other scenarios. The 
> ticket tracks the process of exposing this flag.
> {code}
> #ifdef WITH_NETWORK_ISOLATOR
>   add(::max_executors_per_agent,
>   "max_executors_per_agent",
>   flags::DeprecatedName("max_executors_per_slave"),
>   "Maximum number of executors allowed per agent. The network\n"
>   "monitoring/isolation technique imposes an implicit resource\n"
>   "acquisition on each executor (# ephemeral ports), as a result\n"
>   "one can only run a certain number of executors on each agent.");
> #endif // WITH_NETWORK_ISOLATOR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5789) Expose max_executors_per_agent for non port mapping isolator build.

2016-07-05 Thread Jie Yu (JIRA)
Jie Yu created MESOS-5789:
-

 Summary: Expose max_executors_per_agent for non port mapping 
isolator build.
 Key: MESOS-5789
 URL: https://issues.apache.org/jira/browse/MESOS-5789
 Project: Mesos
  Issue Type: Wish
Reporter: Jie Yu


Currently, master's flag `max_executors_per_agent` is only visible 
(conditionally compiled) if port mapping network isolator is enabled. However, 
this flag should be general and quite useful in other scenarios. The ticket 
tracks the process of exposing this flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5788) Consider adding a Java Scheduler Shim/Adapter for the new/old API.

2016-07-05 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-5788:
-

 Summary: Consider adding a Java Scheduler Shim/Adapter for the 
new/old API.
 Key: MESOS-5788
 URL: https://issues.apache.org/jira/browse/MESOS-5788
 Project: Mesos
  Issue Type: Task
Reporter: Anand Mazumdar
Assignee: Anand Mazumdar


Currently, for existing JAVA based frameworks, moving to try out the new API 
can be cumbersome. This change intends to introduce a shim/adapter interface 
that makes this easier by allowing to toggle between the old/new API 
(driver/new scheduler library) implementation via an environment variable. This 
would allow framework developers to transition their older frameworks to the 
new API rather seamlessly.

This would look similar to the work done for the executor shim for C++ 
(command/docker executor). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5781) Benchmark allocation with framework suppression.

2016-07-05 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-5781:
--
Shepherd: Yan Xu

> Benchmark allocation with framework suppression.
> 
>
> Key: MESOS-5781
> URL: https://issues.apache.org/jira/browse/MESOS-5781
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jacob Janco
>Assignee: Jacob Janco
>  Labels: allocator, benchmark
>
> Benchmarks effects of framework suppression on allocation time. Frameworks 
> are suppressed and resources recovered each iteration and allocation time is 
> measured as we move to suppress all frameworks in the test case. Referencing 
> MESOS-4694. 
> Sample run at top of tree: 
> Using 2000 agents and 200 frameworks
> round 0 allocate took 2.630963secs to make 199 offers
> round 1 allocate took 2.640694secs to make 198 offers
> round 2 allocate took 2.642664secs to make 197 offers
> ...
> round 197 allocate took 2.433047secs to make 2 offers
> round 198 allocate took 2.409804secs to make 1 offers
> round 199 allocate took 252270us to make 0 offers
> Sample run with MESOS-4694 (https://reviews.apache.org/r/43666/):
> Using 2000 agents and 200 frameworks
> round 0 allocate took 2.626182secs to make 199 offers
> round 1 allocate took 2.62286secs to make 198 offers
> round 2 allocate took 2.591389secs to make 197 offers
> ...
> round 101 allocate took 1.494164secs to make 98 offers
> round 102 allocate took 1.491371secs to make 97 offers
> round 103 allocate took 1.491969secs to make 96 offers
> ...
> round 197 allocate took 534780us to make 2 offers
> round 198 allocate took 501947us to make 1 offers
> round 199 allocate took 24929us to make 0 offers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5780) Benchmark framework failover.

2016-07-05 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-5780:
--
Shepherd: Yan Xu

> Benchmark framework failover.
> -
>
> Key: MESOS-5780
> URL: https://issues.apache.org/jira/browse/MESOS-5780
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jacob Janco
>Assignee: Jacob Janco
>  Labels: allocator, benchmark, framework
>
> Benchmarking disconnection and reconnection of all frameworks in cluster 
> proves useful in gauging the efficiency of the allocator's handling of a 
> flooded event queue. I'd also like to reference MESOS-3157.
> Sample run at top of tree: 
> [ RUN  ] 
> SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.FrameworkFailover/10
> Allocator settled after 11.8424527mins for 5000 agents and 500 frameworks
> Sample run with MESOS-3157: 
> [ RUN  ] 
> SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.FrameworkFailover/10
> Allocator settled after 5.98023secs for 5000 agents and 500 frameworks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5221) Add Documentation for Nvidia GPU support

2016-07-05 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-5221:
---
Fix Version/s: 1.0.0

> Add Documentation for Nvidia GPU support
> 
>
> Key: MESOS-5221
> URL: https://issues.apache.org/jira/browse/MESOS-5221
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>Priority: Minor
> Fix For: 1.0.0
>
>
> https://reviews.apache.org/r/46220/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5787) Add ability to set framework capabilities in 'mesos-execute'

2016-07-05 Thread Kevin Klues (JIRA)
Kevin Klues created MESOS-5787:
--

 Summary: Add ability to set framework capabilities in 
'mesos-execute'
 Key: MESOS-5787
 URL: https://issues.apache.org/jira/browse/MESOS-5787
 Project: Mesos
  Issue Type: Improvement
Reporter: Kevin Klues
Assignee: Kevin Klues
 Fix For: 1.0.0


For now, we want to add this so that we can run {{mesos-execute}} against 
agents that offer GPU resources. In the future, as we add more framework 
capabilities, this functionality will become more generally useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5416) make check of stout fails.

2016-07-05 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363208#comment-15363208
 ] 

Till Toenshoff commented on MESOS-5416:
---

We did unfortunately miss a bug here - fixes are in review and being validated 
at this moment.

https://reviews.apache.org/r/49648/
https://reviews.apache.org/r/49649/

> make check of stout fails.
> --
>
> Key: MESOS-5416
> URL: https://issues.apache.org/jira/browse/MESOS-5416
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>Assignee: Kapil Arya
> Fix For: 1.0.0
>
>
> When trying to build stout's tests on its own, I am hitting the following:
> {noformat}
> $ pwd
> /home/till/scratchpad/mesos/3rdparty/stout
> $ bootstrap
> $ mkdir build
> $ cd build/
> $ ../configure
> $ make check
> [...]
> ../tests/bytes_tests.cpp:13:25: fatal error: gtest/gtest.h: No such file or 
> directory
> compilation terminated.
> Makefile:766: recipe for target 'stout_tests-bytes_tests.o' failed
> make[2]: *** [stout_tests-bytes_tests.o] Error 1
> make[2]: *** Waiting for unfinished jobs
> ../tests/base64_tests.cpp:13:25: fatal error: gtest/gtest.h: No such file or 
> directory
> compilation terminated.
> ../tests/bits_tests.cpp:13:25: fatal error: gtest/gtest.h: No such file or 
> directory
> compilation terminated.
> Makefile:738: recipe for target 'stout_tests-base64_tests.o' failed
> make[2]: *** [stout_tests-base64_tests.o] Error 1
> ../tests/duration_tests.cpp:13:25: fatal error: gtest/gtest.h: No such file 
> or directory
> compilation terminated.
> Makefile:752: recipe for target 'stout_tests-bits_tests.o' failed
> make[2]: *** [stout_tests-bits_tests.o] Error 1
> Makefile:794: recipe for target 'stout_tests-duration_tests.o' failed
> make[2]: *** [stout_tests-duration_tests.o] Error 1
> ../tests/adaptor_tests.cpp:15:25: fatal error: gtest/gtest.h: No such file or 
> directory
> compilation terminated.
> Makefile:724: recipe for target 'stout_tests-adaptor_tests.o' failed
> make[2]: *** [stout_tests-adaptor_tests.o] Error 1
> ../tests/cache_tests.cpp:15:25: fatal error: gtest/gtest.h: No such file or 
> directory
> compilation terminated.
> Makefile:780: recipe for target 'stout_tests-cache_tests.o' failed
> make[2]: *** [stout_tests-cache_tests.o] Error 1
> make[2]: Leaving directory '/home/till/scratchpad/mesos/3rdparty/stout/build'
> Makefile:1706: recipe for target 'check-am' failed
> make[1]: *** [check-am] Error 2
> make[1]: Leaving directory '/home/till/scratchpad/mesos/3rdparty/stout/build'
> Makefile:1418: recipe for target 'check-recursive' failed
> make: *** [check-recursive] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5786) Health check command in command executor should be running under task's user.

2016-07-05 Thread Jie Yu (JIRA)
Jie Yu created MESOS-5786:
-

 Summary: Health check command in command executor should be 
running under task's user.
 Key: MESOS-5786
 URL: https://issues.apache.org/jira/browse/MESOS-5786
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu


Currently, it's running under command executor's user, which is root.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5785) Port documentation mistakes - ephemeral ports

2016-07-05 Thread Michael Gummelt (JIRA)
Michael Gummelt created MESOS-5785:
--

 Summary: Port documentation mistakes - ephemeral ports
 Key: MESOS-5785
 URL: https://issues.apache.org/jira/browse/MESOS-5785
 Project: Mesos
  Issue Type: Bug
Reporter: Michael Gummelt


The docs here: 
http://mesos.apache.org/documentation/latest/attributes-resources/

Should probably recommend that users not configure their agents to offer ports 
in the ephemeral port range (32768+: 
https://en.wikipedia.org/wiki/Ephemeral_port).  We avoid this in DC/OS, for 
example.  The example includes ports offered in this range, so we should fix 
that.

Further the docs state that ports have "pre-defined behavior", but they don't 
state what this is, and I'm not even clear myself what this is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5415) bootstrap of libprocess fails.

2016-07-05 Thread Kapil Arya (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303164#comment-15303164
 ] 

Kapil Arya edited comment on MESOS-5415 at 7/5/16 7:09 PM:
---

RRs:

https://reviews.apache.org/r/47924/
https://reviews.apache.org/r/47925/
https://reviews.apache.org/r/47928/
https://reviews.apache.org/r/47929/


was (Author: karya):
RRs:

https://reviews.apache.org/r/47924/
https://reviews.apache.org/r/47925/
https://reviews.apache.org/r/47927/
https://reviews.apache.org/r/47928/
https://reviews.apache.org/r/47929/

> bootstrap of libprocess fails.
> --
>
> Key: MESOS-5415
> URL: https://issues.apache.org/jira/browse/MESOS-5415
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>Assignee: Kapil Arya
>
> When trying to build libprocess on its own, I am hitting the following:
> {noformat}
> $ pwd
> /home/till/scratchpad/mesos/3rdparty/libprocess
> $ ./bootstrap
> […]
> configure.ac:64: error: required file '3rdparty/gmock_sources.cc.in' not found
> {noformat}
> So the standalone {{configure.ac}} still tries to locate 
> {{gmock_source.cc.in}} in a subfolder called {{3rdparty}} while it should 
> actually try to locate it in its parent folder.
> {noformat}
> $ ll /home/till/scratchpad/mesos/3rdparty/gmock_sources.cc.in
> -rw-rw-r--. 1 till till 730 May 19 13:22 
> /home/till/scratchpad/mesos/3rdparty/gmock_sources.cc.in
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5416) make check of stout fails.

2016-07-05 Thread Kapil Arya (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303165#comment-15303165
 ] 

Kapil Arya edited comment on MESOS-5416 at 7/5/16 7:09 PM:
---

RRs:

https://reviews.apache.org/r/47931/
https://reviews.apache.org/r/47932/


was (Author: karya):
RRs:

https://reviews.apache.org/r/47930/
https://reviews.apache.org/r/47931/
https://reviews.apache.org/r/47932/

> make check of stout fails.
> --
>
> Key: MESOS-5416
> URL: https://issues.apache.org/jira/browse/MESOS-5416
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>Assignee: Kapil Arya
> Fix For: 1.0.0
>
>
> When trying to build stout's tests on its own, I am hitting the following:
> {noformat}
> $ pwd
> /home/till/scratchpad/mesos/3rdparty/stout
> $ bootstrap
> $ mkdir build
> $ cd build/
> $ ../configure
> $ make check
> [...]
> ../tests/bytes_tests.cpp:13:25: fatal error: gtest/gtest.h: No such file or 
> directory
> compilation terminated.
> Makefile:766: recipe for target 'stout_tests-bytes_tests.o' failed
> make[2]: *** [stout_tests-bytes_tests.o] Error 1
> make[2]: *** Waiting for unfinished jobs
> ../tests/base64_tests.cpp:13:25: fatal error: gtest/gtest.h: No such file or 
> directory
> compilation terminated.
> ../tests/bits_tests.cpp:13:25: fatal error: gtest/gtest.h: No such file or 
> directory
> compilation terminated.
> Makefile:738: recipe for target 'stout_tests-base64_tests.o' failed
> make[2]: *** [stout_tests-base64_tests.o] Error 1
> ../tests/duration_tests.cpp:13:25: fatal error: gtest/gtest.h: No such file 
> or directory
> compilation terminated.
> Makefile:752: recipe for target 'stout_tests-bits_tests.o' failed
> make[2]: *** [stout_tests-bits_tests.o] Error 1
> Makefile:794: recipe for target 'stout_tests-duration_tests.o' failed
> make[2]: *** [stout_tests-duration_tests.o] Error 1
> ../tests/adaptor_tests.cpp:15:25: fatal error: gtest/gtest.h: No such file or 
> directory
> compilation terminated.
> Makefile:724: recipe for target 'stout_tests-adaptor_tests.o' failed
> make[2]: *** [stout_tests-adaptor_tests.o] Error 1
> ../tests/cache_tests.cpp:15:25: fatal error: gtest/gtest.h: No such file or 
> directory
> compilation terminated.
> Makefile:780: recipe for target 'stout_tests-cache_tests.o' failed
> make[2]: *** [stout_tests-cache_tests.o] Error 1
> make[2]: Leaving directory '/home/till/scratchpad/mesos/3rdparty/stout/build'
> Makefile:1706: recipe for target 'check-am' failed
> make[1]: *** [check-am] Error 2
> make[1]: Leaving directory '/home/till/scratchpad/mesos/3rdparty/stout/build'
> Makefile:1418: recipe for target 'check-recursive' failed
> make: *** [check-recursive] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5784) Add support for vmodule in /toggle/logging

2016-07-05 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-5784:


 Summary: Add support for vmodule in /toggle/logging
 Key: MESOS-5784
 URL: https://issues.apache.org/jira/browse/MESOS-5784
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Zhitao Li
Assignee: Zhitao Li
Priority: Minor


The `/logging/toggle` endpoint is great, but it's better if we can selectively 
only toggle verbose level for some particular module.

glog supports this by the vmodule flag (see 
https://github.com/google/glog/blob/4d391fe692ae6b9e0105f473945c415a3ce5a401/src/vlog_is_on.cc#L55)

I think we can also support this in libprocess.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5775) Add a new CVMFS image type.

2016-07-05 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362976#comment-15362976
 ] 

Jie Yu commented on MESOS-5775:
---

[~lins05] we don't have a design yet. For instance, it's not clear to me how to 
define the protobuf so that we can support catalog based CVMFS image in the 
future.

> Add a new CVMFS image type.
> ---
>
> Key: MESOS-5775
> URL: https://issues.apache.org/jira/browse/MESOS-5775
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Shuai Lin
>
> One way to specify a CVMFS image is like the following. Using a combination 
> of repository name and the path to the image in the repository.
> If we were to support CVMFS image using a catalog, we needed to think about 
> what's the best way to express that. Maybe we should make `Cvmfs` message 
> more extensible with that in mind.
> {code}
> message Image {
>   enum Type {
> APPC = 1;
> DOCKER = 2;
> CVMFS = 3;
>   }
>   
>   message Cvmfs {
> required string repository = 1;
> required string path = 2;
>   }
>   optional Cvmfs cvmfs = 5;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5771) Add benchmark test for shared resources.

2016-07-05 Thread Anindya Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362902#comment-15362902
 ] 

Anindya Sinha commented on MESOS-5771:
--

RR up for review:
https://reviews.apache.org/r/49571/

> Add benchmark test for shared resources.
> 
>
> Key: MESOS-5771
> URL: https://issues.apache.org/jira/browse/MESOS-5771
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>  Labels: benchmark, external-volumes, persistent-volumes
>
> With adding support for shared resources (ie. shared persistent volumes), add 
> benchmark tests since this feature introduces changes in allocator and the 
> DRF sorter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5757) Authorize orphaned tasks

2016-07-05 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5757:
--
Priority: Critical  (was: Major)

> Authorize orphaned tasks
> 
>
> Key: MESOS-5757
> URL: https://issues.apache.org/jira/browse/MESOS-5757
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Affects Versions: 1.0.0
>Reporter: Vinod Kone
>Assignee: Joerg Schad
>Priority: Critical
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> Currently, orphaned tasks are not filtered (i.e., using authorization) when a 
> request is made to /state endpoint. This is inconsistent (and unexpected) 
> with how we filter un-orphaned tasks. 
> This is tricky because master and hence the authorizer do not have 
> FrameworkInfos for these orphaned tasks, until after the corresponding 
> frameworks re-register.
> One option is for the agent to include FrameworkInfos of all its tasks and 
> executors in its re-registration message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5757) Authorize orphaned tasks

2016-07-05 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5757:
--
Shepherd: Vinod Kone

> Authorize orphaned tasks
> 
>
> Key: MESOS-5757
> URL: https://issues.apache.org/jira/browse/MESOS-5757
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Affects Versions: 1.0.0
>Reporter: Vinod Kone
>Assignee: Joerg Schad
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> Currently, orphaned tasks are not filtered (i.e., using authorization) when a 
> request is made to /state endpoint. This is inconsistent (and unexpected) 
> with how we filter un-orphaned tasks. 
> This is tricky because master and hence the authorizer do not have 
> FrameworkInfos for these orphaned tasks, until after the corresponding 
> frameworks re-register.
> One option is for the agent to include FrameworkInfos of all its tasks and 
> executors in its re-registration message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5770) Mesos state api reporting Host IP instead of container IP with health check

2016-07-05 Thread Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362862#comment-15362862
 ] 

Lax commented on MESOS-5770:


Ok, thanks haosdent.

> Mesos state api reporting Host IP instead of container IP with health check 
> 
>
> Key: MESOS-5770
> URL: https://issues.apache.org/jira/browse/MESOS-5770
> Project: Mesos
>  Issue Type: Bug
>Reporter: Lax
>Priority: Critical
> Fix For: 0.28.0
>
>
> Am using Mesos IP per container using docker containerizer (via Calico). 
> Mesos state API (/master/state.json) seems to report container IP as long as 
> I have no health check on my task. As soon as I add health check to the task, 
> mesos start reporting Host IP instead of Container IP.
> Had initially opened this bug on Marathon 
> (https://github.com/mesosphere/marathon/issues/3907). But then told the issue 
> is with Mesos reporting the wrong IP. 
> Here are versions of Mesos and Marathon I was using.
> Mesos: 0.28.0.2
> Marathon: 0.15.3-1.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5770) Mesos state api reporting Host IP instead of container IP with health check

2016-07-05 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362836#comment-15362836
 ] 

haosdent commented on MESOS-5770:
-

Hi, for the package of mesos, you could refer to 
https://github.com/mesosphere/mesos-deb-packaging and 
https://github.com/apache/mesos/blob/master/docs/getting-started.md#system-requirements
And I would start to investigate the root cause tomorrow, would inform you if 
have any progress. Thank you for your helps again.

> Mesos state api reporting Host IP instead of container IP with health check 
> 
>
> Key: MESOS-5770
> URL: https://issues.apache.org/jira/browse/MESOS-5770
> Project: Mesos
>  Issue Type: Bug
>Reporter: Lax
>Priority: Critical
> Fix For: 0.28.0
>
>
> Am using Mesos IP per container using docker containerizer (via Calico). 
> Mesos state API (/master/state.json) seems to report container IP as long as 
> I have no health check on my task. As soon as I add health check to the task, 
> mesos start reporting Host IP instead of Container IP.
> Had initially opened this bug on Marathon 
> (https://github.com/mesosphere/marathon/issues/3907). But then told the issue 
> is with Mesos reporting the wrong IP. 
> Here are versions of Mesos and Marathon I was using.
> Mesos: 0.28.0.2
> Marathon: 0.15.3-1.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5763) Task stuck in fetching is not cleaned up after --executor_registration_timeout.

2016-07-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-5763:
--
Priority: Blocker  (was: Critical)

> Task stuck in fetching is not cleaned up after 
> --executor_registration_timeout.
> ---
>
> Key: MESOS-5763
> URL: https://issues.apache.org/jira/browse/MESOS-5763
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.28.0, 1.0.0, 0.29.0
>Reporter: Yan Xu
>Assignee: Yan Xu
>Priority: Blocker
> Fix For: 0.28.3, 1.0.0, 0.27.4
>
>
> When the fetching process hangs forever due to reasons such as HDFS issues, 
> Mesos containerizer would attempt to destroy the container and kill the 
> executor after {{--executor_registration_timeout}}. However this reliably 
> fails for us: the executor would be killed by the launcher destroy and the 
> container would be destroyed but the agent would never find out that the 
> executor is terminated thus leaving the task in the STAGING state forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5073) Mesos allocator leaks role sorter and quota role sorters.

2016-07-05 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5073:
---
Fix Version/s: 0.28.3

> Mesos allocator leaks role sorter and quota role sorters.
> -
>
> Key: MESOS-5073
> URL: https://issues.apache.org/jira/browse/MESOS-5073
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: tech-debt
> Fix For: 0.28.3, 1.0.0
>
>
> The Mesos allocator {{internal::HierarchicalAllocatorProcess}} owns two raw 
> pointer members {{roleSorter}} and {{quotaRoleSorter}}, but fails to properly 
> manage their lifetime; they are e.g., not cleaned up in the allocator process 
> destructor.
> Since currently we do not recreate an existing allocator in production code 
> they seem to be unaffected by these leaks; they do affect tests though where 
> we create allocators multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5698) Quota sorter not updated for resource changes at agent.

2016-07-05 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5698:
---
Fix Version/s: 0.28.3

> Quota sorter not updated for resource changes at agent.
> ---
>
> Key: MESOS-5698
> URL: https://issues.apache.org/jira/browse/MESOS-5698
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 0.27.3, 0.28.2
>Reporter: Neil Conway
>Assignee: Neil Conway
>Priority: Blocker
>  Labels: mesosphere, quota
> Fix For: 0.28.3, 1.0.0
>
>
> Consider this sequence of events:
> 1. Slave connects, with 128MB of disk.
> 2. Master offers resources at slave to framework
> 3. Framework creates a dynamic reservation for 1MB and a persistent volume of 
> the same size on the slave's resources.
>   => This invokes {{Master::apply}}, which invokes 
> {{allocator->updateAllocation}}, which invokes {{Sorter::update()}} on the 
> framework sorter and role sorter. If the framework's role has a configured 
> quota, it also invokes {{update}} on the quota role sorter -- in this case, 
> the framework's role has no quota, so the quota role sorter is *not* updated.
>   => {{DRFSorter::update}} updates the *total* resources at a given slave, 
> among updating other state. New total resources will be 127MB of unreserved 
> disk and 1MB of reserved disk with a volume. Note that the quota role sorter 
> still thinks the slave has 128MB of unreserved disk.
> 4. The slave is removed from the cluster. 
> {{HierarchicalAllocatorProcess::removeSlave}} invokes:
> {code}
>   roleSorter->remove(slaveId, slaves[slaveId].total);
>   quotaRoleSorter->remove(slaveId, slaves[slaveId].total.nonRevocable());
> {code}
> {{slaves\[slaveId\].total.nonRevocable()}} is 127MB of unreserved disk and 
> 1MB of reserved disk with a volume. When we remove this from the quota role 
> sorter, we're left with total resources on the reserved slave of 1MB of 
> unreserved disk, since that is the result of subtracting <127MB unreserved, 
> 1MB reserved+volume> from <128MB unreserved>.
> The implications of this can't be good: at minimum, we're leaking resources 
> for removed slaves in the quota role sorter. We're also introducing an 
> inconsistency between {{total_.resources\[slaveId\]}} and 
> {{total_.scalarQuantities}}, since the latter has already stripped-out 
> volume/reservation information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5703) Authorize operator endpoints for Mesos 1.0

2016-07-05 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5703:
--
Fix Version/s: (was: 1.0.0)

> Authorize operator endpoints for Mesos 1.0
> --
>
> Key: MESOS-5703
> URL: https://issues.apache.org/jira/browse/MESOS-5703
> Project: Mesos
>  Issue Type: Epic
>  Components: security
>Reporter: Adam B
>Assignee: Adam B
>  Labels: authorization, mesosphere, security
>
> We've authorized many endpoints in our work on MESOS-4843 and MESOS-5150, but 
> we need to tie it all together into a cohesive story and document the 
> authorization model/strategy. This epic will collect issues to round out the 
> Mesos 1.0 authorization story.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5073) Mesos allocator leaks role sorter and quota role sorters.

2016-07-05 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5073:
---
Summary: Mesos allocator leaks role sorter and quota role sorters.  (was: 
Mesos allocator leaks role sorter and quota role sorters)

> Mesos allocator leaks role sorter and quota role sorters.
> -
>
> Key: MESOS-5073
> URL: https://issues.apache.org/jira/browse/MESOS-5073
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: tech-debt
> Fix For: 1.0.0
>
>
> The Mesos allocator {{internal::HierarchicalAllocatorProcess}} owns two raw 
> pointer members {{roleSorter}} and {{quotaRoleSorter}}, but fails to properly 
> manage their lifetime; they are e.g., not cleaned up in the allocator process 
> destructor.
> Since currently we do not recreate an existing allocator in production code 
> they seem to be unaffected by these leaks; they do affect tests though where 
> we create allocators multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5770) Mesos state api reporting Host IP instead of container IP with health check

2016-07-05 Thread Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362755#comment-15362755
 ] 

Lax edited comment on MESOS-5770 at 7/5/16 4:58 PM:


Is the work around suggested in other ticket good to be picked up? If so, is 
there a way to build packaged version of mesos? Appreciate if you can share the 
packaging instructions for Mesos. would like to package mesos with the patch 
and verify. 


was (Author: lax77):
Is the work around suggested in other ticket good to be picked up? If so, is 
there a way to build packaged version of mesos? would like to package mesos and 
try out the patch. 

> Mesos state api reporting Host IP instead of container IP with health check 
> 
>
> Key: MESOS-5770
> URL: https://issues.apache.org/jira/browse/MESOS-5770
> Project: Mesos
>  Issue Type: Bug
>Reporter: Lax
>Priority: Critical
> Fix For: 0.28.0
>
>
> Am using Mesos IP per container using docker containerizer (via Calico). 
> Mesos state API (/master/state.json) seems to report container IP as long as 
> I have no health check on my task. As soon as I add health check to the task, 
> mesos start reporting Host IP instead of Container IP.
> Had initially opened this bug on Marathon 
> (https://github.com/mesosphere/marathon/issues/3907). But then told the issue 
> is with Mesos reporting the wrong IP. 
> Here are versions of Mesos and Marathon I was using.
> Mesos: 0.28.0.2
> Marathon: 0.15.3-1.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5770) Mesos state api reporting Host IP instead of container IP with health check

2016-07-05 Thread Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362755#comment-15362755
 ] 

Lax commented on MESOS-5770:


Is the work around suggested in other ticket good to be picked up? If so, is 
there a way to build packaged version of mesos? would like to package mesos and 
try out the patch. 

> Mesos state api reporting Host IP instead of container IP with health check 
> 
>
> Key: MESOS-5770
> URL: https://issues.apache.org/jira/browse/MESOS-5770
> Project: Mesos
>  Issue Type: Bug
>Reporter: Lax
>Priority: Critical
> Fix For: 0.28.0
>
>
> Am using Mesos IP per container using docker containerizer (via Calico). 
> Mesos state API (/master/state.json) seems to report container IP as long as 
> I have no health check on my task. As soon as I add health check to the task, 
> mesos start reporting Host IP instead of Container IP.
> Had initially opened this bug on Marathon 
> (https://github.com/mesosphere/marathon/issues/3907). But then told the issue 
> is with Mesos reporting the wrong IP. 
> Here are versions of Mesos and Marathon I was using.
> Mesos: 0.28.0.2
> Marathon: 0.15.3-1.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5783) Explore using protobuf arena allocation to improve performance

2016-07-05 Thread Neil Conway (JIRA)
Neil Conway created MESOS-5783:
--

 Summary: Explore using protobuf arena allocation to improve 
performance
 Key: MESOS-5783
 URL: https://issues.apache.org/jira/browse/MESOS-5783
 Project: Mesos
  Issue Type: Improvement
  Components: general
Reporter: Neil Conway


This has the potential to reduce memory management overhead when manipulating 
protobuf messages:

https://developers.google.com/protocol-buffers/docs/reference/arenas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5757) Authorize orphaned tasks

2016-07-05 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361699#comment-15361699
 ] 

Joerg Schad edited comment on MESOS-5757 at 7/5/16 3:06 PM:


Extended install() to support 7 arguments.
https://reviews.apache.org/r/49606/

Added support for recovered frameworks.
https://reviews.apache.org/r/49607/

Added filtering for orphaned tasks in /state endpoint.
https://reviews.apache.org/r/49609/

Test
https://reviews.apache.org/r/49639/


was (Author: js84):
Extended install() to support 7 arguments.
https://reviews.apache.org/r/49606/

Added support for recovered frameworks.
https://reviews.apache.org/r/49607/

Added filtering for orphaned tasks in /state endpoint.
https://reviews.apache.org/r/49609/

(test are almost ready and will follow tomorrow)

> Authorize orphaned tasks
> 
>
> Key: MESOS-5757
> URL: https://issues.apache.org/jira/browse/MESOS-5757
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Affects Versions: 1.0.0
>Reporter: Vinod Kone
>Assignee: Joerg Schad
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> Currently, orphaned tasks are not filtered (i.e., using authorization) when a 
> request is made to /state endpoint. This is inconsistent (and unexpected) 
> with how we filter un-orphaned tasks. 
> This is tricky because master and hence the authorizer do not have 
> FrameworkInfos for these orphaned tasks, until after the corresponding 
> frameworks re-register.
> One option is for the agent to include FrameworkInfos of all its tasks and 
> executors in its re-registration message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5703) Authorize operator endpoints for Mesos 1.0

2016-07-05 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5703:
--
Priority: Major  (was: Critical)

> Authorize operator endpoints for Mesos 1.0
> --
>
> Key: MESOS-5703
> URL: https://issues.apache.org/jira/browse/MESOS-5703
> Project: Mesos
>  Issue Type: Epic
>  Components: security
>Reporter: Adam B
>Assignee: Adam B
>  Labels: authorization, mesosphere, security
> Fix For: 1.0.0
>
>
> We've authorized many endpoints in our work on MESOS-4843 and MESOS-5150, but 
> we need to tie it all together into a cohesive story and document the 
> authorization model/strategy. This epic will collect issues to round out the 
> Mesos 1.0 authorization story.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5379) Authentication documentation for libprocess endpoints can be misleading.

2016-07-05 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5379:
--
Priority: Major  (was: Critical)

> Authentication documentation for libprocess endpoints can be misleading.
> 
>
> Key: MESOS-5379
> URL: https://issues.apache.org/jira/browse/MESOS-5379
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation, libprocess, security
>Affects Versions: 1.0.0
>Reporter: Benjamin Bannier
>  Labels: mesosphere, tech-debt
> Fix For: 1.0.0
>
>
> Libprocess exposes a number of endpoints (at least: {{/logging}}, 
> {{/metrics}}, and {{/profiler}}). If libprocess was initialized with some 
> realm these endpoints require authentication, and don't if not.
> To generate endpoint help we currently use the also function 
> {{AUTHENTICATION}} which injects the following into the help string,
> {code}
> This endpoints requires authentication iff HTTP authentication is enabled.
> {code}
> with {{iff}} documenting a coupling stronger between required authentication 
> and enabled authentication which might not be true for above libprocess 
> endpoints -- it is e.g., true when these endpoints are exposed through mesos 
> masters/agents, but possibly not if exposed through other executables.
> It seems for libprocess endpoint a less strong formulation like e.g.,
> {code}
> This endpoints supports authentication. If HTTP authentication is enabled, 
> this endpoint may require authentication.
> {code}
> might make the generated help strings more reusable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5730) Sandbox access authorization should fail for non existing sandboxes.

2016-07-05 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5730:
--
Fix Version/s: (was: 1.0.0)

> Sandbox access authorization should fail for non existing sandboxes.
> 
>
> Key: MESOS-5730
> URL: https://issues.apache.org/jira/browse/MESOS-5730
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>  Labels: authorization, mesosphere, security
>
> The local authorizer currently tries to authorize {{ACCESS_SANDBOX}} even if 
> no further object specification - e.g. {{framework_info}} or 
> {{executor_info}}) where specified / available at that time.
> Given that there is likely no sandbox available if there was no 
> {{executor_info}} provided, I think we should actually fail instead of allow 
> or deny (403).
> A failure would result into an IMHO more appropriate ServiceUnavailable 
> (503).  
> See 
> https://github.com/apache/mesos/commit/c8d67590064e35566274116cede9c6a733187b48#diff-dd692b1640b2628014feca01a94ba1e1R241



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3335) FlagsBase copy-ctor leads to dangling pointer.

2016-07-05 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-3335:

Shepherd: Michael Park

> FlagsBase copy-ctor leads to dangling pointer.
> --
>
> Key: MESOS-3335
> URL: https://issues.apache.org/jira/browse/MESOS-3335
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Benjamin Bannier
>  Labels: mesosphere
> Attachments: lambda_capture_bug.cpp
>
>
> Per [#3328], ubsan detects the following problem:
> [ RUN ] FaultToleranceTest.ReregisterCompletedFrameworks
> /mesos/3rdparty/libprocess/3rdparty/stout/include/stout/flags/flags.hpp:303:25:
>  runtime error: load of value 33, which is not a valid value for type 'bool'
> I believe what is going on here is the following:
> * The test calls StartMaster(), which does MesosTest::CreateMasterFlags()
> * MesosTest::CreateMasterFlags() allocates a new master::Flags on the stack, 
> which is subsequently copy-constructed back to StartMaster()
> * The FlagsBase constructor is:
> bq. {{FlagsBase() { add(, "help", "...", false); }}}
> where "help" is a member variable -- i.e., it is allocated on the stack in 
> this case.
> * {{FlagsBase()::add}} captures {{}}, e.g.:
> {noformat}
> flag.stringify = [t1](const FlagsBase&) -> Option {
> return stringify(*t1);
>   };}}
> {noformat}
> * The implicit copy constructor for FlagsBase is just going to copy the 
> lambda above, i.e., the result of the copy constructor will have a lambda 
> that points into MesosTest::CreateMasterFlags()'s stack frame, which is bad 
> news.
> Not sure the right fix -- comments welcome. You could define a copy-ctor for 
> FlagsBase that does something gross (basically remove the old help flag and 
> define a new one that points into the target of the copy), but that seems, 
> well, gross.
> Probably not a pressing-problem to fix -- AFAICS worst symptom is that we end 
> up reading one byte from some random stack location when serving 
> {{state.json}}, for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5782) Renamed 'commands' to 'pre_exec_commands' in ContainerLaunchInfo.

2016-07-05 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-5782:
---

 Summary: Renamed 'commands' to 'pre_exec_commands' in 
ContainerLaunchInfo.
 Key: MESOS-5782
 URL: https://issues.apache.org/jira/browse/MESOS-5782
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song
Assignee: Gilbert Song


Currently the 'commands' in isolator.proto ContainerLaunchInfo is somehow 
confusing. It is a pre-executed command (can be any script or shell command) 
before launch. We should renamed 'commands' to 'pre_exec_commands' in 
ContainerLaunchInfo and add comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5727) Command executor health check does not work when the task specifies container image.

2016-07-05 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-5727:

 Labels: containerizer health-check mesosphere  (was: )
Component/s: containerization

> Command executor health check does not work when the task specifies container 
> image.
> 
>
> Key: MESOS-5727
> URL: https://issues.apache.org/jira/browse/MESOS-5727
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.28.2, 1.0.0
>Reporter: Jie Yu
>Assignee: Gilbert Song
>  Labels: containerizer, health-check, mesosphere
> Fix For: 1.0.0
>
>
> Since we launch the task after pivot_root, we no longer has the access to the 
> mesos-health-check binary. The solution is to refactor health check to be a 
> library (libprocess) so that it does not depend on the underlying filesystem.
> One note here is that we should strive to keep both the command executor and 
> the task in the same mount namespace so that Mesos CLI tooling does not need 
> to find the mount namespace for the task. It just need to find the 
> corresponding pid for the executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5727) Command executor health check does not work when the task specifies container image.

2016-07-05 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-5727:

  Sprint: Mesosphere Sprint 38
Story Points: 5

> Command executor health check does not work when the task specifies container 
> image.
> 
>
> Key: MESOS-5727
> URL: https://issues.apache.org/jira/browse/MESOS-5727
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.2, 1.0.0
>Reporter: Jie Yu
>Assignee: Gilbert Song
> Fix For: 1.0.0
>
>
> Since we launch the task after pivot_root, we no longer has the access to the 
> mesos-health-check binary. The solution is to refactor health check to be a 
> library (libprocess) so that it does not depend on the underlying filesystem.
> One note here is that we should strive to keep both the command executor and 
> the task in the same mount namespace so that Mesos CLI tooling does not need 
> to find the mount namespace for the task. It just need to find the 
> corresponding pid for the executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)