[jira] [Updated] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction.

2017-02-10 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5967:
---
Target Version/s: 1.3.0

> Add support for 'docker image inspect' in our docker abstraction.
> -
>
> Key: MESOS-5967
> URL: https://issues.apache.org/jira/browse/MESOS-5967
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Reporter: Kevin Klues
>Assignee: Guangya Liu
>  Labels: gpu
>
> Docker's command line tool for {{docker inspect}} can take either a 
> {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
> array containing low-level information about that container, image or task. 
> However, the current {{docker inspect}} support in our docker abstraction 
> only supports inspecting containers (not images or tasks).  We should expand 
> this to (at least) support images.
> In particular, this additional functionality is motivated by the upcoming GPU 
> support, which needs to inspect the labels in a docker image to decide if it 
> should inject the required Nvidia volumes into a container.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6638) Update Suppress and Revive to be per-role.

2017-02-10 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862069#comment-15862069
 ] 

Guangya Liu commented on MESOS-6638:


{noformat}
commit f40e3d5fb167a691f6a3071f504b77e0def29604
Author: Guangya Liu gy...@apache.org
Date:   Sat Feb 11 08:24:26 2017 +0800

Added roles field to framework.

Added roles field to framework.

Review: https://reviews.apache.org/r/56499/
{noformat}

> Update Suppress and Revive to be per-role.
> --
>
> Key: MESOS-6638
> URL: https://issues.apache.org/jira/browse/MESOS-6638
> Project: Mesos
>  Issue Type: Task
>  Components: framework api
>Reporter: Benjamin Mahler
>Assignee: Guangya Liu
>
> The {{SUPPRESS}} and {{REVIVE}} calls need to be updated to be per-role. I.e. 
> Include {{Revive.role}} and {{Suppress.role}} fields, indicating which role 
> the operation is being applied to.
> {{Revive}} and {{Suppress}} messages do not currently exist, so these need to 
> be added. To support the old-style schedulers, we will make the role fields 
> optional.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6638) Update Suppress and Revive to be per-role.

2017-02-08 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859128#comment-15859128
 ] 

Guangya Liu commented on MESOS-6638:


{noformat}
commit 4fb2a5d2edeca0966c0f3ea3445f9723d0140d09
Author: Guangya Liu 
Date:   Thu Feb 9 14:40:04 2017 +0800

Enabled suppress offer per role.

Enabled suppress offer per role.

Review: https://reviews.apache.org/r/56330/

commit 20dfd055a20e1238e6a7d52181fc33da9b4460cb
Author: Guangya Liu 
Date:   Thu Feb 9 14:44:45 2017 +0800

Enabled `ReviveOffersMessage` support revive per role.

Review: https://reviews.apache.org/r/56371/

commit 54e65143c5b19915f8ec2bbce35d239b4c5d85d7
Author: Guangya Liu 
Date:   Thu Feb 9 14:48:19 2017 +0800

Augmented master `Revive` API to accept `Call::Revive`.

Augmented master `Revive` API to accept `Call::Revive`.

Review: https://reviews.apache.org/r/56373/

commit c2388a511c775dd6f392961b06fd7738bf051dbc
Author: Guangya Liu 
Date:   Thu Feb 9 14:51:27 2017 +0800

Enabled revive offer per role.

Enabled revive offer per role.

Review: https://reviews.apache.org/r/56374/
{noformat}

> Update Suppress and Revive to be per-role.
> --
>
> Key: MESOS-6638
> URL: https://issues.apache.org/jira/browse/MESOS-6638
> Project: Mesos
>  Issue Type: Task
>  Components: framework api
>Reporter: Benjamin Mahler
>Assignee: Guangya Liu
>
> The {{SUPPRESS}} and {{REVIVE}} calls need to be updated to be per-role. I.e. 
> Include {{Revive.role}} and {{Suppress.role}} fields, indicating which role 
> the operation is being applied to.
> {{Revive}} and {{Suppress}} messages do not currently exist, so these need to 
> be added. To support the old-style schedulers, we will make the role fields 
> optional.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6638) Update Suppress and Revive to be per-role.

2017-02-08 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858772#comment-15858772
 ] 

Guangya Liu commented on MESOS-6638:


{code}
commit 748675352964ccfbf4e45d6cd7b4b4cacb1c58bf
Author: Guangya Liu gy...@apache.org
Date:   Thu Feb 9 08:28:43 2017 +0800

Updated Suppress and Revive proto to support per role.

Updated Suppress and Revive proto to support per role.

Review: https://reviews.apache.org/r/56327/

commit 348c06bb0f06c3229ba897fc7fd568473c5bd11b
Author: Guangya Liu gy...@apache.org
Date:   Thu Feb 9 08:30:21 2017 +0800

Augmented master `Suppress` API to accept `Call::Suppress`.

Augmented master `Suppress` API to accept `Call::Suppress`.

Review: https://reviews.apache.org/r/56328/
{code}

> Update Suppress and Revive to be per-role.
> --
>
> Key: MESOS-6638
> URL: https://issues.apache.org/jira/browse/MESOS-6638
> Project: Mesos
>  Issue Type: Task
>  Components: framework api
>Reporter: Benjamin Mahler
>Assignee: Guangya Liu
>
> The {{SUPPRESS}} and {{REVIVE}} calls need to be updated to be per-role. I.e. 
> Include {{Revive.role}} and {{Suppress.role}} fields, indicating which role 
> the operation is being applied to.
> {{Revive}} and {{Suppress}} messages do not currently exist, so these need to 
> be added. To support the old-style schedulers, we will make the role fields 
> optional.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-6638) Update Suppress and Revive to be per-role.

2017-02-07 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854206#comment-15854206
 ] 

Guangya Liu edited comment on MESOS-6638 at 2/7/17 12:44 PM:
-

https://reviews.apache.org/r/56327/ Updated Suppress and Revive proto to 
support per role.
https://reviews.apache.org/r/56328/ Augmented master `Suppress` API to accept 
`Call::Suppress`.
https://reviews.apache.org/r/56330/ Enabled suppress offer per role.
https://reviews.apache.org/r/56371/ Enabled `ReviveOffersMessage` support 
revive per role.
https://reviews.apache.org/r/56373/ Augmented master `Revive` API to accept 
`Call::Revive`.
https://reviews.apache.org/r/56374/ Enabled revive offer per role.
https://reviews.apache.org/r/56376/ Updated allocator test to support create 
multi role framework.
https://reviews.apache.org/r/56378/ Added test case for suppress and revive 
with multi role framework.


was (Author: gyliu):
https://reviews.apache.org/r/56327/ Updated Suppress and Revive proto to 
support per role.
https://reviews.apache.org/r/56328/ Augmented master `Suppress` API to accept 
`Call::Suppress`.
https://reviews.apache.org/r/56330/ Enabled suppress offer per role.

> Update Suppress and Revive to be per-role.
> --
>
> Key: MESOS-6638
> URL: https://issues.apache.org/jira/browse/MESOS-6638
> Project: Mesos
>  Issue Type: Task
>  Components: framework api
>Reporter: Benjamin Mahler
>Assignee: Guangya Liu
>
> The {{SUPPRESS}} and {{REVIVE}} calls need to be updated to be per-role. I.e. 
> Include {{Revive.role}} and {{Suppress.role}} fields, indicating which role 
> the operation is being applied to.
> {{Revive}} and {{Suppress}} messages do not currently exist, so these need to 
> be added. To support the old-style schedulers, we will make the role fields 
> optional.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7070) Improve allocator performance phase 2

2017-02-06 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-7070:
---
Epic Name: allocator performance phase 2

> Improve allocator performance phase 2
> -
>
> Key: MESOS-7070
> URL: https://issues.apache.org/jira/browse/MESOS-7070
> Project: Mesos
>  Issue Type: Epic
>Reporter: Guangya Liu
>
> The phase 1 for `allocator performance improvement` has been finished, 
> basically, the phase 1 have finished such following improvements:
> 1) Enabled batch allocation in allocator.
> 2) Improved performance for sorter.
> 3) Improved performance for `Resource` class.
> 4) Added quite a lot of benchmark test for both sorter and resources.
> But there are some things need follow up in phase 2, such as periodic 
> resource allocations, allocate resources asap after recover resources, more 
> benchmark test etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7070) Improve allocator performance phase 2

2017-02-06 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-7070:
---
Issue Type: Epic  (was: Bug)

> Improve allocator performance phase 2
> -
>
> Key: MESOS-7070
> URL: https://issues.apache.org/jira/browse/MESOS-7070
> Project: Mesos
>  Issue Type: Epic
>Reporter: Guangya Liu
>
> The phase 1 for `allocator performance improvement` has been finished, 
> basically, the phase 1 have finished such following improvements:
> 1) Enabled batch allocation in allocator.
> 2) Improved performance for sorter.
> 3) Improved performance for `Resource` class.
> 4) Added quite a lot of benchmark test for both sorter and resources.
> But there are some things need follow up in phase 2, such as periodic 
> resource allocations, allocate resources asap after recover resources, more 
> benchmark test etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7070) Improve allocator performance phase 2

2017-02-06 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-7070:
--

 Summary: Improve allocator performance phase 2
 Key: MESOS-7070
 URL: https://issues.apache.org/jira/browse/MESOS-7070
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu


The phase 1 for `allocator performance improvement` has been finished, 
basically, the phase 1 have finished such following improvements:

1) Enabled batch allocation in allocator.
2) Improved performance for sorter.
3) Improved performance for `Resource` class.
4) Added quite a lot of benchmark test for both sorter and resources.

But there are some things need follow up in phase 2, such as periodic resource 
allocations, allocate resources asap after recover resources, more benchmark 
test etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-6638) Update Suppress and Revive to be per-role.

2017-02-06 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu reassigned MESOS-6638:
--

Assignee: Guangya Liu

> Update Suppress and Revive to be per-role.
> --
>
> Key: MESOS-6638
> URL: https://issues.apache.org/jira/browse/MESOS-6638
> Project: Mesos
>  Issue Type: Task
>  Components: framework api
>Reporter: Benjamin Mahler
>Assignee: Guangya Liu
>
> The {{SUPPRESS}} and {{REVIVE}} calls need to be updated to be per-role. I.e. 
> Include {{Revive.role}} and {{Suppress.role}} fields, indicating which role 
> the operation is being applied to.
> {{Revive}} and {{Suppress}} messages do not currently exist, so these need to 
> be added. To support the old-style schedulers, we will make the role fields 
> optional.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7044) Update comments for Queue.get() & Queue.put()

2017-01-31 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-7044:
---
Priority: Minor  (was: Major)

> Update comments for Queue.get() & Queue.put()
> -
>
> Key: MESOS-7044
> URL: https://issues.apache.org/jira/browse/MESOS-7044
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Priority: Minor
>
> This is a follow up action from https://reviews.apache.org/r/55852/
> We are now using Queue.get() & Queue.put() to `pop` and `push` elements, and 
> it is difficult to understand `Queue.get()` can also `pop` an element without 
> reading the code, it is better use some meaningful names such as `pop/push` 
> or some others.
> https://github.com/apache/mesos/blob/1.1.x/3rdparty/libprocess/include/process/queue.hpp#L34-L70



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7044) Update comments for Queue.get() & Queue.put()

2017-01-31 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-7044:
--

 Summary: Update comments for Queue.get() & Queue.put()
 Key: MESOS-7044
 URL: https://issues.apache.org/jira/browse/MESOS-7044
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu


This is a follow up action from https://reviews.apache.org/r/55852/

We are now using Queue.get() & Queue.put() to `pop` and `push` elements, and it 
is difficult to understand `Queue.get()` can also `pop` an element without 
reading the code, it is better use some meaningful names such as `pop/push` or 
some others.

https://github.com/apache/mesos/blob/1.1.x/3rdparty/libprocess/include/process/queue.hpp#L34-L70



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-2824) Support pre-fetching images

2017-01-27 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-2824:
---
Assignee: (was: Guangya Liu)

> Support pre-fetching images
> ---
>
> Key: MESOS-2824
> URL: https://issues.apache.org/jira/browse/MESOS-2824
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Affects Versions: 0.23.0
>Reporter: Ian Downes
>Priority: Minor
>  Labels: mesosphere, twitter
>
> Default container images can be specified with the --default_container_info 
> flag to the slave. This may be a large image that will take a long time to 
> initially fetch/hash/extract when the first container is provisioned. Add 
> optional support to start fetching the image when the slave starts and 
> consider not registering until the fetch is complete.
> To extend that, we should support an operator endpoint so that operators can 
> specify images to pre-fetch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6854) Prevent launching MULTI_ROLE framework's tasks on agents without MULTI_ROLE support.

2017-01-15 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823576#comment-15823576
 ] 

Guangya Liu commented on MESOS-6854:





I am out of the office until 01/24/2017.

I will be in vacation from 1.16 to 1.24 and may not check email on time,
plesae call 15029181175 (Temp use) or wechat for any emergency. Thanks.



Note: This is an automated response to your message "[jira] [Assigned]
(MESOS-6854) Prevent launching MULTI_ROLE framework's tasks on agents
without MULTI_ROLE support." sent on 01/16/2017 07:42 AM GMT

This is the only notification you will receive while this person is away.


> Prevent launching MULTI_ROLE framework's tasks on agents without MULTI_ROLE 
> support.
> 
>
> Key: MESOS-6854
> URL: https://issues.apache.org/jira/browse/MESOS-6854
> Project: Mesos
>  Issue Type: Task
>  Components: agent, master
>Reporter: Benjamin Mahler
>Assignee: Jay Guo
>
> The proposal for upgrades / backwards compatibility in phase 1 of multi-role 
> framework support is that we require that masters and agents are all upgraded 
> before a multi-role framework registers.
> We need to explicitly protect against this situation occurring given it's 
> common for old agents to show up in a cluster. The master can prevent the 
> launching of MULTI_ROLE frameworks' tasks on agent without MULTI_ROLE 
> framework support.
> If we were to naively let this happen the old agent would think the resources 
> are allocated to the "*" and there would need to be master logic to deal with 
> the old agent not populating Resource.AllocationInfo.
> The guard will either need to be version based or agent capability based, the 
> latter seeming like the stronger approach given some users upgrade off of 
> master rather than using release versions.
> We can initially start with the master side guard, and have the agent send 
> the capability once the agent-side implementation is complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction.

2017-01-07 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807232#comment-15807232
 ] 

Guangya Liu edited comment on MESOS-5967 at 1/7/17 10:15 AM:
-

[~klueska] Just rebased, all of the patches are valid now, can you please help 
review? Thanks.


was (Author: gyliu):
[~klueska] Just rebased, all of the patches are now valid now, can you please 
help review? Thanks.

> Add support for 'docker image inspect' in our docker abstraction.
> -
>
> Key: MESOS-5967
> URL: https://issues.apache.org/jira/browse/MESOS-5967
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Reporter: Kevin Klues
>Assignee: Guangya Liu
>  Labels: gpu
>
> Docker's command line tool for {{docker inspect}} can take either a 
> {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
> array containing low-level information about that container, image or task. 
> However, the current {{docker inspect}} support in our docker abstraction 
> only supports inspecting containers (not images or tasks).  We should expand 
> this to (at least) support images.
> In particular, this additional functionality is motivated by the upcoming GPU 
> support, which needs to inspect the labels in a docker image to decide if it 
> should inject the required Nvidia volumes into a container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction.

2017-01-07 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807232#comment-15807232
 ] 

Guangya Liu commented on MESOS-5967:


[~klueska] Just rebased, all of the patches are now valid now, can you please 
help review? Thanks.

> Add support for 'docker image inspect' in our docker abstraction.
> -
>
> Key: MESOS-5967
> URL: https://issues.apache.org/jira/browse/MESOS-5967
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Reporter: Kevin Klues
>Assignee: Guangya Liu
>  Labels: gpu
>
> Docker's command line tool for {{docker inspect}} can take either a 
> {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
> array containing low-level information about that container, image or task. 
> However, the current {{docker inspect}} support in our docker abstraction 
> only supports inspecting containers (not images or tasks).  We should expand 
> this to (at least) support images.
> In particular, this additional functionality is motivated by the upcoming GPU 
> support, which needs to inspect the labels in a docker image to decide if it 
> should inject the required Nvidia volumes into a container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6854) Prevent launching MULTI_ROLE framework's tasks on agents without MULTI_ROLE support.

2017-01-05 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15801371#comment-15801371
 ] 

Guangya Liu commented on MESOS-6854:


[~bmahler] , one question for this is for master side guard, if the master 
cannot get the agent capability, how can it do the validation? So seems we need 
first finish agent part first?

> Prevent launching MULTI_ROLE framework's tasks on agents without MULTI_ROLE 
> support.
> 
>
> Key: MESOS-6854
> URL: https://issues.apache.org/jira/browse/MESOS-6854
> Project: Mesos
>  Issue Type: Task
>  Components: agent, master
>Reporter: Benjamin Mahler
>Assignee: Guangya Liu
>
> The proposal for upgrades / backwards compatibility in phase 1 of multi-role 
> framework support is that we require that masters and agents are all upgraded 
> before a multi-role framework registers.
> We need to explicitly protect against this situation occurring given it's 
> common for old agents to show up in a cluster. The master can prevent the 
> launching of MULTI_ROLE frameworks' tasks on agent without MULTI_ROLE 
> framework support.
> If we were to naively let this happen the old agent would think the resources 
> are allocated to the "*" and there would need to be master logic to deal with 
> the old agent not populating Resource.AllocationInfo.
> The guard will either need to be version based or agent capability based, the 
> latter seeming like the stronger approach given some users upgrade off of 
> master rather than using release versions.
> We can initially start with the master side guard, and have the agent send 
> the capability once the agent-side implementation is complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6854) Prevent launching MULTI_ROLE framework's tasks on agents without MULTI_ROLE support.

2017-01-05 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu reassigned MESOS-6854:
--

Assignee: Guangya Liu

> Prevent launching MULTI_ROLE framework's tasks on agents without MULTI_ROLE 
> support.
> 
>
> Key: MESOS-6854
> URL: https://issues.apache.org/jira/browse/MESOS-6854
> Project: Mesos
>  Issue Type: Task
>  Components: agent, master
>Reporter: Benjamin Mahler
>Assignee: Guangya Liu
>
> The proposal for upgrades / backwards compatibility in phase 1 of multi-role 
> framework support is that we require that masters and agents are all upgraded 
> before a multi-role framework registers.
> We need to explicitly protect against this situation occurring given it's 
> common for old agents to show up in a cluster. The master can prevent the 
> launching of MULTI_ROLE frameworks' tasks on agent without MULTI_ROLE 
> framework support.
> If we were to naively let this happen the old agent would think the resources 
> are allocated to the "*" and there would need to be master logic to deal with 
> the old agent not populating Resource.AllocationInfo.
> The guard will either need to be version based or agent capability based, the 
> latter seeming like the stronger approach given some users upgrade off of 
> master rather than using release versions.
> We can initially start with the master side guard, and have the agent send 
> the capability once the agent-side implementation is complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6730) Reserve operation should validate reserved resource role against resource allocationInfo role

2016-12-06 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-6730:
---
Description: 
When doing dynamic reservation validation, the current logic is make sure the 
reserved resources role is same as the framework role: 
https://github.com/apache/mesos/blob/1.1.x/src/master/validation.cpp#L1458

{code}
  if (frameworkRole.isSome() && resource.role() != frameworkRole.get()) {
  return Error(
  "A reserve operation was attempted for a resource with role"
  " '" + resource.role() + "', but the framework can only reserve"
  " resources with role '" + frameworkRole.get() + "'");
}
{code}

With multi-role framework, we should validate reserved resource role same as 
resource allocation role.

Please make sure distinguish dynamic reservation with framework and http 
endpoint. If dynamic reservation was triggered by a framework, then we need to 
do such validation. If done by the http endpoint, then no need to validate the 
roles.

  was:
When doing dynamic reservation validation, the current logic is make sure the 
reserved resources role is same as the framework role: 
https://github.com/apache/mesos/blob/1.1.x/src/master/validation.cpp#L1458

{code}
  if (frameworkRole.isSome() && resource.role() != frameworkRole.get()) {
  return Error(
  "A reserve operation was attempted for a resource with role"
  " '" + resource.role() + "', but the framework can only reserve"
  " resources with role '" + frameworkRole.get() + "'");
}
{code}

With multi-role framework, we should validate reserved resource role same as 
resource allocation role.


> Reserve operation should validate reserved resource role against resource 
> allocationInfo role
> -
>
> Key: MESOS-6730
> URL: https://issues.apache.org/jira/browse/MESOS-6730
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>
> When doing dynamic reservation validation, the current logic is make sure the 
> reserved resources role is same as the framework role: 
> https://github.com/apache/mesos/blob/1.1.x/src/master/validation.cpp#L1458
> {code}
>   if (frameworkRole.isSome() && resource.role() != frameworkRole.get()) {
>   return Error(
>   "A reserve operation was attempted for a resource with role"
>   " '" + resource.role() + "', but the framework can only reserve"
>   " resources with role '" + frameworkRole.get() + "'");
> }
> {code}
> With multi-role framework, we should validate reserved resource role same as 
> resource allocation role.
> Please make sure distinguish dynamic reservation with framework and http 
> endpoint. If dynamic reservation was triggered by a framework, then we need 
> to do such validation. If done by the http endpoint, then no need to validate 
> the roles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6730) Reserve operation should validate reserved resource role against resource allocationInfo role

2016-12-06 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-6730:
--

 Summary: Reserve operation should validate reserved resource role 
against resource allocationInfo role
 Key: MESOS-6730
 URL: https://issues.apache.org/jira/browse/MESOS-6730
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu


When doing dynamic reservation validation, the current logic is make sure the 
reserved resources role is same as the framework role: 
https://github.com/apache/mesos/blob/1.1.x/src/master/validation.cpp#L1458

{code}
  if (frameworkRole.isSome() && resource.role() != frameworkRole.get()) {
  return Error(
  "A reserve operation was attempted for a resource with role"
  " '" + resource.role() + "', but the framework can only reserve"
  " resources with role '" + frameworkRole.get() + "'");
}
{code}

With multi-role framework, we should validate reserved resource role same as 
resource allocation role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6685) Update Role::Resources to correctly account for multi-role frameworks

2016-12-03 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-6685:
---
Summary: Update Role::Resources to correctly account for multi-role 
frameworks  (was: Update Role::Resources to correctly acount for multi-role 
frameworks)

> Update Role::Resources to correctly account for multi-role frameworks
> -
>
> Key: MESOS-6685
> URL: https://issues.apache.org/jira/browse/MESOS-6685
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>
> With single role framework, when call the get role endpoint, the master will 
> return resources for this role with all of the resources for a framework who 
> is using this role. But with multi-role framework, the get role endpoint 
> should only return resources used by one of the roles in a multi-role 
> framework.
> {code}
>   Resources resources() const
>   {
> Resources resources;
> foreachvalue (Framework* framework, frameworks) {
>   resources += framework->totalUsedResources;
>   resources += framework->totalOfferedResources;
> }
> return resources;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6685) Update Role::Resources to correctly acount for multi-role frameworks

2016-12-03 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-6685:
--

 Summary: Update Role::Resources to correctly acount for multi-role 
frameworks
 Key: MESOS-6685
 URL: https://issues.apache.org/jira/browse/MESOS-6685
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu


With single role framework, when call the get role endpoint, the master will 
return resources for this role with all of the resources for a framework who is 
using this role. But with multi-role framework, the get role endpoint should 
only return resources used by one of the roles in a multi-role framework.

{code}
  Resources resources() const
  {
Resources resources;
foreachvalue (Framework* framework, frameworks) {
  resources += framework->totalUsedResources;
  resources += framework->totalOfferedResources;
}

return resources;
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6684) Update addFramework/removeFramework to handle multi-role frameworks

2016-12-03 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-6684:
--

 Summary: Update addFramework/removeFramework to handle multi-role 
frameworks
 Key: MESOS-6684
 URL: https://issues.apache.org/jira/browse/MESOS-6684
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu


The current master add/remove frameworks only handle single role framework, it 
should be updated to support multi-role frameworks.

{code}
 if (!activeRoles.contains(role)) {
activeRoles[role] = new Role();
  }
  activeRoles[role]->addFramework(framework);
{code}

We should update both {{addFramework}} and {{removeFramework}} in master.cpp to 
be able to map one framework to multiple roles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6630) Add some benchmark test for quota allocation

2016-11-22 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-6630:
--

 Summary: Add some benchmark test for quota allocation
 Key: MESOS-6630
 URL: https://issues.apache.org/jira/browse/MESOS-6630
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu


After made a minor update for allocator performance here 
https://reviews.apache.org/r/53929/ , I found that we have no benchmark test 
for quota allocation, we should add some benchmark test for such cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6600) Add priority tiers to support multi-tenancy

2016-11-17 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-6600:
---
Description: 
Tier is kind of priority level, it will include a type and priority level. The 
type can be either quota or fair share. The reason that we want to have `Tier` 
is mainly for defining resources allocations with priority for now. One example 
is for `Quota`, if we have more quotas than total resources in the cluster, 
then with the `Tier` logic, we can make sure the high priority tier quota can 
get allocations first. Also the high priority tier quota can preempt resources 
from preemptable quota (A new concept for quota and still under discussion). 
With `Tier`, we can also enable tasks with priority (task priority is based on 
the resources priority), and high priority tasks can preempt resources from low 
priority tasks.

Current design document: 
https://docs.google.com/document/d/1bPHREn1AfUQIAGwZUS7yLFZLW6ycO2WbmEzW6o92NV0/edit

  was:
TBD

Current design document: 
https://docs.google.com/document/d/1bPHREn1AfUQIAGwZUS7yLFZLW6ycO2WbmEzW6o92NV0/edit


> Add priority tiers to support multi-tenancy
> ---
>
> Key: MESOS-6600
> URL: https://issues.apache.org/jira/browse/MESOS-6600
> Project: Mesos
>  Issue Type: Epic
>Reporter: Benjamin Hindman
>  Labels: multi-tenancy
>
> Tier is kind of priority level, it will include a type and priority level. 
> The type can be either quota or fair share. The reason that we want to have 
> `Tier` is mainly for defining resources allocations with priority for now. 
> One example is for `Quota`, if we have more quotas than total resources in 
> the cluster, then with the `Tier` logic, we can make sure the high priority 
> tier quota can get allocations first. Also the high priority tier quota can 
> preempt resources from preemptable quota (A new concept for quota and still 
> under discussion). With `Tier`, we can also enable tasks with priority (task 
> priority is based on the resources priority), and high priority tasks can 
> preempt resources from low priority tasks.
> Current design document: 
> https://docs.google.com/document/d/1bPHREn1AfUQIAGwZUS7yLFZLW6ycO2WbmEzW6o92NV0/edit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4766) Improve allocator performance.

2016-10-12 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-4766:
---
Target Version/s: 1.2.0

> Improve allocator performance.
> --
>
> Key: MESOS-4766
> URL: https://issues.apache.org/jira/browse/MESOS-4766
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Michael Park
>Priority: Critical
>
> This is an epic to track the various tickets around improving the performance 
> of the allocator, including the following:
> * Preventing un-necessary backup of the allocator.
> * Reducing the cost of allocations and allocator state updates.
> * Improving performance of the DRF sorter.
> * More benchmarking to simulate scenarios with performance issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5898) Make resources benchmark test for ports -=/- more accurate

2016-10-11 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391903#comment-15391903
 ] 

Guangya Liu edited comment on MESOS-5898 at 10/12/16 3:32 AM:
--

https://reviews.apache.org/r/50380 Added new benchmark test for port resources.
https://reviews.apache.org/r/52769/ Removed ports ranges benchmark test from 
scalar benchmark test.


was (Author: gyliu):
https://reviews.apache.org/r/50380/

> Make resources benchmark test for ports -=/- more accurate
> --
>
> Key: MESOS-5898
> URL: https://issues.apache.org/jira/browse/MESOS-5898
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>
> When I run benchmark test for port resources, I can get the following result, 
> the `-=` and `-` only consumed 10ms, this cannot reflect the real time of 
> operating 1000 ports with `-=` and `-`.
> The root cause is that  the current calculation is always using same port 
> range, with port, the formula for `+` is {{a+a+a+a+...+a==a}}; for `-`, it 
> will be {{a-a=0}} and {{0-a=0}}. 
> With {{0-a=0}}, the code here 
> https://github.com/apache/mesos/blob/master/src/common/values.cpp#L544 will 
> cause there is no validation as the {{left}} is empty.
> {code}
> ./bin/mesos-tests.sh --benchmark 
> --gtest_filter="*Resources_BENCHMARK_Test.Arithmetic/2"
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
> [ RUN  ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
> Took 3.219217secs to perform 1000 'total += r' operations on ports(*):[1-2, 
> 4-5, 7-8, 10-11, 13-14, 16-17, 1...
> Took 10207us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 
> 7-8, 10-11, 13-14, 16-17, 1...
> Took 3.515383secs to perform 1000 'total = total + r' operations on 
> ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
> Took 10208us to perform 1000 'total = total - r' operations on ports(*):[1-2, 
> 4-5, 7-8, 10-11, 13-14, 16-17, 1...
> [   OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (6759 
> ms)
> [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (6759 ms 
> total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (6801 ms total)
> [  PASSED  ] 1 test.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5700) Add benchmark test for Resource class

2016-10-11 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5700:
---
Summary: Add benchmark test for Resource class  (was: Add Bbenchmark test 
for Resource class)

> Add benchmark test for Resource class
> -
>
> Key: MESOS-5700
> URL: https://issues.apache.org/jira/browse/MESOS-5700
> Project: Mesos
>  Issue Type: Bug
>Reporter: Klaus Ma
>Assignee: Klaus Ma
> Attachments: hashmap.diff, name_roleId.diff, port.perf.log, 
> reservation.perf.log
>
>
> Add benchmark of Resource class for Allocation Performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5700) Add Bbenchmark test for Resource class

2016-10-11 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5700:
---
Summary: Add Bbenchmark test for Resource class  (was: Benchmark for 
Resource class)

> Add Bbenchmark test for Resource class
> --
>
> Key: MESOS-5700
> URL: https://issues.apache.org/jira/browse/MESOS-5700
> Project: Mesos
>  Issue Type: Bug
>Reporter: Klaus Ma
>Assignee: Klaus Ma
> Attachments: hashmap.diff, name_roleId.diff, port.perf.log, 
> reservation.perf.log
>
>
> Add benchmark of Resource class for Allocation Performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6308) CHECK failure in DRF sorter.

2016-10-11 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-6308:
---
Target Version/s: 1.1.0

> CHECK failure in DRF sorter.
> 
>
> Key: MESOS-6308
> URL: https://issues.apache.org/jira/browse/MESOS-6308
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Guangya Liu
>
> Saw this CHECK failed in our internal CI:
> https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450
> {noformat}
> [03:08:28] :   [Step 10/10] [ RUN  ] PartitionTest.DisconnectedFramework
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.200443   577 cluster.cpp:158] 
> Creating default 'local' authorizer
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.206408   577 leveldb.cpp:174] 
> Opened db in 5.827159ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208127   577 leveldb.cpp:181] 
> Compacted db in 1.697508ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208150   577 leveldb.cpp:196] 
> Created db iterator in 5756ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208160   577 leveldb.cpp:202] 
> Seeked to beginning of db in 1483ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208168   577 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 1101ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208184   577 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208452   591 recover.cpp:451] 
> Starting replica recovery
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208664   596 recover.cpp:477] 
> Replica is in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209079   591 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(3666)@172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209203   593 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209394   598 recover.cpp:568] 
> Updating replica status to STARTING
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209473   598 master.cpp:380] 
> Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) 
> started on 172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209489   598 master.cpp:382] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" 
> --zk_session_timeout="10secs"
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209692   598 master.cpp:432] 
> Master only allowing authenticated frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209699   598 master.cpp:446] 
> Master only allowing authenticated agents to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209704   598 master.cpp:459] 
> Master only allowing authenticated HTTP frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209709   598 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/7rr0oB/credentials'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209810   598 master.cpp:504] Using 
> default 'crammd5' authenticator
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209853   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209897   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209940   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209962   598 master.cpp:584] 
> Authorization enabled
> [03:08:28]W:   [Step 10/10] I1004 

[jira] [Commented] (MESOS-4694) DRFAllocator takes very long to allocate resources with a large number of frameworks

2016-10-11 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566916#comment-15566916
 ] 

Guangya Liu commented on MESOS-4694:


[~bmahler] can we mark this as RESOLVED?

> DRFAllocator takes very long to allocate resources with a large number of 
> frameworks
> 
>
> Key: MESOS-4694
> URL: https://issues.apache.org/jira/browse/MESOS-4694
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Affects Versions: 0.26.0, 0.27.0, 0.27.1, 0.27.2, 0.28.0, 0.28.1
>Reporter: Dario Rexin
>Assignee: Dario Rexin
>
> With a growing number of connected frameworks, the allocation time grows to 
> very high numbers. The addition of quota in 0.27 had an additional impact on 
> these numbers. Running `mesos-tests.sh --benchmark 
> --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us 
> the following numbers:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 200 frameworks
> round 0 allocate took 2.921202secs to make 200 offers
> round 1 allocate took 2.85045secs to make 200 offers
> round 2 allocate took 2.823768secs to make 200 offers
> {noformat}
> Increasing the number of frameworks to 2000:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 28.209454secs to make 2000 offers
> round 1 allocate took 28.469419secs to make 2000 offers
> round 2 allocate took 28.138086secs to make 2000 offers
> {noformat}
> I was able to reduce this time by a substantial amount. After applying the 
> patches:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 200 frameworks
> round 0 allocate took 1.016226secs to make 2000 offers
> round 1 allocate took 1.102729secs to make 2000 offers
> round 2 allocate took 1.102624secs to make 2000 offers
> {noformat}
> And with 2000 frameworks:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 12.563203secs to make 2000 offers
> round 1 allocate took 12.437517secs to make 2000 offers
> round 2 allocate took 12.470708secs to make 2000 offers
> {noformat}
> The patches do 3 things to improve the performance of the allocator.
> 1) The total values in the DRFSorter will be pre calculated per resource type
> 2) In the allocate method, when no resources are available to allocate, we 
> break out of the innermost loop to prevent looping over a large number of 
> frameworks when we have nothing to allocate
> 3) when a framework suppresses offers, we remove it from the sorter instead 
> of just calling continue in the allocation loop - this greatly improves 
> performance in the sorter and prevents looping over frameworks that don't 
> need resources
> Assuming that most of the frameworks behave nicely and suppress offers when 
> they have nothing to schedule, it is fair to assume, that point 3) has the 
> biggest impact on the performance. If we suppress offers for 90% of the 
> frameworks in the benchmark test, we see following numbers:
> {noformat}
> ==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 200 slaves and 2000 frameworks
> round 0 allocate took 11626us to make 200 offers
> round 1 allocate took 22890us to make 200 offers
> round 2 allocate took 21346us to make 200 offers
> {noformat}
> And for 200 frameworks:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 1.11178secs to make 2000 offers
> round 1 allocate took 1.062649secs to make 2000 offers
> round 2 allocate took 1.080181secs to make 2000 offers
> {noformat}
> Review requests:
> 

[jira] [Comment Edited] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction.

2016-10-11 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564767#comment-15564767
 ] 

Guangya Liu edited comment on MESOS-5967 at 10/11/16 7:34 AM:
--

https://reviews.apache.org/r/52727/ Added `Labels` to docker image.
https://reviews.apache.org/r/52666/ Added support for `docker inspect image` in 
docker containerizer.
https://reviews.apache.org/r/52728/ Renamed `inspect` to `inspectContainer`.


was (Author: gyliu):
{code}
https://reviews.apache.org/r/52727/ Added `Labels` to docker image.
https://reviews.apache.org/r/52666/ Added support for `docker inspect image` in 
docker containerizer.
https://reviews.apache.org/r/52728/ Renamed `inspect` to `inspectContainer`.
{code}

> Add support for 'docker image inspect' in our docker abstraction.
> -
>
> Key: MESOS-5967
> URL: https://issues.apache.org/jira/browse/MESOS-5967
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Guangya Liu
>  Labels: gpu
>
> Docker's command line tool for {{docker inspect}} can take either a 
> {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
> array containing low-level information about that container, image or task. 
> However, the current {{docker inspect}} support in our docker abstraction 
> only supports inspecting containers (not images or tasks).  We should expand 
> this to (at least) support images.
> In particular, this additional functionality is motivated by the upcoming GPU 
> support, which needs to inspect the labels in a docker image to decide if it 
> should inject the required Nvidia volumes into a container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.

2016-10-10 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561571#comment-15561571
 ] 

Guangya Liu commented on MESOS-6308:


This is a race issue in tear down and calculating the dominant share metrics, 
this can be happened with following sequence with 
{{PartitionTest.DisconnectedFramework}}:

1) {{driver.stop()}} will send a {{TEARDOWN}} request to master which will 
trigger {{removeFramework}} in allocator. 

2) The {{PartitionTest.DisconnectedFramework}} construct the {{Metrics}} object 
with following code, and at this time the default role was not removed, and the 
{{calculateShare}} will be in queue to calculate the dominant share for default 
role.

{code}
JSON::Object stats = Metrics();
{code}

3) The remove framework in master continues and it call allocator to remove 
framework. As there is only one framework under star role, so the 
{{removeFramework}} will call {{roleSorter->remove(role);}} to remove the role 
and its related allocations, the race will be here, take a look at the API of 
{{remove}} in sorter.cpp.

{code}
void DRFSorter::remove(const string& name)
{
  set::iterator it = find(name);

  if (it != clients.end()) {
clients.erase(it);
  }
 
  allocations.erase(name);
  weights.erase(name);

  < `calculateShare` was triggered here to  calculate the 
dominant share for default role, but at this time, the default role allocation 
was now removed and  `calculateShare` will report CHECK FAIL. It is very 
difficult to reproduce this, that's why I run test for more than 1 hour to 
reproduce this.

  if (metrics.isSome()) {
metrics->remove(name);
  }
}
{code}

I updated the code of {{DRFSorter::remove(const string& name)}} a bit by adding 
a {{os::sleep(Seconds(1));}} between {{allocations.erase(name);}} and 
{{metrics->remove(name);}} as following:

{code}
--- a/src/master/allocator/sorter/drf/sorter.cpp
+++ b/src/master/allocator/sorter/drf/sorter.cpp
@@ -29,6 +29,8 @@
 #include 
 #include 

+#include 
+
 #include "logging/logging.hpp"

 #include "master/allocator/sorter/drf/sorter.hpp"
@@ -110,6 +112,8 @@ void DRFSorter::remove(const string& name)
   allocations.erase(name);
   weights.erase(name);

+  os::sleep(Seconds(1));
+
   if (metrics.isSome()) {
 metrics->remove(name);
   }
{code}

Then re-run the test {{PartitionTest.DisconnectedFramework}}, it will be failed 
every time and CHECK FAIL at the same place:
{code}
[ RUN  ] PartitionTest.DisconnectedFramework
I1010 15:43:46.022001 257765376 exec.cpp:162] Version: 1.1.0
I1010 15:43:46.025545 259375104 exec.cpp:237] Executor registered on agent 
5bc85014-3fab-459f-9d85-8b47a06e27d0-S0
Received SUBSCRIBED event
Subscribed executor on 192.168.56.1
Received LAUNCH event
Starting task 51f3e50a-e561-407b-8ee4-65f163d65bd7
/Users/gyliu/git/mesos/build/src/mesos-containerizer launch 
--command="{"shell":true,"value":"sleep 60"}" --help="false"
Forked command at 93007
F1010 15:43:50.094323 407199744 sorter.cpp:454] Check failed: contains(name)
*** Check failure stack trace: ***
@0x1119b91ca  google::LogMessage::Fail()
@0x1119b8157  google::LogMessage::SendToLog()
@0x1119b8e7a  google::LogMessage::Flush()
@0x1119bfce8  google::LogMessageFatal::~LogMessageFatal()
@0x1119b9605  google::LogMessageFatal::~LogMessageFatal()
@0x10fa0bd18  
mesos::internal::master::allocator::DRFSorter::calculateShare()
@0x10fa05c5e  
mesos::internal::master::allocator::Metrics::add()::$_0::operator()()
@0x10fa09232  
_ZZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNSt3__112basic_stringIcNS9_11char_traitsIcEENS9_9allocatorIcE3$_0EENS_6FutureIdEERKNS_4UPIDEOT_ENKUlPNS_11ProcessBaseEE_clEST_
@0x10fa091f0  
_ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcE3$_0EENS3_6FutureIdEERKNS3_4UPIDEOT_EUlPNS3_11ProcessBaseEE_SW_EEEvDpOT_
@0x10fa08e9c  
_ZNSt3__110__function6__funcIZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcE3$_0EENS2_6FutureIdEERKNS2_4UPIDEOT_EUlPNS2_11ProcessBaseEE_NSF_ISW_EEFvSV_EEclEOSV_
@0x111897acf  std::__1::function<>::operator()()
@0x1118684ff  process::ProcessBase::visit()
@0x1118cc18e  process::DispatchEvent::visit()
@0x109a75431  process::ProcessBase::serve()
@0x1118651d1  process::ProcessManager::resume()
@0x111870cc6  
process::ProcessManager::init_threads()::$_1::operator()()
@0x111870969  
_ZNSt3__114__thread_proxyINS_5tupleIJZN7process14ProcessManager12init_threadsEvE3$_1EPvS6_
@ 

[jira] [Commented] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction

2016-10-09 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561293#comment-15561293
 ] 

Guangya Liu commented on MESOS-5967:


There are two solutions for this:

Solution 1): Added a new function named as {{inspectImage}} and renamed the 
current function {{inspect}} to {{inspectContainer}}. The only issue is that it 
does not match the docker API match as the docker API is using {{inspect}} for 
both container and image.

Solution 2): Use template to handle this:

{code}
template 
process::Future inspect(
  const std::string& containerName,
  const Option& retryInterval = None()) const;
{code}

Please note that I was not using {{virtual}} above as {{template}} do not 
support {{virtual}}, so here I need to remove {{virtual}}.

Then I can define `Container` and `Image` inspect as following: 

{code}
template<>
 Future Docker::inspect(
 const string& containerName,
 const Option& retryInterval) const
{
...
}

template<>
 Future Docker::inspect(
 const string& imageName,
 const Option& retryInterval) const
{
...
}
{code}

For the caller part, container will be:

{code}
docker->inspect(...);
{code}

and image will be:

{code}
docker->inspect(...);
{code}

I think solution 1) is more simple, as solution 2) need remove {{virutal}} for 
{{inspect}}, though it has no impact but  this will make the code not 
consistent, [~bmahler] [~klueska] any comments? Thanks.

> Add support for 'docker image inspect' in our docker abstraction
> 
>
> Key: MESOS-5967
> URL: https://issues.apache.org/jira/browse/MESOS-5967
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Guangya Liu
>  Labels: gpu
> Fix For: 1.1.0
>
>
> Docker's command line tool for {{docker inspect}} can take either a 
> {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
> array containing low-level information about that container, image or task. 
> However, the current {{docker inspect}} support in our docker abstraction 
> only supports inspecting containers (not images or tasks).  We should expand 
> this to (at least) support images.
> In particular, this additional functionality is motivated by the upcoming GPU 
> support, which needs to inspect the labels in a docker image to decide if it 
> should inject the required Nvidia volumes into a container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.

2016-10-07 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556594#comment-15556594
 ] 

Guangya Liu commented on MESOS-6308:


Thanks [~bbannier] , I reproduced this issue again after running almost 1 hour 
and found it failed as following when adding metrics:

{code}
F1007 18:22:39.125012 255385600 sorter.cpp:458] Check failed: contains(name)
*** Check failure stack trace: ***
@0x108b7afda  google::LogMessage::Fail()
@0x108b79f67  google::LogMessage::SendToLog()
@0x108b7ac8a  google::LogMessage::Flush()
@0x108b81af8  google::LogMessageFatal::~LogMessageFatal()
@0x108b7b415  google::LogMessageFatal::~LogMessageFatal()
@0x106bcd4d5  
mesos::internal::master::allocator::DRFSorter::calculateShare()
@0x106bc710e  
mesos::internal::master::allocator::Metrics::add()::$_0::operator()()
@0x106bca6e2  
_ZZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNSt3__112basic_stringIcNS9_11char_traitsIcEENS9_9allocatorIcE3$_0EENS_6FutureIdEERKNS_4UPIDEOT_ENKUlPNS_11ProcessBaseEE_clEST_
@0x106bca6a0  
_ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcE3$_0EENS3_6FutureIdEERKNS3_4UPIDEOT_EUlPNS3_11ProcessBaseEE_SW_EEEvDpOT_
@0x106bca34c  
_ZNSt3__110__function6__funcIZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcE3$_0EENS2_6FutureIdEERKNS2_4UPIDEOT_EUlPNS2_11ProcessBaseEE_NSF_ISW_EEFvSV_EEclEOSV_
@0x108a598df  std::__1::function<>::operator()()
@0x108a2a30f  process::ProcessBase::visit()
@0x108a8df9e  process::DispatchEvent::visit()
@0x100c65c51  process::ProcessBase::serve()
@0x108a26fe1  process::ProcessManager::resume()
@0x108a32ad6  
process::ProcessManager::init_threads()::$_1::operator()()
@0x108a32779  
_ZNSt3__114__thread_proxyINS_5tupleIJZN7process14ProcessManager12init_threadsEvE3$_1EPvS6_
@ 0x7fff957a405a  _pthread_body
@ 0x7fff957a3fd7  _pthread_start
@ 0x7fff957a13ed  thread_start
E1007 18:23:06.083991 317579264 process.cpp:2154] Failed to shutdown socket 
with fd 15: Socket is not connected
Abort trap: 6
{code}

Will check more for if there are case that we can add metrics for a non 
existent client? [~bbannier] , please show your comments if any. Thanks.


> CHECK failure in DRF sorter.
> 
>
> Key: MESOS-6308
> URL: https://issues.apache.org/jira/browse/MESOS-6308
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Guangya Liu
>
> Saw this CHECK failed in our internal CI:
> https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450
> {noformat}
> [03:08:28] :   [Step 10/10] [ RUN  ] PartitionTest.DisconnectedFramework
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.200443   577 cluster.cpp:158] 
> Creating default 'local' authorizer
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.206408   577 leveldb.cpp:174] 
> Opened db in 5.827159ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208127   577 leveldb.cpp:181] 
> Compacted db in 1.697508ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208150   577 leveldb.cpp:196] 
> Created db iterator in 5756ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208160   577 leveldb.cpp:202] 
> Seeked to beginning of db in 1483ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208168   577 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 1101ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208184   577 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208452   591 recover.cpp:451] 
> Starting replica recovery
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208664   596 recover.cpp:477] 
> Replica is in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209079   591 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(3666)@172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209203   593 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209394   598 recover.cpp:568] 
> Updating replica status to STARTING
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209473   598 master.cpp:380] 
> Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) 
> started on 172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209489   598 master.cpp:382] Flags 
> at startup: --acls="" 

[jira] [Assigned] (MESOS-6308) CHECK failure in DRF sorter.

2016-10-05 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu reassigned MESOS-6308:
--

Assignee: Guangya Liu

> CHECK failure in DRF sorter.
> 
>
> Key: MESOS-6308
> URL: https://issues.apache.org/jira/browse/MESOS-6308
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Guangya Liu
>
> Saw this CHECK failed in our internal CI:
> https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450
> {noformat}
> [03:08:28] :   [Step 10/10] [ RUN  ] PartitionTest.DisconnectedFramework
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.200443   577 cluster.cpp:158] 
> Creating default 'local' authorizer
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.206408   577 leveldb.cpp:174] 
> Opened db in 5.827159ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208127   577 leveldb.cpp:181] 
> Compacted db in 1.697508ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208150   577 leveldb.cpp:196] 
> Created db iterator in 5756ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208160   577 leveldb.cpp:202] 
> Seeked to beginning of db in 1483ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208168   577 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 1101ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208184   577 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208452   591 recover.cpp:451] 
> Starting replica recovery
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208664   596 recover.cpp:477] 
> Replica is in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209079   591 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(3666)@172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209203   593 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209394   598 recover.cpp:568] 
> Updating replica status to STARTING
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209473   598 master.cpp:380] 
> Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) 
> started on 172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209489   598 master.cpp:382] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" 
> --zk_session_timeout="10secs"
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209692   598 master.cpp:432] 
> Master only allowing authenticated frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209699   598 master.cpp:446] 
> Master only allowing authenticated agents to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209704   598 master.cpp:459] 
> Master only allowing authenticated HTTP frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209709   598 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/7rr0oB/credentials'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209810   598 master.cpp:504] Using 
> default 'crammd5' authenticator
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209853   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209897   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209940   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209962   598 master.cpp:584] 
> Authorization enabled
> [03:08:28]W:   [Step 10/10] I1004 

[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.

2016-10-05 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550554#comment-15550554
 ] 

Guangya Liu commented on MESOS-6308:


I was now trying to reproduce this issue but with no lucky even with 
{{--gtest_repeat=100}}, will try to increase the workload as you suggested to 
see if I can reproduce this first.

> CHECK failure in DRF sorter.
> 
>
> Key: MESOS-6308
> URL: https://issues.apache.org/jira/browse/MESOS-6308
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>
> Saw this CHECK failed in our internal CI:
> https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450
> {noformat}
> [03:08:28] :   [Step 10/10] [ RUN  ] PartitionTest.DisconnectedFramework
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.200443   577 cluster.cpp:158] 
> Creating default 'local' authorizer
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.206408   577 leveldb.cpp:174] 
> Opened db in 5.827159ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208127   577 leveldb.cpp:181] 
> Compacted db in 1.697508ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208150   577 leveldb.cpp:196] 
> Created db iterator in 5756ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208160   577 leveldb.cpp:202] 
> Seeked to beginning of db in 1483ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208168   577 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 1101ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208184   577 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208452   591 recover.cpp:451] 
> Starting replica recovery
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208664   596 recover.cpp:477] 
> Replica is in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209079   591 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(3666)@172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209203   593 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209394   598 recover.cpp:568] 
> Updating replica status to STARTING
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209473   598 master.cpp:380] 
> Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) 
> started on 172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209489   598 master.cpp:382] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" 
> --zk_session_timeout="10secs"
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209692   598 master.cpp:432] 
> Master only allowing authenticated frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209699   598 master.cpp:446] 
> Master only allowing authenticated agents to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209704   598 master.cpp:459] 
> Master only allowing authenticated HTTP frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209709   598 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/7rr0oB/credentials'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209810   598 master.cpp:504] Using 
> default 'crammd5' authenticator
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209853   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209897   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209940   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 

[jira] [Created] (MESOS-6317) Race in master update slave.

2016-10-05 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-6317:
--

 Summary: Race in master update slave.
 Key: MESOS-6317
 URL: https://issues.apache.org/jira/browse/MESOS-6317
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


Currently, when {{updateSlave}} in master, it will first rescind offers and 
then updateSlave in allocator, but there is a race for this, there might be a 
batch allocation inserted bwteen the two. In this case, the order will be 
rescind offer -> batch allocation -> update slave. This order will cause some 
issues when the oversubscribed resources was decreased.

Suppose the oversubscribed resources was decreased from 2 to 1, then after 
rescind offer finished, the batch allocation will allocate the old 2 
oversubscribed resources again, then update slave will update the total 
oversubscribed resources to 1. This will cause the agent host have some time 
overcommitted due to the tasks can still use 2 oversubscribed resources but not 
1 oversubscribed resources, once the tasks using the 2 oversubscribed resources 
finished, everything goes back.

So here we should adjust the order of rescind offer and updateSlave in master 
to avoid resource overcommit.

If we update slave first then rescind offer, the order will be update slave -> 
batch allocation -> rescind offer, this order will have no problem when 
descreasing resources. Suppose the oversubscribed resources was decreased from 
2 to 1, then update slave will update total oversubscribed resources to 1 
directly, then the batch allocation will not allocate any oversubscribed 
resources since there are more allocated than total oversubscribed resources, 
then rescind offer will rescind all offers using oversubscribed resources. This 
will not lead the agent host to be overcommitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6181) The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct

2016-10-03 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-6181:
---
Description: One issue for the test: If destroy volume failed, we should 
get the last offer to make sure that the last offer also contain the volume 
resource.  (was: Two issues for those two test cases:

1) No need to add `{}` in the test case as there is no need to add `{}`, adding 
the `{}` will cause the driver decline a non exist offer.
2) If destroy volume failed, we should get the last offer to make sure that the 
last offer also contain the volume resource.)

> The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct
> -
>
> Key: MESOS-6181
> URL: https://issues.apache.org/jira/browse/MESOS-6181
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>
> One issue for the test: If destroy volume failed, we should get the last 
> offer to make sure that the last offer also contain the volume resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5524) Expose resource allocation constraints (quota, shares) to schedulers.

2016-09-28 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522430#comment-15522430
 ] 

Guangya Liu edited comment on MESOS-5524 at 9/28/16 1:39 PM:
-

[~bmahler] one question want to discuss with you is when exposing the resource 
allocation constraints, do we need to expose the resources as {{role}} level or 
{{framework}} level? 

If expose as {{role}} level, then there may be problems when one role has 
multiple frameworks as each framework with same role will have same resource 
constraints, and we cannot guarantee if one framework can always get the 
exposed resources.

{{framework}} level is also not good, the problem is how we define 
{{framework}} level, just expose the resources evenly to all {{frameworks}} 
under the same {{role}} or some other ways?  expose the resources evenly to all 
{{frameworks}} under the same {{role}} is also not accurate, as there maybe a 
{{framework}} have quite a lot of tasks while others may not have tasks, and 
the framework with lot of tasks will use up all of the resources.


was (Author: gyliu):
[~bmahler] one question want to discuss with you is when exposing the resource 
allocation constraints, do we need to expose the resources as {{role}} level or 
{{framework}} level? 

If expose as {{role}} level, then there may be problems when one role has 
multiple frameworks as each framework with same role will have same resource 
constraints, and we cannot guarantee if one framework can always get the 
exposed resources.

Seems {{framework}} level is more accurate, but even with {{framework}} level, 
it may still not accurate because of the allocator coarse-grained mode for 
resource allocation when there are more frameworks than agents in cluster. any 
comments?

> Expose resource allocation constraints (quota, shares) to schedulers.
> -
>
> Key: MESOS-5524
> URL: https://issues.apache.org/jira/browse/MESOS-5524
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation, scheduler api
>Reporter: Benjamin Mahler
>
> Currently, schedulers do not have visibility into their quota or shares of 
> the cluster. By providing this information, we give the scheduler the ability 
> to make better decisions. As we start to allow schedulers to decide how 
> they'd like to use a particular resource (e.g. as non-revocable or 
> revocable), schedulers need visibility into their quota and shares to make an 
> effective decision (otherwise they may accidentally exceed their quota and 
> will not find out until mesos replies with TASK_LOST REASON_QUOTA_EXCEEDED).
> We would start by exposing the following information:
> * quota: e.g. cpus:10, mem:20, disk:40
> * shares: e.g. cpus:20, mem:40, disk:80
> Currently, quota is used for non-revocable resources and the idea is to use 
> shares only for consuming revocable resources since the number of shares 
> available to a role changes dynamically as resources come and go, frameworks 
> come and go, or the operator manipulates the amount of resources sectioned 
> off for quota.
> By exposing quota and shares, the framework knows when it can consume 
> additional non-revocable resources (i.e. when it has fewer non-revocable 
> resources allocated to it than its quota) or when it can consume revocable 
> resources (always! but in the future, it cannot revoke another user's 
> revocable resources if the framework is above its fair share).
> This also allows schedulers to determine whether they have sufficient quota 
> assigned to them, and to alert the operator if they need more to run safely. 
> Also, by viewing their fair share, the framework can expose monitoring 
> information that shows the discrepancy between how much it would like and its 
> fair share (note that the framework can actually exceed its fair share but in 
> the future this will mean increased potential for revocation).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6181) The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct

2016-09-27 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525325#comment-15525325
 ] 

Guangya Liu commented on MESOS-6181:


Thanks [~greggomann] Agree for #1.

For #2, take {{PersistentVolumeTest, BadACLNoPrincipal}} as an example, in 
https://github.com/apache/mesos/blob/master/src/tests/persistent_volume_tests.cpp#L1626
 , it is expecting 
{{EXPECT_TRUE(Resources(offer.resources()).contains(volume));}} , but it is not 
using the latest offer but it is still using the offer 
https://github.com/apache/mesos/blob/master/src/tests/persistent_volume_tests.cpp#L1599
 revived, this is not accurate, we should use the offer after {{acceptOffers}} 
, we need to make sure that the volume is still in the new offer after 
allocation interval, comments?

> The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct
> -
>
> Key: MESOS-6181
> URL: https://issues.apache.org/jira/browse/MESOS-6181
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>
> Two issues for those two test cases:
> 1) No need to add `{}` in the test case as there is no need to add `{}`, 
> adding the `{}` will cause the driver decline a non exist offer.
> 2) If destroy volume failed, we should get the last offer to make sure that 
> the last offer also contain the volume resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5524) Expose resource allocation constraints (quota, shares) to schedulers.

2016-09-26 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522430#comment-15522430
 ] 

Guangya Liu commented on MESOS-5524:


[~bmahler] one question want to discuss with you is when exposing the resource 
allocation constraints, do we need to expose the resources as {{role}} level or 
{{framework}} level? 

If expose as {{role}} level, then there may be problems when one role has 
multiple frameworks as each framework with same role will have same resource 
constraints, and we cannot guarantee if one framework can always get the 
exposed resources.

Seems {{framework}} level is more accurate, but even with {{framework}} level, 
it may still not accurate because of the allocator coarse-grained mode for 
resource allocation when there are more frameworks than agents in cluster. any 
comments?

> Expose resource allocation constraints (quota, shares) to schedulers.
> -
>
> Key: MESOS-5524
> URL: https://issues.apache.org/jira/browse/MESOS-5524
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation, scheduler api
>Reporter: Benjamin Mahler
>
> Currently, schedulers do not have visibility into their quota or shares of 
> the cluster. By providing this information, we give the scheduler the ability 
> to make better decisions. As we start to allow schedulers to decide how 
> they'd like to use a particular resource (e.g. as non-revocable or 
> revocable), schedulers need visibility into their quota and shares to make an 
> effective decision (otherwise they may accidentally exceed their quota and 
> will not find out until mesos replies with TASK_LOST REASON_QUOTA_EXCEEDED).
> We would start by exposing the following information:
> * quota: e.g. cpus:10, mem:20, disk:40
> * shares: e.g. cpus:20, mem:40, disk:80
> Currently, quota is used for non-revocable resources and the idea is to use 
> shares only for consuming revocable resources since the number of shares 
> available to a role changes dynamically as resources come and go, frameworks 
> come and go, or the operator manipulates the amount of resources sectioned 
> off for quota.
> By exposing quota and shares, the framework knows when it can consume 
> additional non-revocable resources (i.e. when it has fewer non-revocable 
> resources allocated to it than its quota) or when it can consume revocable 
> resources (always! but in the future, it cannot revoke another user's 
> revocable resources if the framework is above its fair share).
> This also allows schedulers to determine whether they have sufficient quota 
> assigned to them, and to alert the operator if they need more to run safely. 
> Also, by viewing their fair share, the framework can expose monitoring 
> information that shows the discrepancy between how much it would like and its 
> fair share (note that the framework can actually exceed its fair share but in 
> the future this will mean increased potential for revocation).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6181) The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct

2016-09-16 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497632#comment-15497632
 ] 

Guangya Liu commented on MESOS-6181:


 cc [~greggomann] 

> The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct
> -
>
> Key: MESOS-6181
> URL: https://issues.apache.org/jira/browse/MESOS-6181
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>
> Two issues for those two test cases:
> 1) No need to add `{}` in the test case as there is no need to add `{}`, 
> adding the `{}` will cause the driver decline a non exist offer.
> 2) If destroy volume failed, we should get the last offer to make sure that 
> the last offer also contain the volume resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6181) The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct

2016-09-16 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-6181:
--

 Summary: The logic for BadACLNoPrincipal and 
BadACLDropCreateAndDestroy is not correct
 Key: MESOS-6181
 URL: https://issues.apache.org/jira/browse/MESOS-6181
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu


Two issues for those two test cases:

1) No need to add `{}` in the test case as there is no need to add `{}`, adding 
the `{}` will cause the driver decline a non exist offer.
2) If destroy volume failed, we should get the last offer to make sure that the 
last offer also contain the volume resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4811) Reusable/Cacheable Offer

2016-09-06 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469367#comment-15469367
 ] 

Guangya Liu commented on MESOS-4811:


Based on requirement description, this is duplicate with MESOS-3078 , 
[~klaus1982] [~abi...@gmail.com] please help confirm. Thanks.

> Reusable/Cacheable Offer
> 
>
> Key: MESOS-4811
> URL: https://issues.apache.org/jira/browse/MESOS-4811
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Klaus Ma
>Assignee: Abhishek Dasgupta
>  Labels: tech-debt
>
> Currently, the resources are return back to allocator when task finished; and 
> those resources are not allocated to framework until next allocation cycle. 
> The performance is low for short running tasks (MESOS-3078). The proposed 
> solution is to let framework keep using the offer until allocator decide to 
> rescind it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4988) Excluded reserved resources when got nonRevocable resources in stage 1.

2016-09-06 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469211#comment-15469211
 ] 

Guangya Liu commented on MESOS-4988:


This improvement seems have no impact to performance, shall we close this one? 
[~klaus1982]

> Excluded reserved resources when got nonRevocable resources in stage 1.
> ---
>
> Key: MESOS-4988
> URL: https://issues.apache.org/jira/browse/MESOS-4988
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Klaus Ma
>
> Allocator will only allocate non-revocable resources to satify quota. As the 
> reserved resources can not be revocable, it's not necessary to call 
> `nonRevocable()` for reserved resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6131) Improved performance for resource flatten

2016-09-06 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-6131:
--

 Summary: Improved performance for resource flatten
 Key: MESOS-6131
 URL: https://issues.apache.org/jira/browse/MESOS-6131
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


The {{Resources::flatten}} is using {{+=}} to add single resource object, but 
this will impact the performance much as {{+=}} will invoke resource 
validation, here we should validate the role first and then call {{add}} 
directly to avoid resource validation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6113) Offer Quota resources as revocable

2016-09-02 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15460347#comment-15460347
 ] 

Guangya Liu edited comment on MESOS-6113 at 9/3/16 4:35 AM:


Does the section section in MESOS-4392 help? It is saying lend out the un-used 
quota to other framework and reclaim them back when needed.

{code}
A greedy analytics batch system wants to use as much of the cluster as possible 
to maximize computational throughput. When a competing web service with fixed 
task size starts up, there must be sufficient resources to run it immediately. 
The operator can reserve these resources by setting quota. However, if these 
resources are kept idle until the service is in use, this is wasteful from the 
analytics job's point of view. On the other hand, the analytics job should hand 
back reserved resources to the service when needed to avoid starvation of the 
latter.
{code}


was (Author: gyliu):
Does the section section in MESOS-4392 help? It is saying lend out the un-used 
quota to other framework and reclaim them back when needed.

{quota}
A greedy analytics batch system wants to use as much of the cluster as possible 
to maximize computational throughput. When a competing web service with fixed 
task size starts up, there must be sufficient resources to run it immediately. 
The operator can reserve these resources by setting quota. However, if these 
resources are kept idle until the service is in use, this is wasteful from the 
analytics job's point of view. On the other hand, the analytics job should hand 
back reserved resources to the service when needed to avoid starvation of the 
latter.
{quota}

> Offer Quota resources as revocable
> --
>
> Key: MESOS-6113
> URL: https://issues.apache.org/jira/browse/MESOS-6113
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Affects Versions: 1.0.1
>Reporter: Michael Gummelt
>
> *Goal:*
> I have high-priority Spark jobs, and best-effort jobs.  I need my 
> high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the 
> best-effort jobs on revocable resources. 
> *Problem:*
> Revocable resources are currently only created via oversubscription, where 
> resources allocated to but not used by a framework will be offered to other 
> frameworks.  This doesn't support the ability for a high-pri framework to 
> start up and pre-empty a low-pri framework.
> *Solution:*
> Let's allow quota (and ideally any reserved resources) to be configurable to 
> be offered as revocable resources to other frameworks that don't register 
> with the role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6113) Offer Quota resources as revocable

2016-09-01 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15456846#comment-15456846
 ] 

Guangya Liu commented on MESOS-6113:


Then this should be a duplicate with MESOS-4392 but not MESOS-4967 , right?

> Offer Quota resources as revocable
> --
>
> Key: MESOS-6113
> URL: https://issues.apache.org/jira/browse/MESOS-6113
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Affects Versions: 1.0.1
>Reporter: Michael Gummelt
>
> *Goal:*
> I have high-priority Spark jobs, and best-effort jobs.  I need my 
> high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the 
> best-effort jobs on revocable resources. 
> *Problem:*
> Revocable resources are currently only created via oversubscription, where 
> resources allocated to but not used by a framework will be offered to other 
> frameworks.  This doesn't support the ability for a high-pri framework to 
> start up and pre-empty a low-pri framework.
> *Solution:*
> Let's allow quota (and ideally any reserved resources) to be configurable to 
> be offered as revocable resources to other frameworks that don't register 
> with the role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6113) Offer Quota resources as revocable

2016-09-01 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-6113:
---
Summary: Offer Quota resources as revocable  (was: Offer reserved resources 
as revocable)

> Offer Quota resources as revocable
> --
>
> Key: MESOS-6113
> URL: https://issues.apache.org/jira/browse/MESOS-6113
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Affects Versions: 1.0.1
>Reporter: Michael Gummelt
>
> *Goal:*
> I have high-priority Spark jobs, and best-effort jobs.  I need my 
> high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the 
> best-effort jobs on revocable resources. 
> *Problem:*
> Revocable resources are currently only created via oversubscription, where 
> resources allocated to but not used by a framework will be offered to other 
> frameworks.  This doesn't support the ability for a high-pri framework to 
> start up and pre-empty a low-pri framework.
> *Solution:*
> Let's allow quota (and ideally any reserved resources) to be configurable to 
> be offered as revocable resources to other frameworks that don't register 
> with the role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6113) Offer reserved resources as revocable

2016-08-31 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454373#comment-15454373
 ] 

Guangya Liu edited comment on MESOS-6113 at 9/1/16 5:32 AM:


MESOS-4967 is kind of oversubscription for reserved resources and MESOS-4392 is 
kind of oversubscription for quota resources. I was a bit confused here: The 
content in this JIRA is for {{Quota}} resources while the title is for 
{{reserved}} resources, can you elaborate? [~mgummelt]


was (Author: gyliu):
MESOS-4976 is kind of oversubscription for reserved resources and MESOS-4392 is 
kind of oversubscription for quota resources. I was a bit confused here: The 
content in this JIRA is for {{Quota}} resources while the title is for 
{{reserved}} resources, can you elaborate? [~mgummelt]

> Offer reserved resources as revocable
> -
>
> Key: MESOS-6113
> URL: https://issues.apache.org/jira/browse/MESOS-6113
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Affects Versions: 1.0.1
>Reporter: Michael Gummelt
>
> *Goal:*
> I have high-priority Spark jobs, and best-effort jobs.  I need my 
> high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the 
> best-effort jobs on revocable resources. 
> *Problem:*
> Revocable resources are currently only created via oversubscription, where 
> resources allocated to but not used by a framework will be offered to other 
> frameworks.  This doesn't support the ability for a high-pri framework to 
> start up and pre-empty a low-pri framework.
> *Solution:*
> Let's allow quota (and ideally any reserved resources) to be configurable to 
> be offered as revocable resources to other frameworks that don't register 
> with the role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6113) Offer reserved resources as revocable

2016-08-31 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454373#comment-15454373
 ] 

Guangya Liu commented on MESOS-6113:


MESOS-4976 is kind of oversubscription for reserved resources and MESOS-4392 is 
kind of oversubscription for quota resources. I was a bit confused here: The 
content in this JIRA is for {{Quota}} resources while the title is for 
{{reserved}} resources, can you elaborate? [~mgummelt]

> Offer reserved resources as revocable
> -
>
> Key: MESOS-6113
> URL: https://issues.apache.org/jira/browse/MESOS-6113
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Affects Versions: 1.0.1
>Reporter: Michael Gummelt
>
> *Goal:*
> I have high-priority Spark jobs, and best-effort jobs.  I need my 
> high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the 
> best-effort jobs on revocable resources. 
> *Problem:*
> Revocable resources are currently only created via oversubscription, where 
> resources allocated to but not used by a framework will be offered to other 
> frameworks.  This doesn't support the ability for a high-pri framework to 
> start up and pre-empty a low-pri framework.
> *Solution:*
> Let's allow quota (and ideally any reserved resources) to be configurable to 
> be offered as revocable resources to other frameworks that don't register 
> with the role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6112) Frameworks are starved when > 5 are run concurrently

2016-08-31 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454358#comment-15454358
 ] 

Guangya Liu commented on MESOS-6112:


Perhaps you can use {{suppressOffers()}} and {{reviveOffers()}} as a pair: 
After {{suppressOffers()}}, you can call {{reviveOffers()}} to see if you can 
get the offer of the persistent volume, if not, call {{suppressOffers()}} again 
and loop till your persistent volume host come back?

> Frameworks are starved when > 5 are run concurrently
> 
>
> Key: MESOS-6112
> URL: https://issues.apache.org/jira/browse/MESOS-6112
> Project: Mesos
>  Issue Type: Task
>  Components: allocation, master
>Affects Versions: 1.0.1
>Reporter: Michael Gummelt
>
> As I understand it, the master will send an offer to a list of frameworks 
> ordered by DRF, until the offer is accepted.  There is a 1s wait time between 
> each offering.  Once the decline timeout for the first framework has been 
> reached, rather than continuing to submit the offer to the rest of the 
> frameworks in the list, the master starts over at the beginning, starving the 
> rest of the frameworks.
> This means that in order for Mesos to support > 5 concurrent frameworks, all 
> frameworks must be good citizens and set their decline timeout to something 
> large or suppress offers.  I think this is a fairly undesirable state of 
> things.
> I propose that the master instead continues to submit the offer to every 
> registered framework, even if the declineOffer timeout has been reached.
> The potential increase in task startup latency that could be introduced by 
> this change can be obviated in part if we also make the master smarter about 
> how long to wait between successive offers, rather than a static 1s.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6112) Frameworks are starved when > 5 are run concurrently

2016-08-30 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450888#comment-15450888
 ] 

Guangya Liu commented on MESOS-6112:


Is this duplicate with MESOS-3202? I think that this will only happen when you 
have more frameworks than agents? Can quota help if one role per framework?

> Frameworks are starved when > 5 are run concurrently
> 
>
> Key: MESOS-6112
> URL: https://issues.apache.org/jira/browse/MESOS-6112
> Project: Mesos
>  Issue Type: Task
>  Components: allocation, master
>Affects Versions: 1.0.1
>Reporter: Michael Gummelt
>
> As I understand it, the master will send an offer to a list of frameworks 
> ordered by DRF, until the offer is accepted.  There is a 1s wait time between 
> each offering.  Once the decline timeout for the first framework has been 
> reached, rather than continuing to submit the offer to the rest of the 
> frameworks in the list, the master starts over at the beginning, starving the 
> rest of the frameworks.
> This means that in order for Mesos to support > 5 concurrent frameworks, all 
> frameworks must be good citizens and set their decline timeout to something 
> large or suppress offers.  I think this is a fairly undesirable state of 
> things.
> I propose that the master instead continues to submit the offer to every 
> registered framework, even if the declineOffer timeout has been reached.
> The potential increase in task startup latency that could be introduced by 
> this change can be obviated in part if we also make the master smarter about 
> how long to wait between successive offers, rather than a static 1s.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6087) Add master tests for TaskGroup

2016-08-26 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438557#comment-15438557
 ] 

Guangya Liu commented on MESOS-6087:


https://reviews.apache.org/r/51451/ Added test case 
MasterAuthorizationTest.KillPendingTaskInTaskGroup.

cc [~vinodkone]

> Add master tests for TaskGroup
> --
>
> Key: MESOS-6087
> URL: https://issues.apache.org/jira/browse/MESOS-6087
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Guangya Liu
>
> Some of the tests we want to write:
> -- If a pending task in a task group is killed, the entire group is killed.
> -- If a task in a task group is invalid, the whole group is considered 
> invalid.
> -- If a task in a task group is unauthorized, the whole group is considered 
> unauthorized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6087) Add master tests for TaskGroup

2016-08-24 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-6087:
---
Assignee: (was: Guangya Liu)

> Add master tests for TaskGroup
> --
>
> Key: MESOS-6087
> URL: https://issues.apache.org/jira/browse/MESOS-6087
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>
> Some of the tests we want to write:
> -- If a pending task in a task group is killed, the entire group is killed.
> -- If a task in a task group is invalid, the whole group is considered 
> invalid.
> -- If a task in a task group is unauthorized, the whole group is considered 
> unauthorized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6087) Add master tests for TaskGroup

2016-08-24 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu reassigned MESOS-6087:
--

Assignee: Guangya Liu

> Add master tests for TaskGroup
> --
>
> Key: MESOS-6087
> URL: https://issues.apache.org/jira/browse/MESOS-6087
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Guangya Liu
>
> Some of the tests we want to write:
> -- If a pending task in a task group is killed, the entire group is killed.
> -- If a task in a task group is invalid, the whole group is considered 
> invalid.
> -- If a task in a task group is unauthorized, the whole group is considered 
> unauthorized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4808) Allocation in batch instead of execute it every-time when addSlave/addFramework.

2016-08-24 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434588#comment-15434588
 ] 

Guangya Liu commented on MESOS-4808:


[~klaus1982] Shall we mark this as duplicate with MESOS-3157 as I think that 
the patch for MESOS-3157 https://reviews.apache.org/r/51027/ actually also 
fixed this ticket.

> Allocation in batch instead of execute it every-time when 
> addSlave/addFramework.
> 
>
> Key: MESOS-4808
> URL: https://issues.apache.org/jira/browse/MESOS-4808
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Klaus Ma
>  Labels: master, tech-debt
>
> Currently, {{allocate()}} are executed every-time when a new slave/framework 
> are registered; if there're lots of agent start all most the same time, the 
> allocation will keep running for a while. It's acceptable behaviour to 
> allocate resources in next allocation cycle. But when a task is finished, 
> it's better to allocate ASAP although there's performances issues; refer to 
> MESOS-3078 for more detail on short running tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4767) Apply batching to allocation events to reduce allocator backlogging.

2016-08-24 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434574#comment-15434574
 ] 

Guangya Liu commented on MESOS-4767:


[~bmahler] Shall we mark this as duplicate with MESOS-3157 as I think that the 
patch for MESOS-3157 https://reviews.apache.org/r/51027/ actually also fixed 
this ticket.

> Apply batching to allocation events to reduce allocator backlogging.
> 
>
> Key: MESOS-4767
> URL: https://issues.apache.org/jira/browse/MESOS-4767
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Guangya Liu
>
> Per the 
> [discussion|https://issues.apache.org/jira/browse/MESOS-3157?focusedCommentId=14728377=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14728377]
>  that came out of MESOS-3157, we'd like to batch together outstanding 
> allocation dispatches in order to avoid backing up the allocator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3078) Recovered resources are not re-allocated until the next allocation delay.

2016-08-22 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu reassigned MESOS-3078:
--

Assignee: Guangya Liu

> Recovered resources are not re-allocated until the next allocation delay.
> -
>
> Key: MESOS-3078
> URL: https://issues.apache.org/jira/browse/MESOS-3078
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Guangya Liu
>
> Currently, when resources are recovered, we do not perform an allocation for 
> that slave. Rather, we wait until the next allocation interval.
> For small task, high throughput frameworks, this can have a significant 
> impact on overall throughput, see the following thread:
> http://markmail.org/thread/y6mzfwzlurv6nik3
> We should consider immediately performing a re-allocation for the slave upon 
> resource recovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3078) Recovered resources are not re-allocated until the next allocation delay.

2016-08-22 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432190#comment-15432190
 ] 

Guangya Liu commented on MESOS-3078:


The review posted by [~jjanco] here https://reviews.apache.org/r/51027/ can 
help this, we can use similar logic in {{addSlave}} to handle this.

{code}
allocationCandidates.insert(slaveId);
if (!allocationPending) {
  allocationPending = true;
  dispatch(self(), ::allocate);
}
{code}

> Recovered resources are not re-allocated until the next allocation delay.
> -
>
> Key: MESOS-3078
> URL: https://issues.apache.org/jira/browse/MESOS-3078
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Mahler
>
> Currently, when resources are recovered, we do not perform an allocation for 
> that slave. Rather, we wait until the next allocation interval.
> For small task, high throughput frameworks, this can have a significant 
> impact on overall throughput, see the following thread:
> http://markmail.org/thread/y6mzfwzlurv6nik3
> We should consider immediately performing a re-allocation for the slave upon 
> resource recovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-970) Upgrade bundled leveldb to 1.18

2016-08-11 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417455#comment-15417455
 ] 

Guangya Liu commented on MESOS-970:
---

Actually, [~bmahler] already mentioned this in a JIRA here 
https://issues.apache.org/jira/browse/MESOS-4558 and we do have plan to fix 
this.
 
I think this was introduced by the review here 
https://reviews.apache.org/r/49784/ as we are adding more test cases here for 
allocator benchmark test.
 
{code}
INSTANTIATE_TEST_CASE_P(
SlaveAndFrameworkCount,
HierarchicalAllocator_BENCHMARK_Test,
::testing::Combine(
  ::testing::Values(1000U, 5000U, 1U, 2U, 3U, 5U),
  ::testing::Values(1U, 50U, 100U, 200U, 500U, 1000U, 3000U, 6000U))
);
{code}
 
There will be 48 (6 * 8) cases here and the longest benchmark test would have 
5 agents and 6000 frameworks as the test parameter, and also some test is 
looping (framework * 2) times and for the last case, it would be 12000 loops, 
that's why you see the benchmark test time is increasing.
 
We are now trying to find a solution for this so that we can also enable the 
benchmark test in ASF CI. For now, perhaps you can use some filter to filter 
out some test cases.
 
{code}
MESOS_BENCHMARK=1 GTEST_FILTER="*BENCHMARK*.*/1" make check
{code}
 
The above command will only run the first test case, you can adjust the 
parameter based on your test requirement. Hope this helps.

> Upgrade bundled leveldb to 1.18
> ---
>
> Key: MESOS-970
> URL: https://issues.apache.org/jira/browse/MESOS-970
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Benjamin Mahler
>Assignee: Tomasz Janiszewski
>
> We currently bundle leveldb 1.4, and the latest version is leveldb 1.18.
> Upgrade to 1.18 could solve the problems when build Mesos in some non-x86 
> architecture CPU.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5830) Make a sweep to trim excess space around angle brackets

2016-08-06 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410815#comment-15410815
 ] 

Guangya Liu commented on MESOS-5830:


Yes, [~zerobleed] , it is a good start for you to get started for mesos. As 
suggested by [~haosd...@gmail.com], you can follow 
https://github.com/apache/mesos/blob/master/docs/submitting-a-patch.md to 
contribute. There is also a meetup slides here for you to take a reference 
http://files.meetup.com/18744996/Mesos_Community_Guidance.pdf

> Make a sweep to trim excess space around angle brackets
> ---
>
> Key: MESOS-5830
> URL: https://issues.apache.org/jira/browse/MESOS-5830
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Priority: Trivial
>
> The codebase still has pre-C++11 code where we needed to say e.g., 
> {{vector

[jira] [Updated] (MESOS-5921) `validate` is a bit heavy to check negative scalar resource

2016-07-28 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5921:
---
Attachment: WithoutValidation.png

> `validate` is a bit heavy to check negative scalar resource
> ---
>
> Key: MESOS-5921
> URL: https://issues.apache.org/jira/browse/MESOS-5921
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
> Attachments: WithValidation.png, WithoutValidation.png
>
>
> When subtract resources finished, we need to call {{Resources::validate}} to 
> check if the scalar resource is negative so as to remove this resource if it 
> is negative. This is a bit heavy as the {{Resources::validate}} did many 
> validation stuffs, such as checking type, validating role, checking resource 
> name etc, all of them are not necessary.
> We should introduce a new helper function {{isNegative}} to check if the 
> resource is a negative scalar resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5921) `validate` is a bit heavy to check negative scalar resource

2016-07-28 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5921:
---
Attachment: WithValidation.png

> `validate` is a bit heavy to check negative scalar resource
> ---
>
> Key: MESOS-5921
> URL: https://issues.apache.org/jira/browse/MESOS-5921
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
> Attachments: WithValidation.png
>
>
> When subtract resources finished, we need to call {{Resources::validate}} to 
> check if the scalar resource is negative so as to remove this resource if it 
> is negative. This is a bit heavy as the {{Resources::validate}} did many 
> validation stuffs, such as checking type, validating role, checking resource 
> name etc, all of them are not necessary.
> We should introduce a new helper function {{isNegative}} to check if the 
> resource is a negative scalar resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5921) `validate` is a bit heavy to check negative scalar resource

2016-07-28 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398446#comment-15398446
 ] 

Guangya Liu commented on MESOS-5921:


Sure Ben, I will use {{callgrind}} to check why the performance was not 
improved much before post the patch. As here I was using `Ports` resources, the 
performance here should have some improvement after using {{isNegative}} 
ideally.

> `validate` is a bit heavy to check negative scalar resource
> ---
>
> Key: MESOS-5921
> URL: https://issues.apache.org/jira/browse/MESOS-5921
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>
> When subtract resources finished, we need to call {{Resources::validate}} to 
> check if the scalar resource is negative so as to remove this resource if it 
> is negative. This is a bit heavy as the {{Resources::validate}} did many 
> validation stuffs, such as checking type, validating role, checking resource 
> name etc, all of them are not necessary.
> We should introduce a new helper function {{isNegative}} to check if the 
> resource is a negative scalar resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5921) `validate` is a bit heavy to check negative scalar resource

2016-07-28 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397700#comment-15397700
 ] 

Guangya Liu commented on MESOS-5921:


[~bmahler], did some checking for this and seems we can keep the current logic 
of of {{Resources::subtract}} using {{Resources::validate}} as this function 
can return very quickly when encounter negative scalar resources. What do you 
think? Thanks.

{code}
Option Resources::validate(const Resource& resource)
{
  if (resource.name().empty()) {
return Error("Empty resource name");
  }

  if (!Value::Type_IsValid(resource.type())) {
return Error("Invalid resource type");
  }

  if (resource.type() == Value::SCALAR) {
if (!resource.has_scalar() ||
resource.has_ranges() ||
resource.has_set()) {
  return Error("Invalid scalar resource");
}

if (resource.scalar().value() < 0) {
  return Error("Invalid scalar resource: value < 0");  << Return here 
if the scalar resource is negative and thus will not do other checking.
}
  } else if (resource.type() == Value::RANGES) {
 ..
  } else if (resource.type() == Value::SET) {
..
  } else {
// Resource doesn't support TEXT or other value types.
return Error("Unsupported resource type");
  }

  ..
}
{code}

I also did some test with following code diff and found that the performance 
was almost not changed for operating 1000 port resources.

Code diff.
{code}
--- a/include/mesos/resources.hpp
+++ b/include/mesos/resources.hpp
@@ -396,6 +396,9 @@ private:
   // ensure this is warranted.
   bool _contains(const Resource& that) const;

+  // Check if the resource is a negative scalar resource.
+  bool isNegative(const Resource& r) const;
+
   // Similar to the public 'find', but only for a single Resource
   // object. The target resource may span multiple roles, so this
   // returns Resources.
diff --git a/src/common/resources.cpp b/src/common/resources.cpp
index 2878ace..b1259b9 100644
--- a/src/common/resources.cpp
+++ b/src/common/resources.cpp
@@ -1296,6 +1296,17 @@ bool Resources::_contains(const Resource& that) const
 }


+bool Resources::isNegative(const Resource& r) const
+{
+  if (r.type() == Value::SCALAR &&
+  r.scalar().value() < 0) {
+return true;
+  }
+
+  return false;
+}
+
+
 Option Resources::find(const Resource& target) const
 {
   Resources found;
@@ -1442,10 +1453,8 @@ void Resources::subtract(const Resource& that)
 if (internal::subtractable(*resource, that)) {
   *resource -= that;

-  // Remove the resource if it becomes invalid or zero. We need
-  // to do the validation because we want to strip negative
-  // scalar Resource object.
-  if (validate(*resource).isSome() || isEmpty(*resource)) {
+  // Remove the resource if it becomes negative or empty.
+  if (isNegative(*resource) || isEmpty(*resource)) {
 // As `resources` is not ordered, and erasing an element
 // from the middle using `DeleteSubrange` is expensive, we
 // swap with the last element and then shrink the
{code}

Before fix:
{code}
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
[ RUN  ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
Took 2.730778secs to perform 1000 'total += r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 20.703045secs to perform 1000 'total.contains(r)' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 3.530712secs to perform 1000 'total -= r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 2.92716secs to perform 1000 'total = total + r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 3.489936secs to perform 1000 'total = total - r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 122368us to perform 1000 'r.nonRevocable()' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
[   OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (33508 ms)
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (33508 ms 
total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (33525 ms total)
[  PASSED  ] 1 test.
{code}

After fix:
{code}
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
[ RUN  ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
Took 2.657057secs to perform 1000 'total += r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 20.493614secs to perform 1000 'total.contains(r)' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 3.420194secs to perform 1000 'total -= r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 

[jira] [Created] (MESOS-5921) `validate` is a bit heavy to check negative scalar resource

2016-07-28 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5921:
--

 Summary: `validate` is a bit heavy to check negative scalar 
resource
 Key: MESOS-5921
 URL: https://issues.apache.org/jira/browse/MESOS-5921
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


When subtract resources finished, we need to call {{Resources::validate}} to 
check if the scalar resource is negative so as to remove this resource if it is 
negative. This is a bit heavy as the {{Resources::validate}} did many 
validation stuffs, such as checking type, validating role, checking resource 
name etc, all of them are not necessary.

We should introduce a new helper function {{isNegative}} to check if the 
resource is a negative scalar resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5919) Improve performance for `Resources.contains` and `Resources.filter`

2016-07-28 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5919:
--

 Summary: Improve performance for `Resources.contains` and 
`Resources.filter`
 Key: MESOS-5919
 URL: https://issues.apache.org/jira/browse/MESOS-5919
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


The current logic for `Resources.contains` and `Resources.filter` are as 
following:

{code}
Resources Resources::filter(
const lambda::function& predicate) const
{
  Resources result;
  foreach (const Resource& resource, resources) {
if (predicate(resource)) {
  result += resource;
}
  }
  return result;
}
bool Resources::contains(const Resources& that) const
{
  Resources remaining = *this;

  foreach (const Resource& resource, that.resources) {
// NOTE: We use _contains because Resources only contain valid
// Resource objects, and we don't want the performance hit of the
// validity check.
if (!remaining._contains(resource)) {
  return false;
}

remaining -= resource;
  }

  return true;
}
{code}

The problem is that actually all of the {{resource}} object in those two APIs 
are valid and there is no need to validate the resource here, but here both the 
{{remaining -= resource;}} in {{Resources.contains}} and {{result += 
resource;}} in {{Resources::filter}} both include the logic of {{validate}} 
resource, we should remove the {{validate}} logic here by using {{subtract}} 
and {{add}} for those two APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5700) Benchmark for Resource class

2016-07-26 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5700:
---
Summary: Benchmark for Resource class  (was: Benchmark for Resource class 
(protobuf vs. C++))

> Benchmark for Resource class
> 
>
> Key: MESOS-5700
> URL: https://issues.apache.org/jira/browse/MESOS-5700
> Project: Mesos
>  Issue Type: Bug
>Reporter: Klaus Ma
>Assignee: Klaus Ma
> Attachments: hashmap.diff, name_roleId.diff, port.perf.log, 
> reservation.perf.log
>
>
> Add benchmark of Resource class for Allocation Performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5700) Benchmark for Resource class (protobuf vs. C++)

2016-07-26 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393410#comment-15393410
 ] 

Guangya Liu edited comment on MESOS-5700 at 7/26/16 8:20 AM:
-

Did some test for how does {{addable}} and {{subtractable}} contribute to 
resources benchmark test, the result is that {{those two validations does not 
cost much time and we can ignore it}}. cc [~bmahler] [~klaus1982]

Test steps are as following:
1) Checkout two source code copies: mesos-1 and mesos-2, apply patch 
https://reviews.apache.org/r/50380/ for both copies.
2) Update code in mesos-1 by removing both {{addable}} and {{subtractable}} for 
resources {{+=}} and {{-=}}. Code diff is as following:
{code}
diff --git a/src/common/resources.cpp b/src/common/resources.cpp
index 3dbff24..d770e98 100644
--- a/src/common/resources.cpp
+++ b/src/common/resources.cpp
@@ -227,6 +227,7 @@ bool operator!=(const Resource& left, const Resource& right)

 namespace internal {

+#if 0
 // Tests if we can add two Resource objects together resulting in one
 // valid Resource object. For example, two Resource objects with
 // different name, type or role are not addable.
@@ -277,6 +278,7 @@ static bool addable(const Resource& left, const Resource& 
right)

   return true;
 }
+#endif


 // Tests if we can subtract "right" from "left" resulting in one valid
@@ -1381,11 +1383,9 @@ void Resources::add(const Resource& that)

   bool found = false;
   foreach (Resource& resource, resources) {
-if (internal::addable(resource, that)) {
   resource += that;
   found = true;
   break;
-}
   }

   // Cannot be combined with any existing Resource object.
@@ -1439,7 +1439,6 @@ void Resources::subtract(const Resource& that)
   for (int i = 0; i < resources.size(); i++) {
 Resource* resource = resources.Mutable(i);

-if (internal::subtractable(*resource, that)) {
   *resource -= that;

   // Remove the resource if it becomes invalid or zero. We need
@@ -1455,7 +1454,6 @@ void Resources::subtract(const Resource& that)
   }

   break;
-}
   }
 }
{code}
3) Build those two copies and run benchmark test 
{{ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2}}.

Test result without validation for both {{addable}} and {{subtractable}} 
{code}
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
[ RUN  ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
Took 2.833678secs to perform 1000 'total += r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 3.656634secs to perform 1000 'total -= r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 3.012337secs to perform 1000 'total = total + r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 3.650337secs to perform 1000 'total = total - r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
[   OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (13155 ms)
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (13155 ms 
total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (13174 ms total)
[  PASSED  ] 1 test.
{code}

Test result with validation for both {{addable}} and {{subtractable}} 
{code}
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
[ RUN  ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
Took 2.707476secs to perform 1000 'total += r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 3.49798secs to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 
7-8, 10-11, 13-14, 16-17, 1...
Took 2.911038secs to perform 1000 'total = total + r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 3.692435secs to perform 1000 'total = total - r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
[   OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (12811 ms)
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (12811 ms 
total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (12830 ms total)
[  PASSED  ] 1 test.
{code}

Please refer to 
https://docs.google.com/document/d/1D5qqkEh28vnS-2j3F1K8liYS8ThtSjeLJ4AvogIoxjk/edit?ts=57971af2#
 for more detail of the diagram of {{valgrind --tool=callgrind}}.


was (Author: gyliu):
Did some test for how does {{addable}} and {{subtractable}} contribute to 
resources benchmark test, the result is that {{those two validations does not 
cost much time and we can ignore it}}. cc [~bmahler] [~klaus1982]

Test steps are as following:
1) Checkout two source code copies: mesos-1 and mesos-2, apply patch 

[jira] [Commented] (MESOS-5700) Benchmark for Resource class (protobuf vs. C++)

2016-07-26 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393410#comment-15393410
 ] 

Guangya Liu commented on MESOS-5700:


Did some test for how does {{addable}} and {{subtractable}} contribute to 
resources benchmark test, the result is that {{those two validations does not 
cost much time and we can ignore it}}. cc [~bmahler] [~klaus1982]

Test steps are as following:
1) Checkout two source code copies: mesos-1 and mesos-2, apply patch 
https://reviews.apache.org/r/50380/ for both copies.
2) Update code in mesos-1 by removing both {{addable}} and {{subtractable}} for 
resources {{+=}} and {{-=}}. Code diff is as following:
{code}
diff --git a/src/common/resources.cpp b/src/common/resources.cpp
index 3dbff24..d770e98 100644
--- a/src/common/resources.cpp
+++ b/src/common/resources.cpp
@@ -227,6 +227,7 @@ bool operator!=(const Resource& left, const Resource& right)

 namespace internal {

+#if 0
 // Tests if we can add two Resource objects together resulting in one
 // valid Resource object. For example, two Resource objects with
 // different name, type or role are not addable.
@@ -277,6 +278,7 @@ static bool addable(const Resource& left, const Resource& 
right)

   return true;
 }
+#endif


 // Tests if we can subtract "right" from "left" resulting in one valid
@@ -1381,11 +1383,9 @@ void Resources::add(const Resource& that)

   bool found = false;
   foreach (Resource& resource, resources) {
-if (internal::addable(resource, that)) {
   resource += that;
   found = true;
   break;
-}
   }

   // Cannot be combined with any existing Resource object.
@@ -1439,7 +1439,6 @@ void Resources::subtract(const Resource& that)
   for (int i = 0; i < resources.size(); i++) {
 Resource* resource = resources.Mutable(i);

-if (internal::subtractable(*resource, that)) {
   *resource -= that;

   // Remove the resource if it becomes invalid or zero. We need
@@ -1455,7 +1454,6 @@ void Resources::subtract(const Resource& that)
   }
{code}
3) Build those two copies and run benchmark test 
{{ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2}}.

Test result without validation for both {{addable}} and {{subtractable}} 
{code}
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
[ RUN  ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
Took 2.833678secs to perform 1000 'total += r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 3.656634secs to perform 1000 'total -= r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 3.012337secs to perform 1000 'total = total + r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 3.650337secs to perform 1000 'total = total - r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
[   OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (13155 ms)
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (13155 ms 
total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (13174 ms total)
[  PASSED  ] 1 test.
{code}

Test result with validation for both {{addable}} and {{subtractable}} 
{code}
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
[ RUN  ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
Took 2.707476secs to perform 1000 'total += r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 3.49798secs to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 
7-8, 10-11, 13-14, 16-17, 1...
Took 2.911038secs to perform 1000 'total = total + r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 3.692435secs to perform 1000 'total = total - r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
[   OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (12811 ms)
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (12811 ms 
total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (12830 ms total)
[  PASSED  ] 1 test.
{code}

Please refer to 
https://docs.google.com/document/d/1D5qqkEh28vnS-2j3F1K8liYS8ThtSjeLJ4AvogIoxjk/edit?ts=57971af2#
 for more detail of the diagram of {{valgrind --tool=callgrind}}.

> Benchmark for Resource class (protobuf vs. C++)
> ---
>
> Key: MESOS-5700
> URL: https://issues.apache.org/jira/browse/MESOS-5700
> Project: Mesos
>  Issue Type: Bug
>Reporter: Klaus Ma
>Assignee: Klaus Ma
> Attachments: hashmap.diff, name_roleId.diff, port.perf.log, 
> reservation.perf.log
>
>
> Add 

[jira] [Commented] (MESOS-3157) only perform batch resource allocations

2016-07-25 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391660#comment-15391660
 ] 

Guangya Liu commented on MESOS-3157:


[~jjanco] any update for this? are you still working for this?

> only perform batch resource allocations
> ---
>
> Key: MESOS-3157
> URL: https://issues.apache.org/jira/browse/MESOS-3157
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: James Peach
>Assignee: Jacob Janco
>
> Our deployment environments have a lot of churn, with many short-live 
> frameworks that often revive offers. Running the allocator takes a long time 
> (from seconds up to minutes).
> In this situation, event-triggered allocation causes the event queue in the 
> allocator process to get very long, and the allocator effectively becomes 
> unresponsive (eg. a revive offers message takes too long to come to the head 
> of the queue).
> We have been running a patch to remove all the event-triggered allocations 
> and only allocate from the batch task 
> {{HierarchicalAllocatorProcess::batch}}. This works great and really improves 
> responsiveness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5898) Make resources benchmark test for ports -=/- more accurate

2016-07-25 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5898:
---
Description: 
When I run benchmark test for port resources, I can get the following result, 
the `-=` and `-` only consumed 10ms, this cannot reflect the real time of 
operating 1000 ports with `-=` and `-`.

The root cause is that  the current calculation is always using same port 
range, with port, the formula for `+` is {{a+a+a+a+...+a==a}}; for `-`, it will 
be {{a-a=0}} and {{0-a=0}}. 

With {{0-a=0}}, the code here 
https://github.com/apache/mesos/blob/master/src/common/values.cpp#L544 will 
cause there is no validation as the {{left}} is empty.

{code}
./bin/mesos-tests.sh --benchmark 
--gtest_filter="*Resources_BENCHMARK_Test.Arithmetic/2"
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
[ RUN  ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
Took 3.219217secs to perform 1000 'total += r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 10207us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 
7-8, 10-11, 13-14, 16-17, 1...
Took 3.515383secs to perform 1000 'total = total + r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 10208us to perform 1000 'total = total - r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
[   OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (6759 ms)
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (6759 ms 
total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (6801 ms total)
[  PASSED  ] 1 test.
{code}

  was:
When I run benchmark test for port resources, I can get the following result, 
the `-=` and `-` only consumed 10ms, this cannot reflect the real time of 
operating 1000 ports with `-=` and `-`.

The root cause is that  the current calculation is always using same port 
range, with port, the formula for `+` is {a+a+a+a+...+a==a}; for `-`, it will 
be {a-a=0} and {0-a=0}. 

With {0-a=0}, the code here 
https://github.com/apache/mesos/blob/master/src/common/values.cpp#L544 will 
cause there is no validation as the {{left}} is empty.

{code}
./bin/mesos-tests.sh --benchmark 
--gtest_filter="*Resources_BENCHMARK_Test.Arithmetic/2"
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
[ RUN  ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
Took 3.219217secs to perform 1000 'total += r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 10207us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 
7-8, 10-11, 13-14, 16-17, 1...
Took 3.515383secs to perform 1000 'total = total + r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 10208us to perform 1000 'total = total - r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
[   OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (6759 ms)
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (6759 ms 
total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (6801 ms total)
[  PASSED  ] 1 test.
{code}


> Make resources benchmark test for ports -=/- more accurate
> --
>
> Key: MESOS-5898
> URL: https://issues.apache.org/jira/browse/MESOS-5898
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>
> When I run benchmark test for port resources, I can get the following result, 
> the `-=` and `-` only consumed 10ms, this cannot reflect the real time of 
> operating 1000 ports with `-=` and `-`.
> The root cause is that  the current calculation is always using same port 
> range, with port, the formula for `+` is {{a+a+a+a+...+a==a}}; for `-`, it 
> will be {{a-a=0}} and {{0-a=0}}. 
> With {{0-a=0}}, the code here 
> https://github.com/apache/mesos/blob/master/src/common/values.cpp#L544 will 
> cause there is no validation as the {{left}} is empty.
> {code}
> ./bin/mesos-tests.sh --benchmark 
> --gtest_filter="*Resources_BENCHMARK_Test.Arithmetic/2"
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
> [ RUN  ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
> Took 3.219217secs to perform 1000 'total += r' operations on ports(*):[1-2, 
> 4-5, 7-8, 10-11, 13-14, 16-17, 1...
> Took 10207us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 
> 7-8, 10-11, 13-14, 16-17, 1...
> Took 3.515383secs to perform 1000 'total 

[jira] [Created] (MESOS-5898) Make resources benchmark test for ports -=/- more accurate

2016-07-25 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5898:
--

 Summary: Make resources benchmark test for ports -=/- more accurate
 Key: MESOS-5898
 URL: https://issues.apache.org/jira/browse/MESOS-5898
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


When I run benchmark test for port resources, I can get the following result, 
the `-=` and `-` only consumed 10ms, this cannot reflect the real time of 
operating 1000 ports with `-=` and `-`.

The root cause is that  the current calculation is always using same port 
range, with port, the formula for `+` is {a+a+a+a+...+a==a}; for `-`, it will 
be {a-a=0} and {0-a=0}. 

With {0-a=0}, the code here 
https://github.com/apache/mesos/blob/master/src/common/values.cpp#L544 will 
cause there is no validation as the {{left}} is empty.

{code}
./bin/mesos-tests.sh --benchmark 
--gtest_filter="*Resources_BENCHMARK_Test.Arithmetic/2"
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
[ RUN  ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
Took 3.219217secs to perform 1000 'total += r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 10207us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 
7-8, 10-11, 13-14, 16-17, 1...
Took 3.515383secs to perform 1000 'total = total + r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 10208us to perform 1000 'total = total - r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
[   OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (6759 ms)
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (6759 ms 
total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (6801 ms total)
[  PASSED  ] 1 test.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4770) Investigate performance improvements for 'Resources' class.

2016-07-23 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390638#comment-15390638
 ] 

Guangya Liu commented on MESOS-4770:


[~jvanremoortere] had some investigation for this and the prototype code is 
here (a bit old but good enough for investigation)

1) 
https://github.com/jmlvanre/mesos/commit/f39f49ca0876f61fc94e752fc3c4f14377b1d329
2) 
https://github.com/jmlvanre/mesos/commit/7b4ac74449044d892e25ee31a297d50254afd1e0
3) 
https://github.com/jmlvanre/mesos/commit/4fc05821b4fa3c30dd1fed66ba7fc4498ee29efb

The performance was improved 2x times based on [~jvanremoortere] 's test.

> Investigate performance improvements for 'Resources' class.
> ---
>
> Key: MESOS-4770
> URL: https://issues.apache.org/jira/browse/MESOS-4770
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>Priority: Critical
>
> Currently we have some performance issues when we have heavy usage of the 
> {{Resources}} class. Currently, we tend to work around these issues (e.g. 
> reduce the amount of Resources arithmetic operations in the caller code).
> The implementation of {{Resources}} currently consists of wrapping underlying 
> {{Resource}} protobuf objects and manipulating them. This is fairly expensive 
> compared to doing things more directly with C++ objects.
> This ticket is to explore the performance improvements of using C++ objects 
> more directly instead of working off of {{Resource}} objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5869) Disable resources validation for `+=` and `-=`

2016-07-19 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5869:
--

 Summary: Disable resources validation for `+=` and `-=`
 Key: MESOS-5869
 URL: https://issues.apache.org/jira/browse/MESOS-5869
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


The `validation` consumed quite a lot of time when doing resources `+=` and 
`-=`, but it is not needed for those operations, we need to remove this check.

Based on the test result of removing the `validation`, the performance of 
resources += and -= will be improved by 10x for sorter test, and the 
performance for port range += was improved by 5x and port range -= was improved 
1000x.

Sorter Benchmark test before fix:
{code}
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test
[ RUN  ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35
Using 5 agents and 1000 clients
Added 1000 clients in 23305us
Added 5 agents in 1.174069secs
Added allocations for 5 agents in 40.562802secs
Full sort of 1000 clients took 38193us
No-op sort of 1000 clients took 382us
[   OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 (43032 ms)
[--] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test (43032 ms 
total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (43054 ms total)
[  PASSED  ] 1 test.
{code}

Sorter Benchmark test after fix:
{code}
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test
[ RUN  ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35
Using 5 agents and 1000 clients
Added 1000 clients in 25846us
Added 5 agents in 1.092462secs
Added allocations for 5 agents in 4.397859secs
Full sort of 1000 clients took 35051us
No-op sort of 1000 clients took 551us
[   OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 (6897 ms)
[--] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test (6897 ms 
total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (6920 ms total)
[  PASSED  ] 1 test.
{code}

Ports resources benchmark test before fix:
{code}
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
[ RUN  ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
Took 12.478841secs to perform 1000 'total += r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 8.512399secs to perform 1000 'total -= r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 11.296542secs to perform 1000 'total = total + r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 8.517692secs to perform 1000 'total = total - r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
[   OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (40808 ms)
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (40808 ms 
total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (40832 ms total)
[  PASSED  ] 1 test.
{code}

Ports resources benchmark test after fix:
{code}
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
[ RUN  ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
Took 2.827012secs to perform 1000 'total += r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 8841us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 
10-11, 13-14, 16-17, 1...
Took 3.313112secs to perform 1000 'total = total + r' operations on 
ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1...
Took 12415us to perform 1000 'total = total - r' operations on ports(*):[1-2, 
4-5, 7-8, 10-11, 13-14, 16-17, 1...
[   OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (6164 ms)
[--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (6164 ms 
total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (6187 ms total)
[  PASSED  ] 1 test.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4558) Reduce the running time of benchmark tests.

2016-07-13 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374768#comment-15374768
 ] 

Guangya Liu commented on MESOS-4558:


Selectively run benchmark test is also an option, but not sure if there are any 
logic for how to select the representative benchmark test.

Take this patch https://reviews.apache.org/r/49784/ as an example, after it was 
merged, it will introduce some cases that agent count is less than framework 
count, which may lead some frameworks cannot get resources and the allocator 
will try to allocate resources on fully used agents. This is a good test cases 
to check if the fully used agent resources can impact the performance of the 
allocator. But the problem is how we can select the cases which can cover the 
cases of fully used agent with some filters?

Also in https://reviews.apache.org/r/49784/ , we enabled {{batchsize}} to do 
less loop for frameworks, this seems a simple way to decrease the time of the 
benchmark test without updating the filter logic of ASF CI.

> Reduce the running time of benchmark tests.
> ---
>
> Key: MESOS-4558
> URL: https://issues.apache.org/jira/browse/MESOS-4558
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>  Labels: newbie++
>
> Currently benchmark tests take a long time (>5 hours). It would be nice to 
> reduce the total time taken by the benchmark tests to enable us to run them 
> on ASF CI.
> Command to run only benchmark tests
> {code}
> MESOS_BENCHMARK=1 GTEST_FILTER="*BENCHMARK*" make check
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5834) Mesos may pass to the Docker daemon --volume-driver multiple times.

2016-07-12 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373179#comment-15373179
 ] 

Guangya Liu commented on MESOS-5834:


The {{driver}} field is an optional field, and also docker suggest creating the 
volume explicitly via {{docker volume create}} before using it, if you create 
the docker volumes explicitly and do not set {{driver}}, there will be no such 
issues; otherwise, the {{stderr}} will show you some error message for {{Error 
response from daemon: create aa: conflict: volume name must be unique.}}. Does 
this behaviour ok for you?

{code}
message DockerVolume {
  // Driver of the volume, it can be flocker, convoy, raxrey etc.
  optional string driver = 1;

   // Name of the volume.
  required string name = 2;

  // Volume driver specific options.
  optional Parameters driver_options = 3;
}
{code}

> Mesos may pass to the Docker daemon --volume-driver multiple times.
> ---
>
> Key: MESOS-5834
> URL: https://issues.apache.org/jira/browse/MESOS-5834
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.0.0
>Reporter: Gastón Kleiman
>  Labels: mesosphere
>
> https://github.com/apache/mesos/blob/master/src/docker/docker.cpp#L590 will 
> append the "--volume-driver" flag to argv once per Volume.
> According to https://github.com/docker/docker/issues/16069 this flag can only 
> be specified once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4558) Reduce the running time of benchmark tests.

2016-07-10 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15369990#comment-15369990
 ] 

Guangya Liu commented on MESOS-4558:


[~jjanco] is trying to make the looping numbers of the benchmark as 
configurable by a batch size which can reduce the time of benchmark test.  
Please refer to https://reviews.apache.org/r/49616/ for detail.

> Reduce the running time of benchmark tests.
> ---
>
> Key: MESOS-4558
> URL: https://issues.apache.org/jira/browse/MESOS-4558
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>  Labels: newbie++
>
> Currently benchmark tests take a long time (>5 hours). It would be nice to 
> reduce the total time taken by the benchmark tests to enable us to run them 
> on ASF CI.
> Command to run only benchmark tests
> {code}
> MESOS_BENCHMARK=1 GTEST_FILTER="*BENCHMARK*" make check
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5701) Add benchmark for sorter performance

2016-07-10 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu reassigned MESOS-5701:
--

Assignee: Guangya Liu

> Add benchmark for sorter performance
> 
>
> Key: MESOS-5701
> URL: https://issues.apache.org/jira/browse/MESOS-5701
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Klaus Ma
>Assignee: Guangya Liu
>
> Add benchmark of sorter in allocation for Allocation Performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5825) Support mounting image volume in mesos containerizer.

2016-07-08 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368671#comment-15368671
 ] 

Guangya Liu commented on MESOS-5825:


[~gilbert] is this duplicate with 
https://issues.apache.org/jira/browse/MESOS-5465 ? If so, can you please post 
some comments at MESOS-5465? ;-)

> Support mounting image volume in mesos containerizer.
> -
>
> Key: MESOS-5825
> URL: https://issues.apache.org/jira/browse/MESOS-5825
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: containerizer, filesystem, isolator, mesosphere
>
> Mesos containerizer should be able to support mounting image volume type. 
> Specifically, both image rootfs and default manifest should be reachable 
> inside container's mount namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5700) Benchmark for Resource class (protobuf vs. C++)

2016-07-08 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368666#comment-15368666
 ] 

Guangya Liu commented on MESOS-5700:


Based on investigation from [~jvanremoortere] and [~mcypark] , the founding is 
that (1) copying of the protobufs was expensive (2) looping over and checking 
.name() equality was expensive, for example. We may need to think more use 
cases related to {{Resource}} and translate those to benchmark test. 

> Benchmark for Resource class (protobuf vs. C++)
> ---
>
> Key: MESOS-5700
> URL: https://issues.apache.org/jira/browse/MESOS-5700
> Project: Mesos
>  Issue Type: Bug
>Reporter: Klaus Ma
>Assignee: Klaus Ma
>
> Add benchmark of Resource class for Allocation Performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5425) Consider using IntervalSet for Port range resource math

2016-07-08 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15367301#comment-15367301
 ] 

Guangya Liu commented on MESOS-5425:


I'm linking MESOS-5700 here cause there is a patch 
https://reviews.apache.org/r/49381 which can help you doing some benchmark test.

> Consider using IntervalSet for Port range resource math
> ---
>
> Key: MESOS-5425
> URL: https://issues.apache.org/jira/browse/MESOS-5425
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Joseph Wu
>Assignee: Yanyan Hu
>  Labels: mesosphere
> Attachments: graycol.gif
>
>
> Follow-up JIRA for comments raised in MESOS-3051 (see comments there).
> We should consider utilizing 
> [{{IntervalSet}}|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/3rdparty/stout/include/stout/interval.hpp]
>  in [Port range resource 
> math|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/src/common/values.cpp#L143].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5800) code clean up for allocator benchmark test

2016-07-07 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5800:
--

 Summary: code clean up for allocator benchmark test
 Key: MESOS-5800
 URL: https://issues.apache.org/jira/browse/MESOS-5800
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


We are now trying to introduce some benchmark test for allocator, and people 
may make a reference to the current benchmark test for their new benchmark test.

There are two major issues for current benchmark test:
1) The output of the benchmark test is {{round 0 allocate took 3.077414secs to 
make 200 offers}}, the {{200}} here is framework numbers but not offer numbers.
2) Two test cases {{DeclineOffers}} and {{ResourceLabels}} are not using 
templatized test fixture.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5739) Fix Value parsing code to only accept the canonical formats

2016-07-06 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5739:
---
Description: 
We should fix the value parsing code to only accept the canonical formats as 
defined in http://mesos.apache.org/documentation/latest/attributes-resources/ , 
the behaviour after the fix is as following:

{code}
1. Did not support [1-2, [3-4]] as Ranges; it should be [1-2, 3-4].
2. Did not support {a{b, c}d} as Set; it should be {ab, cd}
3. Add check for Text against [a-zA-Z0-9_/.-]
{code}


  was:
Enhanced Value parsing:

{code}
1. Did not support [1-2, [3-4]] as Ranges; it should be [1-2, 3-4].
2. Did not support {a{b, c}d} as Set; it should be {ab, cd}
3. Add check for Text against [a-zA-Z0-9_/.-]
{code}



> Fix Value parsing code to only accept the canonical formats
> ---
>
> Key: MESOS-5739
> URL: https://issues.apache.org/jira/browse/MESOS-5739
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Klaus Ma
>Assignee: Klaus Ma
>
> We should fix the value parsing code to only accept the canonical formats as 
> defined in http://mesos.apache.org/documentation/latest/attributes-resources/ 
> , the behaviour after the fix is as following:
> {code}
> 1. Did not support [1-2, [3-4]] as Ranges; it should be [1-2, 3-4].
> 2. Did not support {a{b, c}d} as Set; it should be {ab, cd}
> 3. Add check for Text against [a-zA-Z0-9_/.-]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5739) Fix Value parsing code to only accept the canonical formats

2016-07-06 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5739:
---
Summary: Fix Value parsing code to only accept the canonical formats  (was: 
Enhance Value parsing)

> Fix Value parsing code to only accept the canonical formats
> ---
>
> Key: MESOS-5739
> URL: https://issues.apache.org/jira/browse/MESOS-5739
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Klaus Ma
>Assignee: Klaus Ma
>
> Enhanced Value parsing:
> {code}
> 1. Did not support [1-2, [3-4]] as Ranges; it should be [1-2, 3-4].
> 2. Did not support {a{b, c}d} as Set; it should be {ab, cd}
> 3. Add check for Text against [a-zA-Z0-9_/.-]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5017) Don't consider agents without allocatable resources in the allocator

2016-07-06 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364414#comment-15364414
 ] 

Guangya Liu commented on MESOS-5017:


I posted a patch here https://reviews.apache.org/r/49694/ , but found that the 
performance does not improve much with a benchmark test, [~bmahler] and 
[~jvanremoortere] , can you please help check and show your comments if any?

> Don't consider agents without allocatable resources in the allocator
> 
>
> Key: MESOS-5017
> URL: https://issues.apache.org/jira/browse/MESOS-5017
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Dario Rexin
>Assignee: Guangya Liu
>Priority: Minor
>
> During the review r/43668/ , it come out an enhancement that if an agent has 
> not allocatable resources, the allocator should filter them out at the 
> beginning.
> {quote}
> Joris Van Remoortere Posted 1 month ago (March 16, 2016, 5:04 a.m.)
> Should we filter out slaves that have no allocatable resources?
> If we do, let's make sure we note that we want to pass the original slaveids 
> to the deallocate function
>  The issue has been resolved. Show all issues
> Dario Rexin 4 weeks ago (March 23, 2016, 4:25 a.m.)
> I'm not sure if it would be a big improvement. Calculating the available 
> resources if somewhat expensive and we have to do it again in the loop and 
> most slaves will probably have resources available anyway. The reason it's an 
> improvement in the loop is, that after we offer the resources to a framework, 
> we can be sure that they are all unavailable to the following frameworks 
> under the same role.
> Klaus Ma 4 weeks ago (March 23, 2016, 11:13 a.m.)
> @joris/dario, I think the improvement dependent on the workload patten: 1.) 
> for short running tasks, it maybe serveral tasks finished during the 
> allocation interval, so maybe no improvement; 2.) but for long running tasks, 
> slave/agent should be fully used in most of time, it'll be a big improvement. 
> I used to log MESOS-4986 to add a filter after stage 1 (Quota), but maybe 
> useless after revocable by default.
> Joris Van Remoortere 3 weeks, 6 days ago (March 23, 2016, 8:59 p.m.)
> Can you open a JIRA to consider doing this. Along Klaus' example, I'm not 
> convinced this wouldn't have a large impact in certain scenarios.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5681) c++ based resource and resources object

2016-07-05 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5681:
---

[~yanyanhu] This seems to be duplicate with MESOS-4770 , can you confirm?

> c++ based resource and resources object
> ---
>
> Key: MESOS-5681
> URL: https://issues.apache.org/jira/browse/MESOS-5681
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Yanyan Hu
>  Labels: performance
>
> Followup JIRA for MESOS-5425. Currently, resource object exposes the protobuf 
> to store data internally. But its implementation is low-efficient for math 
> calculation, especially for the case of Ranges subtraction. An interim 
> solution proposed https://reviews.apache.org/r/48593/ is converting Ranges to 
> IntervalSet inline to optimize the performance. In long-term, we should 
> consider C++ library based resource object as a permanent solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5017) Don't consider agents without allocatable resources in the allocator

2016-07-05 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu reassigned MESOS-5017:
--

Assignee: Guangya Liu  (was: Klaus Ma)

> Don't consider agents without allocatable resources in the allocator
> 
>
> Key: MESOS-5017
> URL: https://issues.apache.org/jira/browse/MESOS-5017
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Dario Rexin
>Assignee: Guangya Liu
>Priority: Minor
>
> During the review r/43668/ , it come out an enhancement that if an agent has 
> not allocatable resources, the allocator should filter them out at the 
> beginning.
> {quote}
> Joris Van Remoortere Posted 1 month ago (March 16, 2016, 5:04 a.m.)
> Should we filter out slaves that have no allocatable resources?
> If we do, let's make sure we note that we want to pass the original slaveids 
> to the deallocate function
>  The issue has been resolved. Show all issues
> Dario Rexin 4 weeks ago (March 23, 2016, 4:25 a.m.)
> I'm not sure if it would be a big improvement. Calculating the available 
> resources if somewhat expensive and we have to do it again in the loop and 
> most slaves will probably have resources available anyway. The reason it's an 
> improvement in the loop is, that after we offer the resources to a framework, 
> we can be sure that they are all unavailable to the following frameworks 
> under the same role.
> Klaus Ma 4 weeks ago (March 23, 2016, 11:13 a.m.)
> @joris/dario, I think the improvement dependent on the workload patten: 1.) 
> for short running tasks, it maybe serveral tasks finished during the 
> allocation interval, so maybe no improvement; 2.) but for long running tasks, 
> slave/agent should be fully used in most of time, it'll be a big improvement. 
> I used to log MESOS-4986 to add a filter after stage 1 (Quota), but maybe 
> useless after revocable by default.
> Joris Van Remoortere 3 weeks, 6 days ago (March 23, 2016, 8:59 p.m.)
> Can you open a JIRA to consider doing this. Along Klaus' example, I'm not 
> convinced this wouldn't have a large impact in certain scenarios.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4694) DRFAllocator takes very long to allocate resources with a large number of frameworks

2016-07-05 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363600#comment-15363600
 ] 

Guangya Liu commented on MESOS-4694:


[~drexin] are you still actively working on this? If not, can I take this over? 
Thanks.

> DRFAllocator takes very long to allocate resources with a large number of 
> frameworks
> 
>
> Key: MESOS-4694
> URL: https://issues.apache.org/jira/browse/MESOS-4694
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Affects Versions: 0.26.0, 0.27.0, 0.27.1, 0.27.2, 0.28.0, 0.28.1
>Reporter: Dario Rexin
>Assignee: Dario Rexin
>
> With a growing number of connected frameworks, the allocation time grows to 
> very high numbers. The addition of quota in 0.27 had an additional impact on 
> these numbers. Running `mesos-tests.sh --benchmark 
> --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us 
> the following numbers:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 200 frameworks
> round 0 allocate took 2.921202secs to make 200 offers
> round 1 allocate took 2.85045secs to make 200 offers
> round 2 allocate took 2.823768secs to make 200 offers
> {noformat}
> Increasing the number of frameworks to 2000:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 28.209454secs to make 2000 offers
> round 1 allocate took 28.469419secs to make 2000 offers
> round 2 allocate took 28.138086secs to make 2000 offers
> {noformat}
> I was able to reduce this time by a substantial amount. After applying the 
> patches:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 200 frameworks
> round 0 allocate took 1.016226secs to make 2000 offers
> round 1 allocate took 1.102729secs to make 2000 offers
> round 2 allocate took 1.102624secs to make 2000 offers
> {noformat}
> And with 2000 frameworks:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 12.563203secs to make 2000 offers
> round 1 allocate took 12.437517secs to make 2000 offers
> round 2 allocate took 12.470708secs to make 2000 offers
> {noformat}
> The patches do 3 things to improve the performance of the allocator.
> 1) The total values in the DRFSorter will be pre calculated per resource type
> 2) In the allocate method, when no resources are available to allocate, we 
> break out of the innermost loop to prevent looping over a large number of 
> frameworks when we have nothing to allocate
> 3) when a framework suppresses offers, we remove it from the sorter instead 
> of just calling continue in the allocation loop - this greatly improves 
> performance in the sorter and prevents looping over frameworks that don't 
> need resources
> Assuming that most of the frameworks behave nicely and suppress offers when 
> they have nothing to schedule, it is fair to assume, that point 3) has the 
> biggest impact on the performance. If we suppress offers for 90% of the 
> frameworks in the benchmark test, we see following numbers:
> {noformat}
> ==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 200 slaves and 2000 frameworks
> round 0 allocate took 11626us to make 200 offers
> round 1 allocate took 22890us to make 200 offers
> round 2 allocate took 21346us to make 200 offers
> {noformat}
> And for 200 frameworks:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 1.11178secs to make 2000 offers
> round 1 allocate took 1.062649secs to make 2000 offers
> round 2 allocate took 1.080181secs to make 2000 offers
> 

[jira] [Created] (MESOS-5760) MAC OS Build failed

2016-06-30 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5760:
--

 Summary: MAC OS Build failed
 Key: MESOS-5760
 URL: https://issues.apache.org/jira/browse/MESOS-5760
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


{code}
arwin -DZOOKEEPER_VERSION=\"3.4.8\" 
-I/usr/local/opt/subversion/include/subversion-1 
-I/usr/local/opt/openssl/include -I/usr/include/apr-1 -I/usr/include/apr-1.0  
-D_THREAD_SAFE -pthread -g -O0 -Wno-unused-local-typedef -std=c++11 
-stdlib=libc++ -DGTEST_USE_OWN_TR1_TUPLE=1 -DGTEST_LANG_CXX11 -MT 
tests/mesos_tests-hdfs_tests.o -MD -MP -MF 
tests/.deps/mesos_tests-hdfs_tests.Tpo -c -o tests/mesos_tests-hdfs_tests.o 
`test -f 'tests/hdfs_tests.cpp' || echo '../../src/'`tests/hdfs_tests.cpp
In file included from ../../src/tests/gc_tests.cpp:42:
// distributed with this work for additional information
../../src/linux/fs.hpp:20:10: fatal error: 'mntent.h' file not found
#include 
 ^
mv -f tests/.deps/mesos_tests-executor_http_api_tests.Tpo 
tests/.deps/mesos_tests-executor_http_api_tests.Po
g++ -DPACKAGE_NAME=\"mesos\" -DPACKAGE_TARNAME=\"mesos\" 
-DPACKAGE_VERSION=\"1.0.0\" -DPACKAGE_STRING=\"mesos\ 1.0.0\" 
-DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"mesos\" 
-DVERSION=\"1.0.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 
-DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 
-DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 
-DLT_OBJDIR=\".libs/\" -DHAVE_CXX11=1 -DHAVE_PTHREAD_PRIO_INHERIT=1 
-DHAVE_PTHREAD=1 -DHAVE_LIBZ=1 -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 
-DHAVE_LIBAPR_1=1 -DHAVE_LIBCURL=1 -DMESOS_HAS_JAVA=1 -DHAVE_PYTHON=\"2.7\" 
-DMESOS_HAS_PYTHON=1 -DHAVE_LIBSASL2=1 -DHAVE_SVN_VERSION_H=1 
-DHAVE_LIBSVN_SUBR_1=1 -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1 
-DHAVE_LIBZ=1 -I. -I../../src   -Wall -Werror -DLIBDIR=\"/usr/local/lib\" 
-DPKGLIBEXECDIR=\"/usr/local/libexec/mesos\" 
-DPKGDATADIR=\"/usr/local/share/mesos\" 
-DPKGMODULEDIR=\"/usr/local/lib/mesos/modules\" -I../../include -I../include 
-I../include/mesos -DPICOJSON_USE_INT64 -D__STDC_FORMAT_MACROS -isystem 
../3rdparty/boost-1.53.0 -I../3rdparty/glog-0.3.3/src 
-I../3rdparty/leveldb-1.4/include -I../../3rdparty/libprocess/include 
-I../3rdparty/nvml-352.79 -I../3rdparty/picojson-1.3.0 
-I../3rdparty/protobuf-2.6.1/src -I../../3rdpa
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5743) Added a flag parser for hashset.

2016-06-29 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5743:
--

 Summary: Added a flag parser for hashset.
 Key: MESOS-5743
 URL: https://issues.apache.org/jira/browse/MESOS-5743
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


We are introducing a new flag in master to set multiple exclude resource names 
from sorter, it is better add a lag parser for hashset to parse 
the flag for multiple exclude resource names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5621) Enabled calculateShare() to ignore the fairnessExcludeResourceNames

2016-06-24 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337847#comment-15337847
 ] 

Guangya Liu edited comment on MESOS-5621 at 6/24/16 2:08 PM:
-

https://reviews.apache.org/r/49190/


was (Author: gyliu):
https://reviews.apache.org/r/48906/

> Enabled calculateShare() to ignore the fairnessExcludeResourceNames
> ---
>
> Key: MESOS-5621
> URL: https://issues.apache.org/jira/browse/MESOS-5621
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>
> Enabled calculateShare() to ignore the fairnessExcludeResourceNames, the 
> fairnessExcludeResourceNames will be a member field for sorter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5621) Enabled calculateShare() to ignore the fairnessExcludeResourceNames

2016-06-24 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5621:
---
Description: Enabled calculateShare() to ignore the 
fairnessExcludeResourceNames, the fairnessExcludeResourceNames will be a member 
field for sorter.  (was: We need a helper function to get all non scarce 
resources so as to help allocator get the non scarce resources information.)

> Enabled calculateShare() to ignore the fairnessExcludeResourceNames
> ---
>
> Key: MESOS-5621
> URL: https://issues.apache.org/jira/browse/MESOS-5621
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>
> Enabled calculateShare() to ignore the fairnessExcludeResourceNames, the 
> fairnessExcludeResourceNames will be a member field for sorter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5621) Enabled calculateShare() to ignore the fairnessExcludeResourceNames

2016-06-24 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5621:
---
Summary: Enabled calculateShare() to ignore the 
fairnessExcludeResourceNames  (was: Add helper function to get non scarce 
resoures)

> Enabled calculateShare() to ignore the fairnessExcludeResourceNames
> ---
>
> Key: MESOS-5621
> URL: https://issues.apache.org/jira/browse/MESOS-5621
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>
> We need a helper function to get all non scarce resources so as to help 
> allocator get the non scarce resources information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5641) Update docker-volume.md to add some content for how to test

2016-06-17 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5641:
--

 Summary: Update docker-volume.md to add some content for how to 
test
 Key: MESOS-5641
 URL: https://issues.apache.org/jira/browse/MESOS-5641
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


The mesos-execute was fixed in MESOS-5265 , the document should be updated to 
reflect how to use mesos-execute to test the feature of docker volume isolator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5640) Unify the help info for master/agent flags

2016-06-17 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5640:
--

 Summary: Unify the help info for master/agent flags
 Key: MESOS-5640
 URL: https://issues.apache.org/jira/browse/MESOS-5640
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Priority: Minor


Currently, in master/flags.cpp, some flags end up with a "\n" while some not, 
this caused the output not consistent.

{code}
--[no-]hostname_lookup 
Whether we should execute a lookup to find out the server's hostname,

 if not explicitly set (via, e.g., `--hostname`).

 True by default; if set to `false` it will cause Mesos

 to use the IP address, unless the hostname is explicitly set. (default: true)
  --http_authenticators=VALUE   
 HTTP authenticator implementation to use when handling requests to

 authenticated endpoints. Use the default

 `basic`, or load an alternate

 HTTP authenticator module using `--modules`.


 Currently there is no support for multiple HTTP authenticators. (default: 
basic)
  --http_framework_authenticators=VALUE 
 HTTP authenticator implementation to use when authenticating HTTP

 frameworks. Use the

 `basic` authenticator or load an

 alternate authenticator module using `--modules`.

 Must be used in conjunction with `--http_authenticate_frameworks`.
{code}

I think we should follow the linux "man command" format by adding "\n" to all 
flags.

The following is a sample output for "man ls".

{code}
 -@  Display extended attribute keys and sizes in long (-l) output.

 -1  (The numeric digit ``one''.)  Force output to be one entry per 
line.  This is the default when output is not to a terminal.

 -A  List all entries except for . and ...  Always set for the 
super-user.

 -a  Include directory entries whose names begin with a dot (.).

 -B  Force printing of non-printable characters (as defined by ctype(3) 
and current locale settings) in file names as \xxx, where xxx is the numeric 
value of the character
 in octal.

 -b  As -B, but use C escape codes whenever possible.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5625) Document the overall treatment of scarce resources.

2016-06-16 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5625:
--

 Summary: Document the overall treatment of scarce resources.
 Key: MESOS-5625
 URL: https://issues.apache.org/jira/browse/MESOS-5625
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


This document should clarify the overall treatment of scarce resources.

Please refer to http://markmail.org/thread/ojoz5zyko2l5srld for some initial 
discussion.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5623) Add test cases for scarce resources

2016-06-16 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5623:
--

 Summary: Add test cases for scarce resources
 Key: MESOS-5623
 URL: https://issues.apache.org/jira/browse/MESOS-5623
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


Add some test cases for scarce resources change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5622) Update allocator to handle scarce resources

2016-06-16 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5622:
--

 Summary: Update allocator to handle scarce resources
 Key: MESOS-5622
 URL: https://issues.apache.org/jira/browse/MESOS-5622
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


The allocator should be updated to handle scarce resources, the idea is exclude 
scarce resources from all sorters in allocator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   >