[jira] [Updated] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction.
[ https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5967: --- Target Version/s: 1.3.0 > Add support for 'docker image inspect' in our docker abstraction. > - > > Key: MESOS-5967 > URL: https://issues.apache.org/jira/browse/MESOS-5967 > Project: Mesos > Issue Type: Improvement > Components: containerization, docker >Reporter: Kevin Klues >Assignee: Guangya Liu > Labels: gpu > > Docker's command line tool for {{docker inspect}} can take either a > {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON > array containing low-level information about that container, image or task. > However, the current {{docker inspect}} support in our docker abstraction > only supports inspecting containers (not images or tasks). We should expand > this to (at least) support images. > In particular, this additional functionality is motivated by the upcoming GPU > support, which needs to inspect the labels in a docker image to decide if it > should inject the required Nvidia volumes into a container. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6638) Update Suppress and Revive to be per-role.
[ https://issues.apache.org/jira/browse/MESOS-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862069#comment-15862069 ] Guangya Liu commented on MESOS-6638: {noformat} commit f40e3d5fb167a691f6a3071f504b77e0def29604 Author: Guangya Liu gy...@apache.org Date: Sat Feb 11 08:24:26 2017 +0800 Added roles field to framework. Added roles field to framework. Review: https://reviews.apache.org/r/56499/ {noformat} > Update Suppress and Revive to be per-role. > -- > > Key: MESOS-6638 > URL: https://issues.apache.org/jira/browse/MESOS-6638 > Project: Mesos > Issue Type: Task > Components: framework api >Reporter: Benjamin Mahler >Assignee: Guangya Liu > > The {{SUPPRESS}} and {{REVIVE}} calls need to be updated to be per-role. I.e. > Include {{Revive.role}} and {{Suppress.role}} fields, indicating which role > the operation is being applied to. > {{Revive}} and {{Suppress}} messages do not currently exist, so these need to > be added. To support the old-style schedulers, we will make the role fields > optional. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6638) Update Suppress and Revive to be per-role.
[ https://issues.apache.org/jira/browse/MESOS-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859128#comment-15859128 ] Guangya Liu commented on MESOS-6638: {noformat} commit 4fb2a5d2edeca0966c0f3ea3445f9723d0140d09 Author: Guangya LiuDate: Thu Feb 9 14:40:04 2017 +0800 Enabled suppress offer per role. Enabled suppress offer per role. Review: https://reviews.apache.org/r/56330/ commit 20dfd055a20e1238e6a7d52181fc33da9b4460cb Author: Guangya Liu Date: Thu Feb 9 14:44:45 2017 +0800 Enabled `ReviveOffersMessage` support revive per role. Review: https://reviews.apache.org/r/56371/ commit 54e65143c5b19915f8ec2bbce35d239b4c5d85d7 Author: Guangya Liu Date: Thu Feb 9 14:48:19 2017 +0800 Augmented master `Revive` API to accept `Call::Revive`. Augmented master `Revive` API to accept `Call::Revive`. Review: https://reviews.apache.org/r/56373/ commit c2388a511c775dd6f392961b06fd7738bf051dbc Author: Guangya Liu Date: Thu Feb 9 14:51:27 2017 +0800 Enabled revive offer per role. Enabled revive offer per role. Review: https://reviews.apache.org/r/56374/ {noformat} > Update Suppress and Revive to be per-role. > -- > > Key: MESOS-6638 > URL: https://issues.apache.org/jira/browse/MESOS-6638 > Project: Mesos > Issue Type: Task > Components: framework api >Reporter: Benjamin Mahler >Assignee: Guangya Liu > > The {{SUPPRESS}} and {{REVIVE}} calls need to be updated to be per-role. I.e. > Include {{Revive.role}} and {{Suppress.role}} fields, indicating which role > the operation is being applied to. > {{Revive}} and {{Suppress}} messages do not currently exist, so these need to > be added. To support the old-style schedulers, we will make the role fields > optional. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6638) Update Suppress and Revive to be per-role.
[ https://issues.apache.org/jira/browse/MESOS-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858772#comment-15858772 ] Guangya Liu commented on MESOS-6638: {code} commit 748675352964ccfbf4e45d6cd7b4b4cacb1c58bf Author: Guangya Liu gy...@apache.org Date: Thu Feb 9 08:28:43 2017 +0800 Updated Suppress and Revive proto to support per role. Updated Suppress and Revive proto to support per role. Review: https://reviews.apache.org/r/56327/ commit 348c06bb0f06c3229ba897fc7fd568473c5bd11b Author: Guangya Liu gy...@apache.org Date: Thu Feb 9 08:30:21 2017 +0800 Augmented master `Suppress` API to accept `Call::Suppress`. Augmented master `Suppress` API to accept `Call::Suppress`. Review: https://reviews.apache.org/r/56328/ {code} > Update Suppress and Revive to be per-role. > -- > > Key: MESOS-6638 > URL: https://issues.apache.org/jira/browse/MESOS-6638 > Project: Mesos > Issue Type: Task > Components: framework api >Reporter: Benjamin Mahler >Assignee: Guangya Liu > > The {{SUPPRESS}} and {{REVIVE}} calls need to be updated to be per-role. I.e. > Include {{Revive.role}} and {{Suppress.role}} fields, indicating which role > the operation is being applied to. > {{Revive}} and {{Suppress}} messages do not currently exist, so these need to > be added. To support the old-style schedulers, we will make the role fields > optional. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (MESOS-6638) Update Suppress and Revive to be per-role.
[ https://issues.apache.org/jira/browse/MESOS-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854206#comment-15854206 ] Guangya Liu edited comment on MESOS-6638 at 2/7/17 12:44 PM: - https://reviews.apache.org/r/56327/ Updated Suppress and Revive proto to support per role. https://reviews.apache.org/r/56328/ Augmented master `Suppress` API to accept `Call::Suppress`. https://reviews.apache.org/r/56330/ Enabled suppress offer per role. https://reviews.apache.org/r/56371/ Enabled `ReviveOffersMessage` support revive per role. https://reviews.apache.org/r/56373/ Augmented master `Revive` API to accept `Call::Revive`. https://reviews.apache.org/r/56374/ Enabled revive offer per role. https://reviews.apache.org/r/56376/ Updated allocator test to support create multi role framework. https://reviews.apache.org/r/56378/ Added test case for suppress and revive with multi role framework. was (Author: gyliu): https://reviews.apache.org/r/56327/ Updated Suppress and Revive proto to support per role. https://reviews.apache.org/r/56328/ Augmented master `Suppress` API to accept `Call::Suppress`. https://reviews.apache.org/r/56330/ Enabled suppress offer per role. > Update Suppress and Revive to be per-role. > -- > > Key: MESOS-6638 > URL: https://issues.apache.org/jira/browse/MESOS-6638 > Project: Mesos > Issue Type: Task > Components: framework api >Reporter: Benjamin Mahler >Assignee: Guangya Liu > > The {{SUPPRESS}} and {{REVIVE}} calls need to be updated to be per-role. I.e. > Include {{Revive.role}} and {{Suppress.role}} fields, indicating which role > the operation is being applied to. > {{Revive}} and {{Suppress}} messages do not currently exist, so these need to > be added. To support the old-style schedulers, we will make the role fields > optional. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7070) Improve allocator performance phase 2
[ https://issues.apache.org/jira/browse/MESOS-7070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-7070: --- Epic Name: allocator performance phase 2 > Improve allocator performance phase 2 > - > > Key: MESOS-7070 > URL: https://issues.apache.org/jira/browse/MESOS-7070 > Project: Mesos > Issue Type: Epic >Reporter: Guangya Liu > > The phase 1 for `allocator performance improvement` has been finished, > basically, the phase 1 have finished such following improvements: > 1) Enabled batch allocation in allocator. > 2) Improved performance for sorter. > 3) Improved performance for `Resource` class. > 4) Added quite a lot of benchmark test for both sorter and resources. > But there are some things need follow up in phase 2, such as periodic > resource allocations, allocate resources asap after recover resources, more > benchmark test etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7070) Improve allocator performance phase 2
[ https://issues.apache.org/jira/browse/MESOS-7070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-7070: --- Issue Type: Epic (was: Bug) > Improve allocator performance phase 2 > - > > Key: MESOS-7070 > URL: https://issues.apache.org/jira/browse/MESOS-7070 > Project: Mesos > Issue Type: Epic >Reporter: Guangya Liu > > The phase 1 for `allocator performance improvement` has been finished, > basically, the phase 1 have finished such following improvements: > 1) Enabled batch allocation in allocator. > 2) Improved performance for sorter. > 3) Improved performance for `Resource` class. > 4) Added quite a lot of benchmark test for both sorter and resources. > But there are some things need follow up in phase 2, such as periodic > resource allocations, allocate resources asap after recover resources, more > benchmark test etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7070) Improve allocator performance phase 2
Guangya Liu created MESOS-7070: -- Summary: Improve allocator performance phase 2 Key: MESOS-7070 URL: https://issues.apache.org/jira/browse/MESOS-7070 Project: Mesos Issue Type: Bug Reporter: Guangya Liu The phase 1 for `allocator performance improvement` has been finished, basically, the phase 1 have finished such following improvements: 1) Enabled batch allocation in allocator. 2) Improved performance for sorter. 3) Improved performance for `Resource` class. 4) Added quite a lot of benchmark test for both sorter and resources. But there are some things need follow up in phase 2, such as periodic resource allocations, allocate resources asap after recover resources, more benchmark test etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (MESOS-6638) Update Suppress and Revive to be per-role.
[ https://issues.apache.org/jira/browse/MESOS-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu reassigned MESOS-6638: -- Assignee: Guangya Liu > Update Suppress and Revive to be per-role. > -- > > Key: MESOS-6638 > URL: https://issues.apache.org/jira/browse/MESOS-6638 > Project: Mesos > Issue Type: Task > Components: framework api >Reporter: Benjamin Mahler >Assignee: Guangya Liu > > The {{SUPPRESS}} and {{REVIVE}} calls need to be updated to be per-role. I.e. > Include {{Revive.role}} and {{Suppress.role}} fields, indicating which role > the operation is being applied to. > {{Revive}} and {{Suppress}} messages do not currently exist, so these need to > be added. To support the old-style schedulers, we will make the role fields > optional. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7044) Update comments for Queue.get() & Queue.put()
[ https://issues.apache.org/jira/browse/MESOS-7044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-7044: --- Priority: Minor (was: Major) > Update comments for Queue.get() & Queue.put() > - > > Key: MESOS-7044 > URL: https://issues.apache.org/jira/browse/MESOS-7044 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Priority: Minor > > This is a follow up action from https://reviews.apache.org/r/55852/ > We are now using Queue.get() & Queue.put() to `pop` and `push` elements, and > it is difficult to understand `Queue.get()` can also `pop` an element without > reading the code, it is better use some meaningful names such as `pop/push` > or some others. > https://github.com/apache/mesos/blob/1.1.x/3rdparty/libprocess/include/process/queue.hpp#L34-L70 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7044) Update comments for Queue.get() & Queue.put()
Guangya Liu created MESOS-7044: -- Summary: Update comments for Queue.get() & Queue.put() Key: MESOS-7044 URL: https://issues.apache.org/jira/browse/MESOS-7044 Project: Mesos Issue Type: Bug Reporter: Guangya Liu This is a follow up action from https://reviews.apache.org/r/55852/ We are now using Queue.get() & Queue.put() to `pop` and `push` elements, and it is difficult to understand `Queue.get()` can also `pop` an element without reading the code, it is better use some meaningful names such as `pop/push` or some others. https://github.com/apache/mesos/blob/1.1.x/3rdparty/libprocess/include/process/queue.hpp#L34-L70 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-2824) Support pre-fetching images
[ https://issues.apache.org/jira/browse/MESOS-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-2824: --- Assignee: (was: Guangya Liu) > Support pre-fetching images > --- > > Key: MESOS-2824 > URL: https://issues.apache.org/jira/browse/MESOS-2824 > Project: Mesos > Issue Type: Improvement > Components: isolation >Affects Versions: 0.23.0 >Reporter: Ian Downes >Priority: Minor > Labels: mesosphere, twitter > > Default container images can be specified with the --default_container_info > flag to the slave. This may be a large image that will take a long time to > initially fetch/hash/extract when the first container is provisioned. Add > optional support to start fetching the image when the slave starts and > consider not registering until the fetch is complete. > To extend that, we should support an operator endpoint so that operators can > specify images to pre-fetch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6854) Prevent launching MULTI_ROLE framework's tasks on agents without MULTI_ROLE support.
[ https://issues.apache.org/jira/browse/MESOS-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823576#comment-15823576 ] Guangya Liu commented on MESOS-6854: I am out of the office until 01/24/2017. I will be in vacation from 1.16 to 1.24 and may not check email on time, plesae call 15029181175 (Temp use) or wechat for any emergency. Thanks. Note: This is an automated response to your message "[jira] [Assigned] (MESOS-6854) Prevent launching MULTI_ROLE framework's tasks on agents without MULTI_ROLE support." sent on 01/16/2017 07:42 AM GMT This is the only notification you will receive while this person is away. > Prevent launching MULTI_ROLE framework's tasks on agents without MULTI_ROLE > support. > > > Key: MESOS-6854 > URL: https://issues.apache.org/jira/browse/MESOS-6854 > Project: Mesos > Issue Type: Task > Components: agent, master >Reporter: Benjamin Mahler >Assignee: Jay Guo > > The proposal for upgrades / backwards compatibility in phase 1 of multi-role > framework support is that we require that masters and agents are all upgraded > before a multi-role framework registers. > We need to explicitly protect against this situation occurring given it's > common for old agents to show up in a cluster. The master can prevent the > launching of MULTI_ROLE frameworks' tasks on agent without MULTI_ROLE > framework support. > If we were to naively let this happen the old agent would think the resources > are allocated to the "*" and there would need to be master logic to deal with > the old agent not populating Resource.AllocationInfo. > The guard will either need to be version based or agent capability based, the > latter seeming like the stronger approach given some users upgrade off of > master rather than using release versions. > We can initially start with the master side guard, and have the agent send > the capability once the agent-side implementation is complete. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction.
[ https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807232#comment-15807232 ] Guangya Liu edited comment on MESOS-5967 at 1/7/17 10:15 AM: - [~klueska] Just rebased, all of the patches are valid now, can you please help review? Thanks. was (Author: gyliu): [~klueska] Just rebased, all of the patches are now valid now, can you please help review? Thanks. > Add support for 'docker image inspect' in our docker abstraction. > - > > Key: MESOS-5967 > URL: https://issues.apache.org/jira/browse/MESOS-5967 > Project: Mesos > Issue Type: Improvement > Components: containerization, docker >Reporter: Kevin Klues >Assignee: Guangya Liu > Labels: gpu > > Docker's command line tool for {{docker inspect}} can take either a > {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON > array containing low-level information about that container, image or task. > However, the current {{docker inspect}} support in our docker abstraction > only supports inspecting containers (not images or tasks). We should expand > this to (at least) support images. > In particular, this additional functionality is motivated by the upcoming GPU > support, which needs to inspect the labels in a docker image to decide if it > should inject the required Nvidia volumes into a container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction.
[ https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807232#comment-15807232 ] Guangya Liu commented on MESOS-5967: [~klueska] Just rebased, all of the patches are now valid now, can you please help review? Thanks. > Add support for 'docker image inspect' in our docker abstraction. > - > > Key: MESOS-5967 > URL: https://issues.apache.org/jira/browse/MESOS-5967 > Project: Mesos > Issue Type: Improvement > Components: containerization, docker >Reporter: Kevin Klues >Assignee: Guangya Liu > Labels: gpu > > Docker's command line tool for {{docker inspect}} can take either a > {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON > array containing low-level information about that container, image or task. > However, the current {{docker inspect}} support in our docker abstraction > only supports inspecting containers (not images or tasks). We should expand > this to (at least) support images. > In particular, this additional functionality is motivated by the upcoming GPU > support, which needs to inspect the labels in a docker image to decide if it > should inject the required Nvidia volumes into a container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6854) Prevent launching MULTI_ROLE framework's tasks on agents without MULTI_ROLE support.
[ https://issues.apache.org/jira/browse/MESOS-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15801371#comment-15801371 ] Guangya Liu commented on MESOS-6854: [~bmahler] , one question for this is for master side guard, if the master cannot get the agent capability, how can it do the validation? So seems we need first finish agent part first? > Prevent launching MULTI_ROLE framework's tasks on agents without MULTI_ROLE > support. > > > Key: MESOS-6854 > URL: https://issues.apache.org/jira/browse/MESOS-6854 > Project: Mesos > Issue Type: Task > Components: agent, master >Reporter: Benjamin Mahler >Assignee: Guangya Liu > > The proposal for upgrades / backwards compatibility in phase 1 of multi-role > framework support is that we require that masters and agents are all upgraded > before a multi-role framework registers. > We need to explicitly protect against this situation occurring given it's > common for old agents to show up in a cluster. The master can prevent the > launching of MULTI_ROLE frameworks' tasks on agent without MULTI_ROLE > framework support. > If we were to naively let this happen the old agent would think the resources > are allocated to the "*" and there would need to be master logic to deal with > the old agent not populating Resource.AllocationInfo. > The guard will either need to be version based or agent capability based, the > latter seeming like the stronger approach given some users upgrade off of > master rather than using release versions. > We can initially start with the master side guard, and have the agent send > the capability once the agent-side implementation is complete. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6854) Prevent launching MULTI_ROLE framework's tasks on agents without MULTI_ROLE support.
[ https://issues.apache.org/jira/browse/MESOS-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu reassigned MESOS-6854: -- Assignee: Guangya Liu > Prevent launching MULTI_ROLE framework's tasks on agents without MULTI_ROLE > support. > > > Key: MESOS-6854 > URL: https://issues.apache.org/jira/browse/MESOS-6854 > Project: Mesos > Issue Type: Task > Components: agent, master >Reporter: Benjamin Mahler >Assignee: Guangya Liu > > The proposal for upgrades / backwards compatibility in phase 1 of multi-role > framework support is that we require that masters and agents are all upgraded > before a multi-role framework registers. > We need to explicitly protect against this situation occurring given it's > common for old agents to show up in a cluster. The master can prevent the > launching of MULTI_ROLE frameworks' tasks on agent without MULTI_ROLE > framework support. > If we were to naively let this happen the old agent would think the resources > are allocated to the "*" and there would need to be master logic to deal with > the old agent not populating Resource.AllocationInfo. > The guard will either need to be version based or agent capability based, the > latter seeming like the stronger approach given some users upgrade off of > master rather than using release versions. > We can initially start with the master side guard, and have the agent send > the capability once the agent-side implementation is complete. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6730) Reserve operation should validate reserved resource role against resource allocationInfo role
[ https://issues.apache.org/jira/browse/MESOS-6730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-6730: --- Description: When doing dynamic reservation validation, the current logic is make sure the reserved resources role is same as the framework role: https://github.com/apache/mesos/blob/1.1.x/src/master/validation.cpp#L1458 {code} if (frameworkRole.isSome() && resource.role() != frameworkRole.get()) { return Error( "A reserve operation was attempted for a resource with role" " '" + resource.role() + "', but the framework can only reserve" " resources with role '" + frameworkRole.get() + "'"); } {code} With multi-role framework, we should validate reserved resource role same as resource allocation role. Please make sure distinguish dynamic reservation with framework and http endpoint. If dynamic reservation was triggered by a framework, then we need to do such validation. If done by the http endpoint, then no need to validate the roles. was: When doing dynamic reservation validation, the current logic is make sure the reserved resources role is same as the framework role: https://github.com/apache/mesos/blob/1.1.x/src/master/validation.cpp#L1458 {code} if (frameworkRole.isSome() && resource.role() != frameworkRole.get()) { return Error( "A reserve operation was attempted for a resource with role" " '" + resource.role() + "', but the framework can only reserve" " resources with role '" + frameworkRole.get() + "'"); } {code} With multi-role framework, we should validate reserved resource role same as resource allocation role. > Reserve operation should validate reserved resource role against resource > allocationInfo role > - > > Key: MESOS-6730 > URL: https://issues.apache.org/jira/browse/MESOS-6730 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu > > When doing dynamic reservation validation, the current logic is make sure the > reserved resources role is same as the framework role: > https://github.com/apache/mesos/blob/1.1.x/src/master/validation.cpp#L1458 > {code} > if (frameworkRole.isSome() && resource.role() != frameworkRole.get()) { > return Error( > "A reserve operation was attempted for a resource with role" > " '" + resource.role() + "', but the framework can only reserve" > " resources with role '" + frameworkRole.get() + "'"); > } > {code} > With multi-role framework, we should validate reserved resource role same as > resource allocation role. > Please make sure distinguish dynamic reservation with framework and http > endpoint. If dynamic reservation was triggered by a framework, then we need > to do such validation. If done by the http endpoint, then no need to validate > the roles. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6730) Reserve operation should validate reserved resource role against resource allocationInfo role
Guangya Liu created MESOS-6730: -- Summary: Reserve operation should validate reserved resource role against resource allocationInfo role Key: MESOS-6730 URL: https://issues.apache.org/jira/browse/MESOS-6730 Project: Mesos Issue Type: Bug Reporter: Guangya Liu When doing dynamic reservation validation, the current logic is make sure the reserved resources role is same as the framework role: https://github.com/apache/mesos/blob/1.1.x/src/master/validation.cpp#L1458 {code} if (frameworkRole.isSome() && resource.role() != frameworkRole.get()) { return Error( "A reserve operation was attempted for a resource with role" " '" + resource.role() + "', but the framework can only reserve" " resources with role '" + frameworkRole.get() + "'"); } {code} With multi-role framework, we should validate reserved resource role same as resource allocation role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6685) Update Role::Resources to correctly account for multi-role frameworks
[ https://issues.apache.org/jira/browse/MESOS-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-6685: --- Summary: Update Role::Resources to correctly account for multi-role frameworks (was: Update Role::Resources to correctly acount for multi-role frameworks) > Update Role::Resources to correctly account for multi-role frameworks > - > > Key: MESOS-6685 > URL: https://issues.apache.org/jira/browse/MESOS-6685 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu > > With single role framework, when call the get role endpoint, the master will > return resources for this role with all of the resources for a framework who > is using this role. But with multi-role framework, the get role endpoint > should only return resources used by one of the roles in a multi-role > framework. > {code} > Resources resources() const > { > Resources resources; > foreachvalue (Framework* framework, frameworks) { > resources += framework->totalUsedResources; > resources += framework->totalOfferedResources; > } > return resources; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6685) Update Role::Resources to correctly acount for multi-role frameworks
Guangya Liu created MESOS-6685: -- Summary: Update Role::Resources to correctly acount for multi-role frameworks Key: MESOS-6685 URL: https://issues.apache.org/jira/browse/MESOS-6685 Project: Mesos Issue Type: Bug Reporter: Guangya Liu With single role framework, when call the get role endpoint, the master will return resources for this role with all of the resources for a framework who is using this role. But with multi-role framework, the get role endpoint should only return resources used by one of the roles in a multi-role framework. {code} Resources resources() const { Resources resources; foreachvalue (Framework* framework, frameworks) { resources += framework->totalUsedResources; resources += framework->totalOfferedResources; } return resources; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6684) Update addFramework/removeFramework to handle multi-role frameworks
Guangya Liu created MESOS-6684: -- Summary: Update addFramework/removeFramework to handle multi-role frameworks Key: MESOS-6684 URL: https://issues.apache.org/jira/browse/MESOS-6684 Project: Mesos Issue Type: Bug Reporter: Guangya Liu The current master add/remove frameworks only handle single role framework, it should be updated to support multi-role frameworks. {code} if (!activeRoles.contains(role)) { activeRoles[role] = new Role(); } activeRoles[role]->addFramework(framework); {code} We should update both {{addFramework}} and {{removeFramework}} in master.cpp to be able to map one framework to multiple roles. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6630) Add some benchmark test for quota allocation
Guangya Liu created MESOS-6630: -- Summary: Add some benchmark test for quota allocation Key: MESOS-6630 URL: https://issues.apache.org/jira/browse/MESOS-6630 Project: Mesos Issue Type: Bug Reporter: Guangya Liu After made a minor update for allocator performance here https://reviews.apache.org/r/53929/ , I found that we have no benchmark test for quota allocation, we should add some benchmark test for such cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6600) Add priority tiers to support multi-tenancy
[ https://issues.apache.org/jira/browse/MESOS-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-6600: --- Description: Tier is kind of priority level, it will include a type and priority level. The type can be either quota or fair share. The reason that we want to have `Tier` is mainly for defining resources allocations with priority for now. One example is for `Quota`, if we have more quotas than total resources in the cluster, then with the `Tier` logic, we can make sure the high priority tier quota can get allocations first. Also the high priority tier quota can preempt resources from preemptable quota (A new concept for quota and still under discussion). With `Tier`, we can also enable tasks with priority (task priority is based on the resources priority), and high priority tasks can preempt resources from low priority tasks. Current design document: https://docs.google.com/document/d/1bPHREn1AfUQIAGwZUS7yLFZLW6ycO2WbmEzW6o92NV0/edit was: TBD Current design document: https://docs.google.com/document/d/1bPHREn1AfUQIAGwZUS7yLFZLW6ycO2WbmEzW6o92NV0/edit > Add priority tiers to support multi-tenancy > --- > > Key: MESOS-6600 > URL: https://issues.apache.org/jira/browse/MESOS-6600 > Project: Mesos > Issue Type: Epic >Reporter: Benjamin Hindman > Labels: multi-tenancy > > Tier is kind of priority level, it will include a type and priority level. > The type can be either quota or fair share. The reason that we want to have > `Tier` is mainly for defining resources allocations with priority for now. > One example is for `Quota`, if we have more quotas than total resources in > the cluster, then with the `Tier` logic, we can make sure the high priority > tier quota can get allocations first. Also the high priority tier quota can > preempt resources from preemptable quota (A new concept for quota and still > under discussion). With `Tier`, we can also enable tasks with priority (task > priority is based on the resources priority), and high priority tasks can > preempt resources from low priority tasks. > Current design document: > https://docs.google.com/document/d/1bPHREn1AfUQIAGwZUS7yLFZLW6ycO2WbmEzW6o92NV0/edit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4766) Improve allocator performance.
[ https://issues.apache.org/jira/browse/MESOS-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-4766: --- Target Version/s: 1.2.0 > Improve allocator performance. > -- > > Key: MESOS-4766 > URL: https://issues.apache.org/jira/browse/MESOS-4766 > Project: Mesos > Issue Type: Epic > Components: allocation >Reporter: Benjamin Mahler >Assignee: Michael Park >Priority: Critical > > This is an epic to track the various tickets around improving the performance > of the allocator, including the following: > * Preventing un-necessary backup of the allocator. > * Reducing the cost of allocations and allocator state updates. > * Improving performance of the DRF sorter. > * More benchmarking to simulate scenarios with performance issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5898) Make resources benchmark test for ports -=/- more accurate
[ https://issues.apache.org/jira/browse/MESOS-5898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391903#comment-15391903 ] Guangya Liu edited comment on MESOS-5898 at 10/12/16 3:32 AM: -- https://reviews.apache.org/r/50380 Added new benchmark test for port resources. https://reviews.apache.org/r/52769/ Removed ports ranges benchmark test from scalar benchmark test. was (Author: gyliu): https://reviews.apache.org/r/50380/ > Make resources benchmark test for ports -=/- more accurate > -- > > Key: MESOS-5898 > URL: https://issues.apache.org/jira/browse/MESOS-5898 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > > When I run benchmark test for port resources, I can get the following result, > the `-=` and `-` only consumed 10ms, this cannot reflect the real time of > operating 1000 ports with `-=` and `-`. > The root cause is that the current calculation is always using same port > range, with port, the formula for `+` is {{a+a+a+a+...+a==a}}; for `-`, it > will be {{a-a=0}} and {{0-a=0}}. > With {{0-a=0}}, the code here > https://github.com/apache/mesos/blob/master/src/common/values.cpp#L544 will > cause there is no validation as the {{left}} is empty. > {code} > ./bin/mesos-tests.sh --benchmark > --gtest_filter="*Resources_BENCHMARK_Test.Arithmetic/2" > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test > [ RUN ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 > Took 3.219217secs to perform 1000 'total += r' operations on ports(*):[1-2, > 4-5, 7-8, 10-11, 13-14, 16-17, 1... > Took 10207us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, > 7-8, 10-11, 13-14, 16-17, 1... > Took 3.515383secs to perform 1000 'total = total + r' operations on > ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... > Took 10208us to perform 1000 'total = total - r' operations on ports(*):[1-2, > 4-5, 7-8, 10-11, 13-14, 16-17, 1... > [ OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (6759 > ms) > [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (6759 ms > total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (6801 ms total) > [ PASSED ] 1 test. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5700) Add benchmark test for Resource class
[ https://issues.apache.org/jira/browse/MESOS-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5700: --- Summary: Add benchmark test for Resource class (was: Add Bbenchmark test for Resource class) > Add benchmark test for Resource class > - > > Key: MESOS-5700 > URL: https://issues.apache.org/jira/browse/MESOS-5700 > Project: Mesos > Issue Type: Bug >Reporter: Klaus Ma >Assignee: Klaus Ma > Attachments: hashmap.diff, name_roleId.diff, port.perf.log, > reservation.perf.log > > > Add benchmark of Resource class for Allocation Performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5700) Add Bbenchmark test for Resource class
[ https://issues.apache.org/jira/browse/MESOS-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5700: --- Summary: Add Bbenchmark test for Resource class (was: Benchmark for Resource class) > Add Bbenchmark test for Resource class > -- > > Key: MESOS-5700 > URL: https://issues.apache.org/jira/browse/MESOS-5700 > Project: Mesos > Issue Type: Bug >Reporter: Klaus Ma >Assignee: Klaus Ma > Attachments: hashmap.diff, name_roleId.diff, port.perf.log, > reservation.perf.log > > > Add benchmark of Resource class for Allocation Performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6308) CHECK failure in DRF sorter.
[ https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-6308: --- Target Version/s: 1.1.0 > CHECK failure in DRF sorter. > > > Key: MESOS-6308 > URL: https://issues.apache.org/jira/browse/MESOS-6308 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Guangya Liu > > Saw this CHECK failed in our internal CI: > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450 > {noformat} > [03:08:28] : [Step 10/10] [ RUN ] PartitionTest.DisconnectedFramework > [03:08:28]W: [Step 10/10] I1004 03:08:28.200443 577 cluster.cpp:158] > Creating default 'local' authorizer > [03:08:28]W: [Step 10/10] I1004 03:08:28.206408 577 leveldb.cpp:174] > Opened db in 5.827159ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208127 577 leveldb.cpp:181] > Compacted db in 1.697508ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208150 577 leveldb.cpp:196] > Created db iterator in 5756ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208160 577 leveldb.cpp:202] > Seeked to beginning of db in 1483ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208168 577 leveldb.cpp:271] > Iterated through 0 keys in the db in 1101ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208184 577 replica.cpp:776] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [03:08:28]W: [Step 10/10] I1004 03:08:28.208452 591 recover.cpp:451] > Starting replica recovery > [03:08:28]W: [Step 10/10] I1004 03:08:28.208664 596 recover.cpp:477] > Replica is in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209079 591 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > __req_res__(3666)@172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209203 593 recover.cpp:197] > Received a recover response from a replica in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209394 598 recover.cpp:568] > Updating replica status to STARTING > [03:08:28]W: [Step 10/10] I1004 03:08:28.209473 598 master.cpp:380] > Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) > started on 172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209489 598 master.cpp:382] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" > --zk_session_timeout="10secs" > [03:08:28]W: [Step 10/10] I1004 03:08:28.209692 598 master.cpp:432] > Master only allowing authenticated frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209699 598 master.cpp:446] > Master only allowing authenticated agents to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209704 598 master.cpp:459] > Master only allowing authenticated HTTP frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209709 598 credentials.hpp:37] > Loading credentials for authentication from '/tmp/7rr0oB/credentials' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209810 598 master.cpp:504] Using > default 'crammd5' authenticator > [03:08:28]W: [Step 10/10] I1004 03:08:28.209853 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readonly' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209897 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209940 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209962 598 master.cpp:584] > Authorization enabled > [03:08:28]W: [Step 10/10] I1004
[jira] [Commented] (MESOS-4694) DRFAllocator takes very long to allocate resources with a large number of frameworks
[ https://issues.apache.org/jira/browse/MESOS-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566916#comment-15566916 ] Guangya Liu commented on MESOS-4694: [~bmahler] can we mark this as RESOLVED? > DRFAllocator takes very long to allocate resources with a large number of > frameworks > > > Key: MESOS-4694 > URL: https://issues.apache.org/jira/browse/MESOS-4694 > Project: Mesos > Issue Type: Improvement > Components: allocation >Affects Versions: 0.26.0, 0.27.0, 0.27.1, 0.27.2, 0.28.0, 0.28.1 >Reporter: Dario Rexin >Assignee: Dario Rexin > > With a growing number of connected frameworks, the allocation time grows to > very high numbers. The addition of quota in 0.27 had an additional impact on > these numbers. Running `mesos-tests.sh --benchmark > --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us > the following numbers: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 2.921202secs to make 200 offers > round 1 allocate took 2.85045secs to make 200 offers > round 2 allocate took 2.823768secs to make 200 offers > {noformat} > Increasing the number of frameworks to 2000: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 28.209454secs to make 2000 offers > round 1 allocate took 28.469419secs to make 2000 offers > round 2 allocate took 28.138086secs to make 2000 offers > {noformat} > I was able to reduce this time by a substantial amount. After applying the > patches: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 1.016226secs to make 2000 offers > round 1 allocate took 1.102729secs to make 2000 offers > round 2 allocate took 1.102624secs to make 2000 offers > {noformat} > And with 2000 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 12.563203secs to make 2000 offers > round 1 allocate took 12.437517secs to make 2000 offers > round 2 allocate took 12.470708secs to make 2000 offers > {noformat} > The patches do 3 things to improve the performance of the allocator. > 1) The total values in the DRFSorter will be pre calculated per resource type > 2) In the allocate method, when no resources are available to allocate, we > break out of the innermost loop to prevent looping over a large number of > frameworks when we have nothing to allocate > 3) when a framework suppresses offers, we remove it from the sorter instead > of just calling continue in the allocation loop - this greatly improves > performance in the sorter and prevents looping over frameworks that don't > need resources > Assuming that most of the frameworks behave nicely and suppress offers when > they have nothing to schedule, it is fair to assume, that point 3) has the > biggest impact on the performance. If we suppress offers for 90% of the > frameworks in the benchmark test, we see following numbers: > {noformat} > ==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 200 slaves and 2000 frameworks > round 0 allocate took 11626us to make 200 offers > round 1 allocate took 22890us to make 200 offers > round 2 allocate took 21346us to make 200 offers > {noformat} > And for 200 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 1.11178secs to make 2000 offers > round 1 allocate took 1.062649secs to make 2000 offers > round 2 allocate took 1.080181secs to make 2000 offers > {noformat} > Review requests: >
[jira] [Comment Edited] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction.
[ https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564767#comment-15564767 ] Guangya Liu edited comment on MESOS-5967 at 10/11/16 7:34 AM: -- https://reviews.apache.org/r/52727/ Added `Labels` to docker image. https://reviews.apache.org/r/52666/ Added support for `docker inspect image` in docker containerizer. https://reviews.apache.org/r/52728/ Renamed `inspect` to `inspectContainer`. was (Author: gyliu): {code} https://reviews.apache.org/r/52727/ Added `Labels` to docker image. https://reviews.apache.org/r/52666/ Added support for `docker inspect image` in docker containerizer. https://reviews.apache.org/r/52728/ Renamed `inspect` to `inspectContainer`. {code} > Add support for 'docker image inspect' in our docker abstraction. > - > > Key: MESOS-5967 > URL: https://issues.apache.org/jira/browse/MESOS-5967 > Project: Mesos > Issue Type: Improvement >Reporter: Kevin Klues >Assignee: Guangya Liu > Labels: gpu > > Docker's command line tool for {{docker inspect}} can take either a > {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON > array containing low-level information about that container, image or task. > However, the current {{docker inspect}} support in our docker abstraction > only supports inspecting containers (not images or tasks). We should expand > this to (at least) support images. > In particular, this additional functionality is motivated by the upcoming GPU > support, which needs to inspect the labels in a docker image to decide if it > should inject the required Nvidia volumes into a container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.
[ https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561571#comment-15561571 ] Guangya Liu commented on MESOS-6308: This is a race issue in tear down and calculating the dominant share metrics, this can be happened with following sequence with {{PartitionTest.DisconnectedFramework}}: 1) {{driver.stop()}} will send a {{TEARDOWN}} request to master which will trigger {{removeFramework}} in allocator. 2) The {{PartitionTest.DisconnectedFramework}} construct the {{Metrics}} object with following code, and at this time the default role was not removed, and the {{calculateShare}} will be in queue to calculate the dominant share for default role. {code} JSON::Object stats = Metrics(); {code} 3) The remove framework in master continues and it call allocator to remove framework. As there is only one framework under star role, so the {{removeFramework}} will call {{roleSorter->remove(role);}} to remove the role and its related allocations, the race will be here, take a look at the API of {{remove}} in sorter.cpp. {code} void DRFSorter::remove(const string& name) { set::iterator it = find(name); if (it != clients.end()) { clients.erase(it); } allocations.erase(name); weights.erase(name); < `calculateShare` was triggered here to calculate the dominant share for default role, but at this time, the default role allocation was now removed and `calculateShare` will report CHECK FAIL. It is very difficult to reproduce this, that's why I run test for more than 1 hour to reproduce this. if (metrics.isSome()) { metrics->remove(name); } } {code} I updated the code of {{DRFSorter::remove(const string& name)}} a bit by adding a {{os::sleep(Seconds(1));}} between {{allocations.erase(name);}} and {{metrics->remove(name);}} as following: {code} --- a/src/master/allocator/sorter/drf/sorter.cpp +++ b/src/master/allocator/sorter/drf/sorter.cpp @@ -29,6 +29,8 @@ #include #include +#include + #include "logging/logging.hpp" #include "master/allocator/sorter/drf/sorter.hpp" @@ -110,6 +112,8 @@ void DRFSorter::remove(const string& name) allocations.erase(name); weights.erase(name); + os::sleep(Seconds(1)); + if (metrics.isSome()) { metrics->remove(name); } {code} Then re-run the test {{PartitionTest.DisconnectedFramework}}, it will be failed every time and CHECK FAIL at the same place: {code} [ RUN ] PartitionTest.DisconnectedFramework I1010 15:43:46.022001 257765376 exec.cpp:162] Version: 1.1.0 I1010 15:43:46.025545 259375104 exec.cpp:237] Executor registered on agent 5bc85014-3fab-459f-9d85-8b47a06e27d0-S0 Received SUBSCRIBED event Subscribed executor on 192.168.56.1 Received LAUNCH event Starting task 51f3e50a-e561-407b-8ee4-65f163d65bd7 /Users/gyliu/git/mesos/build/src/mesos-containerizer launch --command="{"shell":true,"value":"sleep 60"}" --help="false" Forked command at 93007 F1010 15:43:50.094323 407199744 sorter.cpp:454] Check failed: contains(name) *** Check failure stack trace: *** @0x1119b91ca google::LogMessage::Fail() @0x1119b8157 google::LogMessage::SendToLog() @0x1119b8e7a google::LogMessage::Flush() @0x1119bfce8 google::LogMessageFatal::~LogMessageFatal() @0x1119b9605 google::LogMessageFatal::~LogMessageFatal() @0x10fa0bd18 mesos::internal::master::allocator::DRFSorter::calculateShare() @0x10fa05c5e mesos::internal::master::allocator::Metrics::add()::$_0::operator()() @0x10fa09232 _ZZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNSt3__112basic_stringIcNS9_11char_traitsIcEENS9_9allocatorIcE3$_0EENS_6FutureIdEERKNS_4UPIDEOT_ENKUlPNS_11ProcessBaseEE_clEST_ @0x10fa091f0 _ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcE3$_0EENS3_6FutureIdEERKNS3_4UPIDEOT_EUlPNS3_11ProcessBaseEE_SW_EEEvDpOT_ @0x10fa08e9c _ZNSt3__110__function6__funcIZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcE3$_0EENS2_6FutureIdEERKNS2_4UPIDEOT_EUlPNS2_11ProcessBaseEE_NSF_ISW_EEFvSV_EEclEOSV_ @0x111897acf std::__1::function<>::operator()() @0x1118684ff process::ProcessBase::visit() @0x1118cc18e process::DispatchEvent::visit() @0x109a75431 process::ProcessBase::serve() @0x1118651d1 process::ProcessManager::resume() @0x111870cc6 process::ProcessManager::init_threads()::$_1::operator()() @0x111870969 _ZNSt3__114__thread_proxyINS_5tupleIJZN7process14ProcessManager12init_threadsEvE3$_1EPvS6_ @
[jira] [Commented] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction
[ https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561293#comment-15561293 ] Guangya Liu commented on MESOS-5967: There are two solutions for this: Solution 1): Added a new function named as {{inspectImage}} and renamed the current function {{inspect}} to {{inspectContainer}}. The only issue is that it does not match the docker API match as the docker API is using {{inspect}} for both container and image. Solution 2): Use template to handle this: {code} template process::Future inspect( const std::string& containerName, const Option& retryInterval = None()) const; {code} Please note that I was not using {{virtual}} above as {{template}} do not support {{virtual}}, so here I need to remove {{virtual}}. Then I can define `Container` and `Image` inspect as following: {code} template<> Future Docker::inspect( const string& containerName, const Option& retryInterval) const { ... } template<> Future Docker::inspect( const string& imageName, const Option& retryInterval) const { ... } {code} For the caller part, container will be: {code} docker->inspect(...); {code} and image will be: {code} docker->inspect(...); {code} I think solution 1) is more simple, as solution 2) need remove {{virutal}} for {{inspect}}, though it has no impact but this will make the code not consistent, [~bmahler] [~klueska] any comments? Thanks. > Add support for 'docker image inspect' in our docker abstraction > > > Key: MESOS-5967 > URL: https://issues.apache.org/jira/browse/MESOS-5967 > Project: Mesos > Issue Type: Improvement >Reporter: Kevin Klues >Assignee: Guangya Liu > Labels: gpu > Fix For: 1.1.0 > > > Docker's command line tool for {{docker inspect}} can take either a > {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON > array containing low-level information about that container, image or task. > However, the current {{docker inspect}} support in our docker abstraction > only supports inspecting containers (not images or tasks). We should expand > this to (at least) support images. > In particular, this additional functionality is motivated by the upcoming GPU > support, which needs to inspect the labels in a docker image to decide if it > should inject the required Nvidia volumes into a container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.
[ https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556594#comment-15556594 ] Guangya Liu commented on MESOS-6308: Thanks [~bbannier] , I reproduced this issue again after running almost 1 hour and found it failed as following when adding metrics: {code} F1007 18:22:39.125012 255385600 sorter.cpp:458] Check failed: contains(name) *** Check failure stack trace: *** @0x108b7afda google::LogMessage::Fail() @0x108b79f67 google::LogMessage::SendToLog() @0x108b7ac8a google::LogMessage::Flush() @0x108b81af8 google::LogMessageFatal::~LogMessageFatal() @0x108b7b415 google::LogMessageFatal::~LogMessageFatal() @0x106bcd4d5 mesos::internal::master::allocator::DRFSorter::calculateShare() @0x106bc710e mesos::internal::master::allocator::Metrics::add()::$_0::operator()() @0x106bca6e2 _ZZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNSt3__112basic_stringIcNS9_11char_traitsIcEENS9_9allocatorIcE3$_0EENS_6FutureIdEERKNS_4UPIDEOT_ENKUlPNS_11ProcessBaseEE_clEST_ @0x106bca6a0 _ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcE3$_0EENS3_6FutureIdEERKNS3_4UPIDEOT_EUlPNS3_11ProcessBaseEE_SW_EEEvDpOT_ @0x106bca34c _ZNSt3__110__function6__funcIZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcE3$_0EENS2_6FutureIdEERKNS2_4UPIDEOT_EUlPNS2_11ProcessBaseEE_NSF_ISW_EEFvSV_EEclEOSV_ @0x108a598df std::__1::function<>::operator()() @0x108a2a30f process::ProcessBase::visit() @0x108a8df9e process::DispatchEvent::visit() @0x100c65c51 process::ProcessBase::serve() @0x108a26fe1 process::ProcessManager::resume() @0x108a32ad6 process::ProcessManager::init_threads()::$_1::operator()() @0x108a32779 _ZNSt3__114__thread_proxyINS_5tupleIJZN7process14ProcessManager12init_threadsEvE3$_1EPvS6_ @ 0x7fff957a405a _pthread_body @ 0x7fff957a3fd7 _pthread_start @ 0x7fff957a13ed thread_start E1007 18:23:06.083991 317579264 process.cpp:2154] Failed to shutdown socket with fd 15: Socket is not connected Abort trap: 6 {code} Will check more for if there are case that we can add metrics for a non existent client? [~bbannier] , please show your comments if any. Thanks. > CHECK failure in DRF sorter. > > > Key: MESOS-6308 > URL: https://issues.apache.org/jira/browse/MESOS-6308 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Guangya Liu > > Saw this CHECK failed in our internal CI: > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450 > {noformat} > [03:08:28] : [Step 10/10] [ RUN ] PartitionTest.DisconnectedFramework > [03:08:28]W: [Step 10/10] I1004 03:08:28.200443 577 cluster.cpp:158] > Creating default 'local' authorizer > [03:08:28]W: [Step 10/10] I1004 03:08:28.206408 577 leveldb.cpp:174] > Opened db in 5.827159ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208127 577 leveldb.cpp:181] > Compacted db in 1.697508ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208150 577 leveldb.cpp:196] > Created db iterator in 5756ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208160 577 leveldb.cpp:202] > Seeked to beginning of db in 1483ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208168 577 leveldb.cpp:271] > Iterated through 0 keys in the db in 1101ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208184 577 replica.cpp:776] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [03:08:28]W: [Step 10/10] I1004 03:08:28.208452 591 recover.cpp:451] > Starting replica recovery > [03:08:28]W: [Step 10/10] I1004 03:08:28.208664 596 recover.cpp:477] > Replica is in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209079 591 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > __req_res__(3666)@172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209203 593 recover.cpp:197] > Received a recover response from a replica in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209394 598 recover.cpp:568] > Updating replica status to STARTING > [03:08:28]W: [Step 10/10] I1004 03:08:28.209473 598 master.cpp:380] > Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) > started on 172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209489 598 master.cpp:382] Flags > at startup: --acls=""
[jira] [Assigned] (MESOS-6308) CHECK failure in DRF sorter.
[ https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu reassigned MESOS-6308: -- Assignee: Guangya Liu > CHECK failure in DRF sorter. > > > Key: MESOS-6308 > URL: https://issues.apache.org/jira/browse/MESOS-6308 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Guangya Liu > > Saw this CHECK failed in our internal CI: > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450 > {noformat} > [03:08:28] : [Step 10/10] [ RUN ] PartitionTest.DisconnectedFramework > [03:08:28]W: [Step 10/10] I1004 03:08:28.200443 577 cluster.cpp:158] > Creating default 'local' authorizer > [03:08:28]W: [Step 10/10] I1004 03:08:28.206408 577 leveldb.cpp:174] > Opened db in 5.827159ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208127 577 leveldb.cpp:181] > Compacted db in 1.697508ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208150 577 leveldb.cpp:196] > Created db iterator in 5756ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208160 577 leveldb.cpp:202] > Seeked to beginning of db in 1483ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208168 577 leveldb.cpp:271] > Iterated through 0 keys in the db in 1101ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208184 577 replica.cpp:776] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [03:08:28]W: [Step 10/10] I1004 03:08:28.208452 591 recover.cpp:451] > Starting replica recovery > [03:08:28]W: [Step 10/10] I1004 03:08:28.208664 596 recover.cpp:477] > Replica is in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209079 591 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > __req_res__(3666)@172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209203 593 recover.cpp:197] > Received a recover response from a replica in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209394 598 recover.cpp:568] > Updating replica status to STARTING > [03:08:28]W: [Step 10/10] I1004 03:08:28.209473 598 master.cpp:380] > Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) > started on 172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209489 598 master.cpp:382] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" > --zk_session_timeout="10secs" > [03:08:28]W: [Step 10/10] I1004 03:08:28.209692 598 master.cpp:432] > Master only allowing authenticated frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209699 598 master.cpp:446] > Master only allowing authenticated agents to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209704 598 master.cpp:459] > Master only allowing authenticated HTTP frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209709 598 credentials.hpp:37] > Loading credentials for authentication from '/tmp/7rr0oB/credentials' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209810 598 master.cpp:504] Using > default 'crammd5' authenticator > [03:08:28]W: [Step 10/10] I1004 03:08:28.209853 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readonly' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209897 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209940 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209962 598 master.cpp:584] > Authorization enabled > [03:08:28]W: [Step 10/10] I1004
[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.
[ https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550554#comment-15550554 ] Guangya Liu commented on MESOS-6308: I was now trying to reproduce this issue but with no lucky even with {{--gtest_repeat=100}}, will try to increase the workload as you suggested to see if I can reproduce this first. > CHECK failure in DRF sorter. > > > Key: MESOS-6308 > URL: https://issues.apache.org/jira/browse/MESOS-6308 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu > > Saw this CHECK failed in our internal CI: > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450 > {noformat} > [03:08:28] : [Step 10/10] [ RUN ] PartitionTest.DisconnectedFramework > [03:08:28]W: [Step 10/10] I1004 03:08:28.200443 577 cluster.cpp:158] > Creating default 'local' authorizer > [03:08:28]W: [Step 10/10] I1004 03:08:28.206408 577 leveldb.cpp:174] > Opened db in 5.827159ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208127 577 leveldb.cpp:181] > Compacted db in 1.697508ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208150 577 leveldb.cpp:196] > Created db iterator in 5756ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208160 577 leveldb.cpp:202] > Seeked to beginning of db in 1483ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208168 577 leveldb.cpp:271] > Iterated through 0 keys in the db in 1101ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208184 577 replica.cpp:776] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [03:08:28]W: [Step 10/10] I1004 03:08:28.208452 591 recover.cpp:451] > Starting replica recovery > [03:08:28]W: [Step 10/10] I1004 03:08:28.208664 596 recover.cpp:477] > Replica is in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209079 591 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > __req_res__(3666)@172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209203 593 recover.cpp:197] > Received a recover response from a replica in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209394 598 recover.cpp:568] > Updating replica status to STARTING > [03:08:28]W: [Step 10/10] I1004 03:08:28.209473 598 master.cpp:380] > Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) > started on 172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209489 598 master.cpp:382] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" > --zk_session_timeout="10secs" > [03:08:28]W: [Step 10/10] I1004 03:08:28.209692 598 master.cpp:432] > Master only allowing authenticated frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209699 598 master.cpp:446] > Master only allowing authenticated agents to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209704 598 master.cpp:459] > Master only allowing authenticated HTTP frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209709 598 credentials.hpp:37] > Loading credentials for authentication from '/tmp/7rr0oB/credentials' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209810 598 master.cpp:504] Using > default 'crammd5' authenticator > [03:08:28]W: [Step 10/10] I1004 03:08:28.209853 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readonly' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209897 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209940 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm
[jira] [Created] (MESOS-6317) Race in master update slave.
Guangya Liu created MESOS-6317: -- Summary: Race in master update slave. Key: MESOS-6317 URL: https://issues.apache.org/jira/browse/MESOS-6317 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu Currently, when {{updateSlave}} in master, it will first rescind offers and then updateSlave in allocator, but there is a race for this, there might be a batch allocation inserted bwteen the two. In this case, the order will be rescind offer -> batch allocation -> update slave. This order will cause some issues when the oversubscribed resources was decreased. Suppose the oversubscribed resources was decreased from 2 to 1, then after rescind offer finished, the batch allocation will allocate the old 2 oversubscribed resources again, then update slave will update the total oversubscribed resources to 1. This will cause the agent host have some time overcommitted due to the tasks can still use 2 oversubscribed resources but not 1 oversubscribed resources, once the tasks using the 2 oversubscribed resources finished, everything goes back. So here we should adjust the order of rescind offer and updateSlave in master to avoid resource overcommit. If we update slave first then rescind offer, the order will be update slave -> batch allocation -> rescind offer, this order will have no problem when descreasing resources. Suppose the oversubscribed resources was decreased from 2 to 1, then update slave will update total oversubscribed resources to 1 directly, then the batch allocation will not allocate any oversubscribed resources since there are more allocated than total oversubscribed resources, then rescind offer will rescind all offers using oversubscribed resources. This will not lead the agent host to be overcommitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6181) The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct
[ https://issues.apache.org/jira/browse/MESOS-6181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-6181: --- Description: One issue for the test: If destroy volume failed, we should get the last offer to make sure that the last offer also contain the volume resource. (was: Two issues for those two test cases: 1) No need to add `{}` in the test case as there is no need to add `{}`, adding the `{}` will cause the driver decline a non exist offer. 2) If destroy volume failed, we should get the last offer to make sure that the last offer also contain the volume resource.) > The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct > - > > Key: MESOS-6181 > URL: https://issues.apache.org/jira/browse/MESOS-6181 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > > One issue for the test: If destroy volume failed, we should get the last > offer to make sure that the last offer also contain the volume resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5524) Expose resource allocation constraints (quota, shares) to schedulers.
[ https://issues.apache.org/jira/browse/MESOS-5524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522430#comment-15522430 ] Guangya Liu edited comment on MESOS-5524 at 9/28/16 1:39 PM: - [~bmahler] one question want to discuss with you is when exposing the resource allocation constraints, do we need to expose the resources as {{role}} level or {{framework}} level? If expose as {{role}} level, then there may be problems when one role has multiple frameworks as each framework with same role will have same resource constraints, and we cannot guarantee if one framework can always get the exposed resources. {{framework}} level is also not good, the problem is how we define {{framework}} level, just expose the resources evenly to all {{frameworks}} under the same {{role}} or some other ways? expose the resources evenly to all {{frameworks}} under the same {{role}} is also not accurate, as there maybe a {{framework}} have quite a lot of tasks while others may not have tasks, and the framework with lot of tasks will use up all of the resources. was (Author: gyliu): [~bmahler] one question want to discuss with you is when exposing the resource allocation constraints, do we need to expose the resources as {{role}} level or {{framework}} level? If expose as {{role}} level, then there may be problems when one role has multiple frameworks as each framework with same role will have same resource constraints, and we cannot guarantee if one framework can always get the exposed resources. Seems {{framework}} level is more accurate, but even with {{framework}} level, it may still not accurate because of the allocator coarse-grained mode for resource allocation when there are more frameworks than agents in cluster. any comments? > Expose resource allocation constraints (quota, shares) to schedulers. > - > > Key: MESOS-5524 > URL: https://issues.apache.org/jira/browse/MESOS-5524 > Project: Mesos > Issue Type: Epic > Components: allocation, scheduler api >Reporter: Benjamin Mahler > > Currently, schedulers do not have visibility into their quota or shares of > the cluster. By providing this information, we give the scheduler the ability > to make better decisions. As we start to allow schedulers to decide how > they'd like to use a particular resource (e.g. as non-revocable or > revocable), schedulers need visibility into their quota and shares to make an > effective decision (otherwise they may accidentally exceed their quota and > will not find out until mesos replies with TASK_LOST REASON_QUOTA_EXCEEDED). > We would start by exposing the following information: > * quota: e.g. cpus:10, mem:20, disk:40 > * shares: e.g. cpus:20, mem:40, disk:80 > Currently, quota is used for non-revocable resources and the idea is to use > shares only for consuming revocable resources since the number of shares > available to a role changes dynamically as resources come and go, frameworks > come and go, or the operator manipulates the amount of resources sectioned > off for quota. > By exposing quota and shares, the framework knows when it can consume > additional non-revocable resources (i.e. when it has fewer non-revocable > resources allocated to it than its quota) or when it can consume revocable > resources (always! but in the future, it cannot revoke another user's > revocable resources if the framework is above its fair share). > This also allows schedulers to determine whether they have sufficient quota > assigned to them, and to alert the operator if they need more to run safely. > Also, by viewing their fair share, the framework can expose monitoring > information that shows the discrepancy between how much it would like and its > fair share (note that the framework can actually exceed its fair share but in > the future this will mean increased potential for revocation). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6181) The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct
[ https://issues.apache.org/jira/browse/MESOS-6181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525325#comment-15525325 ] Guangya Liu commented on MESOS-6181: Thanks [~greggomann] Agree for #1. For #2, take {{PersistentVolumeTest, BadACLNoPrincipal}} as an example, in https://github.com/apache/mesos/blob/master/src/tests/persistent_volume_tests.cpp#L1626 , it is expecting {{EXPECT_TRUE(Resources(offer.resources()).contains(volume));}} , but it is not using the latest offer but it is still using the offer https://github.com/apache/mesos/blob/master/src/tests/persistent_volume_tests.cpp#L1599 revived, this is not accurate, we should use the offer after {{acceptOffers}} , we need to make sure that the volume is still in the new offer after allocation interval, comments? > The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct > - > > Key: MESOS-6181 > URL: https://issues.apache.org/jira/browse/MESOS-6181 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > > Two issues for those two test cases: > 1) No need to add `{}` in the test case as there is no need to add `{}`, > adding the `{}` will cause the driver decline a non exist offer. > 2) If destroy volume failed, we should get the last offer to make sure that > the last offer also contain the volume resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5524) Expose resource allocation constraints (quota, shares) to schedulers.
[ https://issues.apache.org/jira/browse/MESOS-5524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522430#comment-15522430 ] Guangya Liu commented on MESOS-5524: [~bmahler] one question want to discuss with you is when exposing the resource allocation constraints, do we need to expose the resources as {{role}} level or {{framework}} level? If expose as {{role}} level, then there may be problems when one role has multiple frameworks as each framework with same role will have same resource constraints, and we cannot guarantee if one framework can always get the exposed resources. Seems {{framework}} level is more accurate, but even with {{framework}} level, it may still not accurate because of the allocator coarse-grained mode for resource allocation when there are more frameworks than agents in cluster. any comments? > Expose resource allocation constraints (quota, shares) to schedulers. > - > > Key: MESOS-5524 > URL: https://issues.apache.org/jira/browse/MESOS-5524 > Project: Mesos > Issue Type: Epic > Components: allocation, scheduler api >Reporter: Benjamin Mahler > > Currently, schedulers do not have visibility into their quota or shares of > the cluster. By providing this information, we give the scheduler the ability > to make better decisions. As we start to allow schedulers to decide how > they'd like to use a particular resource (e.g. as non-revocable or > revocable), schedulers need visibility into their quota and shares to make an > effective decision (otherwise they may accidentally exceed their quota and > will not find out until mesos replies with TASK_LOST REASON_QUOTA_EXCEEDED). > We would start by exposing the following information: > * quota: e.g. cpus:10, mem:20, disk:40 > * shares: e.g. cpus:20, mem:40, disk:80 > Currently, quota is used for non-revocable resources and the idea is to use > shares only for consuming revocable resources since the number of shares > available to a role changes dynamically as resources come and go, frameworks > come and go, or the operator manipulates the amount of resources sectioned > off for quota. > By exposing quota and shares, the framework knows when it can consume > additional non-revocable resources (i.e. when it has fewer non-revocable > resources allocated to it than its quota) or when it can consume revocable > resources (always! but in the future, it cannot revoke another user's > revocable resources if the framework is above its fair share). > This also allows schedulers to determine whether they have sufficient quota > assigned to them, and to alert the operator if they need more to run safely. > Also, by viewing their fair share, the framework can expose monitoring > information that shows the discrepancy between how much it would like and its > fair share (note that the framework can actually exceed its fair share but in > the future this will mean increased potential for revocation). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6181) The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct
[ https://issues.apache.org/jira/browse/MESOS-6181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497632#comment-15497632 ] Guangya Liu commented on MESOS-6181: cc [~greggomann] > The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct > - > > Key: MESOS-6181 > URL: https://issues.apache.org/jira/browse/MESOS-6181 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > > Two issues for those two test cases: > 1) No need to add `{}` in the test case as there is no need to add `{}`, > adding the `{}` will cause the driver decline a non exist offer. > 2) If destroy volume failed, we should get the last offer to make sure that > the last offer also contain the volume resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6181) The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct
Guangya Liu created MESOS-6181: -- Summary: The logic for BadACLNoPrincipal and BadACLDropCreateAndDestroy is not correct Key: MESOS-6181 URL: https://issues.apache.org/jira/browse/MESOS-6181 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Two issues for those two test cases: 1) No need to add `{}` in the test case as there is no need to add `{}`, adding the `{}` will cause the driver decline a non exist offer. 2) If destroy volume failed, we should get the last offer to make sure that the last offer also contain the volume resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4811) Reusable/Cacheable Offer
[ https://issues.apache.org/jira/browse/MESOS-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469367#comment-15469367 ] Guangya Liu commented on MESOS-4811: Based on requirement description, this is duplicate with MESOS-3078 , [~klaus1982] [~abi...@gmail.com] please help confirm. Thanks. > Reusable/Cacheable Offer > > > Key: MESOS-4811 > URL: https://issues.apache.org/jira/browse/MESOS-4811 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Klaus Ma >Assignee: Abhishek Dasgupta > Labels: tech-debt > > Currently, the resources are return back to allocator when task finished; and > those resources are not allocated to framework until next allocation cycle. > The performance is low for short running tasks (MESOS-3078). The proposed > solution is to let framework keep using the offer until allocator decide to > rescind it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4988) Excluded reserved resources when got nonRevocable resources in stage 1.
[ https://issues.apache.org/jira/browse/MESOS-4988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469211#comment-15469211 ] Guangya Liu commented on MESOS-4988: This improvement seems have no impact to performance, shall we close this one? [~klaus1982] > Excluded reserved resources when got nonRevocable resources in stage 1. > --- > > Key: MESOS-4988 > URL: https://issues.apache.org/jira/browse/MESOS-4988 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Klaus Ma > > Allocator will only allocate non-revocable resources to satify quota. As the > reserved resources can not be revocable, it's not necessary to call > `nonRevocable()` for reserved resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6131) Improved performance for resource flatten
Guangya Liu created MESOS-6131: -- Summary: Improved performance for resource flatten Key: MESOS-6131 URL: https://issues.apache.org/jira/browse/MESOS-6131 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu The {{Resources::flatten}} is using {{+=}} to add single resource object, but this will impact the performance much as {{+=}} will invoke resource validation, here we should validate the role first and then call {{add}} directly to avoid resource validation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6113) Offer Quota resources as revocable
[ https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15460347#comment-15460347 ] Guangya Liu edited comment on MESOS-6113 at 9/3/16 4:35 AM: Does the section section in MESOS-4392 help? It is saying lend out the un-used quota to other framework and reclaim them back when needed. {code} A greedy analytics batch system wants to use as much of the cluster as possible to maximize computational throughput. When a competing web service with fixed task size starts up, there must be sufficient resources to run it immediately. The operator can reserve these resources by setting quota. However, if these resources are kept idle until the service is in use, this is wasteful from the analytics job's point of view. On the other hand, the analytics job should hand back reserved resources to the service when needed to avoid starvation of the latter. {code} was (Author: gyliu): Does the section section in MESOS-4392 help? It is saying lend out the un-used quota to other framework and reclaim them back when needed. {quota} A greedy analytics batch system wants to use as much of the cluster as possible to maximize computational throughput. When a competing web service with fixed task size starts up, there must be sufficient resources to run it immediately. The operator can reserve these resources by setting quota. However, if these resources are kept idle until the service is in use, this is wasteful from the analytics job's point of view. On the other hand, the analytics job should hand back reserved resources to the service when needed to avoid starvation of the latter. {quota} > Offer Quota resources as revocable > -- > > Key: MESOS-6113 > URL: https://issues.apache.org/jira/browse/MESOS-6113 > Project: Mesos > Issue Type: Task > Components: allocation >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > *Goal:* > I have high-priority Spark jobs, and best-effort jobs. I need my > high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the > best-effort jobs on revocable resources. > *Problem:* > Revocable resources are currently only created via oversubscription, where > resources allocated to but not used by a framework will be offered to other > frameworks. This doesn't support the ability for a high-pri framework to > start up and pre-empty a low-pri framework. > *Solution:* > Let's allow quota (and ideally any reserved resources) to be configurable to > be offered as revocable resources to other frameworks that don't register > with the role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6113) Offer Quota resources as revocable
[ https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15456846#comment-15456846 ] Guangya Liu commented on MESOS-6113: Then this should be a duplicate with MESOS-4392 but not MESOS-4967 , right? > Offer Quota resources as revocable > -- > > Key: MESOS-6113 > URL: https://issues.apache.org/jira/browse/MESOS-6113 > Project: Mesos > Issue Type: Task > Components: allocation >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > *Goal:* > I have high-priority Spark jobs, and best-effort jobs. I need my > high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the > best-effort jobs on revocable resources. > *Problem:* > Revocable resources are currently only created via oversubscription, where > resources allocated to but not used by a framework will be offered to other > frameworks. This doesn't support the ability for a high-pri framework to > start up and pre-empty a low-pri framework. > *Solution:* > Let's allow quota (and ideally any reserved resources) to be configurable to > be offered as revocable resources to other frameworks that don't register > with the role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6113) Offer Quota resources as revocable
[ https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-6113: --- Summary: Offer Quota resources as revocable (was: Offer reserved resources as revocable) > Offer Quota resources as revocable > -- > > Key: MESOS-6113 > URL: https://issues.apache.org/jira/browse/MESOS-6113 > Project: Mesos > Issue Type: Task > Components: allocation >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > *Goal:* > I have high-priority Spark jobs, and best-effort jobs. I need my > high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the > best-effort jobs on revocable resources. > *Problem:* > Revocable resources are currently only created via oversubscription, where > resources allocated to but not used by a framework will be offered to other > frameworks. This doesn't support the ability for a high-pri framework to > start up and pre-empty a low-pri framework. > *Solution:* > Let's allow quota (and ideally any reserved resources) to be configurable to > be offered as revocable resources to other frameworks that don't register > with the role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6113) Offer reserved resources as revocable
[ https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454373#comment-15454373 ] Guangya Liu edited comment on MESOS-6113 at 9/1/16 5:32 AM: MESOS-4967 is kind of oversubscription for reserved resources and MESOS-4392 is kind of oversubscription for quota resources. I was a bit confused here: The content in this JIRA is for {{Quota}} resources while the title is for {{reserved}} resources, can you elaborate? [~mgummelt] was (Author: gyliu): MESOS-4976 is kind of oversubscription for reserved resources and MESOS-4392 is kind of oversubscription for quota resources. I was a bit confused here: The content in this JIRA is for {{Quota}} resources while the title is for {{reserved}} resources, can you elaborate? [~mgummelt] > Offer reserved resources as revocable > - > > Key: MESOS-6113 > URL: https://issues.apache.org/jira/browse/MESOS-6113 > Project: Mesos > Issue Type: Task > Components: allocation >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > *Goal:* > I have high-priority Spark jobs, and best-effort jobs. I need my > high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the > best-effort jobs on revocable resources. > *Problem:* > Revocable resources are currently only created via oversubscription, where > resources allocated to but not used by a framework will be offered to other > frameworks. This doesn't support the ability for a high-pri framework to > start up and pre-empty a low-pri framework. > *Solution:* > Let's allow quota (and ideally any reserved resources) to be configurable to > be offered as revocable resources to other frameworks that don't register > with the role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6113) Offer reserved resources as revocable
[ https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454373#comment-15454373 ] Guangya Liu commented on MESOS-6113: MESOS-4976 is kind of oversubscription for reserved resources and MESOS-4392 is kind of oversubscription for quota resources. I was a bit confused here: The content in this JIRA is for {{Quota}} resources while the title is for {{reserved}} resources, can you elaborate? [~mgummelt] > Offer reserved resources as revocable > - > > Key: MESOS-6113 > URL: https://issues.apache.org/jira/browse/MESOS-6113 > Project: Mesos > Issue Type: Task > Components: allocation >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > *Goal:* > I have high-priority Spark jobs, and best-effort jobs. I need my > high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the > best-effort jobs on revocable resources. > *Problem:* > Revocable resources are currently only created via oversubscription, where > resources allocated to but not used by a framework will be offered to other > frameworks. This doesn't support the ability for a high-pri framework to > start up and pre-empty a low-pri framework. > *Solution:* > Let's allow quota (and ideally any reserved resources) to be configurable to > be offered as revocable resources to other frameworks that don't register > with the role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6112) Frameworks are starved when > 5 are run concurrently
[ https://issues.apache.org/jira/browse/MESOS-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454358#comment-15454358 ] Guangya Liu commented on MESOS-6112: Perhaps you can use {{suppressOffers()}} and {{reviveOffers()}} as a pair: After {{suppressOffers()}}, you can call {{reviveOffers()}} to see if you can get the offer of the persistent volume, if not, call {{suppressOffers()}} again and loop till your persistent volume host come back? > Frameworks are starved when > 5 are run concurrently > > > Key: MESOS-6112 > URL: https://issues.apache.org/jira/browse/MESOS-6112 > Project: Mesos > Issue Type: Task > Components: allocation, master >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > As I understand it, the master will send an offer to a list of frameworks > ordered by DRF, until the offer is accepted. There is a 1s wait time between > each offering. Once the decline timeout for the first framework has been > reached, rather than continuing to submit the offer to the rest of the > frameworks in the list, the master starts over at the beginning, starving the > rest of the frameworks. > This means that in order for Mesos to support > 5 concurrent frameworks, all > frameworks must be good citizens and set their decline timeout to something > large or suppress offers. I think this is a fairly undesirable state of > things. > I propose that the master instead continues to submit the offer to every > registered framework, even if the declineOffer timeout has been reached. > The potential increase in task startup latency that could be introduced by > this change can be obviated in part if we also make the master smarter about > how long to wait between successive offers, rather than a static 1s. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6112) Frameworks are starved when > 5 are run concurrently
[ https://issues.apache.org/jira/browse/MESOS-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450888#comment-15450888 ] Guangya Liu commented on MESOS-6112: Is this duplicate with MESOS-3202? I think that this will only happen when you have more frameworks than agents? Can quota help if one role per framework? > Frameworks are starved when > 5 are run concurrently > > > Key: MESOS-6112 > URL: https://issues.apache.org/jira/browse/MESOS-6112 > Project: Mesos > Issue Type: Task > Components: allocation, master >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > As I understand it, the master will send an offer to a list of frameworks > ordered by DRF, until the offer is accepted. There is a 1s wait time between > each offering. Once the decline timeout for the first framework has been > reached, rather than continuing to submit the offer to the rest of the > frameworks in the list, the master starts over at the beginning, starving the > rest of the frameworks. > This means that in order for Mesos to support > 5 concurrent frameworks, all > frameworks must be good citizens and set their decline timeout to something > large or suppress offers. I think this is a fairly undesirable state of > things. > I propose that the master instead continues to submit the offer to every > registered framework, even if the declineOffer timeout has been reached. > The potential increase in task startup latency that could be introduced by > this change can be obviated in part if we also make the master smarter about > how long to wait between successive offers, rather than a static 1s. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6087) Add master tests for TaskGroup
[ https://issues.apache.org/jira/browse/MESOS-6087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438557#comment-15438557 ] Guangya Liu commented on MESOS-6087: https://reviews.apache.org/r/51451/ Added test case MasterAuthorizationTest.KillPendingTaskInTaskGroup. cc [~vinodkone] > Add master tests for TaskGroup > -- > > Key: MESOS-6087 > URL: https://issues.apache.org/jira/browse/MESOS-6087 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Assignee: Guangya Liu > > Some of the tests we want to write: > -- If a pending task in a task group is killed, the entire group is killed. > -- If a task in a task group is invalid, the whole group is considered > invalid. > -- If a task in a task group is unauthorized, the whole group is considered > unauthorized. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6087) Add master tests for TaskGroup
[ https://issues.apache.org/jira/browse/MESOS-6087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-6087: --- Assignee: (was: Guangya Liu) > Add master tests for TaskGroup > -- > > Key: MESOS-6087 > URL: https://issues.apache.org/jira/browse/MESOS-6087 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone > > Some of the tests we want to write: > -- If a pending task in a task group is killed, the entire group is killed. > -- If a task in a task group is invalid, the whole group is considered > invalid. > -- If a task in a task group is unauthorized, the whole group is considered > unauthorized. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6087) Add master tests for TaskGroup
[ https://issues.apache.org/jira/browse/MESOS-6087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu reassigned MESOS-6087: -- Assignee: Guangya Liu > Add master tests for TaskGroup > -- > > Key: MESOS-6087 > URL: https://issues.apache.org/jira/browse/MESOS-6087 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Assignee: Guangya Liu > > Some of the tests we want to write: > -- If a pending task in a task group is killed, the entire group is killed. > -- If a task in a task group is invalid, the whole group is considered > invalid. > -- If a task in a task group is unauthorized, the whole group is considered > unauthorized. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4808) Allocation in batch instead of execute it every-time when addSlave/addFramework.
[ https://issues.apache.org/jira/browse/MESOS-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434588#comment-15434588 ] Guangya Liu commented on MESOS-4808: [~klaus1982] Shall we mark this as duplicate with MESOS-3157 as I think that the patch for MESOS-3157 https://reviews.apache.org/r/51027/ actually also fixed this ticket. > Allocation in batch instead of execute it every-time when > addSlave/addFramework. > > > Key: MESOS-4808 > URL: https://issues.apache.org/jira/browse/MESOS-4808 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Klaus Ma > Labels: master, tech-debt > > Currently, {{allocate()}} are executed every-time when a new slave/framework > are registered; if there're lots of agent start all most the same time, the > allocation will keep running for a while. It's acceptable behaviour to > allocate resources in next allocation cycle. But when a task is finished, > it's better to allocate ASAP although there's performances issues; refer to > MESOS-3078 for more detail on short running tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4767) Apply batching to allocation events to reduce allocator backlogging.
[ https://issues.apache.org/jira/browse/MESOS-4767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434574#comment-15434574 ] Guangya Liu commented on MESOS-4767: [~bmahler] Shall we mark this as duplicate with MESOS-3157 as I think that the patch for MESOS-3157 https://reviews.apache.org/r/51027/ actually also fixed this ticket. > Apply batching to allocation events to reduce allocator backlogging. > > > Key: MESOS-4767 > URL: https://issues.apache.org/jira/browse/MESOS-4767 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Mahler >Assignee: Guangya Liu > > Per the > [discussion|https://issues.apache.org/jira/browse/MESOS-3157?focusedCommentId=14728377=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14728377] > that came out of MESOS-3157, we'd like to batch together outstanding > allocation dispatches in order to avoid backing up the allocator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-3078) Recovered resources are not re-allocated until the next allocation delay.
[ https://issues.apache.org/jira/browse/MESOS-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu reassigned MESOS-3078: -- Assignee: Guangya Liu > Recovered resources are not re-allocated until the next allocation delay. > - > > Key: MESOS-3078 > URL: https://issues.apache.org/jira/browse/MESOS-3078 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Mahler >Assignee: Guangya Liu > > Currently, when resources are recovered, we do not perform an allocation for > that slave. Rather, we wait until the next allocation interval. > For small task, high throughput frameworks, this can have a significant > impact on overall throughput, see the following thread: > http://markmail.org/thread/y6mzfwzlurv6nik3 > We should consider immediately performing a re-allocation for the slave upon > resource recovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3078) Recovered resources are not re-allocated until the next allocation delay.
[ https://issues.apache.org/jira/browse/MESOS-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432190#comment-15432190 ] Guangya Liu commented on MESOS-3078: The review posted by [~jjanco] here https://reviews.apache.org/r/51027/ can help this, we can use similar logic in {{addSlave}} to handle this. {code} allocationCandidates.insert(slaveId); if (!allocationPending) { allocationPending = true; dispatch(self(), ::allocate); } {code} > Recovered resources are not re-allocated until the next allocation delay. > - > > Key: MESOS-3078 > URL: https://issues.apache.org/jira/browse/MESOS-3078 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Mahler > > Currently, when resources are recovered, we do not perform an allocation for > that slave. Rather, we wait until the next allocation interval. > For small task, high throughput frameworks, this can have a significant > impact on overall throughput, see the following thread: > http://markmail.org/thread/y6mzfwzlurv6nik3 > We should consider immediately performing a re-allocation for the slave upon > resource recovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-970) Upgrade bundled leveldb to 1.18
[ https://issues.apache.org/jira/browse/MESOS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417455#comment-15417455 ] Guangya Liu commented on MESOS-970: --- Actually, [~bmahler] already mentioned this in a JIRA here https://issues.apache.org/jira/browse/MESOS-4558 and we do have plan to fix this. I think this was introduced by the review here https://reviews.apache.org/r/49784/ as we are adding more test cases here for allocator benchmark test. {code} INSTANTIATE_TEST_CASE_P( SlaveAndFrameworkCount, HierarchicalAllocator_BENCHMARK_Test, ::testing::Combine( ::testing::Values(1000U, 5000U, 1U, 2U, 3U, 5U), ::testing::Values(1U, 50U, 100U, 200U, 500U, 1000U, 3000U, 6000U)) ); {code} There will be 48 (6 * 8) cases here and the longest benchmark test would have 5 agents and 6000 frameworks as the test parameter, and also some test is looping (framework * 2) times and for the last case, it would be 12000 loops, that's why you see the benchmark test time is increasing. We are now trying to find a solution for this so that we can also enable the benchmark test in ASF CI. For now, perhaps you can use some filter to filter out some test cases. {code} MESOS_BENCHMARK=1 GTEST_FILTER="*BENCHMARK*.*/1" make check {code} The above command will only run the first test case, you can adjust the parameter based on your test requirement. Hope this helps. > Upgrade bundled leveldb to 1.18 > --- > > Key: MESOS-970 > URL: https://issues.apache.org/jira/browse/MESOS-970 > Project: Mesos > Issue Type: Improvement > Components: replicated log >Reporter: Benjamin Mahler >Assignee: Tomasz Janiszewski > > We currently bundle leveldb 1.4, and the latest version is leveldb 1.18. > Upgrade to 1.18 could solve the problems when build Mesos in some non-x86 > architecture CPU. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5830) Make a sweep to trim excess space around angle brackets
[ https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410815#comment-15410815 ] Guangya Liu commented on MESOS-5830: Yes, [~zerobleed] , it is a good start for you to get started for mesos. As suggested by [~haosd...@gmail.com], you can follow https://github.com/apache/mesos/blob/master/docs/submitting-a-patch.md to contribute. There is also a meetup slides here for you to take a reference http://files.meetup.com/18744996/Mesos_Community_Guidance.pdf > Make a sweep to trim excess space around angle brackets > --- > > Key: MESOS-5830 > URL: https://issues.apache.org/jira/browse/MESOS-5830 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier >Priority: Trivial > > The codebase still has pre-C++11 code where we needed to say e.g., > {{vector
[jira] [Updated] (MESOS-5921) `validate` is a bit heavy to check negative scalar resource
[ https://issues.apache.org/jira/browse/MESOS-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5921: --- Attachment: WithoutValidation.png > `validate` is a bit heavy to check negative scalar resource > --- > > Key: MESOS-5921 > URL: https://issues.apache.org/jira/browse/MESOS-5921 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > Attachments: WithValidation.png, WithoutValidation.png > > > When subtract resources finished, we need to call {{Resources::validate}} to > check if the scalar resource is negative so as to remove this resource if it > is negative. This is a bit heavy as the {{Resources::validate}} did many > validation stuffs, such as checking type, validating role, checking resource > name etc, all of them are not necessary. > We should introduce a new helper function {{isNegative}} to check if the > resource is a negative scalar resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5921) `validate` is a bit heavy to check negative scalar resource
[ https://issues.apache.org/jira/browse/MESOS-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5921: --- Attachment: WithValidation.png > `validate` is a bit heavy to check negative scalar resource > --- > > Key: MESOS-5921 > URL: https://issues.apache.org/jira/browse/MESOS-5921 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > Attachments: WithValidation.png > > > When subtract resources finished, we need to call {{Resources::validate}} to > check if the scalar resource is negative so as to remove this resource if it > is negative. This is a bit heavy as the {{Resources::validate}} did many > validation stuffs, such as checking type, validating role, checking resource > name etc, all of them are not necessary. > We should introduce a new helper function {{isNegative}} to check if the > resource is a negative scalar resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5921) `validate` is a bit heavy to check negative scalar resource
[ https://issues.apache.org/jira/browse/MESOS-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398446#comment-15398446 ] Guangya Liu commented on MESOS-5921: Sure Ben, I will use {{callgrind}} to check why the performance was not improved much before post the patch. As here I was using `Ports` resources, the performance here should have some improvement after using {{isNegative}} ideally. > `validate` is a bit heavy to check negative scalar resource > --- > > Key: MESOS-5921 > URL: https://issues.apache.org/jira/browse/MESOS-5921 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > > When subtract resources finished, we need to call {{Resources::validate}} to > check if the scalar resource is negative so as to remove this resource if it > is negative. This is a bit heavy as the {{Resources::validate}} did many > validation stuffs, such as checking type, validating role, checking resource > name etc, all of them are not necessary. > We should introduce a new helper function {{isNegative}} to check if the > resource is a negative scalar resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5921) `validate` is a bit heavy to check negative scalar resource
[ https://issues.apache.org/jira/browse/MESOS-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397700#comment-15397700 ] Guangya Liu commented on MESOS-5921: [~bmahler], did some checking for this and seems we can keep the current logic of of {{Resources::subtract}} using {{Resources::validate}} as this function can return very quickly when encounter negative scalar resources. What do you think? Thanks. {code} Option Resources::validate(const Resource& resource) { if (resource.name().empty()) { return Error("Empty resource name"); } if (!Value::Type_IsValid(resource.type())) { return Error("Invalid resource type"); } if (resource.type() == Value::SCALAR) { if (!resource.has_scalar() || resource.has_ranges() || resource.has_set()) { return Error("Invalid scalar resource"); } if (resource.scalar().value() < 0) { return Error("Invalid scalar resource: value < 0"); << Return here if the scalar resource is negative and thus will not do other checking. } } else if (resource.type() == Value::RANGES) { .. } else if (resource.type() == Value::SET) { .. } else { // Resource doesn't support TEXT or other value types. return Error("Unsupported resource type"); } .. } {code} I also did some test with following code diff and found that the performance was almost not changed for operating 1000 port resources. Code diff. {code} --- a/include/mesos/resources.hpp +++ b/include/mesos/resources.hpp @@ -396,6 +396,9 @@ private: // ensure this is warranted. bool _contains(const Resource& that) const; + // Check if the resource is a negative scalar resource. + bool isNegative(const Resource& r) const; + // Similar to the public 'find', but only for a single Resource // object. The target resource may span multiple roles, so this // returns Resources. diff --git a/src/common/resources.cpp b/src/common/resources.cpp index 2878ace..b1259b9 100644 --- a/src/common/resources.cpp +++ b/src/common/resources.cpp @@ -1296,6 +1296,17 @@ bool Resources::_contains(const Resource& that) const } +bool Resources::isNegative(const Resource& r) const +{ + if (r.type() == Value::SCALAR && + r.scalar().value() < 0) { +return true; + } + + return false; +} + + Option Resources::find(const Resource& target) const { Resources found; @@ -1442,10 +1453,8 @@ void Resources::subtract(const Resource& that) if (internal::subtractable(*resource, that)) { *resource -= that; - // Remove the resource if it becomes invalid or zero. We need - // to do the validation because we want to strip negative - // scalar Resource object. - if (validate(*resource).isSome() || isEmpty(*resource)) { + // Remove the resource if it becomes negative or empty. + if (isNegative(*resource) || isEmpty(*resource)) { // As `resources` is not ordered, and erasing an element // from the middle using `DeleteSubrange` is expensive, we // swap with the last element and then shrink the {code} Before fix: {code} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test [ RUN ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 Took 2.730778secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 20.703045secs to perform 1000 'total.contains(r)' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.530712secs to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 2.92716secs to perform 1000 'total = total + r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.489936secs to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 122368us to perform 1000 'r.nonRevocable()' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... [ OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (33508 ms) [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (33508 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (33525 ms total) [ PASSED ] 1 test. {code} After fix: {code} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test [ RUN ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 Took 2.657057secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 20.493614secs to perform 1000 'total.contains(r)' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.420194secs to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11,
[jira] [Created] (MESOS-5921) `validate` is a bit heavy to check negative scalar resource
Guangya Liu created MESOS-5921: -- Summary: `validate` is a bit heavy to check negative scalar resource Key: MESOS-5921 URL: https://issues.apache.org/jira/browse/MESOS-5921 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu When subtract resources finished, we need to call {{Resources::validate}} to check if the scalar resource is negative so as to remove this resource if it is negative. This is a bit heavy as the {{Resources::validate}} did many validation stuffs, such as checking type, validating role, checking resource name etc, all of them are not necessary. We should introduce a new helper function {{isNegative}} to check if the resource is a negative scalar resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5919) Improve performance for `Resources.contains` and `Resources.filter`
Guangya Liu created MESOS-5919: -- Summary: Improve performance for `Resources.contains` and `Resources.filter` Key: MESOS-5919 URL: https://issues.apache.org/jira/browse/MESOS-5919 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu The current logic for `Resources.contains` and `Resources.filter` are as following: {code} Resources Resources::filter( const lambda::function& predicate) const { Resources result; foreach (const Resource& resource, resources) { if (predicate(resource)) { result += resource; } } return result; } bool Resources::contains(const Resources& that) const { Resources remaining = *this; foreach (const Resource& resource, that.resources) { // NOTE: We use _contains because Resources only contain valid // Resource objects, and we don't want the performance hit of the // validity check. if (!remaining._contains(resource)) { return false; } remaining -= resource; } return true; } {code} The problem is that actually all of the {{resource}} object in those two APIs are valid and there is no need to validate the resource here, but here both the {{remaining -= resource;}} in {{Resources.contains}} and {{result += resource;}} in {{Resources::filter}} both include the logic of {{validate}} resource, we should remove the {{validate}} logic here by using {{subtract}} and {{add}} for those two APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5700) Benchmark for Resource class
[ https://issues.apache.org/jira/browse/MESOS-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5700: --- Summary: Benchmark for Resource class (was: Benchmark for Resource class (protobuf vs. C++)) > Benchmark for Resource class > > > Key: MESOS-5700 > URL: https://issues.apache.org/jira/browse/MESOS-5700 > Project: Mesos > Issue Type: Bug >Reporter: Klaus Ma >Assignee: Klaus Ma > Attachments: hashmap.diff, name_roleId.diff, port.perf.log, > reservation.perf.log > > > Add benchmark of Resource class for Allocation Performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5700) Benchmark for Resource class (protobuf vs. C++)
[ https://issues.apache.org/jira/browse/MESOS-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393410#comment-15393410 ] Guangya Liu edited comment on MESOS-5700 at 7/26/16 8:20 AM: - Did some test for how does {{addable}} and {{subtractable}} contribute to resources benchmark test, the result is that {{those two validations does not cost much time and we can ignore it}}. cc [~bmahler] [~klaus1982] Test steps are as following: 1) Checkout two source code copies: mesos-1 and mesos-2, apply patch https://reviews.apache.org/r/50380/ for both copies. 2) Update code in mesos-1 by removing both {{addable}} and {{subtractable}} for resources {{+=}} and {{-=}}. Code diff is as following: {code} diff --git a/src/common/resources.cpp b/src/common/resources.cpp index 3dbff24..d770e98 100644 --- a/src/common/resources.cpp +++ b/src/common/resources.cpp @@ -227,6 +227,7 @@ bool operator!=(const Resource& left, const Resource& right) namespace internal { +#if 0 // Tests if we can add two Resource objects together resulting in one // valid Resource object. For example, two Resource objects with // different name, type or role are not addable. @@ -277,6 +278,7 @@ static bool addable(const Resource& left, const Resource& right) return true; } +#endif // Tests if we can subtract "right" from "left" resulting in one valid @@ -1381,11 +1383,9 @@ void Resources::add(const Resource& that) bool found = false; foreach (Resource& resource, resources) { -if (internal::addable(resource, that)) { resource += that; found = true; break; -} } // Cannot be combined with any existing Resource object. @@ -1439,7 +1439,6 @@ void Resources::subtract(const Resource& that) for (int i = 0; i < resources.size(); i++) { Resource* resource = resources.Mutable(i); -if (internal::subtractable(*resource, that)) { *resource -= that; // Remove the resource if it becomes invalid or zero. We need @@ -1455,7 +1454,6 @@ void Resources::subtract(const Resource& that) } break; -} } } {code} 3) Build those two copies and run benchmark test {{ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2}}. Test result without validation for both {{addable}} and {{subtractable}} {code} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test [ RUN ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 Took 2.833678secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.656634secs to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.012337secs to perform 1000 'total = total + r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.650337secs to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... [ OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (13155 ms) [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (13155 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (13174 ms total) [ PASSED ] 1 test. {code} Test result with validation for both {{addable}} and {{subtractable}} {code} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test [ RUN ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 Took 2.707476secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.49798secs to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 2.911038secs to perform 1000 'total = total + r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.692435secs to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... [ OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (12811 ms) [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (12811 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (12830 ms total) [ PASSED ] 1 test. {code} Please refer to https://docs.google.com/document/d/1D5qqkEh28vnS-2j3F1K8liYS8ThtSjeLJ4AvogIoxjk/edit?ts=57971af2# for more detail of the diagram of {{valgrind --tool=callgrind}}. was (Author: gyliu): Did some test for how does {{addable}} and {{subtractable}} contribute to resources benchmark test, the result is that {{those two validations does not cost much time and we can ignore it}}. cc [~bmahler] [~klaus1982] Test steps are as following: 1) Checkout two source code copies: mesos-1 and mesos-2, apply patch
[jira] [Commented] (MESOS-5700) Benchmark for Resource class (protobuf vs. C++)
[ https://issues.apache.org/jira/browse/MESOS-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393410#comment-15393410 ] Guangya Liu commented on MESOS-5700: Did some test for how does {{addable}} and {{subtractable}} contribute to resources benchmark test, the result is that {{those two validations does not cost much time and we can ignore it}}. cc [~bmahler] [~klaus1982] Test steps are as following: 1) Checkout two source code copies: mesos-1 and mesos-2, apply patch https://reviews.apache.org/r/50380/ for both copies. 2) Update code in mesos-1 by removing both {{addable}} and {{subtractable}} for resources {{+=}} and {{-=}}. Code diff is as following: {code} diff --git a/src/common/resources.cpp b/src/common/resources.cpp index 3dbff24..d770e98 100644 --- a/src/common/resources.cpp +++ b/src/common/resources.cpp @@ -227,6 +227,7 @@ bool operator!=(const Resource& left, const Resource& right) namespace internal { +#if 0 // Tests if we can add two Resource objects together resulting in one // valid Resource object. For example, two Resource objects with // different name, type or role are not addable. @@ -277,6 +278,7 @@ static bool addable(const Resource& left, const Resource& right) return true; } +#endif // Tests if we can subtract "right" from "left" resulting in one valid @@ -1381,11 +1383,9 @@ void Resources::add(const Resource& that) bool found = false; foreach (Resource& resource, resources) { -if (internal::addable(resource, that)) { resource += that; found = true; break; -} } // Cannot be combined with any existing Resource object. @@ -1439,7 +1439,6 @@ void Resources::subtract(const Resource& that) for (int i = 0; i < resources.size(); i++) { Resource* resource = resources.Mutable(i); -if (internal::subtractable(*resource, that)) { *resource -= that; // Remove the resource if it becomes invalid or zero. We need @@ -1455,7 +1454,6 @@ void Resources::subtract(const Resource& that) } {code} 3) Build those two copies and run benchmark test {{ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2}}. Test result without validation for both {{addable}} and {{subtractable}} {code} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test [ RUN ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 Took 2.833678secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.656634secs to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.012337secs to perform 1000 'total = total + r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.650337secs to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... [ OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (13155 ms) [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (13155 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (13174 ms total) [ PASSED ] 1 test. {code} Test result with validation for both {{addable}} and {{subtractable}} {code} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test [ RUN ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 Took 2.707476secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.49798secs to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 2.911038secs to perform 1000 'total = total + r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.692435secs to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... [ OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (12811 ms) [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (12811 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (12830 ms total) [ PASSED ] 1 test. {code} Please refer to https://docs.google.com/document/d/1D5qqkEh28vnS-2j3F1K8liYS8ThtSjeLJ4AvogIoxjk/edit?ts=57971af2# for more detail of the diagram of {{valgrind --tool=callgrind}}. > Benchmark for Resource class (protobuf vs. C++) > --- > > Key: MESOS-5700 > URL: https://issues.apache.org/jira/browse/MESOS-5700 > Project: Mesos > Issue Type: Bug >Reporter: Klaus Ma >Assignee: Klaus Ma > Attachments: hashmap.diff, name_roleId.diff, port.perf.log, > reservation.perf.log > > > Add
[jira] [Commented] (MESOS-3157) only perform batch resource allocations
[ https://issues.apache.org/jira/browse/MESOS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391660#comment-15391660 ] Guangya Liu commented on MESOS-3157: [~jjanco] any update for this? are you still working for this? > only perform batch resource allocations > --- > > Key: MESOS-3157 > URL: https://issues.apache.org/jira/browse/MESOS-3157 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: James Peach >Assignee: Jacob Janco > > Our deployment environments have a lot of churn, with many short-live > frameworks that often revive offers. Running the allocator takes a long time > (from seconds up to minutes). > In this situation, event-triggered allocation causes the event queue in the > allocator process to get very long, and the allocator effectively becomes > unresponsive (eg. a revive offers message takes too long to come to the head > of the queue). > We have been running a patch to remove all the event-triggered allocations > and only allocate from the batch task > {{HierarchicalAllocatorProcess::batch}}. This works great and really improves > responsiveness. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5898) Make resources benchmark test for ports -=/- more accurate
[ https://issues.apache.org/jira/browse/MESOS-5898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5898: --- Description: When I run benchmark test for port resources, I can get the following result, the `-=` and `-` only consumed 10ms, this cannot reflect the real time of operating 1000 ports with `-=` and `-`. The root cause is that the current calculation is always using same port range, with port, the formula for `+` is {{a+a+a+a+...+a==a}}; for `-`, it will be {{a-a=0}} and {{0-a=0}}. With {{0-a=0}}, the code here https://github.com/apache/mesos/blob/master/src/common/values.cpp#L544 will cause there is no validation as the {{left}} is empty. {code} ./bin/mesos-tests.sh --benchmark --gtest_filter="*Resources_BENCHMARK_Test.Arithmetic/2" [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test [ RUN ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 Took 3.219217secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 10207us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.515383secs to perform 1000 'total = total + r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 10208us to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... [ OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (6759 ms) [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (6759 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (6801 ms total) [ PASSED ] 1 test. {code} was: When I run benchmark test for port resources, I can get the following result, the `-=` and `-` only consumed 10ms, this cannot reflect the real time of operating 1000 ports with `-=` and `-`. The root cause is that the current calculation is always using same port range, with port, the formula for `+` is {a+a+a+a+...+a==a}; for `-`, it will be {a-a=0} and {0-a=0}. With {0-a=0}, the code here https://github.com/apache/mesos/blob/master/src/common/values.cpp#L544 will cause there is no validation as the {{left}} is empty. {code} ./bin/mesos-tests.sh --benchmark --gtest_filter="*Resources_BENCHMARK_Test.Arithmetic/2" [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test [ RUN ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 Took 3.219217secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 10207us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.515383secs to perform 1000 'total = total + r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 10208us to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... [ OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (6759 ms) [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (6759 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (6801 ms total) [ PASSED ] 1 test. {code} > Make resources benchmark test for ports -=/- more accurate > -- > > Key: MESOS-5898 > URL: https://issues.apache.org/jira/browse/MESOS-5898 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > > When I run benchmark test for port resources, I can get the following result, > the `-=` and `-` only consumed 10ms, this cannot reflect the real time of > operating 1000 ports with `-=` and `-`. > The root cause is that the current calculation is always using same port > range, with port, the formula for `+` is {{a+a+a+a+...+a==a}}; for `-`, it > will be {{a-a=0}} and {{0-a=0}}. > With {{0-a=0}}, the code here > https://github.com/apache/mesos/blob/master/src/common/values.cpp#L544 will > cause there is no validation as the {{left}} is empty. > {code} > ./bin/mesos-tests.sh --benchmark > --gtest_filter="*Resources_BENCHMARK_Test.Arithmetic/2" > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test > [ RUN ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 > Took 3.219217secs to perform 1000 'total += r' operations on ports(*):[1-2, > 4-5, 7-8, 10-11, 13-14, 16-17, 1... > Took 10207us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, > 7-8, 10-11, 13-14, 16-17, 1... > Took 3.515383secs to perform 1000 'total
[jira] [Created] (MESOS-5898) Make resources benchmark test for ports -=/- more accurate
Guangya Liu created MESOS-5898: -- Summary: Make resources benchmark test for ports -=/- more accurate Key: MESOS-5898 URL: https://issues.apache.org/jira/browse/MESOS-5898 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu When I run benchmark test for port resources, I can get the following result, the `-=` and `-` only consumed 10ms, this cannot reflect the real time of operating 1000 ports with `-=` and `-`. The root cause is that the current calculation is always using same port range, with port, the formula for `+` is {a+a+a+a+...+a==a}; for `-`, it will be {a-a=0} and {0-a=0}. With {0-a=0}, the code here https://github.com/apache/mesos/blob/master/src/common/values.cpp#L544 will cause there is no validation as the {{left}} is empty. {code} ./bin/mesos-tests.sh --benchmark --gtest_filter="*Resources_BENCHMARK_Test.Arithmetic/2" [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test [ RUN ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 Took 3.219217secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 10207us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.515383secs to perform 1000 'total = total + r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 10208us to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... [ OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (6759 ms) [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (6759 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (6801 ms total) [ PASSED ] 1 test. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4770) Investigate performance improvements for 'Resources' class.
[ https://issues.apache.org/jira/browse/MESOS-4770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390638#comment-15390638 ] Guangya Liu commented on MESOS-4770: [~jvanremoortere] had some investigation for this and the prototype code is here (a bit old but good enough for investigation) 1) https://github.com/jmlvanre/mesos/commit/f39f49ca0876f61fc94e752fc3c4f14377b1d329 2) https://github.com/jmlvanre/mesos/commit/7b4ac74449044d892e25ee31a297d50254afd1e0 3) https://github.com/jmlvanre/mesos/commit/4fc05821b4fa3c30dd1fed66ba7fc4498ee29efb The performance was improved 2x times based on [~jvanremoortere] 's test. > Investigate performance improvements for 'Resources' class. > --- > > Key: MESOS-4770 > URL: https://issues.apache.org/jira/browse/MESOS-4770 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler >Priority: Critical > > Currently we have some performance issues when we have heavy usage of the > {{Resources}} class. Currently, we tend to work around these issues (e.g. > reduce the amount of Resources arithmetic operations in the caller code). > The implementation of {{Resources}} currently consists of wrapping underlying > {{Resource}} protobuf objects and manipulating them. This is fairly expensive > compared to doing things more directly with C++ objects. > This ticket is to explore the performance improvements of using C++ objects > more directly instead of working off of {{Resource}} objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5869) Disable resources validation for `+=` and `-=`
Guangya Liu created MESOS-5869: -- Summary: Disable resources validation for `+=` and `-=` Key: MESOS-5869 URL: https://issues.apache.org/jira/browse/MESOS-5869 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu The `validation` consumed quite a lot of time when doing resources `+=` and `-=`, but it is not needed for those operations, we need to remove this check. Based on the test result of removing the `validation`, the performance of resources += and -= will be improved by 10x for sorter test, and the performance for port range += was improved by 5x and port range -= was improved 1000x. Sorter Benchmark test before fix: {code} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test [ RUN ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 Using 5 agents and 1000 clients Added 1000 clients in 23305us Added 5 agents in 1.174069secs Added allocations for 5 agents in 40.562802secs Full sort of 1000 clients took 38193us No-op sort of 1000 clients took 382us [ OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 (43032 ms) [--] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test (43032 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (43054 ms total) [ PASSED ] 1 test. {code} Sorter Benchmark test after fix: {code} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test [ RUN ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 Using 5 agents and 1000 clients Added 1000 clients in 25846us Added 5 agents in 1.092462secs Added allocations for 5 agents in 4.397859secs Full sort of 1000 clients took 35051us No-op sort of 1000 clients took 551us [ OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 (6897 ms) [--] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test (6897 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (6920 ms total) [ PASSED ] 1 test. {code} Ports resources benchmark test before fix: {code} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test [ RUN ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 Took 12.478841secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 8.512399secs to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 11.296542secs to perform 1000 'total = total + r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 8.517692secs to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... [ OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (40808 ms) [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (40808 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (40832 ms total) [ PASSED ] 1 test. {code} Ports resources benchmark test after fix: {code} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test [ RUN ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 Took 2.827012secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 8841us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 3.313112secs to perform 1000 'total = total + r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... Took 12415us to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5, 7-8, 10-11, 13-14, 16-17, 1... [ OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (6164 ms) [--] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (6164 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (6187 ms total) [ PASSED ] 1 test. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4558) Reduce the running time of benchmark tests.
[ https://issues.apache.org/jira/browse/MESOS-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374768#comment-15374768 ] Guangya Liu commented on MESOS-4558: Selectively run benchmark test is also an option, but not sure if there are any logic for how to select the representative benchmark test. Take this patch https://reviews.apache.org/r/49784/ as an example, after it was merged, it will introduce some cases that agent count is less than framework count, which may lead some frameworks cannot get resources and the allocator will try to allocate resources on fully used agents. This is a good test cases to check if the fully used agent resources can impact the performance of the allocator. But the problem is how we can select the cases which can cover the cases of fully used agent with some filters? Also in https://reviews.apache.org/r/49784/ , we enabled {{batchsize}} to do less loop for frameworks, this seems a simple way to decrease the time of the benchmark test without updating the filter logic of ASF CI. > Reduce the running time of benchmark tests. > --- > > Key: MESOS-4558 > URL: https://issues.apache.org/jira/browse/MESOS-4558 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone > Labels: newbie++ > > Currently benchmark tests take a long time (>5 hours). It would be nice to > reduce the total time taken by the benchmark tests to enable us to run them > on ASF CI. > Command to run only benchmark tests > {code} > MESOS_BENCHMARK=1 GTEST_FILTER="*BENCHMARK*" make check > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5834) Mesos may pass to the Docker daemon --volume-driver multiple times.
[ https://issues.apache.org/jira/browse/MESOS-5834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373179#comment-15373179 ] Guangya Liu commented on MESOS-5834: The {{driver}} field is an optional field, and also docker suggest creating the volume explicitly via {{docker volume create}} before using it, if you create the docker volumes explicitly and do not set {{driver}}, there will be no such issues; otherwise, the {{stderr}} will show you some error message for {{Error response from daemon: create aa: conflict: volume name must be unique.}}. Does this behaviour ok for you? {code} message DockerVolume { // Driver of the volume, it can be flocker, convoy, raxrey etc. optional string driver = 1; // Name of the volume. required string name = 2; // Volume driver specific options. optional Parameters driver_options = 3; } {code} > Mesos may pass to the Docker daemon --volume-driver multiple times. > --- > > Key: MESOS-5834 > URL: https://issues.apache.org/jira/browse/MESOS-5834 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 1.0.0 >Reporter: Gastón Kleiman > Labels: mesosphere > > https://github.com/apache/mesos/blob/master/src/docker/docker.cpp#L590 will > append the "--volume-driver" flag to argv once per Volume. > According to https://github.com/docker/docker/issues/16069 this flag can only > be specified once. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4558) Reduce the running time of benchmark tests.
[ https://issues.apache.org/jira/browse/MESOS-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15369990#comment-15369990 ] Guangya Liu commented on MESOS-4558: [~jjanco] is trying to make the looping numbers of the benchmark as configurable by a batch size which can reduce the time of benchmark test. Please refer to https://reviews.apache.org/r/49616/ for detail. > Reduce the running time of benchmark tests. > --- > > Key: MESOS-4558 > URL: https://issues.apache.org/jira/browse/MESOS-4558 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone > Labels: newbie++ > > Currently benchmark tests take a long time (>5 hours). It would be nice to > reduce the total time taken by the benchmark tests to enable us to run them > on ASF CI. > Command to run only benchmark tests > {code} > MESOS_BENCHMARK=1 GTEST_FILTER="*BENCHMARK*" make check > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5701) Add benchmark for sorter performance
[ https://issues.apache.org/jira/browse/MESOS-5701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu reassigned MESOS-5701: -- Assignee: Guangya Liu > Add benchmark for sorter performance > > > Key: MESOS-5701 > URL: https://issues.apache.org/jira/browse/MESOS-5701 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Klaus Ma >Assignee: Guangya Liu > > Add benchmark of sorter in allocation for Allocation Performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5825) Support mounting image volume in mesos containerizer.
[ https://issues.apache.org/jira/browse/MESOS-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368671#comment-15368671 ] Guangya Liu commented on MESOS-5825: [~gilbert] is this duplicate with https://issues.apache.org/jira/browse/MESOS-5465 ? If so, can you please post some comments at MESOS-5465? ;-) > Support mounting image volume in mesos containerizer. > - > > Key: MESOS-5825 > URL: https://issues.apache.org/jira/browse/MESOS-5825 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Gilbert Song >Assignee: Gilbert Song > Labels: containerizer, filesystem, isolator, mesosphere > > Mesos containerizer should be able to support mounting image volume type. > Specifically, both image rootfs and default manifest should be reachable > inside container's mount namespace. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5700) Benchmark for Resource class (protobuf vs. C++)
[ https://issues.apache.org/jira/browse/MESOS-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368666#comment-15368666 ] Guangya Liu commented on MESOS-5700: Based on investigation from [~jvanremoortere] and [~mcypark] , the founding is that (1) copying of the protobufs was expensive (2) looping over and checking .name() equality was expensive, for example. We may need to think more use cases related to {{Resource}} and translate those to benchmark test. > Benchmark for Resource class (protobuf vs. C++) > --- > > Key: MESOS-5700 > URL: https://issues.apache.org/jira/browse/MESOS-5700 > Project: Mesos > Issue Type: Bug >Reporter: Klaus Ma >Assignee: Klaus Ma > > Add benchmark of Resource class for Allocation Performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5425) Consider using IntervalSet for Port range resource math
[ https://issues.apache.org/jira/browse/MESOS-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15367301#comment-15367301 ] Guangya Liu commented on MESOS-5425: I'm linking MESOS-5700 here cause there is a patch https://reviews.apache.org/r/49381 which can help you doing some benchmark test. > Consider using IntervalSet for Port range resource math > --- > > Key: MESOS-5425 > URL: https://issues.apache.org/jira/browse/MESOS-5425 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Joseph Wu >Assignee: Yanyan Hu > Labels: mesosphere > Attachments: graycol.gif > > > Follow-up JIRA for comments raised in MESOS-3051 (see comments there). > We should consider utilizing > [{{IntervalSet}}|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/3rdparty/stout/include/stout/interval.hpp] > in [Port range resource > math|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/src/common/values.cpp#L143]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5800) code clean up for allocator benchmark test
Guangya Liu created MESOS-5800: -- Summary: code clean up for allocator benchmark test Key: MESOS-5800 URL: https://issues.apache.org/jira/browse/MESOS-5800 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu We are now trying to introduce some benchmark test for allocator, and people may make a reference to the current benchmark test for their new benchmark test. There are two major issues for current benchmark test: 1) The output of the benchmark test is {{round 0 allocate took 3.077414secs to make 200 offers}}, the {{200}} here is framework numbers but not offer numbers. 2) Two test cases {{DeclineOffers}} and {{ResourceLabels}} are not using templatized test fixture. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5739) Fix Value parsing code to only accept the canonical formats
[ https://issues.apache.org/jira/browse/MESOS-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5739: --- Description: We should fix the value parsing code to only accept the canonical formats as defined in http://mesos.apache.org/documentation/latest/attributes-resources/ , the behaviour after the fix is as following: {code} 1. Did not support [1-2, [3-4]] as Ranges; it should be [1-2, 3-4]. 2. Did not support {a{b, c}d} as Set; it should be {ab, cd} 3. Add check for Text against [a-zA-Z0-9_/.-] {code} was: Enhanced Value parsing: {code} 1. Did not support [1-2, [3-4]] as Ranges; it should be [1-2, 3-4]. 2. Did not support {a{b, c}d} as Set; it should be {ab, cd} 3. Add check for Text against [a-zA-Z0-9_/.-] {code} > Fix Value parsing code to only accept the canonical formats > --- > > Key: MESOS-5739 > URL: https://issues.apache.org/jira/browse/MESOS-5739 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Klaus Ma >Assignee: Klaus Ma > > We should fix the value parsing code to only accept the canonical formats as > defined in http://mesos.apache.org/documentation/latest/attributes-resources/ > , the behaviour after the fix is as following: > {code} > 1. Did not support [1-2, [3-4]] as Ranges; it should be [1-2, 3-4]. > 2. Did not support {a{b, c}d} as Set; it should be {ab, cd} > 3. Add check for Text against [a-zA-Z0-9_/.-] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5739) Fix Value parsing code to only accept the canonical formats
[ https://issues.apache.org/jira/browse/MESOS-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5739: --- Summary: Fix Value parsing code to only accept the canonical formats (was: Enhance Value parsing) > Fix Value parsing code to only accept the canonical formats > --- > > Key: MESOS-5739 > URL: https://issues.apache.org/jira/browse/MESOS-5739 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Klaus Ma >Assignee: Klaus Ma > > Enhanced Value parsing: > {code} > 1. Did not support [1-2, [3-4]] as Ranges; it should be [1-2, 3-4]. > 2. Did not support {a{b, c}d} as Set; it should be {ab, cd} > 3. Add check for Text against [a-zA-Z0-9_/.-] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5017) Don't consider agents without allocatable resources in the allocator
[ https://issues.apache.org/jira/browse/MESOS-5017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364414#comment-15364414 ] Guangya Liu commented on MESOS-5017: I posted a patch here https://reviews.apache.org/r/49694/ , but found that the performance does not improve much with a benchmark test, [~bmahler] and [~jvanremoortere] , can you please help check and show your comments if any? > Don't consider agents without allocatable resources in the allocator > > > Key: MESOS-5017 > URL: https://issues.apache.org/jira/browse/MESOS-5017 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Dario Rexin >Assignee: Guangya Liu >Priority: Minor > > During the review r/43668/ , it come out an enhancement that if an agent has > not allocatable resources, the allocator should filter them out at the > beginning. > {quote} > Joris Van Remoortere Posted 1 month ago (March 16, 2016, 5:04 a.m.) > Should we filter out slaves that have no allocatable resources? > If we do, let's make sure we note that we want to pass the original slaveids > to the deallocate function > The issue has been resolved. Show all issues > Dario Rexin 4 weeks ago (March 23, 2016, 4:25 a.m.) > I'm not sure if it would be a big improvement. Calculating the available > resources if somewhat expensive and we have to do it again in the loop and > most slaves will probably have resources available anyway. The reason it's an > improvement in the loop is, that after we offer the resources to a framework, > we can be sure that they are all unavailable to the following frameworks > under the same role. > Klaus Ma 4 weeks ago (March 23, 2016, 11:13 a.m.) > @joris/dario, I think the improvement dependent on the workload patten: 1.) > for short running tasks, it maybe serveral tasks finished during the > allocation interval, so maybe no improvement; 2.) but for long running tasks, > slave/agent should be fully used in most of time, it'll be a big improvement. > I used to log MESOS-4986 to add a filter after stage 1 (Quota), but maybe > useless after revocable by default. > Joris Van Remoortere 3 weeks, 6 days ago (March 23, 2016, 8:59 p.m.) > Can you open a JIRA to consider doing this. Along Klaus' example, I'm not > convinced this wouldn't have a large impact in certain scenarios. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5681) c++ based resource and resources object
[ https://issues.apache.org/jira/browse/MESOS-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5681: --- [~yanyanhu] This seems to be duplicate with MESOS-4770 , can you confirm? > c++ based resource and resources object > --- > > Key: MESOS-5681 > URL: https://issues.apache.org/jira/browse/MESOS-5681 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Yanyan Hu > Labels: performance > > Followup JIRA for MESOS-5425. Currently, resource object exposes the protobuf > to store data internally. But its implementation is low-efficient for math > calculation, especially for the case of Ranges subtraction. An interim > solution proposed https://reviews.apache.org/r/48593/ is converting Ranges to > IntervalSet inline to optimize the performance. In long-term, we should > consider C++ library based resource object as a permanent solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5017) Don't consider agents without allocatable resources in the allocator
[ https://issues.apache.org/jira/browse/MESOS-5017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu reassigned MESOS-5017: -- Assignee: Guangya Liu (was: Klaus Ma) > Don't consider agents without allocatable resources in the allocator > > > Key: MESOS-5017 > URL: https://issues.apache.org/jira/browse/MESOS-5017 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Dario Rexin >Assignee: Guangya Liu >Priority: Minor > > During the review r/43668/ , it come out an enhancement that if an agent has > not allocatable resources, the allocator should filter them out at the > beginning. > {quote} > Joris Van Remoortere Posted 1 month ago (March 16, 2016, 5:04 a.m.) > Should we filter out slaves that have no allocatable resources? > If we do, let's make sure we note that we want to pass the original slaveids > to the deallocate function > The issue has been resolved. Show all issues > Dario Rexin 4 weeks ago (March 23, 2016, 4:25 a.m.) > I'm not sure if it would be a big improvement. Calculating the available > resources if somewhat expensive and we have to do it again in the loop and > most slaves will probably have resources available anyway. The reason it's an > improvement in the loop is, that after we offer the resources to a framework, > we can be sure that they are all unavailable to the following frameworks > under the same role. > Klaus Ma 4 weeks ago (March 23, 2016, 11:13 a.m.) > @joris/dario, I think the improvement dependent on the workload patten: 1.) > for short running tasks, it maybe serveral tasks finished during the > allocation interval, so maybe no improvement; 2.) but for long running tasks, > slave/agent should be fully used in most of time, it'll be a big improvement. > I used to log MESOS-4986 to add a filter after stage 1 (Quota), but maybe > useless after revocable by default. > Joris Van Remoortere 3 weeks, 6 days ago (March 23, 2016, 8:59 p.m.) > Can you open a JIRA to consider doing this. Along Klaus' example, I'm not > convinced this wouldn't have a large impact in certain scenarios. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4694) DRFAllocator takes very long to allocate resources with a large number of frameworks
[ https://issues.apache.org/jira/browse/MESOS-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363600#comment-15363600 ] Guangya Liu commented on MESOS-4694: [~drexin] are you still actively working on this? If not, can I take this over? Thanks. > DRFAllocator takes very long to allocate resources with a large number of > frameworks > > > Key: MESOS-4694 > URL: https://issues.apache.org/jira/browse/MESOS-4694 > Project: Mesos > Issue Type: Improvement > Components: allocation >Affects Versions: 0.26.0, 0.27.0, 0.27.1, 0.27.2, 0.28.0, 0.28.1 >Reporter: Dario Rexin >Assignee: Dario Rexin > > With a growing number of connected frameworks, the allocation time grows to > very high numbers. The addition of quota in 0.27 had an additional impact on > these numbers. Running `mesos-tests.sh --benchmark > --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us > the following numbers: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 2.921202secs to make 200 offers > round 1 allocate took 2.85045secs to make 200 offers > round 2 allocate took 2.823768secs to make 200 offers > {noformat} > Increasing the number of frameworks to 2000: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 28.209454secs to make 2000 offers > round 1 allocate took 28.469419secs to make 2000 offers > round 2 allocate took 28.138086secs to make 2000 offers > {noformat} > I was able to reduce this time by a substantial amount. After applying the > patches: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 1.016226secs to make 2000 offers > round 1 allocate took 1.102729secs to make 2000 offers > round 2 allocate took 1.102624secs to make 2000 offers > {noformat} > And with 2000 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 12.563203secs to make 2000 offers > round 1 allocate took 12.437517secs to make 2000 offers > round 2 allocate took 12.470708secs to make 2000 offers > {noformat} > The patches do 3 things to improve the performance of the allocator. > 1) The total values in the DRFSorter will be pre calculated per resource type > 2) In the allocate method, when no resources are available to allocate, we > break out of the innermost loop to prevent looping over a large number of > frameworks when we have nothing to allocate > 3) when a framework suppresses offers, we remove it from the sorter instead > of just calling continue in the allocation loop - this greatly improves > performance in the sorter and prevents looping over frameworks that don't > need resources > Assuming that most of the frameworks behave nicely and suppress offers when > they have nothing to schedule, it is fair to assume, that point 3) has the > biggest impact on the performance. If we suppress offers for 90% of the > frameworks in the benchmark test, we see following numbers: > {noformat} > ==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 200 slaves and 2000 frameworks > round 0 allocate took 11626us to make 200 offers > round 1 allocate took 22890us to make 200 offers > round 2 allocate took 21346us to make 200 offers > {noformat} > And for 200 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 1.11178secs to make 2000 offers > round 1 allocate took 1.062649secs to make 2000 offers > round 2 allocate took 1.080181secs to make 2000 offers >
[jira] [Created] (MESOS-5760) MAC OS Build failed
Guangya Liu created MESOS-5760: -- Summary: MAC OS Build failed Key: MESOS-5760 URL: https://issues.apache.org/jira/browse/MESOS-5760 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu {code} arwin -DZOOKEEPER_VERSION=\"3.4.8\" -I/usr/local/opt/subversion/include/subversion-1 -I/usr/local/opt/openssl/include -I/usr/include/apr-1 -I/usr/include/apr-1.0 -D_THREAD_SAFE -pthread -g -O0 -Wno-unused-local-typedef -std=c++11 -stdlib=libc++ -DGTEST_USE_OWN_TR1_TUPLE=1 -DGTEST_LANG_CXX11 -MT tests/mesos_tests-hdfs_tests.o -MD -MP -MF tests/.deps/mesos_tests-hdfs_tests.Tpo -c -o tests/mesos_tests-hdfs_tests.o `test -f 'tests/hdfs_tests.cpp' || echo '../../src/'`tests/hdfs_tests.cpp In file included from ../../src/tests/gc_tests.cpp:42: // distributed with this work for additional information ../../src/linux/fs.hpp:20:10: fatal error: 'mntent.h' file not found #include ^ mv -f tests/.deps/mesos_tests-executor_http_api_tests.Tpo tests/.deps/mesos_tests-executor_http_api_tests.Po g++ -DPACKAGE_NAME=\"mesos\" -DPACKAGE_TARNAME=\"mesos\" -DPACKAGE_VERSION=\"1.0.0\" -DPACKAGE_STRING=\"mesos\ 1.0.0\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"mesos\" -DVERSION=\"1.0.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_CXX11=1 -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_LIBZ=1 -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 -DHAVE_LIBAPR_1=1 -DHAVE_LIBCURL=1 -DMESOS_HAS_JAVA=1 -DHAVE_PYTHON=\"2.7\" -DMESOS_HAS_PYTHON=1 -DHAVE_LIBSASL2=1 -DHAVE_SVN_VERSION_H=1 -DHAVE_LIBSVN_SUBR_1=1 -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1 -DHAVE_LIBZ=1 -I. -I../../src -Wall -Werror -DLIBDIR=\"/usr/local/lib\" -DPKGLIBEXECDIR=\"/usr/local/libexec/mesos\" -DPKGDATADIR=\"/usr/local/share/mesos\" -DPKGMODULEDIR=\"/usr/local/lib/mesos/modules\" -I../../include -I../include -I../include/mesos -DPICOJSON_USE_INT64 -D__STDC_FORMAT_MACROS -isystem ../3rdparty/boost-1.53.0 -I../3rdparty/glog-0.3.3/src -I../3rdparty/leveldb-1.4/include -I../../3rdparty/libprocess/include -I../3rdparty/nvml-352.79 -I../3rdparty/picojson-1.3.0 -I../3rdparty/protobuf-2.6.1/src -I../../3rdpa {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5743) Added a flag parser for hashset.
Guangya Liu created MESOS-5743: -- Summary: Added a flag parser for hashset. Key: MESOS-5743 URL: https://issues.apache.org/jira/browse/MESOS-5743 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu We are introducing a new flag in master to set multiple exclude resource names from sorter, it is better add a lag parser for hashset to parse the flag for multiple exclude resource names. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5621) Enabled calculateShare() to ignore the fairnessExcludeResourceNames
[ https://issues.apache.org/jira/browse/MESOS-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337847#comment-15337847 ] Guangya Liu edited comment on MESOS-5621 at 6/24/16 2:08 PM: - https://reviews.apache.org/r/49190/ was (Author: gyliu): https://reviews.apache.org/r/48906/ > Enabled calculateShare() to ignore the fairnessExcludeResourceNames > --- > > Key: MESOS-5621 > URL: https://issues.apache.org/jira/browse/MESOS-5621 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > > Enabled calculateShare() to ignore the fairnessExcludeResourceNames, the > fairnessExcludeResourceNames will be a member field for sorter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5621) Enabled calculateShare() to ignore the fairnessExcludeResourceNames
[ https://issues.apache.org/jira/browse/MESOS-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5621: --- Description: Enabled calculateShare() to ignore the fairnessExcludeResourceNames, the fairnessExcludeResourceNames will be a member field for sorter. (was: We need a helper function to get all non scarce resources so as to help allocator get the non scarce resources information.) > Enabled calculateShare() to ignore the fairnessExcludeResourceNames > --- > > Key: MESOS-5621 > URL: https://issues.apache.org/jira/browse/MESOS-5621 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > > Enabled calculateShare() to ignore the fairnessExcludeResourceNames, the > fairnessExcludeResourceNames will be a member field for sorter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5621) Enabled calculateShare() to ignore the fairnessExcludeResourceNames
[ https://issues.apache.org/jira/browse/MESOS-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5621: --- Summary: Enabled calculateShare() to ignore the fairnessExcludeResourceNames (was: Add helper function to get non scarce resoures) > Enabled calculateShare() to ignore the fairnessExcludeResourceNames > --- > > Key: MESOS-5621 > URL: https://issues.apache.org/jira/browse/MESOS-5621 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > > We need a helper function to get all non scarce resources so as to help > allocator get the non scarce resources information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5641) Update docker-volume.md to add some content for how to test
Guangya Liu created MESOS-5641: -- Summary: Update docker-volume.md to add some content for how to test Key: MESOS-5641 URL: https://issues.apache.org/jira/browse/MESOS-5641 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu The mesos-execute was fixed in MESOS-5265 , the document should be updated to reflect how to use mesos-execute to test the feature of docker volume isolator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5640) Unify the help info for master/agent flags
Guangya Liu created MESOS-5640: -- Summary: Unify the help info for master/agent flags Key: MESOS-5640 URL: https://issues.apache.org/jira/browse/MESOS-5640 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Priority: Minor Currently, in master/flags.cpp, some flags end up with a "\n" while some not, this caused the output not consistent. {code} --[no-]hostname_lookup Whether we should execute a lookup to find out the server's hostname, if not explicitly set (via, e.g., `--hostname`). True by default; if set to `false` it will cause Mesos to use the IP address, unless the hostname is explicitly set. (default: true) --http_authenticators=VALUE HTTP authenticator implementation to use when handling requests to authenticated endpoints. Use the default `basic`, or load an alternate HTTP authenticator module using `--modules`. Currently there is no support for multiple HTTP authenticators. (default: basic) --http_framework_authenticators=VALUE HTTP authenticator implementation to use when authenticating HTTP frameworks. Use the `basic` authenticator or load an alternate authenticator module using `--modules`. Must be used in conjunction with `--http_authenticate_frameworks`. {code} I think we should follow the linux "man command" format by adding "\n" to all flags. The following is a sample output for "man ls". {code} -@ Display extended attribute keys and sizes in long (-l) output. -1 (The numeric digit ``one''.) Force output to be one entry per line. This is the default when output is not to a terminal. -A List all entries except for . and ... Always set for the super-user. -a Include directory entries whose names begin with a dot (.). -B Force printing of non-printable characters (as defined by ctype(3) and current locale settings) in file names as \xxx, where xxx is the numeric value of the character in octal. -b As -B, but use C escape codes whenever possible. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5625) Document the overall treatment of scarce resources.
Guangya Liu created MESOS-5625: -- Summary: Document the overall treatment of scarce resources. Key: MESOS-5625 URL: https://issues.apache.org/jira/browse/MESOS-5625 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu This document should clarify the overall treatment of scarce resources. Please refer to http://markmail.org/thread/ojoz5zyko2l5srld for some initial discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5623) Add test cases for scarce resources
Guangya Liu created MESOS-5623: -- Summary: Add test cases for scarce resources Key: MESOS-5623 URL: https://issues.apache.org/jira/browse/MESOS-5623 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu Add some test cases for scarce resources change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5622) Update allocator to handle scarce resources
Guangya Liu created MESOS-5622: -- Summary: Update allocator to handle scarce resources Key: MESOS-5622 URL: https://issues.apache.org/jira/browse/MESOS-5622 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu The allocator should be updated to handle scarce resources, the idea is exclude scarce resources from all sorters in allocator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)