[jira] [Assigned] (MESOS-5795) Add Nvidia GPU support for in the docker containerizer

2020-05-26 Thread Meng Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-5795:
---

Assignee: Meng Zhu

> Add Nvidia GPU support for in the docker containerizer
> --
>
> Key: MESOS-5795
> URL: https://issues.apache.org/jira/browse/MESOS-5795
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization, docker
>Reporter: Kevin Klues
>Assignee: Meng Zhu
>Priority: Major
>  Labels: gpu, mesosphere
>
> In order to support Nvidia GPUs with docker containers in Mesos, we need to 
> be able to consolidate all Nvidia libraries into a common volume and inject 
> that volume into the container. This tracks the support in the docker 
> containerizer. The mesos containerizer support has already been completed in 
> MESOS-5401.
> More info on why this is necessary here: 
> https://github.com/NVIDIA/nvidia-docker/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10029) Quota limits may be breached when serving operations.

2019-11-04 Thread Meng Zhu (Jira)
Meng Zhu created MESOS-10029:


 Summary: Quota limits may be breached when serving operations.
 Key: MESOS-10029
 URL: https://issues.apache.org/jira/browse/MESOS-10029
 Project: Mesos
  Issue Type: Bug
Reporter: Meng Zhu


Currently, quota limits are only enforced during offer stage in the allocator. 
For other resource consumption events e.g. operator initiated operations (e.g. 
reserve resources for a role), the limit logic is not checked. This may lead to 
a breach of quota limits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10028) Mesos failed to build due to error C3493 on windows with MSVC

2019-11-04 Thread Meng Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-10028:


Assignee: Meng Zhu

> Mesos failed to build due to error C3493 on windows with MSVC
> -
>
> Key: MESOS-10028
> URL: https://issues.apache.org/jira/browse/MESOS-10028
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: master
> Environment: VS 2017 + Windows Server 2016
>Reporter: LinGao
>Assignee: Meng Zhu
>Priority: Major
> Attachments: log_x64_build.log
>
>
> Mesos failed to build due to error C3493: 'childRoleLength' cannot be 
> implicitly captured because no default capture mode has been specified on 
> Windows using MSVC. It can be first reproduced on 69e92ae reversion on master 
> branch. Could you please take a look at this isssue? Thanks a lot!
>  
> Reproduce steps:
> 1. git clone -c core.autocrlf=true https://github.com/apache/mesos 
> D:\mesos\src
> 2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos
> 3. cd src
> 4. .\bootstrap.bat
> 5. cd ..
> 6. mkdir build_x64 && pushd build_x64
> 7. cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
> 8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 
> /t:Rebuild
>  
> ErrorMessage:
> D:\mesos\src\src\tests\hierarchical_allocator_tests.cpp(8455): error C3493: 
> 'childRoleLength' cannot be implicitly captured because no default capture 
> mode has been specified



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10014) `tryUntrackFrameworkUnderRole` check failed in `HierarchicalAllocatorProcess::removeFramework`.

2019-10-22 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957497#comment-16957497
 ] 

Meng Zhu commented on MESOS-10014:
--

Hmm, the following log message looks problematic:

{noformat}
I1018 09:05:14.228754 21394 hierarchical.cpp:955] Added agent 
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 (ip-172-16-10-17.ec2.internal) with 
cpus:2; mem:1024; disk:1024; ports:[31000-32000] (offered or allocated: {})
I1018 09:05:14.229159 21394 hierarchical.cpp:1100] Grew agent 
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 by disk[RAW(,,profile)]:200 (total), {  
} (used)
I1018 09:05:14.229632 21394 hierarchical.cpp:1057] Agent 
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 (ip-172-16-10-17.ec2.internal) updated 
with total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
I1018 09:05:14.230063 21394 hierarchical.cpp:1843] Performed allocation for 1 
agents in 128843ns
I1018 09:05:14.230569 21391 master.cpp:10926] Recovered orphan operation 
71647a26-b5fe-4b97-9162-0abb2785b909 (ID: operation) on agent 
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 belonging to framework 
e6284079-cb6a-4a47-8f9a-ea9b84ff622a- in state OPERATION_PENDING
I1018 09:05:14.230813 21391 master.cpp:10824] Adding framework 
e6284079-cb6a-4a47-8f9a-ea9b84ff622a- (default) with roles {  } suppressed
I1018 09:05:14.230991 21391 master.cpp:8295] Updating framework 
e6284079-cb6a-4a47-8f9a-ea9b84ff622a- (default) with roles {  } suppressed
I1018 09:05:14.231298 21390 hierarchical.cpp:1100] Grew agent 
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 by disk[RAW(,,profile)]:200 (total), { 
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-: disk(allocated: 
default-role)[RAW(,,profile)]:200 } (used)
{noformat}

This happens after the master failover. In particular, there are two `Grew 
agent ...` indicating two resource providers (each with 200 disk) are added. 
And the latter one contains *used* 200 disk. This is probably the same 200 disk 
resource printed out above by [~bmahler]

I suspect this relates to orphan operations cc/[~greggomann]

> `tryUntrackFrameworkUnderRole` check failed in 
> `HierarchicalAllocatorProcess::removeFramework`.
> ---
>
> Key: MESOS-10014
> URL: https://issues.apache.org/jira/browse/MESOS-10014
> Project: Mesos
>  Issue Type: Bug
>  Components: master, test
>Affects Versions: 1.10
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: flaky-test, resource-management
> Attachments: AgentPendingOperationAfterMasterFailover-badrun.txt
>
>
> `ContentType/OperationReconciliationTest.AgentPendingOperationAfterMasterFailover/0`
>  test failed:
> {code:java}
> F1018 09:05:14.310616 21391 hierarchical.cpp:745] Check failed: 
> tryUntrackFrameworkUnderRole(framework, role)  Framework: 
> e6284079-cb6a-4a47-8f9a-ea9b84ff622a- role: default-role
> *** Check failure stack trace: ***
> @ 0x7f40fff0a1f6  google::LogMessage::Fail()
> @ 0x7f40fff0a14f  google::LogMessage::SendToLog()
> @ 0x7f40fff09a91  google::LogMessage::Flush()
> @ 0x7f40fff0d12f  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f410fd828ac  
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeFramework()
> @  0x186b29f  
> _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES8_EEvRKNS_3PIDIT_EEMSA_FvT0_EOT1_ENKUlOS6_PNS_11ProcessBaseEE_clESJ_SL_
> @  0x189c273  
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS3_11FrameworkIDESA_EEvRKNS1_3PIDIT_EEMSC_FvT0_EOT1_EUlOS8_PNS1_11ProcessBaseEE_JS8_SN_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSP_
> @  0x18990b7  
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi113invoke_expandISP_St5tupleIJS9_SR_EESU_IJOSO_EEJLm0ELm1DTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISH_Efp0_EEcl7forwardISK_Efp2_OSD_OSH_N5cpp1416integer_sequenceImJXspT2_SL_
> @  0x1896100  
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_IS9_St12_PlaceholderILi1clIISO_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImILm0ELm1_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOSX_
> @  0x1895174  
> 

[jira] [Commented] (MESOS-10006) Crash in Sorter: "Check failed: resources.contains(slaveId)"

2019-10-04 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944691#comment-16944691
 ] 

Meng Zhu commented on MESOS-10006:
--

Debug patch landed in master and 1.9.x, 1.8.x (will be included in 1.9.1 and 
1.8.2)
{noformat}
commit 3457771b42993c85e3da3c4550b233f61b14bc99 (origin/master, apache/master, 
master, check_slaveID)
Author: Meng Zhu 
Date:   Fri Oct 4 10:48:40 2019 -0400

Made `CHECK` in sorter print out more info upon failure.

Review: https://reviews.apache.org/r/71581
{noformat}


> Crash in Sorter: "Check failed: resources.contains(slaveId)"
> 
>
> Key: MESOS-10006
> URL: https://issues.apache.org/jira/browse/MESOS-10006
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0, 1.4.1, 1.9.0
> Environment: Ubuntu Bionic 18.04, Mesos 1.1.0, 1.4.1, 1.9.0 (logs are 
> from 1.9.0).
>Reporter: Terra Field
>Priority: Major
> Attachments: mesos-master.log.gz
>
>
> We've hit a similar exception on 3 different versions of the Mesos master 
> (the line #/file name changes but the Check failed is the same), usually when 
> under very high load:
> {noformat}
> F1003 22:06:54.463502  8579 sorter.hpp:339] Check failed: 
> resources.contains(slaveId)
> {noformat}
> This particular occurrence happened after the election of a new master that 
> was then stuck doing framework update broadcasts, as documented in 
> MESOS-10005.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10006) Crash in Sorter: "Check failed: resources.contains(slaveId)"

2019-10-04 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944592#comment-16944592
 ] 

Meng Zhu commented on MESOS-10006:
--

 Cross-posting from slack:

thanks for the ticket! Unfortunately, the log does not contain much useful 
information. Alas, we did not print out the slaveID upon check failure. Sent 
out a patch to print more info upon check failure:
I send out https://reviews.apache.org/r/71581
Consider backport.

Also, some hunch diagnosis: such CHECK failure on sorter function input args 
are almost always bugs on the caller side, in this case, most likely some 
race/inconsistencies between master and allocator during recovery



> Crash in Sorter: "Check failed: resources.contains(slaveId)"
> 
>
> Key: MESOS-10006
> URL: https://issues.apache.org/jira/browse/MESOS-10006
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0, 1.4.1, 1.9.0
> Environment: Ubuntu Bionic 18.04, Mesos 1.1.0, 1.4.1, 1.9.0 (logs are 
> from 1.9.0).
>Reporter: Terra Field
>Priority: Major
> Attachments: mesos-master.log.gz
>
>
> We've hit a similar exception on 3 different versions of the Mesos master 
> (the line #/file name changes but the Check failed is the same), usually when 
> under very high load:
> {noformat}
> F1003 22:06:54.463502  8579 sorter.hpp:339] Check failed: 
> resources.contains(slaveId)
> {noformat}
> This particular occurrence happened after the election of a new master that 
> was then stuck doing framework update broadcasts, as documented in 
> MESOS-10005.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-3938) Consider allowing setting quotas for the default '*' role.

2019-10-02 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942888#comment-16942888
 ] 

Meng Zhu commented on MESOS-3938:
-

{noformat}
commit 270a3dce490d5b334f9a0011ea416ffc42e187e4
Author: Meng Zhu 
Date:   Wed Sep 25 15:41:07 2019 -0700

Documented setting quota on the default role in the release note.

Review: https://reviews.apache.org/r/71548

commit 4dd00c6ad3d8af1d38d496a51f5407ee0e4b1970
Author: Meng Zhu 
Date:   Tue Sep 10 11:51:09 2019 -0700

Allowed setting quota the default "*" role.

There is no clear argument against setting quota on the default
"*" role. This patch allows doing so. Tests are updated to check
against regressions.

Review: https://reviews.apache.org/r/71464
{noformat}


> Consider allowing setting quotas for the default '*' role.
> --
>
> Key: MESOS-3938
> URL: https://issues.apache.org/jira/browse/MESOS-3938
> Project: Mesos
>  Issue Type: Task
>Reporter: Alex R
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> Investigate use cases and implications of the possibility to set quota for 
> the '*' role. For example, having quota for '*' set can effectively reduce 
> the scope of the quota capacity heuristic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-8503) Improve UI when displaying frameworks with many roles.

2019-09-25 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938030#comment-16938030
 ] 

Meng Zhu commented on MESOS-8503:
-

{noformat}
commit aed0b871479ecb1ee36df334c46203b75d682a7e
Author: Andrei Sekretenko 
Date:   Wed Sep 25 13:11:08 2019 -0700

Fixed Javascript linting and IE compatibility of the UI roles tree.

Review: https://reviews.apache.org/r/71541/
{noformat}


> Improve UI when displaying frameworks with many roles.
> --
>
> Key: MESOS-8503
> URL: https://issues.apache.org/jira/browse/MESOS-8503
> Project: Mesos
>  Issue Type: Task
>Reporter: Armand Grillet
>Assignee: Andrei Sekretenko
>Priority: Major
>  Labels: resource-management
> Fix For: 1.10
>
> Attachments: Screen Shot 2018-01-29 à 10.38.05.png
>
>
> The /frameworks UI endpoint displays all the roles of each framework in a 
> table:
> !Screen Shot 2018-01-29 à 10.38.05.png!
> This is not readable if a framework has many roles. We thus need to provide a 
> solution to only display a few roles per framework and show more when a user 
> wants to see all of them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-9975) Sorter may leak clients.

2019-09-18 Thread Meng Zhu (Jira)
Meng Zhu created MESOS-9975:
---

 Summary: Sorter may leak clients.
 Key: MESOS-9975
 URL: https://issues.apache.org/jira/browse/MESOS-9975
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Meng Zhu


In MESOS-9015, we allowed resource quantities to change when updating an 
existing allocation. When the allocation is updated to empty, however, we 
forget to remove the client in the map in the `sorter::update()` if the 
`newAllocation` is `empty()`.

https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/sorter/drf/sorter.hpp#L382-L384

The above case could happen, for example, when a CSI volume with a stale 
profile is destroyed, it would be better to convert it into an empty resource 
since the disk space is no longer available. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9975) Sorter may leak clients.

2019-09-18 Thread Meng Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9975:
---

Assignee: Meng Zhu

> Sorter may leak clients.
> 
>
> Key: MESOS-9975
> URL: https://issues.apache.org/jira/browse/MESOS-9975
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> In MESOS-9015, we allowed resource quantities to change when updating an 
> existing allocation. When the allocation is updated to empty, however, we 
> forget to remove the client in the map in the `sorter::update()` if the 
> `newAllocation` is `empty()`.
> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/sorter/drf/sorter.hpp#L382-L384
> The above case could happen, for example, when a CSI volume with a stale 
> profile is destroyed, it would be better to convert it into an empty resource 
> since the disk space is no longer available. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-621) `HierarchicalAllocatorProcess::removeSlave` doesn't properly handle framework allocations/resources

2019-09-12 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928909#comment-16928909
 ] 

Meng Zhu commented on MESOS-621:


Added tracking of allocated or offered resources in the allocator:

{noformat}
commit 783fd45c548fdff0c5c4812bc8e92c3aed202e06
Author: Meng Zhu m...@mesosphere.io
Date:   Sat Sep 7 16:01:51 2019 -0700


Tracked offered and allocated resources in the role tree.

This helpers simplify the quota tracking logic and also paves
the way to reduce duplicated states in the sorter.

Also documented that shared resources must be uniquely
identifiable.

Small performance degradation when making allocations due to
duplicated map construction in `(un)trackAllocatedResources`.
This will be removed once embeded the sorter in the role tree.

Benchmark `LargeAndSmallQuota/2`:

Master:

Added 3000 agents in 80.648188ms
Added 3000 frameworks in 19.7006984secs
Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks,
with drf sorter
Made 3500 allocations in 16.044274434secs
Made 0 allocation in 14.476429451secs

Master + this patch:
Added 3000 agents in 80.110817ms
Added 3000 frameworks in 17.25974094secs
Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks,
with drf sorter
Made 3500 allocations in 16.91971379secs
Made 0 allocation in 13.784476154secs

Review: https://reviews.apache.org/r/71460
commit 2ec34ca5951a5a8da3d1ab93839cce68e815c1d5
Author: Meng Zhu 
Date:   Tue Sep 3 13:31:36 2019 -0700

Added tracking of framework allocations in the allocator Slave class.

This would simplify the tracking logic regarding
resource allocations in the allocator. See MESOS-9182.

Negligible performance impact:

Master:

BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Added 3000 agents in 77.999483ms
Added 3000 frameworks in 16.736076171secs
Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks,
with drf sorter
Made 3500 allocations in 15.342376944secs
Made 0 allocation in 13.96720191secs

Master + this patch:

BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Added 3000 agents in 83.597048ms
Added 3000 frameworks in 16.757011745secs
Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks,
with drf sorter
Made 3500 allocations in 15.566366241secs
Made 0 allocation in 14.022591871secs

Review: https://reviews.apache.org/r/68508
{noformat}


> `HierarchicalAllocatorProcess::removeSlave` doesn't properly handle framework 
> allocations/resources
> ---
>
> Key: MESOS-621
> URL: https://issues.apache.org/jira/browse/MESOS-621
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Major
>  Labels: mesosphere, resource-management, tech-debt
>
> Currently a slaveRemoved() simply removes the slave from 'slaves' map and 
> slave's resources from 'roleSorter'. Looking at resourcesRecovered(), more 
> things need to be done when a slave is removed (e.g., framework 
> unallocations).
> It would be nice to fix this and have a test for this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (MESOS-3938) Consider allowing setting quotas for the default '*' role.

2019-09-10 Thread Meng Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-3938:
---

Assignee: Meng Zhu

> Consider allowing setting quotas for the default '*' role.
> --
>
> Key: MESOS-3938
> URL: https://issues.apache.org/jira/browse/MESOS-3938
> Project: Mesos
>  Issue Type: Task
>Reporter: Alexander Rukletsov
>Assignee: Meng Zhu
>Priority: Major
>
> Investigate use cases and implications of the possibility to set quota for 
> the '*' role. For example, having quota for '*' set can effectively reduce 
> the scope of the quota capacity heuristic.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (MESOS-9242) Resources wrapper loses shared resource count information.

2019-09-07 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925052#comment-16925052
 ] 

Meng Zhu edited comment on MESOS-9242 at 9/8/19 5:25 AM:
-

Another seemingly possible fix is to store shared items as individual objects 
in the Resources list e.g. 60 disk resources that got shared twice could have 
two resource with shared info set. However, this has a confusing problem when 
doing arithmetics: if we add another addable 60 shared disk, should it be kept 
as a distinct object or combine scalar value with the same object?

Looks like we have to live with the count. However, returning `sharedCount` 
number of a resource object in the iterator also seems less than ideal. It 
would go against caller's assumption that resource objects are unique. For 
example, when calculating total scalar quantities, one would expect to simply 
add scalars with the same resource name together.

A better solution seems to expose the shared count i.e. get rid of the 
`Resource_` wrapper and put `shared_count` as a field in the Resource 
SharedInfo proto message.


was (Author: mzhu):
Another seemingly possible fix is to store shared items as individual objects 
in the Resources list e.g. 60 disk resources that got shared twice could have 
two resource with shared info set. However, this has a confusing problem when 
doing arithmetics: if we add another addable 60 shared disk, should it be kept 
as a distinct object or combine scalar value with the same object?

Looks like we have to live with the count.

> Resources wrapper loses shared resource count information.
> --
>
> Key: MESOS-9242
> URL: https://issues.apache.org/jira/browse/MESOS-9242
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> The Resources wrapper stores a {{Resource_}} wrapper type that stores 
> multiple copies of the a shared resource in a single {{Resource_}} with a 
> shared count.
> On the output paths Resources, we lose the shared counts since we convert 
> {{Resource_}} directly back into a single {{Resource}}, even if the shared 
> count was > 1.
> We need to fix this in the following:
> * Implicit cast operator back to repeated ptr field of resource, this is easy 
> to adjust.
> * Resource iteration, since we only expose const iteration, it should be 
> possible to use an iterator adaptor to return the shared resource {{count}} 
> times rather than just once when there are multiple copies.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9242) Resources wrapper loses shared resource count information.

2019-09-07 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925052#comment-16925052
 ] 

Meng Zhu commented on MESOS-9242:
-

Another seemingly possible fix is to store shared items as individual objects 
in the Resources list e.g. 60 disk resources that got shared twice could have 
two resource with shared info set. However, this has a confusing problem when 
doing arithmetics: if we add another addable 60 shared disk, should it be kept 
as a distinct object or combine scalar value with the same object?

Looks like we have to live with the count.

> Resources wrapper loses shared resource count information.
> --
>
> Key: MESOS-9242
> URL: https://issues.apache.org/jira/browse/MESOS-9242
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> The Resources wrapper stores a {{Resource_}} wrapper type that stores 
> multiple copies of the a shared resource in a single {{Resource_}} with a 
> shared count.
> On the output paths Resources, we lose the shared counts since we convert 
> {{Resource_}} directly back into a single {{Resource}}, even if the shared 
> count was > 1.
> We need to fix this in the following:
> * Implicit cast operator back to repeated ptr field of resource, this is easy 
> to adjust.
> * Resource iteration, since we only expose const iteration, it should be 
> possible to use an iterator adaptor to return the shared resource {{count}} 
> times rather than just once when there are multiple copies.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (MESOS-9242) Resources wrapper loses shared resource count information.

2019-09-07 Thread Meng Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9242:
---

Assignee: Meng Zhu

> Resources wrapper loses shared resource count information.
> --
>
> Key: MESOS-9242
> URL: https://issues.apache.org/jira/browse/MESOS-9242
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>
> The Resources wrapper stores a {{Resource_}} wrapper type that stores 
> multiple copies of the a shared resource in a single {{Resource_}} with a 
> shared count.
> On the output paths Resources, we lose the shared counts since we convert 
> {{Resource_}} directly back into a single {{Resource}}, even if the shared 
> count was > 1.
> We need to fix this in the following:
> * Implicit cast operator back to repeated ptr field of resource, this is easy 
> to adjust.
> * Resource iteration, since we only expose const iteration, it should be 
> possible to use an iterator adaptor to return the shared resource {{count}} 
> times rather than just once when there are multiple copies.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-1452) Improve Master::removeOffer to avoid further resource accounting bugs.

2019-09-06 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924640#comment-16924640
 ] 

Meng Zhu commented on MESOS-1452:
-

{noformat}
commit a8050cafaa5465bd74a2ced1c37bb6b64c735445
Author: Andrei Sekretenko 
Date:   Fri Sep 6 14:15:28 2019 -0700

Separated handling offer validation failure from handling success.

This patch refactors the loop through offer IDs in `Master::accept()`
into two simpler loops: one loop for the offer validation failure case,
another for the case of validation success, thus bringing removal of
offers and recovering their resources close together.

This is a prerequisite for implementing `rescindOffer()`/
`declineOffer()` in the dependent patch.

Review: https://reviews.apache.org/r/71433/

commit 7eb21c41ed255184988298e29644bf7f310c3374
Author: Andrei Sekretenko 
Date:   Fri Sep 6 14:15:38 2019 -0700

Moved `removeOffers()` from `Master::accept()` into `Master::_accept()`.

This patch moves offer removal on accept into the deferred continuation
that follows authorization (if offers pass validation in `accept()`).

Incrementing the `offers_accepted` metric is also moved to `_accept()`.

This is a prerequisite for implementing `rescindOffer()` /
`declineOffer()` / in the dependent patch.

Review: https://reviews.apache.org/r/71434/

Author: Andrei Sekretenko 
Date:   Fri Sep 6 14:15:54 2019 -0700

Replaced removeOffer + recoverResources pairs with specialized helpers.

This patch adds helper methods `Master::rescindOffer()` /
`Master::discardOffer()` that recover offer's resources in the allocator
and remove the offer, and replaces paired calls of `removeOffer()` +
`allocator->recoverResources()` with these helpers.

Review: https://reviews.apache.org/r/71436/
{noformat}


> Improve Master::removeOffer to avoid further resource accounting bugs.
> --
>
> Key: MESOS-1452
> URL: https://issues.apache.org/jira/browse/MESOS-1452
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>Priority: Major
>
> Per comments on this review: https://reviews.apache.org/r/21750/
> We've had numerous bugs around resource accounting in the master due to the 
> trickiness of removing offers in the Master code.
> There are a few ways to improve this:
> 1. Add multiple offer methods to differentiate semantics:
> {code}
> useOffer(offerId);
> rescindOffer(offerId);
> discardOffer(offerId);
> {code}
> 2. Add an enum to removeOffer to differentiate removal semantics:
> {code}
> removeOffer(offerId, USE);
> removeOffer(offerId, RESCIND);
> removeOffer(offerId, DISCARD);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (MESOS-1452) Improve Master::removeOffer to avoid further resource accounting bugs.

2019-09-06 Thread Meng Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-1452:
---

Assignee: Andrei Sekretenko

> Improve Master::removeOffer to avoid further resource accounting bugs.
> --
>
> Key: MESOS-1452
> URL: https://issues.apache.org/jira/browse/MESOS-1452
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>Assignee: Andrei Sekretenko
>Priority: Major
>
> Per comments on this review: https://reviews.apache.org/r/21750/
> We've had numerous bugs around resource accounting in the master due to the 
> trickiness of removing offers in the Master code.
> There are a few ways to improve this:
> 1. Add multiple offer methods to differentiate semantics:
> {code}
> useOffer(offerId);
> rescindOffer(offerId);
> discardOffer(offerId);
> {code}
> 2. Add an enum to removeOffer to differentiate removal semantics:
> {code}
> removeOffer(offerId, USE);
> removeOffer(offerId, RESCIND);
> removeOffer(offerId, DISCARD);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9962) Mesos may report completed task as running in the state.

2019-09-04 Thread Meng Zhu (Jira)
Meng Zhu created MESOS-9962:
---

 Summary: Mesos may report completed task as running in the state.
 Key: MESOS-9962
 URL: https://issues.apache.org/jira/browse/MESOS-9962
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: Meng Zhu


When the following steps occur:
1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or 
/master/machine/down).
2) The executor is sent a kill, and the agent counts down on 
executor_shutdown_grace_period.
3) The executor exits, before all terminal status updates reach the agent. This 
is more likely if executor_shutdown_grace_period passes.

This results in a completed executor, with non-terminal tasks (according to 
status updates).

This would produce a confusing report where completed tasks are still 
TASK_RUNNING.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9750) Agent V1 GET_STATE response may report a complete executor's tasks as non-terminal after a graceful agent shutdown

2019-09-04 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922871#comment-16922871
 ] 

Meng Zhu commented on MESOS-9750:
-

Note, while this ticket makes the completed task with the nonterminal status 
list in the right place (i.e. completed tasks). However, it would result in a 
weird behavior where a completed task would have a nonterminal status e.g. 
TASK_RUNNING.

> Agent V1 GET_STATE response may report a complete executor's tasks as 
> non-terminal after a graceful agent shutdown
> --
>
> Key: MESOS-9750
> URL: https://issues.apache.org/jira/browse/MESOS-9750
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Affects Versions: 1.6.0, 1.7.0, 1.8.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
> Fix For: 1.7.3, 1.8.1, 1.9.0
>
>
> When the following steps occur:
> 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or 
> /master/machine/down).
> 2) The executor is sent a kill, and the agent counts down on 
> {{executor_shutdown_grace_period}}.
> 3) The executor exits, before all terminal status updates reach the agent. 
> This is more likely if {{executor_shutdown_grace_period}} passes.
> This results in a completed executor, with non-terminal tasks (according to 
> status updates).
> When the agent starts back up, the completed executor will be recovered and 
> shows up correctly  as a completed executor in {{/state}}.  However, if you 
> fetch the V1 {{GET_STATE}} result, there will be an entry in 
> {{launched_tasks}} even though nothing is running.
> {code}
> get_tasks {
>   launched_tasks {
> name: "test-task"
> task_id {
>   value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
> }
> framework_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
> }
> executor_id {
>   value: "default"
> }
> agent_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
> }
> state: TASK_RUNNING
> resources { ... }
> resources { ... }
> resources { ... }
> resources { ... }
> statuses {
>   task_id {
> value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
>   }
>   state: TASK_RUNNING
>   agent_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
>   }
>   timestamp: 1556674758.2175469
>   executor_id {
> value: "default"
>   }
>   source: SOURCE_EXECUTOR
>   uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224"
>   container_status { ... }
> }
>   }
> }
> get_executors {
>   completed_executors {
> executor_info {
>   executor_id {
> value: "default"
>   }
>   command {
> value: ""
>   }
>   framework_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
> }
>   }
> }
> get_frameworks {
>   completed_frameworks {
> framework_info {
>   user: "user"
>   name: "default"
>   id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
>   checkpoint: true
>   hostname: "localhost"
>   principal: "test-principal"
>   capabilities {
> type: MULTI_ROLE
>   }
>   capabilities {
> type: RESERVATION_REFINEMENT
>   }
>   roles: "*"
> }
>   }
> }
> {code}
> This happens because we combine executors and completed executors when 
> constructing the response.  The terminal task(s) with non-terminal updates 
> appear under completed executors.
> https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9961) Agent could fail to report completed tasks.

2019-09-04 Thread Meng Zhu (Jira)
Meng Zhu created MESOS-9961:
---

 Summary: Agent could fail to report completed tasks.
 Key: MESOS-9961
 URL: https://issues.apache.org/jira/browse/MESOS-9961
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: Meng Zhu


When agent reregisters with a master, we don't report completed executors for 
active frameworks. We only report completed executors if the framework is also 
completed on the agent:

https://github.com/apache/mesos/blob/1.7.x/src/slave/slave.cpp#L1785-L1832



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the addition of quota limits.

2019-08-23 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914674#comment-16914674
 ] 

Meng Zhu commented on MESOS-9806:
-

As of now, the performance is close to 1.8.1 even with the addition of limits 
enforcement. There will be more improvement as we deprecate the framework 
sorter and optimize the role sorter (MESOS-9942 and MESOS-9943).

> Address allocator performance regression due to the addition of quota limits.
> -
>
> Key: MESOS-9806
> URL: https://issues.apache.org/jira/browse/MESOS-9806
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: resource-management
>
> In MESOS-9802, we removed the quota role sorter which is tech debt.
> However, this slows down the allocator. The problem is that in the first 
> stage, even though a cluster might have no active roles with non-default 
> quota, the allocator will now have to sort and go through each and every role 
> in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, 
> the allocator could experience ~50% performance degradation.
> There are a couple of ways to address this issue. For example, we could make 
> the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return 
> all the roles with non-default quota. Alternatively, an even better approach 
> would be to deprecate the sorter concept and just have two standalone 
> functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree 
> structure (not yet exist in the allocator) and return the sorted roles.
> In addition, when implementing MESOS-8068, we need to do more during the 
> allocation cycle. In particular, we need to call shrink many more times than 
> before. These all contribute to the performance slowdown. Specifically, for 
> the quota oriented benchmark 
> `HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2` we can observe 
> 2-3x slowdown compared to the previous release (1.8.1):
> Current master:
> QuotaParam/BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
> Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter
> Made 3500 allocations in 32.051382735secs
> Made 0 allocation in 27.976022773secs
> 1.8.1:
> HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
> Made 3500 allocations in 13.810811063secs
> Made 0 allocation in 9.885972984secs



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the addition of quota limits.

2019-08-23 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914673#comment-16914673
 ] 

Meng Zhu commented on MESOS-9806:
-

All the optimizations improved the performance by 50%

1.8.1
HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 13.810811063secs
Made 0 allocation in 9.885972984secs

Before the optimization:
QuotaParam/BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter
Made 3500 allocations in 32.051382735secs
Made 0 allocation in 27.976022773secs

After the optimization:
HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 15.385276405secs
Made 0 allocation in 13.718502414secs

> Address allocator performance regression due to the addition of quota limits.
> -
>
> Key: MESOS-9806
> URL: https://issues.apache.org/jira/browse/MESOS-9806
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: resource-management
>
> In MESOS-9802, we removed the quota role sorter which is tech debt.
> However, this slows down the allocator. The problem is that in the first 
> stage, even though a cluster might have no active roles with non-default 
> quota, the allocator will now have to sort and go through each and every role 
> in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, 
> the allocator could experience ~50% performance degradation.
> There are a couple of ways to address this issue. For example, we could make 
> the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return 
> all the roles with non-default quota. Alternatively, an even better approach 
> would be to deprecate the sorter concept and just have two standalone 
> functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree 
> structure (not yet exist in the allocator) and return the sorted roles.
> In addition, when implementing MESOS-8068, we need to do more during the 
> allocation cycle. In particular, we need to call shrink many more times than 
> before. These all contribute to the performance slowdown. Specifically, for 
> the quota oriented benchmark 
> `HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2` we can observe 
> 2-3x slowdown compared to the previous release (1.8.1):
> Current master:
> QuotaParam/BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
> Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter
> Made 3500 allocations in 32.051382735secs
> Made 0 allocation in 27.976022773secs
> 1.8.1:
> HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
> Made 3500 allocations in 13.810811063secs
> Made 0 allocation in 9.885972984secs



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the addition of quota limits.

2019-08-23 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914672#comment-16914672
 ] 

Meng Zhu commented on MESOS-9806:
-

Small vector optimization for ResourceQuantities, ResourceLimits and Resources:

{noformat}
commit 73033130de7872c6f240b9b05dced039d7666138
Author: Meng Zhu 
Date:   Thu Aug 22 17:19:30 2019 -0700

Used boost `small_vector` in `Resources`.

Master + previous patch:
*HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 16.307044003secs
Made 0 allocation in 14.948262599secs

Master + previous patch + this patch:
*HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 15.385276405secs
Made 0 allocation in 13.718502414secs

Review: https://reviews.apache.org/r/71357

commit 95201cbe4dc87eae2fde5754d16f5effbb6c1974
Author: Meng Zhu 
Date:   Thu Aug 22 16:55:34 2019 -0700

Used boost `small_vector` in Resource Quantities and Limits.

Master + previous patch
*HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 16.831380548secs
Made 0 allocation in 15.102885644secs

Master + previous patch + this patch:
*HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 16.307044003secs
Made 0 allocation in 14.948262599secs

Review: https://reviews.apache.org/r/71355

commit 25070f232a9bb97d1b78f8a7e5b774bbd50654f9
Author: Meng Zhu 
Date:   Thu Aug 22 16:54:42 2019 -0700

Updated the boost library.

This update includes adding `container/small_vector.hpp`.

Review: https://reviews.apache.org/r/71356
{noformat}


> Address allocator performance regression due to the addition of quota limits.
> -
>
> Key: MESOS-9806
> URL: https://issues.apache.org/jira/browse/MESOS-9806
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: resource-management
>
> In MESOS-9802, we removed the quota role sorter which is tech debt.
> However, this slows down the allocator. The problem is that in the first 
> stage, even though a cluster might have no active roles with non-default 
> quota, the allocator will now have to sort and go through each and every role 
> in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, 
> the allocator could experience ~50% performance degradation.
> There are a couple of ways to address this issue. For example, we could make 
> the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return 
> all the roles with non-default quota. Alternatively, an even better approach 
> would be to deprecate the sorter concept and just have two standalone 
> functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree 
> structure (not yet exist in the allocator) and return the sorted roles.
> In addition, when implementing MESOS-8068, we need to do more during the 
> allocation cycle. In particular, we need to call shrink many more times than 
> before. These all contribute to the performance slowdown. Specifically, for 
> the quota oriented benchmark 
> `HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2` we can observe 
> 2-3x slowdown compared to the previous release (1.8.1):
> Current master:
> QuotaParam/BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
> Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter
> Made 3500 allocations in 32.051382735secs
> Made 0 allocation in 27.976022773secs
> 1.8.1:
> HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
> Made 3500 allocations in 13.810811063secs
> Made 0 allocation in 9.885972984secs



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the addition of quota limits.

2019-08-23 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914670#comment-16914670
 ] 

Meng Zhu commented on MESOS-9806:
-

Optimized the allocation loop

{noformat}
commit ec6b7b34215e821a63cb79e7d52d94ff08c1e110
Author: Meng Zhu 
Date:   Thu Aug 22 17:54:25 2019 -0700

Optimized the allocation loop.

Master:

HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 23.37 secs
Made 0 allocation in 19.72 secs

Master + this patch:

HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 16.831380548secs
Made 0 allocation in 15.102885644secs

Review: https://reviews.apache.org/r/71359
{noformat}


> Address allocator performance regression due to the addition of quota limits.
> -
>
> Key: MESOS-9806
> URL: https://issues.apache.org/jira/browse/MESOS-9806
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: resource-management
>
> In MESOS-9802, we removed the quota role sorter which is tech debt.
> However, this slows down the allocator. The problem is that in the first 
> stage, even though a cluster might have no active roles with non-default 
> quota, the allocator will now have to sort and go through each and every role 
> in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, 
> the allocator could experience ~50% performance degradation.
> There are a couple of ways to address this issue. For example, we could make 
> the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return 
> all the roles with non-default quota. Alternatively, an even better approach 
> would be to deprecate the sorter concept and just have two standalone 
> functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree 
> structure (not yet exist in the allocator) and return the sorted roles.
> In addition, when implementing MESOS-8068, we need to do more during the 
> allocation cycle. In particular, we need to call shrink many more times than 
> before. These all contribute to the performance slowdown. Specifically, for 
> the quota oriented benchmark 
> `HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2` we can observe 
> 2-3x slowdown compared to the previous release (1.8.1):
> Current master:
> QuotaParam/BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
> Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter
> Made 3500 allocations in 32.051382735secs
> Made 0 allocation in 27.976022773secs
> 1.8.1:
> HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
> Made 3500 allocations in 13.810811063secs
> Made 0 allocation in 9.885972984secs



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9943) Dedicate sorter for roles.

2019-08-19 Thread Meng Zhu (Jira)
Meng Zhu created MESOS-9943:
---

 Summary: Dedicate sorter for roles.
 Key: MESOS-9943
 URL: https://issues.apache.org/jira/browse/MESOS-9943
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Meng Zhu
Assignee: Meng Zhu


Once MESOS-9942 has landed, we can clean up and optimize the sorter for roles. 
Specifically, each node in the tree (except the root and virtual leaf node) 
will carry a back pointer to the role tree structure in the allocator. This 
will eliminate all the state duplications and unnecessary trackings that 
currently done inside the sorter.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9942) Deprecate framework sorter.

2019-08-19 Thread Meng Zhu (Jira)
Meng Zhu created MESOS-9942:
---

 Summary: Deprecate framework sorter.
 Key: MESOS-9942
 URL: https://issues.apache.org/jira/browse/MESOS-9942
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Meng Zhu
Assignee: Meng Zhu


Given the flat structure of the framework, there is no need to store and sort 
frameworks in the sorter tree structure. We should deprecate framework sorter. 
This would dedicate the sorter for roles, opening up room for optimization and 
cleanup. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9940) Framework removal may lead to inconsistent task states between master and agent.

2019-08-14 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9940:
---

 Summary: Framework removal may lead to inconsistent task states 
between master and agent.
 Key: MESOS-9940
 URL: https://issues.apache.org/jira/browse/MESOS-9940
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Meng Zhu


When a framework is removed from the master (say due to disconnection), master 
sends a `ShutdownFrameworkMessage` to the agent. At the same time, master would 
transition the task status to e.g. KILLED. 
(https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11247-L11291)

When agent got the shutdown message, it would try to shutdown all the executor 
and destroy all the containers. The tasks' status is updated after all these 
are done. 
(https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L7914-L7922)

However, if the executor shutdown gets stuck (e.g. due to hanging docker 
daemon), the task status transition will never happen. And master and agent 
will have diverged view of these tasks.

One consequence is that masters may try to schedule more workloads onto the 
problematic agent (because it thinks those task resources are freed up). Since 
we do not have overcommit check on agent, agent will comply and launch those 
tasks. This will lead to over-allocation.

One possible solution is to hold on the master status update until the agent is 
done with the framework shutdown.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9930) DRF sorter may omit clients in sorting after removing an inactive leaf node.

2019-08-08 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9930:
---

 Summary: DRF sorter may omit clients in sorting after removing an 
inactive leaf node.
 Key: MESOS-9930
 URL: https://issues.apache.org/jira/browse/MESOS-9930
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Meng Zhu
Assignee: Meng Zhu


The sorter assumes inactive leaf nodes are placed in the tail in the children 
list of a node.
However, when collapsing a parent node with a single "." virtual child node, 
its position may fail to be updated due to a bug in `Sorter::remove()`:

{noformat}
CHECK(child->isLeaf());

current->kind = child->kind;
...
if (current->kind == Node::INTERNAL) {
}
{noformat}

This bug would manifest, if
(1) we have a/b and a/.
(2) deactivate(a),  i.e. a/. becomes inactive_leaf
(3) remove(a/b)
When these happens, a/. will collapse to `a` as an inactive_leaf, due to the 
bug above, however, it will not be placed at the end, resulting in all the 
clients after `a` not included in the sort().

Luckily, this should never happen in practice, because only frameworks will get 
deactivated, and frameworks don’t have sub clients.




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9599) Update `GET_QUOTA` to return both guarantees and limits.

2019-07-30 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896740#comment-16896740
 ] 

Meng Zhu commented on MESOS-9599:
-

{noformat}
commit 817545318da364efdff7c9c3f888d0d7aa94da23
Author: Meng Zhu m...@mesosphere.io
Date:   Tue Jul 30 18:48:32 2019 -0700


Updated quota related endpoints to return quota configurations.

Added quota configuration information (that includes both
guarantees and limits) in V1 GET_QUOTA call and V0 GET "/quota".

To keep backwards compatibility, the infos field which only
includes the guarantees are continue to be filled. An additional
field configs was added.

Also extended an existing test to cover the changes in
the endpoints.

Review: https://reviews.apache.org/r/71159
{noformat}


> Update `GET_QUOTA` to return both guarantees and limits. 
> -
>
> Key: MESOS-9599
> URL: https://issues.apache.org/jira/browse/MESOS-9599
> Project: Mesos
>  Issue Type: Task
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> We should mark the existing `QuotaInfo` message as deprecated in favor of the 
> new `QuotaConfig`:
> {noformat}
> message GetQuota {
>   required quota.QuotaStatus status = 1;
> }
> message QuotaStatus {
>repeated QuotaInfo infos [deprecated = true];
>repeated QuotaConfig configs; 
> }
> message QuotaConfig {
> required  string role;
> map guarantees;
> map limits;
> }
> {noformat}
> We will continue to fill in the QuotaInfo though for backward compatibility. 
> See the design doc: [New 
> API|https://docs.google.com/document/d/13vG5uH4YVwM79ErBPYAZfnqYFOBbUy2Lym0_9iAQ5Uk/edit#heading=h.z2vfcyzabymz]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9598) Update GET `/quota` to return both guarantees and limits.

2019-07-30 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896738#comment-16896738
 ] 

Meng Zhu commented on MESOS-9598:
-

{noformat}
commit 817545318da364efdff7c9c3f888d0d7aa94da23
Author: Meng Zhu m...@mesosphere.io
Date:   Tue Jul 30 18:48:32 2019 -0700


Updated quota related endpoints to return quota configurations.

Added quota configuration information (that includes both
guarantees and limits) in V1 GET_QUOTA call and V0 GET "/quota".

To keep backwards compatibility, the infos field which only
includes the guarantees are continue to be filled. An additional
field configs was added.

Also extended an existing test to cover the changes in
the endpoints.

Review: https://reviews.apache.org/r/71159
{noformat}


> Update GET `/quota` to return both guarantees and limits.
> -
>
> Key: MESOS-9598
> URL: https://issues.apache.org/jira/browse/MESOS-9598
> Project: Mesos
>  Issue Type: Task
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> We should mark the existing `QuotaInfo` message as deprecated in favor of the 
> new `QuotaConfig`:
> {noformat}
> message QuotaStatus {
>repeated QuotaInfo infos [deprecated = true];
>repeated QuotaConfig configs; 
> }
> message QuotaConfig {
> required  string role;
> map guarantees;
> map limits;
> }
> {noformat}
> We will continue to fill in the QuotaInfo though for backward compatibility. 
> See the design doc: [New 
> API|https://docs.google.com/document/d/13vG5uH4YVwM79ErBPYAZfnqYFOBbUy2Lym0_9iAQ5Uk/edit#]
> Note, we only update this v0 endpoint for the GET method. There is no plan to 
> support configuring quota limits from this endpoint. V1 calls should be used.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9917) Store a role/framework tree in the allocator and deprecate the sorter interface.

2019-07-30 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9917:
---

Assignee: Meng Zhu

> Store a role/framework tree in the allocator and deprecate the sorter 
> interface.
> 
>
> Key: MESOS-9917
> URL: https://issues.apache.org/jira/browse/MESOS-9917
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: mesosphere, resource-management
>
> Currently, the client (role and framework) tree for the allocator is stored 
> in the sorter abstraction. This is not ideal. The role/framework tree is 
> generic information that is needed regardless of the sorter used. The current 
> sorter interface and its associated states are tech debts that contribute to 
> performance slowdown and code convolution. 
> We should store a role/framework tree in the allocator. Each client node will 
> have a variant field that encapsulates information needed for each sorter 
> (e.g. for random sorter, it could be empty).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9917) Store a role/framework tree in the allocator and deprecate the sorter interface.

2019-07-30 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9917:
---

 Summary: Store a role/framework tree in the allocator and 
deprecate the sorter interface.
 Key: MESOS-9917
 URL: https://issues.apache.org/jira/browse/MESOS-9917
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Meng Zhu


Currently, the client (role and framework) tree for the allocator is stored in 
the sorter abstraction. This is not ideal. The role/framework tree is generic 
information that is needed regardless of the sorter used. The current sorter 
interface and its associated states are tech debts that contribute to 
performance slowdown and code convolution. 

We should store a role/framework tree in the allocator. Each client node will 
have a variant field that encapsulates information needed for each sorter (e.g. 
for random sorter, it could be empty).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9600) Deprecate `SET_QUOTA` and `REMOVE_QUOTA` calls in favor of `UPDATE_QUOTA`.

2019-07-30 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9600:
---

Assignee: Meng Zhu

> Deprecate `SET_QUOTA` and `REMOVE_QUOTA` calls in favor of `UPDATE_QUOTA`.
> --
>
> Key: MESOS-9600
> URL: https://issues.apache.org/jira/browse/MESOS-9600
> Project: Mesos
>  Issue Type: Task
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> Once the `UPDATE_QUOTA` call (MESOS-9596) is implemented and wired, we should 
> deprecate the existing calls `REMOVE_QUOTA` and `SET_QUOTA`. In the 
> user-facing documentation, we should hide the old API and showcase the new 
> one.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9913) Use built-in protobuf JSON mapping utilities in favor of reflection for (de)serialization.

2019-07-29 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9913:
---

 Summary: Use built-in protobuf JSON mapping utilities in favor of 
reflection for (de)serialization. 
 Key: MESOS-9913
 URL: https://issues.apache.org/jira/browse/MESOS-9913
 Project: Mesos
  Issue Type: Improvement
  Components: json api
Reporter: Meng Zhu


Currently, we use protobuf reflection APIs to (de)serialize to/from JSON. This 
means a lot of custom code. There are places where we forgot to customize (e.g. 
for Map, MESOS-9901). Also, there is a performance regression in protobuf 
reflection if we upgrade our protobuf library to 3.7.x (see MESOS-9896 and 
related tickets).

Thus it would beneficial to make use of the [built-in json utilises 
|https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/util/json_util.h]
 to do the mapping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9598) Update GET `/quota` to return both guarantees and limits.

2019-07-24 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9598:
---

Assignee: Meng Zhu

> Update GET `/quota` to return both guarantees and limits.
> -
>
> Key: MESOS-9598
> URL: https://issues.apache.org/jira/browse/MESOS-9598
> Project: Mesos
>  Issue Type: Task
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> We should mark the existing `QuotaInfo` message as deprecated in favor of the 
> new `QuotaConfig`:
> {noformat}
> message QuotaStatus {
>repeated QuotaInfo infos [deprecated = true];
>repeated QuotaConfig configs; 
> }
> message QuotaConfig {
> required  string role;
> map guarantees;
> map limits;
> }
> {noformat}
> We will continue to fill in the QuotaInfo though for backward compatibility. 
> See the design doc: [New 
> API|https://docs.google.com/document/d/13vG5uH4YVwM79ErBPYAZfnqYFOBbUy2Lym0_9iAQ5Uk/edit#]
> Note, we only update this v0 endpoint for the GET method. There is no plan to 
> support configuring quota limits from this endpoint. V1 calls should be used.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9599) Update `GET_QUOTA` to return both guarantees and limits.

2019-07-24 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9599:
---

Assignee: Meng Zhu

> Update `GET_QUOTA` to return both guarantees and limits. 
> -
>
> Key: MESOS-9599
> URL: https://issues.apache.org/jira/browse/MESOS-9599
> Project: Mesos
>  Issue Type: Task
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> We should mark the existing `QuotaInfo` message as deprecated in favor of the 
> new `QuotaConfig`:
> {noformat}
> message GetQuota {
>   required quota.QuotaStatus status = 1;
> }
> message QuotaStatus {
>repeated QuotaInfo infos [deprecated = true];
>repeated QuotaConfig configs; 
> }
> message QuotaConfig {
> required  string role;
> map guarantees;
> map limits;
> }
> {noformat}
> We will continue to fill in the QuotaInfo though for backward compatibility. 
> See the design doc: [New 
> API|https://docs.google.com/document/d/13vG5uH4YVwM79ErBPYAZfnqYFOBbUy2Lym0_9iAQ5Uk/edit#heading=h.z2vfcyzabymz]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9599) Update `GET_QUOTA` to return both guarantees and limits.

2019-07-24 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892347#comment-16892347
 ] 

Meng Zhu commented on MESOS-9599:
-

{noformat}
commit ed06bc6b539eea115375640703eb0934328daca6
Author: Meng Zhu m...@mesosphere.io
Date:   Tue May 21 16:07:41 2019 +0200


Added `repeated QuotaConfig` to `QuotaStatus`.

Also marked the `infos` field as deprecated.

`QuotaStatus` is returned by `GET_QUOTA` and `GET /quota`.
As we introduce quota limits, a new mesage `QuotaConfig`
is introduced to describe the quota configuration. For
backwards compatibility, we will fill in both fields
until `QuotaInfo` is removed (in Mesos 2.0).

Review: https://reviews.apache.org/r/70690
{noformat}


> Update `GET_QUOTA` to return both guarantees and limits. 
> -
>
> Key: MESOS-9599
> URL: https://issues.apache.org/jira/browse/MESOS-9599
> Project: Mesos
>  Issue Type: Task
>Reporter: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> We should mark the existing `QuotaInfo` message as deprecated in favor of the 
> new `QuotaConfig`:
> {noformat}
> message GetQuota {
>   required quota.QuotaStatus status = 1;
> }
> message QuotaStatus {
>repeated QuotaInfo infos [deprecated = true];
>repeated QuotaConfig configs; 
> }
> message QuotaConfig {
> required  string role;
> map guarantees;
> map limits;
> }
> {noformat}
> We will continue to fill in the QuotaInfo though for backward compatibility. 
> See the design doc: [New 
> API|https://docs.google.com/document/d/13vG5uH4YVwM79ErBPYAZfnqYFOBbUy2Lym0_9iAQ5Uk/edit#heading=h.z2vfcyzabymz]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9903) ContentType/AgentAPITest.MarkResourceProviderGone

2019-07-23 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9903:
---

 Summary: ContentType/AgentAPITest.MarkResourceProviderGone
 Key: MESOS-9903
 URL: https://issues.apache.org/jira/browse/MESOS-9903
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Meng Zhu
 Attachments: badrun_log.txt

Observed flaky in our CI, centos-6-SSL. Log attached.
Crash trace:

{noformat}
I0724 00:38:07.728926  3249 http_connection.hpp:283] Connected with the remote 
endpoint at http://172.16.10.60:38795/slave()/api/v1/resource_provider
*** Aborted at 1563928687 (unix time) try "date -d @1563928687" if you are 
using GNU date ***
I0724 00:38:07.730021 27831 slave.cpp:924] Agent terminating
I0724 00:38:07.731081  3250 master.cpp:1295] Agent 
8324a471-1cb7-4778-959a-560b074686b8-S0 at slave()@172.16.10.60:38795 
(ip-172-16-10-60.ec2.internal) disconnected
I0724 00:38:07.731101  3250 master.cpp:3397] Disconnecting agent 
8324a471-1cb7-4778-959a-560b074686b8-S0 at slave()@172.16.10.60:38795 
(ip-172-16-10-60.ec2.internal)
I0724 00:38:07.731140  3250 master.cpp:3416] Deactivating agent 
8324a471-1cb7-4778-959a-560b074686b8-S0 at slave()@172.16.10.60:38795 
(ip-172-16-10-60.ec2.internal)
I0724 00:38:07.731204  3247 hierarchical.cpp:799] Agent 
8324a471-1cb7-4778-959a-560b074686b8-S0 deactivated
PC: @ 0x7f7a21bf59fc process::UPID::UPID()
*** SIGSEGV (@0x557acd6ed7a1) received by PID 27831 (TID 0x7f7a14040700) from 
PID 18446744072861177761; stack trace: ***
@ 0x7f79eb0dcde7 (unknown)
@ 0x7f79eb0e4385 JVM_handle_linux_signal
@ 0x7f79eb0d9583 (unknown)
@ 0x7f7a1e2257e0 (unknown)
@ 0x7f7a21bf59fc process::UPID::UPID()
@ 0x7f7a209e6cbb mesos::v1::resource_provider::Driver::send()
@ 0x5579c9704027 
mesos::internal::tests::resource_provider::MockResourceProvider<>::connectedDefault()
@ 0x5579c9604b2a 
testing::internal::FunctionMockerBase<>::UntypedPerformDefaultAction()
@ 0x5579cad9fe83 
testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith()
@ 0x5579c9635714 
mesos::internal::tests::resource_provider::MockResourceProvider<>::connected()
@ 0x7f7a206a9273 process::AsyncExecutorProcess::execute<>()
@ 0x7f7a206b6b3b 
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEESG_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSL_FSI_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteISW_EEOSE_S3_E_JSZ_SE_St12_PlaceholderILi1EEclEOS3_
@ 0x7f7a21c10ea1 process::ProcessBase::consume()
@ 0x7f7a21c25677 process::ProcessManager::resume()
@ 0x7f7a21c2aae6 
_ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
@ 0x7f7a21ee0c7f execute_native_thread_routine
@ 0x7f7a1e21daa1 start_thread
@ 0x7f7a1d1ddc4d clone
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9901) Specialize jsonify for protobuf Maps.

2019-07-23 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9901:
---

Assignee: Meng Zhu

> Specialize jsonify for protobuf Maps.
> -
>
> Key: MESOS-9901
> URL: https://issues.apache.org/jira/browse/MESOS-9901
> Project: Mesos
>  Issue Type: Improvement
>  Components: json api
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>
> Jsonify current treats protobuf as a regular repeated field. For example, for 
> the schema 
> {noformat}
> message QuotaConfig {
>   required string role = 1;
>   map guarantees = 2;
>   map limits = 3;
> }
> {noformat}
> it will produce:
> {noformat}
>   "configs": [
> {
>   "role": "role1",
>   "guarantees": [
> {
>   "key": "cpus",
>   "value": {
> "value": 1
>   }
> },
> {
>   "key": "mem",
>   "value": {
> "value": 512
>   }
> }
>   ]
> {noformat}
> This output cannot be parsed back to proto messages. We need to specialize 
> jsonify for Maps type. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9668) Add authorization support for the new `GET_QUOTA` call.

2019-07-23 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891427#comment-16891427
 ] 

Meng Zhu commented on MESOS-9668:
-

{noformat}
commit 756e212ee91f9b65fb5f90d627b41c9b8c22a319 (HEAD -> master, origin/master, 
apache/master)
Author: Meng Zhu 
Date:   Mon Jul 22 14:36:47 2019 -0700

Removed `quota_info` in the `GET_QUOTA` authorization object.

Currently, the `GET_QUOTA` authorizable action set both  `value`
and `quota_info` fields. The `value` field is set due to
backward compatibility for the `GET_QUOTA_WITH_ROLE` action.

This patch makes the `GET_QUOTA` action only set the `value`
field with the role name. Since the `quota.QuotaInfo` field
is being deprecated, it is no longer set (the local authorizer
only looks at the `value` field, it is also probably the case
for any external authorizer modules).

Also refactored `QuotaHandler::status`.

Review: https://reviews.apache.org/r/71139
{noformat}


> Add authorization support for the new `GET_QUOTA` call.
> ---
>
> Key: MESOS-9668
> URL: https://issues.apache.org/jira/browse/MESOS-9668
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: mesosphere, resource-management
>
> The new `GET_QUOTA` call will return QUOTA_CONFIGS:
> // Used in GET_QUOTA and returned by GET /quota
> //
> // Overall cluster quota status, including all roles, their quota 
> configurations and current state (e.g. consumed and effective limits)
> message QuotaStatus {
>repeated QuotaInfo infos [deprecated = true];
>repeated QuotaConfig configs; 
> }
> Currently, the GET_QUOTA authorizable action set both value
> and quota_info fields. The value field is set due to
> backward compatibility for the GET_QUOTA_WITH_ROLE action.
> We should make the GET_QUOTA action only set the value
> field with the role name. Since the quota.QuotaInfo field
> is being deprecated, it should not be set (the local authorizer
> only looks at the value field, it is also probably the case
> for any external authorizer modules).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-8968) Wire `UPDATE_QUOTA` call.

2019-07-23 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891417#comment-16891417
 ] 

Meng Zhu commented on MESOS-8968:
-

{noformat}
commit 7aa2a96fea8a44f673a95b425bae71c946c09f2c (HEAD -> update_quota_working, 
apache/master)
Author: Meng Zhu 
Date:   Thu Jul 18 11:32:49 2019 -0700

Added a test to ensure `UPDATE_QUOTA` is applied all-or-nothing.

Review: https://reviews.apache.org/r/71119
{noformat}


> Wire `UPDATE_QUOTA` call.
> -
>
> Key: MESOS-8968
> URL: https://issues.apache.org/jira/browse/MESOS-8968
> Project: Mesos
>  Issue Type: Bug
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: Quota, allocator, multitenancy
>
> Wire the existing master, auth, registar, and allocator pieces together to 
> complete the `UPDATE_QUOTA` call.
> This would enable the master capability `QUOTA_V2`.
> This also fixes the "ignoring zero resource quota" bug in the old quota 
> implementation, namely:
> Currently, Mesos discards resource object with zero scalar value when parsing 
> resources. This means quota set to zero would be ignored and not enforced. 
> For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no 
> GPU. Due to the above issue, the allocator can only see the quota as 
> "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs 
> may still be allocated to this role. 
> With the completion of `UPDATE_QUOTA` which takes a map of name, scalar 
> values, zero value will no longer be dropped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9901) Specialize jsonify for protobuf Maps.

2019-07-23 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891299#comment-16891299
 ] 

Meng Zhu commented on MESOS-9901:
-

[~bbannier] Thanks for pointing to the test. But, despite the name, the test 
dose not really use jsonify
https://github.com/apache/mesos/blob/ff8c9a96be6ae1ee47faf9d5b80a518dfb4a3db0/3rdparty/stout/tests/protobuf_tests.cpp#L838-L839

> Specialize jsonify for protobuf Maps.
> -
>
> Key: MESOS-9901
> URL: https://issues.apache.org/jira/browse/MESOS-9901
> Project: Mesos
>  Issue Type: Improvement
>  Components: json api
>Reporter: Meng Zhu
>Priority: Major
>
> Jsonify current treats protobuf as a regular repeated field. For example, for 
> the schema 
> {noformat}
> message QuotaConfig {
>   required string role = 1;
>   map guarantees = 2;
>   map limits = 3;
> }
> {noformat}
> it will produce:
> {noformat}
>   "configs": [
> {
>   "role": "role1",
>   "guarantees": [
> {
>   "key": "cpus",
>   "value": {
> "value": 1
>   }
> },
> {
>   "key": "mem",
>   "value": {
> "value": 512
>   }
> }
>   ]
> {noformat}
> This output cannot be parsed back to proto messages. We need to specialize 
> jsonify for Maps type. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9901) Specialize jsonify for protobuf Maps.

2019-07-22 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9901:
---

 Summary: Specialize jsonify for protobuf Maps.
 Key: MESOS-9901
 URL: https://issues.apache.org/jira/browse/MESOS-9901
 Project: Mesos
  Issue Type: Improvement
  Components: json api
Reporter: Meng Zhu


Jsonify current treats protobuf as a regular repeated field. For example, for 
the schema 

{noformat}
message QuotaConfig {
  required string role = 1;

  map guarantees = 2;
  map limits = 3;
}
{noformat}

it will produce:

{noformat}
  "configs": [
{
  "role": "role1",
  "guarantees": [
{
  "key": "cpus",
  "value": {
"value": 1
  }
},
{
  "key": "mem",
  "value": {
"value": 512
  }
}
  ]
{noformat}

This output cannot be parsed back to proto messages. We need to specialize 
jsonify for Maps type. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-8968) Wire `UPDATE_QUOTA` call.

2019-07-10 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-8968:
---

Assignee: Meng Zhu

> Wire `UPDATE_QUOTA` call.
> -
>
> Key: MESOS-8968
> URL: https://issues.apache.org/jira/browse/MESOS-8968
> Project: Mesos
>  Issue Type: Bug
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: Quota, allocator, multitenancy
>
> Wire the existing master, auth, registar, and allocator pieces together to 
> complete the `UPDATE_QUOTA` call.
> This would enable the master capability `QUOTA_V2`.
> This also fixes the "ignoring zero resource quota" bug in the old quota 
> implementation, namely:
> Currently, Mesos discards resource object with zero scalar value when parsing 
> resources. This means quota set to zero would be ignored and not enforced. 
> For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no 
> GPU. Due to the above issue, the allocator can only see the quota as 
> "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs 
> may still be allocated to this role. 
> With the completion of `UPDATE_QUOTA` which takes a map of name, scalar 
> values, zero value will no longer be dropped.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8968) Wire `UPDATE_QUOTA` call.

2019-07-10 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882508#comment-16882508
 ] 

Meng Zhu edited comment on MESOS-8968 at 7/10/19 11:54 PM:
---

{noformat}
commit 0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8 (apache/master)
Author: Meng Zhu 
Date:   Fri Jul 5 18:05:59 2019 -0700

Implemented `UPDATE_QUOTA` operator call.

This patch wires up the master, auth, registar and allocator
pieces for `UPDATE_QUOTA` call.

This enables the master capability `QUOTA_V2`. The capability
implies the quota v2 API is capable of writes (`UPDATE_QUOTA`)
and the master is capable of recovering from V2 quota
(`QuotaConfig`) in registry.

This patch lacks the rescind offer logic. When quota limits
and guarantees are configured, it might be necessary to
rescind offers on the fly to satisfy new guarantees or be
constrained by the new limits. A todo is left and will be
tackled in subsequent patches.

Also enabled test `MasterQuotaTest.RecoverQuotaEmptyCluster`.

Review: https://reviews.apache.org/r/71021
{noformat}

{noformat}
commit dcd73437549413790751d1ff127989dbb29bd753 (HEAD -> update_quota, 
apache/master)
Author: Meng Zhu 
Date:   Sun Jul 7 14:27:14 2019 -0700

Added tests for `UPDATE_QUOTA`.

These tests reuse the existing tests for `SET_QUOTA` and
`REMOVE_QUOTA` calls. In general, `UPDATE_QUOTA` request
should fail where `SET_QUOTA` fails. When the existing
test expects `SET_QUOTA` call succeeds, we test the
`UPDATE_QUOTA` call by first remove the set quota and then
send the `UPDATE_QUOTA` request.

Review: https://reviews.apache.org/r/71022
{noformat}


was (Author: mzhu):
{noformat}
commit 0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8 (apache/master)
Author: Meng Zhu 
Date:   Fri Jul 5 18:05:59 2019 -0700

Implemented `UPDATE_QUOTA` operator call.

This patch wires up the master, auth, registar and allocator
pieces for `UPDATE_QUOTA` call.

This enables the master capability `QUOTA_V2`. The capability
implies the quota v2 API is capable of writes (`UPDATE_QUOTA`)
and the master is capable of recovering from V2 quota
(`QuotaConfig`) in registry.

This patch lacks the rescind offer logic. When quota limits
and guarantees are configured, it might be necessary to
rescind offers on the fly to satisfy new guarantees or be
constrained by the new limits. A todo is left and will be
tackled in subsequent patches.

Also enabled test `MasterQuotaTest.RecoverQuotaEmptyCluster`.

Review: https://reviews.apache.org/r/71021
{noformat}


> Wire `UPDATE_QUOTA` call.
> -
>
> Key: MESOS-8968
> URL: https://issues.apache.org/jira/browse/MESOS-8968
> Project: Mesos
>  Issue Type: Bug
>Reporter: Meng Zhu
>Priority: Major
>  Labels: Quota, allocator, multitenancy
>
> Wire the existing master, auth, registar, and allocator pieces together to 
> complete the `UPDATE_QUOTA` call.
> This would enable the master capability `QUOTA_V2`.
> This also fixes the "ignoring zero resource quota" bug in the old quota 
> implementation, namely:
> Currently, Mesos discards resource object with zero scalar value when parsing 
> resources. This means quota set to zero would be ignored and not enforced. 
> For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no 
> GPU. Due to the above issue, the allocator can only see the quota as 
> "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs 
> may still be allocated to this role. 
> With the completion of `UPDATE_QUOTA` which takes a map of name, scalar 
> values, zero value will no longer be dropped.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9812) Add achievability validation for update quota call.

2019-07-10 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882512#comment-16882512
 ] 

Meng Zhu commented on MESOS-9812:
-

This covers: the guarantee overcommitment check, and hierchical gurantees check

{noformat}
commit 16f0b0c295960e397e56f6d504b8075cb62e6e4f
Author: Meng Zhu 
Date:   Fri Jul 5 15:41:01 2019 -0700

Added overcommit and hierarchical inclusion check for `UPDATE_QUOTA`.

The overcommit check validates that the total quota guarantees in
the cluster is contained by the cluster capacity.

The hierarchical inclusion check validates that the sum of
children's  guarantees is contained by the parent guarantee.

Further validation is needed for:

- Check a role's limit is less than its current consumption.
- Check a role's limit is less than its parent's limit.

Review: https://reviews.apache.org/r/71020
{noformat}

Leave the ticket on for now for:
limits < consumption, hierarchical limits invariant.

> Add achievability validation for update quota call.
> ---
>
> Key: MESOS-9812
> URL: https://issues.apache.org/jira/browse/MESOS-9812
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> Add overcommit check, hierarchical quota validation and force flag override 
> for update quota call.
> Right now, we only have validation for per quota config. We need to add 
> further validation for the update quota call regarding:
> 1. Check if the role's resource limits are already breached. To achieve this, 
> we need to first rescind offers until its allocated resources are below 
> limits. If after all rescinds, allocated resources are still above the 
> requested limits, we will return an error unless the `force` flag is used.
> 2. If the aggregated quota guarantees of all roles are less than the cluster 
> capacity. If so we will return an error unless the `force` flag is used.
> 3. hierarchical limits validation
>   a. Check a role's limit is less than its parent's limit.
>   b. Check the sum of children's guarantees is less than its parent's 
> guarantees.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8968) Wire `UPDATE_QUOTA` call.

2019-07-10 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882508#comment-16882508
 ] 

Meng Zhu commented on MESOS-8968:
-

{noformat}
commit 0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8 (apache/master)
Author: Meng Zhu 
Date:   Fri Jul 5 18:05:59 2019 -0700

Implemented `UPDATE_QUOTA` operator call.

This patch wires up the master, auth, registar and allocator
pieces for `UPDATE_QUOTA` call.

This enables the master capability `QUOTA_V2`. The capability
implies the quota v2 API is capable of writes (`UPDATE_QUOTA`)
and the master is capable of recovering from V2 quota
(`QuotaConfig`) in registry.

This patch lacks the rescind offer logic. When quota limits
and guarantees are configured, it might be necessary to
rescind offers on the fly to satisfy new guarantees or be
constrained by the new limits. A todo is left and will be
tackled in subsequent patches.

Also enabled test `MasterQuotaTest.RecoverQuotaEmptyCluster`.

Review: https://reviews.apache.org/r/71021
{noformat}


> Wire `UPDATE_QUOTA` call.
> -
>
> Key: MESOS-8968
> URL: https://issues.apache.org/jira/browse/MESOS-8968
> Project: Mesos
>  Issue Type: Bug
>Reporter: Meng Zhu
>Priority: Major
>  Labels: Quota, allocator, multitenancy
>
> Wire the existing master, auth, registar, and allocator pieces together to 
> complete the `UPDATE_QUOTA` call.
> This would enable the master capability `QUOTA_V2`.
> This also fixes the "ignoring zero resource quota" bug in the old quota 
> implementation, namely:
> Currently, Mesos discards resource object with zero scalar value when parsing 
> resources. This means quota set to zero would be ignored and not enforced. 
> For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no 
> GPU. Due to the above issue, the allocator can only see the quota as 
> "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs 
> may still be allocated to this role. 
> With the completion of `UPDATE_QUOTA` which takes a map of name, scalar 
> values, zero value will no longer be dropped.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8968) Wire `UPDATE_QUOTA` call.

2019-07-10 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882509#comment-16882509
 ] 

Meng Zhu commented on MESOS-8968:
-

Leave it open for now, until more tests are landed.

> Wire `UPDATE_QUOTA` call.
> -
>
> Key: MESOS-8968
> URL: https://issues.apache.org/jira/browse/MESOS-8968
> Project: Mesos
>  Issue Type: Bug
>Reporter: Meng Zhu
>Priority: Major
>  Labels: Quota, allocator, multitenancy
>
> Wire the existing master, auth, registar, and allocator pieces together to 
> complete the `UPDATE_QUOTA` call.
> This would enable the master capability `QUOTA_V2`.
> This also fixes the "ignoring zero resource quota" bug in the old quota 
> implementation, namely:
> Currently, Mesos discards resource object with zero scalar value when parsing 
> resources. This means quota set to zero would be ignored and not enforced. 
> For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no 
> GPU. Due to the above issue, the allocator can only see the quota as 
> "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs 
> may still be allocated to this role. 
> With the completion of `UPDATE_QUOTA` which takes a map of name, scalar 
> values, zero value will no longer be dropped.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9882) Mesos.UpdateFrameworkV0Test.SuppressedRoles is flaky.

2019-07-03 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9882:
---

 Summary: Mesos.UpdateFrameworkV0Test.SuppressedRoles is flaky.
 Key: MESOS-9882
 URL: https://issues.apache.org/jira/browse/MESOS-9882
 Project: Mesos
  Issue Type: Bug
  Components: flaky
Reporter: Meng Zhu
 Attachments: UpdateFrameworkV0Test.SuppressedRoles_badrun.txt

Observed in CI, log attached.

{noformat}
mesos-ec2-ubuntu-14.04-SSL.Mesos.UpdateFrameworkV0Test.SuppressedRoles (from 
UpdateFrameworkV0Test)


Error Message
../../src/tests/master/update_framework_tests.cpp:1117
Mock function called more times than expected - returning directly.
Function call: agentAdded(@0x7fb254001c40 32-byte object <90-7A 6C-85 B2-7F 
00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 F0-85 00-54 B2-7F 00-00>)
 Expected: to be called once
   Actual: called twice - over-saturated and active
Stacktrace
../../src/tests/master/update_framework_tests.cpp:1117
Mock function called more times than expected - returning directly.
Function call: agentAdded(@0x7fb254001c40 32-byte object <90-7A 6C-85 B2-7F 
00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 F0-85 00-54 B2-7F 00-00>)
 Expected: to be called once
   Actual: called twice - over-saturated and active
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9812) Add achievability validation for update quota call.

2019-07-02 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9812:
---

Assignee: Meng Zhu

> Add achievability validation for update quota call.
> ---
>
> Key: MESOS-9812
> URL: https://issues.apache.org/jira/browse/MESOS-9812
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> Add overcommit check and force flag override for update quota call.
> Right now, we only have validation for per quota config. We need to add 
> further validation for the update quota call regarding:
> 1. If the role's resource limits are already breached. To achieve this, we 
> need to first rescind offers until its allocated resources are below limits. 
> If after all rescinds, allocated resources are still above the requested 
> limits, we will return an error unless the `force` flag is used.
> 2. If the aggregated quota guarantees of all roles are less than the cluster 
> capacity. If so we will return an error unless the `force` flag is used.
> 3. hierarchical quota validness (we could probably punt this given that we 
> only support flat role quota at the moment).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9601) Persist `QuotaConfig`s in the registry.

2019-07-01 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876515#comment-16876515
 ] 

Meng Zhu commented on MESOS-9601:
-

{noformat}
commit 3720e4cf5f7cb0d8e98afacea39528bd41c767b4
Author: Meng Zhu 
Date:   Fri Jun 28 14:16:00 2019 -0700

Updated registry operation `UpdateQuota` to persist `QuotaConfig`.

The new operations will mutate the `quota_configs` field in
the registry to persist `QuotaConfigs` configured by the new
`UPDATE_QUOTA` call as well as the legacy `SET_QUOTA` and
`REMOVE_QUOTA` calls.

The operation removes any entries in the legacy `quotas` field
with the same role name. In addition, it also adds/removes the
minimum capability `QUOTA_V2` accordingly: if `quota_configs`
is empty the capability will be removed otherwise it will
be added.

This operation replaces the `REMOVE_QUOTA` operation.

Also fixed/disabled affected tests.

Review: https://reviews.apache.org/r/70951

commit c82847ad1b8d3760d34ee1e8869c2b7286ccfaa1
Author: Meng Zhu 
Date:   Fri Jun 28 14:15:02 2019 -0700

Added helpers to add and remove master minimum capabilities.

Also added a TODO about refactoring the helpers.

Review: https://reviews.apache.org/r/70972

commit f37250f53e75e0442aed2f61bbedbc9b068821d5
Author: Meng Zhu 
Date:   Tue Jun 25 18:07:29 2019 -0700

Added a registry field for `QuotaConfig`.

A new field called `quota_configs` is added to persist the
quota configurations of the cluster. This replaces the old
`quotas` field which is deprecated and will be removed
in Mesos 2.0.

When users upgrade to Mesos 1.9, `quotas` will be preserved
and recovered and `quota_configs` will be empty. As users
configures new quotas, whether through the new `UPDATE_QUOTA`
call or the deprecated `SET_QUTOA` call, the configured quotas
will be persisted into the `quota_configs` field along with the
`QUOTA_V2` minimum capability. The capability is removed only
if `quota_configs` becomes empty again. If a role already has an
entry in the old `quotas` field, it will be removed from `quotas`.
In other words, once upgraded, `quotas` will still be preserved
and honored, but it will never grow. Instead it will gradually
shrink as the roles' quotas get updated or removed.

Review: https://reviews.apache.org/r/70950

commit 0bc857d672189605f83acb7ef57bce89b141ba72
Author: Meng Zhu 
Date:   Tue Jun 25 15:19:44 2019 -0700

Added master minimum capability `QUOTA_V2`.

This adds a new enum for the revamped quota feature
in the master. When quota is configured in Mesos 1.9
or higher, the quota configurations will be persisted
into the `quota_configs` field in the registry. And
the `QUOTA_V2` minimum capability will be added to the
registry as well. This will prevent any master downgrades
until `quota_configs` becomes empty. This can be done by
setting the quota of the roles listed in `quota_configs`
back to the default (no guarantees and no limits).

Note, since at the moment of adding this patch, the master
is not yet capable of handling the new quota API. The
`capability` is not added to the `MASTER_CAPABILITIES`.
That should be done later together with the patches that
enables master for handling the new quota calls.

Review: https://reviews.apache.org/r/70949
{noformat}

> Persist `QuotaConfig`s in the registry.
> ---
>
> Key: MESOS-9601
> URL: https://issues.apache.org/jira/browse/MESOS-9601
> Project: Mesos
>  Issue Type: Task
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> We need to persist the new `QuotaConfig` in the registry.
> One thing to note is, the old masters only support quota guarantee which also 
> servers as limits implicitly. Once new masters start to support both 
> guarantees and limits, there is no safe downgrade path without altering the 
> cluster behavior (if the new quota semantics are used). Thus, we need to 
> ensure that alerts are given if such downgrades are attempted.
> To this end, if the quota is configured after this change, a new minimum 
> capability `QUOTA_V2` will be persisted to the registry along with the new 
> `QuotaConfig` message. Thanks to the minimum capability check, old masters 
> (that do not possess the `QUOTA_V2` capability) will refuse to start in this 
> case and we will print out suggestions to the operator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9866) Removes the `quotas` field in the registry.

2019-06-26 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9866:
---

 Summary: Removes the `quotas` field in the registry.
 Key: MESOS-9866
 URL: https://issues.apache.org/jira/browse/MESOS-9866
 Project: Mesos
  Issue Type: Bug
Reporter: Meng Zhu


Prior to Mesos 1.9, quota information is persisted in the `quotas` field. It 
has since been deprecated in Mesos 1.9. Newly configured quotas are now 
persisted in the `quota_configs` field. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9807) Introduce a `struct Quota` wrapper.

2019-06-25 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872747#comment-16872747
 ] 

Meng Zhu commented on MESOS-9807:
-

{noformat}
commit 8eba78cbddc8b70f78c07a501ee0dc1d6204f280
Author: Meng Zhu 
Date:   Thu Jun 20 17:29:28 2019 -0700

Replaced `Quota` with `Quota2` in the master state.

This paves way to remove `struct Quota`.

Review: https://reviews.apache.org/r/70916

commit 5907a357180ccd8fe398f2b6638c85912fafe8b2
Author: Meng Zhu 
Date:   Thu Jun 20 18:50:38 2019 -0700

Replaced the old `struct Quota`.

The new `struct Quota` is consistent with the proto `QuotaConfig`
where guarantees and limits are decoupled and uses more proper
abstractions: `ResourceQuantities` and `ResourceLimits`.

Review: https://reviews.apache.org/r/70919
{noformat}


> Introduce a `struct Quota` wrapper.
> ---
>
> Key: MESOS-9807
> URL: https://issues.apache.org/jira/browse/MESOS-9807
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> We should introduce:
> struct Qutota {
>   ResourceQuantities guarantees;
>   ResourceLimits limits;
> }
> There are a couple of small hurdles. First, there is already a struct Quota 
> wrapper in "include/mesos/quota/quota.hpp", we need to deprecate that first. 
> Second, `ResourceQuantities` and `ResourceLimits` are right now only used in 
> internal headers. We probably want to move them into public header, since 
> this struct will also be used in allocator interface which is also in the 
> public header. (Looking at this line, the boundary is alreayd breached: 
> https://github.com/apache/mesos/blob/master/include/mesos/allocator/allocator.hpp#L41)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9820) Add `updateQuota()` method to the allocator.

2019-06-25 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872745#comment-16872745
 ] 

Meng Zhu commented on MESOS-9820:
-

{noformat}
commit 373393bbaaeadf992c2e8d5399462ffe128eaec4
Author: Meng Zhu 
Date:   Thu Jun 20 18:48:28 2019 -0700

Removed `setQuota` and `removeQuota` methods in the allocator.

These are replaced by the `updateQuota` method.

Review: https://reviews.apache.org/r/70918
{noformat}

> Add `updateQuota()` method to the allocator.
> 
>
> Key: MESOS-9820
> URL: https://issues.apache.org/jira/browse/MESOS-9820
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> This is the method that underlies the `UPDATE_QUOTA` operator call. This will 
> allow the allocator to set different values for guarantees and limits.
> The existing `setQuota` and `removeQuota` methods in the allocator will be 
> deprecated. This will likely break many existing allocator tests. We should 
> fix and refactor tests to verify the bursting up to limits feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9820) Add `updateQuota()` method to the allocator.

2019-06-25 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872744#comment-16872744
 ] 

Meng Zhu commented on MESOS-9820:
-

{noformat}
commit 86affdd0b5c2208627eb194e5d02794fa264c383
Author: Meng Zhu 
Date:   Thu Jun 20 18:09:36 2019 -0700

Refactored the allocator test to use the `updateQuota` method.

This paves the way to remove `setQuota` and `removeQuota` methods.

Review: https://reviews.apache.org/r/70917
{noformat}


> Add `updateQuota()` method to the allocator.
> 
>
> Key: MESOS-9820
> URL: https://issues.apache.org/jira/browse/MESOS-9820
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> This is the method that underlies the `UPDATE_QUOTA` operator call. This will 
> allow the allocator to set different values for guarantees and limits.
> The existing `setQuota` and `removeQuota` methods in the allocator will be 
> deprecated. This will likely break many existing allocator tests. We should 
> fix and refactor tests to verify the bursting up to limits feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9854) /roles endpoint should return both guarantees and limits.

2019-06-25 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872742#comment-16872742
 ] 

Meng Zhu commented on MESOS-9854:
-

{noformat}
commit b23b4e52a24637231a85faf2416b75180cfd9063
Author: Meng Zhu m...@mesosphere.io
Date:   Thu Jun 20 17:17:41 2019 -0700


Made `/roles` endpoint also return quota limits.

Now that guarantees are decoupled from limits, we should
return limits and guarantees separately in the `/roles` endpoint.

Three incompatible changes are introduced:

- The `principal` field is removed. This legacy field was used to
record the principal of the operator who configured the quota.
So that later, if a different operator with a different principal
wants to modify the quota, the action can be properly authorized.
This use case has since been deprecated and the principal field
will no longer be filled going forward.

- Resources with zero quantity will no longer be included in
the `guarantee` field.

- The `guarantee` field will continue to be filled.
However, since we are decoupling the quota guarantee from the limit.
One can no longer assume that the limit will be the same as guarantee.
A separate `limit` field is introduced.

Before, the response might contain:
```
{
  "quota": {
"guarantee": {
  "cpus": 1,
  "disk": 0,
  "gpus": 0,
  "mem": 512
},
"principal": "test-principal",
"role": "foo"
  }
}
```

After:
```
{
  "quota": {
"guarantee": {
  "cpus": 1,
  "mem": 512
},
"limit": {
  "cpus": 1,
  "mem": 512
},
"role": "foo"
  }
}
```

Also fixed an affected test.

Review: https://reviews.apache.org/r/70915
{noformat}


> /roles endpoint should return both guarantees and limits. 
> --
>
> Key: MESOS-9854
> URL: https://issues.apache.org/jira/browse/MESOS-9854
> Project: Mesos
>  Issue Type: Bug
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8486) Webui should display role limits.

2019-06-24 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-8486:
---

Assignee: Meng Zhu  (was: Armand Grillet)

> Webui should display role limits.
> -
>
> Key: MESOS-8486
> URL: https://issues.apache.org/jira/browse/MESOS-8486
> Project: Mesos
>  Issue Type: Task
>  Components: webui
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>  Labels: multitenancy
>
> With the addition of quota limits (see MESOS-8068), the UI should be updated 
> to display the per role limit information. Specifically, the 'Roles' tab 
> needs to be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9861) Make PushGauges support float point stats.

2019-06-24 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9861:
---

 Summary: Make PushGauges support float point stats.
 Key: MESOS-9861
 URL: https://issues.apache.org/jira/browse/MESOS-9861
 Project: Mesos
  Issue Type: Bug
  Components: metrics
Reporter: Meng Zhu


Currently, PushGauges are modeled against counters. Thus it does not support 
floating point stats. This prevents many existing PullGauges to use it. We need 
to add support for floating point stat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9668) Add authorization support for the new `GET_QUOTA` call.

2019-06-24 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9668:
---

Assignee: Meng Zhu

> Add authorization support for the new `GET_QUOTA` call.
> ---
>
> Key: MESOS-9668
> URL: https://issues.apache.org/jira/browse/MESOS-9668
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: mesosphere, resource-management
>
> The new `GET_QUOTA` call will return QUOTA_CONFIGS:
> // Used in GET_QUOTA and returned by GET /quota
> //
> // Overall cluster quota status, including all roles, their quota 
> configurations and current state (e.g. consumed and effective limits)
> message QuotaStatus {
>repeated QuotaInfo infos [deprecated = true];
>repeated QuotaConfig configs; 
> }
> Current authorizer takes in QuotaInfo as the object. We should deprecate that 
> and let it take in QuotaConfigs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9601) Guard against downgrade hazards after new quota configurations are used.

2019-06-21 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9601:
---

Assignee: Meng Zhu

> Guard against downgrade hazards after new quota configurations are used.
> 
>
> Key: MESOS-9601
> URL: https://issues.apache.org/jira/browse/MESOS-9601
> Project: Mesos
>  Issue Type: Task
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> Current (old) masters only support quota guarantee which also servers as 
> limits implicitly. Once new masters start to support both guarantees and 
> limits, there is no safe downgrade path without altering the cluster behavior 
> (if the new quota semantics are used). Thus, we need to ensure that alerts 
> are given if such downgrades are attempted.
> To this end, if the new `UPDATE_QUOTA` call is used, a new minimum capability 
> `QUOTA_LIMITS` will be persisted to the registry along with the new 
> `QuotaConfig` message. Thanks to the minimum capability check, old masters 
> (that do not possess the `QUOTA_LIMITS` capability) will refuse the start in 
> this case and we will print out suggestions to the operator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9602) Provide backward compatibility for old quota configurations.

2019-06-21 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9602:
---

Assignee: Meng Zhu

> Provide backward compatibility for old quota configurations.
> 
>
> Key: MESOS-9602
> URL: https://issues.apache.org/jira/browse/MESOS-9602
> Project: Mesos
>  Issue Type: Task
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> Current (old) masters only support quota guarantee which also servers as 
> limits implicitly. When upgrading to new masters where guarantees and limits 
> are decoupled, we need to ensure backward compatibility such that the 
> existing (old) quota configurations are honored and there should be no change 
> to the cluster behavior.
> To this end, new masters should also be able to consume the old quota 
> registry. The old guarantee field will be used to set both guarantee and 
> limits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8068) Non-revocable bursting over quota guarantees via limits.

2019-06-21 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-8068:
---

Assignee: Meng Zhu

> Non-revocable bursting over quota guarantees via limits.
> 
>
> Key: MESOS-8068
> URL: https://issues.apache.org/jira/browse/MESOS-8068
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>  Labels: multitenancy, resource-management
>
> Prior to introducing a revocable tier of allocation (see MESOS-4441), there 
> is a notion of whether a role can burst over its quota guarantee.
> We currently apply implicit limits in the following way:
> No quota guarantee set: (guarantee 0, no limit)
> Quota guarantee set: (guarantee G, limit G)
> That is, we only allow support burst-only without guarantee and 
> guarantee-only without burst. We do not support bursting over some non-zero 
> guarantee: (guarantee G, limit L >= G).
> The idea here is that we should make these implicit limits explicit to 
> clarify for users the distinction between guarantees and limits, and to 
> support bursting over the guarantee.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9854) /roles endpoint should return both guarantees and limits.

2019-06-20 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9854:
---

 Summary: /roles endpoint should return both guarantees and limits. 
 Key: MESOS-9854
 URL: https://issues.apache.org/jira/browse/MESOS-9854
 Project: Mesos
  Issue Type: Bug
Reporter: Meng Zhu
Assignee: Meng Zhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9851) Migrate allocator metrics to PushGauge.

2019-06-18 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9851:
---

 Summary: Migrate allocator metrics to PushGauge.
 Key: MESOS-9851
 URL: https://issues.apache.org/jira/browse/MESOS-9851
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Meng Zhu


We should migrate all metrics in the master actor to use PushGauges instead of 
PullGauges for better performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9807) Introduce a `struct Quota` wrapper.

2019-06-14 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864551#comment-16864551
 ] 

Meng Zhu commented on MESOS-9807:
-

{noformat}

commit ceb1120e8c53771219363e0bf579a770b914a592
Author: Meng Zhu 
Date:   Thu Jun 6 16:18:45 2019 -0700

Used the new quota struct for the allocator recover call.

Review: https://reviews.apache.org/r/70804

commit 4bdbd8e7da5063d55726b628b5e0d31c79650d3f
Author: Meng Zhu 
Date:   Thu Jun 6 15:58:05 2019 -0700

Added `Metrics::updateQuota` for quota metrics.

This intends to replace the existing ``Metrics::setQuota` and
`Metrics::remove` calls.

Currently, it only tracks guarantees. Need to add limits metrics.

Review: https://reviews.apache.org/r/70802

commit 495162eefa12900b3a74bfbb269851473df4cce9
Author: Meng Zhu 
Date:   Wed Jun 5 14:04:53 2019 -0700

Refactored allocator with the new quota wrapper struct.

This patch also introduces a constant `DEFAULT_QUOTA`.
By default, a role has no guarantees and no limits.

Review: https://reviews.apache.org/r/70801

commit 75798445f932f1f163a502e2325e76cf33450836
Author: Meng Zhu 
Date:   Tue Jun 4 10:48:51 2019 -0700

Refactored quota overcommit check.

This refactor makes the `QuotaTree` to use the new
quota wrapper struct.

Also refactor the check to reflect that it is currently
only checking guarantees.

Review: https://reviews.apache.org/r/70800

commit f05f0616841bd539a8b6abfc591f3c287ad998d9
Author: Meng Zhu 
Date:   Tue Jun 4 17:34:52 2019 -0700

Added a wrapper struct for quota guarantees and limits.

This struct is temporarily named to `Quota2` to differentiate
with the existing `Quota` struct. It will replace all `Quota`
and rename to `Quota`.

Review: https://reviews.apache.org/r/70799
{noformat}


> Introduce a `struct Quota` wrapper.
> ---
>
> Key: MESOS-9807
> URL: https://issues.apache.org/jira/browse/MESOS-9807
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> We should introduce:
> struct Qutota {
>   ResourceQuantities guarantees;
>   ResourceLimits limits;
> }
> There are a couple of small hurdles. First, there is already a struct Quota 
> wrapper in "include/mesos/quota/quota.hpp", we need to deprecate that first. 
> Second, `ResourceQuantities` and `ResourceLimits` are right now only used in 
> internal headers. We probably want to move them into public header, since 
> this struct will also be used in allocator interface which is also in the 
> public header. (Looking at this line, the boundary is alreayd breached: 
> https://github.com/apache/mesos/blob/master/include/mesos/allocator/allocator.hpp#L41)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9820) Add `updateQuota()` method to the allocator.

2019-06-14 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864550#comment-16864550
 ] 

Meng Zhu commented on MESOS-9820:
-

{noformat}
commit 4703b23143ee806ed5e68d9ff6eabe9600ffc9c9
Author: Meng Zhu 
Date:   Wed Jun 5 16:44:00 2019 -0700

Added `updateQuota` method to the allocator.

This call updates a role's quota guarantees and limits.
All roles have a default quota defined as `DEFAULT_QUOTA`.
Currently, it is no guarantees and limits. Thus to "remove"
a quota, one should simply update the quota to be
`DEFAULT_QUOTA`.

Master `setQuota` and `removeQuota` calls into the allocator
are replaced with the `updateQuota`.

`setQuota` and `removeQuota` calls are now only used in the tests.
They will be removed once those tests are refactored.

Also fixed affected tests.

Review: https://reviews.apache.org/r/70803
{noformat}


> Add `updateQuota()` method to the allocator.
> 
>
> Key: MESOS-9820
> URL: https://issues.apache.org/jira/browse/MESOS-9820
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> This is the method that underlies the `UPDATE_QUOTA` operator call. This will 
> allow the allocator to set different values for guarantees and limits.
> The existing `setQuota` and `removeQuota` methods in the allocator will be 
> deprecated. This will likely break many existing allocator tests. We should 
> fix and refactor tests to verify the bursting up to limits feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9820) Add `updateQuota()` method to the allocator.

2019-06-14 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9820:
---

Assignee: Meng Zhu

> Add `updateQuota()` method to the allocator.
> 
>
> Key: MESOS-9820
> URL: https://issues.apache.org/jira/browse/MESOS-9820
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> This is the method that underlies the `UPDATE_QUOTA` operator call. This will 
> allow the allocator to set different values for guarantees and limits.
> The existing `setQuota` and `removeQuota` methods in the allocator will be 
> deprecated. This will likely break many existing allocator tests. We should 
> fix and refactor tests to verify the bursting up to limits feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9847) Docker executor doesn't wait for status updates to be ack'd before shutting down.

2019-06-13 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9847:
---

 Summary: Docker executor doesn't wait for status updates to be 
ack'd before shutting down.
 Key: MESOS-9847
 URL: https://issues.apache.org/jira/browse/MESOS-9847
 Project: Mesos
  Issue Type: Bug
  Components: executor
Reporter: Meng Zhu


The docker executor doesn't wait for pending status updates to be acknowledged 
before shutting down, instead it sleeps for one second and then terminates:

{noformat}
  void _stop()
  {
// A hack for now ... but we need to wait until the status update
// is sent to the slave before we shut ourselves down.
// TODO(tnachen): Remove this hack and also the same hack in the
// command executor when we have the new HTTP APIs to wait until
// an ack.
os::sleep(Seconds(1));
driver.get()->stop();
  }
{noformat}

This would result in racing between task status update (e.g. TASK_FINISHED) and 
executor exit. The latter would lead agent generating a `TASK_FAILED` status 
update by itself, leading to the confusing case where the agent handles two 
different terminal status updates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9835) `QuotaRoleAllocateNonQuotaResource` is failing.

2019-06-10 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9835:
---

 Summary: `QuotaRoleAllocateNonQuotaResource` is failing.
 Key: MESOS-9835
 URL: https://issues.apache.org/jira/browse/MESOS-9835
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Meng Zhu
Assignee: Meng Zhu


{noformat}
[ RUN  ] HierarchicalAllocatorTest.QuotaRoleAllocateNonQuotaResource
../../src/tests/hierarchical_allocator_tests.cpp:4094: Failure
Value of: allocations.get().isPending()
  Actual: false
Expected: true
[  FAILED  ] HierarchicalAllocatorTest.QuotaRoleAllocateNonQuotaResource (12 ms)
{noformat}

The test is failing because:

After agent3 is added, it misses a settle call where the allocation of agent3 
is racy.
In addition, after 
https://github.com/apache/mesos/commit/7df8cc6b79e294c075de09f1de4b31a2b88423c8
we now offer nonquota resources on an agent (even that means "chopping") on top 
of role's satisfied guarantees, the test needs to be updated in accordance with 
the behavior change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9807) Introduce a `struct Quota` wrapper.

2019-06-06 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9807:
---

Assignee: Meng Zhu

> Introduce a `struct Quota` wrapper.
> ---
>
> Key: MESOS-9807
> URL: https://issues.apache.org/jira/browse/MESOS-9807
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> We should introduce:
> struct Qutota {
>   ResourceQuantities guarantees;
>   ResourceLimits limits;
> }
> There are a couple of small hurdles. First, there is already a struct Quota 
> wrapper in "include/mesos/quota/quota.hpp", we need to deprecate that first. 
> Second, `ResourceQuantities` and `ResourceLimits` are right now only used in 
> internal headers. We probably want to move them into public header, since 
> this struct will also be used in allocator interface which is also in the 
> public header. (Looking at this line, the boundary is alreayd breached: 
> https://github.com/apache/mesos/blob/master/include/mesos/allocator/allocator.hpp#L41)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9834) Remove `GET_QUOTA` and `REMOVE_QUOTA` calls.

2019-06-06 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9834:
---

 Summary: Remove `GET_QUOTA` and `REMOVE_QUOTA` calls.
 Key: MESOS-9834
 URL: https://issues.apache.org/jira/browse/MESOS-9834
 Project: Mesos
  Issue Type: Task
  Components: HTTP API
Reporter: Meng Zhu


These calls are already deprecated in favor of `UPDATE_QUOTA`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9807) Introduce a `struct Quota` wrapper.

2019-06-06 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858216#comment-16858216
 ] 

Meng Zhu commented on MESOS-9807:
-

{noformat}
commit 8fd52f1ad41c7aa131ceaac1b83a5bd1d06eca21
Author: Meng Zhu m...@mesosphere.io
Date:   Tue Jun 4 09:51:00 2019 -0700


Moved `class ResourceQuantities` to public header.

Some public facing classes such as `Resources` already depends
on `ResourceQuantities` and more are coming.

Review: https://reviews.apache.org/r/70786
{noformat}


> Introduce a `struct Quota` wrapper.
> ---
>
> Key: MESOS-9807
> URL: https://issues.apache.org/jira/browse/MESOS-9807
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> We should introduce:
> struct Qutota {
>   ResourceQuantities guarantees;
>   ResourceLimits limits;
> }
> There are a couple of small hurdles. First, there is already a struct Quota 
> wrapper in "include/mesos/quota/quota.hpp", we need to deprecate that first. 
> Second, `ResourceQuantities` and `ResourceLimits` are right now only used in 
> internal headers. We probably want to move them into public header, since 
> this struct will also be used in allocator interface which is also in the 
> public header. (Looking at this line, the boundary is alreayd breached: 
> https://github.com/apache/mesos/blob/master/include/mesos/allocator/allocator.hpp#L41)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9820) Add `updateQuota()` method to the allocator.

2019-06-04 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9820:
---

 Summary: Add `updateQuota()` method to the allocator.
 Key: MESOS-9820
 URL: https://issues.apache.org/jira/browse/MESOS-9820
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Meng Zhu


This is the method that underlies the `UPDATE_QUOTA` operator call. This will 
allow the allocator to set different values for guarantees and limits.

The existing `setQuota` and `removeQuota` methods in the allocator will be 
deprecated. This will likely break many existing allocator tests. We should fix 
and refactor tests to verify the bursting up to limits feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9813) Track role consumed quota for all roles in the allocator.

2019-06-04 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9813:
---

 Summary: Track role consumed quota for all roles in the allocator.
 Key: MESOS-9813
 URL: https://issues.apache.org/jira/browse/MESOS-9813
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Meng Zhu


We are already tracking role consumed quota for roles with non-default quota in 
the allocator. We should expand that to track all roles' consumptions which 
will then be exposed through metrics later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9812) Add overcommit validation for update quota call.

2019-06-04 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9812:
---

 Summary: Add overcommit validation for update quota call.
 Key: MESOS-9812
 URL: https://issues.apache.org/jira/browse/MESOS-9812
 Project: Mesos
  Issue Type: Improvement
Reporter: Meng Zhu


Add overcommit check and force flag override for update quota call.

Right now, we only have validation for per quota config. We need to add further 
validation for the update quota call regarding cluster resource overcommitment 
(and force flag override) as well as hierarchical quota validness.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8456) Allocator should allow roles to burst above guarantees but below limits.

2019-06-03 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855309#comment-16855309
 ] 

Meng Zhu edited comment on MESOS-8456 at 6/4/19 4:54 AM:
-

main allocator patch:

https://reviews.apache.org/r/70738/


was (Author: mzhu):
https://reviews.apache.org/r/70738/

> Allocator should allow roles to burst above guarantees but below limits.
> 
>
> Key: MESOS-8456
> URL: https://issues.apache.org/jira/browse/MESOS-8456
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: Mesosphere, multitenancy
>
> Currently, allocator only allocates resources for quota roles up to their 
> guarantee in the first allocation stage. The allocator should continue 
> allocating resources to these roles in the second stage below their quota 
> limit. In other words, allocator should allow roles to burst above their 
> guarantee but below the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8456) Allocator should allow roles to burst above guarantees but below limits.

2019-06-03 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855307#comment-16855307
 ] 

Meng Zhu commented on MESOS-8456:
-

Some preparation patch:

{noformat}
commit 31ac45be0a55fc33982641516bcc5eb3226ef406
Author: Meng Zhu 
Date:   Tue May 28 16:28:28 2019 +0200

Added a function to shrink `Resources` to target `ResourceLimits`.

Also added unit tests.

Review: https://reviews.apache.org/r/70737

commit 8d372e14b0240aa5735a7c0cf36e03e7b3344bd1
Author: Meng Zhu 
Date:   Tue May 28 16:27:16 2019 +0200

Added methods to subtract `ResourceQuantities` from `ResourceLimits`.

This patch also makes `ResourceLimits` a friend class of
`ResourceQuantities` to achieve one-pass operation complexities.

Also added unit test.

Review: https://reviews.apache.org/r/70735
{noformat}


> Allocator should allow roles to burst above guarantees but below limits.
> 
>
> Key: MESOS-8456
> URL: https://issues.apache.org/jira/browse/MESOS-8456
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: Mesosphere, multitenancy
>
> Currently, allocator only allocates resources for quota roles up to their 
> guarantee in the first allocation stage. The allocator should continue 
> allocating resources to these roles in the second stage below their quota 
> limit. In other words, allocator should allow roles to burst above their 
> guarantee but below the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (MESOS-8456) Allocator should allow roles to burst above guarantees but below limits.

2019-06-03 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu updated MESOS-8456:

Comment: was deleted

(was: https://reviews.apache.org/r/65661
https://reviews.apache.org/r/65819
https://reviews.apache.org/r/65820
https://reviews.apache.org/r/65821
https://reviews.apache.org/r/65844
https://reviews.apache.org/r/65845
https://reviews.apache.org/r/65847
)

> Allocator should allow roles to burst above guarantees but below limits.
> 
>
> Key: MESOS-8456
> URL: https://issues.apache.org/jira/browse/MESOS-8456
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: Mesosphere, multitenancy
>
> Currently, allocator only allocates resources for quota roles up to their 
> guarantee in the first allocation stage. The allocator should continue 
> allocating resources to these roles in the second stage below their quota 
> limit. In other words, allocator should allow roles to burst above their 
> guarantee but below the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9807) Introduce a `struct Quota` wrapper.

2019-06-03 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9807:
---

 Summary: Introduce a `struct Quota` wrapper.
 Key: MESOS-9807
 URL: https://issues.apache.org/jira/browse/MESOS-9807
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Meng Zhu


We should introduce:

struct Qutota {
  ResourceQuantities guarantees;
  ResourceLimits limits;
}

There are a couple of small hurdles. First, there is already a struct Quota 
wrapper in "include/mesos/quota/quota.hpp", we need to deprecate that first. 
Second, `ResourceQuantities` and `ResourceLimits` are right now only used in 
internal headers. We probably want to move them into public header, since this 
struct will also be used in allocator interface which is also in the public 
header. (Looking at this line, the boundary is alreayd breached: 
https://github.com/apache/mesos/blob/master/include/mesos/allocator/allocator.hpp#L41)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9806) Address allocator performance regression due to the removal of quota role sorter.

2019-06-02 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9806:
---

 Summary: Address allocator performance regression due to the 
removal of quota role sorter.
 Key: MESOS-9806
 URL: https://issues.apache.org/jira/browse/MESOS-9806
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Meng Zhu
Assignee: Meng Zhu


In MESOS-9802, we removed the quota role sorter which is tech debt.

However, this slows down the allocator. The problem is that in the first stage, 
even though a cluster might have no active roles with non-default quota, the 
allocator will now have to sort and go through each and every role in the 
cluster. Benchmark result shows that for 1k roles with 2k frameworks, the 
allocator could experience ~50% performance degradation.

There are a couple of ways to address this issue. For example, we could make 
the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return 
all the roles with non-default quota. Alternatively, an even better approach 
would be to deprecate the sorter concept and just have two standalone functions 
e.g. sortRoles() and sortQuotaRoles() that takes in the role tree structure 
(not yet exist in the allocator) and return the sorted roles.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9802) Remove quota role sorter in the allocator.

2019-05-29 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9802:
---

 Summary: Remove quota role sorter in the allocator.
 Key: MESOS-9802
 URL: https://issues.apache.org/jira/browse/MESOS-9802
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Meng Zhu
Assignee: Meng Zhu


Remove the dedicated quota role sorter in favor of using the same sorting 
between satisfying guarantees and bursting above guarantees up to limits. This 
is tech debt from when a "quota role" was considered different from a 
"non-quota" role. However, they are the same, one just has a default quota.

The only practical difference between quota role sorter and role sorter now is 
that quota role sorter ignores the revocable resources both in its total 
resource pool as well as role allocations. Thus when using DRF, it does not 
count revocable resources which is arguably the right behavior.

By removing the quota sorter, we will have all roles sorted together. When 
using DRF, in the 1st quota guarantee allocation stage, its share calculation 
will also include revocable resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9796) Add `min_allocatable_resources` to mesos-execute.

2019-05-23 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9796:
---

 Summary: Add `min_allocatable_resources` to mesos-execute.
 Key: MESOS-9796
 URL: https://issues.apache.org/jira/browse/MESOS-9796
 Project: Mesos
  Issue Type: Task
  Components: cli
Reporter: Meng Zhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9786) Race between two REMOVE_QUOTA calls crashes the master.

2019-05-16 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841469#comment-16841469
 ] 

Meng Zhu commented on MESOS-9786:
-

{noformat}
commit d9ab461ad4dadf13ec45d52e83a0e9a2f452de74 (HEAD -> quota_race, 
apache/master)
Author: Meng Zhu 
Date:   Thu May 16 12:12:15 2019 +0200

Fix a bug where racing quota removal request could crash the master.

Also added a test.

Review: https://reviews.apache.org/r/70656
{noformat}


> Race between two REMOVE_QUOTA calls crashes the master.
> ---
>
> Key: MESOS-9786
> URL: https://issues.apache.org/jira/browse/MESOS-9786
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.6.2, 1.7.2, 1.8.0, 1.9.0
>Reporter: Andrei Sekretenko
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> The existence of the quota in the master is validated here:
> [https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L700]
> Then the quota is removed from master in a deferred method call:
> [https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L744]
> And then removed from allocator in another deferred call:
> [https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L753]
> So, there is a race between two simultaneous REMOVE_QUOTA calls.
> We observe this race on a heavily loaded cluster. Currently we suspect that 
> the client retries the call (due to the call being not processed for a long 
> time),  and this triggers the race.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9786) Race between two REMOVE_QUOTA calls crashes the master.

2019-05-16 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9786:
---

Assignee: Meng Zhu

> Race between two REMOVE_QUOTA calls crashes the master.
> ---
>
> Key: MESOS-9786
> URL: https://issues.apache.org/jira/browse/MESOS-9786
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.1, 1.8.0, 1.8.1
>Reporter: Andrei Sekretenko
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> The existence of the quota in the master is validated here:
> [https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L700]
> Then the quota is removed from master in a deferred method call:
> [https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L744]
> And then removed from allocator in another deferred call:
> [https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L753]
> So, there is a race between two simultaneous REMOVE_QUOTA calls.
> We observe this race on a heavily loaded cluster. Currently we suspect that 
> the client retries the call (due to the call being not processed for a long 
> time),  and this triggers the race.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9782) Random sorter fails to clear removed clients.

2019-05-13 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9782:
---

Assignee: Meng Zhu

> Random sorter fails to clear removed clients.
> -
>
> Key: MESOS-9782
> URL: https://issues.apache.org/jira/browse/MESOS-9782
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.8.0
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Blocker
>  Labels: resource-management
>
> In `RandomSorter::SortInfo::updateRelativeWeights()`, we do not clear the 
> stale `clients` and `weights` vector if the state is dirty. This would result 
> in an allocator crash due to including removed framework and roles in a 
> sorted result e.g. check failure would occur here 
> (https://github.com/apache/mesos/blob/62f0b6973b2268a3305fd631a914433a933c6757/src/master/allocator/mesos/hierarchical.cpp#L1849).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9782) Random sorter fails to clear removed clients.

2019-05-13 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9782:
---

 Summary: Random sorter fails to clear removed clients.
 Key: MESOS-9782
 URL: https://issues.apache.org/jira/browse/MESOS-9782
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Affects Versions: 1.8.0
Reporter: Meng Zhu


In `RandomSorter::SortInfo::updateRelativeWeights()`, we do not clear the stale 
`clients` and `weights` vector if the state is dirty. This would result in an 
allocator crash due to including removed framework and roles in a sorted result 
e.g. check failure would occur here 
(https://github.com/apache/mesos/blob/62f0b6973b2268a3305fd631a914433a933c6757/src/master/allocator/mesos/hierarchical.cpp#L1849).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9781) Templatize the allocator tests for for different sorters.

2019-05-13 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9781:
---

 Summary: Templatize the allocator tests for for different sorters.
 Key: MESOS-9781
 URL: https://issues.apache.org/jira/browse/MESOS-9781
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Meng Zhu


Currently, most (all?) allocator tests use the DRF sorter:
https://github.com/apache/mesos/blob/62f0b6973b2268a3305fd631a914433a933c6757/src/tests/hierarchical_allocator_tests.cpp#L137

This means we have little coverage for allocators that use random sorter. Tests 
should be examined and templatized for both sorters if possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9780) Improve "picky" framework resource allocation under random sorter.

2019-05-12 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9780:
---

 Summary: Improve "picky" framework resource allocation under 
random sorter.
 Key: MESOS-9780
 URL: https://issues.apache.org/jira/browse/MESOS-9780
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Meng Zhu


Picky frameworks are frameworks that are interested in some particular set of 
resources.
With the current offer model, such a framework usually keeps declining and 
filter uninterested offers until accepting an offer that meets its needs.

While picky frameworks are always prone to performance issues. These frameworks 
are more likely to experience offer starvation issues under random sorter than 
the DRF sorter.

Under DRF sorter, declining offers or Mesos side resource filtering do not 
affect the framework's dominant resource share. Since other frameworks might 
get resource allocated at the same time which brings up their shares 
comparatively, a declined/filtered framework would usually have a higher chance 
of getting other offers as time goes by (if it keeps declining). This reduces 
the time such a framework getting what it wants eventually.

Random sorter, however, is stateless. A decline or filter action has no effect 
on the chance of a framework getting offers. A framework declining or filtering 
an offer essentially wastes a shot for nothing. It becomes a truly altruistic 
act with no perceived gain on the framework side. This makes the random sorter 
likely to perform poorly compared to DRF in terms of handling picky frameworks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9778) Randomized the agents in the second allocation stage.

2019-05-09 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836739#comment-16836739
 ] 

Meng Zhu commented on MESOS-9778:
-

{noformat}
commit d13be8432180d3b64947a320fa0c11340dba029a
Author: Meng Zhu m...@mesosphere.io
Date:   Wed May 8 16:58:02 2019 -0700


Randomized the agents in the second allocation stage.

Before this patch, agents are randomized before the 1st
allocation stage (the quota allocation stage) but not in
the 2nd stage. One perceived issue is that resources on
the agents in the front of the queue are likely to be mostly
allocated in the 1st stage, leaving only slices of resources
available for the second stage. Thus we may see consistently
low quality offers for role/frameworks that get allocated first
in the 2nd stage.

This patch randomizes the agents again before the 2nd stage to
to "spread out" the effect of the 1st stage allocation.

Review: https://reviews.apache.org/r/70613
{noformat}


> Randomized the agents in the second allocation stage.
> -
>
> Key: MESOS-9778
> URL: https://issues.apache.org/jira/browse/MESOS-9778
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> Agents are currently randomized before the 1st
> allocation stage (the quota allocation stage) but not in
> the 2nd stage. One perceived issue is that resources on
> the agents in the front of the queue are likely to be mostly
> allocated in the 1st stage, leaving only slices of resources
> available for the second stage. Thus we may see consistently
> low quality offers for role/frameworks that get allocated first
> in the 2nd stage.
> Consider randomizing the agents in the second allocation stage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9778) Randomized the agents in the second allocation stage.

2019-05-09 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9778:
---

 Summary: Randomized the agents in the second allocation stage.
 Key: MESOS-9778
 URL: https://issues.apache.org/jira/browse/MESOS-9778
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Meng Zhu
Assignee: Meng Zhu


Agents are currently randomized before the 1st
allocation stage (the quota allocation stage) but not in
the 2nd stage. One perceived issue is that resources on
the agents in the front of the queue are likely to be mostly
allocated in the 1st stage, leaving only slices of resources
available for the second stage. Thus we may see consistently
low quality offers for role/frameworks that get allocated first
in the 2nd stage.

Consider randomizing the agents in the second allocation stage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9777) Consider doing an internal retry if reservation and etc. operations fail due to 409 conflict.

2019-05-08 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9777:
---

 Summary: Consider doing an internal retry if reservation and etc. 
operations fail due to 409 conflict.
 Key: MESOS-9777
 URL: https://issues.apache.org/jira/browse/MESOS-9777
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Meng Zhu


A reservation request may return 409 Conflict:

https://github.com/apache/mesos/blob/261d6ef497383795557aaca5dce426b4482eabea/src/master/http.cpp#L4026

It is due to the inherent race between the master and allocator actor. As 
illustrated here:

https://github.com/apache/mesos/blob/261d6ef497383795557aaca5dce426b4482eabea/src/master/allocator/mesos/hierarchical.cpp#L992-L1008

This is not ideal and should be rare. However, it is hard for users to grasp 
this error. It seems to be beneficial for Mesos to retry the reservation 
operation internally for the user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9725) Perform incremental sorting in the random sorter.

2019-05-08 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835725#comment-16835725
 ] 

Meng Zhu commented on MESOS-9725:
-

Based on a recent internal test, the sort() does not take much time. And this 
ticket would introduce some extra complexities.

The review above (https://reviews.apache.org/r/70497/) is pretty ready except 
one issue that still needs to figure out. In the review, we used a hashmap and 
used double as the key. This worries us because of the double precision issue. 
A solution is to use rational numbers. 

Given the benefit and complexity of the patch, we decided to shelve it for now. 
Move this ticket back to `accepted`.

> Perform incremental sorting in the random sorter.
> -
>
> Key: MESOS-9725
> URL: https://issues.apache.org/jira/browse/MESOS-9725
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: performance, resource-management
>
> By doing random sampling every time as the caller asks for the next client 
> (See MESOS-9722) we could avoid the cost of full shuffling and only pay as we 
> go.
> While the hope is to do each random sampling with O(1) cost, the presence of 
> weights complicates the matter. We will need to pay O(log( n )) for every 
> sample even with fancy data structures like segment tree or binary index 
> trees (naive ones will result in O( n ) since we need to look at every node's 
> weights). And the current full node shuffling is already optimal (nlog( n )) 
> if all nodes are picked.
> However, since the number of *distinct* weights is usually much smaller 
> comparing to the size of clients, we can minimize the sample cost by picking 
> a client in two steps:
> Step1: randomly pick a group of clients that has the same weight by 
> generating a weighted random number.
> Step2: Once a vector of clients is chosen, randomly sample a specific client 
> within the group. Since all the clients in the chosen vector have the same 
> weight, we do not need to consider any weights.
>  
> Since the size of distinct weights is usually much smaller comparing to the 
> size of clients, this way, we minimize the cost of generating weighted random 
> numbers which are linear with the size of weights.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9722) Refactor the sorter interface to enable lazy sorting.

2019-05-08 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835715#comment-16835715
 ] 

Meng Zhu commented on MESOS-9722:
-

Based on a recent internal test, the sort() does not take much time. And this 
ticket would introduce some extra complexities. 

The review above (https://reviews.apache.org/r/70419) is pretty ready though. 
But we decide to shelve it for now.  Move this ticket back to `accepted`.

> Refactor the sorter interface to enable lazy sorting.
> -
>
> Key: MESOS-9722
> URL: https://issues.apache.org/jira/browse/MESOS-9722
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: performance, resource-management
>
> Currently, the only way for getting a sorted client from sorter is through:
> {noformat}
> vector Sorter::sort()
> {noformat}
> This sorts all the active clients in the tree and returns all of them in a 
> single vector. This is inefficient if the callers end up only needing a few 
> of clients (e.g. when allocating one agent, only one or a few roles are 
> allocated).
> We could refactor the interface to return an iterator-like handle and then 
> callers can query the next the client in the sorting order. This would pave 
> the way for lazy sorting (i.e. only get the nth client) and improve 
> performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9759) Log required quota headroom and available quota headroom in the allocator.

2019-05-01 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9759:
---

 Summary: Log required quota headroom and available quota headroom 
in the allocator.
 Key: MESOS-9759
 URL: https://issues.apache.org/jira/browse/MESOS-9759
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Meng Zhu


This would ease the debugging of allocation issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9759) Log required quota headroom and available quota headroom in the allocator.

2019-05-01 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9759:
---

Assignee: Meng Zhu

> Log required quota headroom and available quota headroom in the allocator.
> --
>
> Key: MESOS-9759
> URL: https://issues.apache.org/jira/browse/MESOS-9759
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> This would ease the debugging of allocation issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9758) Take ports out of the roles endpoints.

2019-05-01 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9758:
---

 Summary: Take ports out of the roles endpoints.
 Key: MESOS-9758
 URL: https://issues.apache.org/jira/browse/MESOS-9758
 Project: Mesos
  Issue Type: Bug
Reporter: Meng Zhu


It does not make sense to combine ports across agents.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9710) Add tests to ensure random sorter performs correct weighted sorting.

2019-04-30 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830774#comment-16830774
 ] 

Meng Zhu commented on MESOS-9710:
-

{noformat}
commit 89c3dd95a421e14044bc91ceb1998ff4ae3883b4
Author: Meng Zhu m...@mesosphere.io
Date:   Sun Apr 7 15:55:42 2019 -0700


Added a test to verify the sort correctness of the random sorter.

Review: https://reviews.apache.org/r/70418
{noformat}


> Add tests to ensure random sorter performs correct weighted sorting.
> 
>
> Key: MESOS-9710
> URL: https://issues.apache.org/jira/browse/MESOS-9710
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> We added tests for the weighted shuffle algorithm, but didn't test that the 
> RandomSorter's sort() function behaves correctly.
> We should also test that hierarchical weights in the random sorter behave 
> correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9724) Flatten the weighted shuffling in the random sorter.

2019-04-25 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826492#comment-16826492
 ] 

Meng Zhu commented on MESOS-9724:
-

{noformat}
commit 5108f076e6a5c275cae6b124bbcb110bc6785f94
Author: Meng Zhu 
Date:   Wed Apr 24 11:32:38 2019 -0700

Avoided some recalculation in the random sorter.

This patch keeps the sorting related information in the memory
and accompanies a dirty bit with it. This helps to avoid
unnecessary recalculation of this info in `sort()`.

Review: https://reviews.apache.org/r/70430

commit 5a756402ad15cedbc6ccb8fa5de096745967f36f
Author: Meng Zhu 
Date:   Wed Apr 24 10:51:06 2019 -0700

Fixed a bug in the random sorter.

Currently, in the presence of hierarchical roles, the
random sorter shuffles roles level by level and then pick
the active leave nodes using DFS. This could generate
non-uniform random result since active leaves in a subtree
are always picked together.

This patch fixes the issue by first calculating the relative
weights of each active leaf node and shuffle all of them
only once.

Review: https://reviews.apache.org/r/70429

commit 5e52c686c29819113f42c6bde7d90324673b42dc
Author: Meng Zhu 
Date:   Tue Apr 23 18:44:33 2019 -0700

Added a random sorter helper to find active internal nodes.

Active internal nodes are defined as internal nodes that have
at least one active leaf node.

Review: https://reviews.apache.org/r/70542
{noformat}

> Flatten the weighted shuffling in the random sorter.
> 
>
> Key: MESOS-9724
> URL: https://issues.apache.org/jira/browse/MESOS-9724
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: performance, resource-management
>
> Due to the presence of hierarchical weights, the random sorter currently 
> shuffles level-by-level. We should be able to shuffle all the active leaves 
> only once by calculating (and caching) active leaves' relative weights. This 
> should improve the performance in the presence of hierarchical roles. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9738) Add per-framework metrics for offer round trip time.

2019-04-23 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9738:
---

 Summary: Add per-framework metrics for offer round trip time.
 Key: MESOS-9738
 URL: https://issues.apache.org/jira/browse/MESOS-9738
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Meng Zhu


This would provide more insights into framework responsiveness, help detect 
worrisome behaviors such as offer timeout, offer hoarding and etc.

One tricky thing is that we need to take Mesos's own queuing delay into 
consideration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9733) Random sorter generates non-uniform result for hierarchical roles.

2019-04-17 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9733:
---

 Summary: Random sorter generates non-uniform result for 
hierarchical roles.
 Key: MESOS-9733
 URL: https://issues.apache.org/jira/browse/MESOS-9733
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Meng Zhu
Assignee: Meng Zhu


In the presence of hierarchical roles, the random sorter shuffles roles level 
by level and then pick the active leave nodes using DFS:

https://github.com/apache/mesos/blob/7e7cd8de1121589225049ea33df0624b2a1bd754/src/master/allocator/sorter/random/sorter.cpp#L513-L529

This makes the result less random because subtrees are always picked together. 
For example, random sorting result such as `[a/., c/d, a/b, …]` is impossible.
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   >