[jira] [Assigned] (MESOS-5795) Add Nvidia GPU support for in the docker containerizer
[ https://issues.apache.org/jira/browse/MESOS-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-5795: --- Assignee: Meng Zhu > Add Nvidia GPU support for in the docker containerizer > -- > > Key: MESOS-5795 > URL: https://issues.apache.org/jira/browse/MESOS-5795 > Project: Mesos > Issue Type: Epic > Components: containerization, docker >Reporter: Kevin Klues >Assignee: Meng Zhu >Priority: Major > Labels: gpu, mesosphere > > In order to support Nvidia GPUs with docker containers in Mesos, we need to > be able to consolidate all Nvidia libraries into a common volume and inject > that volume into the container. This tracks the support in the docker > containerizer. The mesos containerizer support has already been completed in > MESOS-5401. > More info on why this is necessary here: > https://github.com/NVIDIA/nvidia-docker/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10029) Quota limits may be breached when serving operations.
Meng Zhu created MESOS-10029: Summary: Quota limits may be breached when serving operations. Key: MESOS-10029 URL: https://issues.apache.org/jira/browse/MESOS-10029 Project: Mesos Issue Type: Bug Reporter: Meng Zhu Currently, quota limits are only enforced during offer stage in the allocator. For other resource consumption events e.g. operator initiated operations (e.g. reserve resources for a role), the limit logic is not checked. This may lead to a breach of quota limits. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10028) Mesos failed to build due to error C3493 on windows with MSVC
[ https://issues.apache.org/jira/browse/MESOS-10028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-10028: Assignee: Meng Zhu > Mesos failed to build due to error C3493 on windows with MSVC > - > > Key: MESOS-10028 > URL: https://issues.apache.org/jira/browse/MESOS-10028 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: master > Environment: VS 2017 + Windows Server 2016 >Reporter: LinGao >Assignee: Meng Zhu >Priority: Major > Attachments: log_x64_build.log > > > Mesos failed to build due to error C3493: 'childRoleLength' cannot be > implicitly captured because no default capture mode has been specified on > Windows using MSVC. It can be first reproduced on 69e92ae reversion on master > branch. Could you please take a look at this isssue? Thanks a lot! > > Reproduce steps: > 1. git clone -c core.autocrlf=true https://github.com/apache/mesos > D:\mesos\src > 2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos > 3. cd src > 4. .\bootstrap.bat > 5. cd .. > 6. mkdir build_x64 && pushd build_x64 > 7. cmake ..\src -G "Visual Studio 15 2017 Win64" > -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64 > 8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 > /t:Rebuild > > ErrorMessage: > D:\mesos\src\src\tests\hierarchical_allocator_tests.cpp(8455): error C3493: > 'childRoleLength' cannot be implicitly captured because no default capture > mode has been specified -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10014) `tryUntrackFrameworkUnderRole` check failed in `HierarchicalAllocatorProcess::removeFramework`.
[ https://issues.apache.org/jira/browse/MESOS-10014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957497#comment-16957497 ] Meng Zhu commented on MESOS-10014: -- Hmm, the following log message looks problematic: {noformat} I1018 09:05:14.228754 21394 hierarchical.cpp:955] Added agent e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 (ip-172-16-10-17.ec2.internal) with cpus:2; mem:1024; disk:1024; ports:[31000-32000] (offered or allocated: {}) I1018 09:05:14.229159 21394 hierarchical.cpp:1100] Grew agent e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 by disk[RAW(,,profile)]:200 (total), { } (used) I1018 09:05:14.229632 21394 hierarchical.cpp:1057] Agent e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 (ip-172-16-10-17.ec2.internal) updated with total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] I1018 09:05:14.230063 21394 hierarchical.cpp:1843] Performed allocation for 1 agents in 128843ns I1018 09:05:14.230569 21391 master.cpp:10926] Recovered orphan operation 71647a26-b5fe-4b97-9162-0abb2785b909 (ID: operation) on agent e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 belonging to framework e6284079-cb6a-4a47-8f9a-ea9b84ff622a- in state OPERATION_PENDING I1018 09:05:14.230813 21391 master.cpp:10824] Adding framework e6284079-cb6a-4a47-8f9a-ea9b84ff622a- (default) with roles { } suppressed I1018 09:05:14.230991 21391 master.cpp:8295] Updating framework e6284079-cb6a-4a47-8f9a-ea9b84ff622a- (default) with roles { } suppressed I1018 09:05:14.231298 21390 hierarchical.cpp:1100] Grew agent e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 by disk[RAW(,,profile)]:200 (total), { e6284079-cb6a-4a47-8f9a-ea9b84ff622a-: disk(allocated: default-role)[RAW(,,profile)]:200 } (used) {noformat} This happens after the master failover. In particular, there are two `Grew agent ...` indicating two resource providers (each with 200 disk) are added. And the latter one contains *used* 200 disk. This is probably the same 200 disk resource printed out above by [~bmahler] I suspect this relates to orphan operations cc/[~greggomann] > `tryUntrackFrameworkUnderRole` check failed in > `HierarchicalAllocatorProcess::removeFramework`. > --- > > Key: MESOS-10014 > URL: https://issues.apache.org/jira/browse/MESOS-10014 > Project: Mesos > Issue Type: Bug > Components: master, test >Affects Versions: 1.10 >Reporter: Andrei Budnik >Priority: Major > Labels: flaky-test, resource-management > Attachments: AgentPendingOperationAfterMasterFailover-badrun.txt > > > `ContentType/OperationReconciliationTest.AgentPendingOperationAfterMasterFailover/0` > test failed: > {code:java} > F1018 09:05:14.310616 21391 hierarchical.cpp:745] Check failed: > tryUntrackFrameworkUnderRole(framework, role) Framework: > e6284079-cb6a-4a47-8f9a-ea9b84ff622a- role: default-role > *** Check failure stack trace: *** > @ 0x7f40fff0a1f6 google::LogMessage::Fail() > @ 0x7f40fff0a14f google::LogMessage::SendToLog() > @ 0x7f40fff09a91 google::LogMessage::Flush() > @ 0x7f40fff0d12f google::LogMessageFatal::~LogMessageFatal() > @ 0x7f410fd828ac > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeFramework() > @ 0x186b29f > _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES8_EEvRKNS_3PIDIT_EEMSA_FvT0_EOT1_ENKUlOS6_PNS_11ProcessBaseEE_clESJ_SL_ > @ 0x189c273 > _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS3_11FrameworkIDESA_EEvRKNS1_3PIDIT_EEMSC_FvT0_EOT1_EUlOS8_PNS1_11ProcessBaseEE_JS8_SN_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSP_ > @ 0x18990b7 > _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi113invoke_expandISP_St5tupleIJS9_SR_EESU_IJOSO_EEJLm0ELm1DTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISH_Efp0_EEcl7forwardISK_Efp2_OSD_OSH_N5cpp1416integer_sequenceImJXspT2_SL_ > @ 0x1896100 > _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_IS9_St12_PlaceholderILi1clIISO_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImILm0ELm1_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOSX_ > @ 0x1895174 >
[jira] [Commented] (MESOS-10006) Crash in Sorter: "Check failed: resources.contains(slaveId)"
[ https://issues.apache.org/jira/browse/MESOS-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944691#comment-16944691 ] Meng Zhu commented on MESOS-10006: -- Debug patch landed in master and 1.9.x, 1.8.x (will be included in 1.9.1 and 1.8.2) {noformat} commit 3457771b42993c85e3da3c4550b233f61b14bc99 (origin/master, apache/master, master, check_slaveID) Author: Meng Zhu Date: Fri Oct 4 10:48:40 2019 -0400 Made `CHECK` in sorter print out more info upon failure. Review: https://reviews.apache.org/r/71581 {noformat} > Crash in Sorter: "Check failed: resources.contains(slaveId)" > > > Key: MESOS-10006 > URL: https://issues.apache.org/jira/browse/MESOS-10006 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0, 1.4.1, 1.9.0 > Environment: Ubuntu Bionic 18.04, Mesos 1.1.0, 1.4.1, 1.9.0 (logs are > from 1.9.0). >Reporter: Terra Field >Priority: Major > Attachments: mesos-master.log.gz > > > We've hit a similar exception on 3 different versions of the Mesos master > (the line #/file name changes but the Check failed is the same), usually when > under very high load: > {noformat} > F1003 22:06:54.463502 8579 sorter.hpp:339] Check failed: > resources.contains(slaveId) > {noformat} > This particular occurrence happened after the election of a new master that > was then stuck doing framework update broadcasts, as documented in > MESOS-10005. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10006) Crash in Sorter: "Check failed: resources.contains(slaveId)"
[ https://issues.apache.org/jira/browse/MESOS-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944592#comment-16944592 ] Meng Zhu commented on MESOS-10006: -- Cross-posting from slack: thanks for the ticket! Unfortunately, the log does not contain much useful information. Alas, we did not print out the slaveID upon check failure. Sent out a patch to print more info upon check failure: I send out https://reviews.apache.org/r/71581 Consider backport. Also, some hunch diagnosis: such CHECK failure on sorter function input args are almost always bugs on the caller side, in this case, most likely some race/inconsistencies between master and allocator during recovery > Crash in Sorter: "Check failed: resources.contains(slaveId)" > > > Key: MESOS-10006 > URL: https://issues.apache.org/jira/browse/MESOS-10006 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0, 1.4.1, 1.9.0 > Environment: Ubuntu Bionic 18.04, Mesos 1.1.0, 1.4.1, 1.9.0 (logs are > from 1.9.0). >Reporter: Terra Field >Priority: Major > Attachments: mesos-master.log.gz > > > We've hit a similar exception on 3 different versions of the Mesos master > (the line #/file name changes but the Check failed is the same), usually when > under very high load: > {noformat} > F1003 22:06:54.463502 8579 sorter.hpp:339] Check failed: > resources.contains(slaveId) > {noformat} > This particular occurrence happened after the election of a new master that > was then stuck doing framework update broadcasts, as documented in > MESOS-10005. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-3938) Consider allowing setting quotas for the default '*' role.
[ https://issues.apache.org/jira/browse/MESOS-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942888#comment-16942888 ] Meng Zhu commented on MESOS-3938: - {noformat} commit 270a3dce490d5b334f9a0011ea416ffc42e187e4 Author: Meng Zhu Date: Wed Sep 25 15:41:07 2019 -0700 Documented setting quota on the default role in the release note. Review: https://reviews.apache.org/r/71548 commit 4dd00c6ad3d8af1d38d496a51f5407ee0e4b1970 Author: Meng Zhu Date: Tue Sep 10 11:51:09 2019 -0700 Allowed setting quota the default "*" role. There is no clear argument against setting quota on the default "*" role. This patch allows doing so. Tests are updated to check against regressions. Review: https://reviews.apache.org/r/71464 {noformat} > Consider allowing setting quotas for the default '*' role. > -- > > Key: MESOS-3938 > URL: https://issues.apache.org/jira/browse/MESOS-3938 > Project: Mesos > Issue Type: Task >Reporter: Alex R >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > Investigate use cases and implications of the possibility to set quota for > the '*' role. For example, having quota for '*' set can effectively reduce > the scope of the quota capacity heuristic. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-8503) Improve UI when displaying frameworks with many roles.
[ https://issues.apache.org/jira/browse/MESOS-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938030#comment-16938030 ] Meng Zhu commented on MESOS-8503: - {noformat} commit aed0b871479ecb1ee36df334c46203b75d682a7e Author: Andrei Sekretenko Date: Wed Sep 25 13:11:08 2019 -0700 Fixed Javascript linting and IE compatibility of the UI roles tree. Review: https://reviews.apache.org/r/71541/ {noformat} > Improve UI when displaying frameworks with many roles. > -- > > Key: MESOS-8503 > URL: https://issues.apache.org/jira/browse/MESOS-8503 > Project: Mesos > Issue Type: Task >Reporter: Armand Grillet >Assignee: Andrei Sekretenko >Priority: Major > Labels: resource-management > Fix For: 1.10 > > Attachments: Screen Shot 2018-01-29 à 10.38.05.png > > > The /frameworks UI endpoint displays all the roles of each framework in a > table: > !Screen Shot 2018-01-29 à 10.38.05.png! > This is not readable if a framework has many roles. We thus need to provide a > solution to only display a few roles per framework and show more when a user > wants to see all of them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-9975) Sorter may leak clients.
Meng Zhu created MESOS-9975: --- Summary: Sorter may leak clients. Key: MESOS-9975 URL: https://issues.apache.org/jira/browse/MESOS-9975 Project: Mesos Issue Type: Bug Components: allocation Reporter: Meng Zhu In MESOS-9015, we allowed resource quantities to change when updating an existing allocation. When the allocation is updated to empty, however, we forget to remove the client in the map in the `sorter::update()` if the `newAllocation` is `empty()`. https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/sorter/drf/sorter.hpp#L382-L384 The above case could happen, for example, when a CSI volume with a stale profile is destroyed, it would be better to convert it into an empty resource since the disk space is no longer available. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-9975) Sorter may leak clients.
[ https://issues.apache.org/jira/browse/MESOS-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9975: --- Assignee: Meng Zhu > Sorter may leak clients. > > > Key: MESOS-9975 > URL: https://issues.apache.org/jira/browse/MESOS-9975 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > In MESOS-9015, we allowed resource quantities to change when updating an > existing allocation. When the allocation is updated to empty, however, we > forget to remove the client in the map in the `sorter::update()` if the > `newAllocation` is `empty()`. > https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/sorter/drf/sorter.hpp#L382-L384 > The above case could happen, for example, when a CSI volume with a stale > profile is destroyed, it would be better to convert it into an empty resource > since the disk space is no longer available. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-621) `HierarchicalAllocatorProcess::removeSlave` doesn't properly handle framework allocations/resources
[ https://issues.apache.org/jira/browse/MESOS-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928909#comment-16928909 ] Meng Zhu commented on MESOS-621: Added tracking of allocated or offered resources in the allocator: {noformat} commit 783fd45c548fdff0c5c4812bc8e92c3aed202e06 Author: Meng Zhu m...@mesosphere.io Date: Sat Sep 7 16:01:51 2019 -0700 Tracked offered and allocated resources in the role tree. This helpers simplify the quota tracking logic and also paves the way to reduce duplicated states in the sorter. Also documented that shared resources must be uniquely identifiable. Small performance degradation when making allocations due to duplicated map construction in `(un)trackAllocatedResources`. This will be removed once embeded the sorter in the role tree. Benchmark `LargeAndSmallQuota/2`: Master: Added 3000 agents in 80.648188ms Added 3000 frameworks in 19.7006984secs Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter Made 3500 allocations in 16.044274434secs Made 0 allocation in 14.476429451secs Master + this patch: Added 3000 agents in 80.110817ms Added 3000 frameworks in 17.25974094secs Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter Made 3500 allocations in 16.91971379secs Made 0 allocation in 13.784476154secs Review: https://reviews.apache.org/r/71460 commit 2ec34ca5951a5a8da3d1ab93839cce68e815c1d5 Author: Meng Zhu Date: Tue Sep 3 13:31:36 2019 -0700 Added tracking of framework allocations in the allocator Slave class. This would simplify the tracking logic regarding resource allocations in the allocator. See MESOS-9182. Negligible performance impact: Master: BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Added 3000 agents in 77.999483ms Added 3000 frameworks in 16.736076171secs Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter Made 3500 allocations in 15.342376944secs Made 0 allocation in 13.96720191secs Master + this patch: BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Added 3000 agents in 83.597048ms Added 3000 frameworks in 16.757011745secs Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter Made 3500 allocations in 15.566366241secs Made 0 allocation in 14.022591871secs Review: https://reviews.apache.org/r/68508 {noformat} > `HierarchicalAllocatorProcess::removeSlave` doesn't properly handle framework > allocations/resources > --- > > Key: MESOS-621 > URL: https://issues.apache.org/jira/browse/MESOS-621 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Vinod Kone >Assignee: Meng Zhu >Priority: Major > Labels: mesosphere, resource-management, tech-debt > > Currently a slaveRemoved() simply removes the slave from 'slaves' map and > slave's resources from 'roleSorter'. Looking at resourcesRecovered(), more > things need to be done when a slave is removed (e.g., framework > unallocations). > It would be nice to fix this and have a test for this. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (MESOS-3938) Consider allowing setting quotas for the default '*' role.
[ https://issues.apache.org/jira/browse/MESOS-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-3938: --- Assignee: Meng Zhu > Consider allowing setting quotas for the default '*' role. > -- > > Key: MESOS-3938 > URL: https://issues.apache.org/jira/browse/MESOS-3938 > Project: Mesos > Issue Type: Task >Reporter: Alexander Rukletsov >Assignee: Meng Zhu >Priority: Major > > Investigate use cases and implications of the possibility to set quota for > the '*' role. For example, having quota for '*' set can effectively reduce > the scope of the quota capacity heuristic. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (MESOS-9242) Resources wrapper loses shared resource count information.
[ https://issues.apache.org/jira/browse/MESOS-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925052#comment-16925052 ] Meng Zhu edited comment on MESOS-9242 at 9/8/19 5:25 AM: - Another seemingly possible fix is to store shared items as individual objects in the Resources list e.g. 60 disk resources that got shared twice could have two resource with shared info set. However, this has a confusing problem when doing arithmetics: if we add another addable 60 shared disk, should it be kept as a distinct object or combine scalar value with the same object? Looks like we have to live with the count. However, returning `sharedCount` number of a resource object in the iterator also seems less than ideal. It would go against caller's assumption that resource objects are unique. For example, when calculating total scalar quantities, one would expect to simply add scalars with the same resource name together. A better solution seems to expose the shared count i.e. get rid of the `Resource_` wrapper and put `shared_count` as a field in the Resource SharedInfo proto message. was (Author: mzhu): Another seemingly possible fix is to store shared items as individual objects in the Resources list e.g. 60 disk resources that got shared twice could have two resource with shared info set. However, this has a confusing problem when doing arithmetics: if we add another addable 60 shared disk, should it be kept as a distinct object or combine scalar value with the same object? Looks like we have to live with the count. > Resources wrapper loses shared resource count information. > -- > > Key: MESOS-9242 > URL: https://issues.apache.org/jira/browse/MESOS-9242 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Benjamin Mahler >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > The Resources wrapper stores a {{Resource_}} wrapper type that stores > multiple copies of the a shared resource in a single {{Resource_}} with a > shared count. > On the output paths Resources, we lose the shared counts since we convert > {{Resource_}} directly back into a single {{Resource}}, even if the shared > count was > 1. > We need to fix this in the following: > * Implicit cast operator back to repeated ptr field of resource, this is easy > to adjust. > * Resource iteration, since we only expose const iteration, it should be > possible to use an iterator adaptor to return the shared resource {{count}} > times rather than just once when there are multiple copies. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9242) Resources wrapper loses shared resource count information.
[ https://issues.apache.org/jira/browse/MESOS-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925052#comment-16925052 ] Meng Zhu commented on MESOS-9242: - Another seemingly possible fix is to store shared items as individual objects in the Resources list e.g. 60 disk resources that got shared twice could have two resource with shared info set. However, this has a confusing problem when doing arithmetics: if we add another addable 60 shared disk, should it be kept as a distinct object or combine scalar value with the same object? Looks like we have to live with the count. > Resources wrapper loses shared resource count information. > -- > > Key: MESOS-9242 > URL: https://issues.apache.org/jira/browse/MESOS-9242 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Benjamin Mahler >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > The Resources wrapper stores a {{Resource_}} wrapper type that stores > multiple copies of the a shared resource in a single {{Resource_}} with a > shared count. > On the output paths Resources, we lose the shared counts since we convert > {{Resource_}} directly back into a single {{Resource}}, even if the shared > count was > 1. > We need to fix this in the following: > * Implicit cast operator back to repeated ptr field of resource, this is easy > to adjust. > * Resource iteration, since we only expose const iteration, it should be > possible to use an iterator adaptor to return the shared resource {{count}} > times rather than just once when there are multiple copies. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (MESOS-9242) Resources wrapper loses shared resource count information.
[ https://issues.apache.org/jira/browse/MESOS-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9242: --- Assignee: Meng Zhu > Resources wrapper loses shared resource count information. > -- > > Key: MESOS-9242 > URL: https://issues.apache.org/jira/browse/MESOS-9242 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Benjamin Mahler >Assignee: Meng Zhu >Priority: Major > > The Resources wrapper stores a {{Resource_}} wrapper type that stores > multiple copies of the a shared resource in a single {{Resource_}} with a > shared count. > On the output paths Resources, we lose the shared counts since we convert > {{Resource_}} directly back into a single {{Resource}}, even if the shared > count was > 1. > We need to fix this in the following: > * Implicit cast operator back to repeated ptr field of resource, this is easy > to adjust. > * Resource iteration, since we only expose const iteration, it should be > possible to use an iterator adaptor to return the shared resource {{count}} > times rather than just once when there are multiple copies. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-1452) Improve Master::removeOffer to avoid further resource accounting bugs.
[ https://issues.apache.org/jira/browse/MESOS-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924640#comment-16924640 ] Meng Zhu commented on MESOS-1452: - {noformat} commit a8050cafaa5465bd74a2ced1c37bb6b64c735445 Author: Andrei Sekretenko Date: Fri Sep 6 14:15:28 2019 -0700 Separated handling offer validation failure from handling success. This patch refactors the loop through offer IDs in `Master::accept()` into two simpler loops: one loop for the offer validation failure case, another for the case of validation success, thus bringing removal of offers and recovering their resources close together. This is a prerequisite for implementing `rescindOffer()`/ `declineOffer()` in the dependent patch. Review: https://reviews.apache.org/r/71433/ commit 7eb21c41ed255184988298e29644bf7f310c3374 Author: Andrei Sekretenko Date: Fri Sep 6 14:15:38 2019 -0700 Moved `removeOffers()` from `Master::accept()` into `Master::_accept()`. This patch moves offer removal on accept into the deferred continuation that follows authorization (if offers pass validation in `accept()`). Incrementing the `offers_accepted` metric is also moved to `_accept()`. This is a prerequisite for implementing `rescindOffer()` / `declineOffer()` / in the dependent patch. Review: https://reviews.apache.org/r/71434/ Author: Andrei Sekretenko Date: Fri Sep 6 14:15:54 2019 -0700 Replaced removeOffer + recoverResources pairs with specialized helpers. This patch adds helper methods `Master::rescindOffer()` / `Master::discardOffer()` that recover offer's resources in the allocator and remove the offer, and replaces paired calls of `removeOffer()` + `allocator->recoverResources()` with these helpers. Review: https://reviews.apache.org/r/71436/ {noformat} > Improve Master::removeOffer to avoid further resource accounting bugs. > -- > > Key: MESOS-1452 > URL: https://issues.apache.org/jira/browse/MESOS-1452 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler >Priority: Major > > Per comments on this review: https://reviews.apache.org/r/21750/ > We've had numerous bugs around resource accounting in the master due to the > trickiness of removing offers in the Master code. > There are a few ways to improve this: > 1. Add multiple offer methods to differentiate semantics: > {code} > useOffer(offerId); > rescindOffer(offerId); > discardOffer(offerId); > {code} > 2. Add an enum to removeOffer to differentiate removal semantics: > {code} > removeOffer(offerId, USE); > removeOffer(offerId, RESCIND); > removeOffer(offerId, DISCARD); > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (MESOS-1452) Improve Master::removeOffer to avoid further resource accounting bugs.
[ https://issues.apache.org/jira/browse/MESOS-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-1452: --- Assignee: Andrei Sekretenko > Improve Master::removeOffer to avoid further resource accounting bugs. > -- > > Key: MESOS-1452 > URL: https://issues.apache.org/jira/browse/MESOS-1452 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler >Assignee: Andrei Sekretenko >Priority: Major > > Per comments on this review: https://reviews.apache.org/r/21750/ > We've had numerous bugs around resource accounting in the master due to the > trickiness of removing offers in the Master code. > There are a few ways to improve this: > 1. Add multiple offer methods to differentiate semantics: > {code} > useOffer(offerId); > rescindOffer(offerId); > discardOffer(offerId); > {code} > 2. Add an enum to removeOffer to differentiate removal semantics: > {code} > removeOffer(offerId, USE); > removeOffer(offerId, RESCIND); > removeOffer(offerId, DISCARD); > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (MESOS-9962) Mesos may report completed task as running in the state.
Meng Zhu created MESOS-9962: --- Summary: Mesos may report completed task as running in the state. Key: MESOS-9962 URL: https://issues.apache.org/jira/browse/MESOS-9962 Project: Mesos Issue Type: Bug Components: agent Reporter: Meng Zhu When the following steps occur: 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or /master/machine/down). 2) The executor is sent a kill, and the agent counts down on executor_shutdown_grace_period. 3) The executor exits, before all terminal status updates reach the agent. This is more likely if executor_shutdown_grace_period passes. This results in a completed executor, with non-terminal tasks (according to status updates). This would produce a confusing report where completed tasks are still TASK_RUNNING. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9750) Agent V1 GET_STATE response may report a complete executor's tasks as non-terminal after a graceful agent shutdown
[ https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922871#comment-16922871 ] Meng Zhu commented on MESOS-9750: - Note, while this ticket makes the completed task with the nonterminal status list in the right place (i.e. completed tasks). However, it would result in a weird behavior where a completed task would have a nonterminal status e.g. TASK_RUNNING. > Agent V1 GET_STATE response may report a complete executor's tasks as > non-terminal after a graceful agent shutdown > -- > > Key: MESOS-9750 > URL: https://issues.apache.org/jira/browse/MESOS-9750 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Affects Versions: 1.6.0, 1.7.0, 1.8.0 >Reporter: Joseph Wu >Assignee: Joseph Wu >Priority: Major > Labels: foundations > Fix For: 1.7.3, 1.8.1, 1.9.0 > > > When the following steps occur: > 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or > /master/machine/down). > 2) The executor is sent a kill, and the agent counts down on > {{executor_shutdown_grace_period}}. > 3) The executor exits, before all terminal status updates reach the agent. > This is more likely if {{executor_shutdown_grace_period}} passes. > This results in a completed executor, with non-terminal tasks (according to > status updates). > When the agent starts back up, the completed executor will be recovered and > shows up correctly as a completed executor in {{/state}}. However, if you > fetch the V1 {{GET_STATE}} result, there will be an entry in > {{launched_tasks}} even though nothing is running. > {code} > get_tasks { > launched_tasks { > name: "test-task" > task_id { > value: "dff5a155-47f1-4a71-9b92-30ca059ab456" > } > framework_id { > value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-" > } > executor_id { > value: "default" > } > agent_id { > value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0" > } > state: TASK_RUNNING > resources { ... } > resources { ... } > resources { ... } > resources { ... } > statuses { > task_id { > value: "dff5a155-47f1-4a71-9b92-30ca059ab456" > } > state: TASK_RUNNING > agent_id { > value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0" > } > timestamp: 1556674758.2175469 > executor_id { > value: "default" > } > source: SOURCE_EXECUTOR > uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224" > container_status { ... } > } > } > } > get_executors { > completed_executors { > executor_info { > executor_id { > value: "default" > } > command { > value: "" > } > framework_id { > value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-" > } > } > } > } > get_frameworks { > completed_frameworks { > framework_info { > user: "user" > name: "default" > id { > value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-" > } > checkpoint: true > hostname: "localhost" > principal: "test-principal" > capabilities { > type: MULTI_ROLE > } > capabilities { > type: RESERVATION_REFINEMENT > } > roles: "*" > } > } > } > {code} > This happens because we combine executors and completed executors when > constructing the response. The terminal task(s) with non-terminal updates > appear under completed executors. > https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (MESOS-9961) Agent could fail to report completed tasks.
Meng Zhu created MESOS-9961: --- Summary: Agent could fail to report completed tasks. Key: MESOS-9961 URL: https://issues.apache.org/jira/browse/MESOS-9961 Project: Mesos Issue Type: Bug Components: agent Reporter: Meng Zhu When agent reregisters with a master, we don't report completed executors for active frameworks. We only report completed executors if the framework is also completed on the agent: https://github.com/apache/mesos/blob/1.7.x/src/slave/slave.cpp#L1785-L1832 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the addition of quota limits.
[ https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914674#comment-16914674 ] Meng Zhu commented on MESOS-9806: - As of now, the performance is close to 1.8.1 even with the addition of limits enforcement. There will be more improvement as we deprecate the framework sorter and optimize the role sorter (MESOS-9942 and MESOS-9943). > Address allocator performance regression due to the addition of quota limits. > - > > Key: MESOS-9806 > URL: https://issues.apache.org/jira/browse/MESOS-9806 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Critical > Labels: resource-management > > In MESOS-9802, we removed the quota role sorter which is tech debt. > However, this slows down the allocator. The problem is that in the first > stage, even though a cluster might have no active roles with non-default > quota, the allocator will now have to sort and go through each and every role > in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, > the allocator could experience ~50% performance degradation. > There are a couple of ways to address this issue. For example, we could make > the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return > all the roles with non-default quota. Alternatively, an even better approach > would be to deprecate the sorter concept and just have two standalone > functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree > structure (not yet exist in the allocator) and return the sorted roles. > In addition, when implementing MESOS-8068, we need to do more during the > allocation cycle. In particular, we need to call shrink many more times than > before. These all contribute to the performance slowdown. Specifically, for > the quota oriented benchmark > `HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2` we can observe > 2-3x slowdown compared to the previous release (1.8.1): > Current master: > QuotaParam/BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 > Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter > Made 3500 allocations in 32.051382735secs > Made 0 allocation in 27.976022773secs > 1.8.1: > HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 > Made 3500 allocations in 13.810811063secs > Made 0 allocation in 9.885972984secs -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the addition of quota limits.
[ https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914673#comment-16914673 ] Meng Zhu commented on MESOS-9806: - All the optimizations improved the performance by 50% 1.8.1 HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 13.810811063secs Made 0 allocation in 9.885972984secs Before the optimization: QuotaParam/BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter Made 3500 allocations in 32.051382735secs Made 0 allocation in 27.976022773secs After the optimization: HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 15.385276405secs Made 0 allocation in 13.718502414secs > Address allocator performance regression due to the addition of quota limits. > - > > Key: MESOS-9806 > URL: https://issues.apache.org/jira/browse/MESOS-9806 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Critical > Labels: resource-management > > In MESOS-9802, we removed the quota role sorter which is tech debt. > However, this slows down the allocator. The problem is that in the first > stage, even though a cluster might have no active roles with non-default > quota, the allocator will now have to sort and go through each and every role > in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, > the allocator could experience ~50% performance degradation. > There are a couple of ways to address this issue. For example, we could make > the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return > all the roles with non-default quota. Alternatively, an even better approach > would be to deprecate the sorter concept and just have two standalone > functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree > structure (not yet exist in the allocator) and return the sorted roles. > In addition, when implementing MESOS-8068, we need to do more during the > allocation cycle. In particular, we need to call shrink many more times than > before. These all contribute to the performance slowdown. Specifically, for > the quota oriented benchmark > `HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2` we can observe > 2-3x slowdown compared to the previous release (1.8.1): > Current master: > QuotaParam/BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 > Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter > Made 3500 allocations in 32.051382735secs > Made 0 allocation in 27.976022773secs > 1.8.1: > HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 > Made 3500 allocations in 13.810811063secs > Made 0 allocation in 9.885972984secs -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the addition of quota limits.
[ https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914672#comment-16914672 ] Meng Zhu commented on MESOS-9806: - Small vector optimization for ResourceQuantities, ResourceLimits and Resources: {noformat} commit 73033130de7872c6f240b9b05dced039d7666138 Author: Meng Zhu Date: Thu Aug 22 17:19:30 2019 -0700 Used boost `small_vector` in `Resources`. Master + previous patch: *HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 16.307044003secs Made 0 allocation in 14.948262599secs Master + previous patch + this patch: *HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 15.385276405secs Made 0 allocation in 13.718502414secs Review: https://reviews.apache.org/r/71357 commit 95201cbe4dc87eae2fde5754d16f5effbb6c1974 Author: Meng Zhu Date: Thu Aug 22 16:55:34 2019 -0700 Used boost `small_vector` in Resource Quantities and Limits. Master + previous patch *HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 16.831380548secs Made 0 allocation in 15.102885644secs Master + previous patch + this patch: *HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 16.307044003secs Made 0 allocation in 14.948262599secs Review: https://reviews.apache.org/r/71355 commit 25070f232a9bb97d1b78f8a7e5b774bbd50654f9 Author: Meng Zhu Date: Thu Aug 22 16:54:42 2019 -0700 Updated the boost library. This update includes adding `container/small_vector.hpp`. Review: https://reviews.apache.org/r/71356 {noformat} > Address allocator performance regression due to the addition of quota limits. > - > > Key: MESOS-9806 > URL: https://issues.apache.org/jira/browse/MESOS-9806 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Critical > Labels: resource-management > > In MESOS-9802, we removed the quota role sorter which is tech debt. > However, this slows down the allocator. The problem is that in the first > stage, even though a cluster might have no active roles with non-default > quota, the allocator will now have to sort and go through each and every role > in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, > the allocator could experience ~50% performance degradation. > There are a couple of ways to address this issue. For example, we could make > the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return > all the roles with non-default quota. Alternatively, an even better approach > would be to deprecate the sorter concept and just have two standalone > functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree > structure (not yet exist in the allocator) and return the sorted roles. > In addition, when implementing MESOS-8068, we need to do more during the > allocation cycle. In particular, we need to call shrink many more times than > before. These all contribute to the performance slowdown. Specifically, for > the quota oriented benchmark > `HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2` we can observe > 2-3x slowdown compared to the previous release (1.8.1): > Current master: > QuotaParam/BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 > Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter > Made 3500 allocations in 32.051382735secs > Made 0 allocation in 27.976022773secs > 1.8.1: > HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 > Made 3500 allocations in 13.810811063secs > Made 0 allocation in 9.885972984secs -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the addition of quota limits.
[ https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914670#comment-16914670 ] Meng Zhu commented on MESOS-9806: - Optimized the allocation loop {noformat} commit ec6b7b34215e821a63cb79e7d52d94ff08c1e110 Author: Meng Zhu Date: Thu Aug 22 17:54:25 2019 -0700 Optimized the allocation loop. Master: HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 23.37 secs Made 0 allocation in 19.72 secs Master + this patch: HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 16.831380548secs Made 0 allocation in 15.102885644secs Review: https://reviews.apache.org/r/71359 {noformat} > Address allocator performance regression due to the addition of quota limits. > - > > Key: MESOS-9806 > URL: https://issues.apache.org/jira/browse/MESOS-9806 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Critical > Labels: resource-management > > In MESOS-9802, we removed the quota role sorter which is tech debt. > However, this slows down the allocator. The problem is that in the first > stage, even though a cluster might have no active roles with non-default > quota, the allocator will now have to sort and go through each and every role > in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, > the allocator could experience ~50% performance degradation. > There are a couple of ways to address this issue. For example, we could make > the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return > all the roles with non-default quota. Alternatively, an even better approach > would be to deprecate the sorter concept and just have two standalone > functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree > structure (not yet exist in the allocator) and return the sorted roles. > In addition, when implementing MESOS-8068, we need to do more during the > allocation cycle. In particular, we need to call shrink many more times than > before. These all contribute to the performance slowdown. Specifically, for > the quota oriented benchmark > `HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2` we can observe > 2-3x slowdown compared to the previous release (1.8.1): > Current master: > QuotaParam/BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 > Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter > Made 3500 allocations in 32.051382735secs > Made 0 allocation in 27.976022773secs > 1.8.1: > HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 > Made 3500 allocations in 13.810811063secs > Made 0 allocation in 9.885972984secs -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (MESOS-9943) Dedicate sorter for roles.
Meng Zhu created MESOS-9943: --- Summary: Dedicate sorter for roles. Key: MESOS-9943 URL: https://issues.apache.org/jira/browse/MESOS-9943 Project: Mesos Issue Type: Bug Components: allocation Reporter: Meng Zhu Assignee: Meng Zhu Once MESOS-9942 has landed, we can clean up and optimize the sorter for roles. Specifically, each node in the tree (except the root and virtual leaf node) will carry a back pointer to the role tree structure in the allocator. This will eliminate all the state duplications and unnecessary trackings that currently done inside the sorter. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (MESOS-9942) Deprecate framework sorter.
Meng Zhu created MESOS-9942: --- Summary: Deprecate framework sorter. Key: MESOS-9942 URL: https://issues.apache.org/jira/browse/MESOS-9942 Project: Mesos Issue Type: Bug Components: allocation Reporter: Meng Zhu Assignee: Meng Zhu Given the flat structure of the framework, there is no need to store and sort frameworks in the sorter tree structure. We should deprecate framework sorter. This would dedicate the sorter for roles, opening up room for optimization and cleanup. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (MESOS-9940) Framework removal may lead to inconsistent task states between master and agent.
Meng Zhu created MESOS-9940: --- Summary: Framework removal may lead to inconsistent task states between master and agent. Key: MESOS-9940 URL: https://issues.apache.org/jira/browse/MESOS-9940 Project: Mesos Issue Type: Bug Components: master Reporter: Meng Zhu When a framework is removed from the master (say due to disconnection), master sends a `ShutdownFrameworkMessage` to the agent. At the same time, master would transition the task status to e.g. KILLED. (https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11247-L11291) When agent got the shutdown message, it would try to shutdown all the executor and destroy all the containers. The tasks' status is updated after all these are done. (https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L7914-L7922) However, if the executor shutdown gets stuck (e.g. due to hanging docker daemon), the task status transition will never happen. And master and agent will have diverged view of these tasks. One consequence is that masters may try to schedule more workloads onto the problematic agent (because it thinks those task resources are freed up). Since we do not have overcommit check on agent, agent will comply and launch those tasks. This will lead to over-allocation. One possible solution is to hold on the master status update until the agent is done with the framework shutdown. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9930) DRF sorter may omit clients in sorting after removing an inactive leaf node.
Meng Zhu created MESOS-9930: --- Summary: DRF sorter may omit clients in sorting after removing an inactive leaf node. Key: MESOS-9930 URL: https://issues.apache.org/jira/browse/MESOS-9930 Project: Mesos Issue Type: Bug Components: allocation Reporter: Meng Zhu Assignee: Meng Zhu The sorter assumes inactive leaf nodes are placed in the tail in the children list of a node. However, when collapsing a parent node with a single "." virtual child node, its position may fail to be updated due to a bug in `Sorter::remove()`: {noformat} CHECK(child->isLeaf()); current->kind = child->kind; ... if (current->kind == Node::INTERNAL) { } {noformat} This bug would manifest, if (1) we have a/b and a/. (2) deactivate(a), i.e. a/. becomes inactive_leaf (3) remove(a/b) When these happens, a/. will collapse to `a` as an inactive_leaf, due to the bug above, however, it will not be placed at the end, resulting in all the clients after `a` not included in the sort(). Luckily, this should never happen in practice, because only frameworks will get deactivated, and frameworks don’t have sub clients. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9599) Update `GET_QUOTA` to return both guarantees and limits.
[ https://issues.apache.org/jira/browse/MESOS-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896740#comment-16896740 ] Meng Zhu commented on MESOS-9599: - {noformat} commit 817545318da364efdff7c9c3f888d0d7aa94da23 Author: Meng Zhu m...@mesosphere.io Date: Tue Jul 30 18:48:32 2019 -0700 Updated quota related endpoints to return quota configurations. Added quota configuration information (that includes both guarantees and limits) in V1 GET_QUOTA call and V0 GET "/quota". To keep backwards compatibility, the infos field which only includes the guarantees are continue to be filled. An additional field configs was added. Also extended an existing test to cover the changes in the endpoints. Review: https://reviews.apache.org/r/71159 {noformat} > Update `GET_QUOTA` to return both guarantees and limits. > - > > Key: MESOS-9599 > URL: https://issues.apache.org/jira/browse/MESOS-9599 > Project: Mesos > Issue Type: Task >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > We should mark the existing `QuotaInfo` message as deprecated in favor of the > new `QuotaConfig`: > {noformat} > message GetQuota { > required quota.QuotaStatus status = 1; > } > message QuotaStatus { >repeated QuotaInfo infos [deprecated = true]; >repeated QuotaConfig configs; > } > message QuotaConfig { > required string role; > map guarantees; > map limits; > } > {noformat} > We will continue to fill in the QuotaInfo though for backward compatibility. > See the design doc: [New > API|https://docs.google.com/document/d/13vG5uH4YVwM79ErBPYAZfnqYFOBbUy2Lym0_9iAQ5Uk/edit#heading=h.z2vfcyzabymz] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9598) Update GET `/quota` to return both guarantees and limits.
[ https://issues.apache.org/jira/browse/MESOS-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896738#comment-16896738 ] Meng Zhu commented on MESOS-9598: - {noformat} commit 817545318da364efdff7c9c3f888d0d7aa94da23 Author: Meng Zhu m...@mesosphere.io Date: Tue Jul 30 18:48:32 2019 -0700 Updated quota related endpoints to return quota configurations. Added quota configuration information (that includes both guarantees and limits) in V1 GET_QUOTA call and V0 GET "/quota". To keep backwards compatibility, the infos field which only includes the guarantees are continue to be filled. An additional field configs was added. Also extended an existing test to cover the changes in the endpoints. Review: https://reviews.apache.org/r/71159 {noformat} > Update GET `/quota` to return both guarantees and limits. > - > > Key: MESOS-9598 > URL: https://issues.apache.org/jira/browse/MESOS-9598 > Project: Mesos > Issue Type: Task >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > We should mark the existing `QuotaInfo` message as deprecated in favor of the > new `QuotaConfig`: > {noformat} > message QuotaStatus { >repeated QuotaInfo infos [deprecated = true]; >repeated QuotaConfig configs; > } > message QuotaConfig { > required string role; > map guarantees; > map limits; > } > {noformat} > We will continue to fill in the QuotaInfo though for backward compatibility. > See the design doc: [New > API|https://docs.google.com/document/d/13vG5uH4YVwM79ErBPYAZfnqYFOBbUy2Lym0_9iAQ5Uk/edit#] > Note, we only update this v0 endpoint for the GET method. There is no plan to > support configuring quota limits from this endpoint. V1 calls should be used. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9917) Store a role/framework tree in the allocator and deprecate the sorter interface.
[ https://issues.apache.org/jira/browse/MESOS-9917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9917: --- Assignee: Meng Zhu > Store a role/framework tree in the allocator and deprecate the sorter > interface. > > > Key: MESOS-9917 > URL: https://issues.apache.org/jira/browse/MESOS-9917 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: mesosphere, resource-management > > Currently, the client (role and framework) tree for the allocator is stored > in the sorter abstraction. This is not ideal. The role/framework tree is > generic information that is needed regardless of the sorter used. The current > sorter interface and its associated states are tech debts that contribute to > performance slowdown and code convolution. > We should store a role/framework tree in the allocator. Each client node will > have a variant field that encapsulates information needed for each sorter > (e.g. for random sorter, it could be empty). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9917) Store a role/framework tree in the allocator and deprecate the sorter interface.
Meng Zhu created MESOS-9917: --- Summary: Store a role/framework tree in the allocator and deprecate the sorter interface. Key: MESOS-9917 URL: https://issues.apache.org/jira/browse/MESOS-9917 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Meng Zhu Currently, the client (role and framework) tree for the allocator is stored in the sorter abstraction. This is not ideal. The role/framework tree is generic information that is needed regardless of the sorter used. The current sorter interface and its associated states are tech debts that contribute to performance slowdown and code convolution. We should store a role/framework tree in the allocator. Each client node will have a variant field that encapsulates information needed for each sorter (e.g. for random sorter, it could be empty). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9600) Deprecate `SET_QUOTA` and `REMOVE_QUOTA` calls in favor of `UPDATE_QUOTA`.
[ https://issues.apache.org/jira/browse/MESOS-9600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9600: --- Assignee: Meng Zhu > Deprecate `SET_QUOTA` and `REMOVE_QUOTA` calls in favor of `UPDATE_QUOTA`. > -- > > Key: MESOS-9600 > URL: https://issues.apache.org/jira/browse/MESOS-9600 > Project: Mesos > Issue Type: Task >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > Once the `UPDATE_QUOTA` call (MESOS-9596) is implemented and wired, we should > deprecate the existing calls `REMOVE_QUOTA` and `SET_QUOTA`. In the > user-facing documentation, we should hide the old API and showcase the new > one. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9913) Use built-in protobuf JSON mapping utilities in favor of reflection for (de)serialization.
Meng Zhu created MESOS-9913: --- Summary: Use built-in protobuf JSON mapping utilities in favor of reflection for (de)serialization. Key: MESOS-9913 URL: https://issues.apache.org/jira/browse/MESOS-9913 Project: Mesos Issue Type: Improvement Components: json api Reporter: Meng Zhu Currently, we use protobuf reflection APIs to (de)serialize to/from JSON. This means a lot of custom code. There are places where we forgot to customize (e.g. for Map, MESOS-9901). Also, there is a performance regression in protobuf reflection if we upgrade our protobuf library to 3.7.x (see MESOS-9896 and related tickets). Thus it would beneficial to make use of the [built-in json utilises |https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/util/json_util.h] to do the mapping. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9598) Update GET `/quota` to return both guarantees and limits.
[ https://issues.apache.org/jira/browse/MESOS-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9598: --- Assignee: Meng Zhu > Update GET `/quota` to return both guarantees and limits. > - > > Key: MESOS-9598 > URL: https://issues.apache.org/jira/browse/MESOS-9598 > Project: Mesos > Issue Type: Task >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > We should mark the existing `QuotaInfo` message as deprecated in favor of the > new `QuotaConfig`: > {noformat} > message QuotaStatus { >repeated QuotaInfo infos [deprecated = true]; >repeated QuotaConfig configs; > } > message QuotaConfig { > required string role; > map guarantees; > map limits; > } > {noformat} > We will continue to fill in the QuotaInfo though for backward compatibility. > See the design doc: [New > API|https://docs.google.com/document/d/13vG5uH4YVwM79ErBPYAZfnqYFOBbUy2Lym0_9iAQ5Uk/edit#] > Note, we only update this v0 endpoint for the GET method. There is no plan to > support configuring quota limits from this endpoint. V1 calls should be used. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9599) Update `GET_QUOTA` to return both guarantees and limits.
[ https://issues.apache.org/jira/browse/MESOS-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9599: --- Assignee: Meng Zhu > Update `GET_QUOTA` to return both guarantees and limits. > - > > Key: MESOS-9599 > URL: https://issues.apache.org/jira/browse/MESOS-9599 > Project: Mesos > Issue Type: Task >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > We should mark the existing `QuotaInfo` message as deprecated in favor of the > new `QuotaConfig`: > {noformat} > message GetQuota { > required quota.QuotaStatus status = 1; > } > message QuotaStatus { >repeated QuotaInfo infos [deprecated = true]; >repeated QuotaConfig configs; > } > message QuotaConfig { > required string role; > map guarantees; > map limits; > } > {noformat} > We will continue to fill in the QuotaInfo though for backward compatibility. > See the design doc: [New > API|https://docs.google.com/document/d/13vG5uH4YVwM79ErBPYAZfnqYFOBbUy2Lym0_9iAQ5Uk/edit#heading=h.z2vfcyzabymz] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9599) Update `GET_QUOTA` to return both guarantees and limits.
[ https://issues.apache.org/jira/browse/MESOS-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892347#comment-16892347 ] Meng Zhu commented on MESOS-9599: - {noformat} commit ed06bc6b539eea115375640703eb0934328daca6 Author: Meng Zhu m...@mesosphere.io Date: Tue May 21 16:07:41 2019 +0200 Added `repeated QuotaConfig` to `QuotaStatus`. Also marked the `infos` field as deprecated. `QuotaStatus` is returned by `GET_QUOTA` and `GET /quota`. As we introduce quota limits, a new mesage `QuotaConfig` is introduced to describe the quota configuration. For backwards compatibility, we will fill in both fields until `QuotaInfo` is removed (in Mesos 2.0). Review: https://reviews.apache.org/r/70690 {noformat} > Update `GET_QUOTA` to return both guarantees and limits. > - > > Key: MESOS-9599 > URL: https://issues.apache.org/jira/browse/MESOS-9599 > Project: Mesos > Issue Type: Task >Reporter: Meng Zhu >Priority: Major > Labels: resource-management > > We should mark the existing `QuotaInfo` message as deprecated in favor of the > new `QuotaConfig`: > {noformat} > message GetQuota { > required quota.QuotaStatus status = 1; > } > message QuotaStatus { >repeated QuotaInfo infos [deprecated = true]; >repeated QuotaConfig configs; > } > message QuotaConfig { > required string role; > map guarantees; > map limits; > } > {noformat} > We will continue to fill in the QuotaInfo though for backward compatibility. > See the design doc: [New > API|https://docs.google.com/document/d/13vG5uH4YVwM79ErBPYAZfnqYFOBbUy2Lym0_9iAQ5Uk/edit#heading=h.z2vfcyzabymz] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9903) ContentType/AgentAPITest.MarkResourceProviderGone
Meng Zhu created MESOS-9903: --- Summary: ContentType/AgentAPITest.MarkResourceProviderGone Key: MESOS-9903 URL: https://issues.apache.org/jira/browse/MESOS-9903 Project: Mesos Issue Type: Bug Components: test Reporter: Meng Zhu Attachments: badrun_log.txt Observed flaky in our CI, centos-6-SSL. Log attached. Crash trace: {noformat} I0724 00:38:07.728926 3249 http_connection.hpp:283] Connected with the remote endpoint at http://172.16.10.60:38795/slave()/api/v1/resource_provider *** Aborted at 1563928687 (unix time) try "date -d @1563928687" if you are using GNU date *** I0724 00:38:07.730021 27831 slave.cpp:924] Agent terminating I0724 00:38:07.731081 3250 master.cpp:1295] Agent 8324a471-1cb7-4778-959a-560b074686b8-S0 at slave()@172.16.10.60:38795 (ip-172-16-10-60.ec2.internal) disconnected I0724 00:38:07.731101 3250 master.cpp:3397] Disconnecting agent 8324a471-1cb7-4778-959a-560b074686b8-S0 at slave()@172.16.10.60:38795 (ip-172-16-10-60.ec2.internal) I0724 00:38:07.731140 3250 master.cpp:3416] Deactivating agent 8324a471-1cb7-4778-959a-560b074686b8-S0 at slave()@172.16.10.60:38795 (ip-172-16-10-60.ec2.internal) I0724 00:38:07.731204 3247 hierarchical.cpp:799] Agent 8324a471-1cb7-4778-959a-560b074686b8-S0 deactivated PC: @ 0x7f7a21bf59fc process::UPID::UPID() *** SIGSEGV (@0x557acd6ed7a1) received by PID 27831 (TID 0x7f7a14040700) from PID 18446744072861177761; stack trace: *** @ 0x7f79eb0dcde7 (unknown) @ 0x7f79eb0e4385 JVM_handle_linux_signal @ 0x7f79eb0d9583 (unknown) @ 0x7f7a1e2257e0 (unknown) @ 0x7f7a21bf59fc process::UPID::UPID() @ 0x7f7a209e6cbb mesos::v1::resource_provider::Driver::send() @ 0x5579c9704027 mesos::internal::tests::resource_provider::MockResourceProvider<>::connectedDefault() @ 0x5579c9604b2a testing::internal::FunctionMockerBase<>::UntypedPerformDefaultAction() @ 0x5579cad9fe83 testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith() @ 0x5579c9635714 mesos::internal::tests::resource_provider::MockResourceProvider<>::connected() @ 0x7f7a206a9273 process::AsyncExecutorProcess::execute<>() @ 0x7f7a206b6b3b _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEESG_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSL_FSI_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteISW_EEOSE_S3_E_JSZ_SE_St12_PlaceholderILi1EEclEOS3_ @ 0x7f7a21c10ea1 process::ProcessBase::consume() @ 0x7f7a21c25677 process::ProcessManager::resume() @ 0x7f7a21c2aae6 _ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv @ 0x7f7a21ee0c7f execute_native_thread_routine @ 0x7f7a1e21daa1 start_thread @ 0x7f7a1d1ddc4d clone {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9901) Specialize jsonify for protobuf Maps.
[ https://issues.apache.org/jira/browse/MESOS-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9901: --- Assignee: Meng Zhu > Specialize jsonify for protobuf Maps. > - > > Key: MESOS-9901 > URL: https://issues.apache.org/jira/browse/MESOS-9901 > Project: Mesos > Issue Type: Improvement > Components: json api >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > > Jsonify current treats protobuf as a regular repeated field. For example, for > the schema > {noformat} > message QuotaConfig { > required string role = 1; > map guarantees = 2; > map limits = 3; > } > {noformat} > it will produce: > {noformat} > "configs": [ > { > "role": "role1", > "guarantees": [ > { > "key": "cpus", > "value": { > "value": 1 > } > }, > { > "key": "mem", > "value": { > "value": 512 > } > } > ] > {noformat} > This output cannot be parsed back to proto messages. We need to specialize > jsonify for Maps type. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9668) Add authorization support for the new `GET_QUOTA` call.
[ https://issues.apache.org/jira/browse/MESOS-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891427#comment-16891427 ] Meng Zhu commented on MESOS-9668: - {noformat} commit 756e212ee91f9b65fb5f90d627b41c9b8c22a319 (HEAD -> master, origin/master, apache/master) Author: Meng Zhu Date: Mon Jul 22 14:36:47 2019 -0700 Removed `quota_info` in the `GET_QUOTA` authorization object. Currently, the `GET_QUOTA` authorizable action set both `value` and `quota_info` fields. The `value` field is set due to backward compatibility for the `GET_QUOTA_WITH_ROLE` action. This patch makes the `GET_QUOTA` action only set the `value` field with the role name. Since the `quota.QuotaInfo` field is being deprecated, it is no longer set (the local authorizer only looks at the `value` field, it is also probably the case for any external authorizer modules). Also refactored `QuotaHandler::status`. Review: https://reviews.apache.org/r/71139 {noformat} > Add authorization support for the new `GET_QUOTA` call. > --- > > Key: MESOS-9668 > URL: https://issues.apache.org/jira/browse/MESOS-9668 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: mesosphere, resource-management > > The new `GET_QUOTA` call will return QUOTA_CONFIGS: > // Used in GET_QUOTA and returned by GET /quota > // > // Overall cluster quota status, including all roles, their quota > configurations and current state (e.g. consumed and effective limits) > message QuotaStatus { >repeated QuotaInfo infos [deprecated = true]; >repeated QuotaConfig configs; > } > Currently, the GET_QUOTA authorizable action set both value > and quota_info fields. The value field is set due to > backward compatibility for the GET_QUOTA_WITH_ROLE action. > We should make the GET_QUOTA action only set the value > field with the role name. Since the quota.QuotaInfo field > is being deprecated, it should not be set (the local authorizer > only looks at the value field, it is also probably the case > for any external authorizer modules). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-8968) Wire `UPDATE_QUOTA` call.
[ https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891417#comment-16891417 ] Meng Zhu commented on MESOS-8968: - {noformat} commit 7aa2a96fea8a44f673a95b425bae71c946c09f2c (HEAD -> update_quota_working, apache/master) Author: Meng Zhu Date: Thu Jul 18 11:32:49 2019 -0700 Added a test to ensure `UPDATE_QUOTA` is applied all-or-nothing. Review: https://reviews.apache.org/r/71119 {noformat} > Wire `UPDATE_QUOTA` call. > - > > Key: MESOS-8968 > URL: https://issues.apache.org/jira/browse/MESOS-8968 > Project: Mesos > Issue Type: Bug >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: Quota, allocator, multitenancy > > Wire the existing master, auth, registar, and allocator pieces together to > complete the `UPDATE_QUOTA` call. > This would enable the master capability `QUOTA_V2`. > This also fixes the "ignoring zero resource quota" bug in the old quota > implementation, namely: > Currently, Mesos discards resource object with zero scalar value when parsing > resources. This means quota set to zero would be ignored and not enforced. > For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no > GPU. Due to the above issue, the allocator can only see the quota as > "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs > may still be allocated to this role. > With the completion of `UPDATE_QUOTA` which takes a map of name, scalar > values, zero value will no longer be dropped. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9901) Specialize jsonify for protobuf Maps.
[ https://issues.apache.org/jira/browse/MESOS-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891299#comment-16891299 ] Meng Zhu commented on MESOS-9901: - [~bbannier] Thanks for pointing to the test. But, despite the name, the test dose not really use jsonify https://github.com/apache/mesos/blob/ff8c9a96be6ae1ee47faf9d5b80a518dfb4a3db0/3rdparty/stout/tests/protobuf_tests.cpp#L838-L839 > Specialize jsonify for protobuf Maps. > - > > Key: MESOS-9901 > URL: https://issues.apache.org/jira/browse/MESOS-9901 > Project: Mesos > Issue Type: Improvement > Components: json api >Reporter: Meng Zhu >Priority: Major > > Jsonify current treats protobuf as a regular repeated field. For example, for > the schema > {noformat} > message QuotaConfig { > required string role = 1; > map guarantees = 2; > map limits = 3; > } > {noformat} > it will produce: > {noformat} > "configs": [ > { > "role": "role1", > "guarantees": [ > { > "key": "cpus", > "value": { > "value": 1 > } > }, > { > "key": "mem", > "value": { > "value": 512 > } > } > ] > {noformat} > This output cannot be parsed back to proto messages. We need to specialize > jsonify for Maps type. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9901) Specialize jsonify for protobuf Maps.
Meng Zhu created MESOS-9901: --- Summary: Specialize jsonify for protobuf Maps. Key: MESOS-9901 URL: https://issues.apache.org/jira/browse/MESOS-9901 Project: Mesos Issue Type: Improvement Components: json api Reporter: Meng Zhu Jsonify current treats protobuf as a regular repeated field. For example, for the schema {noformat} message QuotaConfig { required string role = 1; map guarantees = 2; map limits = 3; } {noformat} it will produce: {noformat} "configs": [ { "role": "role1", "guarantees": [ { "key": "cpus", "value": { "value": 1 } }, { "key": "mem", "value": { "value": 512 } } ] {noformat} This output cannot be parsed back to proto messages. We need to specialize jsonify for Maps type. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-8968) Wire `UPDATE_QUOTA` call.
[ https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-8968: --- Assignee: Meng Zhu > Wire `UPDATE_QUOTA` call. > - > > Key: MESOS-8968 > URL: https://issues.apache.org/jira/browse/MESOS-8968 > Project: Mesos > Issue Type: Bug >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: Quota, allocator, multitenancy > > Wire the existing master, auth, registar, and allocator pieces together to > complete the `UPDATE_QUOTA` call. > This would enable the master capability `QUOTA_V2`. > This also fixes the "ignoring zero resource quota" bug in the old quota > implementation, namely: > Currently, Mesos discards resource object with zero scalar value when parsing > resources. This means quota set to zero would be ignored and not enforced. > For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no > GPU. Due to the above issue, the allocator can only see the quota as > "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs > may still be allocated to this role. > With the completion of `UPDATE_QUOTA` which takes a map of name, scalar > values, zero value will no longer be dropped. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8968) Wire `UPDATE_QUOTA` call.
[ https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882508#comment-16882508 ] Meng Zhu edited comment on MESOS-8968 at 7/10/19 11:54 PM: --- {noformat} commit 0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8 (apache/master) Author: Meng Zhu Date: Fri Jul 5 18:05:59 2019 -0700 Implemented `UPDATE_QUOTA` operator call. This patch wires up the master, auth, registar and allocator pieces for `UPDATE_QUOTA` call. This enables the master capability `QUOTA_V2`. The capability implies the quota v2 API is capable of writes (`UPDATE_QUOTA`) and the master is capable of recovering from V2 quota (`QuotaConfig`) in registry. This patch lacks the rescind offer logic. When quota limits and guarantees are configured, it might be necessary to rescind offers on the fly to satisfy new guarantees or be constrained by the new limits. A todo is left and will be tackled in subsequent patches. Also enabled test `MasterQuotaTest.RecoverQuotaEmptyCluster`. Review: https://reviews.apache.org/r/71021 {noformat} {noformat} commit dcd73437549413790751d1ff127989dbb29bd753 (HEAD -> update_quota, apache/master) Author: Meng Zhu Date: Sun Jul 7 14:27:14 2019 -0700 Added tests for `UPDATE_QUOTA`. These tests reuse the existing tests for `SET_QUOTA` and `REMOVE_QUOTA` calls. In general, `UPDATE_QUOTA` request should fail where `SET_QUOTA` fails. When the existing test expects `SET_QUOTA` call succeeds, we test the `UPDATE_QUOTA` call by first remove the set quota and then send the `UPDATE_QUOTA` request. Review: https://reviews.apache.org/r/71022 {noformat} was (Author: mzhu): {noformat} commit 0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8 (apache/master) Author: Meng Zhu Date: Fri Jul 5 18:05:59 2019 -0700 Implemented `UPDATE_QUOTA` operator call. This patch wires up the master, auth, registar and allocator pieces for `UPDATE_QUOTA` call. This enables the master capability `QUOTA_V2`. The capability implies the quota v2 API is capable of writes (`UPDATE_QUOTA`) and the master is capable of recovering from V2 quota (`QuotaConfig`) in registry. This patch lacks the rescind offer logic. When quota limits and guarantees are configured, it might be necessary to rescind offers on the fly to satisfy new guarantees or be constrained by the new limits. A todo is left and will be tackled in subsequent patches. Also enabled test `MasterQuotaTest.RecoverQuotaEmptyCluster`. Review: https://reviews.apache.org/r/71021 {noformat} > Wire `UPDATE_QUOTA` call. > - > > Key: MESOS-8968 > URL: https://issues.apache.org/jira/browse/MESOS-8968 > Project: Mesos > Issue Type: Bug >Reporter: Meng Zhu >Priority: Major > Labels: Quota, allocator, multitenancy > > Wire the existing master, auth, registar, and allocator pieces together to > complete the `UPDATE_QUOTA` call. > This would enable the master capability `QUOTA_V2`. > This also fixes the "ignoring zero resource quota" bug in the old quota > implementation, namely: > Currently, Mesos discards resource object with zero scalar value when parsing > resources. This means quota set to zero would be ignored and not enforced. > For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no > GPU. Due to the above issue, the allocator can only see the quota as > "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs > may still be allocated to this role. > With the completion of `UPDATE_QUOTA` which takes a map of name, scalar > values, zero value will no longer be dropped. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9812) Add achievability validation for update quota call.
[ https://issues.apache.org/jira/browse/MESOS-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882512#comment-16882512 ] Meng Zhu commented on MESOS-9812: - This covers: the guarantee overcommitment check, and hierchical gurantees check {noformat} commit 16f0b0c295960e397e56f6d504b8075cb62e6e4f Author: Meng Zhu Date: Fri Jul 5 15:41:01 2019 -0700 Added overcommit and hierarchical inclusion check for `UPDATE_QUOTA`. The overcommit check validates that the total quota guarantees in the cluster is contained by the cluster capacity. The hierarchical inclusion check validates that the sum of children's guarantees is contained by the parent guarantee. Further validation is needed for: - Check a role's limit is less than its current consumption. - Check a role's limit is less than its parent's limit. Review: https://reviews.apache.org/r/71020 {noformat} Leave the ticket on for now for: limits < consumption, hierarchical limits invariant. > Add achievability validation for update quota call. > --- > > Key: MESOS-9812 > URL: https://issues.apache.org/jira/browse/MESOS-9812 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > Add overcommit check, hierarchical quota validation and force flag override > for update quota call. > Right now, we only have validation for per quota config. We need to add > further validation for the update quota call regarding: > 1. Check if the role's resource limits are already breached. To achieve this, > we need to first rescind offers until its allocated resources are below > limits. If after all rescinds, allocated resources are still above the > requested limits, we will return an error unless the `force` flag is used. > 2. If the aggregated quota guarantees of all roles are less than the cluster > capacity. If so we will return an error unless the `force` flag is used. > 3. hierarchical limits validation > a. Check a role's limit is less than its parent's limit. > b. Check the sum of children's guarantees is less than its parent's > guarantees. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8968) Wire `UPDATE_QUOTA` call.
[ https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882508#comment-16882508 ] Meng Zhu commented on MESOS-8968: - {noformat} commit 0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8 (apache/master) Author: Meng Zhu Date: Fri Jul 5 18:05:59 2019 -0700 Implemented `UPDATE_QUOTA` operator call. This patch wires up the master, auth, registar and allocator pieces for `UPDATE_QUOTA` call. This enables the master capability `QUOTA_V2`. The capability implies the quota v2 API is capable of writes (`UPDATE_QUOTA`) and the master is capable of recovering from V2 quota (`QuotaConfig`) in registry. This patch lacks the rescind offer logic. When quota limits and guarantees are configured, it might be necessary to rescind offers on the fly to satisfy new guarantees or be constrained by the new limits. A todo is left and will be tackled in subsequent patches. Also enabled test `MasterQuotaTest.RecoverQuotaEmptyCluster`. Review: https://reviews.apache.org/r/71021 {noformat} > Wire `UPDATE_QUOTA` call. > - > > Key: MESOS-8968 > URL: https://issues.apache.org/jira/browse/MESOS-8968 > Project: Mesos > Issue Type: Bug >Reporter: Meng Zhu >Priority: Major > Labels: Quota, allocator, multitenancy > > Wire the existing master, auth, registar, and allocator pieces together to > complete the `UPDATE_QUOTA` call. > This would enable the master capability `QUOTA_V2`. > This also fixes the "ignoring zero resource quota" bug in the old quota > implementation, namely: > Currently, Mesos discards resource object with zero scalar value when parsing > resources. This means quota set to zero would be ignored and not enforced. > For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no > GPU. Due to the above issue, the allocator can only see the quota as > "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs > may still be allocated to this role. > With the completion of `UPDATE_QUOTA` which takes a map of name, scalar > values, zero value will no longer be dropped. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8968) Wire `UPDATE_QUOTA` call.
[ https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882509#comment-16882509 ] Meng Zhu commented on MESOS-8968: - Leave it open for now, until more tests are landed. > Wire `UPDATE_QUOTA` call. > - > > Key: MESOS-8968 > URL: https://issues.apache.org/jira/browse/MESOS-8968 > Project: Mesos > Issue Type: Bug >Reporter: Meng Zhu >Priority: Major > Labels: Quota, allocator, multitenancy > > Wire the existing master, auth, registar, and allocator pieces together to > complete the `UPDATE_QUOTA` call. > This would enable the master capability `QUOTA_V2`. > This also fixes the "ignoring zero resource quota" bug in the old quota > implementation, namely: > Currently, Mesos discards resource object with zero scalar value when parsing > resources. This means quota set to zero would be ignored and not enforced. > For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no > GPU. Due to the above issue, the allocator can only see the quota as > "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs > may still be allocated to this role. > With the completion of `UPDATE_QUOTA` which takes a map of name, scalar > values, zero value will no longer be dropped. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9882) Mesos.UpdateFrameworkV0Test.SuppressedRoles is flaky.
Meng Zhu created MESOS-9882: --- Summary: Mesos.UpdateFrameworkV0Test.SuppressedRoles is flaky. Key: MESOS-9882 URL: https://issues.apache.org/jira/browse/MESOS-9882 Project: Mesos Issue Type: Bug Components: flaky Reporter: Meng Zhu Attachments: UpdateFrameworkV0Test.SuppressedRoles_badrun.txt Observed in CI, log attached. {noformat} mesos-ec2-ubuntu-14.04-SSL.Mesos.UpdateFrameworkV0Test.SuppressedRoles (from UpdateFrameworkV0Test) Error Message ../../src/tests/master/update_framework_tests.cpp:1117 Mock function called more times than expected - returning directly. Function call: agentAdded(@0x7fb254001c40 32-byte object <90-7A 6C-85 B2-7F 00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 F0-85 00-54 B2-7F 00-00>) Expected: to be called once Actual: called twice - over-saturated and active Stacktrace ../../src/tests/master/update_framework_tests.cpp:1117 Mock function called more times than expected - returning directly. Function call: agentAdded(@0x7fb254001c40 32-byte object <90-7A 6C-85 B2-7F 00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 F0-85 00-54 B2-7F 00-00>) Expected: to be called once Actual: called twice - over-saturated and active {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9812) Add achievability validation for update quota call.
[ https://issues.apache.org/jira/browse/MESOS-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9812: --- Assignee: Meng Zhu > Add achievability validation for update quota call. > --- > > Key: MESOS-9812 > URL: https://issues.apache.org/jira/browse/MESOS-9812 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > Add overcommit check and force flag override for update quota call. > Right now, we only have validation for per quota config. We need to add > further validation for the update quota call regarding: > 1. If the role's resource limits are already breached. To achieve this, we > need to first rescind offers until its allocated resources are below limits. > If after all rescinds, allocated resources are still above the requested > limits, we will return an error unless the `force` flag is used. > 2. If the aggregated quota guarantees of all roles are less than the cluster > capacity. If so we will return an error unless the `force` flag is used. > 3. hierarchical quota validness (we could probably punt this given that we > only support flat role quota at the moment). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9601) Persist `QuotaConfig`s in the registry.
[ https://issues.apache.org/jira/browse/MESOS-9601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876515#comment-16876515 ] Meng Zhu commented on MESOS-9601: - {noformat} commit 3720e4cf5f7cb0d8e98afacea39528bd41c767b4 Author: Meng Zhu Date: Fri Jun 28 14:16:00 2019 -0700 Updated registry operation `UpdateQuota` to persist `QuotaConfig`. The new operations will mutate the `quota_configs` field in the registry to persist `QuotaConfigs` configured by the new `UPDATE_QUOTA` call as well as the legacy `SET_QUOTA` and `REMOVE_QUOTA` calls. The operation removes any entries in the legacy `quotas` field with the same role name. In addition, it also adds/removes the minimum capability `QUOTA_V2` accordingly: if `quota_configs` is empty the capability will be removed otherwise it will be added. This operation replaces the `REMOVE_QUOTA` operation. Also fixed/disabled affected tests. Review: https://reviews.apache.org/r/70951 commit c82847ad1b8d3760d34ee1e8869c2b7286ccfaa1 Author: Meng Zhu Date: Fri Jun 28 14:15:02 2019 -0700 Added helpers to add and remove master minimum capabilities. Also added a TODO about refactoring the helpers. Review: https://reviews.apache.org/r/70972 commit f37250f53e75e0442aed2f61bbedbc9b068821d5 Author: Meng Zhu Date: Tue Jun 25 18:07:29 2019 -0700 Added a registry field for `QuotaConfig`. A new field called `quota_configs` is added to persist the quota configurations of the cluster. This replaces the old `quotas` field which is deprecated and will be removed in Mesos 2.0. When users upgrade to Mesos 1.9, `quotas` will be preserved and recovered and `quota_configs` will be empty. As users configures new quotas, whether through the new `UPDATE_QUOTA` call or the deprecated `SET_QUTOA` call, the configured quotas will be persisted into the `quota_configs` field along with the `QUOTA_V2` minimum capability. The capability is removed only if `quota_configs` becomes empty again. If a role already has an entry in the old `quotas` field, it will be removed from `quotas`. In other words, once upgraded, `quotas` will still be preserved and honored, but it will never grow. Instead it will gradually shrink as the roles' quotas get updated or removed. Review: https://reviews.apache.org/r/70950 commit 0bc857d672189605f83acb7ef57bce89b141ba72 Author: Meng Zhu Date: Tue Jun 25 15:19:44 2019 -0700 Added master minimum capability `QUOTA_V2`. This adds a new enum for the revamped quota feature in the master. When quota is configured in Mesos 1.9 or higher, the quota configurations will be persisted into the `quota_configs` field in the registry. And the `QUOTA_V2` minimum capability will be added to the registry as well. This will prevent any master downgrades until `quota_configs` becomes empty. This can be done by setting the quota of the roles listed in `quota_configs` back to the default (no guarantees and no limits). Note, since at the moment of adding this patch, the master is not yet capable of handling the new quota API. The `capability` is not added to the `MASTER_CAPABILITIES`. That should be done later together with the patches that enables master for handling the new quota calls. Review: https://reviews.apache.org/r/70949 {noformat} > Persist `QuotaConfig`s in the registry. > --- > > Key: MESOS-9601 > URL: https://issues.apache.org/jira/browse/MESOS-9601 > Project: Mesos > Issue Type: Task >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > We need to persist the new `QuotaConfig` in the registry. > One thing to note is, the old masters only support quota guarantee which also > servers as limits implicitly. Once new masters start to support both > guarantees and limits, there is no safe downgrade path without altering the > cluster behavior (if the new quota semantics are used). Thus, we need to > ensure that alerts are given if such downgrades are attempted. > To this end, if the quota is configured after this change, a new minimum > capability `QUOTA_V2` will be persisted to the registry along with the new > `QuotaConfig` message. Thanks to the minimum capability check, old masters > (that do not possess the `QUOTA_V2` capability) will refuse to start in this > case and we will print out suggestions to the operator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9866) Removes the `quotas` field in the registry.
Meng Zhu created MESOS-9866: --- Summary: Removes the `quotas` field in the registry. Key: MESOS-9866 URL: https://issues.apache.org/jira/browse/MESOS-9866 Project: Mesos Issue Type: Bug Reporter: Meng Zhu Prior to Mesos 1.9, quota information is persisted in the `quotas` field. It has since been deprecated in Mesos 1.9. Newly configured quotas are now persisted in the `quota_configs` field. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9807) Introduce a `struct Quota` wrapper.
[ https://issues.apache.org/jira/browse/MESOS-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872747#comment-16872747 ] Meng Zhu commented on MESOS-9807: - {noformat} commit 8eba78cbddc8b70f78c07a501ee0dc1d6204f280 Author: Meng Zhu Date: Thu Jun 20 17:29:28 2019 -0700 Replaced `Quota` with `Quota2` in the master state. This paves way to remove `struct Quota`. Review: https://reviews.apache.org/r/70916 commit 5907a357180ccd8fe398f2b6638c85912fafe8b2 Author: Meng Zhu Date: Thu Jun 20 18:50:38 2019 -0700 Replaced the old `struct Quota`. The new `struct Quota` is consistent with the proto `QuotaConfig` where guarantees and limits are decoupled and uses more proper abstractions: `ResourceQuantities` and `ResourceLimits`. Review: https://reviews.apache.org/r/70919 {noformat} > Introduce a `struct Quota` wrapper. > --- > > Key: MESOS-9807 > URL: https://issues.apache.org/jira/browse/MESOS-9807 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > We should introduce: > struct Qutota { > ResourceQuantities guarantees; > ResourceLimits limits; > } > There are a couple of small hurdles. First, there is already a struct Quota > wrapper in "include/mesos/quota/quota.hpp", we need to deprecate that first. > Second, `ResourceQuantities` and `ResourceLimits` are right now only used in > internal headers. We probably want to move them into public header, since > this struct will also be used in allocator interface which is also in the > public header. (Looking at this line, the boundary is alreayd breached: > https://github.com/apache/mesos/blob/master/include/mesos/allocator/allocator.hpp#L41) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9820) Add `updateQuota()` method to the allocator.
[ https://issues.apache.org/jira/browse/MESOS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872745#comment-16872745 ] Meng Zhu commented on MESOS-9820: - {noformat} commit 373393bbaaeadf992c2e8d5399462ffe128eaec4 Author: Meng Zhu Date: Thu Jun 20 18:48:28 2019 -0700 Removed `setQuota` and `removeQuota` methods in the allocator. These are replaced by the `updateQuota` method. Review: https://reviews.apache.org/r/70918 {noformat} > Add `updateQuota()` method to the allocator. > > > Key: MESOS-9820 > URL: https://issues.apache.org/jira/browse/MESOS-9820 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > This is the method that underlies the `UPDATE_QUOTA` operator call. This will > allow the allocator to set different values for guarantees and limits. > The existing `setQuota` and `removeQuota` methods in the allocator will be > deprecated. This will likely break many existing allocator tests. We should > fix and refactor tests to verify the bursting up to limits feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9820) Add `updateQuota()` method to the allocator.
[ https://issues.apache.org/jira/browse/MESOS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872744#comment-16872744 ] Meng Zhu commented on MESOS-9820: - {noformat} commit 86affdd0b5c2208627eb194e5d02794fa264c383 Author: Meng Zhu Date: Thu Jun 20 18:09:36 2019 -0700 Refactored the allocator test to use the `updateQuota` method. This paves the way to remove `setQuota` and `removeQuota` methods. Review: https://reviews.apache.org/r/70917 {noformat} > Add `updateQuota()` method to the allocator. > > > Key: MESOS-9820 > URL: https://issues.apache.org/jira/browse/MESOS-9820 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > This is the method that underlies the `UPDATE_QUOTA` operator call. This will > allow the allocator to set different values for guarantees and limits. > The existing `setQuota` and `removeQuota` methods in the allocator will be > deprecated. This will likely break many existing allocator tests. We should > fix and refactor tests to verify the bursting up to limits feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9854) /roles endpoint should return both guarantees and limits.
[ https://issues.apache.org/jira/browse/MESOS-9854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872742#comment-16872742 ] Meng Zhu commented on MESOS-9854: - {noformat} commit b23b4e52a24637231a85faf2416b75180cfd9063 Author: Meng Zhu m...@mesosphere.io Date: Thu Jun 20 17:17:41 2019 -0700 Made `/roles` endpoint also return quota limits. Now that guarantees are decoupled from limits, we should return limits and guarantees separately in the `/roles` endpoint. Three incompatible changes are introduced: - The `principal` field is removed. This legacy field was used to record the principal of the operator who configured the quota. So that later, if a different operator with a different principal wants to modify the quota, the action can be properly authorized. This use case has since been deprecated and the principal field will no longer be filled going forward. - Resources with zero quantity will no longer be included in the `guarantee` field. - The `guarantee` field will continue to be filled. However, since we are decoupling the quota guarantee from the limit. One can no longer assume that the limit will be the same as guarantee. A separate `limit` field is introduced. Before, the response might contain: ``` { "quota": { "guarantee": { "cpus": 1, "disk": 0, "gpus": 0, "mem": 512 }, "principal": "test-principal", "role": "foo" } } ``` After: ``` { "quota": { "guarantee": { "cpus": 1, "mem": 512 }, "limit": { "cpus": 1, "mem": 512 }, "role": "foo" } } ``` Also fixed an affected test. Review: https://reviews.apache.org/r/70915 {noformat} > /roles endpoint should return both guarantees and limits. > -- > > Key: MESOS-9854 > URL: https://issues.apache.org/jira/browse/MESOS-9854 > Project: Mesos > Issue Type: Bug >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8486) Webui should display role limits.
[ https://issues.apache.org/jira/browse/MESOS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-8486: --- Assignee: Meng Zhu (was: Armand Grillet) > Webui should display role limits. > - > > Key: MESOS-8486 > URL: https://issues.apache.org/jira/browse/MESOS-8486 > Project: Mesos > Issue Type: Task > Components: webui >Reporter: Benjamin Mahler >Assignee: Meng Zhu >Priority: Major > Labels: multitenancy > > With the addition of quota limits (see MESOS-8068), the UI should be updated > to display the per role limit information. Specifically, the 'Roles' tab > needs to be updated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9861) Make PushGauges support float point stats.
Meng Zhu created MESOS-9861: --- Summary: Make PushGauges support float point stats. Key: MESOS-9861 URL: https://issues.apache.org/jira/browse/MESOS-9861 Project: Mesos Issue Type: Bug Components: metrics Reporter: Meng Zhu Currently, PushGauges are modeled against counters. Thus it does not support floating point stats. This prevents many existing PullGauges to use it. We need to add support for floating point stat. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9668) Add authorization support for the new `GET_QUOTA` call.
[ https://issues.apache.org/jira/browse/MESOS-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9668: --- Assignee: Meng Zhu > Add authorization support for the new `GET_QUOTA` call. > --- > > Key: MESOS-9668 > URL: https://issues.apache.org/jira/browse/MESOS-9668 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: mesosphere, resource-management > > The new `GET_QUOTA` call will return QUOTA_CONFIGS: > // Used in GET_QUOTA and returned by GET /quota > // > // Overall cluster quota status, including all roles, their quota > configurations and current state (e.g. consumed and effective limits) > message QuotaStatus { >repeated QuotaInfo infos [deprecated = true]; >repeated QuotaConfig configs; > } > Current authorizer takes in QuotaInfo as the object. We should deprecate that > and let it take in QuotaConfigs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9601) Guard against downgrade hazards after new quota configurations are used.
[ https://issues.apache.org/jira/browse/MESOS-9601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9601: --- Assignee: Meng Zhu > Guard against downgrade hazards after new quota configurations are used. > > > Key: MESOS-9601 > URL: https://issues.apache.org/jira/browse/MESOS-9601 > Project: Mesos > Issue Type: Task >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > Current (old) masters only support quota guarantee which also servers as > limits implicitly. Once new masters start to support both guarantees and > limits, there is no safe downgrade path without altering the cluster behavior > (if the new quota semantics are used). Thus, we need to ensure that alerts > are given if such downgrades are attempted. > To this end, if the new `UPDATE_QUOTA` call is used, a new minimum capability > `QUOTA_LIMITS` will be persisted to the registry along with the new > `QuotaConfig` message. Thanks to the minimum capability check, old masters > (that do not possess the `QUOTA_LIMITS` capability) will refuse the start in > this case and we will print out suggestions to the operator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9602) Provide backward compatibility for old quota configurations.
[ https://issues.apache.org/jira/browse/MESOS-9602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9602: --- Assignee: Meng Zhu > Provide backward compatibility for old quota configurations. > > > Key: MESOS-9602 > URL: https://issues.apache.org/jira/browse/MESOS-9602 > Project: Mesos > Issue Type: Task >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > Current (old) masters only support quota guarantee which also servers as > limits implicitly. When upgrading to new masters where guarantees and limits > are decoupled, we need to ensure backward compatibility such that the > existing (old) quota configurations are honored and there should be no change > to the cluster behavior. > To this end, new masters should also be able to consume the old quota > registry. The old guarantee field will be used to set both guarantee and > limits. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8068) Non-revocable bursting over quota guarantees via limits.
[ https://issues.apache.org/jira/browse/MESOS-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-8068: --- Assignee: Meng Zhu > Non-revocable bursting over quota guarantees via limits. > > > Key: MESOS-8068 > URL: https://issues.apache.org/jira/browse/MESOS-8068 > Project: Mesos > Issue Type: Epic > Components: allocation >Reporter: Benjamin Mahler >Assignee: Meng Zhu >Priority: Major > Labels: multitenancy, resource-management > > Prior to introducing a revocable tier of allocation (see MESOS-4441), there > is a notion of whether a role can burst over its quota guarantee. > We currently apply implicit limits in the following way: > No quota guarantee set: (guarantee 0, no limit) > Quota guarantee set: (guarantee G, limit G) > That is, we only allow support burst-only without guarantee and > guarantee-only without burst. We do not support bursting over some non-zero > guarantee: (guarantee G, limit L >= G). > The idea here is that we should make these implicit limits explicit to > clarify for users the distinction between guarantees and limits, and to > support bursting over the guarantee. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9854) /roles endpoint should return both guarantees and limits.
Meng Zhu created MESOS-9854: --- Summary: /roles endpoint should return both guarantees and limits. Key: MESOS-9854 URL: https://issues.apache.org/jira/browse/MESOS-9854 Project: Mesos Issue Type: Bug Reporter: Meng Zhu Assignee: Meng Zhu -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9851) Migrate allocator metrics to PushGauge.
Meng Zhu created MESOS-9851: --- Summary: Migrate allocator metrics to PushGauge. Key: MESOS-9851 URL: https://issues.apache.org/jira/browse/MESOS-9851 Project: Mesos Issue Type: Bug Components: allocation Reporter: Meng Zhu We should migrate all metrics in the master actor to use PushGauges instead of PullGauges for better performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9807) Introduce a `struct Quota` wrapper.
[ https://issues.apache.org/jira/browse/MESOS-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864551#comment-16864551 ] Meng Zhu commented on MESOS-9807: - {noformat} commit ceb1120e8c53771219363e0bf579a770b914a592 Author: Meng Zhu Date: Thu Jun 6 16:18:45 2019 -0700 Used the new quota struct for the allocator recover call. Review: https://reviews.apache.org/r/70804 commit 4bdbd8e7da5063d55726b628b5e0d31c79650d3f Author: Meng Zhu Date: Thu Jun 6 15:58:05 2019 -0700 Added `Metrics::updateQuota` for quota metrics. This intends to replace the existing ``Metrics::setQuota` and `Metrics::remove` calls. Currently, it only tracks guarantees. Need to add limits metrics. Review: https://reviews.apache.org/r/70802 commit 495162eefa12900b3a74bfbb269851473df4cce9 Author: Meng Zhu Date: Wed Jun 5 14:04:53 2019 -0700 Refactored allocator with the new quota wrapper struct. This patch also introduces a constant `DEFAULT_QUOTA`. By default, a role has no guarantees and no limits. Review: https://reviews.apache.org/r/70801 commit 75798445f932f1f163a502e2325e76cf33450836 Author: Meng Zhu Date: Tue Jun 4 10:48:51 2019 -0700 Refactored quota overcommit check. This refactor makes the `QuotaTree` to use the new quota wrapper struct. Also refactor the check to reflect that it is currently only checking guarantees. Review: https://reviews.apache.org/r/70800 commit f05f0616841bd539a8b6abfc591f3c287ad998d9 Author: Meng Zhu Date: Tue Jun 4 17:34:52 2019 -0700 Added a wrapper struct for quota guarantees and limits. This struct is temporarily named to `Quota2` to differentiate with the existing `Quota` struct. It will replace all `Quota` and rename to `Quota`. Review: https://reviews.apache.org/r/70799 {noformat} > Introduce a `struct Quota` wrapper. > --- > > Key: MESOS-9807 > URL: https://issues.apache.org/jira/browse/MESOS-9807 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > We should introduce: > struct Qutota { > ResourceQuantities guarantees; > ResourceLimits limits; > } > There are a couple of small hurdles. First, there is already a struct Quota > wrapper in "include/mesos/quota/quota.hpp", we need to deprecate that first. > Second, `ResourceQuantities` and `ResourceLimits` are right now only used in > internal headers. We probably want to move them into public header, since > this struct will also be used in allocator interface which is also in the > public header. (Looking at this line, the boundary is alreayd breached: > https://github.com/apache/mesos/blob/master/include/mesos/allocator/allocator.hpp#L41) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9820) Add `updateQuota()` method to the allocator.
[ https://issues.apache.org/jira/browse/MESOS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864550#comment-16864550 ] Meng Zhu commented on MESOS-9820: - {noformat} commit 4703b23143ee806ed5e68d9ff6eabe9600ffc9c9 Author: Meng Zhu Date: Wed Jun 5 16:44:00 2019 -0700 Added `updateQuota` method to the allocator. This call updates a role's quota guarantees and limits. All roles have a default quota defined as `DEFAULT_QUOTA`. Currently, it is no guarantees and limits. Thus to "remove" a quota, one should simply update the quota to be `DEFAULT_QUOTA`. Master `setQuota` and `removeQuota` calls into the allocator are replaced with the `updateQuota`. `setQuota` and `removeQuota` calls are now only used in the tests. They will be removed once those tests are refactored. Also fixed affected tests. Review: https://reviews.apache.org/r/70803 {noformat} > Add `updateQuota()` method to the allocator. > > > Key: MESOS-9820 > URL: https://issues.apache.org/jira/browse/MESOS-9820 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > This is the method that underlies the `UPDATE_QUOTA` operator call. This will > allow the allocator to set different values for guarantees and limits. > The existing `setQuota` and `removeQuota` methods in the allocator will be > deprecated. This will likely break many existing allocator tests. We should > fix and refactor tests to verify the bursting up to limits feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9820) Add `updateQuota()` method to the allocator.
[ https://issues.apache.org/jira/browse/MESOS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9820: --- Assignee: Meng Zhu > Add `updateQuota()` method to the allocator. > > > Key: MESOS-9820 > URL: https://issues.apache.org/jira/browse/MESOS-9820 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > This is the method that underlies the `UPDATE_QUOTA` operator call. This will > allow the allocator to set different values for guarantees and limits. > The existing `setQuota` and `removeQuota` methods in the allocator will be > deprecated. This will likely break many existing allocator tests. We should > fix and refactor tests to verify the bursting up to limits feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9847) Docker executor doesn't wait for status updates to be ack'd before shutting down.
Meng Zhu created MESOS-9847: --- Summary: Docker executor doesn't wait for status updates to be ack'd before shutting down. Key: MESOS-9847 URL: https://issues.apache.org/jira/browse/MESOS-9847 Project: Mesos Issue Type: Bug Components: executor Reporter: Meng Zhu The docker executor doesn't wait for pending status updates to be acknowledged before shutting down, instead it sleeps for one second and then terminates: {noformat} void _stop() { // A hack for now ... but we need to wait until the status update // is sent to the slave before we shut ourselves down. // TODO(tnachen): Remove this hack and also the same hack in the // command executor when we have the new HTTP APIs to wait until // an ack. os::sleep(Seconds(1)); driver.get()->stop(); } {noformat} This would result in racing between task status update (e.g. TASK_FINISHED) and executor exit. The latter would lead agent generating a `TASK_FAILED` status update by itself, leading to the confusing case where the agent handles two different terminal status updates. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9835) `QuotaRoleAllocateNonQuotaResource` is failing.
Meng Zhu created MESOS-9835: --- Summary: `QuotaRoleAllocateNonQuotaResource` is failing. Key: MESOS-9835 URL: https://issues.apache.org/jira/browse/MESOS-9835 Project: Mesos Issue Type: Bug Components: test Reporter: Meng Zhu Assignee: Meng Zhu {noformat} [ RUN ] HierarchicalAllocatorTest.QuotaRoleAllocateNonQuotaResource ../../src/tests/hierarchical_allocator_tests.cpp:4094: Failure Value of: allocations.get().isPending() Actual: false Expected: true [ FAILED ] HierarchicalAllocatorTest.QuotaRoleAllocateNonQuotaResource (12 ms) {noformat} The test is failing because: After agent3 is added, it misses a settle call where the allocation of agent3 is racy. In addition, after https://github.com/apache/mesos/commit/7df8cc6b79e294c075de09f1de4b31a2b88423c8 we now offer nonquota resources on an agent (even that means "chopping") on top of role's satisfied guarantees, the test needs to be updated in accordance with the behavior change. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9807) Introduce a `struct Quota` wrapper.
[ https://issues.apache.org/jira/browse/MESOS-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9807: --- Assignee: Meng Zhu > Introduce a `struct Quota` wrapper. > --- > > Key: MESOS-9807 > URL: https://issues.apache.org/jira/browse/MESOS-9807 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > We should introduce: > struct Qutota { > ResourceQuantities guarantees; > ResourceLimits limits; > } > There are a couple of small hurdles. First, there is already a struct Quota > wrapper in "include/mesos/quota/quota.hpp", we need to deprecate that first. > Second, `ResourceQuantities` and `ResourceLimits` are right now only used in > internal headers. We probably want to move them into public header, since > this struct will also be used in allocator interface which is also in the > public header. (Looking at this line, the boundary is alreayd breached: > https://github.com/apache/mesos/blob/master/include/mesos/allocator/allocator.hpp#L41) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9834) Remove `GET_QUOTA` and `REMOVE_QUOTA` calls.
Meng Zhu created MESOS-9834: --- Summary: Remove `GET_QUOTA` and `REMOVE_QUOTA` calls. Key: MESOS-9834 URL: https://issues.apache.org/jira/browse/MESOS-9834 Project: Mesos Issue Type: Task Components: HTTP API Reporter: Meng Zhu These calls are already deprecated in favor of `UPDATE_QUOTA`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9807) Introduce a `struct Quota` wrapper.
[ https://issues.apache.org/jira/browse/MESOS-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858216#comment-16858216 ] Meng Zhu commented on MESOS-9807: - {noformat} commit 8fd52f1ad41c7aa131ceaac1b83a5bd1d06eca21 Author: Meng Zhu m...@mesosphere.io Date: Tue Jun 4 09:51:00 2019 -0700 Moved `class ResourceQuantities` to public header. Some public facing classes such as `Resources` already depends on `ResourceQuantities` and more are coming. Review: https://reviews.apache.org/r/70786 {noformat} > Introduce a `struct Quota` wrapper. > --- > > Key: MESOS-9807 > URL: https://issues.apache.org/jira/browse/MESOS-9807 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Priority: Major > Labels: resource-management > > We should introduce: > struct Qutota { > ResourceQuantities guarantees; > ResourceLimits limits; > } > There are a couple of small hurdles. First, there is already a struct Quota > wrapper in "include/mesos/quota/quota.hpp", we need to deprecate that first. > Second, `ResourceQuantities` and `ResourceLimits` are right now only used in > internal headers. We probably want to move them into public header, since > this struct will also be used in allocator interface which is also in the > public header. (Looking at this line, the boundary is alreayd breached: > https://github.com/apache/mesos/blob/master/include/mesos/allocator/allocator.hpp#L41) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9820) Add `updateQuota()` method to the allocator.
Meng Zhu created MESOS-9820: --- Summary: Add `updateQuota()` method to the allocator. Key: MESOS-9820 URL: https://issues.apache.org/jira/browse/MESOS-9820 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Meng Zhu This is the method that underlies the `UPDATE_QUOTA` operator call. This will allow the allocator to set different values for guarantees and limits. The existing `setQuota` and `removeQuota` methods in the allocator will be deprecated. This will likely break many existing allocator tests. We should fix and refactor tests to verify the bursting up to limits feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9813) Track role consumed quota for all roles in the allocator.
Meng Zhu created MESOS-9813: --- Summary: Track role consumed quota for all roles in the allocator. Key: MESOS-9813 URL: https://issues.apache.org/jira/browse/MESOS-9813 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Meng Zhu We are already tracking role consumed quota for roles with non-default quota in the allocator. We should expand that to track all roles' consumptions which will then be exposed through metrics later. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9812) Add overcommit validation for update quota call.
Meng Zhu created MESOS-9812: --- Summary: Add overcommit validation for update quota call. Key: MESOS-9812 URL: https://issues.apache.org/jira/browse/MESOS-9812 Project: Mesos Issue Type: Improvement Reporter: Meng Zhu Add overcommit check and force flag override for update quota call. Right now, we only have validation for per quota config. We need to add further validation for the update quota call regarding cluster resource overcommitment (and force flag override) as well as hierarchical quota validness. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8456) Allocator should allow roles to burst above guarantees but below limits.
[ https://issues.apache.org/jira/browse/MESOS-8456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855309#comment-16855309 ] Meng Zhu edited comment on MESOS-8456 at 6/4/19 4:54 AM: - main allocator patch: https://reviews.apache.org/r/70738/ was (Author: mzhu): https://reviews.apache.org/r/70738/ > Allocator should allow roles to burst above guarantees but below limits. > > > Key: MESOS-8456 > URL: https://issues.apache.org/jira/browse/MESOS-8456 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: Mesosphere, multitenancy > > Currently, allocator only allocates resources for quota roles up to their > guarantee in the first allocation stage. The allocator should continue > allocating resources to these roles in the second stage below their quota > limit. In other words, allocator should allow roles to burst above their > guarantee but below the limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8456) Allocator should allow roles to burst above guarantees but below limits.
[ https://issues.apache.org/jira/browse/MESOS-8456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855307#comment-16855307 ] Meng Zhu commented on MESOS-8456: - Some preparation patch: {noformat} commit 31ac45be0a55fc33982641516bcc5eb3226ef406 Author: Meng Zhu Date: Tue May 28 16:28:28 2019 +0200 Added a function to shrink `Resources` to target `ResourceLimits`. Also added unit tests. Review: https://reviews.apache.org/r/70737 commit 8d372e14b0240aa5735a7c0cf36e03e7b3344bd1 Author: Meng Zhu Date: Tue May 28 16:27:16 2019 +0200 Added methods to subtract `ResourceQuantities` from `ResourceLimits`. This patch also makes `ResourceLimits` a friend class of `ResourceQuantities` to achieve one-pass operation complexities. Also added unit test. Review: https://reviews.apache.org/r/70735 {noformat} > Allocator should allow roles to burst above guarantees but below limits. > > > Key: MESOS-8456 > URL: https://issues.apache.org/jira/browse/MESOS-8456 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: Mesosphere, multitenancy > > Currently, allocator only allocates resources for quota roles up to their > guarantee in the first allocation stage. The allocator should continue > allocating resources to these roles in the second stage below their quota > limit. In other words, allocator should allow roles to burst above their > guarantee but below the limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (MESOS-8456) Allocator should allow roles to burst above guarantees but below limits.
[ https://issues.apache.org/jira/browse/MESOS-8456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu updated MESOS-8456: Comment: was deleted (was: https://reviews.apache.org/r/65661 https://reviews.apache.org/r/65819 https://reviews.apache.org/r/65820 https://reviews.apache.org/r/65821 https://reviews.apache.org/r/65844 https://reviews.apache.org/r/65845 https://reviews.apache.org/r/65847 ) > Allocator should allow roles to burst above guarantees but below limits. > > > Key: MESOS-8456 > URL: https://issues.apache.org/jira/browse/MESOS-8456 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: Mesosphere, multitenancy > > Currently, allocator only allocates resources for quota roles up to their > guarantee in the first allocation stage. The allocator should continue > allocating resources to these roles in the second stage below their quota > limit. In other words, allocator should allow roles to burst above their > guarantee but below the limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9807) Introduce a `struct Quota` wrapper.
Meng Zhu created MESOS-9807: --- Summary: Introduce a `struct Quota` wrapper. Key: MESOS-9807 URL: https://issues.apache.org/jira/browse/MESOS-9807 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Meng Zhu We should introduce: struct Qutota { ResourceQuantities guarantees; ResourceLimits limits; } There are a couple of small hurdles. First, there is already a struct Quota wrapper in "include/mesos/quota/quota.hpp", we need to deprecate that first. Second, `ResourceQuantities` and `ResourceLimits` are right now only used in internal headers. We probably want to move them into public header, since this struct will also be used in allocator interface which is also in the public header. (Looking at this line, the boundary is alreayd breached: https://github.com/apache/mesos/blob/master/include/mesos/allocator/allocator.hpp#L41) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9806) Address allocator performance regression due to the removal of quota role sorter.
Meng Zhu created MESOS-9806: --- Summary: Address allocator performance regression due to the removal of quota role sorter. Key: MESOS-9806 URL: https://issues.apache.org/jira/browse/MESOS-9806 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Meng Zhu Assignee: Meng Zhu In MESOS-9802, we removed the quota role sorter which is tech debt. However, this slows down the allocator. The problem is that in the first stage, even though a cluster might have no active roles with non-default quota, the allocator will now have to sort and go through each and every role in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, the allocator could experience ~50% performance degradation. There are a couple of ways to address this issue. For example, we could make the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return all the roles with non-default quota. Alternatively, an even better approach would be to deprecate the sorter concept and just have two standalone functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree structure (not yet exist in the allocator) and return the sorted roles. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9802) Remove quota role sorter in the allocator.
Meng Zhu created MESOS-9802: --- Summary: Remove quota role sorter in the allocator. Key: MESOS-9802 URL: https://issues.apache.org/jira/browse/MESOS-9802 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Meng Zhu Assignee: Meng Zhu Remove the dedicated quota role sorter in favor of using the same sorting between satisfying guarantees and bursting above guarantees up to limits. This is tech debt from when a "quota role" was considered different from a "non-quota" role. However, they are the same, one just has a default quota. The only practical difference between quota role sorter and role sorter now is that quota role sorter ignores the revocable resources both in its total resource pool as well as role allocations. Thus when using DRF, it does not count revocable resources which is arguably the right behavior. By removing the quota sorter, we will have all roles sorted together. When using DRF, in the 1st quota guarantee allocation stage, its share calculation will also include revocable resources. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9796) Add `min_allocatable_resources` to mesos-execute.
Meng Zhu created MESOS-9796: --- Summary: Add `min_allocatable_resources` to mesos-execute. Key: MESOS-9796 URL: https://issues.apache.org/jira/browse/MESOS-9796 Project: Mesos Issue Type: Task Components: cli Reporter: Meng Zhu -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9786) Race between two REMOVE_QUOTA calls crashes the master.
[ https://issues.apache.org/jira/browse/MESOS-9786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841469#comment-16841469 ] Meng Zhu commented on MESOS-9786: - {noformat} commit d9ab461ad4dadf13ec45d52e83a0e9a2f452de74 (HEAD -> quota_race, apache/master) Author: Meng Zhu Date: Thu May 16 12:12:15 2019 +0200 Fix a bug where racing quota removal request could crash the master. Also added a test. Review: https://reviews.apache.org/r/70656 {noformat} > Race between two REMOVE_QUOTA calls crashes the master. > --- > > Key: MESOS-9786 > URL: https://issues.apache.org/jira/browse/MESOS-9786 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.6.2, 1.7.2, 1.8.0, 1.9.0 >Reporter: Andrei Sekretenko >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > The existence of the quota in the master is validated here: > [https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L700] > Then the quota is removed from master in a deferred method call: > [https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L744] > And then removed from allocator in another deferred call: > [https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L753] > So, there is a race between two simultaneous REMOVE_QUOTA calls. > We observe this race on a heavily loaded cluster. Currently we suspect that > the client retries the call (due to the call being not processed for a long > time), and this triggers the race. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9786) Race between two REMOVE_QUOTA calls crashes the master.
[ https://issues.apache.org/jira/browse/MESOS-9786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9786: --- Assignee: Meng Zhu > Race between two REMOVE_QUOTA calls crashes the master. > --- > > Key: MESOS-9786 > URL: https://issues.apache.org/jira/browse/MESOS-9786 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.5.1, 1.8.0, 1.8.1 >Reporter: Andrei Sekretenko >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > The existence of the quota in the master is validated here: > [https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L700] > Then the quota is removed from master in a deferred method call: > [https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L744] > And then removed from allocator in another deferred call: > [https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L753] > So, there is a race between two simultaneous REMOVE_QUOTA calls. > We observe this race on a heavily loaded cluster. Currently we suspect that > the client retries the call (due to the call being not processed for a long > time), and this triggers the race. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9782) Random sorter fails to clear removed clients.
[ https://issues.apache.org/jira/browse/MESOS-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9782: --- Assignee: Meng Zhu > Random sorter fails to clear removed clients. > - > > Key: MESOS-9782 > URL: https://issues.apache.org/jira/browse/MESOS-9782 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.8.0 >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Blocker > Labels: resource-management > > In `RandomSorter::SortInfo::updateRelativeWeights()`, we do not clear the > stale `clients` and `weights` vector if the state is dirty. This would result > in an allocator crash due to including removed framework and roles in a > sorted result e.g. check failure would occur here > (https://github.com/apache/mesos/blob/62f0b6973b2268a3305fd631a914433a933c6757/src/master/allocator/mesos/hierarchical.cpp#L1849). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9782) Random sorter fails to clear removed clients.
Meng Zhu created MESOS-9782: --- Summary: Random sorter fails to clear removed clients. Key: MESOS-9782 URL: https://issues.apache.org/jira/browse/MESOS-9782 Project: Mesos Issue Type: Bug Components: allocation Affects Versions: 1.8.0 Reporter: Meng Zhu In `RandomSorter::SortInfo::updateRelativeWeights()`, we do not clear the stale `clients` and `weights` vector if the state is dirty. This would result in an allocator crash due to including removed framework and roles in a sorted result e.g. check failure would occur here (https://github.com/apache/mesos/blob/62f0b6973b2268a3305fd631a914433a933c6757/src/master/allocator/mesos/hierarchical.cpp#L1849). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9781) Templatize the allocator tests for for different sorters.
Meng Zhu created MESOS-9781: --- Summary: Templatize the allocator tests for for different sorters. Key: MESOS-9781 URL: https://issues.apache.org/jira/browse/MESOS-9781 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Meng Zhu Currently, most (all?) allocator tests use the DRF sorter: https://github.com/apache/mesos/blob/62f0b6973b2268a3305fd631a914433a933c6757/src/tests/hierarchical_allocator_tests.cpp#L137 This means we have little coverage for allocators that use random sorter. Tests should be examined and templatized for both sorters if possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9780) Improve "picky" framework resource allocation under random sorter.
Meng Zhu created MESOS-9780: --- Summary: Improve "picky" framework resource allocation under random sorter. Key: MESOS-9780 URL: https://issues.apache.org/jira/browse/MESOS-9780 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Meng Zhu Picky frameworks are frameworks that are interested in some particular set of resources. With the current offer model, such a framework usually keeps declining and filter uninterested offers until accepting an offer that meets its needs. While picky frameworks are always prone to performance issues. These frameworks are more likely to experience offer starvation issues under random sorter than the DRF sorter. Under DRF sorter, declining offers or Mesos side resource filtering do not affect the framework's dominant resource share. Since other frameworks might get resource allocated at the same time which brings up their shares comparatively, a declined/filtered framework would usually have a higher chance of getting other offers as time goes by (if it keeps declining). This reduces the time such a framework getting what it wants eventually. Random sorter, however, is stateless. A decline or filter action has no effect on the chance of a framework getting offers. A framework declining or filtering an offer essentially wastes a shot for nothing. It becomes a truly altruistic act with no perceived gain on the framework side. This makes the random sorter likely to perform poorly compared to DRF in terms of handling picky frameworks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9778) Randomized the agents in the second allocation stage.
[ https://issues.apache.org/jira/browse/MESOS-9778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836739#comment-16836739 ] Meng Zhu commented on MESOS-9778: - {noformat} commit d13be8432180d3b64947a320fa0c11340dba029a Author: Meng Zhu m...@mesosphere.io Date: Wed May 8 16:58:02 2019 -0700 Randomized the agents in the second allocation stage. Before this patch, agents are randomized before the 1st allocation stage (the quota allocation stage) but not in the 2nd stage. One perceived issue is that resources on the agents in the front of the queue are likely to be mostly allocated in the 1st stage, leaving only slices of resources available for the second stage. Thus we may see consistently low quality offers for role/frameworks that get allocated first in the 2nd stage. This patch randomizes the agents again before the 2nd stage to to "spread out" the effect of the 1st stage allocation. Review: https://reviews.apache.org/r/70613 {noformat} > Randomized the agents in the second allocation stage. > - > > Key: MESOS-9778 > URL: https://issues.apache.org/jira/browse/MESOS-9778 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > Agents are currently randomized before the 1st > allocation stage (the quota allocation stage) but not in > the 2nd stage. One perceived issue is that resources on > the agents in the front of the queue are likely to be mostly > allocated in the 1st stage, leaving only slices of resources > available for the second stage. Thus we may see consistently > low quality offers for role/frameworks that get allocated first > in the 2nd stage. > Consider randomizing the agents in the second allocation stage. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9778) Randomized the agents in the second allocation stage.
Meng Zhu created MESOS-9778: --- Summary: Randomized the agents in the second allocation stage. Key: MESOS-9778 URL: https://issues.apache.org/jira/browse/MESOS-9778 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Meng Zhu Assignee: Meng Zhu Agents are currently randomized before the 1st allocation stage (the quota allocation stage) but not in the 2nd stage. One perceived issue is that resources on the agents in the front of the queue are likely to be mostly allocated in the 1st stage, leaving only slices of resources available for the second stage. Thus we may see consistently low quality offers for role/frameworks that get allocated first in the 2nd stage. Consider randomizing the agents in the second allocation stage. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9777) Consider doing an internal retry if reservation and etc. operations fail due to 409 conflict.
Meng Zhu created MESOS-9777: --- Summary: Consider doing an internal retry if reservation and etc. operations fail due to 409 conflict. Key: MESOS-9777 URL: https://issues.apache.org/jira/browse/MESOS-9777 Project: Mesos Issue Type: Improvement Components: master Reporter: Meng Zhu A reservation request may return 409 Conflict: https://github.com/apache/mesos/blob/261d6ef497383795557aaca5dce426b4482eabea/src/master/http.cpp#L4026 It is due to the inherent race between the master and allocator actor. As illustrated here: https://github.com/apache/mesos/blob/261d6ef497383795557aaca5dce426b4482eabea/src/master/allocator/mesos/hierarchical.cpp#L992-L1008 This is not ideal and should be rare. However, it is hard for users to grasp this error. It seems to be beneficial for Mesos to retry the reservation operation internally for the user. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9725) Perform incremental sorting in the random sorter.
[ https://issues.apache.org/jira/browse/MESOS-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835725#comment-16835725 ] Meng Zhu commented on MESOS-9725: - Based on a recent internal test, the sort() does not take much time. And this ticket would introduce some extra complexities. The review above (https://reviews.apache.org/r/70497/) is pretty ready except one issue that still needs to figure out. In the review, we used a hashmap and used double as the key. This worries us because of the double precision issue. A solution is to use rational numbers. Given the benefit and complexity of the patch, we decided to shelve it for now. Move this ticket back to `accepted`. > Perform incremental sorting in the random sorter. > - > > Key: MESOS-9725 > URL: https://issues.apache.org/jira/browse/MESOS-9725 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: performance, resource-management > > By doing random sampling every time as the caller asks for the next client > (See MESOS-9722) we could avoid the cost of full shuffling and only pay as we > go. > While the hope is to do each random sampling with O(1) cost, the presence of > weights complicates the matter. We will need to pay O(log( n )) for every > sample even with fancy data structures like segment tree or binary index > trees (naive ones will result in O( n ) since we need to look at every node's > weights). And the current full node shuffling is already optimal (nlog( n )) > if all nodes are picked. > However, since the number of *distinct* weights is usually much smaller > comparing to the size of clients, we can minimize the sample cost by picking > a client in two steps: > Step1: randomly pick a group of clients that has the same weight by > generating a weighted random number. > Step2: Once a vector of clients is chosen, randomly sample a specific client > within the group. Since all the clients in the chosen vector have the same > weight, we do not need to consider any weights. > > Since the size of distinct weights is usually much smaller comparing to the > size of clients, this way, we minimize the cost of generating weighted random > numbers which are linear with the size of weights. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9722) Refactor the sorter interface to enable lazy sorting.
[ https://issues.apache.org/jira/browse/MESOS-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835715#comment-16835715 ] Meng Zhu commented on MESOS-9722: - Based on a recent internal test, the sort() does not take much time. And this ticket would introduce some extra complexities. The review above (https://reviews.apache.org/r/70419) is pretty ready though. But we decide to shelve it for now. Move this ticket back to `accepted`. > Refactor the sorter interface to enable lazy sorting. > - > > Key: MESOS-9722 > URL: https://issues.apache.org/jira/browse/MESOS-9722 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: performance, resource-management > > Currently, the only way for getting a sorted client from sorter is through: > {noformat} > vector Sorter::sort() > {noformat} > This sorts all the active clients in the tree and returns all of them in a > single vector. This is inefficient if the callers end up only needing a few > of clients (e.g. when allocating one agent, only one or a few roles are > allocated). > We could refactor the interface to return an iterator-like handle and then > callers can query the next the client in the sorting order. This would pave > the way for lazy sorting (i.e. only get the nth client) and improve > performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9759) Log required quota headroom and available quota headroom in the allocator.
Meng Zhu created MESOS-9759: --- Summary: Log required quota headroom and available quota headroom in the allocator. Key: MESOS-9759 URL: https://issues.apache.org/jira/browse/MESOS-9759 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Meng Zhu This would ease the debugging of allocation issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9759) Log required quota headroom and available quota headroom in the allocator.
[ https://issues.apache.org/jira/browse/MESOS-9759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9759: --- Assignee: Meng Zhu > Log required quota headroom and available quota headroom in the allocator. > -- > > Key: MESOS-9759 > URL: https://issues.apache.org/jira/browse/MESOS-9759 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > This would ease the debugging of allocation issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9758) Take ports out of the roles endpoints.
Meng Zhu created MESOS-9758: --- Summary: Take ports out of the roles endpoints. Key: MESOS-9758 URL: https://issues.apache.org/jira/browse/MESOS-9758 Project: Mesos Issue Type: Bug Reporter: Meng Zhu It does not make sense to combine ports across agents. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9710) Add tests to ensure random sorter performs correct weighted sorting.
[ https://issues.apache.org/jira/browse/MESOS-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830774#comment-16830774 ] Meng Zhu commented on MESOS-9710: - {noformat} commit 89c3dd95a421e14044bc91ceb1998ff4ae3883b4 Author: Meng Zhu m...@mesosphere.io Date: Sun Apr 7 15:55:42 2019 -0700 Added a test to verify the sort correctness of the random sorter. Review: https://reviews.apache.org/r/70418 {noformat} > Add tests to ensure random sorter performs correct weighted sorting. > > > Key: MESOS-9710 > URL: https://issues.apache.org/jira/browse/MESOS-9710 > Project: Mesos > Issue Type: Task > Components: allocation >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > > We added tests for the weighted shuffle algorithm, but didn't test that the > RandomSorter's sort() function behaves correctly. > We should also test that hierarchical weights in the random sorter behave > correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9724) Flatten the weighted shuffling in the random sorter.
[ https://issues.apache.org/jira/browse/MESOS-9724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826492#comment-16826492 ] Meng Zhu commented on MESOS-9724: - {noformat} commit 5108f076e6a5c275cae6b124bbcb110bc6785f94 Author: Meng Zhu Date: Wed Apr 24 11:32:38 2019 -0700 Avoided some recalculation in the random sorter. This patch keeps the sorting related information in the memory and accompanies a dirty bit with it. This helps to avoid unnecessary recalculation of this info in `sort()`. Review: https://reviews.apache.org/r/70430 commit 5a756402ad15cedbc6ccb8fa5de096745967f36f Author: Meng Zhu Date: Wed Apr 24 10:51:06 2019 -0700 Fixed a bug in the random sorter. Currently, in the presence of hierarchical roles, the random sorter shuffles roles level by level and then pick the active leave nodes using DFS. This could generate non-uniform random result since active leaves in a subtree are always picked together. This patch fixes the issue by first calculating the relative weights of each active leaf node and shuffle all of them only once. Review: https://reviews.apache.org/r/70429 commit 5e52c686c29819113f42c6bde7d90324673b42dc Author: Meng Zhu Date: Tue Apr 23 18:44:33 2019 -0700 Added a random sorter helper to find active internal nodes. Active internal nodes are defined as internal nodes that have at least one active leaf node. Review: https://reviews.apache.org/r/70542 {noformat} > Flatten the weighted shuffling in the random sorter. > > > Key: MESOS-9724 > URL: https://issues.apache.org/jira/browse/MESOS-9724 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: performance, resource-management > > Due to the presence of hierarchical weights, the random sorter currently > shuffles level-by-level. We should be able to shuffle all the active leaves > only once by calculating (and caching) active leaves' relative weights. This > should improve the performance in the presence of hierarchical roles. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9738) Add per-framework metrics for offer round trip time.
Meng Zhu created MESOS-9738: --- Summary: Add per-framework metrics for offer round trip time. Key: MESOS-9738 URL: https://issues.apache.org/jira/browse/MESOS-9738 Project: Mesos Issue Type: Bug Components: allocation Reporter: Meng Zhu This would provide more insights into framework responsiveness, help detect worrisome behaviors such as offer timeout, offer hoarding and etc. One tricky thing is that we need to take Mesos's own queuing delay into consideration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9733) Random sorter generates non-uniform result for hierarchical roles.
Meng Zhu created MESOS-9733: --- Summary: Random sorter generates non-uniform result for hierarchical roles. Key: MESOS-9733 URL: https://issues.apache.org/jira/browse/MESOS-9733 Project: Mesos Issue Type: Bug Components: allocation Reporter: Meng Zhu Assignee: Meng Zhu In the presence of hierarchical roles, the random sorter shuffles roles level by level and then pick the active leave nodes using DFS: https://github.com/apache/mesos/blob/7e7cd8de1121589225049ea33df0624b2a1bd754/src/master/allocator/sorter/random/sorter.cpp#L513-L529 This makes the result less random because subtrees are always picked together. For example, random sorting result such as `[a/., c/d, a/b, …]` is impossible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)