[jira] [Created] (MESOS-9460) Speculative operations may make master and agent resource views out of sync.
Meng Zhu created MESOS-9460: --- Summary: Speculative operations may make master and agent resource views out of sync. Key: MESOS-9460 URL: https://issues.apache.org/jira/browse/MESOS-9460 Project: Mesos Issue Type: Bug Affects Versions: 1.7.0, 1.6.1, 1.5.1 Reporter: Meng Zhu This bug could happen with the following sequence of events: - agent (re)registers with the master - speculative operation calls are made to the master - the allocator is speculatively updated in https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11315 - before agent resource gets updated, it sends `UpdateSlaveMessage` when getting the (re)registered message if it has the capability `RESOURCE_PROVIDER` or oversubscription is used (https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1551 and https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1633) - the `UpdateSlaveMessage` triggers allocator to update the total resources with STALE info sent from the agent https://github.com/apache/mesos/blob/master/src/master/master.cpp#L8205, thus the update from the previous operation is overwritten and LOST - agent finishes the operation and informs the master through `UpdateOperationStatusMessage` but for the speculative operation, we do not update the allocator https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11177 - The resource views of master and agent are out of sync. This caused MESOS-7971 and likely MESOS-9458 as well. [~chhsia0] proposes to use `resource_version_uuid` to fix this (https://issues.apache.org/jira/browse/MESOS-7971?focusedCommentId=16712278&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16712278). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9321) Add an optional `vendor` field in `Resource.DiskInfo.Source`.
[ https://issues.apache.org/jira/browse/MESOS-9321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712332#comment-16712332 ] Chun-Hung Hsiao commented on MESOS-9321: Reviews so far: [https://reviews.apache.org/r/69037/] (targeted 1.7.x) [https://reviews.apache.org/r/69520/] (targeted 1.7.x) [https://reviews.apache.org/r/69521/] (targeted 1.7.x) [https://reviews.apache.org/r/69522/] Will post another follow-up patch to remove the hacky use of relative "root" path. > Add an optional `vendor` field in `Resource.DiskInfo.Source`. > - > > Key: MESOS-9321 > URL: https://issues.apache.org/jira/browse/MESOS-9321 > Project: Mesos > Issue Type: Task > Components: resource provider, storage >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Critical > Labels: mesosphere, storage > > This will allow the framework to recover volumes reported by the > corresponding CSI plugin across agent ID changes. > When an agent changes its ID, all reservation information related to > resources coming from a given resource provider will be lost, so frameworks > needs an unique identifier to identify if a new volume associated with the > new agent ID is the same volume. Since CSI volume ID are not unique across > different plugins, we will need to add a new {{vendor}} field, which together > with the existing {{id}} field can provide the means to globally uniquely > identify this source. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712278#comment-16712278 ] Chun-Hung Hsiao edited comment on MESOS-7971 at 12/7/18 2:52 AM: - For resource provider operations, we use {{resource_version_uuid}} to resolve this. It seems to me that we should to the same in {{Slave::applyOperation}} as well: Check if {{ApplyOperationMessage.resource_version_uuid}} equals to {{resourceVersion}}, and only apply the speculative operation if the version matches. However, we only have {{resource_version_uuid}} since 1.5 (with the {{RESOURCE_PROVIDER}} agent capability), we could not use the same strategy to fix this in 1.4 if we want to (1.4 is no longer supported though). was (Author: chhsia0): For resource provider operations, we use {{resource_version_uuid}} to resolve this. It seems to me that we should to the same in {{Slave::applyOperation}} as well: Check if {{ApplyOperationMessage.resource_version_uuid}} equals to {{resourceVersion}}, and only apply the speculative operation if the version matches. However, we only have {{resource_version_uuid}} since 1.5 (with the {{RESOURCE_PROVIDER}} agent capability), we could not use the same strategy to fix this in 1.4 if we want to. > PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky > - > > Key: MESOS-7971 > URL: https://issues.apache.org/jira/browse/MESOS-7971 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Vinod Kone >Assignee: Meng Zhu >Priority: Critical > Labels: flaky-test, mesosphere > Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt > > > Saw this when testing 1.4.0-rc5 > {code} > [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove > I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' > authorizer > I0912 05:40:27.338429 30867 master.cpp:442] Master > 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on > 172.17.0.3:54639 > I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="50ms" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/hH0YXe/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" > --zk_session_timeout="10secs" > I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing > authenticated frameworks to register > I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing > authenticated agents to register > I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing > authenticated HTTP frameworks to register > I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for > authentication from '/tmp/hH0YXe/credentials' > I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' > authenticator > I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled > W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical > a
[jira] [Comment Edited] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712278#comment-16712278 ] Chun-Hung Hsiao edited comment on MESOS-7971 at 12/7/18 2:51 AM: - For resource provider operations, we use {{resource_version_uuid}} to resolve this. It seems to me that we should to the same in {{Slave::applyOperation}} as well: Check if {{ApplyOperationMessage.resource_version_uuid}} equals to {{resourceVersion}}, and only apply the speculative operation if the version matches. However, we only have {{resource_version_uuid}} since 1.5 (with the {{RESOURCE_PROVIDER}} agent capability), we could not use the same strategy to fix this in 1.4 if we want to. was (Author: chhsia0): For resource provider operations, we use {{resource_version_uuid}} to resolve this. It seems to me that we should to the same in {{Slave::applyOperation}} as well: Check if {{ApplyOperationMessage.resource_version_uuid}} equals to {{resourceVersion}}, and only apply the speculative operation if the version matches. However, we only have `resource_version_uuid` since 1.5 (with the {{RESOURCE_PROVIDER}} agent capability), we could not use the same strategy to fix this in 1.4 if we want to. > PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky > - > > Key: MESOS-7971 > URL: https://issues.apache.org/jira/browse/MESOS-7971 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Vinod Kone >Assignee: Meng Zhu >Priority: Critical > Labels: flaky-test, mesosphere > Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt > > > Saw this when testing 1.4.0-rc5 > {code} > [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove > I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' > authorizer > I0912 05:40:27.338429 30867 master.cpp:442] Master > 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on > 172.17.0.3:54639 > I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="50ms" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/hH0YXe/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" > --zk_session_timeout="10secs" > I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing > authenticated frameworks to register > I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing > authenticated agents to register > I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing > authenticated HTTP frameworks to register > I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for > authentication from '/tmp/hH0YXe/credentials' > I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' > authenticator > I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled > W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical > allocator process > I0912 05:40:27.353775 3
[jira] [Commented] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712278#comment-16712278 ] Chun-Hung Hsiao commented on MESOS-7971: For resource provider operations, we use {{resource_version_uuid}} to resolve this. It seems to me that we should to the same in {{Slave::applyOperation}} as well: Check if {{ApplyOperationMessage.resource_version_uuid}} equals to {{resourceVersion}}, and only apply the speculative operation if the version matches. However, we only have `resource_version_uuid` since 1.5 (with the {{RESOURCE_PROVIDER}} agent capability), we could not use the same strategy to fix this in 1.4 if we want to. > PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky > - > > Key: MESOS-7971 > URL: https://issues.apache.org/jira/browse/MESOS-7971 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Vinod Kone >Assignee: Meng Zhu >Priority: Critical > Labels: flaky-test, mesosphere > Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt > > > Saw this when testing 1.4.0-rc5 > {code} > [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove > I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' > authorizer > I0912 05:40:27.338429 30867 master.cpp:442] Master > 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on > 172.17.0.3:54639 > I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="50ms" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/hH0YXe/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" > --zk_session_timeout="10secs" > I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing > authenticated frameworks to register > I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing > authenticated agents to register > I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing > authenticated HTTP frameworks to register > I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for > authentication from '/tmp/hH0YXe/credentials' > I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' > authenticator > I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled > W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical > allocator process > I0912 05:40:27.353775 30872 whitelist_watcher.cpp:77] No whitelist given > I0912 05:40:27.356655 30873 master.cpp:2163] Elected as the leading master! > I0912 05:40:27.356675 30873 master.cpp:1702] Recovering from registrar > I0912 05:40:27.356868 30874 registrar.cpp:347] Recovering registrar > I0912 05:40:27.357390 30874 registrar.cpp:391] Successfully fetched the > registry (0B) in 494080ns > I0912 05:40:27.357483 30874 registrar.cpp:495] Applied 1 operations in > 31911ns; attempting to update the registry > I0912 05:40:27.357919 30874 registrar.cpp:552] Successfully updated the > registry
[jira] [Commented] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712200#comment-16712200 ] Meng Zhu commented on MESOS-7971: - This looks like a legitimate bug. Here is the sequence of events that can trigger the bug - agent (re)registers with the master - operation calls are made to the master (let’s say create volume) - the allocator is speculatively updated in https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11315 - before agent resource gets updated, it sends `UpdateSlaveMessage` when getting the (re)registered message in https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1551 and https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1633 - the `UpdateSlaveMessage` triggers allocator to update the total resources again https://github.com/apache/mesos/blob/master/src/master/master.cpp#L8205, resource update from the previous operation is overwritten and LOST - agent finishes the operation and informs the master through `UpdateOperationStatusMessage` - but for the speculative operation, we do not update the allocator https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11177 Thus, the speculative operation failed to be applied on the allocator but successfully applied to the agent. > PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky > - > > Key: MESOS-7971 > URL: https://issues.apache.org/jira/browse/MESOS-7971 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Vinod Kone >Assignee: Meng Zhu >Priority: Critical > Labels: flaky-test, mesosphere > Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt > > > Saw this when testing 1.4.0-rc5 > {code} > [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove > I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' > authorizer > I0912 05:40:27.338429 30867 master.cpp:442] Master > 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on > 172.17.0.3:54639 > I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="50ms" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/hH0YXe/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" > --zk_session_timeout="10secs" > I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing > authenticated frameworks to register > I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing > authenticated agents to register > I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing > authenticated HTTP frameworks to register > I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for > authentication from '/tmp/hH0YXe/credentials' > I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' > authenticator > I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled > W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hier
[jira] [Comment Edited] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712200#comment-16712200 ] Meng Zhu edited comment on MESOS-7971 at 12/7/18 1:12 AM: -- This looks like a legitimate bug. Here is the sequence of events that can trigger the bug - agent (re)registers with the master - operation calls are made to the master (let’s say create volume) - the allocator is speculatively updated in https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11315 - before agent resource gets updated, it sends `UpdateSlaveMessage` when getting the (re)registered message in https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1551 and https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1633 - the `UpdateSlaveMessage` triggers allocator to update the total resources with STALE info sent from the agent https://github.com/apache/mesos/blob/master/src/master/master.cpp#L8205, thus updates from the previous operation is overwritten and LOST - agent finishes the operation and informs the master through `UpdateOperationStatusMessage` - but for the speculative operation, we do not update the allocator https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11177 Thus, the speculative operation failed to be applied on the allocator but successfully applied to the agent. was (Author: mzhu): This looks like a legitimate bug. Here is the sequence of events that can trigger the bug - agent (re)registers with the master - operation calls are made to the master (let’s say create volume) - the allocator is speculatively updated in https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11315 - before agent resource gets updated, it sends `UpdateSlaveMessage` when getting the (re)registered message in https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1551 and https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1633 - the `UpdateSlaveMessage` triggers allocator to update the total resources again https://github.com/apache/mesos/blob/master/src/master/master.cpp#L8205, resource update from the previous operation is overwritten and LOST - agent finishes the operation and informs the master through `UpdateOperationStatusMessage` - but for the speculative operation, we do not update the allocator https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11177 Thus, the speculative operation failed to be applied on the allocator but successfully applied to the agent. > PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky > - > > Key: MESOS-7971 > URL: https://issues.apache.org/jira/browse/MESOS-7971 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Vinod Kone >Assignee: Meng Zhu >Priority: Critical > Labels: flaky-test, mesosphere > Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt > > > Saw this when testing 1.4.0-rc5 > {code} > [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove > I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' > authorizer > I0912 05:40:27.338429 30867 master.cpp:442] Master > 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on > 172.17.0.3:54639 > I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="50ms" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/hH0YXe/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" > --zk_session_t
[jira] [Issue Comment Deleted] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu updated MESOS-7971: Comment: was deleted (was: This test is flaky because due to a race. The test expects the offer to be sent out after the reserve and create operations have finished. But it only waits for the 202 returned by both calls. When the offer is sent while the agent is processing the create operation, the offer does not contain the expected volume resource. Failing the test. Adding manual clock control and properly settle should fix the test. ) > PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky > - > > Key: MESOS-7971 > URL: https://issues.apache.org/jira/browse/MESOS-7971 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Vinod Kone >Assignee: Meng Zhu >Priority: Critical > Labels: flaky-test, mesosphere > Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt > > > Saw this when testing 1.4.0-rc5 > {code} > [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove > I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' > authorizer > I0912 05:40:27.338429 30867 master.cpp:442] Master > 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on > 172.17.0.3:54639 > I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="50ms" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/hH0YXe/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" > --zk_session_timeout="10secs" > I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing > authenticated frameworks to register > I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing > authenticated agents to register > I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing > authenticated HTTP frameworks to register > I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for > authentication from '/tmp/hH0YXe/credentials' > I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' > authenticator > I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled > W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical > allocator process > I0912 05:40:27.353775 30872 whitelist_watcher.cpp:77] No whitelist given > I0912 05:40:27.356655 30873 master.cpp:2163] Elected as the leading master! > I0912 05:40:27.356675 30873 master.cpp:1702] Recovering from registrar > I0912 05:40:27.356868 30874 registrar.cpp:347] Recovering registrar > I0912 05:40:27.357390 30874 registrar.cpp:391] Successfully fetched the > registry (0B) in 494080ns > I0912 05:40:27.357483 30874 registrar.cpp:495] Applied 1 operations in > 31911ns; attempting to update the registry > I0912 05:40:27.357919 30874 registrar.cpp:552] Successfully updated the > registry in 391936ns > I0912 05:40:27.358018 30874 registrar.cpp:424] Successfully recovered > registrar > I0912 05:40:27.35841
[jira] [Issue Comment Deleted] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu updated MESOS-7971: Comment: was deleted (was: https://reviews.apache.org/r/69516/) > PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky > - > > Key: MESOS-7971 > URL: https://issues.apache.org/jira/browse/MESOS-7971 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Vinod Kone >Assignee: Meng Zhu >Priority: Critical > Labels: flaky-test, mesosphere > Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt > > > Saw this when testing 1.4.0-rc5 > {code} > [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove > I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' > authorizer > I0912 05:40:27.338429 30867 master.cpp:442] Master > 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on > 172.17.0.3:54639 > I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="50ms" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/hH0YXe/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" > --zk_session_timeout="10secs" > I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing > authenticated frameworks to register > I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing > authenticated agents to register > I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing > authenticated HTTP frameworks to register > I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for > authentication from '/tmp/hH0YXe/credentials' > I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' > authenticator > I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled > W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical > allocator process > I0912 05:40:27.353775 30872 whitelist_watcher.cpp:77] No whitelist given > I0912 05:40:27.356655 30873 master.cpp:2163] Elected as the leading master! > I0912 05:40:27.356675 30873 master.cpp:1702] Recovering from registrar > I0912 05:40:27.356868 30874 registrar.cpp:347] Recovering registrar > I0912 05:40:27.357390 30874 registrar.cpp:391] Successfully fetched the > registry (0B) in 494080ns > I0912 05:40:27.357483 30874 registrar.cpp:495] Applied 1 operations in > 31911ns; attempting to update the registry > I0912 05:40:27.357919 30874 registrar.cpp:552] Successfully updated the > registry in 391936ns > I0912 05:40:27.358018 30874 registrar.cpp:424] Successfully recovered > registrar > I0912 05:40:27.358413 30868 master.cpp:1801] Recovered 0 agents from the > registry (129B); allowing 10mins for agents to re-register > I0912 05:40:27.358482 30867 hierarchical.cpp:209] Skipping recovery of > hierarchical allocator: nothing to recover > W0912 05:40:27.364050 30860 process.cpp:3196] Attempted to spawn already > running process files@172.17.0.3:54639 > I0912 05:40:27.365372 30860 c
[jira] [Assigned] (MESOS-9458) PersistentVolumeEndpointsTest.StaticReservation is flaky
[ https://issues.apache.org/jira/browse/MESOS-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9458: --- Assignee: Meng Zhu > PersistentVolumeEndpointsTest.StaticReservation is flaky > > > Key: MESOS-9458 > URL: https://issues.apache.org/jira/browse/MESOS-9458 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Vinod Kone >Assignee: Meng Zhu >Priority: Major > Labels: flaky-test, mesosphere > > Observed this in ASF CI > https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Buildbot-Test/310/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1%20MESOS_TEST_AWAIT_TIMEOUT=60secs,OS=ubuntu:16.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)&&(!ubuntu-4)&&(!H21)&&(!H23)&&(!H26)&&(!H27)/consoleText > {noformat} > [ RUN ] PersistentVolumeEndpointsTest.StaticReservation > I1205 11:34:05.896515 22538 cluster.cpp:173] Creating default 'local' > authorizer > I1205 11:34:05.898870 22542 master.cpp:413] Master > 3f2d828b-bff8-461a-98cf-de9163b36657 (488de0351206) started on > 172.17.0.2:40803 > I1205 11:34:05.898895 22542 master.cpp:416] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1000secs" --allocator="hierarchical" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/qOMyLF/credentials" --filter_gpu_resources="true" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" > --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" > --publish_per_framework_metrics="true" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --require_agent_domain="false" --role_sorter="drf" --roles="role1" > --root_submissions="true" --version="false" > --webui_dir="/tmp/SRC/build/mesos-1.8.0/_inst/share/mesos/webui" > --work_dir="/tmp/qOMyLF/master" --zk_session_timeout="10secs" > I1205 11:34:05.899194 22542 master.cpp:465] Master only allowing > authenticated frameworks to register > I1205 11:34:05.899205 22542 master.cpp:471] Master only allowing > authenticated agents to register > I1205 11:34:05.899212 22542 master.cpp:477] Master only allowing > authenticated HTTP frameworks to register > I1205 11:34:05.899219 22542 credentials.hpp:37] Loading credentials for > authentication from '/tmp/qOMyLF/credentials' > I1205 11:34:05.899503 22542 master.cpp:521] Using default 'crammd5' > authenticator > I1205 11:34:05.899674 22542 http.cpp:1042] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I1205 11:34:05.899879 22542 http.cpp:1042] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I1205 11:34:05.900029 22542 http.cpp:1042] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I1205 11:34:05.900211 22542 master.cpp:602] Authorization enabled > W1205 11:34:05.900238 22542 master.cpp:665] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I1205 11:34:05.900684 22539 hierarchical.cpp:175] Initialized hierarchical > allocator process > I1205 11:34:05.900707 22545 whitelist_watcher.cpp:77] No whitelist given > I1205 11:34:05.903553 22540 master.cpp:2105] Elected as the leading master! > I1205 11:34:05.903587 22540 master.cpp:1660] Recovering from registrar > I1205 11:34:05.903753 22551 registrar.cpp:339] Recovering registrar > I1205 11:34:05.904373 22551 registrar.cpp:383] Successfully fetched the > registry (0B) in 574976ns > I1205 11:34:05.904498 22551 registrar.cpp:487] Applied 1 operations in > 34823ns; attempting to update the registry > I1205 11:34:05.905134 22551 registrar.cpp:544] Successfully updated the > registry in 566016ns > I1205 11:34:05.905258 22551 registrar.cpp:416] Successfully recovered > registrar > I1205 11:34:05
[jira] [Commented] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712148#comment-16712148 ] Meng Zhu commented on MESOS-7971: - This test is flaky because due to a race. The test expects the offer to be sent out after the reserve and create operations have finished. But it only waits for the 202 returned by both calls. When the offer is sent while the agent is processing the create operation, the offer does not contain the expected volume resource. Failing the test. Adding manual clock control and properly settle should fix the test. > PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky > - > > Key: MESOS-7971 > URL: https://issues.apache.org/jira/browse/MESOS-7971 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Vinod Kone >Assignee: Meng Zhu >Priority: Critical > Labels: flaky-test, mesosphere > Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt > > > Saw this when testing 1.4.0-rc5 > {code} > [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove > I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' > authorizer > I0912 05:40:27.338429 30867 master.cpp:442] Master > 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on > 172.17.0.3:54639 > I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="50ms" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/hH0YXe/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" > --zk_session_timeout="10secs" > I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing > authenticated frameworks to register > I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing > authenticated agents to register > I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing > authenticated HTTP frameworks to register > I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for > authentication from '/tmp/hH0YXe/credentials' > I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' > authenticator > I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled > W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical > allocator process > I0912 05:40:27.353775 30872 whitelist_watcher.cpp:77] No whitelist given > I0912 05:40:27.356655 30873 master.cpp:2163] Elected as the leading master! > I0912 05:40:27.356675 30873 master.cpp:1702] Recovering from registrar > I0912 05:40:27.356868 30874 registrar.cpp:347] Recovering registrar > I0912 05:40:27.357390 30874 registrar.cpp:391] Successfully fetched the > registry (0B) in 494080ns > I0912 05:40:27.357483 30874 registrar.cpp:495] Applied 1 operations in > 31911ns; attempting to update the registry > I0912 05:40:27.357919 30874 registrar.cpp:552] Successfully updated the > registry in 391936ns > I0912 05:40:27.358018 30874 registrar.cpp:424] Successfully recovered > registr
[jira] [Assigned] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-7971: --- Assignee: Meng Zhu > PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky > - > > Key: MESOS-7971 > URL: https://issues.apache.org/jira/browse/MESOS-7971 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Vinod Kone >Assignee: Meng Zhu >Priority: Critical > Labels: flaky-test, mesosphere > Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt > > > Saw this when testing 1.4.0-rc5 > {code} > [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove > I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' > authorizer > I0912 05:40:27.338429 30867 master.cpp:442] Master > 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on > 172.17.0.3:54639 > I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="50ms" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/hH0YXe/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" > --zk_session_timeout="10secs" > I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing > authenticated frameworks to register > I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing > authenticated agents to register > I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing > authenticated HTTP frameworks to register > I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for > authentication from '/tmp/hH0YXe/credentials' > I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' > authenticator > I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled > W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical > allocator process > I0912 05:40:27.353775 30872 whitelist_watcher.cpp:77] No whitelist given > I0912 05:40:27.356655 30873 master.cpp:2163] Elected as the leading master! > I0912 05:40:27.356675 30873 master.cpp:1702] Recovering from registrar > I0912 05:40:27.356868 30874 registrar.cpp:347] Recovering registrar > I0912 05:40:27.357390 30874 registrar.cpp:391] Successfully fetched the > registry (0B) in 494080ns > I0912 05:40:27.357483 30874 registrar.cpp:495] Applied 1 operations in > 31911ns; attempting to update the registry > I0912 05:40:27.357919 30874 registrar.cpp:552] Successfully updated the > registry in 391936ns > I0912 05:40:27.358018 30874 registrar.cpp:424] Successfully recovered > registrar > I0912 05:40:27.358413 30868 master.cpp:1801] Recovered 0 agents from the > registry (129B); allowing 10mins for agents to re-register > I0912 05:40:27.358482 30867 hierarchical.cpp:209] Skipping recovery of > hierarchical allocator: nothing to recover > W0912 05:40:27.364050 30860 process.cpp:3196] Attempted to spawn already > running process files@172.17.0.3:54639 > I0912 05:40:27.365372 30860 containerizer.cpp:246] Using isolation:
[jira] [Assigned] (MESOS-9314) Consider introducing a ScalarResourceQuantity protobuf message.
[ https://issues.apache.org/jira/browse/MESOS-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9314: --- Assignee: Meng Zhu > Consider introducing a ScalarResourceQuantity protobuf message. > --- > > Key: MESOS-9314 > URL: https://issues.apache.org/jira/browse/MESOS-9314 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Benjamin Mahler >Assignee: Meng Zhu >Priority: Major > Labels: multitenancy > > As part of introducing quota limits, we're adding a new master::Call for > updating quota. This call can take a simplified message that expresses scalar > resource quantities: > {code} > message ScalarResourceQuantity { > required string name; > required Value::Scalar quantity; > } > {code} > This greatly simplified the validation code, as well as the UX of the API > when it comes to knowing what kind of data to provide. > Ideally, the new quota paths can use this message in lieu of Resource > objects, but we'll have to explore backwards compatibility (e.g. registry > data). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6630) Add some benchmark test for quota allocation
[ https://issues.apache.org/jira/browse/MESOS-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712122#comment-16712122 ] Meng Zhu commented on MESOS-6630: - Uploaded two perf traces. mesos-master_nonquota_1206.stacks {noformat} [--] 1 test from NonQuotaVsQuotaParam/HierarchicalAllocator_BENCHMARK_WithNonQuotaVsQuotaParam [ RUN ] NonQuotaVsQuotaParam/HierarchicalAllocator_BENCHMARK_WithNonQuotaVsQuotaParam.NonQuotaVsQuota/4 Added 2000 agents in 82.263735ms Added 2000 frameworks in 12.301791731secs Nonquota run setup: 2000 agents, 1000 roles, 2000 frameworksMade 2000 allocations in 4.322305035secs Made 0 allocation in 4.036441876secs [ OK ] NonQuotaVsQuotaParam/HierarchicalAllocator_BENCHMARK_WithNonQuotaVsQuotaParam.NonQuotaVsQuota/4 (21315 ms) {noformat} mesos-master_quota_1206.stacks {noformat} [ RUN ] NonQuotaVsQuotaParam/HierarchicalAllocator_BENCHMARK_WithNonQuotaVsQuotaParam.NonQuotaVsQuota/5 Added 2000 agents in 82.183633ms Added 2000 frameworks in 12.508906279secs Quota run setup: 2000 agents, 1000 roles, 2000 frameworksMade 2000 allocations in 36.546906639secs Made 0 allocation in 27.331330684secs [ OK ] NonQuotaVsQuotaParam/HierarchicalAllocator_BENCHMARK_WithNonQuotaVsQuotaParam.NonQuotaVsQuota/5 (77055 ms) {noformat} > Add some benchmark test for quota allocation > > > Key: MESOS-6630 > URL: https://issues.apache.org/jira/browse/MESOS-6630 > Project: Mesos > Issue Type: Task > Components: allocation >Reporter: Guangya Liu >Assignee: Meng Zhu >Priority: Major > Labels: mesosphere, performance > Fix For: 1.8.0 > > Attachments: mesos-master_nonquota_1206.stacks, > mesos-master_quota_1206.stacks > > > Comparing to non-quota allocation, current quota allocation involves a > separate allocation stage and additional tracking such as headroom and role > consumed quota. Thus quota allocation performance could be drastically > different (probably slower) than non-quota allocation. A dedicated benchmark > for quota allocation is necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6630) Add some benchmark test for quota allocation
[ https://issues.apache.org/jira/browse/MESOS-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712121#comment-16712121 ] Meng Zhu commented on MESOS-6630: - {noformat} commit 4e512618d508f32170193c49d72011189f6e2fa1 Author: Meng Zhu Date: Fri Oct 12 17:28:39 2018 -0700 Added an allocator benchmark for quota performance. This benchmark evaluates the allocator performance in the presence of roles with both small quota (which can be satisfied by half an agent) as well as large quota (which need resources from two agents). We setup the cluster, trigger one allocation cycle and measure the elapsed time. Review: https://reviews.apache.org/r/69097 {noformat} {noformat} commit 740fa11a33df8528742b3e784206d00111edc4a3 Author: Meng Zhu m...@mesosphere.io Date: Fri Oct 19 22:44:04 2018 -0700 Added a benchmark to compare quota and nonquota allocation performance. This benchmark evaluates the performance difference between nonquota and quota settings. In both settings, the same allocations are made for fair comparison. In particular, since the agent will always be allocated as a whole in nonquota settings, we should also avoid agent chopping in quota setting as well. Thus in this benchmark, quotas are only set to be multiples of whole agent resources. This is also why we have this dedicated benchmark for comparison rather than extending the existing quota benchmarks (which involves agent chopping). Review: https://reviews.apache.org/r/69098 {noformat} > Add some benchmark test for quota allocation > > > Key: MESOS-6630 > URL: https://issues.apache.org/jira/browse/MESOS-6630 > Project: Mesos > Issue Type: Task > Components: allocation >Reporter: Guangya Liu >Assignee: Meng Zhu >Priority: Major > Labels: mesosphere, performance > Attachments: mesos-master_nonquota_1206.stacks, > mesos-master_quota_1206.stacks > > > Comparing to non-quota allocation, current quota allocation involves a > separate allocation stage and additional tracking such as headroom and role > consumed quota. Thus quota allocation performance could be drastically > different (probably slower) than non-quota allocation. A dedicated benchmark > for quota allocation is necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9459) Reviewbot is not verifying reviews that need verification
Vinod Kone created MESOS-9459: - Summary: Reviewbot is not verifying reviews that need verification Key: MESOS-9459 URL: https://issues.apache.org/jira/browse/MESOS-9459 Project: Mesos Issue Type: Bug Reporter: Vinod Kone Assignee: Armand Grillet For example this run of ReviewBot https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23594/console says that there are no reviews to be verified, which is false because if we look at ReviewBoard there are a bunch of reviews that have not been commented on by ReviewBot since a new diff has been posted. {noformat} 12-05-18_23:41:54 - Running /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py 0 review requests need verification {noformat} I see the the logic of the verify-reviews.py script was changed as part of the python3 transition here: https://reviews.apache.org/r/68619/diff/1#27 which likely caused the bug. As an aside, It's unfortunate that python3 update was bundled with logic changes in this review. cc [~andschwa] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-3968) DiskQuotaTest.SlaveRecovery is flaky
[ https://issues.apache.org/jira/browse/MESOS-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711282#comment-16711282 ] Benjamin Bannier commented on MESOS-3968: - [~vinodkone] , we'll spend some time in the near future to remove flakes from {{StorageLocalResourceProviderTest}} which were introduced pretty recently and which we see fail with some frequency. Regarding this test here, I checked how often it failed in our internal CI, and it doesn't seem to be among the worst offenders -- it "only" seems to have failed a handful of times in the last couple hundred CI builds of {{master}}. > DiskQuotaTest.SlaveRecovery is flaky > > > Key: MESOS-3968 > URL: https://issues.apache.org/jira/browse/MESOS-3968 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Mahler >Priority: Major > Labels: flaky-test, mesosphere, mesosphere-oncall, storage > > {noformat: title=Failed Run} > [ RUN ] DiskQuotaTest.SlaveRecovery > I1120 12:02:54.015383 29806 leveldb.cpp:176] Opened db in 2.965411ms > I1120 12:02:54.018033 29806 leveldb.cpp:183] Compacted db in 2.585354ms > I1120 12:02:54.018175 29806 leveldb.cpp:198] Created db iterator in 27134ns > I1120 12:02:54.018275 29806 leveldb.cpp:204] Seeked to beginning of db in > 3025ns > I1120 12:02:54.018375 29806 leveldb.cpp:273] Iterated through 0 keys in the > db in 679ns > I1120 12:02:54.018491 29806 replica.cpp:780] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1120 12:02:54.021386 29838 recover.cpp:449] Starting replica recovery > I1120 12:02:54.021692 29838 recover.cpp:475] Replica is in EMPTY status > I1120 12:02:54.022189 29827 master.cpp:367] Master > 9a3c45ec-28b3-49e6-a83f-1f2035cc1105 (a51e6bb03b55) started on > 172.17.5.188:41228 > I1120 12:02:54.022212 29827 master.cpp:369] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/DsMniF/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="25secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" > --work_dir="/tmp/DsMniF/master" --zk_session_timeout="10secs" > I1120 12:02:54.022557 29827 master.cpp:414] Master only allowing > authenticated frameworks to register > I1120 12:02:54.022569 29827 master.cpp:419] Master only allowing > authenticated slaves to register > I1120 12:02:54.022578 29827 credentials.hpp:37] Loading credentials for > authentication from '/tmp/DsMniF/credentials' > I1120 12:02:54.022896 29827 master.cpp:458] Using default 'crammd5' > authenticator > I1120 12:02:54.023217 29827 master.cpp:495] Authorization enabled > I1120 12:02:54.023512 29831 whitelist_watcher.cpp:79] No whitelist given > I1120 12:02:54.023814 29833 replica.cpp:676] Replica in EMPTY status received > a broadcasted recover request from (562)@172.17.5.188:41228 > I1120 12:02:54.023519 29832 hierarchical.cpp:153] Initialized hierarchical > allocator process > I1120 12:02:54.025997 29831 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I1120 12:02:54.027042 29832 recover.cpp:566] Updating replica status to > STARTING > I1120 12:02:54.027354 29830 master.cpp:1612] The newly elected leader is > master@172.17.5.188:41228 with id 9a3c45ec-28b3-49e6-a83f-1f2035cc1105 > I1120 12:02:54.027385 29830 master.cpp:1625] Elected as the leading master! > I1120 12:02:54.027403 29830 master.cpp:1385] Recovering from registrar > I1120 12:02:54.027679 29830 registrar.cpp:309] Recovering registrar > I1120 12:02:54.028439 29840 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 1.195171ms > I1120 12:02:54.028539 29840 replica.cpp:323] Persisted replica status to > STARTING > I1120 12:02:54.028944 29840 recover.cpp:475] Replica is in STARTING status > I1120 12:02:54.030910 29840 replica.cpp:676] Replica in STARTING status > received a broadcasted recover request from (563)@172.17.5.188:41228 > I1120 12:02:54.031429 29840 recover.cpp:195] Received a recover response from > a replica in STARTING status > I1120 12:02:54.032032 29840 recover.cpp:566] Updating replica status to VOTING > I1120 12:02:54.032816 29840 leveldb.cpp:306] Persisting metadata (8 bytes) to > leve