[jira] [Created] (MESOS-9460) Speculative operations may make master and agent resource views out of sync.

2018-12-06 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9460:
---

 Summary: Speculative operations may make master and agent resource 
views out of sync.
 Key: MESOS-9460
 URL: https://issues.apache.org/jira/browse/MESOS-9460
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.7.0, 1.6.1, 1.5.1
Reporter: Meng Zhu


This bug could happen with the following sequence of events:

- agent (re)registers with the master
- speculative operation calls are made to the master
- the allocator is speculatively updated in 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11315
- before agent resource gets updated, it sends `UpdateSlaveMessage` when 
getting the (re)registered message if it has the capability `RESOURCE_PROVIDER` 
or oversubscription is used 
(https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1551 and 
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1633)
- the `UpdateSlaveMessage` triggers allocator to update the total resources 
with STALE info sent from the agent 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L8205, thus 
the update from the previous operation is overwritten and LOST
- agent finishes the operation and informs the master through 
`UpdateOperationStatusMessage` but for the speculative operation, we do not 
update the allocator 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11177
- The resource views of master and agent are out of sync.

This caused MESOS-7971 and likely MESOS-9458 as well. 

[~chhsia0] proposes to use `resource_version_uuid` to fix this 
(https://issues.apache.org/jira/browse/MESOS-7971?focusedCommentId=16712278&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16712278).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9321) Add an optional `vendor` field in `Resource.DiskInfo.Source`.

2018-12-06 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712332#comment-16712332
 ] 

Chun-Hung Hsiao commented on MESOS-9321:


Reviews so far:
[https://reviews.apache.org/r/69037/] (targeted 1.7.x)
[https://reviews.apache.org/r/69520/] (targeted 1.7.x)
[https://reviews.apache.org/r/69521/] (targeted 1.7.x)
[https://reviews.apache.org/r/69522/]

Will post another follow-up patch to remove the hacky use of relative "root" 
path.

> Add an optional `vendor` field in `Resource.DiskInfo.Source`.
> -
>
> Key: MESOS-9321
> URL: https://issues.apache.org/jira/browse/MESOS-9321
> Project: Mesos
>  Issue Type: Task
>  Components: resource provider, storage
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
>
> This will allow the framework to recover volumes reported by the 
> corresponding CSI plugin across agent ID changes.
> When an agent changes its ID, all reservation information related to 
> resources coming from a given resource provider will be lost, so frameworks 
> needs an unique identifier to identify if a new volume associated with the 
> new agent ID is the same volume. Since CSI volume ID are not unique across 
> different plugins, we will need to add a new {{vendor}} field, which together 
> with the existing {{id}} field can provide the means to globally uniquely 
> identify this source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-12-06 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712278#comment-16712278
 ] 

Chun-Hung Hsiao edited comment on MESOS-7971 at 12/7/18 2:52 AM:
-

For resource provider operations, we use  {{resource_version_uuid}} to resolve 
this.

It seems to me that we should to the same in {{Slave::applyOperation}} as well:
 Check if {{ApplyOperationMessage.resource_version_uuid}} equals to 
{{resourceVersion}},
 and only apply the speculative operation if the version matches.

However, we only have {{resource_version_uuid}} since 1.5 (with the 
{{RESOURCE_PROVIDER}} agent capability),
 we could not use the same strategy to fix this in 1.4 if we want to (1.4 is no 
longer supported though).


was (Author: chhsia0):
For resource provider operations, we use  {{resource_version_uuid}} to resolve 
this.

It seems to me that we should to the same in {{Slave::applyOperation}} as well:
 Check if {{ApplyOperationMessage.resource_version_uuid}} equals to 
{{resourceVersion}},
 and only apply the speculative operation if the version matches.

However, we only have {{resource_version_uuid}} since 1.5 (with the 
{{RESOURCE_PROVIDER}} agent capability),
 we could not use the same strategy to fix this in 1.4 if we want to.

> PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
> -
>
> Key: MESOS-7971
> URL: https://issues.apache.org/jira/browse/MESOS-7971
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: flaky-test, mesosphere
> Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt
>
>
> Saw this when testing 1.4.0-rc5
> {code}
> [ RUN  ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
> I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0912 05:40:27.338429 30867 master.cpp:442] Master 
> 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on 
> 172.17.0.3:54639
> I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="50ms" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/hH0YXe/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" 
> --zk_session_timeout="10secs"
> I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing 
> authenticated frameworks to register
> I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing 
> authenticated agents to register
> I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing 
> authenticated HTTP frameworks to register
> I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/hH0YXe/credentials'
> I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' 
> authenticator
> I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled
> W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical 
> a

[jira] [Comment Edited] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-12-06 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712278#comment-16712278
 ] 

Chun-Hung Hsiao edited comment on MESOS-7971 at 12/7/18 2:51 AM:
-

For resource provider operations, we use  {{resource_version_uuid}} to resolve 
this.

It seems to me that we should to the same in {{Slave::applyOperation}} as well:
 Check if {{ApplyOperationMessage.resource_version_uuid}} equals to 
{{resourceVersion}},
 and only apply the speculative operation if the version matches.

However, we only have {{resource_version_uuid}} since 1.5 (with the 
{{RESOURCE_PROVIDER}} agent capability),
 we could not use the same strategy to fix this in 1.4 if we want to.


was (Author: chhsia0):
For resource provider operations, we use  {{resource_version_uuid}} to resolve 
this.

It seems to me that we should to the same in {{Slave::applyOperation}} as well:
Check if {{ApplyOperationMessage.resource_version_uuid}} equals to 
{{resourceVersion}},
and only apply the speculative operation if the version matches.

However, we only have `resource_version_uuid` since 1.5 (with the 
{{RESOURCE_PROVIDER}} agent capability),
we could not use the same strategy to fix this in 1.4 if we want to.

> PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
> -
>
> Key: MESOS-7971
> URL: https://issues.apache.org/jira/browse/MESOS-7971
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: flaky-test, mesosphere
> Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt
>
>
> Saw this when testing 1.4.0-rc5
> {code}
> [ RUN  ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
> I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0912 05:40:27.338429 30867 master.cpp:442] Master 
> 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on 
> 172.17.0.3:54639
> I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="50ms" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/hH0YXe/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" 
> --zk_session_timeout="10secs"
> I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing 
> authenticated frameworks to register
> I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing 
> authenticated agents to register
> I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing 
> authenticated HTTP frameworks to register
> I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/hH0YXe/credentials'
> I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' 
> authenticator
> I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled
> W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical 
> allocator process
> I0912 05:40:27.353775 3

[jira] [Commented] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-12-06 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712278#comment-16712278
 ] 

Chun-Hung Hsiao commented on MESOS-7971:


For resource provider operations, we use  {{resource_version_uuid}} to resolve 
this.

It seems to me that we should to the same in {{Slave::applyOperation}} as well:
Check if {{ApplyOperationMessage.resource_version_uuid}} equals to 
{{resourceVersion}},
and only apply the speculative operation if the version matches.

However, we only have `resource_version_uuid` since 1.5 (with the 
{{RESOURCE_PROVIDER}} agent capability),
we could not use the same strategy to fix this in 1.4 if we want to.

> PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
> -
>
> Key: MESOS-7971
> URL: https://issues.apache.org/jira/browse/MESOS-7971
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: flaky-test, mesosphere
> Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt
>
>
> Saw this when testing 1.4.0-rc5
> {code}
> [ RUN  ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
> I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0912 05:40:27.338429 30867 master.cpp:442] Master 
> 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on 
> 172.17.0.3:54639
> I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="50ms" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/hH0YXe/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" 
> --zk_session_timeout="10secs"
> I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing 
> authenticated frameworks to register
> I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing 
> authenticated agents to register
> I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing 
> authenticated HTTP frameworks to register
> I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/hH0YXe/credentials'
> I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' 
> authenticator
> I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled
> W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical 
> allocator process
> I0912 05:40:27.353775 30872 whitelist_watcher.cpp:77] No whitelist given
> I0912 05:40:27.356655 30873 master.cpp:2163] Elected as the leading master!
> I0912 05:40:27.356675 30873 master.cpp:1702] Recovering from registrar
> I0912 05:40:27.356868 30874 registrar.cpp:347] Recovering registrar
> I0912 05:40:27.357390 30874 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 494080ns
> I0912 05:40:27.357483 30874 registrar.cpp:495] Applied 1 operations in 
> 31911ns; attempting to update the registry
> I0912 05:40:27.357919 30874 registrar.cpp:552] Successfully updated the 
> registry

[jira] [Commented] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-12-06 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712200#comment-16712200
 ] 

Meng Zhu commented on MESOS-7971:
-

This looks like a legitimate bug. Here is the sequence of events that can 
trigger the bug

- agent (re)registers with the master
- operation calls are made to the master (let’s say create volume)
- the allocator is speculatively updated in 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11315
- before agent resource gets updated, it sends `UpdateSlaveMessage` when 
getting the (re)registered message in 
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1551 and 
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1633
- the `UpdateSlaveMessage` triggers  allocator to update the total resources 
again https://github.com/apache/mesos/blob/master/src/master/master.cpp#L8205, 
resource update from the previous operation is overwritten and LOST
- agent finishes the operation and informs the master through 
`UpdateOperationStatusMessage`
- but for the speculative operation, we do not update the allocator 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11177

Thus, the speculative operation failed to be applied on the allocator but 
successfully applied to the agent.

> PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
> -
>
> Key: MESOS-7971
> URL: https://issues.apache.org/jira/browse/MESOS-7971
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: flaky-test, mesosphere
> Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt
>
>
> Saw this when testing 1.4.0-rc5
> {code}
> [ RUN  ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
> I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0912 05:40:27.338429 30867 master.cpp:442] Master 
> 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on 
> 172.17.0.3:54639
> I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="50ms" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/hH0YXe/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" 
> --zk_session_timeout="10secs"
> I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing 
> authenticated frameworks to register
> I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing 
> authenticated agents to register
> I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing 
> authenticated HTTP frameworks to register
> I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/hH0YXe/credentials'
> I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' 
> authenticator
> I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled
> W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hier

[jira] [Comment Edited] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-12-06 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712200#comment-16712200
 ] 

Meng Zhu edited comment on MESOS-7971 at 12/7/18 1:12 AM:
--

This looks like a legitimate bug. Here is the sequence of events that can 
trigger the bug

- agent (re)registers with the master
- operation calls are made to the master (let’s say create volume)
- the allocator is speculatively updated in 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11315
- before agent resource gets updated, it sends `UpdateSlaveMessage` when 
getting the (re)registered message in 
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1551 and 
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1633
- the `UpdateSlaveMessage` triggers  allocator to update the total resources 
with STALE info sent from the agent 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L8205, thus 
updates from the previous operation is overwritten and LOST
- agent finishes the operation and informs the master through 
`UpdateOperationStatusMessage`
- but for the speculative operation, we do not update the allocator 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11177

Thus, the speculative operation failed to be applied on the allocator but 
successfully applied to the agent.


was (Author: mzhu):
This looks like a legitimate bug. Here is the sequence of events that can 
trigger the bug

- agent (re)registers with the master
- operation calls are made to the master (let’s say create volume)
- the allocator is speculatively updated in 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11315
- before agent resource gets updated, it sends `UpdateSlaveMessage` when 
getting the (re)registered message in 
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1551 and 
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1633
- the `UpdateSlaveMessage` triggers  allocator to update the total resources 
again https://github.com/apache/mesos/blob/master/src/master/master.cpp#L8205, 
resource update from the previous operation is overwritten and LOST
- agent finishes the operation and informs the master through 
`UpdateOperationStatusMessage`
- but for the speculative operation, we do not update the allocator 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11177

Thus, the speculative operation failed to be applied on the allocator but 
successfully applied to the agent.

> PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
> -
>
> Key: MESOS-7971
> URL: https://issues.apache.org/jira/browse/MESOS-7971
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: flaky-test, mesosphere
> Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt
>
>
> Saw this when testing 1.4.0-rc5
> {code}
> [ RUN  ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
> I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0912 05:40:27.338429 30867 master.cpp:442] Master 
> 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on 
> 172.17.0.3:54639
> I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="50ms" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/hH0YXe/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" 
> --zk_session_t

[jira] [Issue Comment Deleted] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-12-06 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu updated MESOS-7971:

Comment: was deleted

(was: This test is flaky because due to a race.

The test expects the offer to be sent out after the reserve and create 
operations have finished. But it only waits for the 202 returned by both calls.

When the offer is sent while the agent is processing the create operation, the 
offer does not contain the expected volume resource. Failing the test.

Adding manual clock control and properly settle should fix the test.
)

> PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
> -
>
> Key: MESOS-7971
> URL: https://issues.apache.org/jira/browse/MESOS-7971
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: flaky-test, mesosphere
> Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt
>
>
> Saw this when testing 1.4.0-rc5
> {code}
> [ RUN  ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
> I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0912 05:40:27.338429 30867 master.cpp:442] Master 
> 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on 
> 172.17.0.3:54639
> I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="50ms" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/hH0YXe/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" 
> --zk_session_timeout="10secs"
> I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing 
> authenticated frameworks to register
> I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing 
> authenticated agents to register
> I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing 
> authenticated HTTP frameworks to register
> I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/hH0YXe/credentials'
> I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' 
> authenticator
> I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled
> W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical 
> allocator process
> I0912 05:40:27.353775 30872 whitelist_watcher.cpp:77] No whitelist given
> I0912 05:40:27.356655 30873 master.cpp:2163] Elected as the leading master!
> I0912 05:40:27.356675 30873 master.cpp:1702] Recovering from registrar
> I0912 05:40:27.356868 30874 registrar.cpp:347] Recovering registrar
> I0912 05:40:27.357390 30874 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 494080ns
> I0912 05:40:27.357483 30874 registrar.cpp:495] Applied 1 operations in 
> 31911ns; attempting to update the registry
> I0912 05:40:27.357919 30874 registrar.cpp:552] Successfully updated the 
> registry in 391936ns
> I0912 05:40:27.358018 30874 registrar.cpp:424] Successfully recovered 
> registrar
> I0912 05:40:27.35841

[jira] [Issue Comment Deleted] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-12-06 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu updated MESOS-7971:

Comment: was deleted

(was: https://reviews.apache.org/r/69516/)

> PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
> -
>
> Key: MESOS-7971
> URL: https://issues.apache.org/jira/browse/MESOS-7971
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: flaky-test, mesosphere
> Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt
>
>
> Saw this when testing 1.4.0-rc5
> {code}
> [ RUN  ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
> I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0912 05:40:27.338429 30867 master.cpp:442] Master 
> 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on 
> 172.17.0.3:54639
> I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="50ms" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/hH0YXe/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" 
> --zk_session_timeout="10secs"
> I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing 
> authenticated frameworks to register
> I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing 
> authenticated agents to register
> I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing 
> authenticated HTTP frameworks to register
> I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/hH0YXe/credentials'
> I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' 
> authenticator
> I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled
> W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical 
> allocator process
> I0912 05:40:27.353775 30872 whitelist_watcher.cpp:77] No whitelist given
> I0912 05:40:27.356655 30873 master.cpp:2163] Elected as the leading master!
> I0912 05:40:27.356675 30873 master.cpp:1702] Recovering from registrar
> I0912 05:40:27.356868 30874 registrar.cpp:347] Recovering registrar
> I0912 05:40:27.357390 30874 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 494080ns
> I0912 05:40:27.357483 30874 registrar.cpp:495] Applied 1 operations in 
> 31911ns; attempting to update the registry
> I0912 05:40:27.357919 30874 registrar.cpp:552] Successfully updated the 
> registry in 391936ns
> I0912 05:40:27.358018 30874 registrar.cpp:424] Successfully recovered 
> registrar
> I0912 05:40:27.358413 30868 master.cpp:1801] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> I0912 05:40:27.358482 30867 hierarchical.cpp:209] Skipping recovery of 
> hierarchical allocator: nothing to recover
> W0912 05:40:27.364050 30860 process.cpp:3196] Attempted to spawn already 
> running process files@172.17.0.3:54639
> I0912 05:40:27.365372 30860 c

[jira] [Assigned] (MESOS-9458) PersistentVolumeEndpointsTest.StaticReservation is flaky

2018-12-06 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9458:
---

Assignee: Meng Zhu

> PersistentVolumeEndpointsTest.StaticReservation is flaky
> 
>
> Key: MESOS-9458
> URL: https://issues.apache.org/jira/browse/MESOS-9458
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Major
>  Labels: flaky-test, mesosphere
>
> Observed this in ASF CI 
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Buildbot-Test/310/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1%20MESOS_TEST_AWAIT_TIMEOUT=60secs,OS=ubuntu:16.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)&&(!ubuntu-4)&&(!H21)&&(!H23)&&(!H26)&&(!H27)/consoleText
> {noformat}
> [ RUN  ] PersistentVolumeEndpointsTest.StaticReservation
> I1205 11:34:05.896515 22538 cluster.cpp:173] Creating default 'local' 
> authorizer
> I1205 11:34:05.898870 22542 master.cpp:413] Master 
> 3f2d828b-bff8-461a-98cf-de9163b36657 (488de0351206) started on 
> 172.17.0.2:40803
> I1205 11:34:05.898895 22542 master.cpp:416] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1000secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/qOMyLF/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
> --publish_per_framework_metrics="true" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --roles="role1" 
> --root_submissions="true" --version="false" 
> --webui_dir="/tmp/SRC/build/mesos-1.8.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/qOMyLF/master" --zk_session_timeout="10secs"
> I1205 11:34:05.899194 22542 master.cpp:465] Master only allowing 
> authenticated frameworks to register
> I1205 11:34:05.899205 22542 master.cpp:471] Master only allowing 
> authenticated agents to register
> I1205 11:34:05.899212 22542 master.cpp:477] Master only allowing 
> authenticated HTTP frameworks to register
> I1205 11:34:05.899219 22542 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/qOMyLF/credentials'
> I1205 11:34:05.899503 22542 master.cpp:521] Using default 'crammd5' 
> authenticator
> I1205 11:34:05.899674 22542 http.cpp:1042] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1205 11:34:05.899879 22542 http.cpp:1042] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1205 11:34:05.900029 22542 http.cpp:1042] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1205 11:34:05.900211 22542 master.cpp:602] Authorization enabled
> W1205 11:34:05.900238 22542 master.cpp:665] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I1205 11:34:05.900684 22539 hierarchical.cpp:175] Initialized hierarchical 
> allocator process
> I1205 11:34:05.900707 22545 whitelist_watcher.cpp:77] No whitelist given
> I1205 11:34:05.903553 22540 master.cpp:2105] Elected as the leading master!
> I1205 11:34:05.903587 22540 master.cpp:1660] Recovering from registrar
> I1205 11:34:05.903753 22551 registrar.cpp:339] Recovering registrar
> I1205 11:34:05.904373 22551 registrar.cpp:383] Successfully fetched the 
> registry (0B) in 574976ns
> I1205 11:34:05.904498 22551 registrar.cpp:487] Applied 1 operations in 
> 34823ns; attempting to update the registry
> I1205 11:34:05.905134 22551 registrar.cpp:544] Successfully updated the 
> registry in 566016ns
> I1205 11:34:05.905258 22551 registrar.cpp:416] Successfully recovered 
> registrar
> I1205 11:34:05

[jira] [Commented] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-12-06 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712148#comment-16712148
 ] 

Meng Zhu commented on MESOS-7971:
-

This test is flaky because due to a race.

The test expects the offer to be sent out after the reserve and create 
operations have finished. But it only waits for the 202 returned by both calls.

When the offer is sent while the agent is processing the create operation, the 
offer does not contain the expected volume resource. Failing the test.

Adding manual clock control and properly settle should fix the test.


> PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
> -
>
> Key: MESOS-7971
> URL: https://issues.apache.org/jira/browse/MESOS-7971
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: flaky-test, mesosphere
> Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt
>
>
> Saw this when testing 1.4.0-rc5
> {code}
> [ RUN  ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
> I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0912 05:40:27.338429 30867 master.cpp:442] Master 
> 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on 
> 172.17.0.3:54639
> I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="50ms" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/hH0YXe/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" 
> --zk_session_timeout="10secs"
> I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing 
> authenticated frameworks to register
> I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing 
> authenticated agents to register
> I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing 
> authenticated HTTP frameworks to register
> I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/hH0YXe/credentials'
> I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' 
> authenticator
> I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled
> W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical 
> allocator process
> I0912 05:40:27.353775 30872 whitelist_watcher.cpp:77] No whitelist given
> I0912 05:40:27.356655 30873 master.cpp:2163] Elected as the leading master!
> I0912 05:40:27.356675 30873 master.cpp:1702] Recovering from registrar
> I0912 05:40:27.356868 30874 registrar.cpp:347] Recovering registrar
> I0912 05:40:27.357390 30874 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 494080ns
> I0912 05:40:27.357483 30874 registrar.cpp:495] Applied 1 operations in 
> 31911ns; attempting to update the registry
> I0912 05:40:27.357919 30874 registrar.cpp:552] Successfully updated the 
> registry in 391936ns
> I0912 05:40:27.358018 30874 registrar.cpp:424] Successfully recovered 
> registr

[jira] [Assigned] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-12-06 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-7971:
---

Assignee: Meng Zhu

> PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
> -
>
> Key: MESOS-7971
> URL: https://issues.apache.org/jira/browse/MESOS-7971
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: flaky-test, mesosphere
> Attachments: ApacheJenkinsConsoleText_autotools_gcc_ubuntu16.txt
>
>
> Saw this when testing 1.4.0-rc5
> {code}
> [ RUN  ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
> I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0912 05:40:27.338429 30867 master.cpp:442] Master 
> 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on 
> 172.17.0.3:54639
> I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="50ms" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/hH0YXe/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" 
> --zk_session_timeout="10secs"
> I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing 
> authenticated frameworks to register
> I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing 
> authenticated agents to register
> I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing 
> authenticated HTTP frameworks to register
> I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/hH0YXe/credentials'
> I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' 
> authenticator
> I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled
> W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical 
> allocator process
> I0912 05:40:27.353775 30872 whitelist_watcher.cpp:77] No whitelist given
> I0912 05:40:27.356655 30873 master.cpp:2163] Elected as the leading master!
> I0912 05:40:27.356675 30873 master.cpp:1702] Recovering from registrar
> I0912 05:40:27.356868 30874 registrar.cpp:347] Recovering registrar
> I0912 05:40:27.357390 30874 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 494080ns
> I0912 05:40:27.357483 30874 registrar.cpp:495] Applied 1 operations in 
> 31911ns; attempting to update the registry
> I0912 05:40:27.357919 30874 registrar.cpp:552] Successfully updated the 
> registry in 391936ns
> I0912 05:40:27.358018 30874 registrar.cpp:424] Successfully recovered 
> registrar
> I0912 05:40:27.358413 30868 master.cpp:1801] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> I0912 05:40:27.358482 30867 hierarchical.cpp:209] Skipping recovery of 
> hierarchical allocator: nothing to recover
> W0912 05:40:27.364050 30860 process.cpp:3196] Attempted to spawn already 
> running process files@172.17.0.3:54639
> I0912 05:40:27.365372 30860 containerizer.cpp:246] Using isolation: 

[jira] [Assigned] (MESOS-9314) Consider introducing a ScalarResourceQuantity protobuf message.

2018-12-06 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9314:
---

Assignee: Meng Zhu

> Consider introducing a ScalarResourceQuantity protobuf message.
> ---
>
> Key: MESOS-9314
> URL: https://issues.apache.org/jira/browse/MESOS-9314
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>  Labels: multitenancy
>
> As part of introducing quota limits, we're adding a new master::Call for 
> updating quota. This call can take a simplified message that expresses scalar 
> resource quantities:
> {code}
> message ScalarResourceQuantity {
>   required string name;
>   required Value::Scalar quantity;
> }
> {code}
> This greatly simplified the validation code, as well as the UX of the API 
> when it comes to knowing what kind of data to provide.
> Ideally, the new quota paths can use this message in lieu of Resource 
> objects, but we'll have to explore backwards compatibility (e.g. registry 
> data).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6630) Add some benchmark test for quota allocation

2018-12-06 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712122#comment-16712122
 ] 

Meng Zhu commented on MESOS-6630:
-

Uploaded two perf traces.

mesos-master_nonquota_1206.stacks

{noformat}
[--] 1 test from 
NonQuotaVsQuotaParam/HierarchicalAllocator_BENCHMARK_WithNonQuotaVsQuotaParam
[ RUN  ] 
NonQuotaVsQuotaParam/HierarchicalAllocator_BENCHMARK_WithNonQuotaVsQuotaParam.NonQuotaVsQuota/4
Added 2000 agents in 82.263735ms
Added 2000 frameworks in 12.301791731secs
Nonquota run setup: 2000 agents, 1000 roles, 2000 frameworksMade 2000 
allocations in 4.322305035secs
Made 0 allocation in 4.036441876secs
[   OK ] 
NonQuotaVsQuotaParam/HierarchicalAllocator_BENCHMARK_WithNonQuotaVsQuotaParam.NonQuotaVsQuota/4
 (21315 ms)
{noformat}

mesos-master_quota_1206.stacks

{noformat}
[ RUN  ] 
NonQuotaVsQuotaParam/HierarchicalAllocator_BENCHMARK_WithNonQuotaVsQuotaParam.NonQuotaVsQuota/5
Added 2000 agents in 82.183633ms
Added 2000 frameworks in 12.508906279secs
Quota run setup: 2000 agents, 1000 roles, 2000 frameworksMade 2000 allocations 
in 36.546906639secs
Made 0 allocation in 27.331330684secs
[   OK ] 
NonQuotaVsQuotaParam/HierarchicalAllocator_BENCHMARK_WithNonQuotaVsQuotaParam.NonQuotaVsQuota/5
 (77055 ms)
{noformat}

> Add some benchmark test for quota allocation
> 
>
> Key: MESOS-6630
> URL: https://issues.apache.org/jira/browse/MESOS-6630
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Guangya Liu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: mesosphere, performance
> Fix For: 1.8.0
>
> Attachments: mesos-master_nonquota_1206.stacks, 
> mesos-master_quota_1206.stacks
>
>
> Comparing to non-quota allocation, current quota allocation involves a 
> separate allocation stage and additional tracking such as headroom and role 
> consumed quota. Thus quota allocation performance could be drastically 
> different (probably slower) than non-quota allocation. A dedicated benchmark 
> for quota allocation is necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6630) Add some benchmark test for quota allocation

2018-12-06 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712121#comment-16712121
 ] 

Meng Zhu commented on MESOS-6630:
-


{noformat}
commit 4e512618d508f32170193c49d72011189f6e2fa1
Author: Meng Zhu 
Date:   Fri Oct 12 17:28:39 2018 -0700

Added an allocator benchmark for quota performance.

This benchmark evaluates the allocator performance in
the presence of roles with both small quota (which can
be satisfied by half an agent) as well as large quota
(which need resources from two agents). We setup the cluster,
trigger one allocation cycle and measure the elapsed time.

Review: https://reviews.apache.org/r/69097
{noformat}


{noformat}
commit 740fa11a33df8528742b3e784206d00111edc4a3
Author: Meng Zhu m...@mesosphere.io
Date:   Fri Oct 19 22:44:04 2018 -0700

Added a benchmark to compare quota and nonquota allocation performance.

This benchmark evaluates the performance difference between nonquota
and quota settings. In both settings, the same allocations are made
for fair comparison. In particular, since the agent will always be
allocated as a whole in nonquota settings, we should also avoid
agent chopping in quota setting as well. Thus in this benchmark,
quotas are only set to be multiples of whole agent resources.
This is also why we have this dedicated benchmark for comparison
rather than extending the existing quota benchmarks (which involves
agent chopping).

Review: https://reviews.apache.org/r/69098
{noformat}

> Add some benchmark test for quota allocation
> 
>
> Key: MESOS-6630
> URL: https://issues.apache.org/jira/browse/MESOS-6630
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Guangya Liu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: mesosphere, performance
> Attachments: mesos-master_nonquota_1206.stacks, 
> mesos-master_quota_1206.stacks
>
>
> Comparing to non-quota allocation, current quota allocation involves a 
> separate allocation stage and additional tracking such as headroom and role 
> consumed quota. Thus quota allocation performance could be drastically 
> different (probably slower) than non-quota allocation. A dedicated benchmark 
> for quota allocation is necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9459) Reviewbot is not verifying reviews that need verification

2018-12-06 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9459:
-

 Summary: Reviewbot is not verifying reviews that need verification
 Key: MESOS-9459
 URL: https://issues.apache.org/jira/browse/MESOS-9459
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone
Assignee: Armand Grillet


For example this run of ReviewBot 
https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23594/console 
says that there are no reviews to be verified, which is false because if we 
look at ReviewBoard there are a bunch of reviews that have not been commented 
on by ReviewBot since a new diff has been posted.

{noformat}
12-05-18_23:41:54 - Running 
/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py
0 review requests need verification
{noformat}

I see the the logic of the verify-reviews.py script was changed as part of the 
python3 transition here: https://reviews.apache.org/r/68619/diff/1#27 which 
likely caused the bug. 

As an aside, It's unfortunate that python3 update was bundled with logic 
changes in this review. cc [~andschwa]




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-3968) DiskQuotaTest.SlaveRecovery is flaky

2018-12-06 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711282#comment-16711282
 ] 

Benjamin Bannier commented on MESOS-3968:
-

[~vinodkone] , we'll spend some time in the near future to remove flakes from 
{{StorageLocalResourceProviderTest}} which were introduced pretty recently and 
which we see fail with some frequency.

Regarding this test here, I checked how often it failed in our internal CI, and 
it doesn't seem to be among the worst offenders -- it "only" seems to have 
failed a handful of times in the last couple hundred CI builds of {{master}}.

> DiskQuotaTest.SlaveRecovery is flaky
> 
>
> Key: MESOS-3968
> URL: https://issues.apache.org/jira/browse/MESOS-3968
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: flaky-test, mesosphere, mesosphere-oncall, storage
>
> {noformat: title=Failed Run}
> [ RUN  ] DiskQuotaTest.SlaveRecovery
> I1120 12:02:54.015383 29806 leveldb.cpp:176] Opened db in 2.965411ms
> I1120 12:02:54.018033 29806 leveldb.cpp:183] Compacted db in 2.585354ms
> I1120 12:02:54.018175 29806 leveldb.cpp:198] Created db iterator in 27134ns
> I1120 12:02:54.018275 29806 leveldb.cpp:204] Seeked to beginning of db in 
> 3025ns
> I1120 12:02:54.018375 29806 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 679ns
> I1120 12:02:54.018491 29806 replica.cpp:780] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1120 12:02:54.021386 29838 recover.cpp:449] Starting replica recovery
> I1120 12:02:54.021692 29838 recover.cpp:475] Replica is in EMPTY status
> I1120 12:02:54.022189 29827 master.cpp:367] Master 
> 9a3c45ec-28b3-49e6-a83f-1f2035cc1105 (a51e6bb03b55) started on 
> 172.17.5.188:41228
> I1120 12:02:54.022212 29827 master.cpp:369] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/DsMniF/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="25secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/DsMniF/master" --zk_session_timeout="10secs"
> I1120 12:02:54.022557 29827 master.cpp:414] Master only allowing 
> authenticated frameworks to register
> I1120 12:02:54.022569 29827 master.cpp:419] Master only allowing 
> authenticated slaves to register
> I1120 12:02:54.022578 29827 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/DsMniF/credentials'
> I1120 12:02:54.022896 29827 master.cpp:458] Using default 'crammd5' 
> authenticator
> I1120 12:02:54.023217 29827 master.cpp:495] Authorization enabled
> I1120 12:02:54.023512 29831 whitelist_watcher.cpp:79] No whitelist given
> I1120 12:02:54.023814 29833 replica.cpp:676] Replica in EMPTY status received 
> a broadcasted recover request from (562)@172.17.5.188:41228
> I1120 12:02:54.023519 29832 hierarchical.cpp:153] Initialized hierarchical 
> allocator process
> I1120 12:02:54.025997 29831 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1120 12:02:54.027042 29832 recover.cpp:566] Updating replica status to 
> STARTING
> I1120 12:02:54.027354 29830 master.cpp:1612] The newly elected leader is 
> master@172.17.5.188:41228 with id 9a3c45ec-28b3-49e6-a83f-1f2035cc1105
> I1120 12:02:54.027385 29830 master.cpp:1625] Elected as the leading master!
> I1120 12:02:54.027403 29830 master.cpp:1385] Recovering from registrar
> I1120 12:02:54.027679 29830 registrar.cpp:309] Recovering registrar
> I1120 12:02:54.028439 29840 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 1.195171ms
> I1120 12:02:54.028539 29840 replica.cpp:323] Persisted replica status to 
> STARTING
> I1120 12:02:54.028944 29840 recover.cpp:475] Replica is in STARTING status
> I1120 12:02:54.030910 29840 replica.cpp:676] Replica in STARTING status 
> received a broadcasted recover request from (563)@172.17.5.188:41228
> I1120 12:02:54.031429 29840 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I1120 12:02:54.032032 29840 recover.cpp:566] Updating replica status to VOTING
> I1120 12:02:54.032816 29840 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leve