[jira] [Commented] (MESOS-8112) DefaultExecutorTest.ResourceLimitation is flaky

2017-11-07 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242343#comment-16242343
 ] 

Alexander Rukletsov commented on MESOS-8112:


[~jpe...@apache.org], Don't be confused by the number of expected status 
updates: {{TASK_STARTING}} change has been introduced in-between. 

Regarding the failure, it looks like  he agent started to cleanup the offending 
container 90s after limitation was reached!

> DefaultExecutorTest.ResourceLimitation is flaky
> ---
>
> Key: MESOS-8112
> URL: https://issues.apache.org/jira/browse/MESOS-8112
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
>Reporter: James Peach
> Attachments: GetContainers-badrun.txt, GetContainers-goodrun.txt, 
> ResourceLimitation-badrun.txt
>
>
> As seen in CI builds, the 
> {{MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0}} test can be 
> flaky
> {noformat}[ RUN  ] 
> MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0
> I1017 21:37:55.179539  3528 cluster.cpp:162] Creating default 'local' 
> authorizer
> I1017 21:37:55.182804  3529 master.cpp:445] Master 
> 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e (42cd332f4072) started on 
> 172.17.0.2:33744
> I1017 21:37:55.182847  3529 master.cpp:447] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/1FtpuJ/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/1FtpuJ/master" --zk_session_timeout="10secs"
> I1017 21:37:55.183141  3529 master.cpp:496] Master only allowing 
> authenticated frameworks to register
> I1017 21:37:55.183153  3529 master.cpp:502] Master only allowing 
> authenticated agents to register
> I1017 21:37:55.183161  3529 master.cpp:508] Master only allowing 
> authenticated HTTP frameworks to register
> I1017 21:37:55.183167  3529 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/1FtpuJ/credentials'
> I1017 21:37:55.183472  3529 master.cpp:552] Using default 'crammd5' 
> authenticator
> I1017 21:37:55.183661  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1017 21:37:55.183862  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1017 21:37:55.184082  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1017 21:37:55.184257  3529 master.cpp:631] Authorization enabled
> I1017 21:37:55.184450  3536 hierarchical.cpp:171] Initialized hierarchical 
> allocator process
> I1017 21:37:55.184551  3536 whitelist_watcher.cpp:77] No whitelist given
> I1017 21:37:55.187489  3536 master.cpp:2198] Elected as the leading master!
> I1017 21:37:55.187516  3536 master.cpp:1687] Recovering from registrar
> I1017 21:37:55.187728  3536 registrar.cpp:347] Recovering registrar
> I1017 21:37:55.188508  3536 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 745984ns
> I1017 21:37:55.188616  3536 registrar.cpp:495] Applied 1 operations in 
> 37290ns; attempting to update the registry
> I1017 21:37:55.189162  3536 registrar.cpp:552] Successfully updated the 
> registry in 491008ns
> I1017 21:37:55.189285  3536 registrar.cpp:424] Successfully recovered 
> registrar
> I1017 21:37:55.190011  3531 hierarchical.cpp:209] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I1017 21:37:55.190115  3534 master.cpp:1791] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> W1017 21:37:55.195062  3528 process.cpp:3194] Attempted to spawn already 
> running process files@172.17.0.2:33744
> I1017 21:37:55.195956  3528 

[jira] [Commented] (MESOS-8112) DefaultExecutorTest.ResourceLimitation is flaky

2017-11-07 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242218#comment-16242218
 ] 

Alexander Rukletsov commented on MESOS-8112:


{{GetContainers}} should be fixed by 
https://github.com/apache/mesos/commit/ab3728f3adcd199cc0dd9c481a2946efffacdb0d

> DefaultExecutorTest.ResourceLimitation is flaky
> ---
>
> Key: MESOS-8112
> URL: https://issues.apache.org/jira/browse/MESOS-8112
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
>Reporter: James Peach
> Attachments: GetContainers-badrun.txt, GetContainers-goodrun.txt, 
> ResourceLimitation-badrun.txt
>
>
> As seen in CI builds, the 
> {{MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0}} test can be 
> flaky
> {noformat}[ RUN  ] 
> MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0
> I1017 21:37:55.179539  3528 cluster.cpp:162] Creating default 'local' 
> authorizer
> I1017 21:37:55.182804  3529 master.cpp:445] Master 
> 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e (42cd332f4072) started on 
> 172.17.0.2:33744
> I1017 21:37:55.182847  3529 master.cpp:447] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/1FtpuJ/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/1FtpuJ/master" --zk_session_timeout="10secs"
> I1017 21:37:55.183141  3529 master.cpp:496] Master only allowing 
> authenticated frameworks to register
> I1017 21:37:55.183153  3529 master.cpp:502] Master only allowing 
> authenticated agents to register
> I1017 21:37:55.183161  3529 master.cpp:508] Master only allowing 
> authenticated HTTP frameworks to register
> I1017 21:37:55.183167  3529 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/1FtpuJ/credentials'
> I1017 21:37:55.183472  3529 master.cpp:552] Using default 'crammd5' 
> authenticator
> I1017 21:37:55.183661  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1017 21:37:55.183862  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1017 21:37:55.184082  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1017 21:37:55.184257  3529 master.cpp:631] Authorization enabled
> I1017 21:37:55.184450  3536 hierarchical.cpp:171] Initialized hierarchical 
> allocator process
> I1017 21:37:55.184551  3536 whitelist_watcher.cpp:77] No whitelist given
> I1017 21:37:55.187489  3536 master.cpp:2198] Elected as the leading master!
> I1017 21:37:55.187516  3536 master.cpp:1687] Recovering from registrar
> I1017 21:37:55.187728  3536 registrar.cpp:347] Recovering registrar
> I1017 21:37:55.188508  3536 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 745984ns
> I1017 21:37:55.188616  3536 registrar.cpp:495] Applied 1 operations in 
> 37290ns; attempting to update the registry
> I1017 21:37:55.189162  3536 registrar.cpp:552] Successfully updated the 
> registry in 491008ns
> I1017 21:37:55.189285  3536 registrar.cpp:424] Successfully recovered 
> registrar
> I1017 21:37:55.190011  3531 hierarchical.cpp:209] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I1017 21:37:55.190115  3534 master.cpp:1791] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> W1017 21:37:55.195062  3528 process.cpp:3194] Attempted to spawn already 
> running process files@172.17.0.2:33744
> I1017 21:37:55.195956  3528 containerizer.cpp:292] Using isolation { 
> environment_secret, network/cni, filesystem/posix, disk/du }
> W1017 21:37:55.196488  3528 backend.cpp:76] Failed to 

[jira] [Commented] (MESOS-8112) DefaultExecutorTest.ResourceLimitation is flaky

2017-10-18 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210353#comment-16210353
 ] 

James Peach commented on MESOS-8112:


The first failure in this test run is 
{{../../src/tests/default_executor_tests.cpp:1460: Failed to wait 15secs for 
failed}}. This indicates that we got the {{starting}} and {{running}} status 
updates but the final {{failure}} took too long.

However, the master forwarded the {{RUNNING}} update here:

{noformat}
I1017 21:37:55.499879  3532 master.cpp:7055] Forwarding status update 
TASK_RUNNING (UUID: 86bb612c-9d48-4e85-a0f1-89820ea65fa1) for task 
ffc4604c-a6cb-4ced-a969-fc6b9e6f955d of framework 
0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e-
...
1017 21:37:55.742033  3533 containerizer.cpp:2677] Container 
4789abbb-04c9-4d6d-b561-f44b34ec47d2 has reached its limit for resource 
[{"allocation_info":{"role":"*"},"name":"disk","scalar":{"value":20.0},"type":"SCALAR"}]
 and will be terminated
...
I1017 21:39:22.944377  3535 master.cpp:1417] Framework 
0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e- (default) disconnected
...
I1017 21:39:22.946893  3529 master.cpp:9157] Updating the state of task 
ffc4604c-a6cb-4ced-a969-fc6b9e6f955d of framework 
0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e- (latest state: TASK_KILLED, status 
update state: TASK_KILLED)
{noformat}

So from the master's perspective, the test framework disconnected? Or did this 
happen once the test failed and we started tearing it down?

Later in the test log:
{noformat}
../../src/tests/default_executor_tests.cpp:1427: Failure
Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, 
_))...
 Expected: to be called twice
   Actual: called once - unsatisfied and active
{noformat}

This seems to indicate that we only got 1 of 3 expected status updates, but if 
that was true I would expect to see a failure on {{AWAIT_READY(running)}} and I 
can't find that here :(

> DefaultExecutorTest.ResourceLimitation is flaky
> ---
>
> Key: MESOS-8112
> URL: https://issues.apache.org/jira/browse/MESOS-8112
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
>Reporter: James Peach
> Attachments: GetContainers-badrun.txt, GetContainers-goodrun.txt
>
>
> As seen in CI builds, the 
> {{MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0}} test can be 
> flaky
> {noformat}[ RUN  ] 
> MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0
> I1017 21:37:55.179539  3528 cluster.cpp:162] Creating default 'local' 
> authorizer
> I1017 21:37:55.182804  3529 master.cpp:445] Master 
> 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e (42cd332f4072) started on 
> 172.17.0.2:33744
> I1017 21:37:55.182847  3529 master.cpp:447] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/1FtpuJ/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/1FtpuJ/master" --zk_session_timeout="10secs"
> I1017 21:37:55.183141  3529 master.cpp:496] Master only allowing 
> authenticated frameworks to register
> I1017 21:37:55.183153  3529 master.cpp:502] Master only allowing 
> authenticated agents to register
> I1017 21:37:55.183161  3529 master.cpp:508] Master only allowing 
> authenticated HTTP frameworks to register
> I1017 21:37:55.183167  3529 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/1FtpuJ/credentials'
> I1017 21:37:55.183472  3529 master.cpp:552] Using default 'crammd5' 
> authenticator
> I1017 21:37:55.183661  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1017 21:37:55.183862  3529 http.cpp:1045] Creating default 'basic' HTTP 
> 

[jira] [Commented] (MESOS-8112) DefaultExecutorTest.ResourceLimitation is flaky

2017-10-18 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209913#comment-16209913
 ] 

James Peach commented on MESOS-8112:


{{ContentType/AgentAPITest.GetContainers/1}} might also be fallout from the 
same changes in MESOS-7963

> DefaultExecutorTest.ResourceLimitation is flaky
> ---
>
> Key: MESOS-8112
> URL: https://issues.apache.org/jira/browse/MESOS-8112
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
>Reporter: James Peach
>
> As seen in CI builds, the 
> {{MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0}} test can be 
> flaky
> {noformat}[ RUN  ] 
> MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0
> I1017 21:37:55.179539  3528 cluster.cpp:162] Creating default 'local' 
> authorizer
> I1017 21:37:55.182804  3529 master.cpp:445] Master 
> 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e (42cd332f4072) started on 
> 172.17.0.2:33744
> I1017 21:37:55.182847  3529 master.cpp:447] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/1FtpuJ/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/1FtpuJ/master" --zk_session_timeout="10secs"
> I1017 21:37:55.183141  3529 master.cpp:496] Master only allowing 
> authenticated frameworks to register
> I1017 21:37:55.183153  3529 master.cpp:502] Master only allowing 
> authenticated agents to register
> I1017 21:37:55.183161  3529 master.cpp:508] Master only allowing 
> authenticated HTTP frameworks to register
> I1017 21:37:55.183167  3529 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/1FtpuJ/credentials'
> I1017 21:37:55.183472  3529 master.cpp:552] Using default 'crammd5' 
> authenticator
> I1017 21:37:55.183661  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1017 21:37:55.183862  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1017 21:37:55.184082  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1017 21:37:55.184257  3529 master.cpp:631] Authorization enabled
> I1017 21:37:55.184450  3536 hierarchical.cpp:171] Initialized hierarchical 
> allocator process
> I1017 21:37:55.184551  3536 whitelist_watcher.cpp:77] No whitelist given
> I1017 21:37:55.187489  3536 master.cpp:2198] Elected as the leading master!
> I1017 21:37:55.187516  3536 master.cpp:1687] Recovering from registrar
> I1017 21:37:55.187728  3536 registrar.cpp:347] Recovering registrar
> I1017 21:37:55.188508  3536 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 745984ns
> I1017 21:37:55.188616  3536 registrar.cpp:495] Applied 1 operations in 
> 37290ns; attempting to update the registry
> I1017 21:37:55.189162  3536 registrar.cpp:552] Successfully updated the 
> registry in 491008ns
> I1017 21:37:55.189285  3536 registrar.cpp:424] Successfully recovered 
> registrar
> I1017 21:37:55.190011  3531 hierarchical.cpp:209] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I1017 21:37:55.190115  3534 master.cpp:1791] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> W1017 21:37:55.195062  3528 process.cpp:3194] Attempted to spawn already 
> running process files@172.17.0.2:33744
> I1017 21:37:55.195956  3528 containerizer.cpp:292] Using isolation { 
> environment_secret, network/cni, filesystem/posix, disk/du }
> W1017 21:37:55.196488  3528 backend.cpp:76] Failed to create 'aufs' backend: 
> AufsBackend requires root privileges
> W1017 21:37:55.196630  3528 backend.cpp:76] Failed to create 'bind' backend: