[jira] [Commented] (MESOS-8112) DefaultExecutorTest.ResourceLimitation is flaky
[ https://issues.apache.org/jira/browse/MESOS-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242343#comment-16242343 ] Alexander Rukletsov commented on MESOS-8112: [~jpe...@apache.org], Don't be confused by the number of expected status updates: {{TASK_STARTING}} change has been introduced in-between. Regarding the failure, it looks like he agent started to cleanup the offending container 90s after limitation was reached! > DefaultExecutorTest.ResourceLimitation is flaky > --- > > Key: MESOS-8112 > URL: https://issues.apache.org/jira/browse/MESOS-8112 > Project: Mesos > Issue Type: Bug > Components: flaky, test >Reporter: James Peach > Attachments: GetContainers-badrun.txt, GetContainers-goodrun.txt, > ResourceLimitation-badrun.txt > > > As seen in CI builds, the > {{MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0}} test can be > flaky > {noformat}[ RUN ] > MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0 > I1017 21:37:55.179539 3528 cluster.cpp:162] Creating default 'local' > authorizer > I1017 21:37:55.182804 3529 master.cpp:445] Master > 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e (42cd332f4072) started on > 172.17.0.2:33744 > I1017 21:37:55.182847 3529 master.cpp:447] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/1FtpuJ/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" > --work_dir="/tmp/1FtpuJ/master" --zk_session_timeout="10secs" > I1017 21:37:55.183141 3529 master.cpp:496] Master only allowing > authenticated frameworks to register > I1017 21:37:55.183153 3529 master.cpp:502] Master only allowing > authenticated agents to register > I1017 21:37:55.183161 3529 master.cpp:508] Master only allowing > authenticated HTTP frameworks to register > I1017 21:37:55.183167 3529 credentials.hpp:37] Loading credentials for > authentication from '/tmp/1FtpuJ/credentials' > I1017 21:37:55.183472 3529 master.cpp:552] Using default 'crammd5' > authenticator > I1017 21:37:55.183661 3529 http.cpp:1045] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I1017 21:37:55.183862 3529 http.cpp:1045] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I1017 21:37:55.184082 3529 http.cpp:1045] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I1017 21:37:55.184257 3529 master.cpp:631] Authorization enabled > I1017 21:37:55.184450 3536 hierarchical.cpp:171] Initialized hierarchical > allocator process > I1017 21:37:55.184551 3536 whitelist_watcher.cpp:77] No whitelist given > I1017 21:37:55.187489 3536 master.cpp:2198] Elected as the leading master! > I1017 21:37:55.187516 3536 master.cpp:1687] Recovering from registrar > I1017 21:37:55.187728 3536 registrar.cpp:347] Recovering registrar > I1017 21:37:55.188508 3536 registrar.cpp:391] Successfully fetched the > registry (0B) in 745984ns > I1017 21:37:55.188616 3536 registrar.cpp:495] Applied 1 operations in > 37290ns; attempting to update the registry > I1017 21:37:55.189162 3536 registrar.cpp:552] Successfully updated the > registry in 491008ns > I1017 21:37:55.189285 3536 registrar.cpp:424] Successfully recovered > registrar > I1017 21:37:55.190011 3531 hierarchical.cpp:209] Skipping recovery of > hierarchical allocator: nothing to recover > I1017 21:37:55.190115 3534 master.cpp:1791] Recovered 0 agents from the > registry (129B); allowing 10mins for agents to re-register > W1017 21:37:55.195062 3528 process.cpp:3194] Attempted to spawn already > running process files@172.17.0.2:33744 > I1017 21:37:55.195956 3528
[jira] [Commented] (MESOS-8112) DefaultExecutorTest.ResourceLimitation is flaky
[ https://issues.apache.org/jira/browse/MESOS-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242218#comment-16242218 ] Alexander Rukletsov commented on MESOS-8112: {{GetContainers}} should be fixed by https://github.com/apache/mesos/commit/ab3728f3adcd199cc0dd9c481a2946efffacdb0d > DefaultExecutorTest.ResourceLimitation is flaky > --- > > Key: MESOS-8112 > URL: https://issues.apache.org/jira/browse/MESOS-8112 > Project: Mesos > Issue Type: Bug > Components: flaky, test >Reporter: James Peach > Attachments: GetContainers-badrun.txt, GetContainers-goodrun.txt, > ResourceLimitation-badrun.txt > > > As seen in CI builds, the > {{MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0}} test can be > flaky > {noformat}[ RUN ] > MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0 > I1017 21:37:55.179539 3528 cluster.cpp:162] Creating default 'local' > authorizer > I1017 21:37:55.182804 3529 master.cpp:445] Master > 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e (42cd332f4072) started on > 172.17.0.2:33744 > I1017 21:37:55.182847 3529 master.cpp:447] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/1FtpuJ/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" > --work_dir="/tmp/1FtpuJ/master" --zk_session_timeout="10secs" > I1017 21:37:55.183141 3529 master.cpp:496] Master only allowing > authenticated frameworks to register > I1017 21:37:55.183153 3529 master.cpp:502] Master only allowing > authenticated agents to register > I1017 21:37:55.183161 3529 master.cpp:508] Master only allowing > authenticated HTTP frameworks to register > I1017 21:37:55.183167 3529 credentials.hpp:37] Loading credentials for > authentication from '/tmp/1FtpuJ/credentials' > I1017 21:37:55.183472 3529 master.cpp:552] Using default 'crammd5' > authenticator > I1017 21:37:55.183661 3529 http.cpp:1045] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I1017 21:37:55.183862 3529 http.cpp:1045] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I1017 21:37:55.184082 3529 http.cpp:1045] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I1017 21:37:55.184257 3529 master.cpp:631] Authorization enabled > I1017 21:37:55.184450 3536 hierarchical.cpp:171] Initialized hierarchical > allocator process > I1017 21:37:55.184551 3536 whitelist_watcher.cpp:77] No whitelist given > I1017 21:37:55.187489 3536 master.cpp:2198] Elected as the leading master! > I1017 21:37:55.187516 3536 master.cpp:1687] Recovering from registrar > I1017 21:37:55.187728 3536 registrar.cpp:347] Recovering registrar > I1017 21:37:55.188508 3536 registrar.cpp:391] Successfully fetched the > registry (0B) in 745984ns > I1017 21:37:55.188616 3536 registrar.cpp:495] Applied 1 operations in > 37290ns; attempting to update the registry > I1017 21:37:55.189162 3536 registrar.cpp:552] Successfully updated the > registry in 491008ns > I1017 21:37:55.189285 3536 registrar.cpp:424] Successfully recovered > registrar > I1017 21:37:55.190011 3531 hierarchical.cpp:209] Skipping recovery of > hierarchical allocator: nothing to recover > I1017 21:37:55.190115 3534 master.cpp:1791] Recovered 0 agents from the > registry (129B); allowing 10mins for agents to re-register > W1017 21:37:55.195062 3528 process.cpp:3194] Attempted to spawn already > running process files@172.17.0.2:33744 > I1017 21:37:55.195956 3528 containerizer.cpp:292] Using isolation { > environment_secret, network/cni, filesystem/posix, disk/du } > W1017 21:37:55.196488 3528 backend.cpp:76] Failed to
[jira] [Commented] (MESOS-8112) DefaultExecutorTest.ResourceLimitation is flaky
[ https://issues.apache.org/jira/browse/MESOS-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210353#comment-16210353 ] James Peach commented on MESOS-8112: The first failure in this test run is {{../../src/tests/default_executor_tests.cpp:1460: Failed to wait 15secs for failed}}. This indicates that we got the {{starting}} and {{running}} status updates but the final {{failure}} took too long. However, the master forwarded the {{RUNNING}} update here: {noformat} I1017 21:37:55.499879 3532 master.cpp:7055] Forwarding status update TASK_RUNNING (UUID: 86bb612c-9d48-4e85-a0f1-89820ea65fa1) for task ffc4604c-a6cb-4ced-a969-fc6b9e6f955d of framework 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e- ... 1017 21:37:55.742033 3533 containerizer.cpp:2677] Container 4789abbb-04c9-4d6d-b561-f44b34ec47d2 has reached its limit for resource [{"allocation_info":{"role":"*"},"name":"disk","scalar":{"value":20.0},"type":"SCALAR"}] and will be terminated ... I1017 21:39:22.944377 3535 master.cpp:1417] Framework 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e- (default) disconnected ... I1017 21:39:22.946893 3529 master.cpp:9157] Updating the state of task ffc4604c-a6cb-4ced-a969-fc6b9e6f955d of framework 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e- (latest state: TASK_KILLED, status update state: TASK_KILLED) {noformat} So from the master's perspective, the test framework disconnected? Or did this happen once the test failed and we started tearing it down? Later in the test log: {noformat} ../../src/tests/default_executor_tests.cpp:1427: Failure Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, _))... Expected: to be called twice Actual: called once - unsatisfied and active {noformat} This seems to indicate that we only got 1 of 3 expected status updates, but if that was true I would expect to see a failure on {{AWAIT_READY(running)}} and I can't find that here :( > DefaultExecutorTest.ResourceLimitation is flaky > --- > > Key: MESOS-8112 > URL: https://issues.apache.org/jira/browse/MESOS-8112 > Project: Mesos > Issue Type: Bug > Components: flaky, test >Reporter: James Peach > Attachments: GetContainers-badrun.txt, GetContainers-goodrun.txt > > > As seen in CI builds, the > {{MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0}} test can be > flaky > {noformat}[ RUN ] > MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0 > I1017 21:37:55.179539 3528 cluster.cpp:162] Creating default 'local' > authorizer > I1017 21:37:55.182804 3529 master.cpp:445] Master > 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e (42cd332f4072) started on > 172.17.0.2:33744 > I1017 21:37:55.182847 3529 master.cpp:447] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/1FtpuJ/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" > --work_dir="/tmp/1FtpuJ/master" --zk_session_timeout="10secs" > I1017 21:37:55.183141 3529 master.cpp:496] Master only allowing > authenticated frameworks to register > I1017 21:37:55.183153 3529 master.cpp:502] Master only allowing > authenticated agents to register > I1017 21:37:55.183161 3529 master.cpp:508] Master only allowing > authenticated HTTP frameworks to register > I1017 21:37:55.183167 3529 credentials.hpp:37] Loading credentials for > authentication from '/tmp/1FtpuJ/credentials' > I1017 21:37:55.183472 3529 master.cpp:552] Using default 'crammd5' > authenticator > I1017 21:37:55.183661 3529 http.cpp:1045] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I1017 21:37:55.183862 3529 http.cpp:1045] Creating default 'basic' HTTP >
[jira] [Commented] (MESOS-8112) DefaultExecutorTest.ResourceLimitation is flaky
[ https://issues.apache.org/jira/browse/MESOS-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209913#comment-16209913 ] James Peach commented on MESOS-8112: {{ContentType/AgentAPITest.GetContainers/1}} might also be fallout from the same changes in MESOS-7963 > DefaultExecutorTest.ResourceLimitation is flaky > --- > > Key: MESOS-8112 > URL: https://issues.apache.org/jira/browse/MESOS-8112 > Project: Mesos > Issue Type: Bug > Components: flaky, test >Reporter: James Peach > > As seen in CI builds, the > {{MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0}} test can be > flaky > {noformat}[ RUN ] > MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0 > I1017 21:37:55.179539 3528 cluster.cpp:162] Creating default 'local' > authorizer > I1017 21:37:55.182804 3529 master.cpp:445] Master > 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e (42cd332f4072) started on > 172.17.0.2:33744 > I1017 21:37:55.182847 3529 master.cpp:447] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/1FtpuJ/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" > --work_dir="/tmp/1FtpuJ/master" --zk_session_timeout="10secs" > I1017 21:37:55.183141 3529 master.cpp:496] Master only allowing > authenticated frameworks to register > I1017 21:37:55.183153 3529 master.cpp:502] Master only allowing > authenticated agents to register > I1017 21:37:55.183161 3529 master.cpp:508] Master only allowing > authenticated HTTP frameworks to register > I1017 21:37:55.183167 3529 credentials.hpp:37] Loading credentials for > authentication from '/tmp/1FtpuJ/credentials' > I1017 21:37:55.183472 3529 master.cpp:552] Using default 'crammd5' > authenticator > I1017 21:37:55.183661 3529 http.cpp:1045] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I1017 21:37:55.183862 3529 http.cpp:1045] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I1017 21:37:55.184082 3529 http.cpp:1045] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I1017 21:37:55.184257 3529 master.cpp:631] Authorization enabled > I1017 21:37:55.184450 3536 hierarchical.cpp:171] Initialized hierarchical > allocator process > I1017 21:37:55.184551 3536 whitelist_watcher.cpp:77] No whitelist given > I1017 21:37:55.187489 3536 master.cpp:2198] Elected as the leading master! > I1017 21:37:55.187516 3536 master.cpp:1687] Recovering from registrar > I1017 21:37:55.187728 3536 registrar.cpp:347] Recovering registrar > I1017 21:37:55.188508 3536 registrar.cpp:391] Successfully fetched the > registry (0B) in 745984ns > I1017 21:37:55.188616 3536 registrar.cpp:495] Applied 1 operations in > 37290ns; attempting to update the registry > I1017 21:37:55.189162 3536 registrar.cpp:552] Successfully updated the > registry in 491008ns > I1017 21:37:55.189285 3536 registrar.cpp:424] Successfully recovered > registrar > I1017 21:37:55.190011 3531 hierarchical.cpp:209] Skipping recovery of > hierarchical allocator: nothing to recover > I1017 21:37:55.190115 3534 master.cpp:1791] Recovered 0 agents from the > registry (129B); allowing 10mins for agents to re-register > W1017 21:37:55.195062 3528 process.cpp:3194] Attempted to spawn already > running process files@172.17.0.2:33744 > I1017 21:37:55.195956 3528 containerizer.cpp:292] Using isolation { > environment_secret, network/cni, filesystem/posix, disk/du } > W1017 21:37:55.196488 3528 backend.cpp:76] Failed to create 'aufs' backend: > AufsBackend requires root privileges > W1017 21:37:55.196630 3528 backend.cpp:76] Failed to create 'bind' backend: