[jira] [Updated] (MESOS-7216) Delayed executor termination leads to test failures

2017-03-30 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman updated MESOS-7216:
--
Sprint: Mesosphere Sprint 52  (was: Mesosphere Sprint 52, Mesosphere Sprint 
53)

> Delayed executor termination leads to test failures
> ---
>
> Key: MESOS-7216
> URL: https://issues.apache.org/jira/browse/MESOS-7216
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>  Labels: debugging, mesosphere
>
> This bug came up during the development of a test for the new COMMAND health 
> checks that use nested containers. The test can be found here: 
> https://reviews.apache.org/r/55901/.
> The test setup was explained in MESOS-7050:
> 1) Start the scheduler driver
> 2) Launch a task group with the default executor that includes a single long 
> running task with a COMMAND health check
> 3) Wait for the task to return a status of HEALTHY one time
> 4) Stop the scheduler driver without explicitly waiting for any of the tasks 
> to complete
> 5) Wait for all the containers to complete
> 6) Exit the test
> With this setup, all of the ASSERTS in the test itself pass, but the test 
> failed because there were remaining processes once the test exited (after a 
> timeout of 15 seconds):
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HealthCheckTest
> [ RUN  ] HealthCheckTest.DefaultExecutorCommandHealthCheck
> I0228 14:29:19.078368 3475919808 cluster.cpp:160] Creating default 'local' 
> authorizer
> I0228 14:29:19.084883 238907392 master.cpp:383] Master 
> 98c48dab-fd2b-404e-85dc-4ec5dd0d635c (172.18.8.139) started on 
> 172.18.8.139:55836
> I0228 14:29:19.084915 238907392 master.cpp:385] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" 
> --credentials="/private/var/folders/jr/17y2w4ld019bsn9vhx0c13f8gn/T/xZZCGr/credentials"
>  --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/Users/gaston/mesos/master/share/mesos/webui" 
> --work_dir="/private/var/folders/jr/17y2w4ld019bsn9vhx0c13f8gn/T/xZZCGr/master"
>  --zk_session_timeout="10secs"
> I0228 14:29:19.086030 238907392 master.cpp:435] Master only allowing 
> authenticated frameworks to register
> I0228 14:29:19.086041 238907392 master.cpp:449] Master only allowing 
> authenticated agents to register
> I0228 14:29:19.086046 238907392 master.cpp:462] Master only allowing 
> authenticated HTTP frameworks to register
> I0228 14:29:19.086050 238907392 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/private/var/folders/jr/17y2w4ld019bsn9vhx0c13f8gn/T/xZZCGr/credentials'
> I0228 14:29:19.086334 238907392 master.cpp:507] Using default 'crammd5' 
> authenticator
> I0228 14:29:19.086369 238907392 authenticator.cpp:519] Initializing server 
> SASL
> I0228 14:29:19.100981 238907392 auxprop.cpp:73] Initialized in-memory 
> auxiliary property plugin
> I0228 14:29:19.101080 238907392 http.cpp:933] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0228 14:29:19.101274 238907392 http.cpp:933] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0228 14:29:19.101414 238907392 http.cpp:933] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0228 14:29:19.101528 238907392 master.cpp:587] Authorization enabled
> I0228 14:29:19.101702 240517120 hierarchical.cpp:161] Initialized 
> hierarchical allocator process
> I0228 14:29:19.101740 239443968 whitelist_watcher.cpp:77] No whitelist given
> I0228 14:29:19.105717 240517120 master.cpp:2122] Elected as the leading 
> master!
> I0228 14:29:19.105738 240517120 master.cpp:1646] Recovering from registrar
> I0228 

[jira] [Updated] (MESOS-7216) Delayed executor termination leads to test failures

2017-03-23 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-7216:
-
Sprint: Mesosphere Sprint 52, Mesosphere Sprint 53  (was: Mesosphere Sprint 
52)

> Delayed executor termination leads to test failures
> ---
>
> Key: MESOS-7216
> URL: https://issues.apache.org/jira/browse/MESOS-7216
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>  Labels: debugging, mesosphere
>
> This bug came up during the development of a test for the new COMMAND health 
> checks that use nested containers. The test can be found here: 
> https://reviews.apache.org/r/55901/.
> The test setup was explained in MESOS-7050:
> 1) Start the scheduler driver
> 2) Launch a task group with the default executor that includes a single long 
> running task with a COMMAND health check
> 3) Wait for the task to return a status of HEALTHY one time
> 4) Stop the scheduler driver without explicitly waiting for any of the tasks 
> to complete
> 5) Wait for all the containers to complete
> 6) Exit the test
> With this setup, all of the ASSERTS in the test itself pass, but the test 
> failed because there were remaining processes once the test exited (after a 
> timeout of 15 seconds):
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HealthCheckTest
> [ RUN  ] HealthCheckTest.DefaultExecutorCommandHealthCheck
> I0228 14:29:19.078368 3475919808 cluster.cpp:160] Creating default 'local' 
> authorizer
> I0228 14:29:19.084883 238907392 master.cpp:383] Master 
> 98c48dab-fd2b-404e-85dc-4ec5dd0d635c (172.18.8.139) started on 
> 172.18.8.139:55836
> I0228 14:29:19.084915 238907392 master.cpp:385] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" 
> --credentials="/private/var/folders/jr/17y2w4ld019bsn9vhx0c13f8gn/T/xZZCGr/credentials"
>  --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/Users/gaston/mesos/master/share/mesos/webui" 
> --work_dir="/private/var/folders/jr/17y2w4ld019bsn9vhx0c13f8gn/T/xZZCGr/master"
>  --zk_session_timeout="10secs"
> I0228 14:29:19.086030 238907392 master.cpp:435] Master only allowing 
> authenticated frameworks to register
> I0228 14:29:19.086041 238907392 master.cpp:449] Master only allowing 
> authenticated agents to register
> I0228 14:29:19.086046 238907392 master.cpp:462] Master only allowing 
> authenticated HTTP frameworks to register
> I0228 14:29:19.086050 238907392 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/private/var/folders/jr/17y2w4ld019bsn9vhx0c13f8gn/T/xZZCGr/credentials'
> I0228 14:29:19.086334 238907392 master.cpp:507] Using default 'crammd5' 
> authenticator
> I0228 14:29:19.086369 238907392 authenticator.cpp:519] Initializing server 
> SASL
> I0228 14:29:19.100981 238907392 auxprop.cpp:73] Initialized in-memory 
> auxiliary property plugin
> I0228 14:29:19.101080 238907392 http.cpp:933] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0228 14:29:19.101274 238907392 http.cpp:933] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0228 14:29:19.101414 238907392 http.cpp:933] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0228 14:29:19.101528 238907392 master.cpp:587] Authorization enabled
> I0228 14:29:19.101702 240517120 hierarchical.cpp:161] Initialized 
> hierarchical allocator process
> I0228 14:29:19.101740 239443968 whitelist_watcher.cpp:77] No whitelist given
> I0228 14:29:19.105717 240517120 master.cpp:2122] Elected as the leading 
> master!
> I0228 14:29:19.105738 240517120 master.cpp:1646] Recovering from registrar
> 

[jira] [Updated] (MESOS-7216) Delayed executor termination leads to test failures

2017-03-07 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman updated MESOS-7216:
--
Summary: Delayed executor termination leads to test failures  (was: 
Termination of nested containers sometimes fails)

> Delayed executor termination leads to test failures
> ---
>
> Key: MESOS-7216
> URL: https://issues.apache.org/jira/browse/MESOS-7216
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Assignee: Kevin Klues
>  Labels: debugging, mesosphere
>
> This bug came up during the development of a test for the new COMMAND health 
> checks that use nested containers. The test can be found here: 
> https://reviews.apache.org/r/55901/.
> The test setup was explained in MESOS-7050:
> 1) Start the scheduler driver
> 2) Launch a task group with the default executor that includes a single long 
> running task with a COMMAND health check
> 3) Wait for the task to return a status of HEALTHY one time
> 4) Stop the scheduler driver without explicitly waiting for any of the tasks 
> to complete
> 5) Exit the test
> With this setup, all of the ASSERTS in the test itself pass, but the test 
> failed because there were remaining processes once the test exited (after a 
> timeout of 15 seconds):
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HealthCheckTest
> [ RUN  ] HealthCheckTest.DefaultExecutorCommandHealthCheck
> I0228 14:29:19.078368 3475919808 cluster.cpp:160] Creating default 'local' 
> authorizer
> I0228 14:29:19.084883 238907392 master.cpp:383] Master 
> 98c48dab-fd2b-404e-85dc-4ec5dd0d635c (172.18.8.139) started on 
> 172.18.8.139:55836
> I0228 14:29:19.084915 238907392 master.cpp:385] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" 
> --credentials="/private/var/folders/jr/17y2w4ld019bsn9vhx0c13f8gn/T/xZZCGr/credentials"
>  --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/Users/gaston/mesos/master/share/mesos/webui" 
> --work_dir="/private/var/folders/jr/17y2w4ld019bsn9vhx0c13f8gn/T/xZZCGr/master"
>  --zk_session_timeout="10secs"
> I0228 14:29:19.086030 238907392 master.cpp:435] Master only allowing 
> authenticated frameworks to register
> I0228 14:29:19.086041 238907392 master.cpp:449] Master only allowing 
> authenticated agents to register
> I0228 14:29:19.086046 238907392 master.cpp:462] Master only allowing 
> authenticated HTTP frameworks to register
> I0228 14:29:19.086050 238907392 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/private/var/folders/jr/17y2w4ld019bsn9vhx0c13f8gn/T/xZZCGr/credentials'
> I0228 14:29:19.086334 238907392 master.cpp:507] Using default 'crammd5' 
> authenticator
> I0228 14:29:19.086369 238907392 authenticator.cpp:519] Initializing server 
> SASL
> I0228 14:29:19.100981 238907392 auxprop.cpp:73] Initialized in-memory 
> auxiliary property plugin
> I0228 14:29:19.101080 238907392 http.cpp:933] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0228 14:29:19.101274 238907392 http.cpp:933] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0228 14:29:19.101414 238907392 http.cpp:933] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0228 14:29:19.101528 238907392 master.cpp:587] Authorization enabled
> I0228 14:29:19.101702 240517120 hierarchical.cpp:161] Initialized 
> hierarchical allocator process
> I0228 14:29:19.101740 239443968 whitelist_watcher.cpp:77] No whitelist given
> I0228 14:29:19.105717 240517120 master.cpp:2122] Elected as the leading 
> master!
> I0228 14:29:19.105738 240517120 master.cpp:1646] Recovering from registrar
> I0228