[jira] [Updated] (MESOS-8112) DefaultExecutorTest.ResourceLimitation is flaky

2017-10-18 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-8112:
---
Attachment: ResourceLimitation-badrun.txt

> DefaultExecutorTest.ResourceLimitation is flaky
> ---
>
> Key: MESOS-8112
> URL: https://issues.apache.org/jira/browse/MESOS-8112
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
>Reporter: James Peach
> Attachments: GetContainers-badrun.txt, GetContainers-goodrun.txt, 
> ResourceLimitation-badrun.txt
>
>
> As seen in CI builds, the 
> {{MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0}} test can be 
> flaky
> {noformat}[ RUN  ] 
> MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0
> I1017 21:37:55.179539  3528 cluster.cpp:162] Creating default 'local' 
> authorizer
> I1017 21:37:55.182804  3529 master.cpp:445] Master 
> 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e (42cd332f4072) started on 
> 172.17.0.2:33744
> I1017 21:37:55.182847  3529 master.cpp:447] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/1FtpuJ/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/1FtpuJ/master" --zk_session_timeout="10secs"
> I1017 21:37:55.183141  3529 master.cpp:496] Master only allowing 
> authenticated frameworks to register
> I1017 21:37:55.183153  3529 master.cpp:502] Master only allowing 
> authenticated agents to register
> I1017 21:37:55.183161  3529 master.cpp:508] Master only allowing 
> authenticated HTTP frameworks to register
> I1017 21:37:55.183167  3529 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/1FtpuJ/credentials'
> I1017 21:37:55.183472  3529 master.cpp:552] Using default 'crammd5' 
> authenticator
> I1017 21:37:55.183661  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1017 21:37:55.183862  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1017 21:37:55.184082  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1017 21:37:55.184257  3529 master.cpp:631] Authorization enabled
> I1017 21:37:55.184450  3536 hierarchical.cpp:171] Initialized hierarchical 
> allocator process
> I1017 21:37:55.184551  3536 whitelist_watcher.cpp:77] No whitelist given
> I1017 21:37:55.187489  3536 master.cpp:2198] Elected as the leading master!
> I1017 21:37:55.187516  3536 master.cpp:1687] Recovering from registrar
> I1017 21:37:55.187728  3536 registrar.cpp:347] Recovering registrar
> I1017 21:37:55.188508  3536 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 745984ns
> I1017 21:37:55.188616  3536 registrar.cpp:495] Applied 1 operations in 
> 37290ns; attempting to update the registry
> I1017 21:37:55.189162  3536 registrar.cpp:552] Successfully updated the 
> registry in 491008ns
> I1017 21:37:55.189285  3536 registrar.cpp:424] Successfully recovered 
> registrar
> I1017 21:37:55.190011  3531 hierarchical.cpp:209] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I1017 21:37:55.190115  3534 master.cpp:1791] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> W1017 21:37:55.195062  3528 process.cpp:3194] Attempted to spawn already 
> running process files@172.17.0.2:33744
> I1017 21:37:55.195956  3528 containerizer.cpp:292] Using isolation { 
> environment_secret, network/cni, filesystem/posix, disk/du }
> W1017 21:37:55.196488  3528 backend.cpp:76] Failed to create 'aufs' backend: 
> AufsBackend requires root privileges
> W1017 21:37:55.196630  3528 backend.cpp:76] Failed to create 

[jira] [Commented] (MESOS-8112) DefaultExecutorTest.ResourceLimitation is flaky

2017-10-18 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210353#comment-16210353
 ] 

James Peach commented on MESOS-8112:


The first failure in this test run is 
{{../../src/tests/default_executor_tests.cpp:1460: Failed to wait 15secs for 
failed}}. This indicates that we got the {{starting}} and {{running}} status 
updates but the final {{failure}} took too long.

However, the master forwarded the {{RUNNING}} update here:

{noformat}
I1017 21:37:55.499879  3532 master.cpp:7055] Forwarding status update 
TASK_RUNNING (UUID: 86bb612c-9d48-4e85-a0f1-89820ea65fa1) for task 
ffc4604c-a6cb-4ced-a969-fc6b9e6f955d of framework 
0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e-
...
1017 21:37:55.742033  3533 containerizer.cpp:2677] Container 
4789abbb-04c9-4d6d-b561-f44b34ec47d2 has reached its limit for resource 
[{"allocation_info":{"role":"*"},"name":"disk","scalar":{"value":20.0},"type":"SCALAR"}]
 and will be terminated
...
I1017 21:39:22.944377  3535 master.cpp:1417] Framework 
0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e- (default) disconnected
...
I1017 21:39:22.946893  3529 master.cpp:9157] Updating the state of task 
ffc4604c-a6cb-4ced-a969-fc6b9e6f955d of framework 
0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e- (latest state: TASK_KILLED, status 
update state: TASK_KILLED)
{noformat}

So from the master's perspective, the test framework disconnected? Or did this 
happen once the test failed and we started tearing it down?

Later in the test log:
{noformat}
../../src/tests/default_executor_tests.cpp:1427: Failure
Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, 
_))...
 Expected: to be called twice
   Actual: called once - unsatisfied and active
{noformat}

This seems to indicate that we only got 1 of 3 expected status updates, but if 
that was true I would expect to see a failure on {{AWAIT_READY(running)}} and I 
can't find that here :(

> DefaultExecutorTest.ResourceLimitation is flaky
> ---
>
> Key: MESOS-8112
> URL: https://issues.apache.org/jira/browse/MESOS-8112
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
>Reporter: James Peach
> Attachments: GetContainers-badrun.txt, GetContainers-goodrun.txt
>
>
> As seen in CI builds, the 
> {{MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0}} test can be 
> flaky
> {noformat}[ RUN  ] 
> MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0
> I1017 21:37:55.179539  3528 cluster.cpp:162] Creating default 'local' 
> authorizer
> I1017 21:37:55.182804  3529 master.cpp:445] Master 
> 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e (42cd332f4072) started on 
> 172.17.0.2:33744
> I1017 21:37:55.182847  3529 master.cpp:447] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/1FtpuJ/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/1FtpuJ/master" --zk_session_timeout="10secs"
> I1017 21:37:55.183141  3529 master.cpp:496] Master only allowing 
> authenticated frameworks to register
> I1017 21:37:55.183153  3529 master.cpp:502] Master only allowing 
> authenticated agents to register
> I1017 21:37:55.183161  3529 master.cpp:508] Master only allowing 
> authenticated HTTP frameworks to register
> I1017 21:37:55.183167  3529 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/1FtpuJ/credentials'
> I1017 21:37:55.183472  3529 master.cpp:552] Using default 'crammd5' 
> authenticator
> I1017 21:37:55.183661  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1017 21:37:55.183862  3529 http.cpp:1045] Creating default 'basic' HTTP 
> 

[jira] [Updated] (MESOS-7111) HttpFaultToleranceTest.SchedulerFailoverFrameworkToExecutorMessage segfaults

2017-10-18 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7111:
---
Labels: flaky-test mesosphere  (was: mesosphere)

> HttpFaultToleranceTest.SchedulerFailoverFrameworkToExecutorMessage segfaults
> 
>
> Key: MESOS-7111
> URL: https://issues.apache.org/jira/browse/MESOS-7111
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.2.0
> Environment: ubuntu-16
>Reporter: Benjamin Bannier
>  Labels: flaky-test, mesosphere
>
> We observed a segfault in 
> {{HttpFaultToleranceTest.SchedulerFailoverFrameworkToExecutorMessage}} in 
> internal CI on an ubuntu16 machine. Note that ubuntu16 uses gcc-6.
> {code}
> [ RUN  ] 
> HttpFaultToleranceTest.SchedulerFailoverFrameworkToExecutorMessage
> I0210 02:47:31.260174 19578 cluster.cpp:160] Creating default 'local' 
> authorizer
> I0210 02:47:31.261225 19597 master.cpp:383] Master 
> d8129420-2a04-48e7-9b28-6b0a0af73168 (ip-10-150-111-24.ec2.internal) started 
> on 10.150.111.24:33608
> I0210 02:47:31.261281 19597 master.cpp:385] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/fBrqHi/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/fBrqHi/master" 
> --zk_session_timeout="10secs"
> I0210 02:47:31.261404 19597 master.cpp:437] Master allowing unauthenticated 
> frameworks to register
> I0210 02:47:31.261411 19597 master.cpp:449] Master only allowing 
> authenticated agents to register
> I0210 02:47:31.261415 19597 master.cpp:462] Master only allowing 
> authenticated HTTP frameworks to register
> I0210 02:47:31.261420 19597 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/fBrqHi/credentials'
> I0210 02:47:31.261488 19597 master.cpp:507] Using default 'crammd5' 
> authenticator
> I0210 02:47:31.261530 19597 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0210 02:47:31.261591 19597 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0210 02:47:31.261631 19597 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0210 02:47:31.261698 19597 master.cpp:587] Authorization enabled
> I0210 02:47:31.261754 19601 whitelist_watcher.cpp:77] No whitelist given
> I0210 02:47:31.261754 19602 hierarchical.cpp:161] Initialized hierarchical 
> allocator process
> I0210 02:47:31.262462 19597 master.cpp:2124] Elected as the leading master!
> I0210 02:47:31.262482 19597 master.cpp:1646] Recovering from registrar
> I0210 02:47:31.262545 19603 registrar.cpp:329] Recovering registrar
> I0210 02:47:31.262774 19602 registrar.cpp:362] Successfully fetched the 
> registry (0B) in 201984ns
> I0210 02:47:31.262809 19602 registrar.cpp:461] Applied 1 operations in 
> 2963ns; attempting to update the registry
> I0210 02:47:31.263062 19599 registrar.cpp:506] Successfully updated the 
> registry in 214016ns
> I0210 02:47:31.263119 19599 registrar.cpp:392] Successfully recovered 
> registrar
> I0210 02:47:31.263267 19597 master.cpp:1762] Recovered 0 agents from the 
> registry (172B); allowing 10mins for agents to re-register
> I0210 02:47:31.263295 19598 hierarchical.cpp:188] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I0210 02:47:31.264645 19578 cluster.cpp:446] Creating default 'local' 
> authorizer
> I0210 02:47:31.265029 19598 slave.cpp:211] Mesos agent started on 
> (105)@10.150.111.24:33608
> I0210 02:47:31.265187 19578 scheduler.cpp:184] Version: 1.3.0
> I0210 02:47:31.265043 19598 slave.cpp:212] Flags at startup: --acls="" 
> 

[jira] [Updated] (MESOS-8113) Display task names in Alphanum pattern

2017-10-18 Thread Varun Gupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Gupta updated MESOS-8113:
---
Attachment: current_lexicographic.png
proposed_alphanum.png

> Display task names in Alphanum pattern
> --
>
> Key: MESOS-8113
> URL: https://issues.apache.org/jira/browse/MESOS-8113
> Project: Mesos
>  Issue Type: Task
>  Components: webui
>Affects Versions: 1.4.0
>Reporter: Varun Gupta
>Priority: Minor
> Fix For: 1.4.0
>
> Attachments: current_lexicographic.png, proposed_alphanum.png
>
>
> As of now, task names are sorted in Lexicographic order, and it annoys to 
> view them. So, I propose to sort them in Alphanum pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8113) Display task names in Alphanum pattern

2017-10-18 Thread Varun Gupta (JIRA)
Varun Gupta created MESOS-8113:
--

 Summary: Display task names in Alphanum pattern
 Key: MESOS-8113
 URL: https://issues.apache.org/jira/browse/MESOS-8113
 Project: Mesos
  Issue Type: Task
  Components: webui
Affects Versions: 1.4.0
Reporter: Varun Gupta
Priority: Minor
 Fix For: 1.4.0
 Attachments: current_lexicographic.png, proposed_alphanum.png

As of now, task names are sorted in Lexicographic order, and it annoys to view 
them. So, I propose to sort them in Alphanum pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7726) MasterTest.IgnoreOldAgentReregistration test is flaky

2017-10-18 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7726:
---
Attachment: IgnoreOldAgentReregistration-badrun.txt
IgnoreOldAgentReregistration-goodrun.txt

> MasterTest.IgnoreOldAgentReregistration test is flaky
> -
>
> Key: MESOS-7726
> URL: https://issues.apache.org/jira/browse/MESOS-7726
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Neil Conway
>  Labels: flaky-test, mesosphere-oncall
> Attachments: IgnoreOldAgentReregistration-badrun.txt, 
> IgnoreOldAgentReregistration-goodrun.txt
>
>
> Observed this on ASF CI.
> {code}
> [ RUN  ] MasterTest.IgnoreOldAgentReregistration
> I0627 05:23:06.031154  4917 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0627 05:23:06.033433  4945 master.cpp:438] Master 
> a8778782-0da1-49a5-9cb8-9f6d11701733 (c43debbe7e32) started on 
> 172.17.0.4:41747
> I0627 05:23:06.033457  4945 master.cpp:440] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/2BARnF/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-1.4.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/2BARnF/master" --zk_session_timeout="10secs"
> I0627 05:23:06.033771  4945 master.cpp:490] Master only allowing 
> authenticated frameworks to register
> I0627 05:23:06.033787  4945 master.cpp:504] Master only allowing 
> authenticated agents to register
> I0627 05:23:06.033798  4945 master.cpp:517] Master only allowing 
> authenticated HTTP frameworks to register
> I0627 05:23:06.033812  4945 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/2BARnF/credentials'
> I0627 05:23:06.034080  4945 master.cpp:562] Using default 'crammd5' 
> authenticator
> I0627 05:23:06.034221  4945 http.cpp:974] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0627 05:23:06.034409  4945 http.cpp:974] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0627 05:23:06.034569  4945 http.cpp:974] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0627 05:23:06.034688  4945 master.cpp:642] Authorization enabled
> I0627 05:23:06.034862  4938 whitelist_watcher.cpp:77] No whitelist given
> I0627 05:23:06.034868  4950 hierarchical.cpp:169] Initialized hierarchical 
> allocator process
> I0627 05:23:06.037211  4957 master.cpp:2161] Elected as the leading master!
> I0627 05:23:06.037236  4957 master.cpp:1700] Recovering from registrar
> I0627 05:23:06.037333  4938 registrar.cpp:345] Recovering registrar
> I0627 05:23:06.038146  4938 registrar.cpp:389] Successfully fetched the 
> registry (0B) in 768256ns
> I0627 05:23:06.038290  4938 registrar.cpp:493] Applied 1 operations in 
> 30798ns; attempting to update the registry
> I0627 05:23:06.038861  4938 registrar.cpp:550] Successfully updated the 
> registry in 510976ns
> I0627 05:23:06.038960  4938 registrar.cpp:422] Successfully recovered 
> registrar
> I0627 05:23:06.039364  4941 hierarchical.cpp:207] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I0627 05:23:06.039594  4958 master.cpp:1799] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> I0627 05:23:06.043999  4917 containerizer.cpp:230] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix,network/cni,environment_secret
> W0627 05:23:06.044456  4917 backend.cpp:76] Failed to create 'aufs' backend: 
> AufsBackend requires root privileges
> W0627 05:23:06.044548  4917 backend.cpp:76] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> I0627 05:23:06.044580  

[jira] [Updated] (MESOS-8112) DefaultExecutorTest.ResourceLimitation is flaky

2017-10-18 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8112:
---
Attachment: GetContainers-goodrun.txt
GetContainers-badrun.txt

> DefaultExecutorTest.ResourceLimitation is flaky
> ---
>
> Key: MESOS-8112
> URL: https://issues.apache.org/jira/browse/MESOS-8112
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
>Reporter: James Peach
> Attachments: GetContainers-badrun.txt, GetContainers-goodrun.txt
>
>
> As seen in CI builds, the 
> {{MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0}} test can be 
> flaky
> {noformat}[ RUN  ] 
> MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0
> I1017 21:37:55.179539  3528 cluster.cpp:162] Creating default 'local' 
> authorizer
> I1017 21:37:55.182804  3529 master.cpp:445] Master 
> 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e (42cd332f4072) started on 
> 172.17.0.2:33744
> I1017 21:37:55.182847  3529 master.cpp:447] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/1FtpuJ/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/1FtpuJ/master" --zk_session_timeout="10secs"
> I1017 21:37:55.183141  3529 master.cpp:496] Master only allowing 
> authenticated frameworks to register
> I1017 21:37:55.183153  3529 master.cpp:502] Master only allowing 
> authenticated agents to register
> I1017 21:37:55.183161  3529 master.cpp:508] Master only allowing 
> authenticated HTTP frameworks to register
> I1017 21:37:55.183167  3529 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/1FtpuJ/credentials'
> I1017 21:37:55.183472  3529 master.cpp:552] Using default 'crammd5' 
> authenticator
> I1017 21:37:55.183661  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1017 21:37:55.183862  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1017 21:37:55.184082  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1017 21:37:55.184257  3529 master.cpp:631] Authorization enabled
> I1017 21:37:55.184450  3536 hierarchical.cpp:171] Initialized hierarchical 
> allocator process
> I1017 21:37:55.184551  3536 whitelist_watcher.cpp:77] No whitelist given
> I1017 21:37:55.187489  3536 master.cpp:2198] Elected as the leading master!
> I1017 21:37:55.187516  3536 master.cpp:1687] Recovering from registrar
> I1017 21:37:55.187728  3536 registrar.cpp:347] Recovering registrar
> I1017 21:37:55.188508  3536 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 745984ns
> I1017 21:37:55.188616  3536 registrar.cpp:495] Applied 1 operations in 
> 37290ns; attempting to update the registry
> I1017 21:37:55.189162  3536 registrar.cpp:552] Successfully updated the 
> registry in 491008ns
> I1017 21:37:55.189285  3536 registrar.cpp:424] Successfully recovered 
> registrar
> I1017 21:37:55.190011  3531 hierarchical.cpp:209] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I1017 21:37:55.190115  3534 master.cpp:1791] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> W1017 21:37:55.195062  3528 process.cpp:3194] Attempted to spawn already 
> running process files@172.17.0.2:33744
> I1017 21:37:55.195956  3528 containerizer.cpp:292] Using isolation { 
> environment_secret, network/cni, filesystem/posix, disk/du }
> W1017 21:37:55.196488  3528 backend.cpp:76] Failed to create 'aufs' backend: 
> AufsBackend requires root privileges
> W1017 21:37:55.196630  3528 backend.cpp:76] 

[jira] [Updated] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2017-10-18 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7742:
---
Attachment: AgentAPITest.LaunchNestedContainerSession-badrun.txt

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Gastón Kleiman
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt
>
>
> Observed this on ASF CI. 
> [~gkleiman] mind triaging this?
> {code}
> [ RUN  ] 
> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession/0
> I0629 05:49:33.180673 25301 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0629 05:49:33.182234 25306 master.cpp:436] Master 
> 90ea1640-bdf3-49ba-b78f-b2ba7ea30077 (296af9b598c3) started on 
> 172.17.0.3:45726
> I0629 05:49:33.182289 25306 master.cpp:438] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" -
> -allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --au
> thenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/a5h5J3/credentials" 
> --framework_sorter="drf" --help="false" --hostn
> ame_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="10
> 00" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="in_memory" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registr
> y_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" -
> -version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/a5h5J3/master" --zk_session_timeout="10secs"
> I0629 05:49:33.182561 25306 master.cpp:488] Master only allowing 
> authenticated frameworks to register
> I0629 05:49:33.182610 25306 master.cpp:502] Master only allowing 
> authenticated agents to register
> I0629 05:49:33.182636 25306 master.cpp:515] Master only allowing 
> authenticated HTTP frameworks to register
> I0629 05:49:33.182656 25306 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/a5h5J3/credentials'
> I0629 05:49:33.182915 25306 master.cpp:560] Using default 'crammd5' 
> authenticator
> I0629 05:49:33.183009 25306 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0629 05:49:33.183151 25306 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0629 05:49:33.183218 25306 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0629 05:49:33.183284 25306 master.cpp:640] Authorization enabled
> I0629 05:49:33.183462 25309 hierarchical.cpp:158] Initialized hierarchical 
> allocator process
> I0629 05:49:33.183504 25309 whitelist_watcher.cpp:77] No whitelist given
> I0629 05:49:33.184311 25308 master.cpp:2161] Elected as the leading master!
> I0629 05:49:33.184341 25308 master.cpp:1700] Recovering from registrar
> I0629 05:49:33.184404 25308 registrar.cpp:345] Recovering registrar
> I0629 05:49:33.184622 25308 registrar.cpp:389] Successfully fetched the 
> registry (0B) in 183040ns
> I0629 05:49:33.184687 25308 registrar.cpp:493] Applied 1 operations in 
> 6441ns; attempting to update the registry
> I0629 05:49:33.184885 25304 registrar.cpp:550] Successfully updated the 
> registry in 147200ns
> I0629 05:49:33.184993 25304 registrar.cpp:422] Successfully recovered 
> registrar
> I0629 05:49:33.185148 25308 master.cpp:1799] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> I0629 05:49:33.185161 25302 hierarchical.cpp:185] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I0629 05:49:33.186769 25301 containerizer.cpp:221] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix,network/cni
> W0629 05:49:33.187232 25301 backend.cpp:76] Failed to create 'aufs' backend: 
> AufsBackend requires root privileges
> W0629 05:49:33.187363 25301 backend.cpp:76] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> I0629 05:49:33.187396 25301 

[jira] [Commented] (MESOS-8112) DefaultExecutorTest.ResourceLimitation is flaky

2017-10-18 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209913#comment-16209913
 ] 

James Peach commented on MESOS-8112:


{{ContentType/AgentAPITest.GetContainers/1}} might also be fallout from the 
same changes in MESOS-7963

> DefaultExecutorTest.ResourceLimitation is flaky
> ---
>
> Key: MESOS-8112
> URL: https://issues.apache.org/jira/browse/MESOS-8112
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
>Reporter: James Peach
>
> As seen in CI builds, the 
> {{MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0}} test can be 
> flaky
> {noformat}[ RUN  ] 
> MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0
> I1017 21:37:55.179539  3528 cluster.cpp:162] Creating default 'local' 
> authorizer
> I1017 21:37:55.182804  3529 master.cpp:445] Master 
> 0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e (42cd332f4072) started on 
> 172.17.0.2:33744
> I1017 21:37:55.182847  3529 master.cpp:447] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/1FtpuJ/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/1FtpuJ/master" --zk_session_timeout="10secs"
> I1017 21:37:55.183141  3529 master.cpp:496] Master only allowing 
> authenticated frameworks to register
> I1017 21:37:55.183153  3529 master.cpp:502] Master only allowing 
> authenticated agents to register
> I1017 21:37:55.183161  3529 master.cpp:508] Master only allowing 
> authenticated HTTP frameworks to register
> I1017 21:37:55.183167  3529 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/1FtpuJ/credentials'
> I1017 21:37:55.183472  3529 master.cpp:552] Using default 'crammd5' 
> authenticator
> I1017 21:37:55.183661  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1017 21:37:55.183862  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1017 21:37:55.184082  3529 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1017 21:37:55.184257  3529 master.cpp:631] Authorization enabled
> I1017 21:37:55.184450  3536 hierarchical.cpp:171] Initialized hierarchical 
> allocator process
> I1017 21:37:55.184551  3536 whitelist_watcher.cpp:77] No whitelist given
> I1017 21:37:55.187489  3536 master.cpp:2198] Elected as the leading master!
> I1017 21:37:55.187516  3536 master.cpp:1687] Recovering from registrar
> I1017 21:37:55.187728  3536 registrar.cpp:347] Recovering registrar
> I1017 21:37:55.188508  3536 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 745984ns
> I1017 21:37:55.188616  3536 registrar.cpp:495] Applied 1 operations in 
> 37290ns; attempting to update the registry
> I1017 21:37:55.189162  3536 registrar.cpp:552] Successfully updated the 
> registry in 491008ns
> I1017 21:37:55.189285  3536 registrar.cpp:424] Successfully recovered 
> registrar
> I1017 21:37:55.190011  3531 hierarchical.cpp:209] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I1017 21:37:55.190115  3534 master.cpp:1791] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> W1017 21:37:55.195062  3528 process.cpp:3194] Attempted to spawn already 
> running process files@172.17.0.2:33744
> I1017 21:37:55.195956  3528 containerizer.cpp:292] Using isolation { 
> environment_secret, network/cni, filesystem/posix, disk/du }
> W1017 21:37:55.196488  3528 backend.cpp:76] Failed to create 'aufs' backend: 
> AufsBackend requires root privileges
> W1017 21:37:55.196630  3528 backend.cpp:76] Failed to create 'bind' backend: 

[jira] [Created] (MESOS-8112) DefaultExecutorTest.ResourceLimitation is flaky

2017-10-18 Thread James Peach (JIRA)
James Peach created MESOS-8112:
--

 Summary: DefaultExecutorTest.ResourceLimitation is flaky
 Key: MESOS-8112
 URL: https://issues.apache.org/jira/browse/MESOS-8112
 Project: Mesos
  Issue Type: Bug
  Components: flaky, test
Reporter: James Peach


As seen in CI builds, the 
{{MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0}} test can be 
flaky

{noformat}[ RUN  ] 
MesosContainerizer/DefaultExecutorTest.ResourceLimitation/0
I1017 21:37:55.179539  3528 cluster.cpp:162] Creating default 'local' authorizer
I1017 21:37:55.182804  3529 master.cpp:445] Master 
0a7cd77c-8bc0-4fdc-b6c5-918b7ffc392e (42cd332f4072) started on 172.17.0.2:33744
I1017 21:37:55.182847  3529 master.cpp:447] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/1FtpuJ/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" 
--work_dir="/tmp/1FtpuJ/master" --zk_session_timeout="10secs"
I1017 21:37:55.183141  3529 master.cpp:496] Master only allowing authenticated 
frameworks to register
I1017 21:37:55.183153  3529 master.cpp:502] Master only allowing authenticated 
agents to register
I1017 21:37:55.183161  3529 master.cpp:508] Master only allowing authenticated 
HTTP frameworks to register
I1017 21:37:55.183167  3529 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/1FtpuJ/credentials'
I1017 21:37:55.183472  3529 master.cpp:552] Using default 'crammd5' 
authenticator
I1017 21:37:55.183661  3529 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1017 21:37:55.183862  3529 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1017 21:37:55.184082  3529 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1017 21:37:55.184257  3529 master.cpp:631] Authorization enabled
I1017 21:37:55.184450  3536 hierarchical.cpp:171] Initialized hierarchical 
allocator process
I1017 21:37:55.184551  3536 whitelist_watcher.cpp:77] No whitelist given
I1017 21:37:55.187489  3536 master.cpp:2198] Elected as the leading master!
I1017 21:37:55.187516  3536 master.cpp:1687] Recovering from registrar
I1017 21:37:55.187728  3536 registrar.cpp:347] Recovering registrar
I1017 21:37:55.188508  3536 registrar.cpp:391] Successfully fetched the 
registry (0B) in 745984ns
I1017 21:37:55.188616  3536 registrar.cpp:495] Applied 1 operations in 37290ns; 
attempting to update the registry
I1017 21:37:55.189162  3536 registrar.cpp:552] Successfully updated the 
registry in 491008ns
I1017 21:37:55.189285  3536 registrar.cpp:424] Successfully recovered registrar
I1017 21:37:55.190011  3531 hierarchical.cpp:209] Skipping recovery of 
hierarchical allocator: nothing to recover
I1017 21:37:55.190115  3534 master.cpp:1791] Recovered 0 agents from the 
registry (129B); allowing 10mins for agents to re-register
W1017 21:37:55.195062  3528 process.cpp:3194] Attempted to spawn already 
running process files@172.17.0.2:33744
I1017 21:37:55.195956  3528 containerizer.cpp:292] Using isolation { 
environment_secret, network/cni, filesystem/posix, disk/du }
W1017 21:37:55.196488  3528 backend.cpp:76] Failed to create 'aufs' backend: 
AufsBackend requires root privileges
W1017 21:37:55.196630  3528 backend.cpp:76] Failed to create 'bind' backend: 
BindBackend requires root privileges
I1017 21:37:55.196662  3528 provisioner.cpp:255] Using default backend 'copy'
I1017 21:37:55.198724  3528 cluster.cpp:448] Creating default 'local' authorizer
I1017 21:37:55.200865  3535 slave.cpp:254] Mesos agent started on 
(724)@172.17.0.2:33744
I1017 21:37:55.200907  3535 slave.cpp:255] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 

[jira] [Commented] (MESOS-7941) Send TASK_STARTING status from built-in executors

2017-10-18 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209853#comment-16209853
 ] 

Alexander Rukletsov commented on MESOS-7941:


Reverting {{f43710eabb1c0956b368e9f855b26bebcf8cbc7a}} and 
{{1e1e409b3906d1a6189d5dfd47b21df7680244f6}} due to failing tests.

> Send TASK_STARTING status from built-in executors
> -
>
> Key: MESOS-7941
> URL: https://issues.apache.org/jira/browse/MESOS-7941
> Project: Mesos
>  Issue Type: Improvement
>  Components: executor
>Reporter: Benno Evers
>Assignee: Benno Evers
>  Labels: executor, executors
> Fix For: 1.5.0
>
>
> All executors have the option to send out a TASK_STARTING status update to 
> signal to the scheduler that they received the command to launch the task.
> It would be good if our built-in executors would do this, for reasons laid 
> out in 
> https://mail-archives.apache.org/mod_mbox/mesos-dev/201708.mbox/%3CCA%2B9TLTzkEVM0CKvY%2B%3D0%3DwjrN6hYFAt0401Y7b8tysDWx1WZzdw%40mail.gmail.com%3E
> This will also fix MESOS-6790.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8111) Mesos sees task as running, but cannot kill it because the agent is offline

2017-10-18 Thread Cosmin Lehene (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cosmin Lehene updated MESOS-8111:
-
Description: 
After scaling down a cluster, the master is reporting a task as running 
although the slave has been long gone.
At the same time it reports it can't kill it because the agent is offline
{noformat}
I1018 16:55:22.00  6976 master.cpp:4913] Processing KILL call for task 
'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 
4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101
W1018 16:55:22.00  6976 master.cpp:5000] Cannot kill task 
spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 
4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the 
agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 
(10.0.0.81) is disconnected. Kill will be retried if the agent re-registers
{noformat}

Clearly, if the agent is offline the task is also not running. Also not sure 
waiting indefinitely for an agent to recover is a good strategy.

  was:
After scaling down a cluster, the master is reporting a task as running 
although the slave has been long gone.
At the same time it reports it can't kill it because the agent is offline
{noformat}
I1018 16:55:22.00  6976 master.cpp:4913] Processing KILL call for task 
'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 
4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101
W1018 16:55:22.00  6976 master.cpp:5000] Cannot kill task 
spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 
4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the 
agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 
(10.0.0.81) is disconnected. Kill will be retried if the agent re-registers
{noformat}


> Mesos sees task as running, but cannot kill it because the agent is offline
> ---
>
> Key: MESOS-8111
> URL: https://issues.apache.org/jira/browse/MESOS-8111
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.2.3
> Environment: DC/OS 1.9.4
>Reporter: Cosmin Lehene
>
> After scaling down a cluster, the master is reporting a task as running 
> although the slave has been long gone.
> At the same time it reports it can't kill it because the agent is offline
> {noformat}
> I1018 16:55:22.00  6976 master.cpp:4913] Processing KILL call for task 
> 'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
> scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101
> W1018 16:55:22.00  6976 master.cpp:5000] Cannot kill task 
> spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
> scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the 
> agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 
> (10.0.0.81) is disconnected. Kill will be retried if the agent re-registers
> {noformat}
> Clearly, if the agent is offline the task is also not running. Also not sure 
> waiting indefinitely for an agent to recover is a good strategy.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8111) Mesos sees task as running, but cannot kill it because the agent is offline

2017-10-18 Thread Cosmin Lehene (JIRA)
Cosmin Lehene created MESOS-8111:


 Summary: Mesos sees task as running, but cannot kill it because 
the agent is offline
 Key: MESOS-8111
 URL: https://issues.apache.org/jira/browse/MESOS-8111
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.2.3
 Environment: DC/OS 1.9.4
Reporter: Cosmin Lehene


After scaling down a cluster, the master is reporting a task as running 
although the slave has been long gone.
At the same time it reports it can't kill it because the agent is offline
{noformat}
I1018 16:55:22.00  6976 master.cpp:4913] Processing KILL call for task 
'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 
4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101
W1018 16:55:22.00  6976 master.cpp:5000] Cannot kill task 
spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 
4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the 
agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 
(10.0.0.81) is disconnected. Kill will be retried if the agent re-registers
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-10-18 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209541#comment-16209541
 ] 

Andrei Budnik commented on MESOS-7506:
--

All failing tests have the same error message in logs like:
{{E0922 00:38:40.509032 31034 slave.cpp:5398] Termination of executor '1' of 
framework 83bd1613-70d9-4c3e-b490-4aa60dd26e22- failed: Failed to kill all 
processes in the container: Timed out after 1mins}}

The container termination future is triggered by 
[MesosContainerizerProcess::___destroy|https://github.com/apache/mesos/blob/b361801f2c78043459199dab3e0defe9a0b4c1aa/src/slave/containerizer/mesos/containerizer.cpp#L2361].
 Agent subscribes to this future by calling 
[containerizer->wait()|https://github.com/apache/mesos/blob/b361801f2c78043459199dab3e0defe9a0b4c1aa/src/slave/slave.cpp#L5280].
 Triggering this future leads to calling of {{Slave::executorTerminated}}, 
which sends {{TASK_FAILED}} status update.

Typical test (e.g. {{SlaveTest.ShutdownUnregisteredExecutor}}) waits for
{code}
  // Ensure that the slave times out and kills the executor.
  Future destroyExecutor =
FUTURE_DISPATCH(_, ::destroy);
{code}

After that, the test waits for {{TASK_FAILED}} status update. So, this test 
completes successfully and slave's destructor is called, [which 
fails|https://github.com/apache/mesos/blob/b361801f2c78043459199dab3e0defe9a0b4c1aa/src/tests/cluster.cpp#L580],
 because {{MesosContainerizerProcess::___destroy}} doesn't erase container from 
the hashmap.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8110) Mesos Maintenance UI not rendering End time correctly

2017-10-18 Thread Vishnu Mohan (JIRA)
Vishnu Mohan created MESOS-8110:
---

 Summary: Mesos Maintenance UI not rendering End time correctly
 Key: MESOS-8110
 URL: https://issues.apache.org/jira/browse/MESOS-8110
 Project: Mesos
  Issue Type: Bug
  Components: webui
Affects Versions: 1.4.0
Reporter: Vishnu Mohan


The {{Begin}} time (e.g., {{2017-10-18T10:54:45-0400}}) and {{End}} time (e.g., 
{{2017-10-18T11:54:45-0400}}) are both rendered as {{just now}} when a 
maintenance window is initially POST'ed (even though they're an hour apart) and 
the {{End}} time never updates although the human-friendly (relative) {{Begin}} 
time does.

These scripts may be used to reproduce the issue:
https://github.com/vishnu2kmohan/dcos-toolbox/blob/master/mesos/maintain-agents.sh
https://github.com/vishnu2kmohan/dcos-toolbox/blob/master/mesos/agent-maintenance-schedule-example.json



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8057) Apply security patches to AngularJS and JQuery in the Mesos UI

2017-10-18 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209438#comment-16209438
 ] 

Alexander Rojas commented on MESOS-8057:


Changes landed in Mesos master branch. They will be part of next Mesos bump on 
DC/OS

{noformat}
commit b0a660bb1811c0144cba781482b1ce4573e685b3
Author: Alexander Rojas 
AuthorDate: Wed Oct 18 12:11:05 2017 +0200
Commit: Alexander Rojas 
CommitDate: Wed Oct 18 16:33:19 2017 +0200

Upgrades jQuery used by Mesos WebUI to version 3.2.1.

The version of jQuery distributed with Mesos (1.7.1) was found to have
security issues which have been addressed in latter versions.

Review: https://reviews.apache.org/r/63101
{noformat}
{noformat}
commit 1b5a4e77e55f5c8665526294626a66905569a284 (HEAD -> master, 
upstream/master)
Author: Alexander Rojas 
AuthorDate: Wed Oct 18 12:11:40 2017 +0200
Commit: Alexander Rojas 
CommitDate: Wed Oct 18 16:33:37 2017 +0200

Upgrades AngularJS used by Mesos WebUI to version 1.2.32.

The version of AngularJS distributed with Mesos (1.2.3) was found to
have security issues which have been addressed in latter versions.

Review: https://reviews.apache.org/r/63102
{noformat}

> Apply security patches to AngularJS and JQuery in the Mesos UI
> --
>
> Key: MESOS-8057
> URL: https://issues.apache.org/jira/browse/MESOS-8057
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.4.0
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.5.0
>
>
> Running a security tool returns:
> {noformat}
> Evidence 
> Vulnerable libraries were found: 
> https://admin.kpn-dsh.com/mesos/static/js/angular-1.2.3.min.js 
> https://admin.kpn-dsh.com/mesos/static/js/angular-route-1.2.3.min.js  
> https://admin.kpn-dsh.com/mesos/static/js/jquery-1.7.1.min.js 
> More information about the issues can be found at: - 
> https://github.com/angular/angular.js/blob/master/CHANGELOG.md - 
> http://bugs.jquery.com/ticket/11290 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (MESOS-8057) Apply security patches to AngularJS and JQuery in the Mesos UI

2017-10-18 Thread Alexander Rojas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rojas updated MESOS-8057:
---
Comment: was deleted

(was: Changes landed in Mesos master branch. They will be part of next Mesos 
bump on DC/OS

{noformat}
commit b0a660bb1811c0144cba781482b1ce4573e685b3
Author: Alexander Rojas 
AuthorDate: Wed Oct 18 12:11:05 2017 +0200
Commit: Alexander Rojas 
CommitDate: Wed Oct 18 16:33:19 2017 +0200

Upgrades jQuery used by Mesos WebUI to version 3.2.1.

The version of jQuery distributed with Mesos (1.7.1) was found to have
security issues which have been addressed in latter versions.

Review: https://reviews.apache.org/r/63101
{noformat}
{noformat}
commit 1b5a4e77e55f5c8665526294626a66905569a284 (HEAD -> master, 
upstream/master)
Author: Alexander Rojas 
AuthorDate: Wed Oct 18 12:11:40 2017 +0200
Commit: Alexander Rojas 
CommitDate: Wed Oct 18 16:33:37 2017 +0200

Upgrades AngularJS used by Mesos WebUI to version 1.2.32.

The version of AngularJS distributed with Mesos (1.2.3) was found to
have security issues which have been addressed in latter versions.

Review: https://reviews.apache.org/r/63102
{noformat})

> Apply security patches to AngularJS and JQuery in the Mesos UI
> --
>
> Key: MESOS-8057
> URL: https://issues.apache.org/jira/browse/MESOS-8057
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.4.0
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.5.0
>
>
> Running a security tool returns:
> {noformat}
> Evidence 
> Vulnerable libraries were found: 
> https://admin.kpn-dsh.com/mesos/static/js/angular-1.2.3.min.js 
> https://admin.kpn-dsh.com/mesos/static/js/angular-route-1.2.3.min.js  
> https://admin.kpn-dsh.com/mesos/static/js/jquery-1.7.1.min.js 
> More information about the issues can be found at: - 
> https://github.com/angular/angular.js/blob/master/CHANGELOG.md - 
> http://bugs.jquery.com/ticket/11290 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7594) Implement 'apply' for resource provider related operations

2017-10-18 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16150341#comment-16150341
 ] 

Jan Schlicht edited comment on MESOS-7594 at 10/18/17 2:35 PM:
---

https://reviews.apache.org/r/63104/
https://reviews.apache.org/r/61810/
https://reviews.apache.org/r/61946/
https://reviews.apache.org/r/63105/
https://reviews.apache.org/r/61947/


was (Author: nfnt):
https://reviews.apache.org/r/61810/
https://reviews.apache.org/r/61946/
https://reviews.apache.org/r/61947/

> Implement 'apply' for resource provider related operations
> --
>
> Key: MESOS-7594
> URL: https://issues.apache.org/jira/browse/MESOS-7594
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere, storage
>
> Resource providers provide new offer operations ({{CREATE_BLOCK}}, 
> {{DESTROY_BLOCK}}, {{CREATE_VOLUME}}, {{DESTROY_VOLUME}}). These operations 
> can be applied by frameworks when they accept on offer. Handling of these 
> operations has to be added to the master's {{accept}} call. I.e. the 
> corresponding resource provider needs be extracted from the offer's resources 
> and a {{resource_provider::Event::OPERATION}} has to be sent to the resource 
> provider. The resource provider will answer with a 
> {{resource_provider::Call::Update}} which needs to be handled as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8109) Broken markup in `Attaching containers to CNI networks`

2017-10-18 Thread Wilfried Goesgens (JIRA)
Wilfried Goesgens created MESOS-8109:


 Summary: Broken markup in `Attaching containers to CNI networks`
 Key: MESOS-8109
 URL: https://issues.apache.org/jira/browse/MESOS-8109
 Project: Mesos
  Issue Type: Documentation
Reporter: Wilfried Goesgens
Priority: Trivial


On http://mesos.apache.org/documentation/latest/cni/ under 'Attaching 
containers to CNI networks' the **NOTE** section is broken - it probably 
shouldn't be a verbatim box.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8057) Apply security patches to AngularJS and JQuery in the Mesos UI

2017-10-18 Thread Alexander Rojas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rojas reassigned MESOS-8057:
--

Assignee: Alexander Rojas

> Apply security patches to AngularJS and JQuery in the Mesos UI
> --
>
> Key: MESOS-8057
> URL: https://issues.apache.org/jira/browse/MESOS-8057
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.4.0
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>Priority: Blocker
>  Labels: mesosphere
>
> Running a security tool returns:
> {noformat}
> Evidence 
> Vulnerable libraries were found: 
> https://admin.kpn-dsh.com/mesos/static/js/angular-1.2.3.min.js 
> https://admin.kpn-dsh.com/mesos/static/js/angular-route-1.2.3.min.js  
> https://admin.kpn-dsh.com/mesos/static/js/jquery-1.7.1.min.js 
> More information about the issues can be found at: - 
> https://github.com/angular/angular.js/blob/master/CHANGELOG.md - 
> http://bugs.jquery.com/ticket/11290 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)