[jira] [Updated] (MESOS-8460) `Slave::detachFile` can segfault because it could use invalid Framework*.

2018-01-18 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-8460:

Summary: `Slave::detachFile` can segfault because it could use invalid 
Framework*.  (was: `Slave::detachFile` can segfault because it could use 
invalid Framework*)

> `Slave::detachFile` can segfault because it could use invalid Framework*.
> -
>
> Key: MESOS-8460
> URL: https://issues.apache.org/jira/browse/MESOS-8460
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Fix For: 1.5.0
>
>
> Observed this SEGV in an internal cluster
> {code}
> {noformat}
> 2018-01-18 19:00:54: *** SIGSEGV (@0x0) received by PID 26410 (TID 
> 0x7fe9e4f65700) from PID 0; stack trace: ***
> 2018-01-18 19:00:54: @ 0x7fe9ea2c85e0 (unknown)
> 2018-01-18 19:00:54: @ 0x7fe9ec4cc855 mesos::internal::Files::detach()
> 2018-01-18 19:00:54: @ 0x7fe9ec8cb5b0 
> mesos::internal::slave::Slave::detachFile()
> 2018-01-18 19:00:54: @ 0x7fe9ec8ccadb 
> _ZZN5mesos8internal5slave9Framework15recoverExecutorERKNS1_5state13ExecutorStateEbRK7hashsetINS_6TaskIDESt4hashIS8_ESt8equal_toIS8_EEENKUlRKN7process6FutureI7NothingEEE0_clESL_.isra.2000
> 2018-01-18 19:00:54: @ 0x7fe9ec37e4e4 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchIvEclINS0_IFvvEvRKNS1_4UPIDEOT_EUlOSE_S3_E_JSE_St12_PlaceholderILi1EEclEOS3_
> 2018-01-18 19:00:54: @ 0x7fe9ed455ea1 process::ProcessBase::consume()
> 2018-01-18 19:00:54: @ 0x7fe9ed464bcc process::ProcessManager::resume()
> 2018-01-18 19:00:54: @ 0x7fe9ed46a136 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 2018-01-18 19:00:54: @ 0x7fe9ea7a0230 (unknown)
> 2018-01-18 19:00:54: @ 0x7fe9ea2c0e25 start_thread
> 2018-01-18 19:00:54: @ 0x7fe9e9fee34d __clone
> 2018-01-18 19:00:54: dcos-mesos-slave.service: main process exited, 
> code=killed, status=11/SEGV{noformat}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8460) `Slave::detachFile` can segfault because it could use invalid Framework*

2018-01-18 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-8460:

Issue Type: Bug  (was: Improvement)

Updated the type from `Improvement` to `bug`.

> `Slave::detachFile` can segfault because it could use invalid Framework*
> 
>
> Key: MESOS-8460
> URL: https://issues.apache.org/jira/browse/MESOS-8460
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Fix For: 1.5.0
>
>
> Observed this SEGV in an internal cluster
> {code}
> {noformat}
> 2018-01-18 19:00:54: *** SIGSEGV (@0x0) received by PID 26410 (TID 
> 0x7fe9e4f65700) from PID 0; stack trace: ***
> 2018-01-18 19:00:54: @ 0x7fe9ea2c85e0 (unknown)
> 2018-01-18 19:00:54: @ 0x7fe9ec4cc855 mesos::internal::Files::detach()
> 2018-01-18 19:00:54: @ 0x7fe9ec8cb5b0 
> mesos::internal::slave::Slave::detachFile()
> 2018-01-18 19:00:54: @ 0x7fe9ec8ccadb 
> _ZZN5mesos8internal5slave9Framework15recoverExecutorERKNS1_5state13ExecutorStateEbRK7hashsetINS_6TaskIDESt4hashIS8_ESt8equal_toIS8_EEENKUlRKN7process6FutureI7NothingEEE0_clESL_.isra.2000
> 2018-01-18 19:00:54: @ 0x7fe9ec37e4e4 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchIvEclINS0_IFvvEvRKNS1_4UPIDEOT_EUlOSE_S3_E_JSE_St12_PlaceholderILi1EEclEOS3_
> 2018-01-18 19:00:54: @ 0x7fe9ed455ea1 process::ProcessBase::consume()
> 2018-01-18 19:00:54: @ 0x7fe9ed464bcc process::ProcessManager::resume()
> 2018-01-18 19:00:54: @ 0x7fe9ed46a136 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 2018-01-18 19:00:54: @ 0x7fe9ea7a0230 (unknown)
> 2018-01-18 19:00:54: @ 0x7fe9ea2c0e25 start_thread
> 2018-01-18 19:00:54: @ 0x7fe9e9fee34d __clone
> 2018-01-18 19:00:54: dcos-mesos-slave.service: main process exited, 
> code=killed, status=11/SEGV{noformat}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8462) Unit test for `Slave::detachFile` on removed frameworks.

2018-01-18 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-8462:
--

Assignee: Qian Zhang  (was: Chun-Hung Hsiao)

> Unit test for `Slave::detachFile` on removed frameworks.
> 
>
> Key: MESOS-8462
> URL: https://issues.apache.org/jira/browse/MESOS-8462
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Qian Zhang
>Priority: Major
>  Labels: mesosphere
>
> We should add a unit test for MESOS-8460.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8460) `Slave::detachFile` can segfault because it could use invalid Framework*

2018-01-18 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331688#comment-16331688
 ] 

Qian Zhang commented on MESOS-8460:
---

commit 32b85a2b06f676b68a16deaa8359ae64a1e8ead9
Author: Chun-Hung Hsiao 
Date: Fri Jan 19 11:08:49 2018 +0800

Fixed detaching task volume directories of destroyed frameworks.
 
 Review: https://reviews.apache.org/r/65231/

> `Slave::detachFile` can segfault because it could use invalid Framework*
> 
>
> Key: MESOS-8460
> URL: https://issues.apache.org/jira/browse/MESOS-8460
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Chun-Hung Hsiao
>Priority: Major
>
> Observed this SEGV in an internal cluster
> {code}
> {noformat}
> 2018-01-18 19:00:54: *** SIGSEGV (@0x0) received by PID 26410 (TID 
> 0x7fe9e4f65700) from PID 0; stack trace: ***
> 2018-01-18 19:00:54: @ 0x7fe9ea2c85e0 (unknown)
> 2018-01-18 19:00:54: @ 0x7fe9ec4cc855 mesos::internal::Files::detach()
> 2018-01-18 19:00:54: @ 0x7fe9ec8cb5b0 
> mesos::internal::slave::Slave::detachFile()
> 2018-01-18 19:00:54: @ 0x7fe9ec8ccadb 
> _ZZN5mesos8internal5slave9Framework15recoverExecutorERKNS1_5state13ExecutorStateEbRK7hashsetINS_6TaskIDESt4hashIS8_ESt8equal_toIS8_EEENKUlRKN7process6FutureI7NothingEEE0_clESL_.isra.2000
> 2018-01-18 19:00:54: @ 0x7fe9ec37e4e4 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchIvEclINS0_IFvvEvRKNS1_4UPIDEOT_EUlOSE_S3_E_JSE_St12_PlaceholderILi1EEclEOS3_
> 2018-01-18 19:00:54: @ 0x7fe9ed455ea1 process::ProcessBase::consume()
> 2018-01-18 19:00:54: @ 0x7fe9ed464bcc process::ProcessManager::resume()
> 2018-01-18 19:00:54: @ 0x7fe9ed46a136 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 2018-01-18 19:00:54: @ 0x7fe9ea7a0230 (unknown)
> 2018-01-18 19:00:54: @ 0x7fe9ea2c0e25 start_thread
> 2018-01-18 19:00:54: @ 0x7fe9e9fee34d __clone
> 2018-01-18 19:00:54: dcos-mesos-slave.service: main process exited, 
> code=killed, status=11/SEGV{noformat}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-5116) Add support for accounting only mode in XFS isolator.

2018-01-18 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-5116:

Summary: Add support for accounting only mode in XFS isolator.  (was: Add 
support for accounting only mode in XFS isolator)

> Add support for accounting only mode in XFS isolator.
> -
>
> Key: MESOS-5116
> URL: https://issues.apache.org/jira/browse/MESOS-5116
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Yan Xu
>Assignee: James Peach
>Priority: Major
> Fix For: 1.4.0
>
>
> The initial implementation of XFS isolator always enforces the disk quota 
> limit. In contrast, Posix disk isolator supports optionally monitoring the 
> disk usage without enforcement. This eases the transition into disk quota 
> enforcement mode.
> Mesos agent provides a {{flags.enforce_container_disk_quota}} flag to turn on 
> enforcement when the Posix isolator is added. With XFS either we support it 
> as well or we need to change the flag so it's Posix disk isolator specific.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-5116) Add support for accounting only mode in XFS isolator

2018-01-18 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-5116:

Summary: Add support for accounting only mode in XFS isolator  (was: 
Investigate supporting accounting only mode in XFS isolator)

> Add support for accounting only mode in XFS isolator
> 
>
> Key: MESOS-5116
> URL: https://issues.apache.org/jira/browse/MESOS-5116
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Yan Xu
>Assignee: James Peach
>Priority: Major
> Fix For: 1.4.0
>
>
> The initial implementation of XFS isolator always enforces the disk quota 
> limit. In contrast, Posix disk isolator supports optionally monitoring the 
> disk usage without enforcement. This eases the transition into disk quota 
> enforcement mode.
> Mesos agent provides a {{flags.enforce_container_disk_quota}} flag to turn on 
> enforcement when the Posix isolator is added. With XFS either we support it 
> as well or we need to change the flag so it's Posix disk isolator specific.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8462) Unit test for `Slave::detachFile` on removed frameworks.

2018-01-18 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8462:
--

 Summary: Unit test for `Slave::detachFile` on removed frameworks.
 Key: MESOS-8462
 URL: https://issues.apache.org/jira/browse/MESOS-8462
 Project: Mesos
  Issue Type: Improvement
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


We should add a unit test for MESOS-8460.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8460) `Slave::detachFile` can segfault because it could use invalid Framework*

2018-01-18 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-8460:

Target Version/s: 1.5.0

> `Slave::detachFile` can segfault because it could use invalid Framework*
> 
>
> Key: MESOS-8460
> URL: https://issues.apache.org/jira/browse/MESOS-8460
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Chun-Hung Hsiao
>Priority: Major
>
> Observed this SEGV in an internal cluster
> {code}
> {noformat}
> 2018-01-18 19:00:54: *** SIGSEGV (@0x0) received by PID 26410 (TID 
> 0x7fe9e4f65700) from PID 0; stack trace: ***
> 2018-01-18 19:00:54: @ 0x7fe9ea2c85e0 (unknown)
> 2018-01-18 19:00:54: @ 0x7fe9ec4cc855 mesos::internal::Files::detach()
> 2018-01-18 19:00:54: @ 0x7fe9ec8cb5b0 
> mesos::internal::slave::Slave::detachFile()
> 2018-01-18 19:00:54: @ 0x7fe9ec8ccadb 
> _ZZN5mesos8internal5slave9Framework15recoverExecutorERKNS1_5state13ExecutorStateEbRK7hashsetINS_6TaskIDESt4hashIS8_ESt8equal_toIS8_EEENKUlRKN7process6FutureI7NothingEEE0_clESL_.isra.2000
> 2018-01-18 19:00:54: @ 0x7fe9ec37e4e4 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchIvEclINS0_IFvvEvRKNS1_4UPIDEOT_EUlOSE_S3_E_JSE_St12_PlaceholderILi1EEclEOS3_
> 2018-01-18 19:00:54: @ 0x7fe9ed455ea1 process::ProcessBase::consume()
> 2018-01-18 19:00:54: @ 0x7fe9ed464bcc process::ProcessManager::resume()
> 2018-01-18 19:00:54: @ 0x7fe9ed46a136 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 2018-01-18 19:00:54: @ 0x7fe9ea7a0230 (unknown)
> 2018-01-18 19:00:54: @ 0x7fe9ea2c0e25 start_thread
> 2018-01-18 19:00:54: @ 0x7fe9e9fee34d __clone
> 2018-01-18 19:00:54: dcos-mesos-slave.service: main process exited, 
> code=killed, status=11/SEGV{noformat}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6780) ContentType/AgentAPIStreamingTest.AttachContainerInput test fails reliably

2018-01-18 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331599#comment-16331599
 ] 

Gilbert Song commented on MESOS-6780:
-

Downgrade this issue to `MAJOR`, since it is a bug in the unit test based on 
the comments above.

> ContentType/AgentAPIStreamingTest.AttachContainerInput test fails reliably
> --
>
> Key: MESOS-6780
> URL: https://issues.apache.org/jira/browse/MESOS-6780
> Project: Mesos
>  Issue Type: Bug
> Environment: Mac OS 10.12, clang version 4.0.0 
> (http://llvm.org/git/clang 88800602c0baafb8739cb838c2fa3f5fb6cc6968) 
> (http://llvm.org/git/llvm 25801f0f22e178343ee1eadfb4c6cc058628280e), 
> libc++-513447dbb91dd555ea08297dbee6a1ceb6abdc46
>Reporter: Benjamin Bannier
>Assignee: Kevin Klues
>Priority: Major
>  Labels: mesosphere
> Attachments: attach_container_input_no_ssl.log
>
>
> The test {{ContentType/AgentAPIStreamingTest.AttachContainerInput}} (both 
> {{/0}} and {{/1}}) fail consistently for me in an SSL-enabled, optimized 
> build.
> {code}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ContentType/AgentAPIStreamingTest
> [ RUN  ] ContentType/AgentAPIStreamingTest.AttachContainerInput/0
> I1212 17:11:12.371175 3971208128 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1212 17:11:12.393844 17362944 master.cpp:380] Master 
> c752777c-d947-4a86-b382-643463866472 (172.18.8.114) started on 
> 172.18.8.114:51059
> I1212 17:11:12.393899 17362944 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" 
> --credentials="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials"
>  --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/master"
>  --zk_session_timeout="10secs"
> I1212 17:11:12.394670 17362944 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1212 17:11:12.394682 17362944 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1212 17:11:12.394691 17362944 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1212 17:11:12.394701 17362944 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials'
> I1212 17:11:12.394959 17362944 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1212 17:11:12.394996 17362944 authenticator.cpp:519] Initializing server SASL
> I1212 17:11:12.411406 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1212 17:11:12.411571 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1212 17:11:12.411682 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1212 17:11:12.411775 17362944 master.cpp:584] Authorization enabled
> I1212 17:11:12.413318 16289792 master.cpp:2045] Elected as the leading master!
> I1212 17:11:12.413377 16289792 master.cpp:1568] Recovering from registrar
> I1212 17:11:12.417582 14143488 registrar.cpp:362] Successfully fetched the 
> registry (0B) in 4.131072ms
> I1212 17:11:12.417667 14143488 registrar.cpp:461] Applied 1 operations in 
> 27us; attempting to update the registry
> I1212 17:11:12.421799 14143488 registrar.cpp:506] Successfully updated the 
> registry in 4.10496ms
> I1212 17:11:12.421835 14143488 registrar.cpp:392] Successfully recovered 
> registrar
> I1212 17:11:12.421998 17362944 master.cpp:1684] Recovered 0 agents from the 
> registry (136B); allowing 10mins for agents to re-register
> 

[jira] [Updated] (MESOS-6780) ContentType/AgentAPIStreamingTest.AttachContainerInput test fails reliably

2018-01-18 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-6780:

Priority: Major  (was: Critical)

> ContentType/AgentAPIStreamingTest.AttachContainerInput test fails reliably
> --
>
> Key: MESOS-6780
> URL: https://issues.apache.org/jira/browse/MESOS-6780
> Project: Mesos
>  Issue Type: Bug
> Environment: Mac OS 10.12, clang version 4.0.0 
> (http://llvm.org/git/clang 88800602c0baafb8739cb838c2fa3f5fb6cc6968) 
> (http://llvm.org/git/llvm 25801f0f22e178343ee1eadfb4c6cc058628280e), 
> libc++-513447dbb91dd555ea08297dbee6a1ceb6abdc46
>Reporter: Benjamin Bannier
>Assignee: Kevin Klues
>Priority: Major
>  Labels: mesosphere
> Attachments: attach_container_input_no_ssl.log
>
>
> The test {{ContentType/AgentAPIStreamingTest.AttachContainerInput}} (both 
> {{/0}} and {{/1}}) fail consistently for me in an SSL-enabled, optimized 
> build.
> {code}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ContentType/AgentAPIStreamingTest
> [ RUN  ] ContentType/AgentAPIStreamingTest.AttachContainerInput/0
> I1212 17:11:12.371175 3971208128 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1212 17:11:12.393844 17362944 master.cpp:380] Master 
> c752777c-d947-4a86-b382-643463866472 (172.18.8.114) started on 
> 172.18.8.114:51059
> I1212 17:11:12.393899 17362944 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" 
> --credentials="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials"
>  --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/master"
>  --zk_session_timeout="10secs"
> I1212 17:11:12.394670 17362944 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1212 17:11:12.394682 17362944 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1212 17:11:12.394691 17362944 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1212 17:11:12.394701 17362944 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials'
> I1212 17:11:12.394959 17362944 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1212 17:11:12.394996 17362944 authenticator.cpp:519] Initializing server SASL
> I1212 17:11:12.411406 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1212 17:11:12.411571 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1212 17:11:12.411682 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1212 17:11:12.411775 17362944 master.cpp:584] Authorization enabled
> I1212 17:11:12.413318 16289792 master.cpp:2045] Elected as the leading master!
> I1212 17:11:12.413377 16289792 master.cpp:1568] Recovering from registrar
> I1212 17:11:12.417582 14143488 registrar.cpp:362] Successfully fetched the 
> registry (0B) in 4.131072ms
> I1212 17:11:12.417667 14143488 registrar.cpp:461] Applied 1 operations in 
> 27us; attempting to update the registry
> I1212 17:11:12.421799 14143488 registrar.cpp:506] Successfully updated the 
> registry in 4.10496ms
> I1212 17:11:12.421835 14143488 registrar.cpp:392] Successfully recovered 
> registrar
> I1212 17:11:12.421998 17362944 master.cpp:1684] Recovered 0 agents from the 
> registry (136B); allowing 10mins for agents to re-register
> I1212 17:11:12.422780 3971208128 containerizer.cpp:220] Using isolation: 
> 

[jira] [Updated] (MESOS-6623) Re-enable tests impacted by request streaming support

2018-01-18 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-6623:

Priority: Major  (was: Critical)

> Re-enable tests impacted by request streaming support
> -
>
> Key: MESOS-6623
> URL: https://issues.apache.org/jira/browse/MESOS-6623
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API, test
>Reporter: Anand Mazumdar
>Priority: Major
>  Labels: mesosphere
>
> We added support for HTTP request streaming in libprocess as part of 
> MESOS-6466. However, this broke a few tests that relied on HTTP request 
> filtering since the handlers no longer have access to the body of the request 
> when {{visit()}} is invoked. We would need to revisit how we do HTTP request 
> filtering and then re-enable these tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-6623) Re-enable tests impacted by request streaming support

2018-01-18 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-6623:

Issue Type: Improvement  (was: Bug)

> Re-enable tests impacted by request streaming support
> -
>
> Key: MESOS-6623
> URL: https://issues.apache.org/jira/browse/MESOS-6623
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API, test
>Reporter: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
>
> We added support for HTTP request streaming in libprocess as part of 
> MESOS-6466. However, this broke a few tests that relied on HTTP request 
> filtering since the handlers no longer have access to the body of the request 
> when {{visit()}} is invoked. We would need to revisit how we do HTTP request 
> filtering and then re-enable these tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8413) Zookeeper configuration passwords are shown in clear text

2018-01-18 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8413:
-
Sprint: Mesosphere Sprint 72

> Zookeeper configuration passwords are shown in clear text
> -
>
> Key: MESOS-8413
> URL: https://issues.apache.org/jira/browse/MESOS-8413
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>Priority: Major
>  Labels: mesosphere, security
>
> No matter how one configures mesos, either by passing the ZooKeeper flags in 
> the command line or using a file, as follows:
> {noformat}
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log 
> --zk=zk://${zk_username}:${zk_password}@${zk_addr}/mesos --quorum=1
> {noformat}
> {noformat}
> echo "zk://${zk_username}:${zk_password}@${zk_addr}/mesos" > 
> /tmp/${USER}/mesos/zk_config.txt
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log --zk=/tmp/${USER}/mesos/zk_config.txt
> {noformat}
> both the logs and the results of the {{/flags}} endpoint will resolve to the 
> contents of the flags, i.e.:
> {noformat}
> I0108 10:12:50.387522 28579 master.cpp:458] Flags at startup: 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="false" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticators="crammd5" 
> --authorizers="local" --filter_gpu_resources="true" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --log_dir="/tmp/user/mesos/master/log" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --quorum="1" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="20secs" 
> --registry_strict="false" --require_agent_domain="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/home/user/mesos/build/../src/webui" 
> --work_dir="/tmp/user/mesos/master" 
> --zk="zk://user@passwd:127.0.0.1:2181/mesos" --zk_session_timeout="10secs"
> {noformat}
> {noformat}
> HTTP/1.1 200 OK
> Content-Encoding: gzip
> Content-Length: 591
> Content-Type: application/json
> Date: Mon, 08 Jan 2018 15:12:53 GMT
> {
> "flags": {
> "agent_ping_timeout": "15secs",
> "agent_reregister_timeout": "10mins",
> "allocation_interval": "1secs",
> "allocator": "HierarchicalDRF",
> "authenticate_agents": "false",
> "authenticate_frameworks": "false",
> "authenticate_http_frameworks": "false",
> "authenticate_http_readonly": "false",
> "authenticate_http_readwrite": "false",
> "authenticators": "crammd5",
> "authorizers": "local",
> "filter_gpu_resources": "true",
> "framework_sorter": "drf",
> "help": "false",
> "hostname_lookup": "true",
> "http_authenticators": "basic",
> "initialize_driver_logging": "true",
> "log_auto_initialize": "true",
> "log_dir": "/tmp/user/mesos/master/log",
> "logbufsecs": "0",
> "logging_level": "INFO",
> "max_agent_ping_timeouts": "5",
> "max_completed_frameworks": "50",
> "max_completed_tasks_per_framework": "1000",
> "max_unreachable_tasks_per_framework": "1000",
> "port": "5050",
> "quiet": "false",
> "quorum": "1",
> "recovery_agent_removal_limit": "100%",
> "registry": "replicated_log",
> "registry_fetch_timeout": "1mins",
> "registry_gc_interval": "15mins",
> "registry_max_agent_age": "2weeks",
> "registry_max_agent_count": "102400",
> "registry_store_timeout": "20secs",
> "registry_strict": "false",
> "require_agent_domain": "false",
> "root_submissions": "true",
> "user_sorter": "drf",
> "version": "false",
> "webui_dir": "/home/user/mesos/build/../src/webui",
> "work_dir": "/tmp/user/mesos/master",
> "zk": "zk://user@passwd:127.0.0.1:2181/mesos",
> "zk_session_timeout": "10secs"
> }
> }
> {noformat}
> Which leads to having no effective way to 

[jira] [Assigned] (MESOS-8460) `Slave::detachFile` can segfault because it could use invalid Framework*

2018-01-18 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-8460:
--

Assignee: Chun-Hung Hsiao  (was: Vinod Kone)

> `Slave::detachFile` can segfault because it could use invalid Framework*
> 
>
> Key: MESOS-8460
> URL: https://issues.apache.org/jira/browse/MESOS-8460
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Chun-Hung Hsiao
>Priority: Major
>
> Observed this SEGV in an internal cluster
> {code}
> {noformat}
> 2018-01-18 19:00:54: *** SIGSEGV (@0x0) received by PID 26410 (TID 
> 0x7fe9e4f65700) from PID 0; stack trace: ***
> 2018-01-18 19:00:54: @ 0x7fe9ea2c85e0 (unknown)
> 2018-01-18 19:00:54: @ 0x7fe9ec4cc855 mesos::internal::Files::detach()
> 2018-01-18 19:00:54: @ 0x7fe9ec8cb5b0 
> mesos::internal::slave::Slave::detachFile()
> 2018-01-18 19:00:54: @ 0x7fe9ec8ccadb 
> _ZZN5mesos8internal5slave9Framework15recoverExecutorERKNS1_5state13ExecutorStateEbRK7hashsetINS_6TaskIDESt4hashIS8_ESt8equal_toIS8_EEENKUlRKN7process6FutureI7NothingEEE0_clESL_.isra.2000
> 2018-01-18 19:00:54: @ 0x7fe9ec37e4e4 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchIvEclINS0_IFvvEvRKNS1_4UPIDEOT_EUlOSE_S3_E_JSE_St12_PlaceholderILi1EEclEOS3_
> 2018-01-18 19:00:54: @ 0x7fe9ed455ea1 process::ProcessBase::consume()
> 2018-01-18 19:00:54: @ 0x7fe9ed464bcc process::ProcessManager::resume()
> 2018-01-18 19:00:54: @ 0x7fe9ed46a136 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 2018-01-18 19:00:54: @ 0x7fe9ea7a0230 (unknown)
> 2018-01-18 19:00:54: @ 0x7fe9ea2c0e25 start_thread
> 2018-01-18 19:00:54: @ 0x7fe9e9fee34d __clone
> 2018-01-18 19:00:54: dcos-mesos-slave.service: main process exited, 
> code=killed, status=11/SEGV{noformat}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8461) SLRP should no assume a CSI plugin always has GetNodeID implemented.

2018-01-18 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8461:
--
Description: According to 0.1.0 spec, GetNodeID is optional, and will be 
implemented if PUBLISH_UNPUBLISH_VOLUME capability is set.

> SLRP should no assume a CSI plugin always has GetNodeID implemented.
> 
>
> Key: MESOS-8461
> URL: https://issues.apache.org/jira/browse/MESOS-8461
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Chun-Hung Hsiao
>Priority: Major
>
> According to 0.1.0 spec, GetNodeID is optional, and will be implemented if 
> PUBLISH_UNPUBLISH_VOLUME capability is set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8461) SLRP should no assume a CSI plugin always has GetNodeID implemented.

2018-01-18 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8461:
--
Environment: (was: According to 0.1.0 spec, GetNodeID is optional, and 
will be implemented if PUBLISH_UNPUBLISH_VOLUME capability is set.)

> SLRP should no assume a CSI plugin always has GetNodeID implemented.
> 
>
> Key: MESOS-8461
> URL: https://issues.apache.org/jira/browse/MESOS-8461
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Chun-Hung Hsiao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8461) SLRP should no assume a CSI plugin always has GetNodeID implemented.

2018-01-18 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8461:
-

 Summary: SLRP should no assume a CSI plugin always has GetNodeID 
implemented.
 Key: MESOS-8461
 URL: https://issues.apache.org/jira/browse/MESOS-8461
 Project: Mesos
  Issue Type: Bug
 Environment: According to 0.1.0 spec, GetNodeID is optional, and will 
be implemented if PUBLISH_UNPUBLISH_VOLUME capability is set.
Reporter: Jie Yu
Assignee: Chun-Hung Hsiao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8460) `Slave::detachFile` can segfault because it could use invalid Framework*

2018-01-18 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331176#comment-16331176
 ] 

Vinod Kone commented on MESOS-8460:
---

Debugged the issue with [~mcypark].

The problem comes from the way we capture `this` implicitly via `=` capture in 
this piece of code

{code}

    slave->garbageCollect(path)

      .onAny(defer(slave->self(), [=](const Future& future) {

        slave->detachFile(path);

 

        if (executor->info.has_type() &&

            executor->info.type() == ExecutorInfo::DEFAULT) {

          foreachvalue (const Task* task, executor->launchedTasks) {

            executor->detachTaskVolumeDirectory(*task);

          }

 

          foreachvalue (const Task* task, executor->terminatedTasks) {

            executor->detachTaskVolumeDirectory(*task);

          }

 

          foreach (const shared_ptr& task, executor->completedTasks) {

            executor->detachTaskVolumeDirectory(*task);

          }

        }

      }));

{code}

 

Specifically, the `slave` pointer inside the onAny lambda actually refers to 
`this->slave` which is a member variable of `Framework`. Since it is possible 
that the Framework struct could be deleted before the onAny callback is 
executed the `slave` pointer could become invalid.

The proposed fix here is to explicitly capture member variables of `Framework` 
instead of using `=` in the lambda.

Note that there is more than one place in the code where we have to fix this.

 

> `Slave::detachFile` can segfault because it could use invalid Framework*
> 
>
> Key: MESOS-8460
> URL: https://issues.apache.org/jira/browse/MESOS-8460
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>Priority: Major
>
> Observed this SEGV in an internal cluster
> {code}
> {noformat}
> 2018-01-18 19:00:54: *** SIGSEGV (@0x0) received by PID 26410 (TID 
> 0x7fe9e4f65700) from PID 0; stack trace: ***
> 2018-01-18 19:00:54: @ 0x7fe9ea2c85e0 (unknown)
> 2018-01-18 19:00:54: @ 0x7fe9ec4cc855 mesos::internal::Files::detach()
> 2018-01-18 19:00:54: @ 0x7fe9ec8cb5b0 
> mesos::internal::slave::Slave::detachFile()
> 2018-01-18 19:00:54: @ 0x7fe9ec8ccadb 
> _ZZN5mesos8internal5slave9Framework15recoverExecutorERKNS1_5state13ExecutorStateEbRK7hashsetINS_6TaskIDESt4hashIS8_ESt8equal_toIS8_EEENKUlRKN7process6FutureI7NothingEEE0_clESL_.isra.2000
> 2018-01-18 19:00:54: @ 0x7fe9ec37e4e4 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchIvEclINS0_IFvvEvRKNS1_4UPIDEOT_EUlOSE_S3_E_JSE_St12_PlaceholderILi1EEclEOS3_
> 2018-01-18 19:00:54: @ 0x7fe9ed455ea1 process::ProcessBase::consume()
> 2018-01-18 19:00:54: @ 0x7fe9ed464bcc process::ProcessManager::resume()
> 2018-01-18 19:00:54: @ 0x7fe9ed46a136 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 2018-01-18 19:00:54: @ 0x7fe9ea7a0230 (unknown)
> 2018-01-18 19:00:54: @ 0x7fe9ea2c0e25 start_thread
> 2018-01-18 19:00:54: @ 0x7fe9e9fee34d __clone
> 2018-01-18 19:00:54: dcos-mesos-slave.service: main process exited, 
> code=killed, status=11/SEGV{noformat}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8460) `Slave::detachFile` can segfault because it could use invalid Framework*

2018-01-18 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-8460:
-

 Summary: `Slave::detachFile` can segfault because it could use 
invalid Framework*
 Key: MESOS-8460
 URL: https://issues.apache.org/jira/browse/MESOS-8460
 Project: Mesos
  Issue Type: Improvement
Reporter: Vinod Kone
Assignee: Vinod Kone


Observed this SEGV in an internal cluster

{code}
{noformat}
2018-01-18 19:00:54: *** SIGSEGV (@0x0) received by PID 26410 (TID 
0x7fe9e4f65700) from PID 0; stack trace: ***
2018-01-18 19:00:54: @ 0x7fe9ea2c85e0 (unknown)
2018-01-18 19:00:54: @ 0x7fe9ec4cc855 mesos::internal::Files::detach()
2018-01-18 19:00:54: @ 0x7fe9ec8cb5b0 
mesos::internal::slave::Slave::detachFile()
2018-01-18 19:00:54: @ 0x7fe9ec8ccadb 
_ZZN5mesos8internal5slave9Framework15recoverExecutorERKNS1_5state13ExecutorStateEbRK7hashsetINS_6TaskIDESt4hashIS8_ESt8equal_toIS8_EEENKUlRKN7process6FutureI7NothingEEE0_clESL_.isra.2000
2018-01-18 19:00:54: @ 0x7fe9ec37e4e4 
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchIvEclINS0_IFvvEvRKNS1_4UPIDEOT_EUlOSE_S3_E_JSE_St12_PlaceholderILi1EEclEOS3_
2018-01-18 19:00:54: @ 0x7fe9ed455ea1 process::ProcessBase::consume()
2018-01-18 19:00:54: @ 0x7fe9ed464bcc process::ProcessManager::resume()
2018-01-18 19:00:54: @ 0x7fe9ed46a136 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
2018-01-18 19:00:54: @ 0x7fe9ea7a0230 (unknown)
2018-01-18 19:00:54: @ 0x7fe9ea2c0e25 start_thread
2018-01-18 19:00:54: @ 0x7fe9e9fee34d __clone
2018-01-18 19:00:54: dcos-mesos-slave.service: main process exited, 
code=killed, status=11/SEGV{noformat}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8459) Executor could linger without ever receiving any tasks

2018-01-18 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-8459:
---

 Summary: Executor could linger without ever receiving any tasks
 Key: MESOS-8459
 URL: https://issues.apache.org/jira/browse/MESOS-8459
 Project: Mesos
  Issue Type: Bug
  Components: executor
Reporter: Meng Zhu


An executor's initial tasks may be killed even after it has been registered. In 
that case, the executor could linger forever.

In MESOS-8411, we have a short-term fix that checks an executor's completed and 
terminated task queues to see if it had ever received any tasks. if the check 
is false and there is no queued or launched tasks, agent will shutdown the 
executor. 

However, this check is not bullet-proof. The completedTasks queue is a 
circular_buffer (current size 200) which means earlier completed tasks that are 
possibly updated by the executor may be ejected and thus are missed by this 
check. This would lead to false positive shutdowns.

Per discussion with [~vinodkone] and [~bmahler]. There are two long term 
solutions.

The first one is to checkpoint additional executor states which indicates 
whether the executor has ever received any tasks (no more inference from task 
queue status);

The alternative is to add timeouts in the executor driver to shutdown lingering 
executors automatically.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8411) Killing a queued task can lead to the command executor never terminating.

2018-01-18 Thread Meng Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu updated MESOS-8411:

Affects Version/s: 1.3.0

> Killing a queued task can lead to the command executor never terminating.
> -
>
> Key: MESOS-8411
> URL: https://issues.apache.org/jira/browse/MESOS-8411
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.3.0
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Critical
>
> If a task is killed while the executor is re-registering, we will remove it 
> from queued tasks and shut down the executor if all the its initial tasks 
> could not be delivered. However, there is a case (within {{Slave::___run}}) 
> where we leave the executor running, the race is:
> # Command-executor task launched.
> # Command executor sends registration message. Agent tells containerizer to 
> update the resources before it sends the tasks to the executor.
> # Kill arrives, and we synchronously remove the task from queued tasks.
> # Containerizer finishes updating the resources, and in {{Slave::___run}} the 
> killed task is ignored.
> # Command executor stays running!
> Executors could have a timeout to handle this case, but it's not clear that 
> all executors will implement this correctly. It would be better to have a 
> defensive policy that will shut down an executor if all of its initial batch 
> of tasks were killed prior to delivery.
> In order to implement this, one approach discussed with [~vinodkone] is to 
> look at the running + terminated but unacked + completed tasks, and if empty, 
> shut the executor down in the {{Slave::___run}} path. This will require us to 
> check that the completed task cache size is set to at least 1, and this also 
> assumes that the completed tasks are not cleared based on time or during 
> agent recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8458) Response from /help includes broken links.

2018-01-18 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-8458:
--

 Summary: Response from /help includes broken links.
 Key: MESOS-8458
 URL: https://issues.apache.org/jira/browse/MESOS-8458
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Affects Versions: 1.5.0
Reporter: Alexander Rukletsov
Assignee: Alexander Rukletsov


Links in response from "http:///help" contain duplicated 
"/help" prefix, e.g., "http:///help/help/version".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8427) Clean up residual CSI endpoints for SLRP tests.

2018-01-18 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8427:
-
Shepherd: Greg Mann  (was: Jie Yu)

> Clean up residual CSI endpoints for SLRP tests.
> ---
>
> Key: MESOS-8427
> URL: https://issues.apache.org/jira/browse/MESOS-8427
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> Since the CSI endpoints are not in the sandbox directory of the unit tests, 
> they need to be explicitly cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8415) Add an SLRP test for agent reboot.

2018-01-18 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8415:
-
Shepherd: Greg Mann  (was: Jie Yu)

> Add an SLRP test for agent  reboot.
> ---
>
> Key: MESOS-8415
> URL: https://issues.apache.org/jira/browse/MESOS-8415
> Project: Mesos
>  Issue Type: Task
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> We should add a test for the following scenario: when an agent is rebooted, 
> all previously published CSI volumes would become unmounted. So SLRP should 
> remount them when a task is going to use the volumes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8426) Speed up SLRP tests

2018-01-18 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8426:
-
Shepherd: Greg Mann  (was: Jie Yu)

> Speed up SLRP tests
> ---
>
> Key: MESOS-8426
> URL: https://issues.apache.org/jira/browse/MESOS-8426
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> Each of the current SLRP unit tests takes seconds to run. This can be 
> improved by reducing the allocation interval and declining offers with 
> filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8409) Add an SLRP test for agent registered with a new ID.

2018-01-18 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8409:
-
Shepherd: Greg Mann  (was: Jie Yu)

> Add an SLRP test for agent registered with a new ID.
> 
>
> Key: MESOS-8409
> URL: https://issues.apache.org/jira/browse/MESOS-8409
> Project: Mesos
>  Issue Type: Task
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> When an agent is registered with a new ID, SLRP should be assigned with a 
> different ID, and all previously created volumes would become pre-existing 
> volumes without profile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8408) Add an SLRP test for CSI plugin restart.

2018-01-18 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8408:
-
Shepherd: Greg Mann  (was: Jie Yu)

> Add an SLRP test for CSI plugin restart.
> 
>
> Key: MESOS-8408
> URL: https://issues.apache.org/jira/browse/MESOS-8408
> Project: Mesos
>  Issue Type: Task
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> We will add a unit test that keeps killing the CSI plugin to verify that SLRP 
> can restart the plugin and work properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8407) Add SLRP unit tests for profile updates and corner cases.

2018-01-18 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8407:
-
Shepherd: Greg Mann  (was: Jie Yu)

> Add SLRP unit tests for profile updates and corner cases.
> -
>
> Key: MESOS-8407
> URL: https://issues.apache.org/jira/browse/MESOS-8407
> Project: Mesos
>  Issue Type: Task
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> The following tests will be added:
> 1. RP update state with no resources, and recover from this state
> 2. Pre-existing CSI volume with zero size
> 3. RP updates state with no profile, then another update state which contains 
> resources associated with a profile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8424) Test that operations are correctly reported following a master failover

2018-01-18 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8424:
-
Labels: mesosphere  (was: )

> Test that operations are correctly reported following a master failover
> ---
>
> Key: MESOS-8424
> URL: https://issues.apache.org/jira/browse/MESOS-8424
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Major
>  Labels: mesosphere
>
> As the master keeps track of operations running on a resource provider, it 
> needs to be updated on these operations when agents reregister after a master 
> failover. E.g., an operation that has finished during the failover should be 
> reported as finished by the master after the agent on which the resource 
> provider is running has reregistered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8354) Operation & CSI Testing Improvements

2018-01-18 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8354:
-
Labels: mesosphere  (was: )

> Operation & CSI Testing Improvements
> 
>
> Key: MESOS-8354
> URL: https://issues.apache.org/jira/browse/MESOS-8354
> Project: Mesos
>  Issue Type: Epic
>Reporter: Greg Mann
>Priority: Major
>  Labels: mesosphere
>
> In order to harden recently-added CSI support (MESOS-7289) and the internal 
> pieces of operation feedback (MESOS-8054), additional tests must be added to 
> verify expected behavior in a variety of scenarios including the failover of 
> various components, network partitions, and rebooted hosts.
> This epic serves as a repository of tickets for specific tests along these 
> lines.
> For a preliminary list of current and desired tests, see [this 
> document|https://docs.google.com/document/d/1xJ-m37_D41lsddijUuLZsoXCk7k0JeNBItM6j2V-N68].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-5362) Add authentication to example frameworks

2018-01-18 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-5362:
--
Shepherd: Vinod Kone  (was: Kapil Arya)

> Add authentication to example frameworks
> 
>
> Key: MESOS-5362
> URL: https://issues.apache.org/jira/browse/MESOS-5362
> Project: Mesos
>  Issue Type: Improvement
>  Components: security
>Reporter: Greg Mann
>Assignee: Till Toenshoff
>Priority: Major
>  Labels: authentication, mesosphere, security
>
> Some example frameworks do not have the ability to authenticate with the 
> master. Adding authentication to the example frameworks that don't already 
> have it implemented would allow us to use these frameworks for testing in 
> authenticated/authorized scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8449) Add missing fields to agent v1 operator API

2018-01-18 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8449:
-
Sprint: Mesosphere Sprint 73  (was: Mesosphere Sprint 72)

> Add missing fields to agent v1 operator API
> ---
>
> Key: MESOS-8449
> URL: https://issues.apache.org/jira/browse/MESOS-8449
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: mesosphere
>
> Some fields which are available via the agent {{/state}} endpoint are not 
> accessible via the v1 API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8240) Add an option to build the new CLI and run unit tests.

2018-01-18 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8240:
--
Sprint: Mesosphere Sprint 70, Mesosphere Sprint 73  (was: Mesosphere Sprint 
70, Mesosphere Sprint 72)

> Add an option to build the new CLI and run unit tests.
> --
>
> Key: MESOS-8240
> URL: https://issues.apache.org/jira/browse/MESOS-8240
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> An update of the discarded https://reviews.apache.org/r/52543/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-18 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16330630#comment-16330630
 ] 

Andrei Budnik commented on MESOS-7742:
--

Steps to reproduce second cause:
1. Add a {{::sleep(2);}} after [binding unix 
socket|https://github.com/apache/mesos/blob/634c8af2618c57a1405d20717fa909b399486f37/src/slave/containerizer/mesos/io/switchboard.cpp#L1056].
2. Recompile `make && make check`.
3. Launch a test:
{code:}
GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests 
--gtest_filter=ContentType/AgentAPITest.LaunchNestedContainerSession/0 
--gtest_break_on_failure --gtest_repeat=1 --verbose
{code}


> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky-test, mesosphere-oncall
> Fix For: 1.6.0
>
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8305) DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.

2018-01-18 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16330594#comment-16330594
 ] 

Qian Zhang commented on MESOS-8305:
---

>From the log, we can see the pid namespace we read for the second task is 
>empty. I think the root cause of this issue may be, before the second task 
>writes its pid namespace into a file in its sandbox but after that file is 
>created (i.e., it is still an empty file), the test tries to read that file, 
>so it will read nothing.

> DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.
> --
>
> Key: MESOS-8305
> URL: https://issues.apache.org/jira/browse/MESOS-8305
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 16.04
> Fedora 23
>Reporter: Alexander Rukletsov
>Assignee: Qian Zhang
>Priority: Major
>  Labels: flaky-test
> Attachments: ROOT_MultiTaskgroupSharePidNamespace-badrun.txt
>
>
> On Ubuntu 16.04:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1877
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532250"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> Full log attached.
> On Fedora 23:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1878
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532233"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> The test became flaky shortly after MESOS-7306 has been committed and likely 
> related to it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8305) DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.

2018-01-18 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang updated MESOS-8305:
--
Story Points: 2
  Sprint: Mesosphere Sprint 72

> DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.
> --
>
> Key: MESOS-8305
> URL: https://issues.apache.org/jira/browse/MESOS-8305
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 16.04
> Fedora 23
>Reporter: Alexander Rukletsov
>Assignee: Qian Zhang
>Priority: Major
>  Labels: flaky-test
> Attachments: ROOT_MultiTaskgroupSharePidNamespace-badrun.txt
>
>
> On Ubuntu 16.04:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1877
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532250"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> Full log attached.
> On Fedora 23:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1878
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532233"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> The test became flaky shortly after MESOS-7306 has been committed and likely 
> related to it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8094) Leverage helper functions to reduce boilerplate code related to v1 API.

2018-01-18 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16330453#comment-16330453
 ] 

Alexander Rukletsov commented on MESOS-8094:


{noformat}
commit 634c8af2618c57a1405d20717fa909b399486f37
Author: Armand Grillet 
AuthorDate: Thu Jan 18 13:37:16 2018 +0100
Commit: Alexander Rukletsov 
CommitDate: Thu Jan 18 13:37:16 2018 +0100

Updated tests to use `createCallSubscribe`.

Update all the tests that send v1 API SUBSCRIBE calls
to use the `createCallSubscribe` test helper.

Review: https://reviews.apache.org/r/63661/
{noformat}

> Leverage helper functions to reduce boilerplate code related to v1 API.
> ---
>
> Key: MESOS-8094
> URL: https://issues.apache.org/jira/browse/MESOS-8094
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: mesosphere, newbie
>
> https://reviews.apache.org/r/61982/ created an example how test code related 
> to scheduler v1 API can be simplified with appropriate usage of helper 
> function. For example, instead of crafting a subscribe call manually like in
> {noformat}
>   {
> v1::scheduler::Call call;
> call.set_type(v1::scheduler::Call::SUBSCRIBE);
> v1::scheduler::Call::Subscribe* subscribe = call.mutable_subscribe();
> subscribe->mutable_framework_info()->CopyFrom(v1::DEFAULT_FRAMEWORK_INFO);
> mesos.send(call);
>   }
> {noformat}
> a helper function {{v1::scheduler::SendSubscribe()}} shall be invoked.
> To find all occurrences that shall be fixed, one can grep the test codebase 
> for {{call.set_type}}. At the moment I see the following files:
> {noformat}
> api_tests.cpp
> check_tests.cpp
> http_fault_tolerant_tests.cpp
> master_maintenance_tests.cpp
> master_tests.cpp
> scheduler_tests.cpp
> slave_authorization_tests.cpp
> slave_recovery_tests.cpp
> slave_tests.cpp
> {noformat}
> The same applies for sending status update acks; 
> {{v1::scheduler::SendAcknowledge()}} action shall be used instead of manually 
> crafting acks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-18 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7506:
---
Attachment: ROOT_IsolatorFlags-badrun3.txt

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, 
> ROOT_IsolatorFlags-badrun3.txt, ReconcileTasksMissingFromSlave-badrun.txt, 
> ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any 
> more
> ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-18 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7506:
---
Description: 
I've observed a number of flaky tests that leave orphan containers upon 
cleanup. A typical log looks like this:
{noformat}
../../src/tests/cluster.cpp:580: Failure
Value of: containers->empty()
  Actual: false
Expected: true
Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
{noformat}

All currently affected tests:
{noformat}
SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any more
ROOT_IsolatorFlags
{noformat}

  was:
I've observed a number of flaky tests that leave orphan containers upon 
cleanup. A typical log looks like this:
{noformat}
../../src/tests/cluster.cpp:580: Failure
Value of: containers->empty()
  Actual: false
Expected: true
Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
{noformat}

All currently affected tests:
{noformat}
SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any more
{noformat}


> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, 
> ReconcileTasksMissingFromSlave-badrun.txt, ResourceLimitation-badrun.txt, 
> ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any 
> more
> ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-18 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7506:
---
Fix Version/s: (was: 1.6.0)

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, 
> ReconcileTasksMissingFromSlave-badrun.txt, ResourceLimitation-badrun.txt, 
> ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any 
> more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7979) reviewboard's GUESS_FIELDS setting leads to redundant information in commit messages

2018-01-18 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7979:

Shepherd: Till Toenshoff

> reviewboard's GUESS_FIELDS setting leads to redundant information in commit 
> messages
> 
>
> Key: MESOS-7979
> URL: https://issues.apache.org/jira/browse/MESOS-7979
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Minor
>  Labels: mesosphere, newbie, python
>
> Reviewboard can be set up to automatically guess a patch's summary and 
> description when {{GUESS_FIELDS}} is set. For commits that have no dedicated 
> description, it uses the commit summary as description as well. This leads to 
> commits with redundant commit messages, e.g.,
> {code}
> Frobnicated the foobarizer.
> Frobnicated the foobarizer.
> Review: https://reviews.apache.org/r/1234567890
> {code}
> When applying this commit with e.g., {{apply_reviews.py}} the redundant body 
> is faithfully copied, but we should consider updating it to instead remove 
> the redundant information automatically leading to e.g.,
> {code}
> Frobnicated the foobarizer.
> Review: https://reviews.apache.org/r/1234567890
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7979) reviewboard's GUESS_FIELDS setting leads to redundant information in commit messages

2018-01-18 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-7979:
---

Assignee: Benjamin Bannier

> reviewboard's GUESS_FIELDS setting leads to redundant information in commit 
> messages
> 
>
> Key: MESOS-7979
> URL: https://issues.apache.org/jira/browse/MESOS-7979
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Minor
>  Labels: mesosphere, newbie, python
>
> Reviewboard can be set up to automatically guess a patch's summary and 
> description when {{GUESS_FIELDS}} is set. For commits that have no dedicated 
> description, it uses the commit summary as description as well. This leads to 
> commits with redundant commit messages, e.g.,
> {code}
> Frobnicated the foobarizer.
> Frobnicated the foobarizer.
> Review: https://reviews.apache.org/r/1234567890
> {code}
> When applying this commit with e.g., {{apply_reviews.py}} the redundant body 
> is faithfully copied, but we should consider updating it to instead remove 
> the redundant information automatically leading to e.g.,
> {code}
> Frobnicated the foobarizer.
> Review: https://reviews.apache.org/r/1234567890
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8445) Test that `UPDATE_STATE` of a resource provider doesn't have unwanted side-effects in master or agent

2018-01-18 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8445:

Shepherd: Benjamin Bannier

> Test that `UPDATE_STATE` of a resource provider doesn't have unwanted 
> side-effects in master or agent
> -
>
> Key: MESOS-8445
> URL: https://issues.apache.org/jira/browse/MESOS-8445
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.6.0
>
>
> While we test the correct behavior of {{UPDATE_STATE}} sent by resource 
> providers when an operation state changes or after (re-)registration, this 
> call might also get sent independent from any such event, e.g., if resources 
> are added to a running resource provider. Correct behavior of master and 
> agent need to be tested. Outstanding offers should be rescinded and internal 
> states updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)