date:20180703

[jira] [Comment Edited] (MESOS-9024) Mesos master segfaults with stack overflow under load.

2018-07-03 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532096#comment-16532096
 ] 

Benjamin Mahler edited comment on MESOS-9024 at 7/4/18 1:30 AM:


{noformat}
commit ab10f8310a735c3119f22dd3d9e636dc9cc38562
Author: Benjamin Mahler 
Date:   Tue Jul 3 16:54:11 2018 -0700

Reduced likelihood of a stack overflow in libprocess socket recv path.

Currently, the socket recv path is implemented using an asynchronous
loop with callbacks. Without using `process::loop`, this pattern is
prone to a stack overflow in the case that all asynchronous calls
complete synchronously. This is possible with sockets if the socket
is always ready for reading. The crash has been reported in MESOS-9024,
so the stack overflow has been encountered in practice.

This patch updates the recv path to leverage `process::loop`, which
is supposed to prevent stack overflows in asynchronous loops. However,
it is still possible for `process::loop` to stack overflow due to
MESOS-8852. In practice, I expect that even without MESOS-8852 fixed,
users won't see any stack overflows in the recv path.

Review: https://reviews.apache.org/r/67824
{noformat}

[~awruef] this has been cherry-picked and will land in 1.6.1 and 1.5.2, please 
let me know if you still see an issue.


was (Author: bmahler):
{noformat}
commit ab10f8310a735c3119f22dd3d9e636dc9cc38562
Author: Benjamin Mahler 
Date:   Tue Jul 3 16:54:11 2018 -0700

Reduced likelihood of a stack overflow in libprocess socket recv path.

Currently, the socket recv path is implemented using an asynchronous
loop with callbacks. Without using `process::loop`, this pattern is
prone to a stack overflow in the case that all asynchronous calls
complete synchronously. This is possible with sockets if the socket
is always ready for reading. The crash has been reported in MESOS-9024,
so the stack overflow has been encountered in practice.

This patch updates the recv path to leverage `process::loop`, which
is supposed to prevent stack overflows in asynchronous loops. However,
it is still possible for `process::loop` to stack overflow due to
MESOS-8852. In practice, I expect that even without MESOS-8852 fixed,
users won't see any stack overflows in the recv path.

Review: https://reviews.apache.org/r/67824
{noformat}

[~awruef] this has been cherry-picked and will land in 1.6.1 and 1.5.2, please 
let me know if you are still an issue.

> Mesos master segfaults with stack overflow under load.
> --
>
> Key: MESOS-9024
> URL: https://issues.apache.org/jira/browse/MESOS-9024
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, master
>Affects Versions: 1.6.0
> Environment: Ubuntu 16.04.4 
>Reporter: Andrew Ruef
>Assignee: Benjamin Mahler
>Priority: Blocker
> Fix For: 1.5.2, 1.6.1
>
> Attachments: stack.txt.gz
>
>
> Running mesos in non-HA mode on a small cluster under load, the master 
> reliably segfaults due to some state it has worked itself into. The segfault 
> appears to be a stack overflow, at least, the call stack has 72662 elements 
> in it in the crashing thread. The root of the stack appears to be in 
> libprocess. 
> I've attached a gzip compressed stack backtrace since the uncompressed stack 
> backtrace is too large to attach to this issue. This happens to me fairly 
> reliably when doing jobs, but it can take many hours or days for mesos to 
> work itself back into this state. 
> I think the below is the beginning of the repeating part of the stack trace: 
> {noformat}
> #72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at 
> ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}}
> {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce 
> const&)>&&) const () at 
> ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}}
> {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}}
> {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny,

[jira] [Commented] (MESOS-8916) Allocation logic cleanup.

2018-07-03 Thread Meng Zhu (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532115#comment-16532115
 ] 

Meng Zhu commented on MESOS-8916:
-

Update the scalar quantity related functions to also strip static reservation 
metadata. Currently there is extra code in the allocator across many places 
(including the allocation logic) to perform this in the call-sites.

https://reviews.apache.org/r/67615/

> Allocation logic cleanup.
> -
>
> Key: MESOS-8916
> URL: https://issues.apache.org/jira/browse/MESOS-8916
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>
> The allocation logic has grown organically and is now very hard to read and 
> maintain. This epic will track cleanups to improve the readability of the 
> core allocation logic:
> * Add a function for returning the subset of frameworks that are capable of 
> receiving offers from the agent. This moves the capability checking out of 
> the core allocation logic and means the loops can just iterate over a smaller 
> set of framework candidates rather than having to write 'continue' cases. 
> This covers the GPU_RESOURCES and REGION_AWARE capabilities.
> * Similarly, add a function that allows framework capability based filtering 
> of resources. This pulls out the filtering logic from the core allocation 
> logic and instead the core allocation logic can just all out to the 
> capability filtering function. This covers the SHARED_RESOURCES, 
> REVOCABLE_RESOURCES and RESERVATION_REFINEMENT capabilities. Note that in 
> order to implement this one, we must refactor the shared resources logic in 
> order to have the resource generation occur regardless of the framework 
> capability (followed by getting filtered out via this new function if the 
> framework is not capable).
> * Update the scalar quantity related functions to also strip static 
> reservation metadata. Currently there is extra code in the allocator across 
> many places (including the allocation logic) to perform this in the 
> call-sites.
> * Track across allocation cycles or pull out the following into functions: 
> quantity of quota that is currently "charged" to a role, amount of "headroom" 
> that is needed/available for unsatisfied quota guarantees.
> * Pull out the resource shrinking function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8916) Allocation logic cleanup.

2018-07-03 Thread Meng Zhu (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532114#comment-16532114
 ] 

Meng Zhu commented on MESOS-8916:
-

Similarly, add a function that allows framework capability based filtering of 
resources. This pulls out the filtering logic from the core allocation logic 
and instead the core allocation logic can just all out to the capability 
filtering function. This covers the SHARED_RESOURCES, REVOCABLE_RESOURCES and 
RESERVATION_REFINEMENT capabilities. Note that in order to implement this one, 
we must refactor the shared resources logic in order to have the resource 
generation occur regardless of the framework capability (followed by getting 
filtered out via this new function if the framework is not capable).

https://reviews.apache.org/r/67827/

> Allocation logic cleanup.
> -
>
> Key: MESOS-8916
> URL: https://issues.apache.org/jira/browse/MESOS-8916
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>
> The allocation logic has grown organically and is now very hard to read and 
> maintain. This epic will track cleanups to improve the readability of the 
> core allocation logic:
> * Add a function for returning the subset of frameworks that are capable of 
> receiving offers from the agent. This moves the capability checking out of 
> the core allocation logic and means the loops can just iterate over a smaller 
> set of framework candidates rather than having to write 'continue' cases. 
> This covers the GPU_RESOURCES and REGION_AWARE capabilities.
> * Similarly, add a function that allows framework capability based filtering 
> of resources. This pulls out the filtering logic from the core allocation 
> logic and instead the core allocation logic can just all out to the 
> capability filtering function. This covers the SHARED_RESOURCES, 
> REVOCABLE_RESOURCES and RESERVATION_REFINEMENT capabilities. Note that in 
> order to implement this one, we must refactor the shared resources logic in 
> order to have the resource generation occur regardless of the framework 
> capability (followed by getting filtered out via this new function if the 
> framework is not capable).
> * Update the scalar quantity related functions to also strip static 
> reservation metadata. Currently there is extra code in the allocator across 
> many places (including the allocation logic) to perform this in the 
> call-sites.
> * Track across allocation cycles or pull out the following into functions: 
> quantity of quota that is currently "charged" to a role, amount of "headroom" 
> that is needed/available for unsatisfied quota guarantees.
> * Pull out the resource shrinking function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8982) add cgroup memory.max_usage_in_bytes into slave monitor/statistics endpoint

2018-07-03 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532112#comment-16532112
 ] 

Benjamin Mahler commented on MESOS-8982:


FYI [~gilbert] [~qianzhang]

> add cgroup memory.max_usage_in_bytes into slave monitor/statistics endpoint
> ---
>
> Key: MESOS-8982
> URL: https://issues.apache.org/jira/browse/MESOS-8982
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, docker, HTTP API
>Affects Versions: 1.6.0
>Reporter: Martin Bydzovsky
>Priority: Minor
>
> As an operator, I'm periodically checking slave's monitor/statistics endpoint 
> to get the memory/cpu usage/cpu throttle for each running task. However, if 
> there is a short-term memory usage peak (lets say seconds), I might miss it 
> (the memory might have been allocated and also released in between my 2 
> collect-metrics intervals). Since the max used memory is logged in the 
> `/sys/fs/cgroup/memory/docker/CID/memory.max_usage_in_bytes`, it would be 
> great, if this info would have been exposed in the api as well..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load.

2018-07-03 Thread Andrew Ruef (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532110#comment-16532110
 ] 

Andrew Ruef commented on MESOS-9024:


Thanks! I'll check this out soon - I went to Plan B (divide up work manually 
using GNU parallel) and that task is still running, but when it's done I'll see 
what this fix does. 

> Mesos master segfaults with stack overflow under load.
> --
>
> Key: MESOS-9024
> URL: https://issues.apache.org/jira/browse/MESOS-9024
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, master
>Affects Versions: 1.6.0
> Environment: Ubuntu 16.04.4 
>Reporter: Andrew Ruef
>Assignee: Benjamin Mahler
>Priority: Blocker
> Fix For: 1.5.2, 1.6.1
>
> Attachments: stack.txt.gz
>
>
> Running mesos in non-HA mode on a small cluster under load, the master 
> reliably segfaults due to some state it has worked itself into. The segfault 
> appears to be a stack overflow, at least, the call stack has 72662 elements 
> in it in the crashing thread. The root of the stack appears to be in 
> libprocess. 
> I've attached a gzip compressed stack backtrace since the uncompressed stack 
> backtrace is too large to attach to this issue. This happens to me fairly 
> reliably when doing jobs, but it can take many hours or days for mesos to 
> work itself back into this state. 
> I think the below is the beginning of the repeating part of the stack trace: 
> {noformat}
> #72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at 
> ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}}
> {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce 
> const&)>&&) const () at 
> ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}}
> {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}}
> {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)> >(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:382}}
> {{#72569 0x7fd74a7cff72 in process::internal::decode_recv () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/src/process.cpp:849}}
> {{#72570 0x7fd74a83d103 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::__call long> const&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple long> const&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) () at 
> /usr/include/c++/5/functional:1074}}
> {{#72571 0x7fd74a82afd2 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::operator() long> const&, void>(process::Future const&) () at 
> /usr/include/c++/5/functional:1133}}
> {{#72572 0x7fd74a81b23c in process::Future const& 
> process::Future::onAny (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer)

[jira] [Commented] (MESOS-8985) Posting to the operator api with 'accept recordio' header can crash the agent

2018-07-03 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532107#comment-16532107
 ] 

Benjamin Mahler commented on MESOS-8985:


[~bennoe] [~alexr] any reason not to backport to the supported releases?

> Posting to the operator api with 'accept recordio' header can crash the agent
> -
>
> Key: MESOS-8985
> URL: https://issues.apache.org/jira/browse/MESOS-8985
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Philip Norman
>Assignee: Benno Evers
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.7.0
>
> Attachments: mesos-slave-crash.log
>
>
> It's possible to crash the mesos agent by posting a reasonable request to the 
> operator API.
> h3. Background:
> Sending a request to the v1 api endpoint with an unsupported 'accept' header:
> {code:java}
> curl -X POST http://10.0.3.27:5051/api/v1 \
>   -H 'accept: application/atom+xml' \
>   -H 'content-type: application/json' \
>   -d '{"type":"GET_CONTAINERS","get_containers":{"show_nested": 
> true,"show_standalone": true}}'{code}
> Results in the following friendly error message:
> {code:java}
> Expecting 'Accept' to allow application/json or application/x-protobuf or 
> application/recordio{code}
> h3. Reproducible crash:
> However, sending the same request with 'application/recordio' 'accept' header:
> {code:java}
> curl -X POST \
> http://10.0.3.27:5051/api/v1 \
>   -H 'accept: application/recordio' \
>   -H 'content-type: application/json' \
>   -d '{"type":"GET_CONTAINERS","get_containers":{"show_nested": 
> true,"show_standalone": true}}'{code}
> causes the agent to crash (no response is received).
> Crash log is shown below, full log from the agent is attached here:
> {code:java}
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> I0607 22:30:32.397320 3743 logfmt.cpp:178] type=audit timestamp=2018-06-07 
> 22:30:32.397243904+00:00 reason="Error in token 'Missing 'Authorization' 
> header from HTTP request'. Allowing anonymous connection" 
> object="/slave(1)/api/v1" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 
> 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 
> Safari/537.36" authorizer="mesos-agent" action="POST" result=allow 
> srcip=10.0.6.99 dstport=5051 srcport=42084 dstip=10.0.3.27
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> W0607 22:30:32.397434 3743 authenticator.cpp:289] Error in token on request 
> from '10.0.6.99:42084': Missing 'Authorization' header from HTTP request
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> W0607 22:30:32.397466 3743 authenticator.cpp:291] Falling back to anonymous 
> connection using user 'dcos_anonymous'
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> I0607 22:30:32.397629 3748 http.cpp:1099] HTTP POST for /slave(1)/api/v1 from 
> 10.0.6.99:42084 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 
> 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 
> Safari/537.36'
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> I0607 22:30:32.397784 3748 http.cpp:2030] Processing GET_CONTAINERS call
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> F0607 22:30:32.398736 3747 http.cpp:121] Serializing a RecordIO stream is not 
> supported
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> *** Check failure stack trace: ***
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f619478636d google::LogMessage::Fail()
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f619478819d google::LogMessage::SendToLog()
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f6194785f5c google::LogMessage::Flush()
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f6194788a99 google::LogMessageFatal::~LogMessageFatal()
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f61935e2b9d mesos::internal::serialize()
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f6193a4c0ef 
> _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEERKN4JSON5ArrayEEE10CallableFnIZNK5mesos8internal5slave4Http13getContainersERKNSD_5agent4CallENSD_11ContentTypeERK6OptionINS3_14authentication9PrincipalEEEUlRKNS2_IS7_EEE0_EclES9_
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f6193a81d61 process::internal::thenf<>()
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
>

[jira] [Assigned] (MESOS-8916) Allocation logic cleanup.

2018-07-03 Thread Meng Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-8916:
---

Assignee: Meng Zhu

> Allocation logic cleanup.
> -
>
> Key: MESOS-8916
> URL: https://issues.apache.org/jira/browse/MESOS-8916
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>
> The allocation logic has grown organically and is now very hard to read and 
> maintain. This epic will track cleanups to improve the readability of the 
> core allocation logic:
> * Add a function for returning the subset of frameworks that are capable of 
> receiving offers from the agent. This moves the capability checking out of 
> the core allocation logic and means the loops can just iterate over a smaller 
> set of framework candidates rather than having to write 'continue' cases. 
> This covers the GPU_RESOURCES and REGION_AWARE capabilities.
> * Similarly, add a function that allows framework capability based filtering 
> of resources. This pulls out the filtering logic from the core allocation 
> logic and instead the core allocation logic can just all out to the 
> capability filtering function. This covers the SHARED_RESOURCES, 
> REVOCABLE_RESOURCES and RESERVATION_REFINEMENT capabilities. Note that in 
> order to implement this one, we must refactor the shared resources logic in 
> order to have the resource generation occur regardless of the framework 
> capability (followed by getting filtered out via this new function if the 
> framework is not capable).
> * Update the scalar quantity related functions to also strip static 
> reservation metadata. Currently there is extra code in the allocator across 
> many places (including the allocation logic) to perform this in the 
> call-sites.
> * Track across allocation cycles or pull out the following into functions: 
> quantity of quota that is currently "charged" to a role, amount of "headroom" 
> that is needed/available for unsatisfied quota guarantees.
> * Pull out the resource shrinking function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9024) Mesos master segfaults with stack overflow under load

2018-07-03 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9024:
--

Assignee: Benjamin Mahler

> Mesos master segfaults with stack overflow under load
> -
>
> Key: MESOS-9024
> URL: https://issues.apache.org/jira/browse/MESOS-9024
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, master
>Affects Versions: 1.6.0
> Environment: Ubuntu 16.04.4 
>Reporter: Andrew Ruef
>Assignee: Benjamin Mahler
>Priority: Blocker
> Attachments: stack.txt.gz
>
>
> Running mesos in non-HA mode on a small cluster under load, the master 
> reliably segfaults due to some state it has worked itself into. The segfault 
> appears to be a stack overflow, at least, the call stack has 72662 elements 
> in it in the crashing thread. The root of the stack appears to be in 
> libprocess. 
> I've attached a gzip compressed stack backtrace since the uncompressed stack 
> backtrace is too large to attach to this issue. This happens to me fairly 
> reliably when doing jobs, but it can take many hours or days for mesos to 
> work itself back into this state. 
> I think the below is the beginning of the repeating part of the stack trace: 
> {noformat}
> #72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at 
> ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}}
> {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce 
> const&)>&&) const () at 
> ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}}
> {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}}
> {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)> >(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:382}}
> {{#72569 0x7fd74a7cff72 in process::internal::decode_recv () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/src/process.cpp:849}}
> {{#72570 0x7fd74a83d103 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::__call long> const&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple long> const&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) () at 
> /usr/include/c++/5/functional:1074}}
> {{#72571 0x7fd74a82afd2 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::operator() long> const&, void>(process::Future const&) () at 
> /usr/include/c++/5/functional:1133}}
> {{#72572 0x7fd74a81b23c in process::Future const& 
> process::Future::onAny (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const::\{lambda(std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
>

[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load

2018-07-03 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532048#comment-16532048
 ] 

Benjamin Mahler commented on MESOS-9024:


It looks like the socket receive path is also prone to stack overflow and needs 
a similar fix as was done on the sending side in MESOS-8594. This can occur 
when there is a socket that is always readable. This likely affects every 
supported version but much like MESOS-8594, we can backport to 1.5.x and 1.6.x 
but not to 1.4.x.

> Mesos master segfaults with stack overflow under load
> -
>
> Key: MESOS-9024
> URL: https://issues.apache.org/jira/browse/MESOS-9024
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, master
>Affects Versions: 1.6.0
> Environment: Ubuntu 16.04.4 
>Reporter: Andrew Ruef
>Priority: Blocker
> Attachments: stack.txt.gz
>
>
> Running mesos in non-HA mode on a small cluster under load, the master 
> reliably segfaults due to some state it has worked itself into. The segfault 
> appears to be a stack overflow, at least, the call stack has 72662 elements 
> in it in the crashing thread. The root of the stack appears to be in 
> libprocess. 
> I've attached a gzip compressed stack backtrace since the uncompressed stack 
> backtrace is too large to attach to this issue. This happens to me fairly 
> reliably when doing jobs, but it can take many hours or days for mesos to 
> work itself back into this state. 
> I think the below is the beginning of the repeating part of the stack trace: 
> {noformat}
> #72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at 
> ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}}
> {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce 
> const&)>&&) const () at 
> ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}}
> {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}}
> {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)> >(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:382}}
> {{#72569 0x7fd74a7cff72 in process::internal::decode_recv () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/src/process.cpp:849}}
> {{#72570 0x7fd74a83d103 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::__call long> const&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple long> const&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) () at 
> /usr/include/c++/5/functional:1074}}
> {{#72571 0x7fd74a82afd2 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::operator() long> const&, void>(process::Future const&) () at 
> /usr/include/c++/5/functional:1133}}
> {{#72572 0x7fd74a81b23c in process::Future const& 
> process::Future::onAny (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
>

[jira] [Created] (MESOS-9050) Mesos fetcher should use agent's credential to fetch artifacts.

2018-07-03 Thread Chun-Hung Hsiao (JIRA)

Chun-Hung Hsiao created MESOS-9050:
--

 Summary: Mesos fetcher should use agent's credential to fetch 
artifacts.
 Key: MESOS-9050
 URL: https://issues.apache.org/jira/browse/MESOS-9050
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Chun-Hung Hsiao


When launching a container, Mesos setuid to the task's credential before 
fetching the artifacts into the executor sandbox. However, if any directory in 
the sandbox path forbids 'x' mode for the task's credential, the fetcher won't 
be able to store the artifact into the sandbox, but instead get an {{EACCES}} 
from 
https://github.com/apache/mesos/blob/master/3rdparty/stout/include/stout/net.hpp#L214

We should use the agent's credential to fetch the artifacts, {{chown}} them, 
then setuid.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-8847) Per Framework task state metrics

2018-07-03 Thread Greg Mann (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474521#comment-16474521
 ] 

Greg Mann edited comment on MESOS-8847 at 7/3/18 9:53 PM:
--

Review: https://reviews.apache.org/r/67813


was (Author: greggomann):
Reviews:
https://reviews.apache.org/r/66874/
https://reviews.apache.org/r/67187/

> Per Framework task state metrics
> 
>
> Key: MESOS-8847
> URL: https://issues.apache.org/jira/browse/MESOS-8847
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>
> Gauge metrics about current number of tasks in active states (RUNNING, 
> STAGING etc).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8912) Per Framework terminal task state metrics

2018-07-03 Thread Greg Mann (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531983#comment-16531983
 ] 

Greg Mann commented on MESOS-8912:
--

Merging this ticket with MESOS-8847

> Per Framework terminal task state metrics
> -
>
> Key: MESOS-8912
> URL: https://issues.apache.org/jira/browse/MESOS-8912
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Assignee: Gilbert Song
>Priority: Major
>
> Counter metriss about number of tasks that reached terminal states (FINISHED, 
> FAILED etc.)
> These counter metrics will have granularity of task states and reasons (i.e., 
> number of tasks that are FINISHED due to REASON `foo` from SOURCE `master`).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8845) Per Framework Operation metrics

2018-07-03 Thread Greg Mann (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531977#comment-16531977
 ] 

Greg Mann commented on MESOS-8845:
--

Latest review; replaces the one above: https://reviews.apache.org/r/67814

> Per Framework Operation metrics
> ---
>
> Key: MESOS-8845
> URL: https://issues.apache.org/jira/browse/MESOS-8845
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>
> Metris for number of operations sent via ACCEPT calls by framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8848) Per Framework Offer metrics

2018-07-03 Thread Greg Mann (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-8848:


Assignee: Greg Mann  (was: Gilbert Song)

> Per Framework Offer metrics
> ---
>
> Key: MESOS-8848
> URL: https://issues.apache.org/jira/browse/MESOS-8848
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>
> Metrics regarding number of offers (sent, accepted, declined, rescinded) on a 
> per framework basis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8848) Per Framework Offer metrics

2018-07-03 Thread Greg Mann (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531978#comment-16531978
 ] 

Greg Mann commented on MESOS-8848:
--

Latest review; replaces the one above: https://reviews.apache.org/r/67812

> Per Framework Offer metrics
> ---
>
> Key: MESOS-8848
> URL: https://issues.apache.org/jira/browse/MESOS-8848
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>
> Metrics regarding number of offers (sent, accepted, declined, rescinded) on a 
> per framework basis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8844) Per Framework EVENT metrics

2018-07-03 Thread Greg Mann (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531974#comment-16531974
 ] 

Greg Mann commented on MESOS-8844:
--

Latest review; replaces those above: https://reviews.apache.org/r/67809

> Per Framework EVENT metrics
> ---
>
> Key: MESOS-8844
> URL: https://issues.apache.org/jira/browse/MESOS-8844
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>
> Metrics for number of events sent by the master to the framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8845) Per Framework Operation metrics

2018-07-03 Thread Greg Mann (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-8845:


Assignee: Greg Mann  (was: Gilbert Song)

> Per Framework Operation metrics
> ---
>
> Key: MESOS-8845
> URL: https://issues.apache.org/jira/browse/MESOS-8845
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>
> Metris for number of operations sent via ACCEPT calls by framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8844) Per Framework EVENT metrics

2018-07-03 Thread Greg Mann (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-8844:


Assignee: Greg Mann  (was: Gilbert Song)

> Per Framework EVENT metrics
> ---
>
> Key: MESOS-8844
> URL: https://issues.apache.org/jira/browse/MESOS-8844
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>
> Metrics for number of events sent by the master to the framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8843) Per Framework CALL metrics

2018-07-03 Thread Greg Mann (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531973#comment-16531973
 ] 

Greg Mann commented on MESOS-8843:
--

Latest review; replaces the one above: https://reviews.apache.org/r/67808

> Per Framework CALL metrics
> --
>
> Key: MESOS-8843
> URL: https://issues.apache.org/jira/browse/MESOS-8843
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>
> Metrics about number of different kinds of calls sent by a framework to 
> master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8843) Per Framework CALL metrics

2018-07-03 Thread Greg Mann (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-8843:


Assignee: Greg Mann  (was: Gilbert Song)

> Per Framework CALL metrics
> --
>
> Key: MESOS-8843
> URL: https://issues.apache.org/jira/browse/MESOS-8843
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>
> Metrics about number of different kinds of calls sent by a framework to 
> master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8846) Per Framework state metrics

2018-07-03 Thread Greg Mann (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-8846:


Assignee: Greg Mann

> Per Framework state metrics
> ---
>
> Key: MESOS-8846
> URL: https://issues.apache.org/jira/browse/MESOS-8846
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>
> Metrics about framework state (e.g., subscribed, suppressed etc).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9049) Agent GC could unmount a dangling persistent volume multiple times.

2018-07-03 Thread Chun-Hung Hsiao (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-9049:
--

Assignee: Zhitao Li  (was: Chun-Hung Hsiao)

> Agent GC could unmount a dangling persistent volume multiple times.
> ---
>
> Key: MESOS-9049
> URL: https://issues.apache.org/jira/browse/MESOS-9049
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.4.2, 1.5.2, 1.7.0, 1.6.1
>Reporter: Chun-Hung Hsiao
>Assignee: Zhitao Li
>Priority: Major
>
> When the agent GC an executor dir and the sandbox of one of its run that 
> contains a dangling persistent volume, the agent might try to unmount the 
> persistent volume twice, which leads to an {{EINVAL}} when trying to unmount 
> the target for the second time.
> Here is the log from a failure run of 
> {{GarbageCollectorIntegrationTest.ROOT_DanglingMount}}:
> {noformat}
> W0702 23:35:31.669946 25401 gc.cpp:241] Unmounting dangling mount point 
> '/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-/executors/test-task123/runs/3fcde2c8-b461-4f22-afec-daa269291c95/dangling'
>  of persistent volume 
> '/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/volumes/roles/default-role/persistence-id'
>  inside garbage collected path 
> '/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-/executors/test-task123'
> W0702 23:35:31.683878 25401 gc.cpp:241] Unmounting dangling mount point 
> '/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-/executors/test-task123/runs/3fcde2c8-b461-4f22-afec-daa269291c95/dangling'
>  of persistent volume 
> '/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/volumes/roles/default-role/persistence-id'
>  inside garbage collected path 
> '/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-'
> W0702 23:35:31.683912 25401 gc.cpp:248] Skipping deletion of 
> '/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-'
>  because unmount failed on 
> '/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-/executors/test-task123/runs/3fcde2c8-b461-4f22-afec-daa269291c95/dangling':
>  Failed to unmount 
> '/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-/executors/test-task123/runs/3fcde2c8-b461-4f22-afec-daa269291c95/dangling':
>  Invalid argument
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9049) Agent GC could unmount a dangling persistent volume multiple times.

2018-07-03 Thread Chun-Hung Hsiao (JIRA)

Chun-Hung Hsiao created MESOS-9049:
--

 Summary: Agent GC could unmount a dangling persistent volume 
multiple times.
 Key: MESOS-9049
 URL: https://issues.apache.org/jira/browse/MESOS-9049
 Project: Mesos
  Issue Type: Bug
  Components: agent
Affects Versions: 1.4.2, 1.5.2, 1.7.0, 1.6.1
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


When the agent GC an executor dir and the sandbox of one of its run that 
contains a dangling persistent volume, the agent might try to unmount the 
persistent volume twice, which leads to an {{EINVAL}} when trying to unmount 
the target for the second time.

Here is the log from a failure run of 
{{GarbageCollectorIntegrationTest.ROOT_DanglingMount}}:
{noformat}
W0702 23:35:31.669946 25401 gc.cpp:241] Unmounting dangling mount point 
'/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-/executors/test-task123/runs/3fcde2c8-b461-4f22-afec-daa269291c95/dangling'
 of persistent volume 
'/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/volumes/roles/default-role/persistence-id'
 inside garbage collected path 
'/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-/executors/test-task123'
W0702 23:35:31.683878 25401 gc.cpp:241] Unmounting dangling mount point 
'/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-/executors/test-task123/runs/3fcde2c8-b461-4f22-afec-daa269291c95/dangling'
 of persistent volume 
'/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/volumes/roles/default-role/persistence-id'
 inside garbage collected path 
'/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-'
W0702 23:35:31.683912 25401 gc.cpp:248] Skipping deletion of 
'/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-'
 because unmount failed on 
'/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-/executors/test-task123/runs/3fcde2c8-b461-4f22-afec-daa269291c95/dangling':
 Failed to unmount 
'/tmp/GarbageCollectorIntegrationTest_ROOT_DanglingMount_zkItvU/slaves/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-S0/frameworks/f4dc0941-e3b0-4f2c-b7f9-025a1af264c8-/executors/test-task123/runs/3fcde2c8-b461-4f22-afec-daa269291c95/dangling':
 Invalid argument
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9007) XFS disk isolator doesn't clean up project ID from symlinks

2018-07-03 Thread Ilya Pronin (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-9007:
--

Assignee: Ilya Pronin

> XFS disk isolator doesn't clean up project ID from symlinks
> ---
>
> Key: MESOS-9007
> URL: https://issues.apache.org/jira/browse/MESOS-9007
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> Upon container destruction its project ID is unallocated by the isolator and 
> removed from the container work directory. However the removing function 
> skips symbolic links and because of that the project still exists until the 
> container directory is garbage collected. If the project ID is reused for a 
> new container, any lingering symlinks that still have that project ID will 
> contribute to disk usage of the new container. Typically symlinks don't take 
> much space, but still this leads to inaccuracy in disk space usage accounting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8986) `slave.available()` in the allocator is expensive and drags down allocation performance.

2018-07-03 Thread Greg Mann (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531697#comment-16531697
 ] 

Greg Mann commented on MESOS-8986:
--

Backports:

1.6.x:
{code}
commit 4e064011038d1afcb60e3374aa94dd01ac88f6b9
Author: Meng Zhu 
Date:   Thu Jun 21 09:09:39 2018 -0700

Modified `createStrippedScalarQuantity()` to clear all metadata fields.

Currently `createStrippedScalarQuantity()` strips resource meta-data
and transforms dynamic reservations into a static reservation.
However, no current code depends on the reservations in resources
returned by this helper function. This leads to boilerplate code
around call sites and performance overhead.

This patch updates the function to clear all reservation information.

Review: https://reviews.apache.org/r/67615/

commit 0d50c45bf19061e4c978641db1f7f9f99088dbae
Author: Meng Zhu 
Date:   Thu Jun 21 09:09:36 2018 -0700

Refactored `struct Slave` in the allocator for better performance.

This patch refactors the `struct Slave` in the allocator.
In particular, it addresses the slowness of computing
agents' available resources. Instead of calculating them
every time on the fly, this patch "denormalizes" the agent
available resources by updating and persisting the field
each time an agent's allocated or total resources change.

Review: https://reviews.apache.org/r/67561/
{code}

1.5.x:
{code}
commit e14a05fc0135697a41fd4c5ec4237ac195240736
Author: Meng Zhu 
Date:   Thu Jun 21 09:09:39 2018 -0700

Modified `createStrippedScalarQuantity()` to clear all metadata fields.

Currently `createStrippedScalarQuantity()` strips resource meta-data
and transforms dynamic reservations into a static reservation.
However, no current code depends on the reservations in resources
returned by this helper function. This leads to boilerplate code
around call sites and performance overhead.

This patch updates the function to clear all reservation information.

Review: https://reviews.apache.org/r/67615/

commit 762c78d5351e2cbef4dc5deb58389ee48e56ef4f
Author: Meng Zhu 
Date:   Thu Jun 21 09:09:36 2018 -0700

Refactored `struct Slave` in the allocator for better performance.

This patch refactors the `struct Slave` in the allocator.
In particular, it addresses the slowness of computing
agents' available resources. Instead of calculating them
every time on the fly, this patch "denormalizes" the agent
available resources by updating and persisting the field
each time an agent's allocated or total resources change.

Review: https://reviews.apache.org/r/67561/
{code}

1.4.x:
{code}
commit 8814f143b095addc3dffcc29dab275ba0b8f9d5d
Author: Meng Zhu 
Date:   Thu Jun 21 09:09:39 2018 -0700

Modified `createStrippedScalarQuantity()` to clear all metadata fields.

Currently `createStrippedScalarQuantity()` strips resource meta-data
and transforms dynamic reservations into a static reservation.
However, no current code depends on the reservations in resources
returned by this helper function. This leads to boilerplate code
around call sites and performance overhead.

This patch updates the function to clear all reservation information.

Review: https://reviews.apache.org/r/67615/

commit d5827547b3a83b79f4373199158bc40c0ae379d9
Author: Meng Zhu 
Date:   Thu Jun 21 09:09:36 2018 -0700

Refactored `struct Slave` in the allocator for better performance.

This patch refactors the `struct Slave` in the allocator.
In particular, it addresses the slowness of computing
agents' available resources. Instead of calculating them
every time on the fly, this patch "denormalizes" the agent
available resources by updating and persisting the field
each time an agent's allocated or total resources change.

Review: https://reviews.apache.org/r/67561/
{code}

> `slave.available()` in the allocator is expensive and drags down allocation 
> performance.
> 
>
> Key: MESOS-8986
> URL: https://issues.apache.org/jira/browse/MESOS-8986
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
> Fix For: 1.4.2, 1.5.2, 1.7.0, 1.6.1
>
>
> We noticed that the [`slave.available()` 
> function|https://github.com/apache/mesos/blob/d733b1031350e03bce443aa287044eb4eee1053a/src/master/allocator/mesos/hierarchical.hpp#L380-L388]
>  in the allocator is expensive and gets called many times in each allocation 
> cycle. In one of our profiling results, this function accounts for more than 
> 80% of the allocation time, drags down the allocator performance 
> significantly.
> One simple way to reduce the

[jira] [Commented] (MESOS-9040) Break scheduler driver dependency on mesos-local.

2018-07-03 Thread Benjamin Hindman (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531608#comment-16531608
 ] 

Benjamin Hindman commented on MESOS-9040:
-

As [~tillt] mentioned, it's meant to be a convenience for framework developers 
who don't want to have to spin up a Mesos cluster when doing local testing. 
This can dramatically decrease the friction for writing integration tests.

I also agree with [~tillt] that we could have users run {{mesos-local}} by 
themselves in order to do local testing or make their integration tests call 
{{mesos-local}}. If we really don't have any users using this that might be the 
most prudent path.

Alternatively, could we create a {{libmesos-local.so}} and then dynamically 
load that library only if we go down the local path?

> Break scheduler driver dependency on mesos-local.
> -
>
> Key: MESOS-9040
> URL: https://issues.apache.org/jira/browse/MESOS-9040
> Project: Mesos
>  Issue Type: Task
>  Components: build, scheduler driver
>Reporter: James Peach
>Priority: Minor
>
> The scheduler driver in {{src/sched/sched.cpp}} has some special dependencies 
> on the {{mesos-local}} code. This seems fairly hacky, but it also causes 
> binary dependencies on {{src/local/local.cpp}} to be dragged into 
> {{libmesos.so}}. {{libmesos.so}} would not otherwise require this code, which 
> could be isolated in the {{mesos-local}} command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9031) Mesos CNI portmap plugins' iptables rules doesn't allow connections via host ip and port from the same bridge container network

2018-07-03 Thread Kirill Plyashkevich (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531457#comment-16531457
 ] 

Kirill Plyashkevich commented on MESOS-9031:


[~qianzhang], do you have
{quote}"excludeDevices" : []{quote}
set in the config?
if `excludeDevices` contains bridge (`mesos-cni0`), you get connection refused.

> Mesos CNI portmap plugins' iptables rules doesn't allow connections via host 
> ip and port from the same bridge container network
> ---
>
> Key: MESOS-9031
> URL: https://issues.apache.org/jira/browse/MESOS-9031
> Project: Mesos
>  Issue Type: Bug
>  Components: cni, containerization
>Affects Versions: 1.6.0
>Reporter: Kirill Plyashkevich
>Assignee: Qian Zhang
>Priority: Major
>
> using `mesos-cni-port-mapper` with folllowing config:
> {noformat}
> { 
>    "name" : "dcos", 
>    "type" : "mesos-cni-port-mapper", 
>    "excludeDevices" : [], 
>    "chain": "MESOS-CNI0-PORT-MAPPER", 
>    "delegate": { 
>    "type": "bridge", 
>    "bridge": "mesos-cni0", 
>    "isGateway": true, 
>    "ipMasq": true, 
>    "hairpinMode": true, 
>    "ipam": { 
>    "type": "host-local", 
>    "ranges": [ 
>    [{"subnet": "172.26.0.0/16"}] 
>    ], 
>    "routes": [ 
>    {"dst": "0.0.0.0/0"} 
>    ] 
>    } 
>    } 
> }
> {noformat}
>  - 2 services running on the same mesos-slave using unified containerizer in 
> different tasks and communicating via host ip and host port
>  - connection timeouts due to iptables rules per container CNI-XXX chain
>  - actually timeouts are caused by
> {noformat}
> Chain CNI-XXX (1 references)
> num  target prot opt source   destination 
> 1ACCEPT all  --  anywhere 172.26.0.0/16/* name: 
> "dcos" id: "" */
> 2MASQUERADE  all  --  anywhere!base-address.mcast.net/4  /* 
> name: "dcos" id: "" */
> {noformat}
> rule #1 is executed and no masquerading happens.
> there are multiple solutions:
>  - -simpliest and fastest one is not to add that ACCEPT- - NOT A SOLUTION. 
> it's happening in `bridge` plugin and `cni/portmap` shows that 
> snat/masquerade should be done during portmapping as well.
>  - perhaps, there's a better change in iptables rules that can fix it
>  - proper one (imho) is to finally implement cni spec 0.3.x in order to be 
> able to use chaining of plugins and use cni's `bridge` and `portmap` plugins 
> in chain (and get rid of mesos-cni-port-mapper completely eventually).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9031) Mesos CNI portmap plugins' iptables rules doesn't allow connections via host ip and port from the same bridge container network

2018-07-03 Thread Qian Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531432#comment-16531432
 ] 

Qian Zhang commented on MESOS-9031:
---

[~Kirill P] I reproduced this issue with the following steps:
 # Launch a command task with {{nginx:alpine}} image to join {{mesos-cni0}} 
bridge network and map host's 8080 port to the container's 80 port.
 # Launch another command task which also joins {{mesos-cni0}} bridge network 
and this task does `curl hostIP:8080`.

I found the second command task failed with the error "connection refused" 
rather than timeout.

 

> Mesos CNI portmap plugins' iptables rules doesn't allow connections via host 
> ip and port from the same bridge container network
> ---
>
> Key: MESOS-9031
> URL: https://issues.apache.org/jira/browse/MESOS-9031
> Project: Mesos
>  Issue Type: Bug
>  Components: cni, containerization
>Affects Versions: 1.6.0
>Reporter: Kirill Plyashkevich
>Assignee: Qian Zhang
>Priority: Major
>
> using `mesos-cni-port-mapper` with folllowing config:
> {noformat}
> { 
>    "name" : "dcos", 
>    "type" : "mesos-cni-port-mapper", 
>    "excludeDevices" : [], 
>    "chain": "MESOS-CNI0-PORT-MAPPER", 
>    "delegate": { 
>    "type": "bridge", 
>    "bridge": "mesos-cni0", 
>    "isGateway": true, 
>    "ipMasq": true, 
>    "hairpinMode": true, 
>    "ipam": { 
>    "type": "host-local", 
>    "ranges": [ 
>    [{"subnet": "172.26.0.0/16"}] 
>    ], 
>    "routes": [ 
>    {"dst": "0.0.0.0/0"} 
>    ] 
>    } 
>    } 
> }
> {noformat}
>  - 2 services running on the same mesos-slave using unified containerizer in 
> different tasks and communicating via host ip and host port
>  - connection timeouts due to iptables rules per container CNI-XXX chain
>  - actually timeouts are caused by
> {noformat}
> Chain CNI-XXX (1 references)
> num  target prot opt source   destination 
> 1ACCEPT all  --  anywhere 172.26.0.0/16/* name: 
> "dcos" id: "" */
> 2MASQUERADE  all  --  anywhere!base-address.mcast.net/4  /* 
> name: "dcos" id: "" */
> {noformat}
> rule #1 is executed and no masquerading happens.
> there are multiple solutions:
>  - -simpliest and fastest one is not to add that ACCEPT- - NOT A SOLUTION. 
> it's happening in `bridge` plugin and `cni/portmap` shows that 
> snat/masquerade should be done during portmapping as well.
>  - perhaps, there's a better change in iptables rules that can fix it
>  - proper one (imho) is to finally implement cni spec 0.3.x in order to be 
> able to use chaining of plugins and use cni's `bridge` and `portmap` plugins 
> in chain (and get rid of mesos-cni-port-mapper completely eventually).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-7441) RegisterSlaveValidationTest.DropInvalidRegistration is flaky

2018-07-03 Thread Jan Schlicht (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531286#comment-16531286
 ] 

Jan Schlicht commented on MESOS-7441:
-

Reopened, as there was a recent test run (on {{master}}, SHA {{b50f6c8a}}) 
failing on CentOS 6 with
{noformat}
[ RUN  ] RegisterSlaveValidationTest.DropInvalidRegistration
I0703 11:44:46.746553 16172 cluster.cpp:173] Creating default 'local' authorizer
I0703 11:44:46.747535 16196 master.cpp:463] Master 
cce3860c-7d4f-4996-b865-fc8ce8302705 (ip-172-16-10-44.ec2.internal) started on 
172.16.10.44:33909
I0703 11:44:46.747611 16196 master.cpp:466] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/dwPsJP/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/dwPsJP/master" --zk_session_timeout="10secs"
I0703 11:44:46.747733 16196 master.cpp:515] Master only allowing authenticated 
frameworks to register
I0703 11:44:46.747748 16196 master.cpp:521] Master only allowing authenticated 
agents to register
I0703 11:44:46.747754 16196 master.cpp:527] Master only allowing authenticated 
HTTP frameworks to register
I0703 11:44:46.747761 16196 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/dwPsJP/credentials'
I0703 11:44:46.747872 16196 master.cpp:571] Using default 'crammd5' 
authenticator
I0703 11:44:46.747907 16196 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0703 11:44:46.747944 16196 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0703 11:44:46.747967 16196 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0703 11:44:46.747997 16196 master.cpp:652] Authorization enabled
I0703 11:44:46.748157 16194 hierarchical.cpp:177] Initialized hierarchical 
allocator process
I0703 11:44:46.748183 16194 whitelist_watcher.cpp:77] No whitelist given
I0703 11:44:46.748715 16196 master.cpp:2162] Elected as the leading master!
I0703 11:44:46.748736 16196 master.cpp:1717] Recovering from registrar
I0703 11:44:46.748950 16196 registrar.cpp:339] Recovering registrar
I0703 11:44:46.749035 16196 registrar.cpp:383] Successfully fetched the 
registry (0B) in 68864ns
I0703 11:44:46.749059 16196 registrar.cpp:487] Applied 1 operations in 5058ns; 
attempting to update the registry
I0703 11:44:46.749349 16196 registrar.cpp:544] Successfully updated the 
registry in 275968ns
I0703 11:44:46.749385 16196 registrar.cpp:416] Successfully recovered registrar
I0703 11:44:46.749465 16196 master.cpp:1831] Recovered 0 agents from the 
registry (172B); allowing 10mins for agents to reregister
I0703 11:44:46.749589 16196 hierarchical.cpp:215] Skipping recovery of 
hierarchical allocator: nothing to recover
W0703 11:44:46.751214 16172 process.cpp:2824] Attempted to spawn already 
running process files@172.16.10.44:33909
I0703 11:44:46.751505 16172 containerizer.cpp:300] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
I0703 11:44:46.753739 16172 linux_launcher.cpp:146] Using /cgroup/freezer as 
the freezer hierarchy for the Linux launcher
I0703 11:44:46.754091 16172 provisioner.cpp:298] Using default backend 'copy'
I0703 11:44:46.754447 16172 cluster.cpp:479] Creating default 'local' authorizer
I0703 11:44:46.754907 16195 slave.cpp:268] Mesos agent started on 
(361)@172.16.10.44:33909
I0703 11:44:46.754920 16195 slave.cpp:269] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/RegisterSlaveValidationTest_DropInvalidRegistration_W7jYUL/store/appc"
 --authenticate_http_executors="true" --authenticate_http_readonly="true"

[jira] [Comment Edited] (MESOS-9040) Break scheduler driver dependency on mesos-local.

2018-07-03 Thread James Peach (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530919#comment-16530919
 ] 

James Peach edited comment on MESOS-9040 at 7/3/18 7:25 AM:


{quote}
It is a convenience thing meant for framework developers - maybe we can achieve 
the same by exec'ing the mesos-local runnable if desired.
{quote}

Hmm, I never knew that. Our framework developers certainly don't know about it 
either. Do you know of anyone who does use it? Is there anything I can run to 
experiment with it?

If framework developers wanted to use {{mesos-local}}, why wouldn't they just 
exec the `mesos-local` process in their CI? 


was (Author: jamespeach):
{quote}
It is a convenience thing meant for framework developers - maybe we can achieve 
the same by exec'ing the mesos-local runnable if desired.
{quote}

Hmm, I never knew that. Our framework developers certainly don't know about it 
either. Do you know of anyone who does use it? Is there anything I can run to 
experiment with it?

If framework developers wanted to use {{mesos-local}}, why wouldn't they just 
exec the `mesos-local` process i their CI? 

> Break scheduler driver dependency on mesos-local.
> -
>
> Key: MESOS-9040
> URL: https://issues.apache.org/jira/browse/MESOS-9040
> Project: Mesos
>  Issue Type: Task
>  Components: build, scheduler driver
>Reporter: James Peach
>Priority: Minor
>
> The scheduler driver in {{src/sched/sched.cpp}} has some special dependencies 
> on the {{mesos-local}} code. This seems fairly hacky, but it also causes 
> binary dependencies on {{src/local/local.cpp}} to be dragged into 
> {{libmesos.so}}. {{libmesos.so}} would not otherwise require this code, which 
> could be isolated in the {{mesos-local}} command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9040) Break scheduler driver dependency on mesos-local.

2018-07-03 Thread James Peach (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530919#comment-16530919
 ] 

James Peach commented on MESOS-9040:


{quote}
It is a convenience thing meant for framework developers - maybe we can achieve 
the same by exec'ing the mesos-local runnable if desired.
{quote}

Hmm, I never knew that. Our framework developers certainly don't know about it 
either. Do you know of anyone who does use it? Is there anything I can run to 
experiment with it?

If framework developers wanted to use {{mesos-local}}, why wouldn't they just 
exec the `mesos-local` process i their CI? 

> Break scheduler driver dependency on mesos-local.
> -
>
> Key: MESOS-9040
> URL: https://issues.apache.org/jira/browse/MESOS-9040
> Project: Mesos
>  Issue Type: Task
>  Components: build, scheduler driver
>Reporter: James Peach
>Priority: Minor
>
> The scheduler driver in {{src/sched/sched.cpp}} has some special dependencies 
> on the {{mesos-local}} code. This seems fairly hacky, but it also causes 
> binary dependencies on {{src/local/local.cpp}} to be dragged into 
> {{libmesos.so}}. {{libmesos.so}} would not otherwise require this code, which 
> could be isolated in the {{mesos-local}} command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-9024) Mesos master segfaults with stack overflow under load.

[jira] [Commented] (MESOS-8916) Allocation logic cleanup.

[jira] [Commented] (MESOS-8916) Allocation logic cleanup.

[jira] [Commented] (MESOS-8982) add cgroup memory.max_usage_in_bytes into slave monitor/statistics endpoint

[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load.

[jira] [Commented] (MESOS-8985) Posting to the operator api with 'accept recordio' header can crash the agent

[jira] [Assigned] (MESOS-8916) Allocation logic cleanup.

[jira] [Assigned] (MESOS-9024) Mesos master segfaults with stack overflow under load

[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load

[jira] [Created] (MESOS-9050) Mesos fetcher should use agent's credential to fetch artifacts.

[jira] [Comment Edited] (MESOS-8847) Per Framework task state metrics

[jira] [Commented] (MESOS-8912) Per Framework terminal task state metrics

[jira] [Commented] (MESOS-8845) Per Framework Operation metrics

[jira] [Assigned] (MESOS-8848) Per Framework Offer metrics

[jira] [Commented] (MESOS-8848) Per Framework Offer metrics

[jira] [Commented] (MESOS-8844) Per Framework EVENT metrics

[jira] [Assigned] (MESOS-8845) Per Framework Operation metrics

[jira] [Assigned] (MESOS-8844) Per Framework EVENT metrics

[jira] [Commented] (MESOS-8843) Per Framework CALL metrics

[jira] [Assigned] (MESOS-8843) Per Framework CALL metrics

[jira] [Assigned] (MESOS-8846) Per Framework state metrics

[jira] [Assigned] (MESOS-9049) Agent GC could unmount a dangling persistent volume multiple times.

[jira] [Created] (MESOS-9049) Agent GC could unmount a dangling persistent volume multiple times.

[jira] [Assigned] (MESOS-9007) XFS disk isolator doesn't clean up project ID from symlinks

[jira] [Commented] (MESOS-8986) `slave.available()` in the allocator is expensive and drags down allocation performance.

[jira] [Commented] (MESOS-9040) Break scheduler driver dependency on mesos-local.

[jira] [Commented] (MESOS-9031) Mesos CNI portmap plugins' iptables rules doesn't allow connections via host ip and port from the same bridge container network

[jira] [Commented] (MESOS-9031) Mesos CNI portmap plugins' iptables rules doesn't allow connections via host ip and port from the same bridge container network

[jira] [Commented] (MESOS-7441) RegisterSlaveValidationTest.DropInvalidRegistration is flaky

[jira] [Comment Edited] (MESOS-9040) Break scheduler driver dependency on mesos-local.

[jira] [Commented] (MESOS-9040) Break scheduler driver dependency on mesos-local.

31 matches

Site Navigation

Mail list logo

Footer information