[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load.

2018-07-03 Thread Andrew Ruef (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532110#comment-16532110
 ] 

Andrew Ruef commented on MESOS-9024:


Thanks! I'll check this out soon - I went to Plan B (divide up work manually 
using GNU parallel) and that task is still running, but when it's done I'll see 
what this fix does. 

> Mesos master segfaults with stack overflow under load.
> --
>
> Key: MESOS-9024
> URL: https://issues.apache.org/jira/browse/MESOS-9024
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, master
>Affects Versions: 1.6.0
> Environment: Ubuntu 16.04.4 
>Reporter: Andrew Ruef
>Assignee: Benjamin Mahler
>Priority: Blocker
> Fix For: 1.5.2, 1.6.1
>
> Attachments: stack.txt.gz
>
>
> Running mesos in non-HA mode on a small cluster under load, the master 
> reliably segfaults due to some state it has worked itself into. The segfault 
> appears to be a stack overflow, at least, the call stack has 72662 elements 
> in it in the crashing thread. The root of the stack appears to be in 
> libprocess. 
> I've attached a gzip compressed stack backtrace since the uncompressed stack 
> backtrace is too large to attach to this issue. This happens to me fairly 
> reliably when doing jobs, but it can take many hours or days for mesos to 
> work itself back into this state. 
> I think the below is the beginning of the repeating part of the stack trace: 
> {noformat}
> #72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at 
> ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}}
> {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce 
> const&)>&&) const () at 
> ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}}
> {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}}
> {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)> >(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:382}}
> {{#72569 0x7fd74a7cff72 in process::internal::decode_recv () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/src/process.cpp:849}}
> {{#72570 0x7fd74a83d103 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::__call long> const&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple long> const&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) () at 
> /usr/include/c++/5/functional:1074}}
> {{#72571 0x7fd74a82afd2 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::operator() long> const&, void>(process::Future const&) () at 
> /usr/include/c++/5/functional:1133}}
> {{#72572 0x7fd74a81b23c in process::Future const& 
> process::Future::onAny (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) 

[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load

2018-07-03 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532048#comment-16532048
 ] 

Benjamin Mahler commented on MESOS-9024:


It looks like the socket receive path is also prone to stack overflow and needs 
a similar fix as was done on the sending side in MESOS-8594. This can occur 
when there is a socket that is always readable. This likely affects every 
supported version but much like MESOS-8594, we can backport to 1.5.x and 1.6.x 
but not to 1.4.x.

> Mesos master segfaults with stack overflow under load
> -
>
> Key: MESOS-9024
> URL: https://issues.apache.org/jira/browse/MESOS-9024
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, master
>Affects Versions: 1.6.0
> Environment: Ubuntu 16.04.4 
>Reporter: Andrew Ruef
>Priority: Blocker
> Attachments: stack.txt.gz
>
>
> Running mesos in non-HA mode on a small cluster under load, the master 
> reliably segfaults due to some state it has worked itself into. The segfault 
> appears to be a stack overflow, at least, the call stack has 72662 elements 
> in it in the crashing thread. The root of the stack appears to be in 
> libprocess. 
> I've attached a gzip compressed stack backtrace since the uncompressed stack 
> backtrace is too large to attach to this issue. This happens to me fairly 
> reliably when doing jobs, but it can take many hours or days for mesos to 
> work itself back into this state. 
> I think the below is the beginning of the repeating part of the stack trace: 
> {noformat}
> #72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at 
> ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}}
> {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce 
> const&)>&&) const () at 
> ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}}
> {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}}
> {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)> >(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:382}}
> {{#72569 0x7fd74a7cff72 in process::internal::decode_recv () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/src/process.cpp:849}}
> {{#72570 0x7fd74a83d103 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::__call long> const&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple long> const&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) () at 
> /usr/include/c++/5/functional:1074}}
> {{#72571 0x7fd74a82afd2 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::operator() long> const&, void>(process::Future const&) () at 
> /usr/include/c++/5/functional:1133}}
> {{#72572 0x7fd74a81b23c in process::Future const& 
> process::Future::onAny (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> 

[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load

2018-06-23 Thread Andrew Ruef (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521106#comment-16521106
 ] 

Andrew Ruef commented on MESOS-9024:


I tried to find the first place where the stack trace starts repeating, I added 
that sequence.

> Mesos master segfaults with stack overflow under load
> -
>
> Key: MESOS-9024
> URL: https://issues.apache.org/jira/browse/MESOS-9024
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, master
>Affects Versions: 1.6.0
> Environment: Ubuntu 16.04.4 
>Reporter: Andrew Ruef
>Priority: Major
> Attachments: stack.txt.gz
>
>
> Running mesos in non-HA mode on a small cluster under load, the master 
> reliably segfaults due to some state it has worked itself into. The segfault 
> appears to be a stack overflow, at least, the call stack has 72662 elements 
> in it in the crashing thread. The root of the stack appears to be in 
> libprocess. 
> I've attached a gzip compressed stack backtrace since the uncompressed stack 
> backtrace is too large to attach to this issue. This happens to me fairly 
> reliably when doing jobs, but it can take many hours or days for mesos to 
> work itself back into this state. 
> I think the below is the beginning of the repeating part of the stack trace: 
> {{#72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at 
> ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}}
> {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce 
> const&)>&&) const () at 
> ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}}
> {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}}
> {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)> >(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:382}}
> {{#72569 0x7fd74a7cff72 in process::internal::decode_recv () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/src/process.cpp:849}}
> {{#72570 0x7fd74a83d103 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::__call long> const&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple long> const&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) () at 
> /usr/include/c++/5/functional:1074}}
> {{#72571 0x7fd74a82afd2 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::operator() long> const&, void>(process::Future const&) () at 
> /usr/include/c++/5/functional:1133}}
> {{#72572 0x7fd74a81b23c in process::Future const& 
> process::Future::onAny (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const::\{lambda(std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> 

[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load

2018-06-23 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521101#comment-16521101
 ] 

Andrei Budnik commented on MESOS-9024:
--

May you please add repeating part of the stack trace to the description?

> Mesos master segfaults with stack overflow under load
> -
>
> Key: MESOS-9024
> URL: https://issues.apache.org/jira/browse/MESOS-9024
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, master
>Affects Versions: 1.6.0
> Environment: Ubuntu 16.04.4 
>Reporter: Andrew Ruef
>Priority: Major
> Attachments: stack.txt.gz
>
>
> Running mesos in non-HA mode on a small cluster under load, the master 
> reliably segfaults due to some state it has worked itself into. The segfault 
> appears to be a stack overflow, at least, the call stack has 72662 elements 
> in it in the crashing thread. The root of the stack appears to be in 
> libprocess. 
> I've attached a gzip compressed stack backtrace since the uncompressed stack 
> backtrace is too large to attach to this issue. This happens to me fairly 
> reliably when doing jobs, but it can take many hours or days for mesos to 
> work itself back into this state. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)