[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load.

2018-07-03 Thread Andrew Ruef (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532110#comment-16532110
 ] 

Andrew Ruef commented on MESOS-9024:


Thanks! I'll check this out soon - I went to Plan B (divide up work manually 
using GNU parallel) and that task is still running, but when it's done I'll see 
what this fix does. 

> Mesos master segfaults with stack overflow under load.
> --
>
> Key: MESOS-9024
> URL: https://issues.apache.org/jira/browse/MESOS-9024
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, master
>Affects Versions: 1.6.0
> Environment: Ubuntu 16.04.4 
>Reporter: Andrew Ruef
>Assignee: Benjamin Mahler
>Priority: Blocker
> Fix For: 1.5.2, 1.6.1
>
> Attachments: stack.txt.gz
>
>
> Running mesos in non-HA mode on a small cluster under load, the master 
> reliably segfaults due to some state it has worked itself into. The segfault 
> appears to be a stack overflow, at least, the call stack has 72662 elements 
> in it in the crashing thread. The root of the stack appears to be in 
> libprocess. 
> I've attached a gzip compressed stack backtrace since the uncompressed stack 
> backtrace is too large to attach to this issue. This happens to me fairly 
> reliably when doing jobs, but it can take many hours or days for mesos to 
> work itself back into this state. 
> I think the below is the beginning of the repeating part of the stack trace: 
> {noformat}
> #72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at 
> ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}}
> {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce 
> const&)>&&) const () at 
> ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}}
> {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}}
> {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)> >(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:382}}
> {{#72569 0x7fd74a7cff72 in process::internal::decode_recv () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/src/process.cpp:849}}
> {{#72570 0x7fd74a83d103 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::__call long> const&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple long> const&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) () at 
> /usr/include/c++/5/functional:1074}}
> {{#72571 0x7fd74a82afd2 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::operator() long> const&, void>(process::Future const&) () at 
> /usr/include/c++/5/functional:1133}}
> {{#72572 0x7fd74a81b23c in process::Future const& 
> process::Future::onAny (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) 

[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load

2018-06-23 Thread Andrew Ruef (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521106#comment-16521106
 ] 

Andrew Ruef commented on MESOS-9024:


I tried to find the first place where the stack trace starts repeating, I added 
that sequence.

> Mesos master segfaults with stack overflow under load
> -
>
> Key: MESOS-9024
> URL: https://issues.apache.org/jira/browse/MESOS-9024
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, master
>Affects Versions: 1.6.0
> Environment: Ubuntu 16.04.4 
>Reporter: Andrew Ruef
>Priority: Major
> Attachments: stack.txt.gz
>
>
> Running mesos in non-HA mode on a small cluster under load, the master 
> reliably segfaults due to some state it has worked itself into. The segfault 
> appears to be a stack overflow, at least, the call stack has 72662 elements 
> in it in the crashing thread. The root of the stack appears to be in 
> libprocess. 
> I've attached a gzip compressed stack backtrace since the uncompressed stack 
> backtrace is too large to attach to this issue. This happens to me fairly 
> reliably when doing jobs, but it can take many hours or days for mesos to 
> work itself back into this state. 
> I think the below is the beginning of the repeating part of the stack trace: 
> {{#72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at 
> ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}}
> {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce 
> const&)>&&) const () at 
> ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}}
> {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}}
> {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)> >(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&) const () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:382}}
> {{#72569 0x7fd74a7cff72 in process::internal::decode_recv () at 
> ../../../mesos-1.6.0/3rdparty/libprocess/src/process.cpp:849}}
> {{#72570 0x7fd74a83d103 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::__call long> const&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple long> const&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) () at 
> /usr/include/c++/5/functional:1074}}
> {{#72571 0x7fd74a82afd2 in std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>::operator() long> const&, void>(process::Future const&) () at 
> /usr/include/c++/5/functional:1133}}
> {{#72572 0x7fd74a81b23c in process::Future const& 
> process::Future::onAny (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const::\{lambda(std::_Bind, 
> char*, unsigned long, 
> process::network::internal::Socket, 
> process::StreamingRequestDecoder*))(process::Future const&, 
> char*, unsigned long, 
> 

[jira] [Created] (MESOS-9024) Mesos master segfaults with stack overflow under load

2018-06-23 Thread Andrew Ruef (JIRA)
Andrew Ruef created MESOS-9024:
--

 Summary: Mesos master segfaults with stack overflow under load
 Key: MESOS-9024
 URL: https://issues.apache.org/jira/browse/MESOS-9024
 Project: Mesos
  Issue Type: Bug
  Components: libprocess, master
Affects Versions: 1.6.0
 Environment: Ubuntu 16.04.4 
Reporter: Andrew Ruef
 Attachments: stack.txt.gz

Running mesos in non-HA mode on a small cluster under load, the master reliably 
segfaults due to some state it has worked itself into. The segfault appears to 
be a stack overflow, at least, the call stack has 72662 elements in it in the 
crashing thread. The root of the stack appears to be in libprocess. 

I've attached a gzip compressed stack backtrace since the uncompressed stack 
backtrace is too large to attach to this issue. This happens to me fairly 
reliably when doing jobs, but it can take many hours or days for mesos to work 
itself back into this state. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)