[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load.
[ https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532110#comment-16532110 ] Andrew Ruef commented on MESOS-9024: Thanks! I'll check this out soon - I went to Plan B (divide up work manually using GNU parallel) and that task is still running, but when it's done I'll see what this fix does. > Mesos master segfaults with stack overflow under load. > -- > > Key: MESOS-9024 > URL: https://issues.apache.org/jira/browse/MESOS-9024 > Project: Mesos > Issue Type: Bug > Components: libprocess, master >Affects Versions: 1.6.0 > Environment: Ubuntu 16.04.4 >Reporter: Andrew Ruef >Assignee: Benjamin Mahler >Priority: Blocker > Fix For: 1.5.2, 1.6.1 > > Attachments: stack.txt.gz > > > Running mesos in non-HA mode on a small cluster under load, the master > reliably segfaults due to some state it has worked itself into. The segfault > appears to be a stack overflow, at least, the call stack has 72662 elements > in it in the crashing thread. The root of the stack appears to be in > libprocess. > I've attached a gzip compressed stack backtrace since the uncompressed stack > backtrace is too large to attach to this issue. This happens to me fairly > reliably when doing jobs, but it can take many hours or days for mesos to > work itself back into this state. > I think the below is the beginning of the repeating part of the stack trace: > {noformat} > #72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at > ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}} > {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce > const&)>&&) const () at > ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}} > {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at > ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}} > {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)> >(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&) const () at > ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:382}} > {{#72569 0x7fd74a7cff72 in process::internal::decode_recv () at > ../../../mesos-1.6.0/3rdparty/libprocess/src/process.cpp:849}} > {{#72570 0x7fd74a83d103 in std::_Bind, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>::__call long> const&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple long> const&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) () at > /usr/include/c++/5/functional:1074}} > {{#72571 0x7fd74a82afd2 in std::_Bind, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>::operator() long> const&, void>(process::Future const&) () at > /usr/include/c++/5/functional:1133}} > {{#72572 0x7fd74a81b23c in process::Future const& > process::Future::onAny (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer)
[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load
[ https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521106#comment-16521106 ] Andrew Ruef commented on MESOS-9024: I tried to find the first place where the stack trace starts repeating, I added that sequence. > Mesos master segfaults with stack overflow under load > - > > Key: MESOS-9024 > URL: https://issues.apache.org/jira/browse/MESOS-9024 > Project: Mesos > Issue Type: Bug > Components: libprocess, master >Affects Versions: 1.6.0 > Environment: Ubuntu 16.04.4 >Reporter: Andrew Ruef >Priority: Major > Attachments: stack.txt.gz > > > Running mesos in non-HA mode on a small cluster under load, the master > reliably segfaults due to some state it has worked itself into. The segfault > appears to be a stack overflow, at least, the call stack has 72662 elements > in it in the crashing thread. The root of the stack appears to be in > libprocess. > I've attached a gzip compressed stack backtrace since the uncompressed stack > backtrace is too large to attach to this issue. This happens to me fairly > reliably when doing jobs, but it can take many hours or days for mesos to > work itself back into this state. > I think the below is the beginning of the repeating part of the stack trace: > {{#72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at > ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}} > {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce > const&)>&&) const () at > ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}} > {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at > ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}} > {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)> >(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&) const () at > ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:382}} > {{#72569 0x7fd74a7cff72 in process::internal::decode_recv () at > ../../../mesos-1.6.0/3rdparty/libprocess/src/process.cpp:849}} > {{#72570 0x7fd74a83d103 in std::_Bind, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>::__call long> const&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple long> const&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) () at > /usr/include/c++/5/functional:1074}} > {{#72571 0x7fd74a82afd2 in std::_Bind, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>::operator() long> const&, void>(process::Future const&) () at > /usr/include/c++/5/functional:1133}} > {{#72572 0x7fd74a81b23c in process::Future const& > process::Future::onAny (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const::\{lambda(std::_Bind, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, >
[jira] [Created] (MESOS-9024) Mesos master segfaults with stack overflow under load
Andrew Ruef created MESOS-9024: -- Summary: Mesos master segfaults with stack overflow under load Key: MESOS-9024 URL: https://issues.apache.org/jira/browse/MESOS-9024 Project: Mesos Issue Type: Bug Components: libprocess, master Affects Versions: 1.6.0 Environment: Ubuntu 16.04.4 Reporter: Andrew Ruef Attachments: stack.txt.gz Running mesos in non-HA mode on a small cluster under load, the master reliably segfaults due to some state it has worked itself into. The segfault appears to be a stack overflow, at least, the call stack has 72662 elements in it in the crashing thread. The root of the stack appears to be in libprocess. I've attached a gzip compressed stack backtrace since the uncompressed stack backtrace is too large to attach to this issue. This happens to me fairly reliably when doing jobs, but it can take many hours or days for mesos to work itself back into this state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)