[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load.
[ https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532110#comment-16532110 ] Andrew Ruef commented on MESOS-9024: Thanks! I'll check this out soon - I went to Plan B (divide up work manually using GNU parallel) and that task is still running, but when it's done I'll see what this fix does. > Mesos master segfaults with stack overflow under load. > -- > > Key: MESOS-9024 > URL: https://issues.apache.org/jira/browse/MESOS-9024 > Project: Mesos > Issue Type: Bug > Components: libprocess, master >Affects Versions: 1.6.0 > Environment: Ubuntu 16.04.4 >Reporter: Andrew Ruef >Assignee: Benjamin Mahler >Priority: Blocker > Fix For: 1.5.2, 1.6.1 > > Attachments: stack.txt.gz > > > Running mesos in non-HA mode on a small cluster under load, the master > reliably segfaults due to some state it has worked itself into. The segfault > appears to be a stack overflow, at least, the call stack has 72662 elements > in it in the crashing thread. The root of the stack appears to be in > libprocess. > I've attached a gzip compressed stack backtrace since the uncompressed stack > backtrace is too large to attach to this issue. This happens to me fairly > reliably when doing jobs, but it can take many hours or days for mesos to > work itself back into this state. > I think the below is the beginning of the repeating part of the stack trace: > {noformat} > #72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at > ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}} > {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce > const&)>&&) const () at > ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}} > {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at > ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}} > {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)> >(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&) const () at > ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:382}} > {{#72569 0x7fd74a7cff72 in process::internal::decode_recv () at > ../../../mesos-1.6.0/3rdparty/libprocess/src/process.cpp:849}} > {{#72570 0x7fd74a83d103 in std::_Bind, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>::__call long> const&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple long> const&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) () at > /usr/include/c++/5/functional:1074}} > {{#72571 0x7fd74a82afd2 in std::_Bind, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>::operator() long> const&, void>(process::Future const&) () at > /usr/include/c++/5/functional:1133}} > {{#72572 0x7fd74a81b23c in process::Future const& > process::Future::onAny (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer)
[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load
[ https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532048#comment-16532048 ] Benjamin Mahler commented on MESOS-9024: It looks like the socket receive path is also prone to stack overflow and needs a similar fix as was done on the sending side in MESOS-8594. This can occur when there is a socket that is always readable. This likely affects every supported version but much like MESOS-8594, we can backport to 1.5.x and 1.6.x but not to 1.4.x. > Mesos master segfaults with stack overflow under load > - > > Key: MESOS-9024 > URL: https://issues.apache.org/jira/browse/MESOS-9024 > Project: Mesos > Issue Type: Bug > Components: libprocess, master >Affects Versions: 1.6.0 > Environment: Ubuntu 16.04.4 >Reporter: Andrew Ruef >Priority: Blocker > Attachments: stack.txt.gz > > > Running mesos in non-HA mode on a small cluster under load, the master > reliably segfaults due to some state it has worked itself into. The segfault > appears to be a stack overflow, at least, the call stack has 72662 elements > in it in the crashing thread. The root of the stack appears to be in > libprocess. > I've attached a gzip compressed stack backtrace since the uncompressed stack > backtrace is too large to attach to this issue. This happens to me fairly > reliably when doing jobs, but it can take many hours or days for mesos to > work itself back into this state. > I think the below is the beginning of the repeating part of the stack trace: > {noformat} > #72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at > ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}} > {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce > const&)>&&) const () at > ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}} > {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at > ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}} > {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)> >(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&) const () at > ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:382}} > {{#72569 0x7fd74a7cff72 in process::internal::decode_recv () at > ../../../mesos-1.6.0/3rdparty/libprocess/src/process.cpp:849}} > {{#72570 0x7fd74a83d103 in std::_Bind, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>::__call long> const&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple long> const&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) () at > /usr/include/c++/5/functional:1074}} > {{#72571 0x7fd74a82afd2 in std::_Bind, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>::operator() long> const&, void>(process::Future const&) () at > /usr/include/c++/5/functional:1133}} > {{#72572 0x7fd74a81b23c in process::Future const& > process::Future::onAny (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, >
[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load
[ https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521106#comment-16521106 ] Andrew Ruef commented on MESOS-9024: I tried to find the first place where the stack trace starts repeating, I added that sequence. > Mesos master segfaults with stack overflow under load > - > > Key: MESOS-9024 > URL: https://issues.apache.org/jira/browse/MESOS-9024 > Project: Mesos > Issue Type: Bug > Components: libprocess, master >Affects Versions: 1.6.0 > Environment: Ubuntu 16.04.4 >Reporter: Andrew Ruef >Priority: Major > Attachments: stack.txt.gz > > > Running mesos in non-HA mode on a small cluster under load, the master > reliably segfaults due to some state it has worked itself into. The segfault > appears to be a stack overflow, at least, the call stack has 72662 elements > in it in the crashing thread. The root of the stack appears to be in > libprocess. > I've attached a gzip compressed stack backtrace since the uncompressed stack > backtrace is too large to attach to this issue. This happens to me fairly > reliably when doing jobs, but it can take many hours or days for mesos to > work itself back into this state. > I think the below is the beginning of the repeating part of the stack trace: > {{#72565 0x7fd748882c32 in lambda::CallableOnce (process::Future const&)>::operator()(process::Future long> const&) && () at > ../../mesos-1.6.0/3rdparty/stout/include/stout/lambda.hpp:443}} > {{#72566 0x7fd7488776d2 in process::Future long>::onAny(lambda::CallableOnce > const&)>&&) const () at > ../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:1461}} > {{#72567 0x7fd74a81b35c in process::Future long>::onAny, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const () at > ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:312}} > {{#72568 0x7fd74a80a5b3 in process::Future long>::onAny, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)> >(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&) const () at > ../../../mesos-1.6.0/3rdparty/libprocess/include/process/future.hpp:382}} > {{#72569 0x7fd74a7cff72 in process::internal::decode_recv () at > ../../../mesos-1.6.0/3rdparty/libprocess/src/process.cpp:849}} > {{#72570 0x7fd74a83d103 in std::_Bind, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>::__call long> const&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple long> const&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) () at > /usr/include/c++/5/functional:1074}} > {{#72571 0x7fd74a82afd2 in std::_Bind, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>::operator() long> const&, void>(process::Future const&) () at > /usr/include/c++/5/functional:1133}} > {{#72572 0x7fd74a81b23c in process::Future const& > process::Future::onAny (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>, void>(std::_Bind (*(std::_Placeholder<1>, char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*)>&&, process::Future long>::Prefer) const::\{lambda(std::_Bind, > char*, unsigned long, > process::network::internal::Socket, > process::StreamingRequestDecoder*))(process::Future const&, > char*, unsigned long, >
[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load
[ https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521101#comment-16521101 ] Andrei Budnik commented on MESOS-9024: -- May you please add repeating part of the stack trace to the description? > Mesos master segfaults with stack overflow under load > - > > Key: MESOS-9024 > URL: https://issues.apache.org/jira/browse/MESOS-9024 > Project: Mesos > Issue Type: Bug > Components: libprocess, master >Affects Versions: 1.6.0 > Environment: Ubuntu 16.04.4 >Reporter: Andrew Ruef >Priority: Major > Attachments: stack.txt.gz > > > Running mesos in non-HA mode on a small cluster under load, the master > reliably segfaults due to some state it has worked itself into. The segfault > appears to be a stack overflow, at least, the call stack has 72662 elements > in it in the crashing thread. The root of the stack appears to be in > libprocess. > I've attached a gzip compressed stack backtrace since the uncompressed stack > backtrace is too large to attach to this issue. This happens to me fairly > reliably when doing jobs, but it can take many hours or days for mesos to > work itself back into this state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)