[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster
[ https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699724#comment-16699724 ] Benjamin Mahler commented on MESOS-8623: Looks like we really dropped the ball on this one, linking in MESOS-9419 and upgrading to blocker. > Crashed framework brings down the whole Mesos cluster > - > > Key: MESOS-8623 > URL: https://issues.apache.org/jira/browse/MESOS-8623 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.4.1 > Environment: Debian 8 > Mesos 1.4.1 >Reporter: Tomas Barton >Priority: Critical > > It might be hard to replicate, but when you do your Mesos cluster is gone. > The issue was caused by an unresponsive Docker engine on a single agent node. > Unfortunately even after fixing Docker issues, all Mesos masters repeatedly > failed to start. In despair I've deleted all {{replicated_log}} data from > Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got > replayed and the master crashed again. Average lifetime for Mesos master was > less than 1min. > {code} > mesos-master[3814]: I0228 00:25:55.269835 3828 network.hpp:436] ZooKeeper > group memberships changed > mesos-master[3814]: I0228 00:25:55.269979 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002519' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.271117 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002520' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.277971 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002521' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.279296 3827 network.hpp:484] ZooKeeper > group PIDs: { log-replica(1) > mesos-master[3814]: W0228 00:26:15.261255 3831 master.hpp:2372] Master > attempted to send message to disconnected framework > 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka) > mesos-master[3814]: F0228 00:26:15.261318 3831 master.hpp:2382] > CHECK_SOME(pid): is NONE > mesos-master[3814]: *** Check failure stack trace: *** > mesos-master[3814]: @ 0x7f7187ca073d google::LogMessage::Fail() > mesos-master[3814]: @ 0x7f7187ca23bd google::LogMessage::SendToLog() > mesos-master[3814]: @ 0x7f7187ca0302 google::LogMessage::Flush() > mesos-master[3814]: @ 0x7f7187ca2da9 > google::LogMessageFatal::~LogMessageFatal() > mesos-master[3814]: @ 0x7f7186d6d769 _CheckFatal::~_CheckFatal() > mesos-master[3814]: @ 0x7f71870465d5 > mesos::internal::master::Framework::send<>() > mesos-master[3814]: @ 0x7f7186fcfe8a > mesos::internal::master::Master::executorMessage() > mesos-master[3814]: @ 0x7f718706b1a1 ProtobufProcess<>::handler4<>() > mesos-master[3814]: @ 0x7f7187008e36 > std::_Function_handler<>::_M_invoke() > mesos-master[3814]: @ 0x7f71870293d1 ProtobufProcess<>::visit() > mesos-master[3814]: @ 0x7f7186fb7ee4 > mesos::internal::master::Master::_visit() > mesos-master[3814]: @ 0x7f7186fd0d5d > mesos::internal::master::Master::visit() > mesos-master[3814]: @ 0x7f7187c02e22 process::ProcessManager::resume() > mesos-master[3814]: @ 0x7f7187c08d46 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE > mesos-master[3814]: @ 0x7f7185babca0 (unknown) > mesos-master[3814]: @ 0x7f71853c6064 start_thread > mesos-master[3814]: @ 0x7f71850fb62d (unknown) > systemd[1]: mesos-master.service: main process exited, code=killed, > status=6/ABRT > systemd[1]: Unit mesos-master.service entered failed state. > systemd[1]: mesos-master.service holdoff time over, scheduling restart. > systemd[1]: Stopping Mesos Master... > systemd[1]: Starting Mesos Master... > systemd[1]: Started Mesos Master. > mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written > to STDERR > mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: > 2017-11-18 02:15:41 by admin > mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1 > mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1 > mesos-master[27840]: I0228 01:32:38.294186 27829 main.cpp:240] Git SHA: > c844db9ac7c0cef59be87438c6781bfb71adcc42 > mesos-master[27840]: I0228 01:32:38.296067 27829 main.cpp:340] Using > 'HierarchicalDRF' allocator > mesos-master[27840]: I0228 01:32:38.411576 27829 replica.cpp:779] Replica > recovered with log positions 13 -> 14 with 0 holes and 0 unlearned > mesos-master[27840]: 2018-02-28 > 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@726: Client > environment:zookeeper.version=zookeeper C client 3.4.8 > mesos-master[27840]: 2018-02-28 > 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@730: Client > environment:host.name=svc01 > mesos-master[27840]: 2018-02-28 >
[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster
[ https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482324#comment-16482324 ] Tomas Barton commented on MESOS-8623: - Any progress on this? We're running into the same issue again and again. > Crashed framework brings down the whole Mesos cluster > - > > Key: MESOS-8623 > URL: https://issues.apache.org/jira/browse/MESOS-8623 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.4.1 > Environment: Debian 8 > Mesos 1.4.1 >Reporter: Tomas Barton >Priority: Critical > > It might be hard to replicate, but when you do your Mesos cluster is gone. > The issue was caused by an unresponsive Docker engine on a single agent node. > Unfortunately even after fixing Docker issues, all Mesos masters repeatedly > failed to start. In despair I've deleted all {{replicated_log}} data from > Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got > replayed and the master crashed again. Average lifetime for Mesos master was > less than 1min. > {code} > mesos-master[3814]: I0228 00:25:55.269835 3828 network.hpp:436] ZooKeeper > group memberships changed > mesos-master[3814]: I0228 00:25:55.269979 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002519' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.271117 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002520' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.277971 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002521' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.279296 3827 network.hpp:484] ZooKeeper > group PIDs: { log-replica(1) > mesos-master[3814]: W0228 00:26:15.261255 3831 master.hpp:2372] Master > attempted to send message to disconnected framework > 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka) > mesos-master[3814]: F0228 00:26:15.261318 3831 master.hpp:2382] > CHECK_SOME(pid): is NONE > mesos-master[3814]: *** Check failure stack trace: *** > mesos-master[3814]: @ 0x7f7187ca073d google::LogMessage::Fail() > mesos-master[3814]: @ 0x7f7187ca23bd google::LogMessage::SendToLog() > mesos-master[3814]: @ 0x7f7187ca0302 google::LogMessage::Flush() > mesos-master[3814]: @ 0x7f7187ca2da9 > google::LogMessageFatal::~LogMessageFatal() > mesos-master[3814]: @ 0x7f7186d6d769 _CheckFatal::~_CheckFatal() > mesos-master[3814]: @ 0x7f71870465d5 > mesos::internal::master::Framework::send<>() > mesos-master[3814]: @ 0x7f7186fcfe8a > mesos::internal::master::Master::executorMessage() > mesos-master[3814]: @ 0x7f718706b1a1 ProtobufProcess<>::handler4<>() > mesos-master[3814]: @ 0x7f7187008e36 > std::_Function_handler<>::_M_invoke() > mesos-master[3814]: @ 0x7f71870293d1 ProtobufProcess<>::visit() > mesos-master[3814]: @ 0x7f7186fb7ee4 > mesos::internal::master::Master::_visit() > mesos-master[3814]: @ 0x7f7186fd0d5d > mesos::internal::master::Master::visit() > mesos-master[3814]: @ 0x7f7187c02e22 process::ProcessManager::resume() > mesos-master[3814]: @ 0x7f7187c08d46 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE > mesos-master[3814]: @ 0x7f7185babca0 (unknown) > mesos-master[3814]: @ 0x7f71853c6064 start_thread > mesos-master[3814]: @ 0x7f71850fb62d (unknown) > systemd[1]: mesos-master.service: main process exited, code=killed, > status=6/ABRT > systemd[1]: Unit mesos-master.service entered failed state. > systemd[1]: mesos-master.service holdoff time over, scheduling restart. > systemd[1]: Stopping Mesos Master... > systemd[1]: Starting Mesos Master... > systemd[1]: Started Mesos Master. > mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written > to STDERR > mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: > 2017-11-18 02:15:41 by admin > mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1 > mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1 > mesos-master[27840]: I0228 01:32:38.294186 27829 main.cpp:240] Git SHA: > c844db9ac7c0cef59be87438c6781bfb71adcc42 > mesos-master[27840]: I0228 01:32:38.296067 27829 main.cpp:340] Using > 'HierarchicalDRF' allocator > mesos-master[27840]: I0228 01:32:38.411576 27829 replica.cpp:779] Replica > recovered with log positions 13 -> 14 with 0 holes and 0 unlearned > mesos-master[27840]: 2018-02-28 > 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@726: Client > environment:zookeeper.version=zookeeper C client 3.4.8 > mesos-master[27840]: 2018-02-28 > 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@730: Client > environment:host.name=svc01 > mesos-master[27840]: 2018-02-28 > 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@737:
[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster
[ https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381284#comment-16381284 ] Tomas Barton commented on MESOS-8623: - Anything specific I should be looking for? There doesn't appear to be much useful information. Mesos starts reconciliation {code} I0227 03:00:34.023347 23757 master.cpp:7286] Performing implicit task state reconciliation for framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131- (mar I0227 03:00:35.845939 23754 master.cpp:7673] Sending 9 offers to framework 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka) at scheduler-090f04ff-207f I0227 03:00:35.849963 23757 master.cpp:5120] Processing DECLINE call for offers: [ 7ad6bb9b-b7f8-4467-8391-51f40c297ac1-O27 ] for framework 911c4b47-2b I0227 03:00:35.850232 23757 master.cpp:9170] Removing offer 7ad6bb9b-b7f8-4467-8391-51f40c297ac1-O27 I0227 03:00:35.850301 23757 master.cpp:5120] Processing DECLINE call for offers: [ 7ad6bb9b-b7f8-4467-8391-51f40c297ac1-O28 ] for framework 911c4b47-2b I0227 03:00:35.850574 23757 master.cpp:9170] Removing offer 7ad6bb9b-b7f8-4467-8391-51f40c297ac1-O28 I0227 03:00:35.850622 23757 master.cpp:5120] Processing DECLINE call for offers: [ 7ad6bb9b-b7f8-4467-8391-51f40c297ac1-O29 ] for framework 911c4b47-2b I0227 03:00:35.850850 23757 master.cpp:9170] Removing offer 7ad6bb9b-b7f8-4467-8391-51f40c297ac1-O29 {code} then hundreds of declined offers follows (over 1000 to be more specific), until it crashes: {code} I0227 02:39:20.596915 31377 master.cpp:5120] Processing DECLINE call for offers: [ 82c5b27e-5ef9-427e-862a-cc4092b3d8e4-O1031 ] for framework 911c4b47- I0227 02:39:20.597039 31377 master.cpp:9170] Removing offer 82c5b27e-5ef9-427e-862a-cc4092b3d8e4-O1031 I0227 02:39:20.597054 31377 master.cpp:5120] Processing DECLINE call for offers: [ 82c5b27e-5ef9-427e-862a-cc4092b3d8e4-O1032 ] for framework 911c4b47- I0227 02:39:20.597149 31377 master.cpp:9170] Removing offer 82c5b27e-5ef9-427e-862a-cc4092b3d8e4-O1032 Stopping Mesos Master... Starting Mesos Master... Started Mesos Master. WARNING: Logging before InitGoogleLogging() is written to STDERR I0227 02:39:20.895939 23727 main.cpp:232] Build: 2017-11-18 02:15:41 by admin I0227 02:39:20.895979 23727 main.cpp:233] Version: 1.4.1 I0227 02:39:20.895983 23727 main.cpp:236] Git tag: 1.4.1 I0227 02:39:20.895985 23727 main.cpp:240] Git SHA: c844db9ac7c0cef59be87438c6781bfb71adcc42 I0227 02:39:20.896937 23727 main.cpp:340] Using 'HierarchicalDRF' allocator I0227 02:39:20.934940 23727 replica.cpp:779] Replica recovered with log positions 7483 -> 7484 with 0 holes and 0 unlearned 2018-02-27 02:39:20,935:23727(0x7fdb3bcea700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8 {code} > Crashed framework brings down the whole Mesos cluster > - > > Key: MESOS-8623 > URL: https://issues.apache.org/jira/browse/MESOS-8623 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.4.1 > Environment: Debian 8 > Mesos 1.4.1 >Reporter: Tomas Barton >Priority: Critical > > It might be hard to replicate, but when you do your Mesos cluster is gone. > The issue was caused by an unresponsive Docker engine on a single agent node. > Unfortunately even after fixing Docker issues, all Mesos masters repeatedly > failed to start. In despair I've deleted all {{replicated_log}} data from > Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got > replayed and the master crashed again. Average lifetime for Mesos master was > less than 1min. > {code} > mesos-master[3814]: I0228 00:25:55.269835 3828 network.hpp:436] ZooKeeper > group memberships changed > mesos-master[3814]: I0228 00:25:55.269979 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002519' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.271117 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002520' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.277971 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002521' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.279296 3827 network.hpp:484] ZooKeeper > group PIDs: { log-replica(1) > mesos-master[3814]: W0228 00:26:15.261255 3831 master.hpp:2372] Master > attempted to send message to disconnected framework > 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka) > mesos-master[3814]: F0228 00:26:15.261318 3831 master.hpp:2382] > CHECK_SOME(pid): is NONE > mesos-master[3814]: *** Check failure stack trace: *** > mesos-master[3814]: @ 0x7f7187ca073d google::LogMessage::Fail() > mesos-master[3814]: @ 0x7f7187ca23bd google::LogMessage::SendToLog() > mesos-master[3814]: @ 0x7f7187ca0302 google::LogMessage::Flush() > mesos-master[3814]: @ 0x7f7187ca2da9 >
[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster
[ https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380785#comment-16380785 ] Joseph Wu commented on MESOS-8623: -- Do you have more master logs prior to the crash? A full run (from master starting to master crashing) would probably help track down which code path is unguarded. BTW, the precise framework probably does not matter too much. > Crashed framework brings down the whole Mesos cluster > - > > Key: MESOS-8623 > URL: https://issues.apache.org/jira/browse/MESOS-8623 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.4.1 > Environment: Debian 8 > Mesos 1.4.1 >Reporter: Tomas Barton >Priority: Critical > > It might be hard to replicate, but when you do your Mesos cluster is gone. > The issue was caused by an unresponsive Docker engine on a single agent node. > Unfortunately even after fixing Docker issues, all Mesos masters repeatedly > failed to start. In despair I've deleted all {{replicated_log}} data from > Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got > replayed and the master crashed again. Average lifetime for Mesos master was > less than 1min. > {code} > mesos-master[3814]: I0228 00:25:55.269835 3828 network.hpp:436] ZooKeeper > group memberships changed > mesos-master[3814]: I0228 00:25:55.269979 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002519' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.271117 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002520' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.277971 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002521' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.279296 3827 network.hpp:484] ZooKeeper > group PIDs: { log-replica(1) > mesos-master[3814]: W0228 00:26:15.261255 3831 master.hpp:2372] Master > attempted to send message to disconnected framework > 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka) > mesos-master[3814]: F0228 00:26:15.261318 3831 master.hpp:2382] > CHECK_SOME(pid): is NONE > mesos-master[3814]: *** Check failure stack trace: *** > mesos-master[3814]: @ 0x7f7187ca073d google::LogMessage::Fail() > mesos-master[3814]: @ 0x7f7187ca23bd google::LogMessage::SendToLog() > mesos-master[3814]: @ 0x7f7187ca0302 google::LogMessage::Flush() > mesos-master[3814]: @ 0x7f7187ca2da9 > google::LogMessageFatal::~LogMessageFatal() > mesos-master[3814]: @ 0x7f7186d6d769 _CheckFatal::~_CheckFatal() > mesos-master[3814]: @ 0x7f71870465d5 > mesos::internal::master::Framework::send<>() > mesos-master[3814]: @ 0x7f7186fcfe8a > mesos::internal::master::Master::executorMessage() > mesos-master[3814]: @ 0x7f718706b1a1 ProtobufProcess<>::handler4<>() > mesos-master[3814]: @ 0x7f7187008e36 > std::_Function_handler<>::_M_invoke() > mesos-master[3814]: @ 0x7f71870293d1 ProtobufProcess<>::visit() > mesos-master[3814]: @ 0x7f7186fb7ee4 > mesos::internal::master::Master::_visit() > mesos-master[3814]: @ 0x7f7186fd0d5d > mesos::internal::master::Master::visit() > mesos-master[3814]: @ 0x7f7187c02e22 process::ProcessManager::resume() > mesos-master[3814]: @ 0x7f7187c08d46 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE > mesos-master[3814]: @ 0x7f7185babca0 (unknown) > mesos-master[3814]: @ 0x7f71853c6064 start_thread > mesos-master[3814]: @ 0x7f71850fb62d (unknown) > systemd[1]: mesos-master.service: main process exited, code=killed, > status=6/ABRT > systemd[1]: Unit mesos-master.service entered failed state. > systemd[1]: mesos-master.service holdoff time over, scheduling restart. > systemd[1]: Stopping Mesos Master... > systemd[1]: Starting Mesos Master... > systemd[1]: Started Mesos Master. > mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written > to STDERR > mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: > 2017-11-18 02:15:41 by admin > mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1 > mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1 > mesos-master[27840]: I0228 01:32:38.294186 27829 main.cpp:240] Git SHA: > c844db9ac7c0cef59be87438c6781bfb71adcc42 > mesos-master[27840]: I0228 01:32:38.296067 27829 main.cpp:340] Using > 'HierarchicalDRF' allocator > mesos-master[27840]: I0228 01:32:38.411576 27829 replica.cpp:779] Replica > recovered with log positions 13 -> 14 with 0 holes and 0 unlearned > mesos-master[27840]: 2018-02-28 > 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@726: Client > environment:zookeeper.version=zookeeper C client 3.4.8 > mesos-master[27840]: 2018-02-28 >
[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster
[ https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380718#comment-16380718 ] Tomas Barton commented on MESOS-8623: - Joseph, thanks for looking into this! We were running Kafka framework via Marathon. However Marathon could not start and elect a master as there was no stable Mesos leader. Thus reconnecting Kafka framework proved to be difficult, although Kafka brokers were normally running. > Crashed framework brings down the whole Mesos cluster > - > > Key: MESOS-8623 > URL: https://issues.apache.org/jira/browse/MESOS-8623 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.4.1 > Environment: Debian 8 > Mesos 1.4.1 >Reporter: Tomas Barton >Priority: Critical > > It might be hard to replicate, but when you do your Mesos cluster is gone. > The issue was caused by an unresponsive Docker engine on a single agent node. > Unfortunately even after fixing Docker issues, all Mesos masters repeatedly > failed to start. In despair I've deleted all {{replicated_log}} data from > Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got > replayed and the master crashed again. Average lifetime for Mesos master was > less than 1min. > {code} > mesos-master[3814]: I0228 00:25:55.269835 3828 network.hpp:436] ZooKeeper > group memberships changed > mesos-master[3814]: I0228 00:25:55.269979 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002519' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.271117 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002520' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.277971 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002521' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.279296 3827 network.hpp:484] ZooKeeper > group PIDs: { log-replica(1) > mesos-master[3814]: W0228 00:26:15.261255 3831 master.hpp:2372] Master > attempted to send message to disconnected framework > 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka) > mesos-master[3814]: F0228 00:26:15.261318 3831 master.hpp:2382] > CHECK_SOME(pid): is NONE > mesos-master[3814]: *** Check failure stack trace: *** > mesos-master[3814]: @ 0x7f7187ca073d google::LogMessage::Fail() > mesos-master[3814]: @ 0x7f7187ca23bd google::LogMessage::SendToLog() > mesos-master[3814]: @ 0x7f7187ca0302 google::LogMessage::Flush() > mesos-master[3814]: @ 0x7f7187ca2da9 > google::LogMessageFatal::~LogMessageFatal() > mesos-master[3814]: @ 0x7f7186d6d769 _CheckFatal::~_CheckFatal() > mesos-master[3814]: @ 0x7f71870465d5 > mesos::internal::master::Framework::send<>() > mesos-master[3814]: @ 0x7f7186fcfe8a > mesos::internal::master::Master::executorMessage() > mesos-master[3814]: @ 0x7f718706b1a1 ProtobufProcess<>::handler4<>() > mesos-master[3814]: @ 0x7f7187008e36 > std::_Function_handler<>::_M_invoke() > mesos-master[3814]: @ 0x7f71870293d1 ProtobufProcess<>::visit() > mesos-master[3814]: @ 0x7f7186fb7ee4 > mesos::internal::master::Master::_visit() > mesos-master[3814]: @ 0x7f7186fd0d5d > mesos::internal::master::Master::visit() > mesos-master[3814]: @ 0x7f7187c02e22 process::ProcessManager::resume() > mesos-master[3814]: @ 0x7f7187c08d46 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE > mesos-master[3814]: @ 0x7f7185babca0 (unknown) > mesos-master[3814]: @ 0x7f71853c6064 start_thread > mesos-master[3814]: @ 0x7f71850fb62d (unknown) > systemd[1]: mesos-master.service: main process exited, code=killed, > status=6/ABRT > systemd[1]: Unit mesos-master.service entered failed state. > systemd[1]: mesos-master.service holdoff time over, scheduling restart. > systemd[1]: Stopping Mesos Master... > systemd[1]: Starting Mesos Master... > systemd[1]: Started Mesos Master. > mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written > to STDERR > mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: > 2017-11-18 02:15:41 by admin > mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1 > mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1 > mesos-master[27840]: I0228 01:32:38.294186 27829 main.cpp:240] Git SHA: > c844db9ac7c0cef59be87438c6781bfb71adcc42 > mesos-master[27840]: I0228 01:32:38.296067 27829 main.cpp:340] Using > 'HierarchicalDRF' allocator > mesos-master[27840]: I0228 01:32:38.411576 27829 replica.cpp:779] Replica > recovered with log positions 13 -> 14 with 0 holes and 0 unlearned > mesos-master[27840]: 2018-02-28 > 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@726: Client > environment:zookeeper.version=zookeeper C client 3.4.8 >
[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster
[ https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380688#comment-16380688 ] Joseph Wu commented on MESOS-8623: -- This CHECK can be hit whenever the master attempts to send any message to a framework recovered via Agent registration (i.e. Agent reports that the framework is running tasks on it), but before the framework has reconnected to the master. The master should be guarding against sending messages to disconnected frameworks, so we'll have to track down which code path is responsible for sending this message. A cursory {{git blame}} suggests that this crash could have happened from 1.2.0 onwards, but this also depends on which message is being sent: https://github.com/apache/mesos/commit/0dbceafa3b7caba9e541cd64bce3d5421e3b1262 Note: This commit removed a {{return;}} that would have hidden this bug. https://github.com/apache/mesos/commit/65efb347301f90e638361a50282cb74980c4c081 > Crashed framework brings down the whole Mesos cluster > - > > Key: MESOS-8623 > URL: https://issues.apache.org/jira/browse/MESOS-8623 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.4.1 > Environment: Debian 8 > Mesos 1.4.1 >Reporter: Tomas Barton >Priority: Critical > > It might be hard to replicate, but when you do your Mesos cluster is gone. > The issue was caused by an unresponsive Docker engine on a single agent node. > Unfortunately even after fixing Docker issues, all Mesos masters repeatedly > failed to start. In despair I've deleted all {{replicated_log}} data from > Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got > replayed and the master crashed again. Average lifetime for Mesos master was > less than 1min. > {code} > mesos-master[3814]: I0228 00:25:55.269835 3828 network.hpp:436] ZooKeeper > group memberships changed > mesos-master[3814]: I0228 00:25:55.269979 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002519' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.271117 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002520' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.277971 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/002521' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.279296 3827 network.hpp:484] ZooKeeper > group PIDs: { log-replica(1) > mesos-master[3814]: W0228 00:26:15.261255 3831 master.hpp:2372] Master > attempted to send message to disconnected framework > 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka) > mesos-master[3814]: F0228 00:26:15.261318 3831 master.hpp:2382] > CHECK_SOME(pid): is NONE > mesos-master[3814]: *** Check failure stack trace: *** > mesos-master[3814]: @ 0x7f7187ca073d google::LogMessage::Fail() > mesos-master[3814]: @ 0x7f7187ca23bd google::LogMessage::SendToLog() > mesos-master[3814]: @ 0x7f7187ca0302 google::LogMessage::Flush() > mesos-master[3814]: @ 0x7f7187ca2da9 > google::LogMessageFatal::~LogMessageFatal() > mesos-master[3814]: @ 0x7f7186d6d769 _CheckFatal::~_CheckFatal() > mesos-master[3814]: @ 0x7f71870465d5 > mesos::internal::master::Framework::send<>() > mesos-master[3814]: @ 0x7f7186fcfe8a > mesos::internal::master::Master::executorMessage() > mesos-master[3814]: @ 0x7f718706b1a1 ProtobufProcess<>::handler4<>() > mesos-master[3814]: @ 0x7f7187008e36 > std::_Function_handler<>::_M_invoke() > mesos-master[3814]: @ 0x7f71870293d1 ProtobufProcess<>::visit() > mesos-master[3814]: @ 0x7f7186fb7ee4 > mesos::internal::master::Master::_visit() > mesos-master[3814]: @ 0x7f7186fd0d5d > mesos::internal::master::Master::visit() > mesos-master[3814]: @ 0x7f7187c02e22 process::ProcessManager::resume() > mesos-master[3814]: @ 0x7f7187c08d46 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE > mesos-master[3814]: @ 0x7f7185babca0 (unknown) > mesos-master[3814]: @ 0x7f71853c6064 start_thread > mesos-master[3814]: @ 0x7f71850fb62d (unknown) > systemd[1]: mesos-master.service: main process exited, code=killed, > status=6/ABRT > systemd[1]: Unit mesos-master.service entered failed state. > systemd[1]: mesos-master.service holdoff time over, scheduling restart. > systemd[1]: Stopping Mesos Master... > systemd[1]: Starting Mesos Master... > systemd[1]: Started Mesos Master. > mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written > to STDERR > mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: > 2017-11-18 02:15:41 by admin > mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1 > mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1 > mesos-master[27840]: I0228