[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster

2018-11-26 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699724#comment-16699724
 ] 

Benjamin Mahler commented on MESOS-8623:


Looks like we really dropped the ball on this one, linking in MESOS-9419 and 
upgrading to blocker.

> Crashed framework brings down the whole Mesos cluster
> -
>
> Key: MESOS-8623
> URL: https://issues.apache.org/jira/browse/MESOS-8623
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
> Environment: Debian 8
> Mesos 1.4.1
>Reporter: Tomas Barton
>Priority: Critical
>
> It might be hard to replicate, but when you do your Mesos cluster is gone. 
> The issue was caused by an unresponsive Docker engine on a single agent node. 
> Unfortunately even after fixing Docker issues, all Mesos masters repeatedly 
> failed to start. In despair I've deleted all {{replicated_log}} data from 
> Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got 
> replayed and the master crashed again. Average lifetime for Mesos master was 
> less than 1min.
> {code}
> mesos-master[3814]: I0228 00:25:55.269835  3828 network.hpp:436] ZooKeeper 
> group memberships changed
> mesos-master[3814]: I0228 00:25:55.269979  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002519' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.271117  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002520' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.277971  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002521' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.279296  3827 network.hpp:484] ZooKeeper 
> group PIDs: { log-replica(1)
> mesos-master[3814]: W0228 00:26:15.261255  3831 master.hpp:2372] Master 
> attempted to send message to disconnected framework 
> 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka)
> mesos-master[3814]: F0228 00:26:15.261318  3831 master.hpp:2382] 
> CHECK_SOME(pid): is NONE
> mesos-master[3814]: *** Check failure stack trace: ***
> mesos-master[3814]: @ 0x7f7187ca073d  google::LogMessage::Fail()
> mesos-master[3814]: @ 0x7f7187ca23bd  google::LogMessage::SendToLog()
> mesos-master[3814]: @ 0x7f7187ca0302  google::LogMessage::Flush()
> mesos-master[3814]: @ 0x7f7187ca2da9  
> google::LogMessageFatal::~LogMessageFatal()
> mesos-master[3814]: @ 0x7f7186d6d769  _CheckFatal::~_CheckFatal()
> mesos-master[3814]: @ 0x7f71870465d5  
> mesos::internal::master::Framework::send<>()
> mesos-master[3814]: @ 0x7f7186fcfe8a  
> mesos::internal::master::Master::executorMessage()
> mesos-master[3814]: @ 0x7f718706b1a1  ProtobufProcess<>::handler4<>()
> mesos-master[3814]: @ 0x7f7187008e36  
> std::_Function_handler<>::_M_invoke()
> mesos-master[3814]: @ 0x7f71870293d1  ProtobufProcess<>::visit()
> mesos-master[3814]: @ 0x7f7186fb7ee4  
> mesos::internal::master::Master::_visit()
> mesos-master[3814]: @ 0x7f7186fd0d5d  
> mesos::internal::master::Master::visit()
> mesos-master[3814]: @ 0x7f7187c02e22  process::ProcessManager::resume()
> mesos-master[3814]: @ 0x7f7187c08d46  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE
> mesos-master[3814]: @ 0x7f7185babca0  (unknown)
> mesos-master[3814]: @ 0x7f71853c6064  start_thread
> mesos-master[3814]: @ 0x7f71850fb62d  (unknown)
> systemd[1]: mesos-master.service: main process exited, code=killed, 
> status=6/ABRT
> systemd[1]: Unit mesos-master.service entered failed state.
> systemd[1]: mesos-master.service holdoff time over, scheduling restart.
> systemd[1]: Stopping Mesos Master...
> systemd[1]: Starting Mesos Master...
> systemd[1]: Started Mesos Master.
> mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written 
> to STDERR
> mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: 
> 2017-11-18 02:15:41 by admin
> mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294186 27829 main.cpp:240] Git SHA: 
> c844db9ac7c0cef59be87438c6781bfb71adcc42
> mesos-master[27840]: I0228 01:32:38.296067 27829 main.cpp:340] Using 
> 'HierarchicalDRF' allocator
> mesos-master[27840]: I0228 01:32:38.411576 27829 replica.cpp:779] Replica 
> recovered with log positions 13 -> 14 with 0 holes and 0 unlearned
> mesos-master[27840]: 2018-02-28 
> 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@726: Client 
> environment:zookeeper.version=zookeeper C client 3.4.8
> mesos-master[27840]: 2018-02-28 
> 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@730: Client 
> environment:host.name=svc01
> mesos-master[27840]: 2018-02-28 
> 

[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster

2018-05-21 Thread Tomas Barton (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482324#comment-16482324
 ] 

Tomas Barton commented on MESOS-8623:
-

Any progress on this? We're running into the same issue again and again.

> Crashed framework brings down the whole Mesos cluster
> -
>
> Key: MESOS-8623
> URL: https://issues.apache.org/jira/browse/MESOS-8623
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
> Environment: Debian 8
> Mesos 1.4.1
>Reporter: Tomas Barton
>Priority: Critical
>
> It might be hard to replicate, but when you do your Mesos cluster is gone. 
> The issue was caused by an unresponsive Docker engine on a single agent node. 
> Unfortunately even after fixing Docker issues, all Mesos masters repeatedly 
> failed to start. In despair I've deleted all {{replicated_log}} data from 
> Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got 
> replayed and the master crashed again. Average lifetime for Mesos master was 
> less than 1min.
> {code}
> mesos-master[3814]: I0228 00:25:55.269835  3828 network.hpp:436] ZooKeeper 
> group memberships changed
> mesos-master[3814]: I0228 00:25:55.269979  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002519' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.271117  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002520' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.277971  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002521' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.279296  3827 network.hpp:484] ZooKeeper 
> group PIDs: { log-replica(1)
> mesos-master[3814]: W0228 00:26:15.261255  3831 master.hpp:2372] Master 
> attempted to send message to disconnected framework 
> 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka)
> mesos-master[3814]: F0228 00:26:15.261318  3831 master.hpp:2382] 
> CHECK_SOME(pid): is NONE
> mesos-master[3814]: *** Check failure stack trace: ***
> mesos-master[3814]: @ 0x7f7187ca073d  google::LogMessage::Fail()
> mesos-master[3814]: @ 0x7f7187ca23bd  google::LogMessage::SendToLog()
> mesos-master[3814]: @ 0x7f7187ca0302  google::LogMessage::Flush()
> mesos-master[3814]: @ 0x7f7187ca2da9  
> google::LogMessageFatal::~LogMessageFatal()
> mesos-master[3814]: @ 0x7f7186d6d769  _CheckFatal::~_CheckFatal()
> mesos-master[3814]: @ 0x7f71870465d5  
> mesos::internal::master::Framework::send<>()
> mesos-master[3814]: @ 0x7f7186fcfe8a  
> mesos::internal::master::Master::executorMessage()
> mesos-master[3814]: @ 0x7f718706b1a1  ProtobufProcess<>::handler4<>()
> mesos-master[3814]: @ 0x7f7187008e36  
> std::_Function_handler<>::_M_invoke()
> mesos-master[3814]: @ 0x7f71870293d1  ProtobufProcess<>::visit()
> mesos-master[3814]: @ 0x7f7186fb7ee4  
> mesos::internal::master::Master::_visit()
> mesos-master[3814]: @ 0x7f7186fd0d5d  
> mesos::internal::master::Master::visit()
> mesos-master[3814]: @ 0x7f7187c02e22  process::ProcessManager::resume()
> mesos-master[3814]: @ 0x7f7187c08d46  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE
> mesos-master[3814]: @ 0x7f7185babca0  (unknown)
> mesos-master[3814]: @ 0x7f71853c6064  start_thread
> mesos-master[3814]: @ 0x7f71850fb62d  (unknown)
> systemd[1]: mesos-master.service: main process exited, code=killed, 
> status=6/ABRT
> systemd[1]: Unit mesos-master.service entered failed state.
> systemd[1]: mesos-master.service holdoff time over, scheduling restart.
> systemd[1]: Stopping Mesos Master...
> systemd[1]: Starting Mesos Master...
> systemd[1]: Started Mesos Master.
> mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written 
> to STDERR
> mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: 
> 2017-11-18 02:15:41 by admin
> mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294186 27829 main.cpp:240] Git SHA: 
> c844db9ac7c0cef59be87438c6781bfb71adcc42
> mesos-master[27840]: I0228 01:32:38.296067 27829 main.cpp:340] Using 
> 'HierarchicalDRF' allocator
> mesos-master[27840]: I0228 01:32:38.411576 27829 replica.cpp:779] Replica 
> recovered with log positions 13 -> 14 with 0 holes and 0 unlearned
> mesos-master[27840]: 2018-02-28 
> 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@726: Client 
> environment:zookeeper.version=zookeeper C client 3.4.8
> mesos-master[27840]: 2018-02-28 
> 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@730: Client 
> environment:host.name=svc01
> mesos-master[27840]: 2018-02-28 
> 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@737: 

[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster

2018-02-28 Thread Tomas Barton (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381284#comment-16381284
 ] 

Tomas Barton commented on MESOS-8623:
-

Anything specific I should be looking for? There doesn't appear to be much 
useful information. Mesos starts reconciliation 
{code}
I0227 03:00:34.023347 23757 master.cpp:7286] Performing implicit task state 
reconciliation for framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131- (mar
I0227 03:00:35.845939 23754 master.cpp:7673] Sending 9 offers to framework 
911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka) at scheduler-090f04ff-207f
I0227 03:00:35.849963 23757 master.cpp:5120] Processing DECLINE call for 
offers: [ 7ad6bb9b-b7f8-4467-8391-51f40c297ac1-O27 ] for framework 911c4b47-2b
I0227 03:00:35.850232 23757 master.cpp:9170] Removing offer 
7ad6bb9b-b7f8-4467-8391-51f40c297ac1-O27
I0227 03:00:35.850301 23757 master.cpp:5120] Processing DECLINE call for 
offers: [ 7ad6bb9b-b7f8-4467-8391-51f40c297ac1-O28 ] for framework 911c4b47-2b
I0227 03:00:35.850574 23757 master.cpp:9170] Removing offer 
7ad6bb9b-b7f8-4467-8391-51f40c297ac1-O28
I0227 03:00:35.850622 23757 master.cpp:5120] Processing DECLINE call for 
offers: [ 7ad6bb9b-b7f8-4467-8391-51f40c297ac1-O29 ] for framework 911c4b47-2b
I0227 03:00:35.850850 23757 master.cpp:9170] Removing offer 
7ad6bb9b-b7f8-4467-8391-51f40c297ac1-O29
{code}
then hundreds of declined offers follows (over 1000 to be more specific), until 
it crashes:
{code}
I0227 02:39:20.596915 31377 master.cpp:5120] Processing DECLINE call for 
offers: [ 82c5b27e-5ef9-427e-862a-cc4092b3d8e4-O1031 ] for framework 911c4b47-
I0227 02:39:20.597039 31377 master.cpp:9170] Removing offer 
82c5b27e-5ef9-427e-862a-cc4092b3d8e4-O1031
I0227 02:39:20.597054 31377 master.cpp:5120] Processing DECLINE call for 
offers: [ 82c5b27e-5ef9-427e-862a-cc4092b3d8e4-O1032 ] for framework 911c4b47-
I0227 02:39:20.597149 31377 master.cpp:9170] Removing offer 
82c5b27e-5ef9-427e-862a-cc4092b3d8e4-O1032
Stopping Mesos Master...
Starting Mesos Master...
Started Mesos Master.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0227 02:39:20.895939 23727 main.cpp:232] Build: 2017-11-18 02:15:41 by admin
I0227 02:39:20.895979 23727 main.cpp:233] Version: 1.4.1
I0227 02:39:20.895983 23727 main.cpp:236] Git tag: 1.4.1
I0227 02:39:20.895985 23727 main.cpp:240] Git SHA: 
c844db9ac7c0cef59be87438c6781bfb71adcc42
I0227 02:39:20.896937 23727 main.cpp:340] Using 'HierarchicalDRF' allocator
I0227 02:39:20.934940 23727 replica.cpp:779] Replica recovered with log 
positions 7483 -> 7484 with 0 holes and 0 unlearned
2018-02-27 02:39:20,935:23727(0x7fdb3bcea700):ZOO_INFO@log_env@726: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
{code}

> Crashed framework brings down the whole Mesos cluster
> -
>
> Key: MESOS-8623
> URL: https://issues.apache.org/jira/browse/MESOS-8623
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
> Environment: Debian 8
> Mesos 1.4.1
>Reporter: Tomas Barton
>Priority: Critical
>
> It might be hard to replicate, but when you do your Mesos cluster is gone. 
> The issue was caused by an unresponsive Docker engine on a single agent node. 
> Unfortunately even after fixing Docker issues, all Mesos masters repeatedly 
> failed to start. In despair I've deleted all {{replicated_log}} data from 
> Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got 
> replayed and the master crashed again. Average lifetime for Mesos master was 
> less than 1min.
> {code}
> mesos-master[3814]: I0228 00:25:55.269835  3828 network.hpp:436] ZooKeeper 
> group memberships changed
> mesos-master[3814]: I0228 00:25:55.269979  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002519' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.271117  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002520' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.277971  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002521' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.279296  3827 network.hpp:484] ZooKeeper 
> group PIDs: { log-replica(1)
> mesos-master[3814]: W0228 00:26:15.261255  3831 master.hpp:2372] Master 
> attempted to send message to disconnected framework 
> 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka)
> mesos-master[3814]: F0228 00:26:15.261318  3831 master.hpp:2382] 
> CHECK_SOME(pid): is NONE
> mesos-master[3814]: *** Check failure stack trace: ***
> mesos-master[3814]: @ 0x7f7187ca073d  google::LogMessage::Fail()
> mesos-master[3814]: @ 0x7f7187ca23bd  google::LogMessage::SendToLog()
> mesos-master[3814]: @ 0x7f7187ca0302  google::LogMessage::Flush()
> mesos-master[3814]: @ 0x7f7187ca2da9  
> 

[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster

2018-02-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380785#comment-16380785
 ] 

Joseph Wu commented on MESOS-8623:
--

Do you have more master logs prior to the crash?  A full run (from master 
starting to master crashing) would probably help track down which code path is 
unguarded.

BTW, the precise framework probably does not matter too much.  

> Crashed framework brings down the whole Mesos cluster
> -
>
> Key: MESOS-8623
> URL: https://issues.apache.org/jira/browse/MESOS-8623
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
> Environment: Debian 8
> Mesos 1.4.1
>Reporter: Tomas Barton
>Priority: Critical
>
> It might be hard to replicate, but when you do your Mesos cluster is gone. 
> The issue was caused by an unresponsive Docker engine on a single agent node. 
> Unfortunately even after fixing Docker issues, all Mesos masters repeatedly 
> failed to start. In despair I've deleted all {{replicated_log}} data from 
> Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got 
> replayed and the master crashed again. Average lifetime for Mesos master was 
> less than 1min.
> {code}
> mesos-master[3814]: I0228 00:25:55.269835  3828 network.hpp:436] ZooKeeper 
> group memberships changed
> mesos-master[3814]: I0228 00:25:55.269979  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002519' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.271117  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002520' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.277971  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002521' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.279296  3827 network.hpp:484] ZooKeeper 
> group PIDs: { log-replica(1)
> mesos-master[3814]: W0228 00:26:15.261255  3831 master.hpp:2372] Master 
> attempted to send message to disconnected framework 
> 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka)
> mesos-master[3814]: F0228 00:26:15.261318  3831 master.hpp:2382] 
> CHECK_SOME(pid): is NONE
> mesos-master[3814]: *** Check failure stack trace: ***
> mesos-master[3814]: @ 0x7f7187ca073d  google::LogMessage::Fail()
> mesos-master[3814]: @ 0x7f7187ca23bd  google::LogMessage::SendToLog()
> mesos-master[3814]: @ 0x7f7187ca0302  google::LogMessage::Flush()
> mesos-master[3814]: @ 0x7f7187ca2da9  
> google::LogMessageFatal::~LogMessageFatal()
> mesos-master[3814]: @ 0x7f7186d6d769  _CheckFatal::~_CheckFatal()
> mesos-master[3814]: @ 0x7f71870465d5  
> mesos::internal::master::Framework::send<>()
> mesos-master[3814]: @ 0x7f7186fcfe8a  
> mesos::internal::master::Master::executorMessage()
> mesos-master[3814]: @ 0x7f718706b1a1  ProtobufProcess<>::handler4<>()
> mesos-master[3814]: @ 0x7f7187008e36  
> std::_Function_handler<>::_M_invoke()
> mesos-master[3814]: @ 0x7f71870293d1  ProtobufProcess<>::visit()
> mesos-master[3814]: @ 0x7f7186fb7ee4  
> mesos::internal::master::Master::_visit()
> mesos-master[3814]: @ 0x7f7186fd0d5d  
> mesos::internal::master::Master::visit()
> mesos-master[3814]: @ 0x7f7187c02e22  process::ProcessManager::resume()
> mesos-master[3814]: @ 0x7f7187c08d46  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE
> mesos-master[3814]: @ 0x7f7185babca0  (unknown)
> mesos-master[3814]: @ 0x7f71853c6064  start_thread
> mesos-master[3814]: @ 0x7f71850fb62d  (unknown)
> systemd[1]: mesos-master.service: main process exited, code=killed, 
> status=6/ABRT
> systemd[1]: Unit mesos-master.service entered failed state.
> systemd[1]: mesos-master.service holdoff time over, scheduling restart.
> systemd[1]: Stopping Mesos Master...
> systemd[1]: Starting Mesos Master...
> systemd[1]: Started Mesos Master.
> mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written 
> to STDERR
> mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: 
> 2017-11-18 02:15:41 by admin
> mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294186 27829 main.cpp:240] Git SHA: 
> c844db9ac7c0cef59be87438c6781bfb71adcc42
> mesos-master[27840]: I0228 01:32:38.296067 27829 main.cpp:340] Using 
> 'HierarchicalDRF' allocator
> mesos-master[27840]: I0228 01:32:38.411576 27829 replica.cpp:779] Replica 
> recovered with log positions 13 -> 14 with 0 holes and 0 unlearned
> mesos-master[27840]: 2018-02-28 
> 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@726: Client 
> environment:zookeeper.version=zookeeper C client 3.4.8
> mesos-master[27840]: 2018-02-28 
> 

[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster

2018-02-28 Thread Tomas Barton (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380718#comment-16380718
 ] 

Tomas Barton commented on MESOS-8623:
-

Joseph, thanks for looking into this! We were running Kafka framework via 
Marathon. However Marathon could not start and elect a master as there was no 
stable Mesos leader. Thus reconnecting Kafka framework proved to be difficult, 
although Kafka brokers were normally running.

> Crashed framework brings down the whole Mesos cluster
> -
>
> Key: MESOS-8623
> URL: https://issues.apache.org/jira/browse/MESOS-8623
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
> Environment: Debian 8
> Mesos 1.4.1
>Reporter: Tomas Barton
>Priority: Critical
>
> It might be hard to replicate, but when you do your Mesos cluster is gone. 
> The issue was caused by an unresponsive Docker engine on a single agent node. 
> Unfortunately even after fixing Docker issues, all Mesos masters repeatedly 
> failed to start. In despair I've deleted all {{replicated_log}} data from 
> Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got 
> replayed and the master crashed again. Average lifetime for Mesos master was 
> less than 1min.
> {code}
> mesos-master[3814]: I0228 00:25:55.269835  3828 network.hpp:436] ZooKeeper 
> group memberships changed
> mesos-master[3814]: I0228 00:25:55.269979  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002519' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.271117  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002520' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.277971  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002521' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.279296  3827 network.hpp:484] ZooKeeper 
> group PIDs: { log-replica(1)
> mesos-master[3814]: W0228 00:26:15.261255  3831 master.hpp:2372] Master 
> attempted to send message to disconnected framework 
> 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka)
> mesos-master[3814]: F0228 00:26:15.261318  3831 master.hpp:2382] 
> CHECK_SOME(pid): is NONE
> mesos-master[3814]: *** Check failure stack trace: ***
> mesos-master[3814]: @ 0x7f7187ca073d  google::LogMessage::Fail()
> mesos-master[3814]: @ 0x7f7187ca23bd  google::LogMessage::SendToLog()
> mesos-master[3814]: @ 0x7f7187ca0302  google::LogMessage::Flush()
> mesos-master[3814]: @ 0x7f7187ca2da9  
> google::LogMessageFatal::~LogMessageFatal()
> mesos-master[3814]: @ 0x7f7186d6d769  _CheckFatal::~_CheckFatal()
> mesos-master[3814]: @ 0x7f71870465d5  
> mesos::internal::master::Framework::send<>()
> mesos-master[3814]: @ 0x7f7186fcfe8a  
> mesos::internal::master::Master::executorMessage()
> mesos-master[3814]: @ 0x7f718706b1a1  ProtobufProcess<>::handler4<>()
> mesos-master[3814]: @ 0x7f7187008e36  
> std::_Function_handler<>::_M_invoke()
> mesos-master[3814]: @ 0x7f71870293d1  ProtobufProcess<>::visit()
> mesos-master[3814]: @ 0x7f7186fb7ee4  
> mesos::internal::master::Master::_visit()
> mesos-master[3814]: @ 0x7f7186fd0d5d  
> mesos::internal::master::Master::visit()
> mesos-master[3814]: @ 0x7f7187c02e22  process::ProcessManager::resume()
> mesos-master[3814]: @ 0x7f7187c08d46  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE
> mesos-master[3814]: @ 0x7f7185babca0  (unknown)
> mesos-master[3814]: @ 0x7f71853c6064  start_thread
> mesos-master[3814]: @ 0x7f71850fb62d  (unknown)
> systemd[1]: mesos-master.service: main process exited, code=killed, 
> status=6/ABRT
> systemd[1]: Unit mesos-master.service entered failed state.
> systemd[1]: mesos-master.service holdoff time over, scheduling restart.
> systemd[1]: Stopping Mesos Master...
> systemd[1]: Starting Mesos Master...
> systemd[1]: Started Mesos Master.
> mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written 
> to STDERR
> mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: 
> 2017-11-18 02:15:41 by admin
> mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294186 27829 main.cpp:240] Git SHA: 
> c844db9ac7c0cef59be87438c6781bfb71adcc42
> mesos-master[27840]: I0228 01:32:38.296067 27829 main.cpp:340] Using 
> 'HierarchicalDRF' allocator
> mesos-master[27840]: I0228 01:32:38.411576 27829 replica.cpp:779] Replica 
> recovered with log positions 13 -> 14 with 0 holes and 0 unlearned
> mesos-master[27840]: 2018-02-28 
> 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@726: Client 
> environment:zookeeper.version=zookeeper C client 3.4.8
> 

[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster

2018-02-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380688#comment-16380688
 ] 

Joseph Wu commented on MESOS-8623:
--

This CHECK can be hit whenever the master attempts to send any message to a 
framework recovered via Agent registration (i.e. Agent reports that the 
framework is running tasks on it), but before the framework has reconnected to 
the master.  

The master should be guarding against sending messages to disconnected 
frameworks, so we'll have to track down which code path is responsible for 
sending this message.

A cursory {{git blame}} suggests that this crash could have happened from 1.2.0 
onwards, but this also depends on which message is being sent:
https://github.com/apache/mesos/commit/0dbceafa3b7caba9e541cd64bce3d5421e3b1262

Note: This commit removed a {{return;}} that would have hidden this bug.
https://github.com/apache/mesos/commit/65efb347301f90e638361a50282cb74980c4c081

> Crashed framework brings down the whole Mesos cluster
> -
>
> Key: MESOS-8623
> URL: https://issues.apache.org/jira/browse/MESOS-8623
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
> Environment: Debian 8
> Mesos 1.4.1
>Reporter: Tomas Barton
>Priority: Critical
>
> It might be hard to replicate, but when you do your Mesos cluster is gone. 
> The issue was caused by an unresponsive Docker engine on a single agent node. 
> Unfortunately even after fixing Docker issues, all Mesos masters repeatedly 
> failed to start. In despair I've deleted all {{replicated_log}} data from 
> Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got 
> replayed and the master crashed again. Average lifetime for Mesos master was 
> less than 1min.
> {code}
> mesos-master[3814]: I0228 00:25:55.269835  3828 network.hpp:436] ZooKeeper 
> group memberships changed
> mesos-master[3814]: I0228 00:25:55.269979  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002519' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.271117  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002520' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.277971  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002521' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.279296  3827 network.hpp:484] ZooKeeper 
> group PIDs: { log-replica(1)
> mesos-master[3814]: W0228 00:26:15.261255  3831 master.hpp:2372] Master 
> attempted to send message to disconnected framework 
> 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka)
> mesos-master[3814]: F0228 00:26:15.261318  3831 master.hpp:2382] 
> CHECK_SOME(pid): is NONE
> mesos-master[3814]: *** Check failure stack trace: ***
> mesos-master[3814]: @ 0x7f7187ca073d  google::LogMessage::Fail()
> mesos-master[3814]: @ 0x7f7187ca23bd  google::LogMessage::SendToLog()
> mesos-master[3814]: @ 0x7f7187ca0302  google::LogMessage::Flush()
> mesos-master[3814]: @ 0x7f7187ca2da9  
> google::LogMessageFatal::~LogMessageFatal()
> mesos-master[3814]: @ 0x7f7186d6d769  _CheckFatal::~_CheckFatal()
> mesos-master[3814]: @ 0x7f71870465d5  
> mesos::internal::master::Framework::send<>()
> mesos-master[3814]: @ 0x7f7186fcfe8a  
> mesos::internal::master::Master::executorMessage()
> mesos-master[3814]: @ 0x7f718706b1a1  ProtobufProcess<>::handler4<>()
> mesos-master[3814]: @ 0x7f7187008e36  
> std::_Function_handler<>::_M_invoke()
> mesos-master[3814]: @ 0x7f71870293d1  ProtobufProcess<>::visit()
> mesos-master[3814]: @ 0x7f7186fb7ee4  
> mesos::internal::master::Master::_visit()
> mesos-master[3814]: @ 0x7f7186fd0d5d  
> mesos::internal::master::Master::visit()
> mesos-master[3814]: @ 0x7f7187c02e22  process::ProcessManager::resume()
> mesos-master[3814]: @ 0x7f7187c08d46  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE
> mesos-master[3814]: @ 0x7f7185babca0  (unknown)
> mesos-master[3814]: @ 0x7f71853c6064  start_thread
> mesos-master[3814]: @ 0x7f71850fb62d  (unknown)
> systemd[1]: mesos-master.service: main process exited, code=killed, 
> status=6/ABRT
> systemd[1]: Unit mesos-master.service entered failed state.
> systemd[1]: mesos-master.service holdoff time over, scheduling restart.
> systemd[1]: Stopping Mesos Master...
> systemd[1]: Starting Mesos Master...
> systemd[1]: Started Mesos Master.
> mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written 
> to STDERR
> mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: 
> 2017-11-18 02:15:41 by admin
> mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1
> mesos-master[27840]: I0228