[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741427#comment-15741427 ] bo kong commented on MESOS-5193: I have meet the same problem with mesos version 1.0.1, but can‘t reappear it. > Recovery failed: Failed to recover registrar on reboot of mesos master > -- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.22.0, 0.27.0 >Reporter: Priyanka Gupta > Labels: master, mesosphere > Attachments: full.log, node1.log, node1_after_work_dir.log, > node2.log, node2_after_work_dir.log, node3.log, node3_after_work_dir.log > > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264707#comment-15264707 ] Priyanka Gupta commented on MESOS-5193: --- [~bmahler] Zookeeper connectivity issues are because we have zk also setup on the same nodes as mesos master. So configuration wise, we have 3 nodes, each running zk, mesos-master and mesos-slave. As far as restart is concerned, we have rhel6 boxes and have a initd service which runs these. Although, once a master process gets killed the service gets terminated as well. > Recovery failed: Failed to recover registrar on reboot of mesos master > -- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.22.0, 0.27.0 >Reporter: Priyanka Gupta > Labels: master, mesosphere > Attachments: node1.log, node1_after_work_dir.log, node2.log, > node2_after_work_dir.log, node3.log, node3_after_work_dir.log > > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264584#comment-15264584 ] Benjamin Mahler commented on MESOS-5193: [~prigupta] Looking at the logs, there was a ~ 3 minute window of time in which the masters were experiencing ZooKeeper connectivity issues (from 18:33 - 18:36). Have you noticed this? Also we require that the masters are run under supervision, are you ensuring that the master are being promptly restarted if they terminate? Since the recovery timeout is 1 minute by default, I would suggest something much smaller, like 10 seconds. Were the masters restarted after the last recovery failures here? {noformat} Master 1: W0429 18:33:08.726205 2518 logging.cpp:88] RAW: Received signal SIGTERM from process 2938 of user 0; exiting I0429 18:33:28.846740 1083 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv I0429 18:37:26.008154 1134 master.cpp:1723] Elected as the leading master! F0429 18:38:26.008847 1127 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins Master 2: W0429 18:36:04.716518 2410 logging.cpp:88] RAW: Received signal SIGTERM from process 3029 of user 0; exiting I0429 18:36:30.429669 1091 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv I0429 18:38:34.699726 1144 master.cpp:1723] Elected as the leading master! F0429 18:39:34.715205 1139 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins Master 3: I0429 18:32:12.877344 7962 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv I0429 18:36:16.489387 7963 master.cpp:1723] Elected as the leading master! F0429 18:37:16.490408 7967 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins {noformat} If they were restarted and the ZooKeeper connectivity was resolved, the masters should have been able to get back up and running. > Recovery failed: Failed to recover registrar on reboot of mesos master > -- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.22.0, 0.27.0 >Reporter: Priyanka Gupta > Labels: master, mesosphere > Attachments: node1.log, node1_after_work_dir.log, node2.log, > node2_after_work_dir.log, node3.log, node3_after_work_dir.log > > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264513#comment-15264513 ] Priyanka Gupta commented on MESOS-5193: --- Hi [~jieyu] I tried changing the work dir as told by you but with no luck. Attaching the logs again. Test scenario: Node1 - leading master. Rebooted node1 -> node 2 became master. All is fine. Once node 1 is back, I rebooted node2 (current leading master) , node3 becomes master and exits, then node1 tries to becomes fails and then node 2 also fails. > Recovery failed: Failed to recover registrar on reboot of mesos master > -- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.22.0, 0.27.0 >Reporter: Priyanka Gupta > Labels: master, mesosphere > Attachments: node1.log, node1_after_work_dir.log, node2.log, > node2_after_work_dir.log, node3.log, node3_after_work_dir.log > > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254189#comment-15254189 ] Jie Yu commented on MESOS-5193: --- Can you change the work_dir to be /var/lib/mesos and see if the issue gets resolved? The problem with a /tmp workdir is that the replicated log will be wiped upon reboot. It'll try to do catch-up with other normal replicas. But if you don't have a quorum of normal replicas, the recovery will fail. > Recovery failed: Failed to recover registrar on reboot of mesos master > -- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.22.0, 0.27.0 >Reporter: Priyanka Gupta > Labels: master, mesosphere > Attachments: node1.log, node2.log, node3.log > > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252298#comment-15252298 ] Priyanka Gupta commented on MESOS-5193: --- Thats right! > Recovery failed: Failed to recover registrar on reboot of mesos master > -- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.22.0, 0.27.0 >Reporter: Priyanka Gupta > Labels: master, mesosphere > Attachments: node1.log, node2.log, node3.log > > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252293#comment-15252293 ] Neil Conway commented on MESOS-5193: Ah -- to clarify, *rebooting* the current leading master node causes the error to occur reliably. However, killing and restarting the {{mesos-master}} process on the current leading master node doesn't cause any problems. Is that accurate? > Recovery failed: Failed to recover registrar on reboot of mesos master > -- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.22.0, 0.27.0 >Reporter: Priyanka Gupta > Labels: master, mesosphere > Attachments: node1.log, node2.log, node3.log > > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252284#comment-15252284 ] Priyanka Gupta commented on MESOS-5193: --- [~neilc]: Its not a one time thing and its reproducible almost always. It happens only when I reboot the system. Shutting down mesos_master service works just fine. Not sure if its something to do with network going down. Its happening in production and hence kind of blocker for us. Thanks, Priyanka > Recovery failed: Failed to recover registrar on reboot of mesos master > -- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.22.0, 0.27.0 >Reporter: Priyanka Gupta > Labels: master, mesosphere > Attachments: node1.log, node2.log, node3.log > > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252276#comment-15252276 ] Neil Conway commented on MESOS-5193: [~prigupta]: I dug into the logs. A few things seemed suspicious, but no smoking gun yet. A few questions to clarify: 1. Does this problem occur reliably, or was it a one-time issue? 2. If it is reproducible, can you start {{mesos-master}} with the {{GLOG_v=1}} set as an environmental variable? If this would cause production downtime, no need to bother. > Recovery failed: Failed to recover registrar on reboot of mesos master > -- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.22.0, 0.27.0 >Reporter: Priyanka Gupta > Labels: master, mesosphere > Attachments: node1.log, node2.log, node3.log > > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250238#comment-15250238 ] Priyanka Gupta commented on MESOS-5193: --- [~neilc] : Any updates on this one? > Recovery failed: Failed to recover registrar on reboot of mesos master > -- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.22.0, 0.27.0 >Reporter: Priyanka Gupta > Labels: master, mesosphere > Attachments: node1.log, node2.log, node3.log > > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242071#comment-15242071 ] Priyanka Gupta commented on MESOS-5193: --- Thanks a lot for getting back. [~neilc] : Please see the logs below. I did a reboot on node1. [sudo reboot] [~kaysoky] Thanks for pointing out. I will make this change. Thanks, Priyanka > Recovery failed: Failed to recover registrar on reboot of mesos master > -- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.22.0, 0.27.0 >Reporter: Priyanka Gupta > Labels: master, mesosphere > Attachments: node1.log, node2.log, node3.log > > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237882#comment-15237882 ] Joseph Wu commented on MESOS-5193: -- Probably not related to this problem you're seeing, but using {{/tmp}} as your {{work_dir}} is problematic. See this and related JIRAs: https://issues.apache.org/jira/browse/MESOS-5064 > Recovery failed: Failed to recover registrar on reboot of mesos master > -- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.22.0, 0.27.0 >Reporter: Priyanka Gupta > Labels: master, mesosphere > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237874#comment-15237874 ] Neil Conway commented on MESOS-5193: Hi [~prigupta] -- can you post the complete log files for all three nodes? I'd like to make sure that the snippets you've posted are not missing some important context. Thanks! > Recovery failed: Failed to recover registrar on reboot of mesos master > -- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.22.0, 0.27.0 >Reporter: Priyanka Gupta > Labels: master, mesosphere > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237577#comment-15237577 ] Priyanka Gupta commented on MESOS-5193: --- Error Stack in mesos master log Node3 I0411 22:47:02.007249 1348 detector.cpp:479] A new leading master (UPID=master@10.221.28.61:5050) is detected I0411 22:47:02.007380 1348 master.cpp:1710] The newly elected leader is master@10.221.28.61:5050 with id 725d1232-bea3-4df5-90c5-6479e5652ef4 I0411 22:47:02.007428 1348 master.cpp:1723] Elected as the leading master! I0411 22:47:02.007457 1348 master.cpp:1468] Recovering from registrar I0411 22:47:02.007551 1345 registrar.cpp:307] Recovering registrar I0411 22:47:02.007649 1356 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.28.249:5050 } I0411 22:47:02.007841 1356 log.cpp:659] Attempting to start the writer I0411 22:47:02.008477 1348 replica.cpp:493] Replica received implicit promise request from (30)@10.221.28.61:5050 with proposal 52 E0411 22:47:02.008903 1358 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected I0411 22:47:02.009968 1348 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.44126ms I0411 22:47:02.010022 1348 replica.cpp:342] Persisted promised to 52 F0411 22:48:02.008332 1357 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f4bd5bcedfd (unknown) @ 0x7f4bd5bd0c3d (unknown) @ 0x7f4bd5bce9ec (unknown) @ 0x7f4bd5bd1539 (unknown) @ 0x7f4bd54022dc (unknown) @ 0x7f4bd5442ab0 (unknown) @ 0x42807e (unknown) @ 0x7f4bd54690a5 (unknown) @ 0x7f4bd54bb976 (unknown) @ 0x7f4bd54cc566 (unknown) @ 0x7f4bd52fc4d6 (unknown) @ 0x7f4bd54cc553 (unknown) @ 0x7f4bd54b0614 (unknown) @ 0x7f4bd5b7c971 (unknown) @ 0x7f4bd5b7cc77 (unknown) @ 0x3dc38b6470 (unknown) @ 0x3dc18079d1 (unknown) @ 0x3dc14e88fd (unknown) @ (nil) (unknown) /bin/bash: line 1: 1313 Aborted /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2 Node 2 I0411 22:48:10.006216 1466 log.cpp:659] Attempting to start the writer E0411 22:48:10.006958 1478 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected I0411 22:48:10.007202 1467 replica.cpp:493] Replica received implicit promise request from (13)@10.221.28.249:5050 with proposal 52 E0411 22:48:10.007491 1478 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected I0411 22:48:10.008458 1467 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.227092ms I0411 22:48:10.008491 1467 replica.cpp:342] Persisted promised to 52 F0411 22:49:10.006739 1476 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7fec686f2dfd (unknown) @ 0x7fec686f4c3d (unknown) @ 0x7fec686f29ec (unknown) @ 0x7fec686f5539 (unknown) @ 0x7fec67f262dc (unknown) @ 0x7fec67f66ab0 (unknown) @ 0x42807e (unknown) @ 0x7fec67f8d0a5 (unknown) @ 0x7fec67fdf976 (unknown) @ 0x7fec67ff0566 (unknown) @ 0x7fec67e204d6 (unknown) @ 0x7fec67ff0553 (unknown) @ 0x7fec67fd4614 (unknown) @ 0x7fec686a0971 (unknown) @ 0x7fec686a0c77 (unknown) @ 0x37f98b6470 (unknown) @ 0x39ed207a51 (unknown) @ 0x39ecae89ad (unknown) @ (nil) (unknown) /bin/bash: line 1: 1452 Aborted /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2 Node 1 I0411 22:45:52.017833 8338 detector.cpp:479] A new leading master (UPID=master@10.221.29.247:5050) is detected I0411 22:45:52.017925 8338 master.cpp:1710] The newly elected leader is master@10.221.29.247:5050 with id 13df6437-fbe9-4390-9f6c-db9fd1d53a16 I0411 22:45:52.017956 8338 master.cpp:1723] Elected as the leading master! I0411 22:45:52.017983 8338 master.cpp:1468] Recovering from registrar I0411 22:45:52.018069 8339 registrar.cpp:307] Recovering registrar I0411 22:45:52.018337 8333 log.cpp:659] Attempting to start the writer I0411 22:45:52.018785 8336 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.29.247:5050 } I0411 22:45:52.019008 8336 replica.cpp:493] Replica received implicit promise request from