[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-12-12 Thread bo kong (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741427#comment-15741427
 ] 

bo kong commented on MESOS-5193:


I  have meet the same problem with mesos version 1.0.1, but can‘t reappear it.

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
> Attachments: full.log, node1.log, node1_after_work_dir.log, 
> node2.log, node2_after_work_dir.log, node3.log, node3_after_work_dir.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-29 Thread Priyanka Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264707#comment-15264707
 ] 

Priyanka Gupta commented on MESOS-5193:
---

[~bmahler] Zookeeper connectivity issues are because we have zk also setup on 
the same nodes as mesos master. So configuration wise, we have 3 nodes, each 
running zk, mesos-master and mesos-slave. As far as restart is concerned, we 
have rhel6 boxes and have a initd service which runs these. Although, once a 
master process gets killed the service gets terminated as well. 

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
> Attachments: node1.log, node1_after_work_dir.log, node2.log, 
> node2_after_work_dir.log, node3.log, node3_after_work_dir.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-29 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264584#comment-15264584
 ] 

Benjamin Mahler commented on MESOS-5193:


[~prigupta] Looking at the logs, there was a ~ 3 minute window of time in which 
the masters were experiencing ZooKeeper connectivity issues (from 18:33 - 
18:36). Have you noticed this?

Also we require that the masters are run under supervision, are you ensuring 
that the master are being promptly restarted if they terminate? Since the 
recovery timeout is 1 minute by default, I would suggest something much 
smaller, like 10 seconds.

Were the masters restarted after the last recovery failures here?

{noformat}
Master 1:
W0429 18:33:08.726205  2518 logging.cpp:88] RAW: Received signal SIGTERM from 
process 2938 of user 0; exiting
I0429 18:33:28.846740  1083 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv
I0429 18:37:26.008154  1134 master.cpp:1723] Elected as the leading master!
F0429 18:38:26.008847  1127 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins

Master 2:
W0429 18:36:04.716518  2410 logging.cpp:88] RAW: Received signal SIGTERM from 
process 3029 of user 0; exiting
I0429 18:36:30.429669  1091 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv
I0429 18:38:34.699726  1144 master.cpp:1723] Elected as the leading master!
F0429 18:39:34.715205  1139 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins

Master 3:
I0429 18:32:12.877344  7962 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv
I0429 18:36:16.489387  7963 master.cpp:1723] Elected as the leading master!
F0429 18:37:16.490408  7967 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
{noformat}

If they were restarted and the ZooKeeper connectivity was resolved, the masters 
should have been able to get back up and running.

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
> Attachments: node1.log, node1_after_work_dir.log, node2.log, 
> node2_after_work_dir.log, node3.log, node3_after_work_dir.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-29 Thread Priyanka Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264513#comment-15264513
 ] 

Priyanka Gupta commented on MESOS-5193:
---

Hi [~jieyu] 

I tried changing the work dir as told by you but with no luck.  Attaching the 
logs again. 
Test scenario: Node1 - leading master. Rebooted node1 -> node 2 became master. 
All is fine. Once node 1 is back, I rebooted node2 (current leading master) , 
node3 becomes master and exits, then node1 tries to becomes fails and then node 
2 also fails.

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
> Attachments: node1.log, node1_after_work_dir.log, node2.log, 
> node2_after_work_dir.log, node3.log, node3_after_work_dir.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-22 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254189#comment-15254189
 ] 

Jie Yu commented on MESOS-5193:
---

Can you change the work_dir to be /var/lib/mesos and see if the issue gets 
resolved?

The problem with a /tmp workdir is that the replicated log will be wiped upon 
reboot. It'll try to do catch-up with other normal replicas. But if you don't 
have a quorum of normal replicas, the recovery will fail.

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
> Attachments: node1.log, node2.log, node3.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-21 Thread Priyanka Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252298#comment-15252298
 ] 

Priyanka Gupta commented on MESOS-5193:
---

Thats right!

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
> Attachments: node1.log, node2.log, node3.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-21 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252293#comment-15252293
 ] 

Neil Conway commented on MESOS-5193:


Ah -- to clarify, *rebooting* the current leading master node causes the error 
to occur reliably. However, killing and restarting the {{mesos-master}} process 
on the current leading master node doesn't cause any problems. Is that accurate?

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
> Attachments: node1.log, node2.log, node3.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-21 Thread Priyanka Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252284#comment-15252284
 ] 

Priyanka Gupta commented on MESOS-5193:
---

[~neilc]: Its not a one time thing and its reproducible almost always. It 
happens only when I reboot the system. Shutting down mesos_master service works 
just fine. Not sure if its something to do with network going down. Its 
happening in production and hence kind of blocker for us.

Thanks,
Priyanka

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
> Attachments: node1.log, node2.log, node3.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-21 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252276#comment-15252276
 ] 

Neil Conway commented on MESOS-5193:


[~prigupta]: I dug into the logs. A few things seemed suspicious, but no 
smoking gun yet. A few questions to clarify:

1. Does this problem occur reliably, or was it a one-time issue?
2. If it is reproducible, can you start {{mesos-master}} with the {{GLOG_v=1}} 
set as an environmental variable? If this would cause production downtime, no 
need to bother.

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
> Attachments: node1.log, node2.log, node3.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-20 Thread Priyanka Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250238#comment-15250238
 ] 

Priyanka Gupta commented on MESOS-5193:
---

[~neilc] : Any updates on this one?

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
> Attachments: node1.log, node2.log, node3.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-14 Thread Priyanka Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242071#comment-15242071
 ] 

Priyanka Gupta commented on MESOS-5193:
---

Thanks a lot for getting back. 
[~neilc] : Please see the logs below.  I did a reboot on node1. [sudo reboot]
[~kaysoky] Thanks for pointing out. I will make this change.

Thanks,
Priyanka

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
> Attachments: node1.log, node2.log, node3.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-12 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237882#comment-15237882
 ] 

Joseph Wu commented on MESOS-5193:
--

Probably not related to this problem you're seeing, but using {{/tmp}} as your 
{{work_dir}} is problematic.  
See this and related JIRAs: https://issues.apache.org/jira/browse/MESOS-5064

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-12 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237874#comment-15237874
 ] 

Neil Conway commented on MESOS-5193:


Hi [~prigupta] -- can you post the complete log files for all three nodes? I'd 
like to make sure that the snippets you've posted are not missing some 
important context. Thanks!

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-12 Thread Priyanka Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237577#comment-15237577
 ] 

Priyanka Gupta commented on MESOS-5193:
---

Error Stack in mesos master log

Node3
I0411 22:47:02.007249  1348 detector.cpp:479] A new leading master 
(UPID=master@10.221.28.61:5050) is detected
I0411 22:47:02.007380  1348 master.cpp:1710] The newly elected leader is 
master@10.221.28.61:5050 with id 725d1232-bea3-4df5-90c5-6479e5652ef4
I0411 22:47:02.007428  1348 master.cpp:1723] Elected as the leading master!
I0411 22:47:02.007457  1348 master.cpp:1468] Recovering from registrar
I0411 22:47:02.007551  1345 registrar.cpp:307] Recovering registrar
I0411 22:47:02.007649  1356 network.hpp:461] ZooKeeper group PIDs: { 
log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.28.249:5050 }
I0411 22:47:02.007841  1356 log.cpp:659] Attempting to start the writer
I0411 22:47:02.008477  1348 replica.cpp:493] Replica received implicit promise 
request from (30)@10.221.28.61:5050 with proposal 52
E0411 22:47:02.008903  1358 process.cpp:1966] Failed to shutdown socket with fd 
23: Transport endpoint is not connected
I0411 22:47:02.009968  1348 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 1.44126ms
I0411 22:47:02.010022  1348 replica.cpp:342] Persisted promised to 52
F0411 22:48:02.008332  1357 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
@ 0x7f4bd5bcedfd  (unknown)
@ 0x7f4bd5bd0c3d  (unknown)
@ 0x7f4bd5bce9ec  (unknown)
@ 0x7f4bd5bd1539  (unknown)
@ 0x7f4bd54022dc  (unknown)
@ 0x7f4bd5442ab0  (unknown)
@   0x42807e  (unknown)
@ 0x7f4bd54690a5  (unknown)
@ 0x7f4bd54bb976  (unknown)
@ 0x7f4bd54cc566  (unknown)
@ 0x7f4bd52fc4d6  (unknown)
@ 0x7f4bd54cc553  (unknown)
@ 0x7f4bd54b0614  (unknown)
@ 0x7f4bd5b7c971  (unknown)
@ 0x7f4bd5b7cc77  (unknown)
@   0x3dc38b6470  (unknown)
@   0x3dc18079d1  (unknown)
@   0x3dc14e88fd  (unknown)
@  (nil)  (unknown)
/bin/bash: line 1:  1313 Aborted /usr/sbin/mesos-master 
--work_dir=/tmp/mesos_dir 
--zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos
 --quorum=2



Node 2

I0411 22:48:10.006216  1466 log.cpp:659] Attempting to start the writer
E0411 22:48:10.006958  1478 process.cpp:1966] Failed to shutdown socket with fd 
23: Transport endpoint is not connected
I0411 22:48:10.007202  1467 replica.cpp:493] Replica received implicit promise 
request from (13)@10.221.28.249:5050 with proposal 52
E0411 22:48:10.007491  1478 process.cpp:1966] Failed to shutdown socket with fd 
23: Transport endpoint is not connected
I0411 22:48:10.008458  1467 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 1.227092ms
I0411 22:48:10.008491  1467 replica.cpp:342] Persisted promised to 52
F0411 22:49:10.006739  1476 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
@ 0x7fec686f2dfd  (unknown)
@ 0x7fec686f4c3d  (unknown)
@ 0x7fec686f29ec  (unknown)
@ 0x7fec686f5539  (unknown)
@ 0x7fec67f262dc  (unknown)
@ 0x7fec67f66ab0  (unknown)
@   0x42807e  (unknown)
@ 0x7fec67f8d0a5  (unknown)
@ 0x7fec67fdf976  (unknown)
@ 0x7fec67ff0566  (unknown)
@ 0x7fec67e204d6  (unknown)
@ 0x7fec67ff0553  (unknown)
@ 0x7fec67fd4614  (unknown)
@ 0x7fec686a0971  (unknown)
@ 0x7fec686a0c77  (unknown)
@   0x37f98b6470  (unknown)
@   0x39ed207a51  (unknown)
@   0x39ecae89ad  (unknown)
@  (nil)  (unknown)
/bin/bash: line 1:  1452 Aborted /usr/sbin/mesos-master 
--work_dir=/tmp/mesos_dir 
--zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos
 --quorum=2



Node 1
I0411 22:45:52.017833  8338 detector.cpp:479] A new leading master 
(UPID=master@10.221.29.247:5050) is detected
I0411 22:45:52.017925  8338 master.cpp:1710] The newly elected leader is 
master@10.221.29.247:5050 with id 13df6437-fbe9-4390-9f6c-db9fd1d53a16
I0411 22:45:52.017956  8338 master.cpp:1723] Elected as the leading master!
I0411 22:45:52.017983  8338 master.cpp:1468] Recovering from registrar
I0411 22:45:52.018069  8339 registrar.cpp:307] Recovering registrar
I0411 22:45:52.018337  8333 log.cpp:659] Attempting to start the writer
I0411 22:45:52.018785  8336 network.hpp:461] ZooKeeper group PIDs: { 
log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.29.247:5050 }
I0411 22:45:52.019008  8336 replica.cpp:493] Replica received implicit promise 
request from